RepoMiniMaxMiniMaxpublished Jan 14, 2025seen 5d

MiniMax-AI/MiniMax-01

Python

Open original ↗

Captured source

source ↗
published Jan 14, 2025seen 5dcaptured 8hhttp 200method plain

MiniMax-AI/MiniMax-01

Description: The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

Language: Python

License: MIT

Stars: 3428

Forks: 329

Open issues: 8

Created: 2025-01-14T15:43:28Z

Pushed: 2025-07-07T12:24:30Z

Default branch: main

Fork: no

Archived: no

README:

MiniMax-01

1. Introduction

We are delighted to introduce two remarkable models, MiniMax-Text-01 and MiniMax-VL-01. MiniMax-Text-01 is a powerful language model boasting 456 billion total parameters, with 45.9 billion activated per token. To unlock its long-context capabilities, it adopts a hybrid architecture integrating Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies like Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, and Expert Tensor Parallel (ETP), its training context length extends to 1 million tokens, and it can handle up to 4 million tokens during inference. Consequently, MiniMax-Text-01 showcases top-tier performance on various academic benchmarks. Building on MiniMax-Text-01's prowess, we developed MiniMax-VL-01 for enhanced visual capabilities. It uses the "ViT-MLP-LLM" framework common in multimodal LLMs. It is initialized and trained using three key components: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and MiniMax-Text-01 as the base LLM. This model features a dynamic resolution mechanism. Input images are resized according to a pre-set grid, with resolutions ranging from 336×336 to 2016×2016, while maintaining a 336×336 thumbnail. The resized images are split into non - overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined to form a full image representation. As a result, MiniMax-VL-01 has achieved top-level performance on multimodal leaderboards, demonstrating its edge in complex multimodal tasks.

2. Model Architecture

The architecture of MiniMax-Text-01 is briefly described as follows:

  • Total Parameters: 456B
  • Activated Parameters per Token: 45.9B
  • Number Layers: 80
  • Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
  • Number of attention heads: 64
  • Attention head dimension: 128
  • Mixture of Experts:
  • Number of experts: 32
  • Expert hidden dimension: 9216
  • Top-2 routing strategy
  • Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
  • Hidden Size: 6144
  • Vocab Size: 200,064

For MiniMax-VL-01, the additional ViT architecture details is as follows:

  • Total Parameters: 303M
  • Number of layers: 24
  • Patch size: 14
  • Hidden size: 1024
  • FFN hidden size: 4096
  • Number of heads: 16
  • Attention head dimension: 64

3. Evaluation

Text Benchmarks

Core Academic Benchmarks

| Tasks | GPT-4o (11-20) | Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2.5-72B-Inst. | DeepSeek-V3 | Llama-3.1-405B-Inst. | MiniMax-Text-01 | |-------------------------------|--------------------|-------------------------------|--------------------------|----------------------------|-----------------------|-----------------|--------------------------|---------------------| | General | | | | | | | | | | MMLU* | 85.7 | 88.3 | 86.8 | 86.5 | 86.1 | 88.5 | 88.6 | 88.5 | | MMLU-Pro* | 74.4 | 78.0 | 75.8 | 76.4 | 71.1 | 75.9 | 73.3 | 75.7 | | SimpleQA | 39.0 | 28.1 | 23.4 | 26.6 | 10.3 | 24.9 | 23.2 | 23.7 | | C-SimpleQA | 64.6 | 56.8 | 59.4 | 63.3 | 52.2 | 64.8 | 54.7 | 67.4 | | IFEval _(avg)_ | 84.1 | 90.1 | 89.4 | 88.4 | 87.2 | 87.3 | 86.4 | 89.1 | | Arena-Hard | 92.4 | 87.6 | 85.3 | 72.7 | 81.2 | 91.4 | 63.5 | 89.1 | | Reasoning | | | | | | | | | | GPQA* _(diamond)_ | 46.0 | 65.0 | 59.1 | 62.1 | 49.0 | 59.1 | 50.7 | 54.4 | | DROP* _(F1)_ | 89.2 | 88.8 | 89.2 | 89.3 | 85.0 | 91.0 | 92.5 | 87.8 | | Mathematics | | | | | | | | | | GSM8k* | 95.6 | 96.9 | 95.2 | 95.4 | 95.8 | 96.7 | 96.7 | 94.8 | | MATH* | 76.6 | 74.1 | 84.6 | 83.9 | 81.8 | 84.6 | 73.8 | 77.4 | | Coding | | | | | | | | | | MBPP + | 76.2 | 75.1 | 75.4 | 75.9 | 77.0 | 78.8 | 73.0 | 71.7 | | HumanEval | 90.2 | 93.7 | 86.6 | 89.6 | 86.6 | 92.1 | 89.0 | 86.9 |

  • Evaluated following a _0-shot CoT_ setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler | Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M | |-------|----|----|-----|-----|-----|------|------|------|----| | GPT-4o (11-20) | 0.970 | 0.921 | 0.890 | 0.888 | 0.884 | - | - | - | - | | Claude-3.5-Sonnet (10-22) | 0.965 | 0.960 | 0.957 | 0.950 | 0.952 | 0.938 | - | - | - | | Gemini-1.5-Pro (002) | 0.962 | 0.960 | 0.960 | 0.958 | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 | | Gemini-2.0-Flash (exp) | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | - | | MiniMax-Text-01 | 0.963 | 0.961 | 0.953 | 0.954 | 0.943 | 0.947 | 0.945 | 0.928 | 0.910 |

LongBench v2 | Model | overall | easy | hard | short | medium | long | |----------------------------|-------------|----------|----------|------------|------------|----------| | Human | 53.7 | 100.0 | 25.1 | 47.2 | 59.1 | 53.7 | | w/ CoT | | | | | | | | GPT-4o (11-20) | 51.4 | 54.2 | 49.7 | 59.6 | 48.6 | 43.5 | | Claude-3.5-Sonnet (10-22) | 46.7 | 55.2 | 41.5 | 53.9 | 41.9 | 44.4 | | Deepseek-V3 | - | - | - | - | - | - | | Qwen2.5-72B-Inst. | 43.5 | 47.9 | 40.8 | 48.9 | 40.9 | 39.8 | | MiniMax-Text-01 | 56.5 | 66.1 | 50.5 | 61.7 | 56.7 | 47.2 | | w/o CoT | | | | | | | | GPT-4o (11-20) | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 | | Claude-3.5-Sonnet (10-22) | 41.0 | 46.9 | 37.3 | 46.1 | 38.6 | 37.0 | | Deepseek-V3 | 48.7 | - | - | - | - | - | | Qwen2.5-72B-Inst. | 42.1 | 42.7 | 41.8 | 45.6 | 38.1 | 44.4 | | MiniMax-Text-01 | 52.9 | 60.9 | 47.9 | 58.9 | 52.6 | 43.5 |

MTOB | Context Type | no context | half book | full book | Δ half book | Δ full book | |------------------|----------------|---------------|---------------|------------------|-----------------| | eng → kalam (ChrF) | | | | | | | GPT-4o (11-20) | 9.90 | 54.30 | - | 44.40 | - | |…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

New model release with strong GitHub traction.