google/gemma-4-12B-it-assistant

Open original ↗

Captured source

source ↗
published May 23, 2026seen 5dcaptured 9hhttp 200method plaintask any-to-anylicense apache-2.0library transformersparams 423Mdownloads 23klikes 81

Hugging Face | GitHub | Launch Blog | Documentation

License: Apache 2.0 | Authors: Google DeepMind

> [!Note] > This model card is for the Multi-Token Prediction (MTP) drafters for the Gemma 4 models. MTP is implemented by extending the base model with a smaller, faster draft model. When used in a Speculative Decoding pipeline, the draft model predicts several tokens ahead, which the target model then verifies in parallel. This results in significant decoding speedups (up to 3x) while guaranteeing the exact same quality as standard generation, making these checkpoints perfect for low-latency and on-device applications.

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in five distinct sizes: E2B, E4B, 12B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

  • Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
  • Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models).
  • Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
  • Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
  • Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
  • Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
  • Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (12B, 26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Dense Models

| Property | E2B | E4B | 12B Unified | 31B Dense | | :---- | :---- | :---- | :---- | :---- | | Total Parameters | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 11.95B | 30.7B | | Layers | 35 | 42 | 48 | 60 | | Sliding Window | 512 tokens | 512 tokens | 1024 tokens | 1024 tokens | | Context Length | 128K tokens | 128K tokens | 256K tokens | 256K tokens | | Vocabulary Size | 262K | 262K | 262K | 262K | | Supported Modalities | Text, Image, Audio | Text, Image, Audio | Text, Image, Audio | Text, Image | | Vision Encoder Parameters | *~150M* | *~150M* | - | *~550M* | | Audio Encoder Parameters | *~300M* | *~300M* | - | No Audio |

The "E" in E2B and E4B stands for "effective" parameters. The smaller models incorporate Per-Layer Embeddings (PLE) to maximize parameter efficiency in on-device deployments. Rather than adding more layers or parameters to the model, PLE gives each decoder layer its own small embedding for every token. These embedding tables are large but are only used for quick lookups, which is why the effective parameter count is much smaller than the total.

The "Unified" in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass.

Mixture-of-Experts (MoE) Model

| Property | 26B A4B MoE | | :---- | :---- | | Total Parameters | 25.2B | | Active Parameters | 3.8B | | Layers | 30 | | Sliding Window | 1024 tokens | | Context Length | 256K tokens | | Vocabulary Size | 262K | | Expert Count | 8 active / 128 total and 1 shared | | Supported Modalities | Text, Image | | Vision Encoder Parameters | *~550M* |

The "A" in 26B A4B stands for "active parameters" in contrast to the total number of parameters the model contains. By only activating a 4B subset of parameters during inference, the Mixture-of-Experts model runs much faster than its 26B total might suggest. This makes it an excellent choice for fast inference compared to the dense 31B model since it runs almost as fast as a 4B-parameter model.

Benchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models.

| | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 12B Unified | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B (no think) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | MMLU Pro | 85.2% | 82.6% | 77.2% | 69.4% | 60.0% | 67.6% | | AIME 2026 no tools | 89.2% | 88.3% | 77.5% | 42.5% | 37.5% | 20.8% | | LiveCodeBench v6 | 80.0% | 77.1% | 72.0% | 52.0% | 44.0% | 29.1% | | Codeforces ELO | 2150 | 1718 | 1659…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release from Google DeepMind with moderate downloads.