ModelAmazon (Nova)Amazon (Nova)published Mar 31, 2026seen 5d

amazon/GKA-primed-HQwen3-8B-Instruct

Open original ↗

Captured source

source ↗
published Mar 31, 2026seen 5dcaptured 10hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 8.5Bdownloads 3.5klikes 2

GKA-primed-HQwen3-8B-Instruct

GKA-primed-HQwen3-8B-Instruct is a Hybrid language model consisting of 50% Attention layers and 50% Gated KalmaNet (GKA) layers, primed from Qwen3-8B using the Hybrid Model Factory Priming pipeline. The model is instruction-tuned and supports context lengths up to 128K tokens.

GKA (pronounced as gee-ka) is a State-Space Model layer inspired by the Kalman Filter that solves an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.

By combining Attention with GKA, our Hybrid model achieves up to 2× faster inference at long contexts while closely matching the base Transformer's quality.

Links

Why Hybrid?

Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:

  • Higher throughput at long contexts — less memory on KV cache means more memory for batching
  • More concurrent sequences — ~2× as many concurrent sequences before hitting memory limits
  • Growing advantage with context length — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length

Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.

Model Overview

  • Type: Causal Language Model (Hybrid Attention + SSM)
  • Base Model: Qwen3-8B
  • Hybrid Layer Type: Gated KalmaNet (GKA)
  • Hybrid Ratio: 50% (18 Attention + 18 GKA layers)
  • Parameters: ~8B
  • Context Length: 128K natively
  • Precision: bfloat16
  • License: Apache 2.0

Note, this is an Instruct-tuned model and is not a thinking model, that is, it does not natively produce chain-of-thought thinking tokens in its generation trace.

Benchmark Results

Below we report benchmark performance for all our instruct-tuned Primed models. All Hybrid models use a 50% Hybrid ratio and are Primed from Qwen3-8B.

We consider two baselines:

1. Qwen3-8B (non-thinking, from HF): The original Qwen model evaluated in non-thinking mode, which is the intended mode for an Instruct model. This serves as the base Transformer from which we start the Priming procedure. 2. Qwen3-8B (Long): The Qwen model fine-tuned on our priming data, extending its native context length from 32K to 128K. All Primed Hybrid models use the same training hyperparameters and data as this baseline, making it a fair comparison for differing architectures.

On both long- and short-context benchmarks, our Primed Hybrid models closely match the performance of the Transformer model while having [considerably lower deployment costs](#inference-efficiency), showcasing the efficacy of the Priming process.

Long-Context Benchmarks

Evaluated on HELMET, MRCR, and BABILong across context lengths from 8K to 128K, using a weighted average with geometrically increasing weights for longer contexts.

The plot below shows performance averaged over context lengths from 8K to 128K.

> [!NOTE] > For the Qwen3-8B (non-thinking, from HF) model, we used YaRN to evaluate on long-context tasks as directed in the model card

How close are the Hybrid models to the Transformer baseline on long context tasks? Primed GKA and GDN Hybrids are within ~1.5 points of Qwen3-8B (Long) on average, while being [1.5–2× faster at inference](#inference-efficiency). Primed B'MOJO-F matches GKA/GDN in quality but is slower due to unfused SSM+SWA kernels (details). Primed Mamba2 lags further behind (approx. 3 point gap), consistent with GKA and GDN's higher expressivity.

Why SSM layers over Sliding Window Attention (SWA)? All Hybrid SSM models outperform the Hybrid SWA model (50% Attention + 50% SWA, window size 512). Even though SWA uses ~2× the effective state size of GKA at BF16, SSM layers retain information from the remote past, while SWA forgets everything beyond its window.

Short-Context NLP Benchmarks

Evaluations on Tulu3-dev from OLMES. All tasks are over a short-context length (≤ 8K). Each category in the table below averages the following Tulu3-dev subtasks: 1. Math: GSM8K, MATH. 2. Knowledge: MMLU, PopQA, TruthfulQA. 3. Coding: HumanEval, HumanEval+. 4. Reasoning: BigBenchHard. 5. Instruction Following: IFEval.

|Model | Math | Knowledge | Coding | Reasoning | Instruction Following | Average | |---|---|---|---|----------|---|--------| | Qwen3-8B [non-thinking, from HF] | 81.36 | 49.33 | 91.77 | 74.31 | 85.59 | 76.47 | | Qwen3-8B [Long] | 64.56 | 49.75 | 91 | 76.27 | 74.49 | 71.21 | | GKA-primed-HQwen3-8B-Instruct | 64.15 | 47.90 | 90.46 | 72.60 | 70.98 | 69.22 | | GDN-primed-HQwen3-8B-Instruct | 59.54 | 48.41 | 91.18 | 72.97 | 73.57 | 69.13 | | Mamba2-primed-HQwen3-8B-Instruct | 57.77 | 46.91 | 89.56 | 70.99 | 74.86 | 68.02 | | BMOJOF-primed-HQwen3-8B-Instruct | 65.69 | 48.63 | 90.02 | 76.42 | 75.60 | 71.27 |

How close are the Hybrid models to the Transformer baseline on short context tasks? All Primed Hybrid models are within ~3 points of Qwen3-8B (Long), using [ [!NOTE] > For applications to complex reasoning and coding problems check out our Primed Hybrid Reasoning models.

About Gated KalmaNet (GKA)

Gated KalmaNet is a State-Space Model layer that is more expressive than both Mamba2 and Gated DeltaNet. GKA achieves this by employing the Kalman Filter to compute the optimal state at each time-step based on the…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable model release by Amazon, moderate traction.