ModelArcee AIArcee AIpublished Dec 14, 2025seen 5d

arcee-ai/AFM-4.5B-Base-KDA-NoPE

Open original ↗

Captured source

source ↗
published Dec 14, 2025seen 5dcaptured 14hhttp 200method plaintask image-feature-extractionlicense apache-2.0library transformersparams 5Bdownloads 11likes 14

AFM-4.5B-Base-KDA-NoPE

A hybrid attention variant of AFM-4.5B-Base combining Kimi Delta Attention (KDA) with NoPE (No Positional Encoding) full-attention layers in a 3:1 ratio. This architecture balances efficiency with performance through knowledge distillation.

> ⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

Following the Kimi Linear architecture pattern, this model interleaves KDA layers with periodic full-attention layers (using NoPE) in a 3:1 ratio. This hybrid structure reduces memory and KV-cache usage while preserving global information flow via the full attention layers.

Key characteristics:

  • 3:1 KDA to full-attention ratio
  • Full attention layers use NoPE (No Positional Encoding)
  • Trained up to 32k sequence length
  • Better short-context performance than pure KDA
  • Reduced memory footprint compared to full attention

Architecture

| Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Pattern | 1 Full Attn (NoPE) : 3 KDA | | Positional Encoding | NoPE on full attention layers | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base |

Benchmark Results

Performance compared to the teacher model and other configurations:

| Benchmark | Teacher (Full Attn) | Hybrid (KDA-NoPE) | KDA-Only | |-----------|:-------------------:|:-----------------:|:--------:| | MMLU (Avg) | 63.1% | 55.1% | 55.8% | | ARC-Challenge | 55.6% | 48.5% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | 74.3% | | GSM8K (Math) | 52.1% | 36.5% | 26.8% |

Key Findings

  • Math advantage: The hybrid recovers significantly more math performance (36.5%) than pure KDA (26.8%)
  • Knowledge benchmarks: Performs comparably to KDA-Only on MMLU, ARC, and HellaSwag
  • Efficiency: Maintains efficiency gains from KDA while preserving global reasoning via NoPE layers

Long-Context Performance (NIAH)

The hybrid model shows distinct long-context behavior:

  • 100% single-needle retrieval up to 32k
  • Sharp performance cliff past 32k training length
  • Near-zero performance beyond training context (vs. smooth degradation for KDA-Only)

The NoPE full-attention layers appear responsible for the hard cutoff—they haven't seen positions beyond 32k during training. KDA layers generalize more naturally to longer sequences.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-NoPE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
  • Teacher: AFM-4.5B-Base (full attention)
  • Student Architecture: Hybrid 3:1 KDA:NoPE
  • Training Length: 32k sequence length

Comparison: Hybrid vs KDA-Only

| Aspect | Hybrid (KDA-NoPE) | KDA-Only | |--------|:-----------------:|:--------:| | Math (GSM8K) | 36.5% ✓ | 26.8% | | Within-training NIAH | 100% | 100% | | Beyond-training behavior | Hard cliff | Smooth degradation | | Memory efficiency | ~75% reduction | ~100% reduction |

Choose Hybrid for better short-context reasoning, especially math. Choose KDA-Only for more predictable long-context degradation.

Intended Use

This model is intended for:

  • Research into hybrid attention architectures
  • Studying linear/full attention tradeoffs
  • Exploring NoPE attention in hybrid configurations
  • Benchmarking efficiency vs. capability tradeoffs

License

AFM-4.5B is released under the Apache-2.0 license.

Notability

notability 2.0/10

Low traction, routine model release