What does this model signal mean?

Arcee AI published arcee-ai/AFM-4.5B-Base-KDA-NoPE. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 7 HF downloads · A 4.5 billion parameter language model without positional encodings.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Arcee AI Model: arcee-ai/AFM-4.5B-Base-KDA-NoPE

Captured source

source ↗

Hugging Face/huggingface.co/arcee-ai/AFM-4.5B-Base-KDA-NoPE

arcee-ai/AFM-4.5B-Base-KDA-NoPE model card

Source ↗

published Dec 14, 2025seen Jun 6captured Jun 11http 200method plaintask feature-extractionlicense apache-2.0library transformersparams 5Bdownloads 7likes 14

AFM-4.5B-Base-KDA-NoPE

A hybrid attention variant of AFM-4.5B-Base combining Kimi Delta Attention (KDA) with NoPE (No Positional Encoding) full-attention layers in a 3:1 ratio. This architecture balances efficiency with performance through knowledge distillation.

> ⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

Following the Kimi Linear architecture pattern, this model interleaves KDA layers with periodic full-attention layers (using NoPE) in a 3:1 ratio. This hybrid structure reduces memory and KV-cache usage while preserving global information flow via the full attention layers.

Key characteristics:

3:1 KDA to full-attention ratio
Full attention layers use NoPE (No Positional Encoding)
Trained up to 32k sequence length
Better short-context performance than pure KDA
Reduced memory footprint compared to full attention

Architecture

| Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Pattern | 1 Full Attn (NoPE) : 3 KDA | | Positional Encoding | NoPE on full attention layers | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base |

Benchmark Results

Performance compared to the teacher model and other configurations:

| Benchmark | Teacher (Full Attn) | Hybrid (KDA-NoPE) | KDA-Only | |-----------|:-------------------:|:-----------------:|:--------:| | MMLU (Avg) | 63.1% | 55.1% | 55.8% | | ARC-Challenge | 55.6% | 48.5% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | 74.3% | | GSM8K (Math) | 52.1% | 36.5% | 26.8% |

Key Findings

Math advantage: The hybrid recovers significantly more math performance (36.5%) than pure KDA (26.8%)
Knowledge benchmarks: Performs comparably to KDA-Only on MMLU, ARC, and HellaSwag
Efficiency: Maintains efficiency gains from KDA while preserving global reasoning via NoPE layers

Long-Context Performance (NIAH)

The hybrid model shows distinct long-context behavior:

100% single-needle retrieval up to 32k
Sharp performance cliff past 32k training length
Near-zero performance beyond training context (vs. smooth degradation for KDA-Only)

The NoPE full-attention layers appear responsible for the hard cutoff—they haven't seen positions beyond 32k during training. KDA layers generalize more naturally to longer sequences.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-NoPE"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
Teacher: AFM-4.5B-Base (full attention)
Student Architecture: Hybrid 3:1 KDA:NoPE
Training Length: 32k sequence length

Comparison: Hybrid vs KDA-Only

| Aspect | Hybrid (KDA-NoPE) | KDA-Only | |--------|:-----------------:|:--------:| | Math (GSM8K) | 36.5% ✓ | 26.8% | | Within-training NIAH | 100% | 100% | | Beyond-training behavior | Hard cliff | Smooth degradation | | Memory efficiency | ~75% reduction | ~100% reduction |

Choose Hybrid for better short-context reasoning, especially math. Choose KDA-Only for more predictable long-context degradation.

Intended Use

This model is intended for:

Research into hybrid attention architectures
Studying linear/full attention tradeoffs
Exploring NoPE attention in hybrid configurations
Benchmarking efficiency vs. capability tradeoffs

License

AFM-4.5B is released under the Apache-2.0 license.

Notability

notability 2.0/10

Low traction, routine model release