WritingArcee AIArcee AIpublished Dec 15, 2025seen 1d

Distilling Kimi Delta Attention Into Afm 4 5b And The Tool We Used To Do It

Open original ↗

Captured source

source ↗

Arcee AI | Distilling Kimi Delta Attention into AFM-4.5B

Trinity Large Thinking: Available on OpenRouter.

Try now ↗

ENTERPRISE

Research

COMPANY

Get API

Blog / Distilling Kimi Delta Attention into AFM-4.5B (and the Tool We Used to Do It)

Distilling Kimi Delta Attention into AFM-4.5B (and the Tool We Used to Do It) Charles Goddard ,

December 15, 2025

Learn how Kimi Delta Attention was distilled into AFM-4.5B using knowledge distillation, long-context training, and Arcee’s open-source DistillKit.

Moonshot AI recently put out a great paper (and an associated model ) on an extension of Gated DeltaNet that they have termed Kimi Delta Attention (KDA). The results look super promising, particularly in the now-classic three-to-one interleaved local and global attention hybrid arrangement. The pretrained model they released is great, but they've open sourced both training and inference kernels, so of course I had to do something to play with them. Pretraining a whole model from scratch just for funsies is still a little too rich for my blood. Inspired by the paper RADLADS , I decided to try to convert AFM-4.5B-Base into a hybrid KDA and full-attention transformer through knowledge distillation, then see how far it generalizes in long context land. A tiny note on terms before we go too far: when I say “full attention” I mean standard global self-attention. When I say “NoPE” in this post, I mean we removed RoPE and did not replace it with any other positional embedding scheme in those layers. Creating the Student First order of business was to create the student model, meaning both modeling code for the desired architecture and a set of decently-initialized weights. Modeling code turned out to be super easy thanks to flash-linear-attention . The Moonshot AI folks contributed kernels and there's a complete layer implementation that is more or less a drop-in fit. The only real code changes needed were to plug that in, rip out RoPE, and add configuration for what layers are KDA vs. full attention. Initializing weights is a little trickier, but not much. Obviously for the majority of the weights they can be copied straight through from the teacher to the student. MLP, embeddings, norms, and so forth can be kept exactly the same. The attention parameters need a bit more attention. The additional parameters unique to KDA layers (like A_log , dt_bias , and so forth) are initialized from scratch. There are q_proj , k_proj , and so forth for Kimi Delta Attention layers, but as currently implemented there isn't an equivalent of GQA for KDA. (You can set num_v_heads , but that doesn't do anything for your key heads and I ran into some crashes when training with it enabled anyway.) So I used the very elegant solution of just repeating the grouped head weights out to MHA-shaped projections. This makes the resulting model about ~5B parameters, up from the original 4.5B, but sure beats thinking any harder about it. Distilling da Knowledge The RADLADS paper sets out a pretty clear and effective three-step pipeline for this sort of distillation: first doing "Attention Hidden State Alignment", which is a distillation targeting hidden state alignment with only the attention parameters trainable, then a full-parameter distillation, finally followed by a fine tune at long context for sequence length generalization. I found their pipeline highly effective, but after a bunch of experiments ended up settling on a slightly modified version that gave equivalent results for this specific setup while being much faster to iterate on. I collapsed the first two distillation stages into a single one with frozen MLP parameters, using a cosine loss instead of MSE between hidden states, then allowed the long context fine tune to pick up any slack necessary in adjusting the MLP layers. It's a shame there isn't a handy open-source toolkit for doing this kind of distillation. But partake in a flight of fancy with me for a moment, and imagine we live in a world in which this YAML config suffices to run it: 1 project_name: distillkit-afm-kda 2 model: arcee-train/afm-4p5b-kdanope-untrained 3 trust_remote_code: true 4 5 frozen_res: # regular expressions for parameter names to freeze during training 6 - embed_tokens 7 - lm_head 8 - norm\.weight 9 - ^model\.layers\.[0-9]+\.mlp\..* 10 11 output_path: /workspace/models/afm-4p5b-kda-hsd 12 use_flash_attention: true 13 sequence_length: 2048 14 15 dataset: 16 train_dataset: 17 repo_id: arcee-train/afm-autodistill-mix-v0 18 split: train 19 seed: 42 20 21 loss_functions: 22 - function: cross_entropy 23 weight: 0.2 24 - function: kl 25 weight: 0.2 26 temperature: 2.0 27 - function: hs_cosine # cosine loss between hidden states 28 weight: 0.6 29 30 layer_mapping: all 31 32 teacher: 33 kind: hf 34 path: arcee-ai/AFM-4.5B-Base 35 kwargs: 36 attn_implementation: flash_attention_2 37 torch_dtype: bfloat16 38 39 training_args: 40 dataset_text_field: text 41 packing: True 42 num_train_epochs: 1 43 per_device_train_batch_size: 4 44 gradient_accumulation_steps: 2 45 46 pad_to_multiple_of: 128 # training kernels got unhappy for very short packed sequences 47 48 save_steps: 200 49 save_total_limit: 1 50 logging_steps: 1 51 52 learning_rate: 1.0e-3 53 weight_decay: 0.00 54 warmup_ratio: 0.025 55 lr_scheduler_type: cosine_with_min_lr 56 lr_scheduler_kwargs: 57 min_lr: 1.0e-5 58 59 bf16: true 60 max_grad_norm: 0.5 61 optim: adamw_torch 62 63 gradient_checkpointing: true 64 gradient_checkpointing_kwargs: 65 use_reentrant: false What a wonderful world that would be. This is surprisingly effective. Even with MLP and embedding parameters frozen, and only about 300 million tokens seen, token-averaged KL divergence between student and teacher converged to around 0.2 on a small held-out slice of the same mix. That's pretty good! I expected to need the full-parameter distillation as well to get any sort of good performance. But it seemed to be diminishing returns from this point, and fine tuning at 32k context did just about as well. So that's what I did: one-phase distillation, plus about a billion tokens at 32k sequence length, with a 1e-5 learning rate. For comparison, I also ran this same pipeline with two other variations of the student model: 1. A) AFM-4.5B-KDA-NoPE (Hybrid, interleaved) Nine blocks of “3 KDA layers + 1 full-attn NoPE layer”. 2. B) AFM-4.5B-KDA-FLP (Front-loaded full attention) First four layers are full-attn NoPE, then the rest are...

Excerpt shown — open the source for the full document.