arcee-ai/DistillKit
Python
Captured source
source ↗arcee-ai/DistillKit
Description: An Open Source Toolkit For LLM Distillation
Language: Python
License: Apache-2.0
Stars: 959
Forks: 126
Open issues: 4
Created: 2024-07-31T22:16:09Z
Pushed: 2026-05-12T20:10:14Z
Default branch: main
Fork: no
Archived: no
README:
DistillKit
A flexible and production-ready toolkit for knowledge distillation of large language models, supporting both online and offline distillation workflows with advanced logit compression.
DistillKit powers the training of many of Arcee's popular open-source models, including Virtuoso, SuperNova Medius, and Blitz.
Features
- Online Distillation: Real-time teacher inference during student training
- Offline Distillation: Train from pre-captured teacher outputs with advanced compression
- Advanced Logit Compression: Novel polynomial approximation + quantization + bit-packing achieving vigorous compression ratios while preserving distillation quality
- Flexible Loss Functions: Composable losses including KL divergence, JSD, TVD, ranking losses, and hidden state alignment
- Sparse & Dense Support: Efficient sparse distributions (top-k) or exact dense distributions
- Battle-tested: The infrastructure powering Arcee's distilled model releases
- HuggingFace Integration: Built on Transformers, TRL, and Accelerate
Why DistillKit?
While online distillation is straightforward, offline distillation at scale requires careful engineering. Simply storing top-k token-logit pairs becomes prohibitively expensive when distilling on billions of tokens.
DistillKit's compression system is the result of months of experimentation to strike the delicate balance between storage costs, memory throughput, and distillation quality. Our approach:
1. Polynomial approximation of the logit distribution curve 2. Error-diffusion quantization of residuals to preserve quality 3. Bit-level packing with arbitrary bit widths (1-64 bits)
This enables practical offline distillation workflows that would otherwise be infeasible.
Installation
git clone https://github.com/arcee-ai/distillkit.git cd distillkit pip install -e .
Optional: Logit Capture
To capture your own teacher outputs, install the capture dependencies:
pip install -e ".[capture]"
For most users, we recommend starting with the pre-captured teacher datasets we provide (see [Datasets](#datasets) below).
Quick Start
Offline Distillation
Train a student model using pre-captured teacher outputs:
# config.yaml project_name: my-distillation model: Qwen/Qwen3-8B output_path: ./output sequence_length: 8192 dataset: train_dataset: repo_id: arcee-ai/Qwen3-235B-Logits-Packed-8192 # Pre-captured teacher outputs split: train prepacked: true teacher: kind: dataset logprob_compressor: d: 151936 # Vocabulary size delta_encoding: true error_diffusion: false exact_dtype: float32 exact_k: 32 k: 128 polynomial_terms: [0, 1, 2] residual_bins: [] term_dtype: float32 loss_functions: - function: cross_entropy weight: 0.5 - function: kl weight: 0.5 temperature: 1.0 missing_probability_handling: zero sparse_chunk_length: 1024 training_args: num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 2.0e-6 bf16: true optim: adamw_torch gradient_checkpointing: true
Run training:
distillkit config.yaml
Online Distillation
For online distillation where the teacher runs alongside student training, see [examples/afm_test.yml](examples/afm_test.yml) for a complete configuration example.
Core Concepts
Knowledge Distillation for LLMs
Knowledge distillation transfers knowledge from a (potentially larger) "teacher" model to a "student" model. Instead of training only on hard labels (the correct token), the student learns from the teacher's probability distribution over tokens, which is a much richer learning signal.
Key benefits:
- Smaller, faster models with competitive performance
- Lower inference costs
- Easier deployment in resource-constrained environments
Online vs Offline Distillation
Online Distillation:
- Teacher runs in real-time during student training
- No storage overhead
- Best when: You have sufficient VRAM for both models and dense distributions
Offline Distillation:
- Teacher outputs pre-captured and compressed
- Enables training multiple students from the same teacher
- Best when: VRAM-limited, reusing teacher signals, or training at large scale
Rule of thumb: If you can fit both teacher and student with dense distributions into VRAM, use online distillation. Otherwise, offline distillation with our compression system is the way to go.
Sparse vs Dense Distributions
Dense distributions include probabilities for the full vocabulary. This is more accurate but memory-intensive.
Sparse distributions store only the top-k tokens and serve as a lossy, but useful and efficient, approximation of the full dense distribution. With sufficient training data, sparse distillation can achieve equivalent performance to dense.
DistillKit supports both, with automatic chunking for memory-efficient processing of long sequences.
Logit Compression
Our compression system balances storage efficiency with distillation quality:
1. Select top-k logits from teacher output 2. Sort by log-probability, optionally apply delta encoding 3. Fit polynomial to the distribution curve 4. Quantize residuals, with optional error diffusion 5. Bitpack everything into byte vectors
There are lots of knobs you can twiddle here to reach a storage/fidelity tradeoff that works for your particular needs.
Recommended configuration (used at Arcee for new captures):
logprob_compressor: d: k: 128 exact_k: 16 exact_dtype: bfloat16 polynomial_terms: [0, 1, 2, 3, 4, "sqrt"] term_dtype: float32 residual_bins: [] delta_encoding: false error_diffusion: false
This takes ~300 bytes/token (0.15% of uncompressed distribution size) with minimal quality loss.
If you're a little tight on storage, try the budget pick:
logprob_compressor: d: k: 50 exact_k: 1 exact_dtype: bfloat16 polynomial_terms: [0, 1, "sqrt"] term_dtype: float32 residual_bins: [] delta_encoding: false error_diffusion: false
This weighs in at around 114 bytes per token, smaller…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New repo with nearly 1k stars.