ByteDance-Seed/SwiftSpec
Python
Captured source
source ↗ByteDance-Seed/SwiftSpec
Description: This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper [ASPLOS' 26].
Language: Python
License: Apache-2.0
Stars: 10
Forks: 0
Open issues: 0
Created: 2025-11-26T03:35:19Z
Pushed: 2026-03-24T07:18:48Z
Default branch: main
Fork: no
Archived: no
README:
SwiftSpec: Disaggregated Speculative Decoding and Fused Kernels for Low-Latency LLM Inference
Paper link: [ASPLOS'26]

This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper (recently accepted at ASPLOS 2026 Summer cycle!).
Highlighted results Achieving 369 tokens/second on average serving Llama3.3 70B int4-AWQ model under a Nvidia 8xH800 GPU node!
Features
- Disaggregated tree generation: Support for both parallel tree generation (as in SwiftSpec) and serial tree generation (as in SpecExec).
- Latency-optimized kernels: a set of latency-optimized kernel, which performs well under low batch size, especially under small models.
- Auto-pad for arbitrary Tensor Parallelism: Adding padding for model weights to support any even degree tensor parallelism for supported model
- Support for Qwen and LLama model: Supports models including Llama3/deepseek-coder/Qwen2/DeepSeek-R1-Distill-Qwen/DeepSeek-R1-Distill-Llama
Performance
Swiftspec Performance Examples
Serving Llama3.3-70B-Instruct INT4-AWQ on 8xH800:

📋 Table of Contents
- [Installation and Quick Start](#installation-and-quick-start)
- [Prerequisites](#prerequisites)
- [0. Install environment](#0-install-environment)
- [1. Download huggingface Models and compare AWQ checkpoints](#1-download-huggingface-models-and-compare-awq-checkpoints)
- [2. Convert models into tensor parallel checkpoints](#2-convert-models-into-tensor-parallel-checkpoints)
- [3. Run single request demo](#3-run-single-request-demo)
- [Model Support](#model-support)
- [Performance Results](#performance-results)
- [Inference Speed (Tokens/sec) on an 8xH800 GPU node](#inference-speed-tokenssec-on-an-8xh800-gpu-node)
- [Acknowledgement](#acknowledgement)
- [Citation](#citation)
Installation and Quick Start
Prerequisites
- Python 3.10
- CUDA 12.4
- H800 GPU
0. Install environment
git submodule init git submodule update # install packages conda create -n awq python==3.10 -y conda activate awq pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 pip install --upgrade pip cd swiftspec pip install --use-pep517 -e . pip install ninja pip install flash-attn==2.7.3 --no-build-isolation # install AWQ kernels cd awq/kernels python setup.py install
1. modify the path in the [exp_configs.py](tinychat/utils/exp_configs.py)
awq_prefix = "awq_cache/" # You don't have to change this model_path_prefix="/path/to/huggingface/model" # change this to the path to the downloaded huggingface model ckpt_prefix = "/root/workspace/models/" # change this to any path that you want to store the SwiftSpec ckpt
2. Download huggingface Models && Convert models into tensor parallel checkpoints
cd scripts python prepare_models.py llama3.3
3. Run single request demo
cd scripts # launch demo on webpage # If you are launching the demo on a ssh-connected GPU server, consider using tools like proxychains to forward to port to your local computer to access (e.g. proxychains4 -f proxychains.conf ssh -L 7860:0.0.0.0:7860 [ssh_name]) python web-demo.sh llama3.3 # run on all queries. # to get the data, copy https://github.com/SafeAILab/EAGLE/tree/main/eagle/data folder to the this repo (in the repo root directory) python bench_exp.py
Model Support
Supported target/draft models
| Model Family | Sizes | Example Script | |-------------|-------|----------------| | Llama3/Llama3.3 |1B/3B/8B/70B| [python prepare_models.py llama3.3](scripts/prepare_models.py) && [python web-demo.py llama3.3](scripts/web-demo.py) | | deepseek-coder |1.3b/6.7b/33b| [python prepare_models.py deepseek](scripts/prepare_models.py) && [python web-demo.py deepseek](scripts/web-demo.py) | | Qwen2-72B | 0.5B/1.5B/7B/72B | [python prepare_models.py qwen](scripts/prepare_models.py) && [python web-demo.py qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Qwen | 1.5B/7B/32B | [python prepare_models.py r1qwen](scripts/prepare_models.py) && [python web-demo.py r1qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Llama | 8B/70B | [python prepare_models.py r1llama](scripts/prepare_models.py) && [python web-demo.py r1llama](scripts/web-demo.py) |
Performance Results
Inference Speed (Tokens/sec) on an 8xH800 GPU node
| Model | Draft model | Precision | Depth | Target TP | Tokens per second | |-------|-----------|------|----------|-------------|--------| | Llama-3.3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 369 | Llama-3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 347 | deepseek-coder-33b-instruct | deepseek-coder-1.3b-instruct | bf16 | 5 | 6 | 472 | Qwen2-72B-Instruct | Qwen2-1.5B-Instruct | bf16 | 5 | 6 | 274 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek-R1-Distill-Qwen-1.5B | bf16 | 5 | 4 | 317 | DeepSeek-R1-Distill-Llama-70B | DeepSeek-R1-Distill-Llama-8B | bf16 | 5 | 4 | 268
Performance are measured across 6 datasets (same way EAGLE series is evaluated)
Acknowledgement
Thanks to:
- llm-awq project, which a large part of our single model inference code relies on
- EAGLE project, from which we adapted the verification of the speculative decoding methods
Citation
If you find Swiftspec useful in your research, please cite our paper:
@misc{zhang2025swiftspecultralowlatencyllm,
title={SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding},
author={Ziyi Zhang and Ziheng Jiang and Chengquan Jiang and Menghan Yu and Size Zheng and Haibin Lin and Henry Hoffmann and Xin Liu},
year={2025},
eprint={2506.11309},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2506.11309},
}Notability
notability 3.0/10New repo, low stars, routine fork/job