openbmb/MiniCPM-SALA
Captured source
source ↗GitHub Repo | Technical Report | Join Us
👋 Contact us in Discord and WeChat
> [!NOTE] > ### 🏆 2026 Sparse Operator Acceleration & Race (SOAR) is Now Live! > > "The MiniCPM-SALA architecture is just the beginning. Realizing its full potential requires deep system-level synergy and cross-layer compilation optimization." > > In collaboration with SGLang and NVIDIA, OpenBMB invites global geeks to push the boundaries of 9B-scale, 1M-token inference on NVIDIA 6000D. > > 💰 Prize Pool: >$100,000 USD (🥇 Top Prize: $89,000) | 🚀 Challenge: Single & Multi-batch Optimization > > 👉 [Click Here to Join the Race @ soar.openbmb.cn](https://soar.openbmb.cn/)
What's New
- [2026.02.11] MiniCPM-SALA is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. You can find technical report here.🔥🔥🔥
Highlights
MiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling
✅ Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency.
✅ Shattering Efficiency Walls: Breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× inference speed and significantly lower KV-cache overhead compared to dense baselines.
✅ Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization.
✅ HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models.
Introduction
MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt InfLLM-V2 and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090.
- SALA Hybrid Attention Mechanism
- Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context.
- Transformer-to-Hybrid Continue Training
- Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch.
- [HyPE](https://arxiv.org/abs/2601.22156) (Hybrid Positional Encoding)
- Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks.
- Efficient Inference on Long Sequences
- Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.
Inference
To achieve optimal performance, we recommend using Temperature=0.9.
HuggingFace
Our model is readily compatible with 🤗 Hugging Face transformers. You can perform inference with our model as follows:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "openbmb/MiniCPM-SALA" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map="auto") model.eval() prompts = ["My name is", "The capital of China is"] with torch.no_grad(): inputs = tokenizer(prompts, return_tensors="pt").to(model.device) outputs = model.generate(**inputs) output_texts = tokenizer.batch_decode(outputs) print(output_texts)
SGLang
Requirements
- CUDA 12.x or higher
gcc/g++compileruvpackage manager (script will check)
Installation
# Clone repository git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git cd sglang # One-click installation (creates venv and compiles all dependencies) bash install_minicpm_sala.sh # Or specify PyPI mirror bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
The installation script performs the following steps:
1. Creates sglang_minicpm_sala_env virtual environment (Python 3.12) 2. Clones dependencies to 3rdparty/ (infllmv2) and initializes submodules (sparse_kernel) 3. Installs MiniCPM-SALA (current repo) 4. Compiles and installs infllmv2_cuda_impl 5. Compiles and installs sparse_kernel 6. Installs tilelang & flash-linear-attention
Usage
# Activate environment
source sglang_minicpm_sala_env/bin/activate
# Launch Inference Server (Replace MODEL_PATH with actual path)
MODEL_PATH=/path/to/your/MiniCPM-SALA
python3 -m sglang.launch_server \
--model ${MODEL_PATH} \
--trust-remote-code \
--disable-radix-cache \
--attention-backend minicpm_flashinfer \
--chunked-prefill-size 8192 \
--max-running-requests 32 \
--skip-server-warmup \
--port 31111 \
--dense-as-sparse| Parameter | Description | |-----------|-------------| | --trust-remote-code | Allow custom code in model | | --disable-radix-cache | Disable RadixAttention prefix cache | | --attention-backend minicpm_flashinfer | Use MiniCPM FlashInfer backend | | --chunked-prefill-size 8192 | Chunked prefill size | | --max-running-requests 32 | Max concurrent requests | | --skip-server-warmup | Skip server warmup | | --port 31111 | Server port | | --dense-as-sparse | Use dense-as-sparse mode |
Manual Installation
If the script doesn't work for you, follow these steps:
# 0. Ensure uv is installed pip install uv # 1. Create venv uv venv --python 3.12 sglang_minicpm_sala_env source sglang_minicpm_sala_env/bin/activate # 2. Install SGLang uv pip install --upgrade pip setuptools wheel uv pip install -e ./python[all] # 3. Compile CUDA Extensions # (Ensure dependencies are cloned to 3rdparty/) cd…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Notable model release with solid traction