AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators
Captured source
source ↗AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 10/10/2025
AdapTive-LeArning Speculator System (ATLAS): A New Paradigm in LLM Inference via Runtime-Learning Accelerators
ATLAS delivers up to 4x faster LLM inference, powered by Together Turbo’s latest research.
Authors
Junxiong Wang, Shirley Wu, Zelei Shao, Vikranth Srivatsa, Jue Wang, Roy Yuan, Qingyang Wu, Alpay Ariyak, Rupert Wu, Wai Tong Chung, Chenfeng Xu, Yonatan Oren, Pragaash Ponnusamy, Yineng Zhang, Avner May, Leon Song, Tri Dao, Percy Liang, Ce Zhang, Ben Athiwaratkun
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Together Turbo Speculator Custom Speculators AI Research Careers
At Together AI, the AI Native Cloud, we’re obsessed with performance. Making large language models faster, cheaper, and more efficient is not a one-trick problem — it requires optimizing along multiple axes. That is the philosophy behind Together Turbo , our suite of inference innovations that draw from research in algorithms, architectures, and modeling recipes. We’re excited to introduce the AdapTive-LeArning Speculator System (ATLAS), the first speculator of its kind that gives automatic performance improvements without any manual tuning. ATLAS offers a new way of doing speculative decoding — one that dynamically improves at runtime — and it fits seamlessly alongside our other Turbo techniques like the proprietary Together Turbo Speculator or Custom Speculators . But why create an adaptive-learning speculator system? Standard speculators are trained for general workloads. Custom speculators are trained on your specific data, but only for a specific snapshot in time. However, as the workload evolves (codebase grows, traffic patterns shift, request distributions change), even highly customized speculators can fall behind. In contrast, ATLAS evolves automatically with usage, learning from both historical patterns and live traffic to continuously align with the target model’s behaviors in real time. This means the more you use our inference service, the better ATLAS will perform! Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq (Figure 1).
Figure 1: We show the decoding speed on NVIDIA HGX B200 with our Turbo speculator and the adaptive-learning speculator system for DeepSeek-V3.1 (top) KIMI-K2-0905 (bottom) with Arena Hard traffic.1 1. Speculative Decoding Speculative decoding is one of the most powerful levers for accelerating inference. 2 Instead of having the target model generate every token step by step, a faster speculator (also known as the draft model ) proposes multiple tokens ahead, and the target model verifies them in parallel in a single forward pass. The verification process ensures that the quality of the output matches the distribution of non-speculative decoding, while achieving speedups by accepting many tokens at a time. The overall speed is influenced by the acceptance rate $α$ (i.e., how often the target model agrees with the drafted tokens from the speculator) and the relative latency $c$ of the draft versus the target. Typically, larger speculators with more parameters yield higher acceptance rates due to their higher capacity but are slower to generate draft tokens. Progress therefore comes from both sides: aligning draft and target models to increase $α$ (training objectives, data, and algorithms) and designing draft models/kernels that keep $c$ low while maintaining $α$ (sparsity, quantization, lightweight & kernel-efficient architectures). The sweet spot is where a high $α$ meets a low $c$, minimizing end-to-end latency.
At Together AI, the Turbo team has developed high-performance speculators that have achieved the world’s fastest decoding speeds on NVIDIA Blackwell by drawing on advances across architecture, sparsity, algorithms, post-training recipes, and data [1-9]. We’ve built a speculator design and selection framework that determines the optimal speculator architecture (width/depth, lookahead, sparsity/quantization, KV reuse) and a scalable training system that brings up speculators for the largest and most challenging open-source targets quickly and reproducibly (e.g., DeepSeek-V3.1 and Kimi-K2). For instance, while Kimi ships without a ready-to-use speculator, we can train and deploy one rapidly and take Kimi from ~150 TPS out of the box to 270+ TPS on the same hardware and batch settings, while preserving target-model quality (see Figure 1, yellow bars). This pipeline powers Turbo Speculators that deliver state-of-the-art decoding latency, and it sets the stage for what comes next: an Adaptive-Learning Speculator System that adjusts token drafting to the workload in real time. 2. Introducing Turbo’s Adaptive-Learning Speculator System At Together AI, we power a broad range of inference workloads. But today’s speculative decoding methods are constrained to using a static speculator, trained on a fixed dataset. Once deployed, the speculator cannot adapt, leading to degrading performance if the input distribution evolves. This problem is particularly pronounced in serverless, multi-tenant environments, where input diversity is sky-high. New users continuously arrive, and bring with them unique workloads that the fixed speculator may not have seen during training. Furthermore, these speculators typically use a fixed lookahead , predicting the same number of tokens regardless of the speculator’s confidence. Put simply, a static speculator cannot keep up .
Figure 3: Two speculators—one static, one adaptive—work with a confidence-aware controller that selects between them and adjusts lookahead for optimal accuracy and speed. To address these limitations, we designed the Adaptive-Learning Speculative…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Solid HN traction, novel inference method