WritingTogether AITogether AIpublished Jul 17, 2025seen 5d

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

Open original ↗

Captured source

source ↗

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Inference

Published 7/17/2025

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

Authors

Together AI

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Together Chat NVIDIA HGX B200 Together GPU kernels

Earlier this year, we invited select customers—including Zoom, Salesforce, and InVideo—to test-drive NVIDIA Blackwell GPUs on Together GPU Clusters. Now, we’re excited to share that Together AI is rolling out NVIDIA Blackwell support for Together Inference, unlocking the next level of performance for real-world AI applications. The verdict is clear: Together AI inference is now among the world’s fastest, most capable platforms for running open-source reasoning models like DeepSeek-R1 at scale, thanks to our new inference engine designed for NVIDIA HGX B200. In this blog post, we share results from an early-access production deployment of DeepSeek-R1-0528 on NVIDIA HGX B200 . As of July 17, 2025, this is the fastest serverless inference performance (to our knowledge) of DeepSeek-R1. You can experience this speed today via Together Chat . The serverless NVIDIA HGX B200 endpoint is currently in closed beta, serving production workloads from our early customers. Reach out to our Sales team to be among the first to get access as we expand. Together AI optimizes every layer of the stack—(1) bespoke GPU kernels , (2) a purpose-built proprietary inference engine , and (3) state-of-the-art speculative decoding methods and (4) calibrated and quantized model optimization — to boost LLM speed and efficiency to new heights without compromising quality.

Figure 1: Source: Artificial Analysis, 07/16/2025. See their website https://artificialanalysis.ai/models/deepseek-r1/providers Figure 2: Source: Artificial Analysis. On 07/16/2025, TogetherAI demonstrated industry-leading throughput and latency on DeepSeek-R1-0528. See their website https://artificialanalysis.ai/models/deepseek-r1/providers Together AI offers a flexible set of infrastructure options for running both inference and training of frontier AI workloads. Whether you're scaling up experiments or deploying production systems, you can choose the level of control and performance that fits your needs: ‍ Get in touch to build with Together AI cloud services accelerated by NVIDIA Blackwell GPUs. Together’s State-of-the-Art Inference on NVIDIA HGX B200 We now analyze how Together’s inference stack for R1-0528 performs relative to a leading open source inference engine on NVIDIA HGX B200 GPUs and the previous generation NVIDIA H200 GPUs . As you can see in the figure below, Together’s inference stack with our in-house Turbo speculator attains a maximum decoding speed of ~334 tokens/sec, a ~32 tokens/sec speedup over the maximum performance without Together’s inference stack of ~302 tokens/sec and a 2.3x to 2.8x speedup over H200 speeds. Crucially, these performance gains are achieved without sacrificing model quality (see Appendix for comparison).

Figure 3: We compared performance between NVIDIA HGX B200 with and without Together’s Inference Engine and HGX H200 without Together’s Inference Engine. The relaxed-speculator mode was disabled to avoid the quality regressions documented in the Appendix. See footnote [1] for benchmarking details. ‍ Customize a Dedicated Endpoint for your Needs For teams with high-performance production requirements, Dedicated Endpoints (DEs) unlock an additional layer of optimization beyond our default serverless configurations. With DEs, we can fine-tune the deployment environment—delivering up to a ~84 tokens/sec speedup: from 302→386 tokens/s at bs=1, 198→227 at bs=8, and 107→133 at bs=32, compared to deployments without Together’s Inference Engine. These improvements maintain our strict quality–performance standards while giving customers more control over the latency–accuracy trade-off. DEs are ideal for production workloads where every millisecond counts and workloads benefit from infrastructure tailored to their specific needs. (See Appendix for details.)

Figure 4: We can fully customize Dedicated Endpoints to optimize for specific workloads by balancing speed, quality, and cost. We optimized this deployment for speed (386 TPS at BS = 1) with a minor exchange in quality. (See the appendix for more details) An Inference Optimization Primer Below, we break down the key inference optimizations that power Together’s industry-leading performance on NVIDIA Blackwell. Together Inference Engine ‍ An inference engine is a crucial software or hardware component responsible for executing trained AI models to make predictions on new data. At Together, our inference engine achieves state-of-the-art performance by integrating Together AI's latest advances, including FlashAttention-3, faster custom GEMM & MHA kernels, quality-preserving quantization, and speculative decoding. To reduce computational overhead, we unify this entire inference workflow—from prefill to speculative token verification—into single, dynamically captured NVIDIA CUDA graphs, with efficient compute-communication overlap and parallel stream orchestration to further boost efficiency. Together Kernels ‍ GPU kernels are custom-developed software programs that run on GPUs, performing critical AI computations such as attention mechanisms and matrix multiplications. By developing optimized kernels, Together AI can unlock faster inference speeds — reducing costs and improving efficiency. With NVIDIA B200 GPUs, Together AI has developed new kernels that take advantage of 5th-generation NVIDIA Tensor Cores and on-chip Tensor Memory. Using the ThunderKittens framework , the Together Kernels Lab built and open-sourced Blackwell GPU kernels matching NVIDIA’s performance within two weeks of getting hardware access. This continues our work on developing kernels that push the state-of-the-art in…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable inference optimization post