Optimizing inference speed and costs: Lessons learned from large-scale deployments
Captured source
source ↗Optimizing inference speed and costs: Lessons learned from large-scale deployments
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Inference
Published 1/22/2026
Optimizing inference speed and costs: Lessons learned from large-scale deployments
Authors
David Nugent, Ingrid Xu
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
How can teams reduce inference latency without massive costs? Achieving faster inference doesn't always mean paying more for a bigger cluster. At Together AI, we’ve seen teams that consistently deliver both low latency and low cost share these key habits: They maximize the usable work extracted from every GPU They actively eliminate invisible compute stalls They strategically select decoding techniques based on their specific traffic patterns They view performance tuning as an ongoing discipline, not a one-time configuration task
By excelling in these areas, your cluster can provide faster responses while simultaneously reducing the cost per token. Why inference cost efficiency matters AI products are getting more competitive by the week — and user expectations are rising just as fast. For leading AI-native companies — like Cursor, who needs massive throughput without compromising speed, and Decagon who needs real-time responses despite unpredictable traffic patterns — the pressure is the same everywhere: Be fast. Sub-500ms TTFT and fast decoding speed Be predictable. No surprise tail latencies Be affordable. GPU bills can’t scale linearly with traffic Be ready for spikes. Because traffic never behaves the way you expect
Across customers, we consistently see the same imperative: deliver sub-second responses, without doubling the GPU bill . The good news? You don’t need exotic architectures or hundreds of extra GPUs to maintain inference cost efficiency. Most teams get meaningful wins by optimizing how their inference runs, not purely how much hardware they buy. How inference optimization works Here are the levers that reliably move both speed and cost in the right direction. 1. Start at the model level: quantization and distillation Quantization Dropping precision (FP16 → FP8 → FP4) makes the model lighter on memory and faster to run — with virtually no quality loss when done well, like how we do it here at Together. This unlocks: Noticeably faster tokens/sec Bigger batch sizes at the same GPU footprint Lower cost per token Smoother scaling for real-time workloads
We’ve seen in many production deployments that FP8 or FP4 quantization delivers 20–40% throughput improvement, without harming output quality. Distillation Not every workload needs the full weight of a frontier model. Distillation trains a smaller model to mimic a larger one, preserving reasoning patterns while cutting compute cost dramatically. DeepSeek-R1 is a great example. Its distilled variants are fast, lightweight, and still excellent at reasoning — making them perfect for: Interactive chat Coding assistants Routing and classification High-volume enterprise workloads Inference at the edge or under tight latency budgets
You can see how teams deploy R1 and its distilled variants securely on Together AI in this post . Distilled R1 variants deliver a quality-to-latency ratio that’s extremely compelling for production workloads — often enabling 2–5× lower cost at similar quality bands for many tasks. Together, quantization and distillation offer some of the largest cost reductions available before touching hardware or cluster architecture. 2. Reduce network latency at the edge (regional inference proxies) Sometimes the biggest latency win isn’t compute, but geography. Even with extremely fast models, network distance is often the slowest part of the request path. Dropping a lightweight proxy in the same region as your inference cluster cuts out long round-trip paths before generation even starts. This alone can shave 50–100 ms off TTFT, and make tail latency far more predictable. 3. Reduce unnecessary compute (memory stalls, KV inefficiencies, fragmentation) Most models aren’t slow— the pipelines around them are. So your GPU spends a lot of time doing nothing and just… waiting. The biggest culprits tend to be: Kernels that don’t work together efficiently , forcing the GPU to pause between prefill, attention, and decoding MoE layers that spend more time waiting on memory than doing useful work , especially when expert routing is unbalanced Prefill paths that struggle with long prompts , leading to slow starts and uneven performance Batching or scheduling gaps that leave portions of the GPU idle while work is still available
At Together AI, we’ve run benchmarks across Llama, Qwen, Mistral, and DeepSeek families (highlighted in our fastest inference for the top open-source models blog) which show that kernel fusion, smarter MoE execution, streamlined tokenization, and better scheduling can eliminate wasted time, unlocking faster responses and higher throughput. 4. Use the right decoding optimization (MTP, speculative decoding, draft models) Decoding is where a lot of time gets lost — and also where some of the easiest wins live. MTP: Predicts multiple tokens at once, increasing decode speed and GPU efficiency Speculative decoding: Uses a small “draft” model to accelerate generation for predictable workloads Traditional speculative decoding uses a fixed drafting strategy, but modern engines allow teams to optimize for their specific traffic distribution — maximizing speed while minimizing quality regressions. We did this with our own speculator, ATLAS . We break down these strategies in detail in our customized speculative decoding post.
When tuned properly, these techniques often deliver 20–50% faster decoding and significantly higher throughput per GPU. 5. Pick the right hardware for your workload (and use the right parallelism) With a new hardware type that comes out every year or so, hardware choice increasingly shapes both cost and latency. Blackwell GPUs offer major improvements in per-token throughput and…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Substantive deployment insights from notable AI company.