Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase
Captured source
source ↗Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Inference
Published 9/15/2025
Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase
Authors
Rajas Bansal, Mitali Meratwal, Nikitha Suryadevara, Will Van Eaton, Rishabh Bhargava
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Try it today
Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase We've rolled out major improvements to our Batch Inference API, making it simpler, faster, and more powerful for teams processing massive datasets.
What's New Streamlined UI Create and track batch jobs in an intuitive interface — no complex API calls required. Universal Model Access The Batch Inference API now supports all serverless models and private deployments, so you can run batch workloads on exactly the models you need. Massive Scale Jump Rate limits are up from 10M to 30B enqueued tokens per model per user, a 3000× increase . Need more? We'll work with you to customize. Lower Cost For most serverless models, the Batch Inference API runs at 50% the cost of our real-time API, making it the most economical way to process high-throughput workloads. Batch Inference API in Action "We rely on the Batch Inference API to process very large amounts of requests. The high rate limits—up to 30B enqueued tokens—let us run massive experiments without bottlenecks, and jobs consistently finish well under the 24-hour SLA, often within just hours. It's transformed the pace at which we can test and iterate." — Volodymyr Kuleshov, Co-Founder, Inception Labs Inception Labs is one of many teams leveraging the Batch Inference API to accelerate experimentation and production workloads. From research datasets to customer-facing applications, Batch enables large-scale processing that simply wasn't feasible before. Ideal Use Cases The Batch Inference API is perfect when you need high throughput without real-time constraints: Large-scale text analysis : Sentiment analysis, document classification, content tagging Fraud detection : Scan millions of transactions for anomalies Synthetic data generation : Create massive training datasets Embedding generation : Turn large corpora into vector representations Content moderation : Process user-generated content at scale Model evaluation: Run large benchmark suites Customer support automation: Handle tickets with longer SLAs efficiently
Looking Ahead These updates mark a major step forward in making large-scale inference both accessible and cost-effective. With an upgraded UI, universal model support, and dramatically higher rate limits––all at typically half the cost of real-time APIs––the Batch Inference API is the most efficient way to handle massive workloads. Try the Batch Inference API today and start scaling your experiments without limits.
Notability
notability 5.0/10Product update with notable rate limit increase.