WritingTogether AITogether AIpublished Sep 15, 2025seen 5d

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

Open original ↗

Captured source

source ↗

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Inference

Published 9/15/2025

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase

Authors

Rajas Bansal, Mitali Meratwal, Nikitha Suryadevara, Will Van Eaton, Rishabh Bhargava

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Try it today

Improved Batch Inference API: Enhanced UI, Expanded Model Support, and 3000× Rate Limit Increase We've rolled out major improvements to our Batch Inference API, making it simpler, faster, and more powerful for teams processing massive datasets.

What's New Streamlined UI Create and track batch jobs in an intuitive interface — no complex API calls required. Universal Model Access The Batch Inference API now supports all serverless models and private deployments, so you can run batch workloads on exactly the models you need. Massive Scale Jump Rate limits are up from 10M to 30B enqueued tokens per model per user, a 3000× increase . Need more? We'll work with you to customize. Lower Cost For most serverless models, the Batch Inference API runs at 50% the cost of our real-time API, making it the most economical way to process high-throughput workloads. Batch Inference API in Action "We rely on the Batch Inference API to process very large amounts of requests. The high rate limits—up to 30B enqueued tokens—let us run massive experiments without bottlenecks, and jobs consistently finish well under the 24-hour SLA, often within just hours. It's transformed the pace at which we can test and iterate." — Volodymyr Kuleshov, Co-Founder, Inception Labs Inception Labs is one of many teams leveraging the Batch Inference API to accelerate experimentation and production workloads. From research datasets to customer-facing applications, Batch enables large-scale processing that simply wasn't feasible before. Ideal Use Cases The Batch Inference API is perfect when you need high throughput without real-time constraints: Large-scale text analysis : Sentiment analysis, document classification, content tagging Fraud detection : Scan millions of transactions for anomalies Synthetic data generation : Create massive training datasets Embedding generation : Turn large corpora into vector representations Content moderation : Process user-generated content at scale Model evaluation: Run large benchmark suites Customer support automation: Handle tickets with longer SLAs efficiently

Looking Ahead These updates mark a major step forward in making large-scale inference both accessible and cost-effective. With an upgraded UI, universal model support, and dramatically higher rate limits––all at typically half the cost of real-time APIs––the Batch Inference API is the most efficient way to handle massive workloads. Try the Batch Inference API today and start scaling your experiments without limits.

Notability

notability 5.0/10

Product update with notable rate limit increase.