From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility
Captured source
source ↗From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All customer stories
From AWS to Together Dedicated Endpoints: Arcee AI's journey to greater inference flexibility
95%
faster TTFT
41+
queries per second
7+ models
Deployed
Client
Company segment
AI-Native Startup
Company Industry
Generative AI Platform
Highlights
95% faster TTFT 41+ QPS GPU fleet offloaded Zero downtime
Summary
If you've ever felt overwhelmed by the complexity of today's AI landscape, you're not alone. Amid this entropy, Arcee AI saw an opportunity to simplify AI adoption by creating efficient, smaller language models that help enterprises effortlessly integrate advanced AI workflows. In this customer story, we explore why Arcee AI transitioned its specialized small language models (SLMs) from AWS to Together Dedicated Endpoints—and how this migration unlocked significant improvements in cost, performance, and operational agility.
Training Small Language Models At the core of Arcee AI’s strategy is a focus on training specialized small language models (SLMs)—typically under 72 billion parameters—optimized for specific tasks. Leveraging their proprietary training stack—including specialized techniques for merging and distilling models—Arcee AI consistently produces high-performing models. These custom models, both open-source and proprietary, excel in distinct tasks such as coding, general text generation, and high-speed inference, providing precise and cost-efficient performance. We’re thrilled to announce that as of today, seven of these models are now available on Together AI serverless endpoints, so you can start using them with the Together API: Arcee AI Virtuoso-Large : Powerful 72B SLM, built for complex, cross-domain tasks and scalable enterprise-grade AI solutions. Arcee AI Virtuoso-Medium : Versatile 32B SLM built for precision and adaptability across domains, ideal for dynamic, compute-intensive use cases. Arcee AI Maestro : 32B SLM for advanced reasoning, excelling in complex problem-solving, abstract reasoning, and scenario modeling. Arcee AI Coder-Large : 32B model based on Qwen2.5-Instruct, fine-tuned for code generation and debugging, ideal for advanced development tasks. Arcee AI Caller : 32B SLM optimized for tool use and API calls, enabling precise execution and orchestration in automation pipelines. Arcee AI Spotlight : 7B vision-language model based on Qwen2.5-VL, refined by Arcee AI for visual tasks with a 32K context length for rich interaction. Arcee AI Blitz : Efficient 24B SLM with strong world knowledge, offering fast, affordable performance across diverse tasks.
Arcee Conductor & Arcee Orchestra After an initial focus on developing powerful models, Arcee AI built a software layer on top, currently consisting of two products: Arcee Conductor and Arcee Orchestra. Arcee Conductor is an intelligent inference routing system powered by a unique 150 million parameter classifier—such a small size that latency doesn’t come into play. Conductor was trained from scratch to evaluate each query or prompt, then quickly route it to the most suitable model based on requirements that include complexity, domain, and task type. Once a user enters a query, Conductor intelligently routes it to the best model for the specific use case, picking between one of Arcee AI’s suite of specialized models or a state-of-the-art third-party model such as GPT-4.1, Claude 3.7 Sonnet, and DeepSeek-R1.
“Most of the time, you think you need a GPT-4.1 caliber model, whereas for a large number of queries you don’t need that level of depth. So Conductor will route to one of our models, which are 95% cheaper than GPT and Sonnet.” – Mark McQuade, CEO In addition to the drastic cost reduction, Arcee AI also found that using Conductor improved performance across an entire benchmark suite, since it quickly routes to the best model for each specific task. Additionally, it allows customers to bring their own models and optimize the Conductor configuration to optimize a specific KPI (such as latency or cost), Arcee Orchestra focuses on building agentic workflows . It enables enterprises to automate tasks through seamless integration with third-party services and data sources. Orchestra stands out with its intuitive no-code interface, enhanced by AI-driven workflow generation capabilities, allowing users to effortlessly build automated workflows through simple prompts or voice commands. Orchestra users can opt to power their workflows with a third-party model, or with one of the Arcee SLMs that has been specially-trained for agentic workflows–via careful fine-tuning to be highly capable at instruction-following and function calling, and to understand the nuances of API inputs and outputs. “Orchestra allows you to automate complex tasks easily, creating workflows using a drag-and-drop interface. It gets really powerful once you start prompting it with a query or voice input to generate a workflow.” – Mark McQuade, CEO
Operational challenges: costs and complexity Arcee AI’s models are at the core of its product offering. Originally, Arcee AI deployed these models through AWS's managed Kubernetes service (EKS). Despite being a managed service, this AWS setup presented significant challenges. For one, managing it required significant engineering resources, specialized Kubernetes expertise and dedicated engineering talent. Plus, the team found that self-managed auto scaling of GPUs was very challenging. Ultimately, this siphoned off valuable resources from product innovation and development. “Kubernetes is not simple. You have to hire dedicated engineering talent to manage the fleet and the backend infrastructure–not just the compute, but also having the compute shared across nodes, load balancing, autoscaling, and all these fun things that require multiple engineers to manage. Even though EKS is a managed service, it was still a very manual, cumbersome and time-consuming process.” – Mark McQuade, CEO
At this stage…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Company migration blog post, moderate industry interest.