Together AI analysis
Thesis
Together AI is consolidating its positioning as the AI-native cloud — an inference-first infrastructure platform that competes on raw speed and cost per token. The evidence pack shows the company simultaneously building out in three directions: (1) deepening the infrastructure surface from GPU clusters into managed storage, networking, and observability P3E23, (2) layering enterprise trust and access-control primitives (ISO 27001 certification, identity and RBAC systems) P24P2P13, and (3) investing heavily in research that directly improves its inference platform's competitive moat — multi-GPU kernel generation P7P9, RL/post-training co-design P8, speculative decoding E28, and agent infrastructure P20. A new Head of Hyperscaler Partnerships role P1 signals intent to scale distribution through cloud platform deals, while the Amsterdam engineering hub is expanding as a dedicated center for identity, collaboration, and sandbox backend systems P2P13P18E32. The research internship pipeline for Fall 2026 is unusually broad — spanning inference, model shaping, RL/post-training, GPU programming, and frontier agents — indicating sustained R&D investment across the full stack E9E10E11E15E29.
Signal desks
Hiring
- Identity & Collaboration team (Amsterdam): Hiring both a Software Engineer and Senior Software Engineer to build authentication flows (SSO, OAuth, SAML), organizations, API keys, and role-based access controls in Elixir/Phoenix + TypeScript/Next.js. Signals enterprise tenant readiness and multi-user collaboration product surface P2P13E4E8.
- Cloud infrastructure (San Francisco & Amsterdam): Multiple roles — Senior Software Engineer – Together Cloud Infrastructure (SF and Amsterdam), AI Infrastructure Engineer, Staff Engineer for Distributed Storage and HPC & AI Infrastructure, Infrastructure Design Engineer — point to a major buildout of GPU clusters, managed storage, and networking P3E6E22E23E30E32E35.
- Product Manager, AI Infrastructure (San Francisco): New role focused on GPU Clusters, Managed Storage, and observability, with a path to full product-area ownership within nine months. Indicates rapid surface expansion and formalization of infrastructure PM discipline P3E6.
- Hyperscaler partnerships (San Francisco): A principal-level Head of Hyperscaler Partnerships role reporting to the VP of Strategic Partnerships, with direct exposure to CEO/CFO/CRO. Explicitly covers model licensing, revshare, marketplace private offers, and scaled cloud distribution P1E2.
- Research interns (Fall 2026, San Francisco): Five distinct intern tracks — Inference P19E10, Model Shaping P17E9, RL & Post-Training Systems (Turbo) P8E15, Frontier Agents P20E11W1W2, and GPU Programming P21E29. Pattern: all require empirical research rigor, Python + deep learning frameworks, and familiarity with Transformer architectures; several explicitly mention CUDA/C++ and systems co-design.
- Inference engineering (San Francisco/NYC): Research Engineer for Frontier Speculative Decoding E28 and Machine Learning Engineer – Inference E27 signal continued heavy investment in inference speed as a differentiator.
- Data Center Operations (San Francisco): Data Center Operations Coordinator role to manage break/fix across multiple data center locations, indicating self-operated or closely managed colocation footprint P22.
- Data platform & analytics: Backend Software Engineer – Data Platform & AI Data Products E37, Analytics Engineer – Data Warehouse E33, Finance Analytics Engineer E25, and Infrastructure Accounting Manager E26 together suggest internal data infrastructure buildout and increasing financial/operational sophistication around GPU capacity economics.
- Customer support functions: Customer Support Engineer roles for both Inference and GPU Cluster E38E40 indicate growing paying customer base requiring dedicated production support.
- Voice AI: Staff Machine Learning Engineer, Voice AI role referenced externally W4 aligns with Together's speech-to-text and TTS product investments E50E55.
- Recruiting scale-up: Senior Technical Recruiter for AI/ML Research E21 signals competitive research hiring push, consistent with the breadth of Fall 2026 intern roles.
- Workplace/operations: Workplace Coordinator P11E16 and Lead Product Designer E39 round out organizational scaling.
Forks
- InferenceX (fork of SemiAnalysisAI/InferenceX): Apache-2.0 continuous inference benchmark platform targeting B200, GB200 NVL72, MI355X, GB300 NVL72, and TPU/ Trainium. Description explicitly mentions Kimi K2.6, DeepSeekv4, GLM5. Forked to run official/unofficial Together inference benchmarking P23E43.
- DeepGEMM (fork of deepseek-ai/DeepGEMM): Forked April 2026. Signals interest in DeepSeek's FP8 GEMM library, likely for inference kernel optimization on Together's serving stack E60.
- k8s-netperf (fork of cloud-bulldozer/k8s-netperf): Kubernetes network performance benchmarking. Relevant to multi-datacenter GPU cluster networking and interconnect performance measurement E59.
- tinker-cookbook (fork of thinking-machines-lab/tinker-cookbook): Forked June 2026. A cookbook for model training/ fine-tuning recipes; consistent with Model Shaping team's post-training and customization work E46.
- Historical forks: Port_FasterTransformer (NVIDIA/FasterTransformer) P25, flash-attention (Dao-AILab) P26, diffusers (HazyResearch/diffusers) P27, together-chat (AI-Yash/st-chat) P28 — all from 2022–2023, inactive or archival, reflecting earlier-stage exploration.
Releases
- together-py SDK (v2.16.1 → v2.20.0): Rapid iteration over ~10 days in late June 2026. New features:
whoamiCLI command and API endpoint P4E7, CLI endpoint adapter commands P10E14, remediation approval mode P5E5, upload-path memory fix P10. Pattern: CLI ergonomics, enterprise remediation workflows, and OpenAPI spec expansion. E1E24 - together-typescript SDK (v0.41.2): Chore-heavy release adding staging CI syncing and remediation summary docs improvements P14E19.
- together-sandbox (workspace v1.11.0 → v3.0.0): Breaking change in v3.0.0 —
snapshots.list()now returns cursor-paginated results instead of plain arrays P12E18. Consistent iteration cadence throughout May–June 2026 E45E49E51. - detect_agent (v0.3.0): Release tagged May 2026; agent detection tooling E48.
- xorl-wheels (tilelang 0.1.10): Prebuilt CUDA 13.1 wheels for TileLang, a kernel DSL — relevant to GPU programming research E47.
Talking
- ParallelKernelBench (June 2026): Flagship research post. Together tested frontier LLMs on 87 real-world multi-GPU CUDA kernel problems. Best model solved <33% correctly; fewer than 25% of those beat the naive PyTorch+NCCL baseline. A few generated kernels beat any public implementation, including one for NVIDIA NeMo-RL's GRPO training loop. Positions Together as the authority on inference-critical kernel benchmarking P7P9E13.
- Kimi K2.7 Code vs Claude Fable 5 (June 2026): Competitive cost/quality analysis. Kimi K2.7 Code produced landing pages at 94% lower cost than Claude Fable 5, within a few quality points. Narrative: open-source models are closing the gap on quality while being dramatically cheaper — reinforces Together's cost-leadership positioning P15E20.
- ISO 27001:2022 certification (June 2026): Enterprise trust milestone. Certification covers global platform ISMS, corporate HQ, and third-party data centers. Authored by Together's security leadership. Directly addresses enterprise procurement requirements P24E44.
- Serving MiniMax-M3 (June 2026): Technical deep-dive on efficient inference for a 1M-token-context multimodal model using KV-block-major sparse attention, paged MSA decode, and a Rust-based multimodal gateway E42.
- Speech-to-text stack (May 2026): Claims world's fastest ASR on Artificial Analysis by treating speech-to-text as a full-path systems problem rather than GPU inference alone E50.
- Coding agent benchmarks (May 2026): Claims 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6 for coding agent workloads E52.
- DeepSeek-V4 serving (May 2026): Explores million-token context as an inference systems problem on HGX B200 — compressed KV layouts, prefix caching, kernel maturity E56.
- Parcae (April 2026): Research on stable looped models — a 770M model matching 1.3B Transformer quality, with first scaling laws for looping E58.
- Product/partnership announcements: GPT Image 2 availability on Together W3, Pearl Research Labs partnership for discounted inference via Proof of Useful Work E53, Violin open-source video translation E54, Voice Finder tool for 600+ TTS voices E55, HuggingFace one-click deploy via Goose E57.
Shipping
Together's shipping velocity in late June 2026 is concentrated on SDK maturity: four Python SDK releases (v2.16.1 through v2.20.0) across ~10 days added CLI ergonomics (whoami, endpoint adapters), enterprise remediation approval workflows, and upload stability fixes P4P5P10P16. The TypeScript SDK saw a chore/infra release focused on CI syncing P14. The together-sandbox product shipped a breaking pagination change (v3.0.0) signaling production API hardening for snapshot-heavy workflows P12. No new models or model cards shipped in the evidence window; shipping activity is overwhelmingly platform and developer-experience oriented.
Research themes
1. Multi-GPU kernel generation and benchmarking: ParallelKernelBench is the headline research artifact — an 87-problem benchmark and evaluation framework for LLM-generated multi-GPU CUDA kernels, directly relevant to inference serving performance P7P9E13. 2. Inference-aware RL and post-training: The Turbo team is co-designing RL algorithms with inference systems, studying how speculative decoding, KV cache management, and partial rollouts affect learning dynamics in GRPO/RLHF/DPO-style methods P8E15. 3. Model shaping and efficient training: Research interns are working on advanced post-training (supervised, preference optimization, RL), distributed training improvements, and foundation model evaluation P17E9. The team has recent publications at ICML 2026 and ICLR 2026 P17. 4. Frontier agents: The Agents team investigates post-training for agentic behavior, self-learning and long-horizon reasoning, evaluation frameworks for open-ended tasks, and agent infrastructure at scale P20E11W1W2. 5. Speculative decoding and inference optimization: Dedicated research engineer role for frontier speculative decoding E28, inference research intern covering KV cache design, compiler-aware optimization, and Mixture-of-Experts serving P19E10. 6. GPU kernel co-design: GPU Programming interns work on CUDA/Triton kernel optimization co-designed with model architecture, reflecting the hardware-software co-design philosophy articulated in Together's mission P21E29E31. 7. Looped/recurrent models: Parcae research (April 2026) demonstrated stable looped LMs matching Transformers 2× their size, with scaling laws for recurrence E58.
Hiring & scaling
Together AI is hiring across three geographic hubs — San Francisco (primary), Amsterdam (secondary engineering hub), and New York City (limited). The Amsterdam office is concentrated on Identity & Collaboration and Sandbox backend engineering P2P13P18E32, while San Francisco carries the full breadth: cloud infrastructure, research, product, partnerships, data platform, and operations. The Fall 2026 intern cohort (five distinct tracks, 12–16 weeks, September–December) is substantial and spans the full research-to-systems spectrum P8P17P19P20P21. A dedicated Senior Technical Recruiter for AI/ML Research E21 and a Finance Analytics Engineer E25 plus Infrastructure Accounting Manager E26 suggest organizational scaling to support larger research headcount and increasingly complex GPU capacity economics. Data center operations hiring P22 indicates self-managed or closely-managed colocation rather than purely cloud-based infrastructure. The hyperscaler partnerships role P1 is a board-level strategic hire, suggesting major cloud distribution deals are in negotiation or planning.
Category implications
- Infrastructure: Together is building a full-stack AI cloud — GPU clusters, managed storage, networking, and observability — not just an inference API P3E23E35. The Data Center Operations Coordinator P22 and Infrastructure Design Engineer E35 roles indicate physical infrastructure ownership. The InferenceX fork P23E43 and k8s-netperf fork E59 signal systematic benchmarking across hardware targets (B200, GB200, MI355X, TPU). This positions Together as competing with both hyperscalers and neoclouds on infrastructure depth, not just model serving. The hyperscaler partnerships hire P1 suggests a co-opetition strategy: build your own infra while also distributing through the majors.
- Product: The product surface is expanding rapidly from inference into storage, observability, sandbox environments, and enterprise identity/access management. The together-sandbox workspace v3.0.0 with pagination P12 and the Identity & Collaboration team buildout P2P13 point to a product evolving from developer tool to enterprise platform. Customer support engineers for both inference and GPU clusters E38E40 confirm paying production customers.
- Research: Together's research strategy is tightly coupled to its infrastructure business — kernel benchmarking P7, speculative decoding E28, inference-aware RL P8, and GPU programming P21 all directly improve inference cost and speed, which is the company's core GTM differentiator. The Model Shaping team's post-training work P17E9 extends this into the customization layer, enabling customers to fine-tune on Together's infrastructure. Frontier agents research P20 and Parcae looped models E58 represent longer-horizon bets.
- GTM: ISO 27001:2022 certification P24E44 is a deliberate enterprise GTM unlock. The cost-comparison blog posts (Kimi vs. Claude at 94% less cost P15, coding agents at 76% lower cost than Opus E52) are direct price-performance marketing aimed at developers and engineering teams. The Pearl Research Labs partnership E53 explores novel distribution via crypto-economic channels. The Head of Hyperscaler Partnerships P1 signals intent to pursue marketplace and revshare deals with major cloud platforms as a complementary GTM motion.
- Strategy: The evidence reveals a three-layer strategy: (1) be the fastest and cheapest inference platform through deep systems research and kernel optimization, (2) expand into adjacent infrastructure (storage, networking, observability, sandboxes) to increase platform stickiness and revenue per customer, (3) pursue enterprise adoption through compliance (ISO 27001), identity/access features, and hyperscaler marketplace distribution. The breadth of Fall 2026 research interns — spanning inference, post-training, agents, and GPU programming — suggests Together is betting that maintaining inference performance leadership requires sustained, multi-disciplinary research investment rather than one-off optimizations.
Traction highlights
- SDK release velocity: four together-py releases in ~10 days (v2.16.1 through v2.20.0) with feature-bearing changes P4P5P10P16E1E7E5E14E24.
- together-sandbox: three workspace releases in May–June 2026 (v1.11.0, v1.12.0, v2.0.0, v3.0.0) with breaking API changes indicating production usage P12E18E45E49E51.
- Enterprise certification: ISO 27001:2022 achieved via ANAB-accredited A-LIGN audit, covering global platform and third-party data centers P24E44.
- Research output: ParallelKernelBench paper with code, HuggingFace dataset, and public blog post P7P9E13; Parcae paper on looped models E58; ICML 2026 and ICLR 2026 publications from Model Shaping interns P17.
- Competitive benchmarking claims: 31% more TPS than TensorRT-LLM for coding agents E52, world's fastest speech-to-text on Artificial Analysis E50, 94% cost reduction vs. Claude Fable 5 for landing page generation P15E20.
- Model serving breadth: MiniMax-M3 (1M context, multimodal) E42, DeepSeek-V4 (1M context) E56, GPT Image 2 W3, Kimi K2.7 Code P15, Gemma-4-31B-it-pearl E53, HuggingFace any-model deployment E57.
- GitHub organization activity: 8+ active repositories with recent commits including ParallelKernelBench P9, InferenceX fork P23, together-storage-claude-skills (Go-based runbooks for T4+CS3 storage) E3, together-sandbox, together-py, together-typescript, detect_agent, and xorl-wheels.