WritingTogether AITogether AIpublished Mar 5, 2026seen 5d

Key research and product announcements at the AI Native Conf

Open original ↗

Captured source

source ↗

Key research and product announcements at the AI Native Conf

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Research

Published 3/5/2026

Key research and product announcements at the AI Native Conf

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

FlashAttention-4 ThunderKittens ThunderAgent Aurora ‍ Cursor Decagon Hedra ‍

Together Research is announcing FlashAttention-4, Reinforcement Learning API, ThunderAgent, ATLAS-2, and more at AI Native Conf.

The AI Native Cloud is more than a positioning statement. It is a full-stack AI cloud that is purpose-built for AI-natives by researchers and engineers who have delivered foundational AI work such as FlashAttention and ThunderKittens . The same people who published that research are the ones running the production systems our customers, such as Cursor and Decagon , depend on. That proximity is hard to replicate. When a technique comes out of our research program, we can quickly move from research to production and ship these techniques for our customers' immediate benefit. Today at the first AI Native Conf, we are announcing seven research and product releases across three areas: Kernels, reinforcement learning, and algorithmic inference optimization. Each one represents a massive advancement from our research-to-production pipeline for customers to use. Kernels FlashAttention-4 FlashAttention is the attention engine powering many large-scale, frontier language models in production today. The research program led by Chief Scientist Tri Dao continues to push the limits of how fast attention can run. FlashAttention-4 pairs a new algorithm with a kernel co-design tuned for NVIDIA Blackwell GPUs, removing the new bottlenecks so the tensor cores stay busy.

It is 2.7x faster than Triton and 1.3x faster than cuDNN 9.13. For long context workloads like video understanding, coding agents, and test time compute scaling, this enables more intelligent capabilities at a lower cost per token on the latest NVIDIA GPUs. Read the FlashAttention-4 launch blog. Together Megakernel One of the leading real-time voice agent companies came to Together with a hard constraint: time-to-first-64-tokens above roughly 100ms breaks the conversational experience. On their previous setup, deployed on NVIDIA B200 GPUs, they were hitting 281ms. Fast for most workloads, but not fast enough for theirs. Together's kernels team worked with them to select a model architecture, then hand-optimized a Megakernel implementation that runs an entire model in a single kernel, targeting the HBM bandwidth ceiling of the NVIDIA H100.

The resulting deployment hit 77ms — a 3.6x performance improvement with 7.2x better unit economics compared to their prior deployment. Together Megakernel is the production implementation of open-source research initially developed with collaborators at Stanford. Backed by the same research lineage as FlashAttention, it's hardware-software co-design that closes the gap between what's theoretically possible and what deployed systems deliver. Learn more about Megakernel together.compile The kernel optimization that produces results like Together Megakernel has historically required specialists — engineers who understand GPU thread-block mapping, memory bandwidth constraints, and hardware-specific tuning at a depth most teams don't have on staff. together.compile automates much of that process. An extension of ThunderKittens, together.compile generates an optimized kernel stack at startup with a single function call — no changes to model code required. When applied to Hedra's Omnia video model, together.compile accelerated generation of 200 frames by 25%.

In production Flux Kontext benchmarks, server startup plus generating 51 images across 17 resolutions completes in 329 seconds with together.compile, versus 558 seconds with torch.compile: A 41% improvement. Startup time drops as well, which matters for teams running autoscaled image and video generation at volume. together.compile is coming soon to Together Dedicated Container Inference. Get in touch if you’d like to join the beta. Reinforcement Learning Reinforcement Learning API Together's Reinforcement Learning API brings the full Together stack to RL training. The kernels, inference optimizations, and research advances that power production inference on Together now apply directly to rollout-heavy workloads — the bottleneck that dominates RL wall-clock time. The API gives teams control, not a black box. Inference and training are exposed as separate, configurable layers — teams decide rollout configuration, weight push frequency, and where compute runs. Together handles synchronization and scheduling; the decisions about how to run RL remain yours. This level of abstraction lets teams actually optimize their training loop, rather than working around someone else's assumptions about how RL should work. Over 70% of RL wall-clock time is rollouts — inference — and that's where Together's research program directly applies. Distribution-aware speculative decoding and ThunderAgent both target the throughput and latency characteristics that make rollouts fast, translating each research advance into faster RL training cycles. The remaining bottleneck is weight distribution: Getting updated weights to inference nodes after each training step. Within a datacenter, Together pushes new weights to all inference nodes in seconds. At global distributed scale — nodes across regions, different GPU types — synchronization completes in under one minute. ThunderAgent The Reinforcement Learning API handles the infrastructure layer. ThunderAgent addresses what happens when the workloads being trained and served are themselves agentic — coding agents, scientific discovery agents, multi-step reasoning pipelines running at scale. Existing inference systems handle agentic workflows as sequences of independent, stateless requests. This creates three compounding problems: KV…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable conference announcements from Together AI.