Inside the Together AI kernels team
Captured source
source ↗Inside the Together AI kernels team
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 4/1/2026
Inside the Together AI kernels team
A look at the team making GPUs go faster than anyone thought possible
Authors
Will Van Eaton
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
FlashAttention FlashAttention-4 Megakernel ThunderKittens
The breakthrough came on a holiday weekend. Memorial Day 2022. While most of Silicon Valley was at barbecues, Dan Fu, Tri Dao, and their colleagues were about to prove the AI establishment wrong. The conventional wisdom was settled: transformer attention was already optimized. GPU experts had squeezed every drop of performance from the hardware. There wasn't much left to gain. Then Dan, Tri, and their colleagues published FlashAttention . Andrej Karpathy, then Senior Director of AI at Tesla, tweeted about it that afternoon. Within hours it was ricocheting through AI research channels. "Honestly, we weren't expecting anybody to see it when we released it," Dan recalls. "We were prepping some blogs and code release for Tuesday morning. But then we saw Karpathy tweet about it Monday at 7 p.m. — so then we knew it was something people were paying attention to."
Previous work on sparsity and low-rank methods showed theoretical speedup but only 10% real performance gains. The FlashAttention team took a different approach: understanding actual GPU memory movement and compute patterns. By applying classic database systems principles (unglamorous stuff about data locality and memory hierarchies) to attention, they achieved 2–3x speedups. For the researchers behind it, the implications were clear. Enormous untapped potential remained in GPU optimization. That single paper became the foundation for what's now one of the most impactful kernel research teams in AI, and a critical building block of the AI Native Cloud. The problem no one was talking about Here's what most people don't understand about AI: having the best models and the best hardware isn't enough. The bottleneck is the gap between them: the software layer that translates mathematical operations into silicon instructions. That's where kernels come in. The gap between what researchers design and what actually runs fast on hardware is vast. Many foundational architectures (ResNets, LSTMs, RNNs) were designed before the scaling paradigm took hold. As researchers began scaling models to hundreds of billions of parameters, hardware evolved with them. GPUs became increasingly specialized matrix multiplication machines, tuned for the Transformer architectures dominating modern AI. Kernels are the translation layer between mathematical abstraction and silicon reality: the software that tells GPUs exactly how to move data and perform computations efficiently. Get them right, and you unlock the full power of the hardware. Get them wrong, and that hardware sits idle. For AI-native applications (products built with AI at their core), this gap is existential. You can't build a responsive AI-native app on infrastructure running below optimal capacity. You can't scale an AI-native business when inference costs are 2x higher than they should be. The AI Native Cloud requires AI-native infrastructure, optimized from the silicon up. One week to match a year's work In March 2025, our kernels team had grown to about 15 people: a mix of ML researchers seeking systems challenges and GPU veterans moving into AI. We'd just gotten access to NVIDIA's new Blackwell GPUs, the latest generation of hardware with fundamentally different capabilities from their predecessors. The challenge was specific: NVIDIA's teams had spent a year developing optimized kernels for Blackwell, with dozens of engineers and intimate knowledge of the hardware. We had a week. We needed something that would let us move fast. That something was ThunderKittens, a library we'd been developing in collaboration with researchers at Stanford. "We had experience adapting kernels to new hardware, and we knew it would take time," Dan says. "For example, FlashAttention-3 came out a year after general Hopper availability. We wanted to build tools to make it easier to quickly build kernels for the new hardware generation, and ThunderKittens was the answer." ThunderKittens is built around NVIDIA's tensor cores, the hardware units on the GPU specialized for matrix multiplication. By building abstractions around tensor cores, ThunderKittens reduces what was once 1,000+ lines of CUDA code to 100–200 lines. The team worked around the clock, adapting ThunderKittens to Blackwell's new features: Fifth-generation tensor cores that run 2–2.5x faster than the previous generation, a new layer of tensor memory providing an extra 256KB of ultra-fast on-chip storage, and CTA pairs that allow thread blocks to coordinate more deeply.
Within one week of hardware access, we had some of the fastest FP4 and FP8 GEMM kernels available for Blackwell, with up to 2x speedups over cuBLAS on H100s. The academic engine ThunderKittens didn't come from nowhere. It's part of a broader pattern in how this team is built. Dan Fu runs a lab at UCSD focused on higher-risk fundamental research, including his personal passion project on FFT algorithms (a niche area most industry researchers would never touch). Together AI co-founder and Chief Scientist Tri Dao is at Princeton. Simran Arora is at Caltech. The model is symbiotic: de-risk ideas in academia, productionize them at Together AI. PhD students join the company. Together AI interns work on longer-term research in academic labs. Ideas flow both ways. This philosophy shapes hiring. We're not looking for people who just want to ship code or rack up citations. We want people who lose sleep over memory access patterns. Who find beauty in data flow diagrams. Who get genuinely excited about taking cutting-edge research into production.…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Routine blog post, no major traction