WritingTogether AITogether AIpublished Mar 17, 2026seen 5d

Captured source

source ↗
published Mar 17, 2026seen 5dcaptured 3dhttp 200method plain

Mamba-3

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Research

Published 3/17/2026

Mamba-3

Authors

Aakash Lahoti* (CMU), Kevin Y. Li* (CMU), Berlin Chen* (Princeton), Caitlin Wang* (Princeton), Aviv Bick (CMU), J. Zico Kolter (CMU), Tri Dao (Princeton, Together AI), Albert Gu (CMU, Cartesia A)

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Paper Code Goomba Lab

tl;dr

Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding. The result: Mamba-3 SISO beats Mamba-2, Gated DeltaNet, and even Llama-3.2-1B (Transformer) on prefill+decode latency across all sequence lengths at the 1.5B scale. The team also open-sourced the kernels, built using a mix of Triton, TileLang, and CuTe DSL for maximum hardware performance. This blog is cross-posted on the Goomba Lab blog and covers work done in collaboration between researchers at Carnegie Mellon University, Princeton University, Cartesia AI, and Together AI.

Since the release of Mamba-2 in mid-2024, most architectures have switched from Mamba-1. Why? Mamba-2 made the bet that training efficiency was the largest bottleneck for state space models (SSMs), and thus simplified the underlying SSM mechanism to deliver 2−8× faster training compared to its predecessor, leading to wider adoption. Since then, the LLM landscape has started to shift. While pretraining is still super important, more attention has been focused on post-training and deployment, both of which are extremely inference-heavy . The scaling of post-training methods, especially with reinforcement learning with verifiable rewards (RLVR) for coding or math, requires huge amounts of generated rollouts, and most recently, agentic workflows, such as Codex, Claude Code, or even OpenClaw, have pushed inference demand through the roof . Despite the clear, growing importance of inference, many linear architectures (including Mamba-2) were developed from a training-first perspective. To accelerate pretraining, the underlying SSM was progressively simplified (e.g., the diagonal transition was reduced to a scalar times identity). While this brought training speed, it left the inference step "too simple" and squarely memory-bound --- the GPUs aren't brr-ing but moving memory most of the time. In this new age of inference, we care a lot about pushing the boundaries of the quality-efficiency frontier: we want the better models to run faster . A natural question arises: What would an SSM designed with inference in mind look like? The Mamba-3 model What's missing? The main appeal of linear models is in their name: compute scales linearly with sequence length because of a fixed-size state. Unfortunately, there is no free lunch . The same fixed state size that enables efficient computation forces the model to compress all past information into one representation, the exact opposite of a Transformer, which stores all past information through a continuously growing state (the KV cache) --- a fundamental difference. So, if we can't grow the state, how do we make that fixed state do more work? We see that earlier designs simplified the recurrence and the transition matrix to make training fast. However, the change also reduced the richness of the dynamics and left decoding memory-bound: each token update performs very little computation relative to memory movement. This provides us with three levers we can pull: (1) make the recurrence itself more expressive, (2) use a richer transition matrix, and (3) add more parallel (and almost free) work inside each update. From these insights, we improve upon Mamba-2 in three core ways that: increase the expressivity of the SSM mechanism through a more general recurrence derived from our exponential-trapezoidal discretization scheme , expand the state-tracking capabilities by modeling a complex-valued SSM system , and improve the model's general performance with little impact on decode latency by using multi-input, multi-output (MIMO) SSMs , which model multiple SSMs in parallel, instead of the current single-input, single-output (SISO) SSMs.

Through these three changes, Mamba-3 pushes the frontier of performance while maintaining similar inference latency . Notably, all three of these changes are inspired by the more "classical" control theory and state space model literature. Our work goes against the grain of many modern linear architectures, which use alternative interpretations of recurrence (such as linear attention or test-time training ) that don't easily capture these concepts . Architecture What has changed in the Mamba-2 layer? Beyond the three methodological upgrades to the core SSM discussed above, we've revamped the architecture a bit to make it more in line with conventional modern language models.

Mamba-3 architecture Based on the diagram, you'll notice we've changed a couple of things. On a high level, Norms. We added in QKNormor "BCNorm" in SSM terminology, which empirically stabilizes the training of Mamba-3 models. The addition of this norm brings Mamba-3 in line with contemporary Transformer and Gated DeltaNet (GDN) models. With QKNorm, the RMSNorm from Mamba-2 becomes optional. However, we empirically find that it may still be worth keeping in hybrid models due to helping length extrapolation capabilities. More on this later. Goodbye Short Conv. We've been able to get rid of the pesky short causal convolution of Mamba-1/2 by combining (1) simple biases on B and C after BCNorm with (2) our new discretization-based recurrence. The new recurrence implicitly applies a convolution on the input to the hidden state, and we show how this is the case in Part 2 of our blog. Can the short conv really be removed? The…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

New model, high HN traction.