The Inference Alpha: Maximizing Frontier Models on AMD
Captured source
source ↗The Inference Alpha: Maximizing Frontier Models on AMD | DigitalOcean
© 2026 DigitalOcean, LLC. Sitemap .
Dark mode is coming soon. Engineering The Inference Alpha: Maximizing Frontier Models on AMD
By Balaji Varadarajan and Emilio Andere
Updated: June 10, 2026 12 min read
<- Back to blog home
At DigitalOcean, we’re committed to providing high-performance infrastructure for the next generation of AI, which is why we’ve been focused on hosting frontier Large Language Models (LLMs) on frontier GPUs—including AMD GPUs .
We see inference performance as an intricate systems-level challenge. For frontier open-weight models, achieving peak output speed is not just about the raw hardware. It also depends on a complex interaction between model architecture, runtime execution, memory systems, scheduling, and decoding strategy.
We believe there’s a significant “performance alpha” found in specialized inference engineering. Optimizing for both speed and cost-efficiency requires a much deeper approach than standard configuration sweeps. By taking a custom approach to the software stack, we can demonstrate that achieving performance parity with more expensive hardware is entirely possible.
While the current software ecosystem often presents non-obvious hurdles, deep engineering allows us to deliver stronger inference economics on high-performance AMD infrastructure relative to conventional flagship deployments.
The Proof is in the Throughput
To ground our “Performance Alpha” theory in reality, DO worked with Wafer to achieve high performance on specific frontier models on AMD GPUs through various optimizations. By utilizing Wafer’s Agent to identify inefficiencies and apply appropriate fixes, we were able to move beyond marginal gains toward order-of-magnitude improvements that change how these models are used in production.*
Kimi 2.5 (High-Speed Single Stream)
On a standard 10k input / 1.5k output workload, a stock configuration on 8x MI350X/MI355x hardware delivered a baseline of 22.5 tok/s . Through deep kernel optimization and a customized inference framework, we increased this to 255.2 tok/s - representing an 11.33x speedup with zero trade-offs in accuracy.
DeepSeek V3.2 (Full-Stack Scaling)
While stock frameworks achieved 38.5 tok/s for single-request output speed, our optimized stack pushed this to 200.8 tok/s . More importantly, at a concurrency of 64, we saw a 7.32x improvement in per-request output speed and boosted aggregate throughput from 548 tok/s to 2,165 tok/s .
GLM-5 (Flagship Efficiency)
GLM-5 is a massive 774B parameter flagship model. By optimizing the deployment topology and specializing the decode path, we enabled a single 8-GPU MI350X node to serve this model with a mean throughput of 151.1 tok/s and an inter-token latency (ITL) of just 17.8 ms .
The Economic Thesis
Beyond the technical achievement, these results represent a fundamental shift in the economics of frontier inference. Our work demonstrates that fully optimized AMD infrastructure can achieve elite performance levels while remaining more cost-effective than traditional flagship hardware deployments.
The takeaway is straightforward: inference performance is increasingly a systems problem. Delivering both high performance and sustainable economics requires a deep, custom approach to the software stack that maximizes every cycle of the underlying silicon.
***** Performance results are based on internal testing using the configurations described. Results in customer environments may vary depending on hardware, specific workload characteristics, implementation, and utilization.
Why “Out-of-the-Box” Software Leaves Performance on the Table
In our research, we define “stock” frameworks as unmodified, out-of-the-box versions of inference engines or standard kernel libraries. While these tools are the fastest way to get a model running, they carry several architectural taxes that can hinder frontier model performance.
The Generality Trade-off
Stock kernels are generically written to support many model shapes, which often leaves them unoptimized for the specific dimensions of frontier architectures.
Prefill Bias
Standard kernels are sized for large prefill batches and carry large-scale tile dimensions, register pressure and launch machinery calibrated for that regime. At single-stream decode, this is a fundamental mismatch—workload is memory-bandwidth-bound, yet the kernel continues to schedule compute resources at prefill scale, leaving the bulk of matrix cores idle.
The “Launch Tax”
Stock setups often dispatch operations like all-reduce, residual add, and RMSNorm as three separate kernels. This leads to microsecond-level administrative overhead for each call and forces unnecessary data “round-trips” to High Bandwidth Memory (HBM).
Rigid Software Constraints
Standard libraries frequently contain hard-coded assertions—such as requiring head counts to be multiples of 16 - that can cause immediate incompatibilities with the unique configurations of new frontier models.
Problem Primer: Defining the Levers of High-Speed Inference
To understand how these gains were achieved, we must understand the systems-level concepts that govern frontier model performance. Inference engineering at its core is about mastering the interactions between hardware execution, memory hierarchy, the software dispatch layer, and knowing precisely where each lever sits.
MXFP4 (Microscaling Formats)
MXFP4 is an open-standard 4-bit floating point format jointly developed by AMD, NVIDIA, Microsoft, and others under the OCP Microscaling specification [1] . Unlike per-tensor or per-channel quantization schemes, MXFP4 operates at the block level: a group of 16 or 32 values shares a single 8-bit scaling exponent, giving an effective storage cost of approximately 4.25 bits per weight rather than a clean 4.0.
This shared-exponent design is the key insight - it preserves the dynamic range needed for numerically sensitive operations like expert routing and attention projection, while still achieving the memory footprint of a 4-bit format.
Compression
BF16 costs 16 bits per weight. MXFP4 at 4.25 bits is a ~3.8× reduction. For a model with hundreds of billions of parameters distributed across routed expert FFN layers, this is the difference between requiring multi-node serving and fitting comfortably on a single 8×GPU node. For GLM-5 at 774B parameters, the majority of parameters reside in expert weights that are...
Excerpt shown — open the source for the full document.