WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published Feb 19, 2026seen 5d

DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost

Open original ↗

Captured source

source ↗

DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Engineering DigitalOcean Gradient™ AI GPU Droplets Optimized for Inference: Increasing Throughput at Lower the Cost

By Jason Peng and Hemasumanth Rasineni

Updated: February 19, 2026 11 min read

<- Back to blog home

Production-grade LLM inference demands more than just access to GPUs; it requires deep optimization across the entire serving stack, from quantization and attention kernels to memory management and parallelism strategies. Most teams deploying models like Llama 3.3 70B on vanilla configurations are leaving the majority of their hardware’s capability on the table: underutilized FLOPs, wasted memory bandwidth, and GPU hours spent waiting instead of computing.

To solve this, we built the Inference Optimized Image a fully pre-configured OS image available on DigitalOcean’s GPU Droplets — that layers speculative decoding, FP8 quantization, FlashAttention-3, paged attention, concurrent optimization, and prompt caching into a single deployable image. The result of our particular test: 143% higher throughput (2,000 vs. 823 tokens/second), 40.7% lower TTFT (187.9ms vs. 316.83ms), and a 75% reduction in cost per million tokens ($1.472 vs. $5.80) — all while running Llama 3.3 70B on 2 H100 GPUs instead of 4.

In this post, we walk through the optimization stack, the engineering reasoning behind each layer, and the benchmark methodology and our test results showing these gains.

Prefill, Decode, and Why Optimization is Multiplicative

As we covered in our LLM Inference Benchmarking post, inference works in two distinct phases with fundamentally different computation characteristics. The prefill phase processes the entire input prompt through the model’s forward pass self-attention, layer norms, feed-forward networks and is compute-bound, with high arithmetic intensity (FLOPs per byte transferred). The decode phase generates tokens one at a time, loading the full weight matrix and KV cache from HBM for each token, making it strictly memory-bandwidth-bound.

This distinction matters because each optimization in our stack targets a specific bottleneck. Speculative decoding attacks the sequential nature of decode. FP8 quantization reduces memory footprint and accelerates compute via higher-throughput Tensor Cores. FlashAttention-3 optimizes the prefill-heavy attention computation. Paged attention improves memory utilization under concurrent load. The gains are multiplicative each layer compounds on the others because they address orthogonal bottlenecks in the inference pipeline.

The Optimization Stack

Speculative Decoding

Standard autoregressive generation is inherently sequential: each token requires a full forward pass through the 70B model, and at decode time, that forward pass is dominated by memory bandwidth the GPU spends more time moving weights from HBM to compute cores than performing matrix multiplications.

Speculative decoding breaks this pattern with a small, fast draft model that proposes multiple candidate tokens in parallel. The full 70B target model then verifies these candidates in a single forward pass and because verification of N tokens costs roughly the same as generating 1, accepted speculations yield multiple tokens for the compute cost of one.

The engineering challenge is draft model selection: fast enough that overhead doesn’t eat the gains, accurate enough that proposals are frequently accepted. We tuned the draft-target pair specifically for the Llama 3.3 70B architecture, optimizing acceptance rates across our benchmark workload distributions. Speculative decoding is the single largest contributor to both the throughput improvement and TTFT reduction in our stack.

FP8 Quantization

Running Llama 3.3 70B at FP16 precision requires approximately 140GB of GPU memory for weights alone, nearly two H100s worth of HBM3 before allocating anything for KV cache, activations, or batch state. This is why vanilla deployments typically need 4 H100 GPUs.

FP8 quantization halves the memory footprint by representing weights in 8-bit floating point (E4M3), compressing the 70B model to ~70GB and making it feasible to serve on 2 H100 GPUs with TP=2 . But the benefit goes beyond memory: H100 FP8 Tensor Cores deliver 2× the peak FLOPS compared to FP16 1,979 TFLOPS vs. 989 TFLOPS. Quantization doesn’t just let us fit the model on fewer GPUs; it makes each GPU compute faster on every forward pass.

The practical result: FP8 enables the 2×H100 configuration that underpins the 75% cost reduction of half the GPUs, each running faster.

Flash Attention-3 and Paged Attention

Attention computation scales quadratically with sequence length, making it a major bottleneck for longer prompts and generations. Two complementary optimizations address this.

FlashAttention-3 restructures attention to minimize HBM reads and writes. Instead of materializing the full N×N attention matrix in GPU memory, it tiles the computation so attention scores are computed and consumed in SRAM (fast on-chip memory) without ever writing the full matrix to HBM. On H100s, it also exploits the TMA (Tensor Memory Accelerator) unit to overlap data movement with computation, a capability unique to the Hopper architecture.

Paged attention tackles KV cache memory management. In vanilla setups, KV cache is pre-allocated contiguously per request, causing fragmentation short sequences hold reserved memory that longer sequences need. Paged attention borrows the virtual memory concept from operating systems, managing KV cache in fixed-size blocks allocated on demand. This dramatically improves memory utilization under variable-length concurrent workloads.

Together, these optimizations let us serve more concurrent requests within the same memory envelope, keeping throughput high and TTFT stable as concurrency scales from 1 to 16 users.

Concurrent Optimization

This is perhaps the least intuitive optimization in our stack, but it delivers some of the most dramatic gains for multi-model deployments.

Concurrent optimization means running multiple instances of the same model in parallel on the same hardware, rather than the typical vLLM approach of one model instance occupying all available GPUs. The insight is rooted in GPU utilization patterns: a single 70B model split across 8 GPUs using TP=8 often can’t fully saturate the memory…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Notable product announcement but not a breakthrough