WritingDigitalOcean (GradientAI)DigitalOcean (GradientAI)published Jan 13, 2026seen 5d

Technical Deep Dive: How DigitalOcean and AMD Delivered a 2x Production Inference Performance Increase for Character.ai

Open original ↗

Captured source

source ↗

Technical Deep Dive: How DigitalOcean and AMD Delivered a 2x Production Inference Performance Increase for Character.ai | DigitalOcean

© 2026 DigitalOcean, LLC. Sitemap .

Dark mode is coming soon. Engineering Technical Deep Dive: How DigitalOcean and AMD Delivered a 2x Production Inference Performance Increase for Character.ai

By Piyush Srivastava and Karnik Modi

Published: January 13, 2026 13 min read

<- Back to blog home

Background: How Character.ai worked with DigitalOcean and AMD to optimize performance

Character.ai , a leading AI entertainment platform with about 20 million worldwide users, wanted to optimize GPU performance and achieve lower inference costs for its application, which requires low-latency performance at large scale. They approached DigitalOcean and AMD in order to achieve this goal. Working closely together, the Character.ai , AMD, and DigitalOcean teams optimized AMD Instinct™ MI300X and MI325X GPU platforms, resulting in a 2x production inference throughput. In optimized configurations, DigitalOcean delivered high request density per node while maintaining exceptional p90 responsiveness for initial token and sustained token generation throughput, outperforming prior deployments on generic, non-optimized GPU infrastructure.

These gains were achieved through platform-level optimizations, including clever parallelization strategies for large Mixture-of-Experts models, efficient FP8 execution paths, optimized kernels with AITER, topology-aware GPU allocation, and production-ready Kubernetes orchestration through DigitalOcean Kubernetes (DOKS). Together, these capabilities allowed Character.ai to scale inference predictably without increasing operational burden. In this post, we will explore the specific orchestration and tuning strategies that made these gains possible.

Technical deep dive overview

Character.ai leverages multiple models like Qwen, Mistral and more to power their applications. This document is focused on how we optimized the Qwen3-235B Instruct FP8 model on a cluster of DigitalOcean featuring AMD Instinct GPUs . This workload was migrated from a generic, non-optimized setup on other providers to AMD Instinct™ MI325X platform on DigitalOcean, and following the outlined optimizations we were able to achieve up to a 2x improvement in request throughput (QPS) under strict latency and concurrency constraints. The Character.ai team has a demanding workload, but with deep collaboration with the customer and AMD, we were able to achieve an outcome that exceeded Character.ai ’s expectations and resulted in a multi-year, eight-figure annual agreement with DigitalOcean for GPU infrastructure.

The objective that we started with was to run the Qwen3-235B model to optimize a 5600 / 140 (ISL / OSL) workload on AMD Instinct™ GPUs. The primary goal was to maximize request throughput (QPS) per MI325X 8x GPU server while keeping the p90 first token latency (TTFT) and time per output token (TPOT) under a defined upper bound target. Once all the optimizations were done, we landed on ~2x QPS per 8x MI325X server compared to a generic setup on other providers.

We will discuss the optimizations in the following section, however, before we get deep into the technical weeds, it is worth defining a few terms used extensively in the sections below.

Distributed Serving

This technique involves multiple replicas on a single node and across multiple nodes in the cluster and routes incoming requests to independent replicas. There is no sharing of weights or KV cache across the replicas. Routing to replicas in distributed inference systems is usually based on several heuristics - load, prefix cache awareness and so on. At the cluster level, implementations like Character.ai have a concept of persistent user session to ensure the consistency of the following requests from the same user, to maximize KV cache hit rate.

Tensor Parallelism (TP)

Tensor Parallelism horizontally slices the model layers or tensors across several GPUs. Every GPU works on the same layers at the same time, each computing a fraction of the layer’s output. This technique is useful to run models which don’t fit in a single GPU memory, however, requires GPUs participating in the TP group to be connected over a high speed link, therefore, this technique is primarily designed for data center grade GPUs.

Expert Parallelism (EP)

Expert Parallelism (EP) is used for Mixture of Experts (MoE) models to distribute experts across multiple GPUs rather than duplicating them. Tokens are routed to specific GPUs holding the relevant experts. MoE architectures are much more memory efficient with sparse activation techniques.

AITER

AITER (AI Tensor Engine for ROCm) GitHub repository is AMD’s centralized library of high-performance AI operators designed to accelerate machine learning workloads on AMD Instinct GPUs. It provides a unified platform where developers can access and integrate optimized kernels—built on technologies like Triton, Composable Kernel (CK), and Assembly—into frameworks like PyTorch and JAX to maximize hardware efficiency.

Technical optimizations

DP1 / TP8 / EP8 with AITER

Character. ai runs models using vLLM. Since this was the first time they were using AMD Instinct GPUs, it was critical to ensure that they would be able to migrate their software tooling to be compatible with AMD Instinct GPUs without significant effort. AMD has contributed extensive ROCm support to upstream vLLM with almost full compatibility to support porting of CUDA applications to ROCm for generic off-the-shelf open source models. In our experience running the Qwen3 model using a vLLM image with ROCm support, there were some initial hiccups that we ran into, like memory access faults during model loading, compatibility issues between TP, EP and AITER. Through close technical collaboration and targeted fixes upstream, these issues were resolved, resulting in a stable and performant vLLM configuration for DP1 / TP8 / EP8 with AITER.

vLLM recipe for DP1 / TP8 / EP8 with AITER

VLLM_USE_V1 = 1 SAFETENSORS_FAST_GPU = 1 \ VLLM_ROCM_USE_AITER = 1 VLLM_ROCM_USE_AITER_MOE = 1 \ VLLM_USE_TRITON_FLASH_ATTN = 0 \ vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --kv-cache-dtype fp8 \ --quantization fp8 \ --distributed-executor-backend mp \ --compilation-config '{"full_cuda_graph":false, "max_capture_size": 32768}' \ --trust-remote-code \ --disable-log-requests \…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive technical post with real performance gains