Heaps do lie: debugging a memory leak in vLLM.
Captured source
source ↗Heaps do lie: debugging a memory leak in vLLM. | Mistral AI Engineering Heaps do lie: debugging a memory leak in vLLM. January 21, 2026 By By Mathis Felardos
Back to Blog
20 min read
Share this post
Copy to clipboard Copied
A few months ago, our team investigated a suspected memory leak in vLLM. At first, we thought the issue would be easy to spot, something confined to the upper layers of the codebase. But the deeper we looked, the more complex it became. This article kicks off our new Engineering Deep Dive series, where we’ll share how we tackle technical investigations and build solutions at Mistral AI. The issue first appeared during pre-production testing of disaggregated serving with one of our frontier models . Memory usage was climbing steadily, but only under specific conditions: with vLLM, with our model Mistral Medium 3.1 , and graph compilation enabled. There were no crashes or errors, just a slow linear increase in system memory of 400 MB per minute on production-like traffic. After a few hours, that would lead to an "out of memory" state. What followed was a methodical hunt, starting from high-level Python tools and descending into kernel-level tracing, until we finally uncovered the true source. Here’s how we tracked it down, and what it revealed about the hidden risks of dependencies layers in today’s software.
Not the kind of trending upwards graph we like to see on Grafana A leak that played hide and seek . Initially, our approach followed a standard troubleshooting path: we aimed to isolate the source of the leak by replicating the issue on a smaller model, with fewer production optimizations activated. But after trying different settings and models we couldn’t reproduce that on another setup. The error was only present on a Prefill/Decode disaggregated setup with NIXL. Given the central role Prefill/Decode (P/D) disaggregation plays in our story, let us walk you through the high-level mechanisms of how this inference setup works. P/D Disaggregated splits the processing of a query into two phases, processed by different instances: First the router sends a “prefill request” (by setting max_tokens=1 and by setting an empty set of KV Transfer metadata) to a prefill vLLM instance to compute the KVCache of the request.
On completion, the router transfers KVCache metadata alongside a “decode request” to a decode vLLM instance.
KVCache transfer is initiated through NIXL and token generation happens on the decode vLLM instance by using and extending the transferred KVCache.
The leak was only observed on the decode side of this disaggregated setup, strongly suggesting that KV Cache transfer through NIXL was the root cause of the leak. In our setup NIXL relies on UCX (Unified Communication X), a high-performance communication library designed for data exchange in distributed systems. UCX enables optimized data transfer over a large set of technologies, including Infiniband , a low-latency, high-throughput interconnect technology commonly used in HPC and data centers.
Overview of a P/D Disaggregated serving deployment. For the remainder of our investigation, we worked in this setup and started with Python memory profiling tools to pinpoint the source of the leak. We tried Memray and Guppy 3, but neither showed a leak and everything they allowed us to observe was normal. Attempting to use GDB made the entire process crash. Our vLLM setup was also too heavy for tools like Valgrind, making it impractically slow or even impossible to use. It was clear that a more powerful tool was required to track the leak. But before investing more time, we decided to be sure that this leak was reproducible by others. We reached out to the vLLM team by opening an issue on their GitHub repository , which helped confirm we weren't the only ones seeing this issue and a deeper investigation was warranted. Counting mallocs and frees with Heaptrack. In order to better track what was happening, we turned to Heaptrack : a memory profiler that overrides memory operations like malloc or free and records these events alongside stack traces. Millian Wolf, the creator of Heaptrack, has written an excellent introductory blog post to get you started with the tool. It’s a two-step process: first run the program with tracing, then interpret the data dumps. In order to track the allocations confined in the worker process of vLLM, we set LD_PRELOAD to libheaptrack_preload.so through vLLM to ensure this library is loaded before any other and overrides the behaviours of memory allocating functions, providing us with the data dump. We were then able to visualize this data through heaptrack_interpret : $ git clone https://github.com/KDE/heaptrack.git
$ cd heaptrack && mkdir build && cd build && cmake .. && sudo make install
Setting LD_PRELOAD=/path/to/libheaptrack_preload.so creates a temporary file named heaptrack.<pid>, here the pid is 2028233
$ /usr/local/lib/heaptrack/libexec/heaptrack_interpret < heaptrack.2028233 | gzip > heaptrack.vllm.2028233.gz
, here the pid is 2028233$ /usr/local/lib/heaptrack/libexec/heaptrack_interpret heaptrack.vllm.2028233.gz">
Heaptrack provides a detailed, interactive graph of all heap allocations, down to the function level. We can track every malloc and free , with a clear breakdown of the memory usage.
The memory usage shown with Heaptrack of our vLLM worker At this point, one might question: where is the memory leak? Indeed, the only visible memory increase was due to a lazy NIXL initialization. To validate that the leak was indeed happening in this set-up, we ran a vLLM benchmark and created two Heaptrack snapshots using heaptrack_interpret : one at the beginning and one near the end. Although the heap memory itself remained stable, the peak resident memory (RSS) , which we will cover in the next section, differed between the two snapshots. This discrepancy was visible in Heaptrack’s summary tab.
Peak RSS discrepancy in Heaptrack: Before (1) and after (2) benchmark This meant the leak was happening outside the heap, and so not part of the memory that Heaptrack analyzes. We needed to change tools to track allocations outside of the heap. Beyond the heap: understanding resident memory and system allocations.
To understand why Heaptrack couldn’t detect the leak, we first need to clarify what the Resident Set Size (RSS) actually includes. RSS represents the portion of a process’s memory held in RAM, and...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Substantive debugging post with low traction