NVIDIA/gdrcopy
C
Captured source
source ↗NVIDIA/gdrcopy
Description: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
Language: C
License: MIT
Stars: 1383
Forks: 189
Open issues: 73
Created: 2014-12-08T20:46:42Z
Pushed: 2026-06-13T00:07:43Z
Default branch: master
Fork: no
Archived: no
README:
GDRCopy
A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.
Introduction
While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory.
The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.
What is inside
GDRCopy offers the infrastructure to create user-space mappings of GPU memory, which can then be manipulated as if it was plain host memory (caveats apply here).
A simple by-product of it is a copy library with the following characteristics:
- very low overhead, as it is driven by the CPU. As a reference, currently a
cudaMemcpy can incur in a 6-7us overhead.
- An initial memory *pinning* phase is required, which is potentially expensive,
10us-1ms depending on the buffer size.
- Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy
Bridge Xeon but it is subject to NUMA effects.
- Slow D-H, because the GPU BAR, which backs the mappings, can't be
prefetched and so burst reads transactions are not generated through PCIE
The library comes with a few tests like:
- gdrcopy_sanity, which contains unit tests for the library and the driver.
- gdrcopy_copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size.
- gdrcopy_copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes.
- gdrcopy_apiperf, an application for benchmarking the latency of each GDRCopy API call.
- gdrcopy_pplat, a benchmark application which calculates the round-trip ping-pong latency between GPU and CPU.
Requirements
GPUDirect RDMA requires NVIDIA Data Center GPU or NVIDIA RTX GPU (formerly Tesla and Quadro) based on Kepler or newer generations, see GPUDirect RDMA. For more general information, please refer to the official GPUDirect RDMA design document.
The device driver requires GPU display driver >= 418.40 on ppc64le and >= 331.14 on other platforms. The library and tests require CUDA >= 6.0.
DKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL or SLE, however, users have an option to build kmod and install it instead of the DKMS package. See [Build and installation](#build-and-installation) section for more details.
# On RHEL # dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL. $ sudo yum install dkms # On Debian - No additional dependency # On SLE / Leap # On SLE dkms can be installed from PackageHub. $ sudo zypper install dkms rpmbuild
CUDA and GPU display driver must be installed before building and/or installing GDRCopy. The installation instructions can be found in https://developer.nvidia.com/cuda-downloads.
GPU display driver header files are also required. They are installed as a part of the driver (or CUDA) installation with *runfile*. If you install the driver via package management, we suggest
- On RHEL,
sudo dnf module install nvidia-driver:latest-dkms. - On Debian,
sudo apt install nvidia-dkms-. - On SLE,
sudo zypper install nvidia-gfx-kmp.
The supported architectures are Linux x86\_64, ppc64le, and arm64. The supported platforms are RHEL8, RHEL9, Ubuntu20\_04, Ubuntu22\_04, SLE-15 (any SP) and Leap 15.x.
Root privileges are necessary to load/install the kernel-mode device driver.
DMA-BUF mmap backend
GDRCopy can export GPU memory as a Linux dma-buf via the CUDA driver and map it into user space with a plain mmap() on the dma-buf file descriptor. This backend does not require the GDRCopy kernel module (gdrdrv) and is intended for environments where gdrdrv is not installed or loaded.
Requirements
- CUDA driver 13.3 or newer
Backend selection
On gdr_open(), GDRCopy tries gdrdrv first. If gdrdrv is not installed or fails to open, it falls back to the dma-buf mmap backend (provided the driver supports). To force the dma-buf backend even when gdrdrv is available, set environment variable GDRCOPY_USE_DMABUF_MMAP=1 before calling gdr_open(). To check at runtime which backend is active:
int using_dmabuf; gdr_get_attribute(g, GDR_ATTR_USING_DMA_BUF_MMAP, &using_dmabuf); // using_dmabuf != 0 -> dma-buf mmap backend is in use
Mapping type
CPU cacheability is decided by the CUDA driver at pin time and cannot be changed afterwards. The mapping type depends on the pin flag and if the platform is coherent.
| Pin flag | Coherent platform | Non-coherent platform | |---------------------------|-----------------------------|-----------------------| | Default | GDR_MAPPING_TYPE_CACHING | GDR_MAPPING_TYPE_WC | | GDR_PIN_FLAG_FORCE_PCIE | GDR_MAPPING_TYPE_WC | GDR_MAPPING_TYPE_WC |
The dmabuf backend does not support user-requested mapping types: the type the pin produces is the only type gdr_map_v2 will accept. Passing an explicit cache flag (GDR_MAP_FLAG_REQ_CACHE_MAPPING, GDR_MAP_FLAG_REQ_WC_MAPPING, …) that asks for anything other than the default returns EINVAL.
Behavior differences vs. the gdrdrv backend
- Persistent mappings. All dma-buf mappings are persistent;
GDR_ATTR_USE_PERSISTENT_MAPPING always returns 1.
- One fd per pinned buffer. Each
gdr_pin_bufferconsumes one dma-buf
file descriptor until gdr_unpin_buffer. Applications that pin many buffers should account for the process FD limit.
- No timing fields.
gdr_get_info_v2returnstm_cycles = 0and
cycles_per_ms = 0.
- No invalidation callback.
gdr_get_callback_flagalways returns 0. - GDR API compatibility — Standard GDRCopy APIs remain unchanged
Build and installation
We provide three ways for building and installing GDRCopy.
rpm package
# For RHEL: $ sudo yum groupinstall 'Development Tools' $ sudo yum install dkms rpm-build make # For SLE: $ sudo zypper in dkms rpmbuild $ cd packages $ CUDA= ./build-rpm-packages.sh $ sudo rpm -Uvh...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid utility library with moderate GitHub stars.