What does this repo signal mean?

NVIDIA published NVIDIA/gdrcopy (C). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/gdrcopy · language C · Solid utility library with moderate GitHub stars.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure in the data-business radar.

NVIDIA Repo: NVIDIA/gdrcopy

Captured source

source ↗

GitHub/github.com/NVIDIA/gdrcopy

NVIDIA/gdrcopy repository metadata

Source ↗

published Dec 8, 2014seen 1wcaptured 1whttp 200method plain

NVIDIA/gdrcopy

Description: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

Language: C

License: MIT

Stars: 1383

Forks: 189

Open issues: 73

Created: 2014-12-08T20:46:42Z

Pushed: 2026-06-13T00:07:43Z

Default branch: master

Fork: no

Archived: no

README:

GDRCopy

A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.

Introduction

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.

What is inside

GDRCopy offers the infrastructure to create user-space mappings of GPU memory, which can then be manipulated as if it was plain host memory (caveats apply here).

A simple by-product of it is a copy library with the following characteristics:

very low overhead, as it is driven by the CPU. As a reference, currently a

cudaMemcpy can incur in a 6-7us overhead.

An initial memory *pinning* phase is required, which is potentially expensive,

10us-1ms depending on the buffer size.

Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy

Bridge Xeon but it is subject to NUMA effects.

Slow D-H, because the GPU BAR, which backs the mappings, can't be

prefetched and so burst reads transactions are not generated through PCIE

The library comes with a few tests like:

gdrcopy_sanity, which contains unit tests for the library and the driver.
gdrcopy_copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size.
gdrcopy_copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes.
gdrcopy_apiperf, an application for benchmarking the latency of each GDRCopy API call.
gdrcopy_pplat, a benchmark application which calculates the round-trip ping-pong latency between GPU and CPU.

Requirements

GPUDirect RDMA requires NVIDIA Data Center GPU or NVIDIA RTX GPU (formerly Tesla and Quadro) based on Kepler or newer generations, see GPUDirect RDMA. For more general information, please refer to the official GPUDirect RDMA design document.

The device driver requires GPU display driver >= 418.40 on ppc64le and >= 331.14 on other platforms. The library and tests require CUDA >= 6.0.

DKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL or SLE, however, users have an option to build kmod and install it instead of the DKMS package. See [Build and installation](#build-and-installation) section for more details.

# On RHEL
# dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL.
$ sudo yum install dkms

# On Debian - No additional dependency

# On SLE / Leap
# On SLE dkms can be installed from PackageHub.
$ sudo zypper install dkms rpmbuild

CUDA and GPU display driver must be installed before building and/or installing GDRCopy. The installation instructions can be found in https://developer.nvidia.com/cuda-downloads.

GPU display driver header files are also required. They are installed as a part of the driver (or CUDA) installation with *runfile*. If you install the driver via package management, we suggest

On RHEL, sudo dnf module install nvidia-driver:latest-dkms.
On Debian, sudo apt install nvidia-dkms-.
On SLE, sudo zypper install nvidia-gfx-kmp.

The supported architectures are Linux x86\_64, ppc64le, and arm64. The supported platforms are RHEL8, RHEL9, Ubuntu20\_04, Ubuntu22\_04, SLE-15 (any SP) and Leap 15.x.

Root privileges are necessary to load/install the kernel-mode device driver.

DMA-BUF mmap backend

GDRCopy can export GPU memory as a Linux dma-buf via the CUDA driver and map it into user space with a plain mmap() on the dma-buf file descriptor. This backend does not require the GDRCopy kernel module (gdrdrv) and is intended for environments where gdrdrv is not installed or loaded.

Requirements

CUDA driver 13.3 or newer

Backend selection

On gdr_open(), GDRCopy tries gdrdrv first. If gdrdrv is not installed or fails to open, it falls back to the dma-buf mmap backend (provided the driver supports). To force the dma-buf backend even when gdrdrv is available, set environment variable GDRCOPY_USE_DMABUF_MMAP=1 before calling gdr_open(). To check at runtime which backend is active:

int using_dmabuf;
gdr_get_attribute(g, GDR_ATTR_USING_DMA_BUF_MMAP, &using_dmabuf);
// using_dmabuf != 0 -> dma-buf mmap backend is in use

Mapping type

CPU cacheability is decided by the CUDA driver at pin time and cannot be changed afterwards. The mapping type depends on the pin flag and if the platform is coherent.

| Pin flag | Coherent platform | Non-coherent platform | |---------------------------|-----------------------------|-----------------------| | Default | GDR_MAPPING_TYPE_CACHING | GDR_MAPPING_TYPE_WC | | GDR_PIN_FLAG_FORCE_PCIE | GDR_MAPPING_TYPE_WC | GDR_MAPPING_TYPE_WC |

The dmabuf backend does not support user-requested mapping types: the type the pin produces is the only type gdr_map_v2 will accept. Passing an explicit cache flag (GDR_MAP_FLAG_REQ_CACHE_MAPPING, GDR_MAP_FLAG_REQ_WC_MAPPING, …) that asks for anything other than the default returns EINVAL.

Behavior differences vs. the gdrdrv backend

Persistent mappings. All dma-buf mappings are persistent;

GDR_ATTR_USE_PERSISTENT_MAPPING always returns 1.

One fd per pinned buffer. Each gdr_pin_buffer consumes one dma-buf

file descriptor until gdr_unpin_buffer. Applications that pin many buffers should account for the process FD limit.

No timing fields. gdr_get_info_v2 returns tm_cycles = 0 and

cycles_per_ms = 0.

No invalidation callback. gdr_get_callback_flag always returns 0.
GDR API compatibility — Standard GDRCopy APIs remain unchanged

Build and installation

We provide three ways for building and installing GDRCopy.

rpm package

# For RHEL:
$ sudo yum groupinstall 'Development Tools'
$ sudo yum install dkms rpm-build make

# For SLE:
$ sudo zypper in dkms rpmbuild

$ cd packages
$ CUDA= ./build-rpm-packages.sh
$ sudo rpm -Uvh...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid utility library with moderate GitHub stars.