ReleaseNVIDIANVIDIApublished Jun 11, 2026seen 1w

NVIDIA/nvshmem v3.7.0-0

NVIDIA/nvshmem

Open original ↗

Captured source

source ↗
published Jun 11, 2026seen 1wcaptured 1whttp 200method plain

NVSHMEM 3.7.0-0

Repository: NVIDIA/nvshmem

Tag: v3.7.0-0

Published: 2026-06-11T09:51:03Z

Prerelease: no

Release notes:

NVIDIA NVSHMEM 3.7.0 Release Notes

NVIDIA® NVSHMEM is an implementation of the OpenSHMEM specification for NVIDIA GPUs. The NVSHMEM programming interface implements a Partitioned Global Address Space (PGAS) model across a cluster of NVIDIA GPUs. NVSHMEM provides an easy-to-use interface to allocate memory that is symmetrically distributed across the GPUs. In addition to a CPU-side interface, NVSHMEM provides a NVIDIA® CUDA® kernel-side interface that allows CUDA threads to access any location in the symmetrically-distributed memory.

The release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 3.7.0 and earlier releases.

Key Features and Enhancements

The NVSHMEM release includes the following key features and enhancements:

  • Features
  • Added TMA-backed implementations for NVLink put and get operations supporting global-memory and shared-memory local buffers, with shared-memory registration API (nvshmemx_ask_smem, nvshmemx_give_smem, nvshmemx_release_smem), selectable via NVSHMEM_TMA_POLICY.
  • Added GPUNetIO remote transport (NVSHMEM_REMOTE_TRANSPORT=gpunetio) with GPU-initiated communication (NVSHMEM_GPUNETIO_ENABLE_GDAKI=1) and support for the DOCA SDK, see https://github.com/NVIDIA-DOCA/gpunetio and https://developer.nvidia.com/networking/doca.
  • Added nvshmemx_flush APIs to provide source-buffer reusability without guaranteeing remote visibility.
  • Added experimental logical endpoint/CFT handle support for fabric-PTX unicast communication, including introducing NVSHMEM_TEAM_MC_SHARED.
  • Added floating-point atomic add/fetch_add APIs (nvshmemx_{half,float,double}_atomic_{add,fetch_add}) with P2P and proxy-backed IBRC support.
  • Added teams-based OpenSHMEM bootstrap support using SHMEM_TEAM_WORLD with fallback to legacy active-set collectives.
  • Added support for combined CUDA VMM handle flags in symmetric buffer registration.
  • Added support for GID-based routing on InfiniBand networks with IBRC and IBGDA transports.
  • Added support for InfiniBand PKey index selection (NVSHMEM_IB_PKEY_INDEX), QP ack timeout (NVSHMEM_IB_TIMEOUT), and retry count (NVSHMEM_IB_RETRY_CNT).
  • Added CMake target for nvidia-nvshmem-cuXX Python wheels via NVSHMEM_BUILD_LIBS_WHEEL.
  • Added multi-architecture device library support with a fatbin LTO-IR library and per-architecture LLVM bitcode libraries.
  • Improved diagnostics with human-readable status strings in error logs.
  • Improved header search for CUDA-version-agnostic CCCL distribution.
  • Switched to C++17 as the minimum required C++ version.
  • Changed licensing to Apache-2.0 and added DCO contributing guidance.
  • Bug Fixes
  • Fixed build portability issues around C++17/C++20 compilation, GNU extensions, and device-library CUDA architecture propagation.
  • Fixed non-RDC compilation for users that include nvshmem.h without compiling with -rdc=true.
  • Fixed hangs when PEs observed different NCCL availability during initialization.
  • Fixed NVLS capability detection on older CUDA drivers when cuDeviceGetAttribute returns CUDA_ERROR_INVALID_VALUE.
  • Fixed legacy OpenSHMEM bootstrap handling of non-4-byte allgather/alltoall payloads.
  • Fixed nvshmem_ptr handling to return NULL instead of segfaulting when symmetric heap is not initialized.
  • Fixed IBDevX doorbell UMEM null-check handling.
  • Fixed IBGDA RC multi-port endpoint setup and CQ indexing across selected devices.
  • Fixed LTO-IR/bitcode build issues including LLVM 21 NVPTX intrinsic compatibility and CUDA architecture selection.
  • Fixed build issue causing redefinition of mlx5dv macros in certain environments.
  • Fixed alltoall block-scoped warp quiet handling when no warps are unused.
  • Fixed standalone test builds against RPM/DEB installs by using exported find_package variables.
  • Fixed memory semantics of the ring allreduce example.
  • Fixed NVLS multimem architecture gating and two-shot tile_allreduce source-data ordering.
  • Fixed IBRC GDRCopy teardown for sysmem handles used by CPU atomics.
  • Fixed bootstrap and common IB transport robustness issues, including a bootstrap helper double-free.
  • External Contributions
  • Added NUMA-aware CPU affinity pinning controlled via NVSHMEM_CPU_AFFINITY. (AWS)
  • Added NVSHMEM_NETDEVS_POLICY to control NIC assignment policy. (AWS)
  • Improved libfabric transport progress, signaling, staged atomics, and ack aggregation for EFA environments. (AWS)
  • Added libfabric RMA batching support with transport-level batching hints and controls. (AWS)
  • Improved libfabric transport GDRCopy integration with opportunistic GDRCopy 2.5+ API loading and FORCE_PCIE support on coherent platforms. (AWS)

The NVSHMEM4Py 0.3.1 release includes the following:

  • Updated Numbast integration and dependency handling for newer Python/Numba-CUDA combinations, including Python 3.14 compatibility.
  • Removed hardcoded CUDA 13 build requirement.
  • Updated CuTe DSL RMA tensor tests to use Torch-backed tensors with DLPack conversion.
  • Fixed CuTe and Numba device collective generation and small-team handling, including reducescatter cooperative-launch bindings.
  • Fixed several minor bugs in NVSHMEM4Py tests.

Compatibility

NVSHMEM 3.7.0 has been tested with the following:

  • CUDA Toolkit:
  • 12.8
  • 12.9
  • 13.2
  • 13.3
  • CPUs:
  • *x86* processors
  • NVIDIA Grace™ processors
  • GPUs:
  • NVIDIA Ampere
  • NVIDIA Hopper™
  • NVIDIA Blackwell
  • NCCL 2.30.4

Limitations

  • NVSHMEM is not compatible with the PMI client library on Cray systems, and *must* use the NVSHMEM internal PMI-2 client library.
  • You must launch jobs with the PMI bootstrap by specifying --mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2, or directly by using the MPI or SHMEM bootstraps.
  • You must also set PMI-2 as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when you build NVSHMEM.
  • The libfabric transport currently does not support VMM, so you must disable VMM by setting NVSHMEM_DISABLE_CUDA_VMM=1.
  • Systems with PCIe peer-to-peer communication must do one of the following:
  • Provide InfiniBand to support NVSHMEM atomics API calls.
  • Use NVSHMEM’s UCX transport, which uses sockets for atomics if InfiniBand is absent.
  • nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between the source and...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine library release