ReleaseNVIDIANVIDIApublished Apr 8, 2025seen 15h

NVIDIA/cuDecomp v0.5.0

NVIDIA/cuDecomp

Open original ↗

Captured source

source ↗
published Apr 8, 2025seen 15hcaptured 15hhttp 200method plain

v0.5.0

Repository: NVIDIA/cuDecomp

Tag: v0.5.0

Published: 2025-04-08T22:36:15Z

Prerelease: no

Release notes:

What's Changed

This release includes a number of major updates to cuDecomp. This release adds new features to make cuDecomp more flexible for users (more customizable memory orderings by pencil axis via new transpose_mem_order configuration option and support for input/output buffer padding in transpose and halo update APIs). This release also improves support for multi-node NVLINK (MNNVL) equipped clusters with opt-in support for fabric allocated cuDecomp workspace memory. Beyond this, this release includes expanded autotuning options and general improvements.

Breaking changes

  • https://github.com/NVIDIA/cuDecomp/pull/60 adds a new padding argument to several cuDecomp APIs: cudecompGetPencilInfo, cudecompTranspose*, and cudecompHaloUpdate* functions. This will require updates to existing C++ code and Fortran code (depending on usage). See https://github.com/NVIDIA/cuDecomp/pull/60 and documentation for more details.

Deprecations

  • The Makefile-based build has been removed.

PRs included in this release

  • Made it possible to include library header from pure C program (https://github.com/NVIDIA/cuDecomp/pull/40)
  • Adding Fortran version of Taylor Green example (https://github.com/NVIDIA/cuDecomp/pull/41)
  • Fix integer overflow issue with C++ TG example for large problems. (https://github.com/NVIDIA/cuDecomp/pull/42)
  • Benchmark updates (https://github.com/NVIDIA/cuDecomp/pull/43)
  • Use unique ID based NVSHMEM initialization method for newer NVSHMEM versions (https://github.com/NVIDIA/cuDecomp/pull/44)
  • Removing Makefile build support and related files. (https://github.com/NVIDIA/cuDecomp/pull/45)
  • Add missing preprocessor guards to fix compilation without NVSHMEM enabled. (https://github.com/NVIDIA/cuDecomp/pull/46)
  • Add small MPI_Alltoall after autotuning to work around MPI memory registration delaying cudaFree. (https://github.com/NVIDIA/cuDecomp/pull/47)
  • Address narrowing conversion errors/warnings. (https://github.com/NVIDIA/cuDecomp/pull/48)
  • Add new transpose_mem_order configuration argument to enable more flexible pencil memory layouts. (https://github.com/NVIDIA/cuDecomp/pull/49)
  • Add opt-in support for fabric-registered workspace allocations via cuMem* APIs. (https://github.com/NVIDIA/cuDecomp/pull/50)
  • Dynamically load CUDA driver functions at runtime. (https://github.com/NVIDIA/cuDecomp/pull/51)
  • Increase buffer size used in post-autotuning MPI_Alltoall. (https://github.com/NVIDIA/cuDecomp/pull/52)
  • Fix integer overflow issue in Fortran poisson example. (https://github.com/NVIDIA/cuDecomp/pull/53)
  • Extend transpose shortcut handling to cases with halos. (https://github.com/NVIDIA/cuDecomp/pull/54)
  • Fix bug in handling of NVSHMEM halo backends from recent change. (https://github.com/NVIDIA/cuDecomp/pull/55)
  • Improve multi-node NVLink topology detection and communication ordering using NVML utilities. (https://github.com/NVIDIA/cuDecomp/pull/56)
  • Fix CUDART_VERSION guard for nvmlDeviceGetGpuFabricInfoV to restrict usage to CUDA >= 12.4. (https://github.com/NVIDIA/cuDecomp/pull/57)
  • Silence messages about NVML symbols failing to load. (https://github.com/NVIDIA/cuDecomp/pull/58)
  • Improve tests (https://github.com/NVIDIA/cuDecomp/pull/59)
  • Preserve original user transpose_mem_order settings after grid descriptor creation. (https://github.com/NVIDIA/cuDecomp/pull/61)
  • Add support for padded input/output buffers in transpose and halo communication routines (https://github.com/NVIDIA/cuDecomp/pull/60)
  • Improvements to batched memcpy kernel implementation. (https://github.com/NVIDIA/cuDecomp/pull/62)
  • Remove redundant axis-contiguous/transpose_mem_order configurations from halo tests. Update axis-contiguous test configurations to not supply transpose_mem_order argument. (https://github.com/NVIDIA/cuDecomp/pull/63)
  • Add Blackwell (cc100) support to default builds when using CUDA 12.8 or newer. (https://github.com/NVIDIA/cuDecomp/pull/64)
  • C++ Taylor Green example updates. (https://github.com/NVIDIA/cuDecomp/pull/65)
  • Add new autotuning options to set per operation halo extent and padding arguments. (https://github.com/NVIDIA/cuDecomp/pull/66)

Full Changelog: https://github.com/NVIDIA/cuDecomp/compare/v0.4.2...v0.5.0