ReleaseNVIDIANVIDIApublished Jun 15, 2026seen 1w

NVIDIA/numba-cuda-mlir v0.4.0

NVIDIA/numba-cuda-mlir

Open original ↗

Captured source

source ↗
published Jun 15, 2026seen 1wcaptured 1whttp 200method plain

v0.4.0

Repository: NVIDIA/numba-cuda-mlir

Tag: v0.4.0

Published: 2026-06-15T16:24:32Z

Prerelease: no

Release notes: This first update focuses on platform support, debugging, ecosystem enablement, performance, and broader CUDA Python compatibility.

Highlights

  • Added Support for Windows, and integrated Windows tests into CI.
  • Added CUDA-gdb CI workflow and debugging support validation.
  • Added experimental third-party ecosystem coverage for nvmath-python, RAPIDS/cuDF, and numbast extension backends.
  • Improved warm compile-time performance by redesigning extension registry refresh behavior, delivering an additional ~40% speedup over the previous implementation on our benchmark suite and reaching ~1.8x geomean speedup on warm compile-time vs. numba-cuda.

Platform and Tooling Support

  • Introduced Windows CI coverage and related build fixes, including static CRT usage.
  • Added CUDA-gdb workflow coverage to validate debugging behavior.
  • Improved compatibility with newer libc++ versions.
  • Removed the implicit nvjitlink dependency derived from cudatoolkit.

Performance and Compilation

  • Replaced implicit context refresh with explicit initialization and version-tracked registries, reducing warm compile overhead.
  • Optimized CUDA Array Interface launch caching.
  • Avoided finalizing internal device callees during compilation.
  • Added user-controlled handling for LTOIR linker optimization disabling instead of unconditionally disabling it.

CUDA Python Compatibility Improvements

  • Added full array.view() support, including dtype bitwidth changes.
  • Added support for vector types in local and shared memory.
  • Added CUDA vector / scalar operations and vector-to-complex conversions.
  • Added support for custom dtypes.
  • Added complex constructor support, including complex32.
  • Added support for complex CPointer getitem/setitem lowering.
  • Added support for NamedTuple usage in kernels.
  • Improved support for array slicing and shared-memory views.

Lowering and Type System Fixes

  • Unified vector type handling by replacing VectorTypeStub with VectorType / VectorTypeClass.
  • Introduced a value/storage data model to fix float16 and bool memory representation issues.
  • Fixed lowering for defaults, tuples, dtype tokens, heterogeneous tuple assignment, optional values, and string constant folding.
  • Fixed array.real / array.imag on shared-memory arrays preserving address space.
  • Fixed VectorType to complex setitem behavior.
  • Fixed to_numba_type handling for NumPy dtypes.

Ecosystem and Extension Support

  • Enabled extension linkage in MLIR lowerings.
  • Added Extension API documentation.
  • Added Numbast MLIR source CI tests.
  • Added experimental cuDF / RAPIDS third-party test coverage, including use of pylibcudf from the active conda environment.
  • Prevented unintended invocation of the Numba-CUDA JIT and addressed resulting issues.

Documentation and Maintenance

  • Updated reference documentation.
  • Added PR documentation preview infrastructure.
  • Fixed PyPI-hosted README links.
  • Removed outdated conda install documentation.
  • Removed legacy @intrinsic implementations.
  • Removed dead NRT C++ code.
  • Removed cudasim support.
  • Removed unnecessary packaging dependency from numba_cuda._compat.

Bug Fixes

  • Fixed ICE for raise-only kernels.
  • Fixed shared-memory view behavior with None starts.
  • Fixed array slicing issues.
  • Fixed multiple lowering edge cases involving tuples, optionals, constants, and complex/vector interactions.
  • Fixed cuDF CI and Numbast CI issues.

Notability

notability 4.0/10

Minor release of niche compiler tool