ReleaseNVIDIANVIDIApublished Mar 18, 2025seen 5d

NVIDIA/nvidia-resiliency-ext v0.3.0

NVIDIA/nvidia-resiliency-ext

Open original ↗

Captured source

source ↗
published Mar 18, 2025seen 5dcaptured 9hhttp 200method plain

v0.3.0

Repository: NVIDIA/nvidia-resiliency-ext

Tag: v0.3.0

Published: 2025-03-18T05:46:17Z

Prerelease: no

Release notes:

Release Notes

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.3

Highlights

  • Support for Blackwell GPU
  • ARM based host CPU support
  • In-process & In-job restart
  • Hierarchical in-process and in-job restart support
  • Warm spare support
  • Health checks
  • GPU health check based on NVML
  • NIC
  • Checkpointing
  • Existing capabilities that used to be part of Megatron Core is refactored to be part of NVRx. The checkpointing feature will be maintained as part of NVRx, and Megatron Core and NeMo will use the code from NVRx in the future.

Known Issues & Limitations

  • GPU health check requires driver >= 570
  • Checkpointing - Persistent queue with replication is not supported

Contributors

@apaithankar @grzegorz-k-karch @hexinw-nvidia @jbieniusiewi @j-szulc @mikolajblaz @sbak5 @skierat @srogawski-nvidia @szmigacz @yzhautouskay

Notability

notability 4.0/10

Routine version release from NVIDIA