ReleaseNVIDIANVIDIApublished May 28, 2025seen 5d

NVIDIA/nvidia-resiliency-ext v0.4.0

NVIDIA/nvidia-resiliency-ext

Open original ↗

Captured source

source ↗
published May 28, 2025seen 5dcaptured 8hhttp 200method plain

v0.4.0

Repository: NVIDIA/nvidia-resiliency-ext

Tag: v0.4.0

Published: 2025-05-28T06:22:56Z

Prerelease: no

Release notes:

Release Notes

NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.4.0

Highlights

  • Checkpointing
  • PR 29 - Support for storing checkpoints to cloud object stores
  • Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob

Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.

  • Provide scalable, reliable, cheaper, single source of truth across clouds/regions
  • Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
  • PR 36 - Critical bug fix to enable async checkpoint loading without errors
  • In-process & In-job restart
  • PR 35 - Nested restarter updates for in-process restart to align with in-job

restart, so users have a consistent experience across in-process and in-job restarts

  • Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional

callbacks and logging

Known Issues & Limitations

  • Dependencies:
  • In-process requires Pytorch, at least version, that includes changes in PR 150690 to avoid

deadlock in NCCL P2P communications (used in pipeline parallel)

  • In-process requires Transformer Engine including at least PR 1715 (merged) and [PR

1812](https://github.com/NVIDIA/TransformerEngine/pull/1812) (not yet merged) to reduce cross-restart memory leaks

Notability

notability 4.0/10

Routine version update, low traction