NVIDIA/nvidia-resiliency-ext v0.4.0
NVIDIA/nvidia-resiliency-ext
Captured source
source ↗published May 28, 2025seen 5dcaptured 8hhttp 200method plain
v0.4.0
Repository: NVIDIA/nvidia-resiliency-ext
Tag: v0.4.0
Published: 2025-05-28T06:22:56Z
Prerelease: no
Release notes:
Release Notes
NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.
NVIDIA Resiliency Extension v0.4.0
Highlights
- Checkpointing
- PR 29 - Support for storing checkpoints to cloud object stores
- Leverage cloud storage provider’s multithreaded SDK for rapid loading and saving checkpoints to object stores such as AWS S3, Azure Blob
Storage, Google Cloud Storage and more using NVIDIA Multi-storage Client.
- Provide scalable, reliable, cheaper, single source of truth across clouds/regions
- Provide opt-out configuration when creating FileSystemWriterAsync class instance to allow users to passthrough to the filesystem
- PR 36 - Critical bug fix to enable async checkpoint loading without errors
- In-process & In-job restart
- PR 35 - Nested restarter updates for in-process restart to align with in-job
restart, so users have a consistent experience across in-process and in-job restarts
- Updates to in-process nested restart functionality provided by Python Wrapper class and existing callback infrastructure with additional
callbacks and logging
Known Issues & Limitations
- Dependencies:
- In-process requires Pytorch, at least version, that includes changes in PR 150690 to avoid
deadlock in NCCL P2P communications (used in pipeline parallel)
- In-process requires Transformer Engine including at least PR 1715 (merged) and [PR
1812](https://github.com/NVIDIA/TransformerEngine/pull/1812) (not yet merged) to reduce cross-restart memory leaks
Notability
notability 4.0/10Routine version update, low traction