ReleaseNVIDIANVIDIApublished Nov 14, 2025seen 5d

NVIDIA/nvidia-resiliency-ext v0.5.0-release-notes

NVIDIA/nvidia-resiliency-ext

Open original ↗

Captured source

source ↗
published Nov 14, 2025seen 5dcaptured 8hhttp 200method plain

v0.5.0

Repository: NVIDIA/nvidia-resiliency-ext

Tag: v0.5.0-release-notes

Published: 2025-11-14T22:42:29Z

Prerelease: no

Release notes:

Release Notes

NVIDIA Resiliency Extension is a Python package for framework developers and users to implement fault-tolerant features. It improves effective training time by minimizing downtime due to failures and interruptions.

NVIDIA Resiliency Extension v0.5.0

Highlights

  • In-job restarts
  • PRs (185, 190, 201) improve the scalability, profiling, and performance of in-job restarts through improvements to the rendezvous operation
  • Key scaling and fault-tolerance improvements:
  • New barrier-based rendezvous operation introduces a substantial redesign that addresses several limitations of the previous dynamic rendezvous implementation. This provides more predictable, stable, and scalable in-job behavior
  • Faster termination path:
  • The worker termination timeout (--workers-stop-timeout) has been reduced from 30 seconds to 15 seconds, improving failure recovery latency and overall job responsiveness
  • New Flag for Infra-Aligned Rank Assignment:
  • A new flag, --ft-use-infra-group-rank, allows in-job scaling to follow the infrastructure scheduler’s rank assignment, preserving topology-aware placement decisions
  • Migration Guidance:
  • While the previous dynamic rendezvous-based implementation (v1) remains supported, users are strongly encouraged to adopt barrier-based rendezvous (v2) for improved reliability, stability, and performance
  • Enhanced GPU and NVLink health checks
  • PR 145 introduces several improvements to health check module including
  • Refactored GPUHealthCheck to support device-specific monitoring
  • New NVLHealthCheck class for NVLink health validation
  • Automatic health check chaining in Wrapper class ChainedGPUHealthCheck and ChainedNVLHealthCheck for in-process use
  • Single GPU health check API for individual device validation and updated trace collector to use new GPU health check API
  • Checkpointing
  • PRs (108, 138, 154, 169, 170, 193, 197, 199) improve the stability of checkpointing by deprecating the use of fork in asynchronous checkpointing, simplifying error propagation and shutdown cleanup logic
  • Introduced the option to use Multithread File IO Instead of Multiprocess to simplify error propagation logic, improve shutdown cleanup and enhance overall stability
  • Made persistent async checkpoint worker default (except for local checkpointing) and fixed cross-call state pollution
  • Added ability to abort async checkpoint process
  • Fault attribution (new module introduced in v0.5)
  • PR 141 introduces the base attribution class which can be used to define any attribution module. This provides asynchronous combining multiple modules directly.
  • PR 172 improves error attribution by dumping NCCL traces from PyTorch for collective analysis on hang or watchdog timeout
  • It is an experimental module to identify ranks interrupting workload progress by analyzing Flight Recorder traces. It detects GPU errors, host issues, and GIL locks
  • PyT’s watchdog is currently configured to include the training process’s stack trace when generating Flight Recorder traces. However, this can lead to a deadlock if the trainer fails inside a routine that performs collectives while holding the GIL, since capturing the stack trace requires reacquiring the GIL. A new environment variable, TORCH_INCLUDE_STACK_TRACE=False (Default: True), has been added to PyTorch main to avoid this issue. This change will be included in the NGC PyT 25.11 container.

Known Issues & Limitations

  • Spare-Node Support
  • Spare nodes are not supported by either dynamic rendezvous or barrier-based rendezvous in the current release.
  • The earlier dynamic rendezvous technically supported spare nodes, but only when infra group rank assignment was not used. That mode isn't viable in real deployments because bypassing the infrastructure topology-aware rank assignment leads to degraded performance and inconsistent scaling behavior. Because of this, spare-node support isn't available in this release.
  • With barrier-based rendezvous, we've aligned fully with infra-assigned ranks to ensure correctness and performance. Spare-node support for barrier-based rendezvous is planned for a future update.
  • CUDA 12 and Ubuntu 22.04 users are advised to build from source, since PyPI wheel for v0.5 defaults to CUDA 13
  • In-process restart requires NCCL = 2.28.9 due to a segmentation fault issue

Notability

notability 3.0/10

Routine tool release notes