ReleaseNVIDIANVIDIApublished May 14, 2026seen 5d

NVIDIA/nvidia-resiliency-ext v0.6.0

NVIDIA/nvidia-resiliency-ext

Open original ↗

Captured source

source ↗
published May 14, 2026seen 5dcaptured 10hhttp 200method plain

v0.6.0

Repository: NVIDIA/nvidia-resiliency-ext

Tag: v0.6.0

Published: 2026-05-14T21:54:30Z

Prerelease: no

Release notes:

NVIDIA Resiliency Extension v0.6.0

Highlights

  • In-job restart
  • Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
  • Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
  • Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix terminate_mp_processes to cover failed workers (#270).
  • Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with --max-restarts (#226, #250, #266):
  • Simple mode (default, --ft-segment=None) for H100 / non-NVSwitch systems — first min_nodes from --nnodes=min:max become active, the rest become standbys with reserved ranks. No GPU ClusterUUID required.
  • Segment-aware mode (--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requires min_nodes % segment == 0.
  • Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
  • Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
  • External InJob control-plane (experimental) — embed ft_launcher orchestration in a host control plane (#321). Not yet QA-validated; APIs may change.
  • Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
  • --max-restarts now reflects job-level restart attempts (#211); ft_launcher runs with sensible defaults out of the box (#205, #271).
  • NUMA binding support in ft_launcher for optimized memory affinity (#209).
  • Health checks
  • NIC link-state health check (#230).
  • Distributed Storage health check (#239).
  • DCA integration for HealthCheck (#235).
  • Fail-count tracking in NodeHealthCheck (#244).
  • Checkpointing
  • CPU shared-memory D2H path (experimental) in FileSystemWriterAsync removes a redundant H2H copy and resolves the prior shm D2H race (#298).
  • PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via ConsistentDataIdentifier, and class-level metadata cache in CachedMetadataFileSystemReader (#273, #274, #275).
  • Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in TemporalAsyncCaller (#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).
  • Fault attribution — productized as standalone services (experimental)

> The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.

  • NVRx Attribution Service (`attrsvc`) and NVRx Slurm Monitor Service (`smonsvc`) introduced as FastAPI-based standalone services (#242, #248).
  • `ft_launcher`-managed `attrsvc` for co-located deployment (#318); UDS endpoints for attrsvc/smonsvc (#315).
  • Attribution is now an optional package — install with pip install nvidia-resiliency-ext[attribution] (#305). Attribution internals refactored under a svc subpackage with a clear controller/runner boundary (#295, #313, #316).
  • PyTorch Flight Recorder (experimental)

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine library release, no major traction