What does this release signal mean?

NVIDIA published NVIDIA/nvidia-resiliency-ext v0.6.0 (NVIDIA/nvidia-resiliency-ext). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's fault-tolerance library for AI training with PyTorch. · v0.6.0 Repository: NVIDIA/nvidia-resiliency-ext Tag: v0.6.0 Published: 2026-05-14T21:54:30Z Prerelease: no Release notes: NVIDIA Resiliency Extension v0.6.0 Highlights -.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/nvidia-resiliency-ext v0.6.0

Captured source

source ↗

GitHub/github.com/NVIDIA/nvidia-resiliency-ext

NVIDIA/nvidia-resiliency-ext v0.6.0

Source ↗

published May 14, 2026seen Jun 6captured Jun 11http 200method plain

v0.6.0

Repository: NVIDIA/nvidia-resiliency-ext

Tag: v0.6.0

Published: 2026-05-14T21:54:30Z

Prerelease: no

Release notes:

NVIDIA Resiliency Extension v0.6.0

Highlights

In-job restart
Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix terminate_mp_processes to cover failed workers (#270).
Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with --max-restarts (#226, #250, #266):
Simple mode (default, --ft-segment=None) for H100 / non-NVSwitch systems — first min_nodes from --nnodes=min:max become active, the rest become standbys with reserved ranks. No GPU ClusterUUID required.
Segment-aware mode (--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requires min_nodes % segment == 0.
Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
External InJob control-plane (experimental) — embed ft_launcher orchestration in a host control plane (#321). Not yet QA-validated; APIs may change.
Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
--max-restarts now reflects job-level restart attempts (#211); ft_launcher runs with sensible defaults out of the box (#205, #271).
NUMA binding support in ft_launcher for optimized memory affinity (#209).

Health checks
NIC link-state health check (#230).
Distributed Storage health check (#239).
DCA integration for HealthCheck (#235).
Fail-count tracking in NodeHealthCheck (#244).

Checkpointing
CPU shared-memory D2H path (experimental) in FileSystemWriterAsync removes a redundant H2H copy and resolves the prior shm D2H race (#298).
PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via ConsistentDataIdentifier, and class-level metadata cache in CachedMetadataFileSystemReader (#273, #274, #275).
Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in TemporalAsyncCaller (#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).

Fault attribution — productized as standalone services (experimental)

> The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.

NVRx Attribution Service (`attrsvc`) and NVRx Slurm Monitor Service (`smonsvc`) introduced as FastAPI-based standalone services (#242, #248).
`ft_launcher`-managed `attrsvc` for co-located deployment (#318); UDS endpoints for attrsvc/smonsvc (#315).
Attribution is now an optional package — install with pip install nvidia-resiliency-ext[attribution] (#305). Attribution internals refactored under a svc subpackage with a clear controller/runner boundary (#295, #313, #316).
PyTorch Flight Recorder (experimental)...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine library release, no major traction