NVIDIA/nvidia-resiliency-ext v0.6.0
NVIDIA/nvidia-resiliency-ext
Captured source
source ↗published May 14, 2026seen 5dcaptured 10hhttp 200method plain
v0.6.0
Repository: NVIDIA/nvidia-resiliency-ext
Tag: v0.6.0
Published: 2026-05-14T21:54:30Z
Prerelease: no
Release notes:
NVIDIA Resiliency Extension v0.6.0
Highlights
- In-job restart
- Barrier-based rendezvous (v2) is now the default (#214). The legacy dynamic rendezvous (v1) is deprecated and will be removed in a future release (#282).
- Rendezvous protocol hardening — round-scoped keys, round-fenced CAS to prevent stale slot writes, and cleaner handling of participants exiting mid-rendezvous (#262, #263, #300).
- Robust startup and shutdown — wait for TCPStore on initial connection (#264), handle signals during rendezvous (#246), notify peers to abort current workers on failure (#228), fix
terminate_mp_processesto cover failed workers (#270). - Hot-spare node support — closes the v0.5 spare-node gap. Hot-spare is always-on and works with
--max-restarts(#226, #250, #266): - Simple mode (default,
--ft-segment=None) for H100 / non-NVSwitch systems — firstmin_nodesfrom--nnodes=min:maxbecome active, the rest become standbys with reserved ranks. No GPU ClusterUUID required. - Segment-aware mode (
--ft-segment=N) for NVSwitch systems (DGX H200, HGX B200) — uses GPU ClusterUUID to identify NVLink domains; nodes in the same segment get contiguous group ranks for NVLink locality. Requiresmin_nodes % segment == 0. - Block-aware rank assignment (#250) and hot-spare exit-handling fix (#266).
- Progress-based early termination for in-job restarts and progress-tracker enhancements (#218, #255).
- External InJob control-plane (experimental) — embed
ft_launcherorchestration in a host control plane (#321). Not yet QA-validated; APIs may change. - Section-timeout fixes — out-of-section timeout now fires for section-less workloads, baseline iteration tracking corrected (#261, #299).
--max-restartsnow reflects job-level restart attempts (#211);ft_launcherruns with sensible defaults out of the box (#205, #271).- NUMA binding support in
ft_launcherfor optimized memory affinity (#209).
- Health checks
- NIC link-state health check (#230).
- Distributed Storage health check (#239).
- DCA integration for HealthCheck (#235).
- Fail-count tracking in
NodeHealthCheck(#244).
- Checkpointing
- CPU shared-memory D2H path (experimental) in
FileSystemWriterAsyncremoves a redundant H2H copy and resolves the prior shm D2H race (#298). - PersistentAsyncCaller upgrades: QoS control, worker data cache, warmup, IPC-handle caching via
ConsistentDataIdentifier, and class-level metadata cache inCachedMetadataFileSystemReader(#273, #274, #275). - Reliability fixes: SIGSEGV on SIGKILL with dangling CUDA IPC handles (#284), CUDA IPC handle errors in persistent worker (#288), premature GC of preloaded pinned host tensors in
TemporalAsyncCaller(#291), MXFP8/TE quantized tensor handling in IPC cache (#276), spawned persistent worker CUDA-device init (#238).
- Fault attribution — productized as standalone services (experimental)
> The attribution module — including the Attribution Service, Flight Recorder integration, LogSage, and MCP integration — remains experimental in v0.6. APIs, CLI flags, and service contracts may change in subsequent releases.
- NVRx Attribution Service (`attrsvc`) and NVRx Slurm Monitor Service (`smonsvc`) introduced as FastAPI-based standalone services (#242, #248).
- `ft_launcher`-managed `attrsvc` for co-located deployment (#318); UDS endpoints for
attrsvc/smonsvc(#315). - Attribution is now an optional package — install with
pip install nvidia-resiliency-ext[attribution](#305). Attribution internals refactored under asvcsubpackage with a clear controller/runner boundary (#295, #313, #316). - PyTorch Flight Recorder (experimental)…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine library release, no major traction