NVIDIA/NVSentinel v1.5.1
NVIDIA/NVSentinel
Captured source
source ↗published May 14, 2026seen 5dcaptured 13hhttp 200method plain
Release v1.5.1
Repository: NVIDIA/NVSentinel
Tag: v1.5.1
Published: 2026-05-14T10:42:20Z
Prerelease: no
Release notes:
Release v1.5.1
This is a hotfix release on top of v1.5.0 containing a single critical fix to the fault-remediation cold-start path.
Bug Fixes
- Fault-Remediation Cold-Start Replay (#1281): Fixed cold-start replay for health events that were intentionally skipped but left without a terminal remediation status. Previously, some skip paths only advanced the change-stream token while leaving
healtheventstatus.faultremediated == nilon the event document, so a fault-remediation restart could cold-start the stale event and process it a second time. This produced two related replay bugs:
- A skipped event behind an equivalent in-progress remediation CR could create a duplicate `RebootNode` after fault-remediation restart — allowing workloads to be scheduled back onto an uncordoned node before an unwanted reboot was triggered.
- Unsupported recommended actions (e.g.,
CONTACT_SUPPORT) could replay on every fault-remediation restart and re-apply thedgxc.nvidia.com/nvsentinel-state=remediation-failedlabel to a node that had already recovered.
The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation:
- On
UnQuarantined, covered stale events are markedfaultremediated=true. - On
Cancelled, covered stale events are markedfaultremediated=false(manual/external cancellation does not prove the fault was remediated). - Unsupported recommended actions are now made terminal with
faultremediated=falseinstead of remaining cold-start eligible. - The cold-start "unresolved remediation-ready event" query was extracted into a shared helper so cold-start and cleanup paths use the same criteria.
- A succeeded existing CR only covers events created before the remediation annotation for that equivalence group was created — later events are treated as a new remediation session and may create a new CR.
- Matching remediation annotation groups are evaluated deterministically by newest
CreatedAtfirst.
Cleanup is scoped to the equivalence group, not the node, so unrelated remediation actions on the same node are not incorrectly closed.
Acknowledgments
Thanks to @XRFXLP, @KaivalyaMDabhadkar for diagnosing and fixing this issue.
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Upgrade from v1.5.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --version v1.5.1 \ --namespace nvsentinel \ --reuse-values
Fresh install:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --version v1.5.1 \ --namespace nvsentinel \ --create-namespace
Notability
notability 3.0/10Routine patch release of a tool.