ReleaseNVIDIANVIDIApublished May 14, 2026seen 5d

NVIDIA/NVSentinel v1.5.1

NVIDIA/NVSentinel

Open original ↗

Captured source

source ↗
published May 14, 2026seen 5dcaptured 13hhttp 200method plain

Release v1.5.1

Repository: NVIDIA/NVSentinel

Tag: v1.5.1

Published: 2026-05-14T10:42:20Z

Prerelease: no

Release notes:

Release v1.5.1

This is a hotfix release on top of v1.5.0 containing a single critical fix to the fault-remediation cold-start path.

Bug Fixes

  • Fault-Remediation Cold-Start Replay (#1281): Fixed cold-start replay for health events that were intentionally skipped but left without a terminal remediation status. Previously, some skip paths only advanced the change-stream token while leaving healtheventstatus.faultremediated == nil on the event document, so a fault-remediation restart could cold-start the stale event and process it a second time. This produced two related replay bugs:
  • A skipped event behind an equivalent in-progress remediation CR could create a duplicate `RebootNode` after fault-remediation restart — allowing workloads to be scheduled back onto an uncordoned node before an unwanted reboot was triggered.
  • Unsupported recommended actions (e.g., CONTACT_SUPPORT) could replay on every fault-remediation restart and re-apply the dgxc.nvidia.com/nvsentinel-state=remediation-failed label to a node that had already recovered.

The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation:

  • On UnQuarantined, covered stale events are marked faultremediated=true.
  • On Cancelled, covered stale events are marked faultremediated=false (manual/external cancellation does not prove the fault was remediated).
  • Unsupported recommended actions are now made terminal with faultremediated=false instead of remaining cold-start eligible.
  • The cold-start "unresolved remediation-ready event" query was extracted into a shared helper so cold-start and cleanup paths use the same criteria.
  • A succeeded existing CR only covers events created before the remediation annotation for that equivalence group was created — later events are treated as a new remediation session and may create a new CR.
  • Matching remediation annotation groups are evaluated deterministically by newest CreatedAt first.

Cleanup is scoped to the equivalence group, not the node, so unrelated remediation actions on the same node are not incorrectly closed.

Acknowledgments

Thanks to @XRFXLP, @KaivalyaMDabhadkar for diagnosing and fixing this issue.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Upgrade from v1.5.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.5.1 \
--namespace nvsentinel \
--reuse-values

Fresh install:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.5.1 \
--namespace nvsentinel \
--create-namespace

Notability

notability 3.0/10

Routine patch release of a tool.