ReleaseNVIDIANVIDIApublished May 19, 2026seen 5d

NVIDIA/NVSentinel v1.6.0

NVIDIA/NVSentinel

Open original ↗

Captured source

source ↗
published May 19, 2026seen 5dcaptured 9hhttp 200method plain

Release v1.6.0

Repository: NVIDIA/NVSentinel

Tag: v1.6.0

Published: 2026-05-19T06:37:10Z

Prerelease: no

Release notes:

Release v1.6.0

This release adds configurable XID cancellation in the syslog health monitor, repeated-NIC analyzer rules for non-fatal degradation signals, a producer-side gate that prevents stale health events from skewing fault-quarantine metrics during platform-connector outages, identity-aware node-condition compaction that fixes stuck conditions on long entity names, and the v1.5.1 fault-remediation cold-start fix for users upgrading directly from v1.5.0.

Major New Features

XID Cancellation in Syslog Health Monitor (#1270)

The syslog-health-monitor can now be configured with cancellation rules that suppress related XID events when a source XID is observed. Rules are declared in a TOML ConfigMap by source/target error code:

cancellations:
- name: SysLogsXIDError
enabled: true
rules:
- onErrorCode: "162"
cancelErrorCodes: ["163"]

When a source XID fires, the monitor emits a synthetic healthy event that clears matching target XIDs from the node condition. The platform-connector and fault-quarantine resolve health events by errorCode when present (falling back to entities-impacted otherwise), so the existing resolution semantics for non-XID checks are unaffected. A new Prometheus metric counts emitted cancellations by check / source / target error code.

Repeated NIC Analyzer Rules (#1272)

Two new Health Events Analyzer rules escalate repeated non-fatal NIC signals:

  • `RepeatedNICDriverError`: escalates selected non-fatal SysLogsNICDriverError patterns when the same pattern repeats 3 times on a node within 1 hour. Noisy diagnostic-only signals like access_reg_failed are excluded from escalation.
  • `RepeatedNICDegradation`: escalates non-fatal NIC degradation events when the same NIC + NICPort sees 3 degradation events within 1 hour.

Both rules escalate to CONTACT_SUPPORT rather than REPLACE_VM — deterministic NIC failures still use first-event REPLACE_VM, while repeated diagnostic/degradation signals are surfaced for human triage. Aggregation is scoped to the same NIC + NICPort so events on different ports do not aggregate incorrectly.

Bug Fixes & Reliability

  • Platform-Connector Outage Gating (#1259): When platform-connector restarted (graceful redeploy, OOM, helm upgrade), every health monitor on the node held in-flight events in its retry loop with the original GeneratedTimestamp. When platform-connector returned, those stale events landed at fault-quarantine and were misattributed as multi-minute fault_quarantine_node_quarantine_duration_seconds histogram entries, even when fault-quarantine actually cordoned in ≤100 ms. Each monitor now stat-checks the platform-connector Unix socket before every gRPC send; if the socket is missing the send is skipped (no buffering, no cache mutation) and the next polling cycle re-emits the event with a fresh timestamp. Recovery is bounded by the polling cadence regardless of how long the outage lasts. A shared publisher in commons/pkg/healthpub consolidates the gate, retry policy, and Prometheus counters across all Go monitors; the Python gpu-health-monitor gets the same gate inline. Also fixes a related bug where syslog-health-monitor.handleBootIDChange persisted the new BootID before delivering post-reboot healthy events — any send failure left those events permanently lost. BootID is now persisted only after every healthy event has been delivered, and a pendingPostRebootBootIDClear is retried at the top of every poll cycle.
  • Node Condition Cleanup for Truncated Entity Messages (#1304): Fixed an issue where entities with long values (e.g., v1/Pod:prod/61f345d08c9a432a-134a464884734f90) would be byte-truncated mid-token by the platform-connector's per-message compaction, leaving subsequent healthy events unable to clear the condition (the exact-substring cleanup lookup never matched the truncated form). compactMessageField is rewritten to parse the structured identity prefix (ErrorCode + entity tokens) and only truncate the trailing diagnostic free-text — identity tokens are never byte-truncated. A backward-compatible entityMatchesMessage helper falls back to prefix matching when there is evidence of truncation (token ends in ... or is the last token with no Recommended Action=), so nodes already carrying truncated conditions from older releases can also be cleared.
  • Fault-Quarantine Empty-Annotation Handling (#1309): Fixed a bug where fault-quarantine treated quarantineHealthEvent: "[]" as an active quarantine. When fault-quarantine processed a healthy event that cleared the last entity from a quarantined node, it wrote the annotation as an empty JSON array before performUncordon() removed the key entirely. If fault-quarantine restarted or hit a conflict before the key was removed, the next fatal event for that node followed the handleAlreadyQuarantinedNode path — appending the event without cordoning. Adds a shared annotation.IsEmptyValue() helper that treats "", whitespace, and "[]" as absent, used by hasExistingQuarantine() and the related test helpers. The same PR also hardens NIC E2E teardown to restart the NIC monitor before deleting the fake sysfs tree, eliminating a burst of false "device disappeared" fatal events that contaminated downstream tests.
  • NIC Fatal Events Cordon Nodes (#1288): Updated fault-quarantine rules so fatal syslog-health-monitor events for the NIC component class now cordon nodes (previously only GPU did), and added a new ruleset for fatal nic-health-monitor events. E2E coverage was extended to assert that fatal NIC events cordon and that recovery uncordons. Also prevents node-drainer from marking drain status terminal when a node-state label update fails, so the event can be retried instead of leaving DB and node state inconsistent.
  • Syslog HM Metadata Cache Retry (#1302, #1287): Fixed an issue where the syslog-health-monitor cached the metadata-collector…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine minor version release