NVIDIA/NVSentinel v1.6.0
NVIDIA/NVSentinel
Captured source
source ↗Release v1.6.0
Repository: NVIDIA/NVSentinel
Tag: v1.6.0
Published: 2026-05-19T06:37:10Z
Prerelease: no
Release notes:
Release v1.6.0
This release adds configurable XID cancellation in the syslog health monitor, repeated-NIC analyzer rules for non-fatal degradation signals, a producer-side gate that prevents stale health events from skewing fault-quarantine metrics during platform-connector outages, identity-aware node-condition compaction that fixes stuck conditions on long entity names, and the v1.5.1 fault-remediation cold-start fix for users upgrading directly from v1.5.0.
Major New Features
XID Cancellation in Syslog Health Monitor (#1270)
The syslog-health-monitor can now be configured with cancellation rules that suppress related XID events when a source XID is observed. Rules are declared in a TOML ConfigMap by source/target error code:
cancellations: - name: SysLogsXIDError enabled: true rules: - onErrorCode: "162" cancelErrorCodes: ["163"]
When a source XID fires, the monitor emits a synthetic healthy event that clears matching target XIDs from the node condition. The platform-connector and fault-quarantine resolve health events by errorCode when present (falling back to entities-impacted otherwise), so the existing resolution semantics for non-XID checks are unaffected. A new Prometheus metric counts emitted cancellations by check / source / target error code.
Repeated NIC Analyzer Rules (#1272)
Two new Health Events Analyzer rules escalate repeated non-fatal NIC signals:
- `RepeatedNICDriverError`: escalates selected non-fatal
SysLogsNICDriverErrorpatterns when the same pattern repeats 3 times on a node within 1 hour. Noisy diagnostic-only signals likeaccess_reg_failedare excluded from escalation. - `RepeatedNICDegradation`: escalates non-fatal NIC degradation events when the same
NIC+NICPortsees 3 degradation events within 1 hour.
Both rules escalate to CONTACT_SUPPORT rather than REPLACE_VM — deterministic NIC failures still use first-event REPLACE_VM, while repeated diagnostic/degradation signals are surfaced for human triage. Aggregation is scoped to the same NIC + NICPort so events on different ports do not aggregate incorrectly.
Bug Fixes & Reliability
- Platform-Connector Outage Gating (#1259): When platform-connector restarted (graceful redeploy, OOM, helm upgrade), every health monitor on the node held in-flight events in its retry loop with the original
GeneratedTimestamp. When platform-connector returned, those stale events landed at fault-quarantine and were misattributed as multi-minutefault_quarantine_node_quarantine_duration_secondshistogram entries, even when fault-quarantine actually cordoned in ≤100 ms. Each monitor now stat-checks the platform-connector Unix socket before every gRPC send; if the socket is missing the send is skipped (no buffering, no cache mutation) and the next polling cycle re-emits the event with a fresh timestamp. Recovery is bounded by the polling cadence regardless of how long the outage lasts. A shared publisher incommons/pkg/healthpubconsolidates the gate, retry policy, and Prometheus counters across all Go monitors; the Python gpu-health-monitor gets the same gate inline. Also fixes a related bug wheresyslog-health-monitor.handleBootIDChangepersisted the new BootID before delivering post-reboot healthy events — any send failure left those events permanently lost. BootID is now persisted only after every healthy event has been delivered, and apendingPostRebootBootIDClearis retried at the top of every poll cycle.
- Node Condition Cleanup for Truncated Entity Messages (#1304): Fixed an issue where entities with long values (e.g.,
v1/Pod:prod/61f345d08c9a432a-134a464884734f90) would be byte-truncated mid-token by the platform-connector's per-message compaction, leaving subsequent healthy events unable to clear the condition (the exact-substring cleanup lookup never matched the truncated form).compactMessageFieldis rewritten to parse the structured identity prefix (ErrorCode + entity tokens) and only truncate the trailing diagnostic free-text — identity tokens are never byte-truncated. A backward-compatibleentityMatchesMessagehelper falls back to prefix matching when there is evidence of truncation (token ends in...or is the last token with noRecommended Action=), so nodes already carrying truncated conditions from older releases can also be cleared.
- Fault-Quarantine Empty-Annotation Handling (#1309): Fixed a bug where fault-quarantine treated
quarantineHealthEvent: "[]"as an active quarantine. When fault-quarantine processed a healthy event that cleared the last entity from a quarantined node, it wrote the annotation as an empty JSON array beforeperformUncordon()removed the key entirely. If fault-quarantine restarted or hit a conflict before the key was removed, the next fatal event for that node followed thehandleAlreadyQuarantinedNodepath — appending the event without cordoning. Adds a sharedannotation.IsEmptyValue()helper that treats"", whitespace, and"[]"as absent, used byhasExistingQuarantine()and the related test helpers. The same PR also hardens NIC E2E teardown to restart the NIC monitor before deleting the fake sysfs tree, eliminating a burst of false "device disappeared" fatal events that contaminated downstream tests.
- NIC Fatal Events Cordon Nodes (#1288): Updated fault-quarantine rules so fatal
syslog-health-monitorevents for theNICcomponent class now cordon nodes (previously onlyGPUdid), and added a new ruleset for fatalnic-health-monitorevents. E2E coverage was extended to assert that fatal NIC events cordon and that recovery uncordons. Also prevents node-drainer from marking drain status terminal when a node-state label update fails, so the event can be retried instead of leaving DB and node state inconsistent.
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine minor version release