NVIDIA/NVSentinel v1.8.0
NVIDIA/NVSentinel
Captured source
source ↗Release v1.8.0
Repository: NVIDIA/NVSentinel
Tag: v1.8.0
Published: 2026-06-01T13:20:11Z
Prerelease: no
Release notes:
Release v1.8.0
This release replaces node-drainer's FIFO worker queue with a two-lane priority queue so a single noisy node can no longer starve drains on other nodes, adds a drainGPUPods flag to scope eviction to GPU-requesting workloads, makes drain and quarantine overrides configurable from the kubernetes-object-monitor, fills in missing recommended actions for newer XIDs, and remediates several CVEs across container images and the Go toolchain.
Major New Features
Priority Queue for Node-Drainer (#1341)
Replaced node-drainer's ready-FIFO ordering with a two-lane priority queue layered under the existing Kubernetes rate-limiting workqueue. Events for nodes that have not yet reached draining get one high-priority representative; additional queued work for the same node stays low-priority to prevent grouped floods from blocking later nodes. Queue priority state is in-memory and follows successful node label transitions — setting draining marks the node as draining, while unquarantine or terminal drain labels clear it. Retry, drain action evaluation, and health-event lifecycle semantics are unchanged. A new Prometheus counter node_drainer_queue_items_assigned_total{priority, reason} tracks assignment decisions.
drainGPUPods Filter (#1310, #1264)
New Helm flag node-drainer.drainGPUPods (default false) restricts pod eviction during fault remediation to workloads that request GPU resources (nvidia.com/gpu or nvidia.com/pgpu). When enabled, CPU-only pods (logging agents, monitoring sidecars, infrastructure DaemonSets) stay running on the node, while GPU workloads — the ones actually blocked by the GPU fault — are evicted. The filter inspects both regular containers and init containers. Default behavior is unchanged so existing deployments are unaffected.
Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)
drainOverrides and quarantineOverrides are now configurable on health events emitted by kubernetes-object-monitor policies, matching the support that already existed in other monitors. Cluster operators can declare per-policy overrides directly in the TOML/YAML config:
healthEvent: componentClass: Node isFatal: true message: "Node is not ready" recommendedAction: CONTACT_SUPPORT errorCode: - NODE_NOT_READY quarantineOverrides: force: true # or skip: true; do not set both drainOverrides: skip: true # or force: true; do not set both
force and skip are mutually exclusive per override block; the chart validates this at template time. This unlocks scenarios like "cordon the node but do not evict pods" (the example tested in the PR) without requiring a separate health monitor.
Bug Fixes & Reliability
- Missing XID Recommended Actions (#1343): Filled in recommended actions for XIDs that were missing from the gpu-health-monitor mapping but listed in the XID analyzer catalog — adds an additional GPU recovery scenario that now triggers
COMPONENT_RESETand fabric-related failures that now triggerRESTART_VM. Bringing the mapping in line with the catalog prevents these XIDs from being silently classified asNONE/CONTACT_SUPPORT. - Preflight Build Platform Arg + FQ CEL for Preflight (#1352): Fixed a missing
--platformargument in the preflight-checks Docker build/publish targets that caused multi-platform image operations to silently produce single-platform artifacts. Also added a new fault-quarantine CEL policy so nodes are cordoned when preflight agents emit fatal health events (respecting existing node-exclusion settings) — preflight failures now flow through the same cordon path as other monitors.
Security & Infrastructure
- Go Toolchain 1.26.3 (#1346): Bumped Go from 1.26.2 → 1.26.3, remediating CVE-2026-39820, CVE-2026-42499, CVE-2026-42501, CVE-2026-33814, CVE-2026-39836, and CVE-2026-33811.
- Image-Level CVE Remediation (#1340):
preflight-nccl-loopbackandpreflight-nccl-allreduce: PyTorch base imagenvcr.io/nvidia/pytorch:26.03-py3→26.04-py3(pillow12.1.1→12.2.0fixing GHSA-whj4-6x5x-4v2j and GHSA-pwv6-vv43-88gr; onnx1.18.0→1.21.0fixing GHSA-q56x-g2fj-4rj6, GHSA-hqmj-h5c6-369m, GHSA-538c-55jv-c5g9, GHSA-3r9x-f23j-gc73). Unuseduv/uvxbinaries removed to eliminate the embedded vulnerablerustls-webpki(GHSA-82j2-j2ch-gfr8).log-collector:kubectlv1.34.1→v1.34.8(picks upgithub.com/moby/spdystream v0.5.1, fixing GHSA-pc3f-x583-g7j2).file-server-cleanup: base imagepython:3.13-alpine→python:3.14-alpine(fixes CVE-2026-7210 inexpat, CVE-2026-6100 in Python decompression modules, CVE-2026-4786 inwebbrowser).gpu-health-monitorandpreflight-dcgm-diag: removed unusedgnupgpackage, eliminating CVE-2025-68973 (gnupg2).
Acknowledgments
This release includes contributions from:
- @XRFXLP
- @coderuhaan2004
- @deesharma24
- @jtschelling
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @coderuhaan2004.
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --version v1.8.0 \ --namespace nvsentinel \ --create-namespace
To upgrade from v1.7.0:
helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --version v1.8.0 \ --namespace nvsentinel \ --reuse-values
Notability
notability 3.0/10Routine version release, minor update