ReleaseNVIDIANVIDIApublished Jun 1, 2026seen 5d

NVIDIA/NVSentinel v1.8.0

NVIDIA/NVSentinel

Open original ↗

Captured source

source ↗
published Jun 1, 2026seen 5dcaptured 11hhttp 200method plain

Release v1.8.0

Repository: NVIDIA/NVSentinel

Tag: v1.8.0

Published: 2026-06-01T13:20:11Z

Prerelease: no

Release notes:

Release v1.8.0

This release replaces node-drainer's FIFO worker queue with a two-lane priority queue so a single noisy node can no longer starve drains on other nodes, adds a drainGPUPods flag to scope eviction to GPU-requesting workloads, makes drain and quarantine overrides configurable from the kubernetes-object-monitor, fills in missing recommended actions for newer XIDs, and remediates several CVEs across container images and the Go toolchain.

Major New Features

Priority Queue for Node-Drainer (#1341)

Replaced node-drainer's ready-FIFO ordering with a two-lane priority queue layered under the existing Kubernetes rate-limiting workqueue. Events for nodes that have not yet reached draining get one high-priority representative; additional queued work for the same node stays low-priority to prevent grouped floods from blocking later nodes. Queue priority state is in-memory and follows successful node label transitions — setting draining marks the node as draining, while unquarantine or terminal drain labels clear it. Retry, drain action evaluation, and health-event lifecycle semantics are unchanged. A new Prometheus counter node_drainer_queue_items_assigned_total{priority, reason} tracks assignment decisions.

drainGPUPods Filter (#1310, #1264)

New Helm flag node-drainer.drainGPUPods (default false) restricts pod eviction during fault remediation to workloads that request GPU resources (nvidia.com/gpu or nvidia.com/pgpu). When enabled, CPU-only pods (logging agents, monitoring sidecars, infrastructure DaemonSets) stay running on the node, while GPU workloads — the ones actually blocked by the GPU fault — are evicted. The filter inspects both regular containers and init containers. Default behavior is unchanged so existing deployments are unaffected.

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

drainOverrides and quarantineOverrides are now configurable on health events emitted by kubernetes-object-monitor policies, matching the support that already existed in other monitors. Cluster operators can declare per-policy overrides directly in the TOML/YAML config:

healthEvent:
componentClass: Node
isFatal: true
message: "Node is not ready"
recommendedAction: CONTACT_SUPPORT
errorCode:
- NODE_NOT_READY
quarantineOverrides:
force: true # or skip: true; do not set both
drainOverrides:
skip: true # or force: true; do not set both

force and skip are mutually exclusive per override block; the chart validates this at template time. This unlocks scenarios like "cordon the node but do not evict pods" (the example tested in the PR) without requiring a separate health monitor.

Bug Fixes & Reliability

  • Missing XID Recommended Actions (#1343): Filled in recommended actions for XIDs that were missing from the gpu-health-monitor mapping but listed in the XID analyzer catalog — adds an additional GPU recovery scenario that now triggers COMPONENT_RESET and fabric-related failures that now trigger RESTART_VM. Bringing the mapping in line with the catalog prevents these XIDs from being silently classified as NONE/CONTACT_SUPPORT.
  • Preflight Build Platform Arg + FQ CEL for Preflight (#1352): Fixed a missing --platform argument in the preflight-checks Docker build/publish targets that caused multi-platform image operations to silently produce single-platform artifacts. Also added a new fault-quarantine CEL policy so nodes are cordoned when preflight agents emit fatal health events (respecting existing node-exclusion settings) — preflight failures now flow through the same cordon path as other monitors.

Security & Infrastructure

  • Go Toolchain 1.26.3 (#1346): Bumped Go from 1.26.2 → 1.26.3, remediating CVE-2026-39820, CVE-2026-42499, CVE-2026-42501, CVE-2026-33814, CVE-2026-39836, and CVE-2026-33811.
  • Image-Level CVE Remediation (#1340):
  • preflight-nccl-loopback and preflight-nccl-allreduce: PyTorch base image nvcr.io/nvidia/pytorch:26.03-py326.04-py3 (pillow 12.1.112.2.0 fixing GHSA-whj4-6x5x-4v2j and GHSA-pwv6-vv43-88gr; onnx 1.18.01.21.0 fixing GHSA-q56x-g2fj-4rj6, GHSA-hqmj-h5c6-369m, GHSA-538c-55jv-c5g9, GHSA-3r9x-f23j-gc73). Unused uv/uvx binaries removed to eliminate the embedded vulnerable rustls-webpki (GHSA-82j2-j2ch-gfr8).
  • log-collector: kubectl v1.34.1v1.34.8 (picks up github.com/moby/spdystream v0.5.1, fixing GHSA-pc3f-x583-g7j2).
  • file-server-cleanup: base image python:3.13-alpinepython:3.14-alpine (fixes CVE-2026-7210 in expat, CVE-2026-6100 in Python decompression modules, CVE-2026-4786 in webbrowser).
  • gpu-health-monitor and preflight-dcgm-diag: removed unused gnupg package, eliminating CVE-2025-68973 (gnupg2).

Acknowledgments

This release includes contributions from:

  • @XRFXLP
  • @coderuhaan2004
  • @deesharma24
  • @jtschelling
  • @lalitadithya

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @coderuhaan2004.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.8.0 \
--namespace nvsentinel \
--create-namespace

To upgrade from v1.7.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version v1.8.0 \
--namespace nvsentinel \
--reuse-values

Notability

notability 3.0/10

Routine version release, minor update