ReleaseNVIDIANVIDIApublished Jun 8, 2026seen 2d

NVIDIA/NVSentinel v1.9.0

NVIDIA/NVSentinel

Open original ↗

Captured source

source ↗
published Jun 8, 2026seen 2dcaptured 12hhttp 200method exa

Release: NVIDIA/NVSentinel v1.9.0

  • Repository: NVIDIA/NVSentinel | NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments | 315 stars | Go
  • Name: Release v1.9.0
  • Author: [@github-actions[bot]](https://github.com/github-actions[bot])
  • Created: 2026-06-08T15:55:36Z
  • Published: 2026-06-08T15:58:21Z

Release v1.9.0

This release tightens preflight semantics so DCGM execution errors and non-actionable diagnostic failures no longer block workloads, adds per-init-container controls for inheriting workload env/volume mounts (so NCCL loopback can run in a clean environment while allreduce can pick up workload fabric config), introduces an optional image-cache DaemonSet for preflight images, adds an out-of-cluster deployment mode for platform-connector, and fixes a startup-race bug that left syslog-health-monitor using an empty GPU driver version for the lifetime of the pod.

Major New Features

Per-Init-Container Env & Volume Inheritance Flags (#1370)

Preflight init containers previously inherited workload environment variables matching ncclEnvPatterns and volume mounts matching volumeMountPatterns uniformly across every check. That was too broad — workload-specific NCCL/fabric configuration could poison checks meant to run with a curated environment (e.g. NCCL loopback inheriting workload settings that alter local GPU P2P/NVLink/NVSwitch behavior). Each preflight init container can now opt in or out of inheritance independently:

- name: preflight-nccl-loopback
inheritUserEnv: false
inheritUserVolumeMounts: false
- name: preflight-nccl-allreduce
inheritUserEnv: true # workload fabric config still flows through
inheritUserVolumeMounts: true

Built-in checks default to curated environments; deployments can opt in selectively where inheritance is actually required.

Image-Cache DaemonSet for Preflight Images (#1365)

New optional DaemonSet preflight-image-cache pre-pulls all preflight check images on every node, eliminating cold-start image-pull latency from the critical path of a workload's first preflight run. Each container in the DaemonSet idles after pulling its image. Gated behind imageCache.enabled (default false), with configurable resources, pod annotations, and scheduling overrides. A pod-template config checksum annotation forces a rollout when the config content changes.

Out-of-Cluster Platform-Connector Deployment (#1359)

platform-connectors accepts an optional --kubeconfig flag for explicit out-of-cluster Kubernetes authentication. The kubeconfig path is threaded through startup, connector initialization, and pipeline transformer creation so both the Kubernetes connector and MetadataAugmentor use the same client config when platform-connectors runs outside the cluster (e.g., under systemd). When --kubeconfig is unset, existing in-cluster auth behavior is unchanged.

Synced DCGM Error Mappings (#1369)

Updated dcgmerrorsmapping.csv to match the latest upstream DCGM dcgm_errors.h enum. New mappings:

  • DCGM_FR_SRAM_THRESHOLD, DCGM_FR_NVLINK_EFFECTIVE_BER_THRESHOLD, DCGM_FR_NVLINK_SYMBOL_BER_THRESHOLD, DCGM_FR_IMEX_UNHEALTHY, DCGM_FR_FABRIC_PROBE_STATE, DCGM_FR_BINARY_PERMISSIONS, DCGM_FR_GPU_RECOVERY_DRAIN_P2PCONTACT_SUPPORT
  • DCGM_FR_FALLEN_OFF_BUS, DCGM_FR_GPU_RECOVERY_REBOOTRESTART_BM
  • DCGM_FR_GPU_RECOVERY_RESET, DCGM_FR_GPU_RECOVERY_DRAIN_RESET, DCGM_FR_NCCL_ERRORCOMPONENT_RESET

Bug Fixes & Reliability

  • DCGM_ST_* Should Not Fail Preflight** (#1364, #1363): DCGM_ST_* codes (e.g. DCGM_ST_IN_USE, DCGM_ST_DIAG_ALREADY_RUNNING) are diagnostic execution failures — the framework could not complete the run — not confirmed hardware faults. Previously these surfaced as fatal health events that cordoned the node. preflight-dcgm-diag now retries on DCGM_ST_* for a configurable number of attempts (DCGM_DIAG_STATUS_RETRY_MAX_ATTEMPTS, DCGM_DIAG_STATUS_RETRY_INTERVAL_SECONDS); if the status persists it emits a non-fatal unhealthy HealthEvent with RecommendedAction=NONE (carrying the DCGM_ST_* status name in the errorCode) and exits successfully so the workload is not blocked. Also adds clean shutdown — dcgmStopDiagnostic is called on termination signals.
  • Preflight-DCGM-Diag Non-Actionable Failures Are Non-Fatal (#1358): DCGM diag failures whose recommended action resolves to NONE (e.g., XID detected during the run with no actionable remediation) are now emitted as non-fatal — the init container exits 0 and the workload's next preflight container runs. Previously these triggered Init:Error and blocked the workload. Bumped DCGM to 4.5.2 to match gpu-health-monitor.
  • Syslog-HM Driver Version Startup Race (#1362): Fixed a long-standing startup race where syslog-health-monitor cached DriverVersion = "" if the monitor started before metadata-collector populated /var/lib/nvsentinel/gpu_metadata.json. The stale empty value was then used for the lifetime of the pod, breaking driver-version-dependent XID 144–150 decoding (the analyzer fell back to WORKFLOW_NVLINK5_ERRCONTACT_SUPPORT instead of returning RESET_GPUCOMPONENT_RESET). Subtle because #1302 had already fixed metadata recovery for PCI → GPU UUID lookups, masking this code path. GetDriverVersion() now reloads metadata at request time when the cached value is empty, so the monitor recovers once metadata-collector writes the file. A new Prometheus metric tracks XID decode requests that ran without a driver version.

Acknowledgments

This release includes contributions from:

  • @XRFXLP
  • @sulixu
  • @lalitadithya

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @sulixu.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel…

Excerpt shown — open the source for the full document.