NVIDIA/NVSentinel v1.9.0
NVIDIA/NVSentinel
Captured source
source ↗Release: NVIDIA/NVSentinel v1.9.0
- Repository: NVIDIA/NVSentinel | NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments | 315 stars | Go
- Name: Release v1.9.0
- Author: [@github-actions[bot]](https://github.com/github-actions[bot])
- Created: 2026-06-08T15:55:36Z
- Published: 2026-06-08T15:58:21Z
Release v1.9.0
This release tightens preflight semantics so DCGM execution errors and non-actionable diagnostic failures no longer block workloads, adds per-init-container controls for inheriting workload env/volume mounts (so NCCL loopback can run in a clean environment while allreduce can pick up workload fabric config), introduces an optional image-cache DaemonSet for preflight images, adds an out-of-cluster deployment mode for platform-connector, and fixes a startup-race bug that left syslog-health-monitor using an empty GPU driver version for the lifetime of the pod.
Major New Features
Per-Init-Container Env & Volume Inheritance Flags (#1370)
Preflight init containers previously inherited workload environment variables matching ncclEnvPatterns and volume mounts matching volumeMountPatterns uniformly across every check. That was too broad — workload-specific NCCL/fabric configuration could poison checks meant to run with a curated environment (e.g. NCCL loopback inheriting workload settings that alter local GPU P2P/NVLink/NVSwitch behavior). Each preflight init container can now opt in or out of inheritance independently:
- name: preflight-nccl-loopback inheritUserEnv: false inheritUserVolumeMounts: false - name: preflight-nccl-allreduce inheritUserEnv: true # workload fabric config still flows through inheritUserVolumeMounts: true
Built-in checks default to curated environments; deployments can opt in selectively where inheritance is actually required.
Image-Cache DaemonSet for Preflight Images (#1365)
New optional DaemonSet preflight-image-cache pre-pulls all preflight check images on every node, eliminating cold-start image-pull latency from the critical path of a workload's first preflight run. Each container in the DaemonSet idles after pulling its image. Gated behind imageCache.enabled (default false), with configurable resources, pod annotations, and scheduling overrides. A pod-template config checksum annotation forces a rollout when the config content changes.
Out-of-Cluster Platform-Connector Deployment (#1359)
platform-connectors accepts an optional --kubeconfig flag for explicit out-of-cluster Kubernetes authentication. The kubeconfig path is threaded through startup, connector initialization, and pipeline transformer creation so both the Kubernetes connector and MetadataAugmentor use the same client config when platform-connectors runs outside the cluster (e.g., under systemd). When --kubeconfig is unset, existing in-cluster auth behavior is unchanged.
Synced DCGM Error Mappings (#1369)
Updated dcgmerrorsmapping.csv to match the latest upstream DCGM dcgm_errors.h enum. New mappings:
DCGM_FR_SRAM_THRESHOLD,DCGM_FR_NVLINK_EFFECTIVE_BER_THRESHOLD,DCGM_FR_NVLINK_SYMBOL_BER_THRESHOLD,DCGM_FR_IMEX_UNHEALTHY,DCGM_FR_FABRIC_PROBE_STATE,DCGM_FR_BINARY_PERMISSIONS,DCGM_FR_GPU_RECOVERY_DRAIN_P2P→CONTACT_SUPPORTDCGM_FR_FALLEN_OFF_BUS,DCGM_FR_GPU_RECOVERY_REBOOT→RESTART_BMDCGM_FR_GPU_RECOVERY_RESET,DCGM_FR_GPU_RECOVERY_DRAIN_RESET,DCGM_FR_NCCL_ERROR→COMPONENT_RESET
Bug Fixes & Reliability
DCGM_ST_*Should Not Fail Preflight** (#1364, #1363):DCGM_ST_*codes (e.g.DCGM_ST_IN_USE,DCGM_ST_DIAG_ALREADY_RUNNING) are diagnostic execution failures — the framework could not complete the run — not confirmed hardware faults. Previously these surfaced as fatal health events that cordoned the node.preflight-dcgm-diagnow retries onDCGM_ST_*for a configurable number of attempts (DCGM_DIAG_STATUS_RETRY_MAX_ATTEMPTS,DCGM_DIAG_STATUS_RETRY_INTERVAL_SECONDS); if the status persists it emits a non-fatal unhealthy HealthEvent withRecommendedAction=NONE(carrying theDCGM_ST_*status name in theerrorCode) and exits successfully so the workload is not blocked. Also adds clean shutdown —dcgmStopDiagnosticis called on termination signals.- Preflight-DCGM-Diag Non-Actionable Failures Are Non-Fatal (#1358): DCGM diag failures whose recommended action resolves to
NONE(e.g., XID detected during the run with no actionable remediation) are now emitted as non-fatal — the init container exits 0 and the workload's next preflight container runs. Previously these triggeredInit:Errorand blocked the workload. Bumped DCGM to 4.5.2 to match gpu-health-monitor. - Syslog-HM Driver Version Startup Race (#1362): Fixed a long-standing startup race where syslog-health-monitor cached
DriverVersion = ""if the monitor started before metadata-collector populated/var/lib/nvsentinel/gpu_metadata.json. The stale empty value was then used for the lifetime of the pod, breaking driver-version-dependent XID 144–150 decoding (the analyzer fell back toWORKFLOW_NVLINK5_ERR→CONTACT_SUPPORTinstead of returningRESET_GPU→COMPONENT_RESET). Subtle because #1302 had already fixed metadata recovery for PCI → GPU UUID lookups, masking this code path.GetDriverVersion()now reloads metadata at request time when the cached value is empty, so the monitor recovers once metadata-collector writes the file. A new Prometheus metric tracks XID decode requests that ran without a driver version.
Acknowledgments
This release includes contributions from:
- @XRFXLP
- @sulixu
- @lalitadithya
Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @sulixu.
Container Images
See versions.txt for the full list of container images and versions.
Helm Chart
Install with:
helm install nvsentinel…
Excerpt shown — open the source for the full document.