ReleaseNVIDIANVIDIApublished Jun 15, 2026seen 1w

NVIDIA/NVSentinel v1.10.0

NVIDIA/NVSentinel

Open original ↗

Captured source

source ↗
published Jun 15, 2026seen 1wcaptured 1whttp 200method plain

Release v1.10.0

Repository: NVIDIA/NVSentinel

Tag: v1.10.0

Published: 2026-06-15T12:56:35Z

Prerelease: no

Release notes:

Release v1.10.0

NVSentinel v1.10.0 expands GPU health coverage with a new GPU thermal-margin watch, reduces memory footprint through optional per-policy namespace scoping in the Kubernetes Object Monitor, and lays the API foundation for external breakfix coordination via the ExternalRemediationRequest CRD. This release also adds finer-grained scheduling control for platform connectors, fixes fault-remediation handling of deleted nodes, and corrects Helm rendering for proxy-terminated PostgreSQL TLS configurations.

Major New Features

GPU Thermal Margin Watch (#1371, #1388)

Adds a new GpuThermalMarginWatch health check to the GPU health monitor that detects when a GPU crosses its hardware thermal-slowdown boundary. The monitor samples DCGM field 153 (DCGM_FI_DEV_GPU_TEMP_LIMIT) as a signed thermal-margin signal and compares it against a per-GPU hardware slowdown threshold; because that offset varies by SKU and is not exposed by DCGM, the metadata-collector reads it once per GPU via NVML field 194 (NVML_FI_DEV_TEMPERATURE_SLOWDOWN_TLIMIT) and publishes it in gpu_metadata.json. When a GPU's live margin falls below its slowdown threshold, NVSentinel raises a fatal GpuThermalMarginWatch event (error code GPU_TEMP_HW_SLOWDOWN_VIOLATION, recommended action CONTACT_SUPPORT) and clears it automatically once the margin recovers. Unlike the existing GpuThermalWatch, which only signals that throttling increased, this gives operators a quantifiable measure of how far past the hardware slowdown line a GPU has gone. The feature is opt-in via enable/store-only toggles. A companion operator runbook walks responders through confirming the alert with live telemetry and nvidia-smi, checking per-GPU threshold metadata, applying remediation, and reproducing the condition under load.

Optional Namespace Scoping in KOM Policies (#1394)

The Kubernetes Object Monitor (KOM) now supports optional per-policy namespace scoping. Setting resource.namespace on a namespaced resource in a KOM policy instructs controller-runtime to build an informer cache scoped to a single namespace rather than watching every object of that GVK cluster-wide, dramatically reducing memory usage for high-cardinality resources such as Pods. In testing, a Pod-watching policy scoped to one namespace held steady at ~18Mi even with 2000 pods scheduled in an unmonitored namespace, versus ~153Mi (roughly 9x) when watching cluster-wide. The field is rejected for cluster-scoped resources; leave it unset when cluster-wide monitoring is genuinely required.

ExternalRemediationRequest CRD Foundation (#1376)

Introduces the foundation for the ExternalRemediationRequest (ERR) CRD, a new coordination surface in the nvsentinel.dgxc.nvidia.com API group that lets NVSentinel hand off node ownership to an external breakfix system. This first PR ships the API shape only: the apiserver now accepts ERR objects via a new proto-generated CRD packaged in the janitor Helm chart, with scheme registration and RBAC granting janitor access to externalremediationrequests plus its status and finalizers subresources. It also adds custom protojson marshaling so proto well-known types (such as the Timestamp on Condition.lastTransitionTime) serialize as RFC3339, and centralizes the nvsentinel.dgxc.nvidia.com/managed node label and ERR identity constants in a new commons/pkg/managed package. This is foundational/preview only: no component observes ERR objects yet, and the reconciler, fault-remediation producer, and node-labeler gating land in follow-up PRs.

Affinity Support for Platform Connector (#1375)

The NVSentinel Helm chart now supports a platformConnector.affinity value, letting operators control how the platform connector DaemonSet pods are scheduled onto nodes. When set, the affinity block (for example, nodeAffinity rules matching custom node labels) is rendered into the DaemonSet's pod spec; when left empty (the default is {}) it renders nothing, so existing deployments are unaffected. This is useful for pinning connectors to specific node pools or hardware. The change includes a new scheduling configuration reference in the platform-connectors docs.

Bug Fixes & Reliability

  • Ignore deleted nodes in fault remediation (#1396, #1387): Fixed fault-remediation retrying health events forever when the target node had been deleted from the cluster. Previously, GetRemediationState/checkExistingCRStatus failed with a Kubernetes "Node not found" error before the event could be marked terminal, so controller-runtime kept retrying and cold-start re-enqueued the stale event on every restart. The reconciler now detects the not-found error via apierrors.IsNotFound, marks remediation events for deleted nodes terminal with faultRemediated=false (cancellation events terminal with faultRemediated=true), and advances the change-stream resume token so the event is recorded as processed and never retried again.
  • Render valid platform-connectors DaemonSet without a client cert (#1397, #1241): Fixed the platform-connectors DaemonSet in the umbrella Helm chart so it renders a valid manifest when PostgreSQL is the datastore but no client certificate is mounted (platformConnector.postgresqlStore.clientCertMountPath set to ""), a common configuration when TLS is terminated by a cloud-sql-proxy sidecar. Previously the template emitted a volumeMount referencing a non-existent volume, causing ArgoCD and Kubernetes to reject the DaemonSet as invalid. The cert volume, volumeMount, and fix-cert-permissions init container are now only rendered when a mount path is actually configured, bringing the DaemonSet in line with the subchart Deployments that already handled this case.
  • Fixed preflight E2E test flakiness (#1374): Hardened the preflight E2E test helper to...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Incremental version release of NVIDIA's NVSentinel tool.