RepoNVIDIANVIDIApublished Sep 16, 2025seen 5d

NVIDIA/NVSentinel

Go

Open original ↗

Captured source

source ↗
published Sep 16, 2025seen 5dcaptured 10hhttp 200method plain

NVIDIA/NVSentinel

Description: NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments

Language: Go

License: Apache-2.0

Stars: 315

Forks: 86

Open issues: 52

Created: 2025-09-16T12:43:23Z

Pushed: 2026-06-11T03:01:04Z

Default branch: main

Fork: no

Archived: no

README:

NVSentinel

GPU Fault Detection and Remediation for Kubernetes

NVSentinel automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It monitors GPU health, system logs, and cloud provider maintenance events, then takes action: cordoning faulty nodes, draining workloads, and triggering break-fix workflows.

> [!NOTE] > Beta / Stable > NVSentinel is ready for production testing and use. APIs, configurations, and features may change between releases. If you encounter issues, please open an issue or start a discussion.

🚀 Quick Start

Prerequisites

  • Kubernetes 1.25+
  • Helm 3.0+
  • NVIDIA GPU Operator (includes DCGM for GPU monitoring)

Installation

NVSENTINEL_VERSION=v1.9.0
# Install from GitHub Container Registry
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--version "$NVSENTINEL_VERSION" \
--namespace nvsentinel \
--create-namespace

# View chart information
helm show chart oci://ghcr.io/nvidia/nvsentinel --version "$NVSENTINEL_VERSION"

✨ Key Features

  • 🔍 Comprehensive Monitoring: Real-time detection of GPU, NVSwitch, and system-level failures
  • 🔧 Automated Remediation: Intelligent fault handling with cordon, drain, and break-fix workflows
  • 📦 Modular Architecture: Pluggable health monitors with standardized gRPC interfaces
  • 🔄 High Availability: Kubernetes-native design with replica support and leader election
  • ⚡ Real-time Processing: Event-driven architecture with immediate fault response
  • 📊 Persistent Storage: MongoDB-based event store with change streams for real-time updates
  • 🛡️ Graceful Handling: Coordinated workload eviction with configurable timeouts
  • 🏷️ Metadata Enrichment: Automatic augmentation of health events with cloud provider and node metadata information

🧪 Complete Setup Guide

For a full installation with all dependencies, follow these steps:

1. Install cert-manager (for TLS)

helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--version v1.19.1 --set installCRDs=true \
--wait

2. Install Prometheus (for metrics)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts --force-update
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.enabled=true \
--set alertmanager.enabled=false \
--set grafana.enabled=false \
--set kubeStateMetrics.enabled=false \
--set nodeExporter.enabled=false \
--wait

3. Install NVSentinel

NVSENTINEL_VERSION=v1.9.0

helm upgrade --install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
--namespace nvsentinel --create-namespace \
--version "$NVSENTINEL_VERSION" \
--timeout 15m \
--wait

4. Verify Installation

kubectl get pods -n nvsentinel
kubectl get nodes # Verify GPU nodes are visible

# Run comprehensive validation
./scripts/validate-nvsentinel.sh --version "$NVSENTINEL_VERSION" --verbose

> Testing: The example above uses default settings. For production, customize values for your environment.

> Production: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See [Configuration](#-configuration) below.

🎮 Try the Demo

Demo Videos

See NVSentinel in action — click any thumbnail to watch:

End-to-End

Custom Health Monitors

Custom Drain Plugins

Extensible Remediation

Health Events Analyzer

See the [demos directory](demos/) for full descriptions.

Run It Locally

Want to try NVSentinel without GPU hardware? Run our [Local Fault Injection Demo](demos/local-fault-injection-demo/README.md):

  • 🚀 5-minute setup - runs entirely in a local KIND cluster
  • 🔍 Real pipeline - see fault detection → quarantine → node cordon
  • 🎯 No GPU required - uses simulated DCGM for testing
cd demos/local-fault-injection-demo
make demo # Automated: creates cluster, installs NVSentinel, injects fault, verifies cordon

Perfect for learning, presentations, or CI/CD testing!

🏗️ Architecture

NVSentinel follows a microservices architecture with modular health monitors and core processing modules:

graph LR
subgraph "Health Monitors"
GPU["GPU Health Monitor
(DCGM Integration)"]
SYS["Syslog Health Monitor
(Journalctl)"]
CSP["CSP Health Monitor
(CSP APIs)"]
K8SOM["Kubernetes Object Monitor
(CEL Policies)"]
end

subgraph "Core Processing"
PC["Platform Connectors
(gRPC Server)"]
STORE[("MongoDB Store
(Event Database)")]
FQ["Fault Quarantine
(Node Cordon)"]
ND["Node Drainer
(Workload Eviction)"]
FR["Fault Remediation
(Break-Fix Integration)"]
HEA["Health Events Analyzer
(Pattern Analysis)"]
LBL["Labeler
(Node Labels)"]
end

subgraph "Kubernetes Cluster"
K8S["Kubernetes API
(Nodes, Pods, Events)"]
end

GPU -->|gRPC| PC
SYS -->|gRPC| PC
CSP -->|gRPC| PC
K8SOM -->|gRPC| PC

PC -->|persist| STORE
PC |update status| K8S

FQ -.->|watch changes| STORE
FQ -->|cordon| K8S

ND -.->|watch changes| STORE
ND -->|drain| K8S

FR -.->|watch changes| STORE
FR -->|create CRDs| K8S

HEA -.->|watch changes| STORE

LBL -->|update labels| K8S

K8SOM -.->|watch changes| K8S

Data Flow: 1. Health Monitors detect hardware/software faults and send events via gRPC to Platform Connectors 2. Platform Connectors validate, persist events to MongoDB, and update Kubernetes node conditions 3. Core Modules independently watch MongoDB change streams for relevant events 4. Modules interact with Kubernetes API to cordon, drain, label nodes, and create remediation CRDs 5. Labeler monitors pods to automatically label nodes with DCGM and driver versions

> Note: All modules operate independently without direct communication. Coordination happens through MongoDB change streams and Kubernetes API.

⚙️ Configuration

NVSentinel is highly…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New NVIDIA repo, moderate stars.