NVIDIA/NVSentinel
Go
Captured source
source ↗NVIDIA/NVSentinel
Description: NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments
Language: Go
License: Apache-2.0
Stars: 315
Forks: 86
Open issues: 52
Created: 2025-09-16T12:43:23Z
Pushed: 2026-06-11T03:01:04Z
Default branch: main
Fork: no
Archived: no
README:
NVSentinel
GPU Fault Detection and Remediation for Kubernetes
NVSentinel automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It monitors GPU health, system logs, and cloud provider maintenance events, then takes action: cordoning faulty nodes, draining workloads, and triggering break-fix workflows.
> [!NOTE] > Beta / Stable > NVSentinel is ready for production testing and use. APIs, configurations, and features may change between releases. If you encounter issues, please open an issue or start a discussion.
🚀 Quick Start
Prerequisites
- Kubernetes 1.25+
- Helm 3.0+
- NVIDIA GPU Operator (includes DCGM for GPU monitoring)
Installation
NVSENTINEL_VERSION=v1.9.0 # Install from GitHub Container Registry helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --version "$NVSENTINEL_VERSION" \ --namespace nvsentinel \ --create-namespace # View chart information helm show chart oci://ghcr.io/nvidia/nvsentinel --version "$NVSENTINEL_VERSION"
✨ Key Features
- 🔍 Comprehensive Monitoring: Real-time detection of GPU, NVSwitch, and system-level failures
- 🔧 Automated Remediation: Intelligent fault handling with cordon, drain, and break-fix workflows
- 📦 Modular Architecture: Pluggable health monitors with standardized gRPC interfaces
- 🔄 High Availability: Kubernetes-native design with replica support and leader election
- ⚡ Real-time Processing: Event-driven architecture with immediate fault response
- 📊 Persistent Storage: MongoDB-based event store with change streams for real-time updates
- 🛡️ Graceful Handling: Coordinated workload eviction with configurable timeouts
- 🏷️ Metadata Enrichment: Automatic augmentation of health events with cloud provider and node metadata information
🧪 Complete Setup Guide
For a full installation with all dependencies, follow these steps:
1. Install cert-manager (for TLS)
helm repo add jetstack https://charts.jetstack.io --force-update helm upgrade --install cert-manager jetstack/cert-manager \ --namespace cert-manager --create-namespace \ --version v1.19.1 --set installCRDs=true \ --wait
2. Install Prometheus (for metrics)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts --force-update helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.enabled=true \ --set alertmanager.enabled=false \ --set grafana.enabled=false \ --set kubeStateMetrics.enabled=false \ --set nodeExporter.enabled=false \ --wait
3. Install NVSentinel
NVSENTINEL_VERSION=v1.9.0 helm upgrade --install nvsentinel oci://ghcr.io/nvidia/nvsentinel \ --namespace nvsentinel --create-namespace \ --version "$NVSENTINEL_VERSION" \ --timeout 15m \ --wait
4. Verify Installation
kubectl get pods -n nvsentinel kubectl get nodes # Verify GPU nodes are visible # Run comprehensive validation ./scripts/validate-nvsentinel.sh --version "$NVSENTINEL_VERSION" --verbose
> Testing: The example above uses default settings. For production, customize values for your environment.
> Production: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See [Configuration](#-configuration) below.
🎮 Try the Demo
Demo Videos
See NVSentinel in action — click any thumbnail to watch:
End-to-End
Custom Health Monitors
Custom Drain Plugins
Extensible Remediation
Health Events Analyzer
See the [demos directory](demos/) for full descriptions.
Run It Locally
Want to try NVSentinel without GPU hardware? Run our [Local Fault Injection Demo](demos/local-fault-injection-demo/README.md):
- 🚀 5-minute setup - runs entirely in a local KIND cluster
- 🔍 Real pipeline - see fault detection → quarantine → node cordon
- 🎯 No GPU required - uses simulated DCGM for testing
cd demos/local-fault-injection-demo make demo # Automated: creates cluster, installs NVSentinel, injects fault, verifies cordon
Perfect for learning, presentations, or CI/CD testing!
🏗️ Architecture
NVSentinel follows a microservices architecture with modular health monitors and core processing modules:
graph LR
subgraph "Health Monitors"
GPU["GPU Health Monitor
(DCGM Integration)"]
SYS["Syslog Health Monitor
(Journalctl)"]
CSP["CSP Health Monitor
(CSP APIs)"]
K8SOM["Kubernetes Object Monitor
(CEL Policies)"]
end
subgraph "Core Processing"
PC["Platform Connectors
(gRPC Server)"]
STORE[("MongoDB Store
(Event Database)")]
FQ["Fault Quarantine
(Node Cordon)"]
ND["Node Drainer
(Workload Eviction)"]
FR["Fault Remediation
(Break-Fix Integration)"]
HEA["Health Events Analyzer
(Pattern Analysis)"]
LBL["Labeler
(Node Labels)"]
end
subgraph "Kubernetes Cluster"
K8S["Kubernetes API
(Nodes, Pods, Events)"]
end
GPU -->|gRPC| PC
SYS -->|gRPC| PC
CSP -->|gRPC| PC
K8SOM -->|gRPC| PC
PC -->|persist| STORE
PC |update status| K8S
FQ -.->|watch changes| STORE
FQ -->|cordon| K8S
ND -.->|watch changes| STORE
ND -->|drain| K8S
FR -.->|watch changes| STORE
FR -->|create CRDs| K8S
HEA -.->|watch changes| STORE
LBL -->|update labels| K8S
K8SOM -.->|watch changes| K8SData Flow: 1. Health Monitors detect hardware/software faults and send events via gRPC to Platform Connectors 2. Platform Connectors validate, persist events to MongoDB, and update Kubernetes node conditions 3. Core Modules independently watch MongoDB change streams for relevant events 4. Modules interact with Kubernetes API to cordon, drain, label nodes, and create remediation CRDs 5. Labeler monitors pods to automatically label nodes with DCGM and driver versions
> Note: All modules operate independently without direct communication. Coordination happens through MongoDB change streams and Kubernetes API.
⚙️ Configuration
NVSentinel is highly…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New NVIDIA repo, moderate stars.