NVIDIA / NVSentinel
NVSentinel is a cross-platform fault remediation service designed to rapidly remediate runtime node-level issues in GPU-accelerated computing environments
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing NVIDIA/NVSentinel in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewNVSentinel **GPU Fault Detection and Remediation for Kubernetes** NVSentinel automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It monitors GPU health, system logs, and cloud provider maintenance events, then takes action: cordoning faulty nodes, draining workloads, and triggering break-fix workflows. > [!NOTE] > **Beta / Stable** > NVSentinel is ready for production testing and use. APIs, configurations, and features may change between releases. If you encounter issues, please open an issue or start a discussion. 🚀 Quick Start Prerequisites • Kubernetes 1.25+ • Helm 3.0+ • NVIDIA GPU Operator (includes DCGM for GPU monitoring) Installation ✨ Key Features • **🔍 Comprehensive Monitoring**: Real-time detection of GPU, NVSwitch, and system-level failures • **🔧 Automated Remediation**: Intelligent fault handling with cordon, drain, and break-fix workflows • **📦 Modular Architecture**: Pluggable health monitors with standardized gRPC interfaces • **🔄 High Availability**: Kubernetes-native design with replica support and leader election • **⚡ Real-time Processing**: Event-driven architecture with immediate fault response • **📊 Persistent Storage**: MongoDB-based event store with change streams for real-time updates • **🛡️ Graceful Handling**: Coordinated workload eviction with configurable timeouts • **🏷️ Metadata Enrichment**: Automatic augmentation of health events with cloud provider and node metadata information 🧪 Complete Setup Guide For a full installation with all dependencies, follow these steps: • Install cert-manager (for TLS) • Install Prometheus (for metrics) • Install NVSentinel • Verify Installation > **Testing**: The example above uses default settings. For production, customize values for your environment. > **Production**: By default, only health monitoring is enabled. Enable fault quarantine and remediation modules via Helm values. See Configuration below. 🎮 Try the Demo Want to see NVSentinel in action without GPU hardware? Try our **Local Fault Injection Demo**: • 🚀 **5-minute setup** - runs entirely in a local KIND cluster • 🔍 **Real pipeline** - see fault detection → quarantine → node cordon • 🎯 **No GPU required** - uses simulated DCGM for testing Perfect for learning, presentations, or CI/CD testing! 🏗️ Architecture NVSentinel follows a microservices architecture with modular health monitors and core processing modules: **Data Flow**: • **Health Monitors** detect hardware/software faults and send events via gRPC to Platform Connectors • **Platform Connectors** validate, persist events to MongoDB, and update Kubernetes node conditions • **Core Modules** independently watch MongoDB change streams for relevant events • **Modules** interact with Kubernetes API to cordon, drain, label nodes, and create remediation CRDs • **Labeler** monitors pods to automatically label nodes with DCGM and driver versions > **Note**: All modules operate independently without direct communication. Coordination happens through MongoDB change streams and Kubernetes API. ⚙️ Configuration NVSentinel is highly configurable with options for each module. For complete configuration documentation, see the **Helm Chart README**. Quick Configuration Overview **Configuration Resources**: • **Helm Chart Configuration Guide**: Complete configuration reference • **values-full.yaml**: Detailed reference with all options • **values.yaml**: Default values 📦 Module Details For detailed module configuration, see the **Helm Chart Configuration Guide**. 🔍 Health Monitors • **GPU Health Monitor**: Monitors GPU hardware health via DCGM - detects thermal issues, ECC errors, and XID events • **Syslog Health Monitor**: Analyzes system logs for hardware and software fault patterns via journalctl • **CSP Health Monitor**: Integrates with cloud provider APIs (GCP/AWS) for maintenance events • **Kubernetes Object Monitor**: Policy-based monitoring for any Kubernetes resource using CEL expressions 🏗️ Core Modules • **Platform Connectors**: Receives health events from monitors via gRPC, persists to MongoDB, and updates Kubernetes node status • **Fault Quarantine**: Watches MongoDB for health events and cordons nodes based on configurable CEL rules • **Node Drainer**: Gracefully evicts workloads from cordoned nodes with per-namespace eviction strategies • **Fault Remediation**: Triggers external break-fix systems by creating maintenance CRDs after drain completion • **Janitor**: Executes node reboots and terminations via cloud provider APIs • **Health Events Analyzer**: Analyzes event patterns and generates recommended actions • **Event Exporter**: Streams health events to external systems in CloudEvents format • **MongoDB Store**: Persistent storage for health events with real-time change streams • **Labeler**: Automatically labels nodes with DCGM and driver versions for self-configuration • **Metadata Collector**: Gathers GPU and NVSwitch topology information • **Log Collection**: Collects diagnostic logs and GPU reports for troubleshooting 📋 Requirements • **Kubernetes**: 1.25 or later • **Helm**: 3.0 or later • **NVIDIA GPU Operator**: For GPU monitoring capabilities (includes DCGM) • **Storage**: Persistent storage for MongoDB (recommended 10GB+) • **Network**: Cluster networking for inter-service communication 🤝 Contributing We welcome contributions! Here's how to get started: **Ways to Contribute**: • 🐛 Report bugs and request features via issues • 🧭 See what we're working on in the roadmap • 📝 Improve documentation • 🧪 Add tests and increase coverage • 🔧 Submit pull requests to fix issues • 💬 Help others in discussions **Getting Started**: • Read the Contributing Guide for guidelines • Check the Development Guide for setup instructions • Browse open issues for opportunities All contributors must sign their commits (DCO). See the contributing guide for details. 💬 Support • 🐛 **Bug Reports**: Create an issue • ❓ **Questi…