NVIDIA / aicr

Tooling for optimized, validated, and reproducible GPU-accelerated AI runtime in Kubernetes

138 stars

16 forks

17 issues

GoYAMLShell

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing NVIDIA/aicr in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/NVIDIA/aicr)

Preview:

Repository Overview (README excerpt)

Crawler view

NVIDIA AI Cluster Runtime AI Cluster Runtime (AICR) makes it easy to stand up GPU-accelerated Kubernetes clusters. It captures known-good combinations of drivers, operators, kernels, and system configurations and publishes them as version-locked **recipes** — reproducible artifacts for Helm, ArgoCD, and other deployment frameworks. Why We Built This Running GPU-accelerated Kubernetes clusters reliably is hard. Small differences in kernel versions, drivers, container runtimes, operators, and Kubernetes releases can cause failures that are difficult to diagnose and expensive to reproduce. Historically, this knowledge has lived in internal validation pipelines and runbooks. AI Cluster Runtime makes it available to everyone. Every AICR recipe is: • **Optimized** — Tuned for a specific combination of hardware, cloud, OS, and workload intent. • **Validated** — Passes automated constraint and compatibility checks before publishing. • **Reproducible** — Same inputs produce identical deployments every time. Quick Start Install and generate your first recipe in under two minutes: The directory contains per-component Helm charts with values files, checksums, and deployer configs. Deploy with , commit to a GitOps repo, or use the built-in ArgoCD deployer. See the Installation Guide for manual installation, building from source, and container images. Features | Feature | Description | |---------|-------------| | ** CLI** | Single binary. Generate recipes, create bundles, capture snapshots, validate configs. | | **API Server ( )** | REST API with the same capabilities as the CLI. Run in-cluster for CI/CD integration or air-gapped environments. | | **Snapshot Agent** | Kubernetes Job that captures live cluster state (GPU hardware, drivers, OS, operators) into a ConfigMap for validation against recipes. | | **Supply Chain Security** | SLSA Level 3 provenance, signed SBOMs, image attestations (cosign), and checksum verification on every release. | Supported Components | Dimension | This Release | |-----------|-------------| | **Kubernetes** | Amazon EKS, Azure AKS (1.34+), GKE, self-managed (Kind) | | **GPUs** | NVIDIA H100, GB200 | | **OS** | Ubuntu | | **Workloads** | Training (Kubeflow), Inference (Dynamo) | | **Components** | GPU Operator, Network Operator, cert-manager, Prometheus stack, etc. | See the full Component Catalog for every component that can appear in a recipe. Don't see what you need? Open an issue — that feedback directly shapes what gets validated next. How It Works A **recipe** is a version-locked configuration for a specific environment. You describe your target (cloud, GPU, OS, workload intent), and the recipe engine matches it against a library of validated **overlays** — layered configurations that compose bottom-up from base defaults through cloud, accelerator, OS, and workload-specific tuning. The **bundler** materializes a recipe into deployment-ready artifacts: one folder per component, each with Helm values, checksums, and a README. The **validator** compares a recipe against a live cluster snapshot and flags anything out of spec. This separation means the same validated configuration works whether you deploy with Helm, ArgoCD, Flux, or a custom pipeline. What AI Cluster Runtime Is Not • Not a Kubernetes distribution • Not a cluster provisioner or lifecycle management system • Not a managed control plane or hosted service • Not a replacement for your cloud provider or OEM platform You bring your cluster and your tools. AI Cluster Runtime tells you what should be installed and how it should be configured. Documentation Choose the path that matches how you'll use the project. User — Platform and Infrastructure Operators • **Installation Guide** — Install the CLI (automated script, manual, or build from source) • **CLI Reference** — Complete command reference with examples • **API Reference** — REST API quick start • **Agent Deployment** — Deploy the Kubernetes agent for automated snapshots • **Component Catalog** — Every component that can appear in a recipe Contributor — Developers and Maintainers • **Contributing Guide** — Development setup, testing, and PR process • **Development Guide** — Local development, Make targets, and tooling • **Architecture Overview** — System design and components • **Bundler Development** — How to create new bundlers • **Data Architecture** — Recipe data model and query matching • **Agent Instructions** — Coding-agent guidance for Codex/Copilot Integrator — Automation and Platform Engineers • **API Reference** — REST API endpoints and usage examples • **Data Flow** — Understanding snapshots, recipes, and bundles • **Automation Guide** — CI/CD integration patterns • **Kubernetes Deployment** — Self-hosted API server setup • **Recipe Development** — Adding and modifying recipe metadata Resources • **Roadmap** — Feature priorities and development timeline • **Security** — Supply chain security, vulnerability reporting, and verification • **Releases** — Binaries, SBOMs, and attestations • **Issues** — Bugs, feature requests, and questions Contributing AI Cluster Runtime is Apache 2.0. Contributions are welcome: new recipes for environments we haven't covered (OpenShift, AKS, bare metal), additional bundler formats, validation checks, or bug reports. See CONTRIBUTING.md for development setup and the PR process.