NVIDIA / aistore
AIStore: scalable storage for AI applications
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing NVIDIA/aistore in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler view**AIStore: High-Performance, Scalable Storage for AI Workloads** AIStore (AIS) is a lightweight distributed storage stack tailored for AI applications. It's an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. Built from scratch, AIS provides linear scale-out, consistent performance, and a flexible deployment model. AIS is a reliable storage cluster that can natively operate on both in-cluster and remote data, without treating either as a cache. AIS consistently shows balanced I/O distribution and linear scalability across an arbitrary number of clustered nodes. The system supports fast data access, reliability, and rich customization for data transformation workloads. Features • ✅ **Multi-Cloud Access:** Seamlessly access and manage content across multiple cloud backends (including AWS S3, GCS, Azure, and OCI), with fast-tier performance, configurable redundancy, and namespace-aware bucket identity (same-name buckets can coexist across accounts, endpoints, and providers). • ✅ **Deploy Anywhere:** AIS runs on any Linux machine, virtual or physical. Deployment options range from a single Docker container and Google Colab to petascale Kubernetes clusters. There are no built-in limitations on deployment size or functionality. • ✅ **High Availability:** Redundant control and data planes. Self-healing, end-to-end protection, n-way mirroring, and erasure coding. Arbitrary number of lightweight access points (AIS proxies). • ✅ **HTTP-based API:** A feature-rich, native API (with user-friendly SDKs for Go and Python), and compliant Amazon S3 API for running unmodified S3 clients. • ✅ **Monitoring:** Comprehensive observability with integrated Prometheus metrics, Grafana dashboards, detailed logs with configurable verbosity, and CLI-based performance tracking for complete cluster visibility and troubleshooting. See AIStore Observability for details. • ✅ **Chunked Objects:** High-performance chunked object representation, with independently retrievable chunks, metadata v2, and checksum-protected manifests. Supports rechunking, parallel reads, and seamless integration with Get-Batch, blob-downloader, and multipart uploads to supported cloud backends. • ✅ **JWT Authentication and Authorization:** Validates request JWTs to provide cluster- and bucket-level access control using static keys or dynamic OIDC issuer JWKS lookup. • ✅ **Secure Redirects:** Configurable cryptographic signing of redirect URLs using HMAC-SHA256 with a versioned cluster key (distributed via metasync, stored in memory only). • ✅ **Load-Aware Throttling:** Dynamic request throttling based on a multi-dimensional load vector (CPU, memory, disk, file descriptors, goroutines) to protect AIS clusters under stress. • ✅ **Unified Namespace:** Attach AIS clusters together to provide unified access to datasets across independent clusters, allowing users to reference shared buckets with cluster-specific identifiers. • ✅ **Turn-key Cache:** In addition to robust data protection features, AIS offers a per-bucket configurable LRU-based cache with eviction thresholds and storage capacity watermarks. • ✅ **ETL Offload:** Execute I/O intensive data transformations close to the data, either inline (on-the-fly as part of each read request) or offline (batch processing, with the destination bucket populated with transformed results). • ✅ **Get-Batch:** Retrieve multiple objects and/or archived files with a single call. Designed for ML/AI pipelines, Get-Batch fetches an entire training batch in one operation, assembling a TAR (or other supported serialization formats) that contains all requested items in the exact user-specified order (paper). • ✅ **Data Consistency:** Guaranteed consistency across all gateways, with write-through semantics in presence of remote backends. • ✅ **Serialization & Sharding:** Native, first-class support for TAR, TGZ, TAR.LZ4, and ZIP archives for efficient storage and processing of small-file datasets. Features include seamless integration with existing unmodified workflows across all APIs and subsystems. • ✅ **Kubernetes:** For production, AIS runs natively on Kubernetes. The dedicated ais-k8s repository includes the AIS K8s Operator, Ansible playbooks, Helm charts, and deployment guidance. • ✅ **Batch Jobs:** More than 30 cluster-wide batch operations that you can start, monitor, and control otherwise. The list currently includes: > The feature set continues to grow and also includes: blob-downloader; lightweight AuthN Service (Beta), to manage users and roles and generate JWTs; runtime management of TLS certificates; full support for adding/removing nodes at runtime; listing, copying, prefetching, and transforming virtual directories; executing presigned S3 requests; adaptive rate limiting; and more. > For the original **white paper** and design philosophy, please see AIStore Overview, which also includes high-level block diagram, terminology, APIs, CLI, and more. > For our 2024 KubeCon presentation, please see AIStore: Enhancing petascale Deep Learning across Cloud backends. CLI AIS includes an integrated, scriptable CLI for managing clusters, buckets, and objects, running and monitoring batch jobs, viewing and downloading logs, generating performance reports, and more: Developer Tools AIS runs natively on Kubernetes and features open format - thus, the freedom to copy or move your data from AIS at any time using the familiar Linux , , and similar. For developers and data scientists, there's also: • Go API used in CLI and benchmarking tools • Python SDK + Reference Guide • PyTorch integration and usage examples • Boto3 support Quick Start • Read the Getting Started Guide for a 5-minute local install, or • Run a minimal AIS cluster consisting of a single gateway and a single storage node, or • Clone the repo and run followed by --------------------- Deployment options AIS…