back to home

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

6,292 stars
925 forks
519 issues
RustPythonGo

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing ai-dynamo/dynamo in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/ai-dynamo/dynamo)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

| **Docs** | **Roadmap** | **Recipes** | **Examples** | **Prebuilt Containers** | **Blog** | **Design Proposals** | Dynamo **The open-source, datacenter-scale inference stack.** Dynamo is the orchestration layer above inference engines — it doesn't replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads. Built in Rust for performance, Python for extensibility. When to use Dynamo • You're serving LLMs across **multiple GPUs or nodes** and need to coordinate them • You want **KV-aware routing** to avoid redundant prefill computation • You need to **independently scale prefill and decode** (disaggregated serving) • You want **automatic scaling** that meets latency SLAs at minimum total cost of ownership (TCO) • You need **fast cold-starts** when spinning up new replicas If you're running a single model on a single GPU, your inference engine alone is probably sufficient. **Feature support at a glance:** | | SGLang | TensorRT-LLM | vLLM | |---|:----:|:----------:|:--:| | **Disaggregated Serving** | ✅ | ✅ | ✅ | | **KV-Aware Routing** | ✅ | ✅ | ✅ | | **SLA-Based Planner** | ✅ | ✅ | ✅ | | **KVBM** | 🚧 | ✅ | ✅ | | **Multimodal** | ✅ | ✅ | ✅ | | **Tool Calling** | ✅ | ✅ | ✅ | > **Full Feature Matrix →** — LoRA, request migration, speculative decoding, and feature interactions. Key Results | Result | Context | |--------|---------| | **7x** higher throughput per GPU | DeepSeek R1 on GB200 NVL72 w/ Dynamo vs B200 without (InferenceX) | | **7x** faster model startup | ModelExpress weight streaming (DeepSeek-V3 on H200) | | **2x** faster time to first token | KV-aware routing, Qwen3-Coder 480B (Baseten benchmark) | | **80%** fewer SLA breaches | Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025 @ 2:50:00) | | **750x** higher throughput | DeepSeek-R1 on GB300 NVL72 (InferenceXv2) | What Dynamo Does Most inference engines optimize a single GPU or a single node. Dynamo is the **orchestration layer above them** — it turns a cluster of GPUs into a coordinated inference system. **Architecture Deep Dive →** Core Capabilities | Capability | What it does | Why it matters | |------------|-------------|----------------| | **Disaggregated Prefill/Decode** | Separates prefill and decode into independently scalable GPU pools | Maximizes GPU utilization; each phase runs on hardware tuned for its workload | | **KV-Aware Routing** | Routes requests based on worker load and KV cache overlap | Eliminates redundant prefill computation — 2x faster TTFT | | **KV Block Manager (KVBM)** | Offloads KV cache across GPU → CPU → SSD → remote storage | Extends effective context length beyond GPU memory | | **ModelExpress** | Streams model weights GPU-to-GPU via NIXL/NVLink | 7x faster cold-start for new replicas | | **Planner** | SLA-driven autoscaler that profiles workloads and right-sizes pools | Meets latency targets at minimum total cost of ownership (TCO) | | **Grove** | K8s operator for topology-aware gang scheduling (NVL72) | Places workloads optimally across racks, hosts, and NUMA nodes | | **AIConfigurator** | Simulates 10K+ deployment configs in seconds | Finds optimal serving config without burning GPU-hours | | **Fault Tolerance** | Canary health checks + in-flight request migration | Workers fail; user requests don't | New in 1.0 • **Zero-config deploy (DGDR)** *(beta):* Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys • **Agentic inference:** Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations • **Multimodal E/P/D:** Disaggregated encode/prefill/decode with embedding cache — 30% faster TTFT on image workloads • **Video generation:** Native FastVideo + SGLang Diffusion support — real-time 1080p on single B200 • **K8s Inference Gateway plugin:** KV-aware routing inside the standard Kubernetes gateway • **Storage-tier KV offload:** S3/Azure blob support + global KV events for cluster-wide cache visibility Quick Start Option A: Container (fastest) Also available: and . Option B: Install from PyPI Then start the frontend and a worker as shown above. See the full installation guide for system dependencies and backend-specific notes. Option C: Kubernetes (recommended) For production multi-node clusters, install the Dynamo Platform and deploy with a single manifest: Pre-built recipes for common models: | Model | Framework | Mode | Recipe | |-------|-----------|------|--------| | Llama-3-70B | vLLM | Aggregated | View | | DeepSeek-R1 | SGLang | Disaggregated | View | | Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | View | See recipes/ for the full list. Cloud-specific guides: AWS EKS · Google GKE Building from Source For contributors who want to build and develop locally. See the full build guide for details. > VSCode/Cursor users: see the for a pre-configured dev environment. Community and Contributing Dynamo is built in the open with an OSS-first development model. We welcome contributions of all kinds. • **Contribution Guide** — How to contribute code, docs, and recipes • **Design Proposals** — RFCs for major features • **Office Hours** — Biweekly community calls • **Discord** — Chat with the team and community • **Dynamo Day Recordings** — Deep dives from production users Latest News • [03/15] Dynamo 1.0 is here — production-ready with strong community adoption • [03/15] NVIDIA Blackwell Ultra sets new inference records in MLPerf • [03/15] NVIDIA Blackwell leads on SemiAnalysis InferenceMax benchmarks • [12/05] Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200 • [12/02] Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo • [11/20]…