kvcache-ai / Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing kvcache-ai/Mooncake in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewA KVCache-centric Disaggregated Architecture for LLM Serving Paper | Slides | Traces | Technical Report | Blog | Slack Mooncake is the serving platform for Kimi , a leading LLM service provided by Moonshot AI . Now both the Transfer Engine and Mooncake Store are open-sourced! This repository also hosts its technical report and the open-sourced traces. 🔄 Updates • **Mar 5, 2026**: LightX2V now supports disaggregated deployment based on Mooncake, enabling encoder/transformer service decoupling with Mooncake Transfer Engine for high-performance cross-device and cross-machine data transfer. • **Feb 25, 2026**: SGLang merged Encoder Global Cache Manager, introducing a Mooncake-powered global multimodal embedding cache that enables cross-instance sharing of ViT embeddings to avoid redundant GPU computation. • **Feb 24, 2026**: vLLM-Omni introduces disaggregated inference connectors with support for both and for multi-node omni-modality pipelines. • **Feb 12, 2026**: Mooncake Joins PyTorch Ecosystem We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem! • **Jan 28, 2026**: FlexKV, a distributed KV store and cache system from Tencent and NVIDIA in collaboration with the community, now supports distributed KVCache reuse with the Mooncake Transfer Engine. • **Dec 27, 2025**: Collaboration with ROLL! Check out the paper here. • **Dec 23, 2025**: SGLang introduces Encode-Prefill-Decode (EPD) Disaggregation with Mooncake as a transfer backend. This integration allows decoupling compute-intensive multimodal encoders (e.g., Vision Transformers) from language model nodes, utilizing Mooncake's RDMA engine for zero-copy transfer of large multimodal embeddings. • **Dec 19, 2025**: Mooncake Transfer Engine has been integrated into TensorRT LLM for KVCache transfer in PD-disaggregated inference. • **Dec 19, 2025**: Mooncake Transfer Engine has been directly integrated into vLLM v1 as a KV Connector in PD-disaggregated setups. • **Nov 07, 2025**: RBG + SGLang HiCache + Mooncake, a role-based out-of-the-box solution for cloud native deployment, which is elastic, scalable, and high-performance. • **Sept 18, 2025**: Mooncake Store empowers vLLM Ascend by serving as the distributed KV cache pool backend. • **Sept 10, 2025**: SGLang officially supports Mooncake Store as a hierarchical KV caching storage backend. The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers. • **Sept 10, 2025**: The official & high-performance version of Mooncake P2P Store is open-sourced as checkpoint-engine. It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s. • **Aug 23, 2025**: xLLM high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching. • **Aug 18, 2025**: vLLM-Ascend integrates Mooncake Transfer Engine for KV cache register and disaggregate prefill, enabling efficient distributed inference on Ascend NPUs. • **Jul 20, 2025**: Mooncake powers the deployment of Kimi K2 on 128 H200 GPUs with PD disaggregation and large-scale expert parallelism, achieving 224k tokens/sec prefill throughput and 288k tokens/sec decode throughput. • **Jun 20, 2025**: Mooncake becomes a PD disaggregation backend for LMDeploy. • **May 9, 2025**: NIXL officially supports Mooncake Transfer Engine as a backend plugin. • **May 8, 2025**: Mooncake x LMCache unite to pioneer KVCache-centric LLM serving system. • **May 5, 2025**: Supported by Mooncake Team, SGLang release guidance to deploy DeepSeek with PD Disaggregation on 96 H100 GPUs. • **Apr 22, 2025**: LMCache officially supports Mooncake Store as a remote connector . • **Apr 10, 2025**: SGLang officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer. • **Mar 7, 2025**: We open-sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM's xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon. • **Feb 25, 2025**: Mooncake receives the **Best Paper Award** at **FAST 2025**! • **Feb 21, 2025**: The updated traces used in our FAST'25 paper have been released. • **Dec 16, 2024**: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer. • **Nov 28, 2024**: We open-sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration. • **July 9, 2024**: We open-sourced the trace as a JSONL file . • **June 27, 2024**: We present a series of Chinese blogs with more discussions on zhihu 1 , 2 , 3 , 4 , 5 , 6 , 7 . • **June 26, 2024**: Initial technical report release. 🎉 Overview Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated KVCache pool. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges in highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests. 🧩 Components **Mooncake Core Component: Transfer Engine (TE)** The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices an…