Mooncake

A KVCache-centric disaggregated architecture platform for LLM serving, providing distributed KVCache pooling, topology-aware high-speed transfer engine, and centralized scheduler, supporting Prefill-Decode separation and MoE elastic inference.

Developed by Moonshot AI, Mooncake is the production-grade inference infrastructure behind the Kimi LLM. Its core design decouples prefill and decode phases in LLM inference, organizing idle CPU, DRAM, and SSD resources across GPU clusters into a distributed KVCache pool through a centralized scheduler, maximizing cluster throughput while meeting latency SLOs.

The platform comprises three core components: Transfer Engine provides unified batch transfer across DRAM/VRAM/NVMe with support for RDMA, CXL, NVMe-of, and other protocols, achieving up to 190 GB/s on 8×400Gbps links; Mooncake Store is a distributed KVCache storage engine with multi-replication, striped parallel I/O, multi-level caching policies, and intelligent prefetching; P2P Store enables decentralized fast checkpoint distribution, validated in trillion-parameter model training scenarios.

For inference framework integration, Mooncake Transfer Engine is integrated into vLLM v1 as a KV Connector, Mooncake Store serves as the remote backend for SGLang HiCache, with additional support for LMDeploy, TensorRT-LLM, and LMCache. The Mooncake-EP module provides elastic expert parallelism and fault tolerance for MoE models, with Kimi K2 achieving 224k tok/s prefill and 288k tok/s decode on 128×H200 clusters.

The project supports heterogeneous accelerator backends including CUDA, Cambricon MLU, Ascend NPU, and HIP. Results were published at USENIX FAST '25 (Best Paper Award) and ACM Transactions on Storage. The code is open-sourced under Apache-2.0, with the core transfer engine distributed via PyPI, alongside Docker images and source compilation options.

Related Projects

OpenChrome

spec-gen

Unity-MCP (AI Game Developer)

STAY UPDATED