EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

617 stars

71 forks

38 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing EricLBuehler/candle-vllm in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/EricLBuehler/candle-vllm)

Preview:

Repository Overview (README excerpt)

Crawler view

English | 简体中文 | Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server. Features • OpenAI compatible API server provided for serving LLMs. • Highly extensible trait-based system to allow rapid implementation of new module pipelines, • Streaming support in generation. • Efficient management of key-value cache with PagedAttention. • Continuous batching (batched decoding for incoming requests over time). • quantization (and marlin format conversion) • format quantization (4-bit) • Support devices • Support inference (both and mode) • Support inference with MPI runner • Support Chunked Prefilling (default chunk size 8K) • Support CUDA Graph • Support Model Context Protocol (MCP) and OpenAI-compatible tool calling • Support Prefix Caching • Support Block-wise FP8 Models (SM90+, Qwen3 Series) • Support Flashinfer Backend Supported Models • Currently, candle-vllm supports chat serving for the following model structures. Show supported model architectures | Model ID | Model Type | Decoding Speed / Request ( , Hopper) | Quantized ( or ) | |--|--|--|--| | #1 | **LLAMA** |105 tks/s (8B) | 154 tks/s (8B, Q4k), 163 tks/s (8B, **Marlin**) | | #2 | **Mistral** |112 tks/s (7B)| 171 tks/s (7B, Q4k), 175 tks/s (7B, **Marlin**) | | #3 | **Phi3/Phi4** |139 tks/s (3.8B)|180 tks/s (3.8B, Q4k)| | #4 | **QWen2/Qwen3 Dense** |96 tks/s (8B)|135 tks/s **(8B, Q4k)**| | #5 | **QWen3 MoE** |92 tks/s **(30B)**|114 tks/s **(30B, Q4K)** | | #6 | **QWen3-Next MoE** |71 tks/s **(80B, BF16, tp=2)**|TBD| | #7 | **QWen3.5 Dense** |30 tks/s **(27B, BF16)**|~42 tks/s **(27B, Q4K / FP8)** | | #8 | **QWen3.5 MoE** |82 tks/s **(35B)**|93 tks/s **(35B, Q4K)** | | #9 | **Yi** |148 tks/s (6B)| 180 tks/s (6B, Q4k)| | #10 | **StableLM** |223 tks/s (3B)|-| | #11 | **Gemma-2/Gemma-3** |92 tks/s (9B)|115 tks/s (9B, **Marlin**)| | #12 | **DeepSeek V2/V3/R1** |TBD|~20 tks **(AWQ 671B, tp=8, offloading)**| | #13 | **QwQ-32B** |45 tks/s **(32B, tp=2)**|63 tks/s **(32B, Q4K)**| | #14 | **GLM4** |89 tks/s **(9B)**|124 tks/s **(9B, Q4K)**| Demo Video • Nvidia GPU and Apple Silicon Show Demo Video Chat demo on **GPU** (A100, BF16, QWen3-8B Reasoning Model) Chat demo on **Apple Silicon** (M4 with 16GB unified memory, Q2K, QWen3-8B) General Usage Install Candle-vLLM **Clone code** **CUDA (CUDA 11+, 12+, 13.0)** > Option 1 (Install into docker) > Option 2 (Manual Installation) Install dependencies Install for single node inference Install for multinode inference **Mac/Metal (single-node only)** Install Xcode command line tools Install with feature Run Directly (Without installation) • [ ] cargo run [ ] -- [ ] [ ] [ ] [ ] Show details **Example:** : RUST_LOG=warn : --release --features cuda,nccl,flashinfer,cutlass,graph ：--log --dtype bf16 --p 2000 --d 0,1 --gpu-memory-fraction 0.7 --isq q4k --prefill-chunk-size 8192 --frequency-penalty 1.1 --presence-penalty 1.1 --enforce-parser qwen_coder : --m Qwen/Qwen3.5-27B-FP8 (or specify local model path) : --fp8-kvcache : --ui-server where, : server port; : device ids; : weight path (safetensors folder); : weight file (for gguf); : huggingface model-id; : convert weights into format during model loading; chunk the prefill into size defined in this flag (default 8K, for disable); and repetition penalty (value from -2.0 to 2.0); ( ) sets a fixed KV cache budget in MB; auto-sizes KV cache after model load using ; forces a specific tool parser backend such as , , , or ; used to enable fp8 kvcache; enable prefix cache reuse; cap prefix cache size; start with a built-in ChatGPT-like Web UI sever. Replace in with to use the Flash attention backend instead. How to serve models? • **Note:** for docker build, execute the following command to enter docker: • Run **Uncompressed** models Show command **Local Path (with port, device)** **Local Path (ISQ, +UI Server)** **Model-ID (download from Huggingface)** **FP8 Model** (block-wise quant, build with feature) • Run **GGUF** models Show command **Local Path** **Model-ID (download from Huggingface)** • Run **GGUF** models on **Apple Silicon** Show command **Local Path (assume model downloaded in /home)** **Model-ID (download from Huggingface)** • Run **Any uncompressed models as quantized with in-situ quantization** Show command **Simply add parameter when running unquantized models** Options for in-site parameters: ["q4_0", "q4_1", "q5_0", "q5_1", "q8_0", "q2k", "q3k","q4k","q5k","q6k"] • Run **Marlin-compatible GPTQ models** models (4-bit GPTQ, 128-group, desc_act=False) Show command **Local Path** **Model-ID (download from Huggingface)** **Convert Any uncompressed model to marlin-compatible format** • Run **Marlin-compatible AWQ models** models Show command **Convert AWQ model to Marlin-compatible format** **Run the converted AWQ model** • Run **Marlin-format** models Show command • Run **Large models using multi-process mode (Multi-GPU)** Show command **QwQ-32B BF16 model on two GPUs** **QwQ-32B 4-bit AWQ model on two GPUs** 1) Convert AWQ model to Marlin-compatible format 2) Run the converted AWQ model **Note:** number of GPUs ( ) used must be aligned to 2^n (e.g., 2, 4, or 8). • Run **Large models using multi-threaded mode (Multi-GPU, for debug purpose)** Show command Simply add the parameter **QwQ-32B BF16 model on two GPUs** If you encountered problems under Multi-threaded Multi-GPU mode, you may: • Run **DeepSeek-R1 (671B/685B) on Lower GPU Memories (CPU offloading)** Show command **1. Convert DeepSeek-R1-AWQ model to Marlin-compatible format** **2. Run DeepSeek-R1 model on 8 x A100(40GB)** **Note:** This setup offloads 15 experts per rank (a total of 120 out of 256 experts) to the CPU (around 150GB additional host memory required). During inference, these offloaded experts are swapped back into GPU memory as needed. If you have even less GPU memory, consider increasing the pa…