back to home

vllm-project / vllm-metal

Community maintained hardware plugin for vLLM on Apple Silicon

View on GitHub
732 stars
75 forks
24 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing vllm-project/vllm-metal in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/vllm-project/vllm-metal)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

vLLM Metal Plugin > **High-performance LLM inference on Apple Silicon using MLX and vLLM** vLLM Metal is a plugin that enables vLLM to run on Apple Silicon Macs using MLX as the primary compute backend. It unifies MLX and PyTorch under a single lowering path. Features • **MLX-accelerated inference**: faster than PyTorch MPS on Apple Silicon • **Unified memory**: True zero-copy operations leveraging Apple Silicon's unified memory architecture • **vLLM compatibility**: Full integration with vLLM's engine, scheduler, and OpenAI-compatible API • **Paged attention** *(experimental)*: Efficient KV cache management for long sequences — opt-in via ; default path uses MLX-managed KV cache. When enabled, expect significantly better serving performance (~82x TTFT, ~3.75x throughput in early benchmarks on Qwen3-0.6B). Other models may have rough edges. • **GQA support**: Grouped-Query Attention for efficient inference Requirements • macOS on Apple Silicon Installation Using the install script, the following will be installed under the directory (the default). • vllm-metal plugin • vllm core • Related libraries If you run , the CLI becomes available and you can access the vLLM right away. For how to use the CLI, please refer to the official vLLM guide. https://docs.vllm.ai/en/latest/cli/ Reinstallation and Update If any issues occur, please use the following command to switch to the latest release version and check if the problem is resolved. If the issue continues to occur in the latest release, please report the details of the issue. (If you have installed it in a directory other than the default , substitute that path and run the command accordingly.) Uninstall Please delete the directory that was installed by the installation script. (If you have installed it in a directory other than the default , substitute that path and run the command accordingly.) Architecture Configuration Environment variables for customization: | Variable | Default | Description | |----------|---------|-------------| | | | allocates just enough memory plus a minimal KV cache, or for fraction of memory | | | | Use MLX for compute (1=yes, 0=no) | | | | MLX device ( or ) | | | | KV cache block size | | | | Enable experimental paged KV cache | | | | Enable debug logging | | | | Set True to change model registry to | | | None | Specify the absolute path of the local model | | | (unset) | Set to enable prefix caching for shared prompt reuse | | | | Fraction of MLX working set for prefix cache (0, 1] | Paged KV vs MLX KV memory settings • MLX path ( ): must be . • Paged KV path ( ): can be or a numeric fraction in . • For paged KV with , vllm-metal uses a default fraction of . | | Valid? | Notes -- | -- | -- | -- | | Yes | MLX path (default) | | Yes | Paged KV path; defaults to 0.9 internally | | Yes | Paged KV path with explicit memory budget | | No | Explicit fraction without paged KV is invalid Acknowledgements • The Metal paged attention kernels are currently adapted from mistral.rs (MIT license), via HuggingFace kernels-community. We plan to develop custom kernels in the future.