back to home

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

View on GitHub
1,073 stars
162 forks
88 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing alibaba/rtp-llm in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/alibaba/rtp-llm)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

-------------------------------------------------------------------------------- | **Documentation** | **Contact Us** | News • [2025/09] 🔥 RTP-LLM 0.2.0 release with enhanced performance and new features • [2025/01] 🚀 RTP-LLM now supports Prefill/Decode separation with detailed technical report • [2025/01] 🌟 Qwen series model and bert embedding model now supported on Yitian ARM CPU • [2024/06] 🔄 Major refactor: Scheduling and batching framework rewritten in C++, complete GPU memory management, and new Device backend • [2024/06] 🏗️ Multi-hardware support in development: AMD ROCm, Intel CPU and ARM CPU support coming soon More • 大模型推理新突破:分布式推理技术探索与实践 • 为异构推理做好准备:次世代 RTP-LLM 推理引擎设计分享 • LLM推理加速:decode阶段的Attention在GPU上的优化 • LLM推理加速:decode阶段的Attention在GPU上的优化(二) About RTP-LLM is a Large Language Model (LLM) inference acceleration engine developed by Alibaba's Foundation Model Inference Team. It is widely used within Alibaba Group, supporting LLM service across multiple business units including Taobao, Tmall, Idlefish, Cainiao, Amap, Ele.me, AE, and Lazada. RTP-LLM is a sub-project of the havenask project. Key Features 🏢 Production Proven Trusted and deployed across numerous LLM scenarios: • Taobao Wenwen • Alibaba's international AI platform, Aidge • OpenSearch LLM Smart Q&A Edition • Large Language Model based Long-tail Query Rewriting in Taobao Search ⚡ High Performance • Utilizes high-performance CUDA kernels, including PagedAttention, FlashAttention, FlashDecoding, etc. • Implements WeightOnly INT8 Quantization with automatic quantization at load time • Support WeightOnly INT4 Quantization with GPTQ and AWQ • Adaptive KVCache Quantization • Detailed optimization of dynamic batching overhead at the framework level • Specially optimized for the V100 GPU 🔧 Flexibility and Ease of Use • Seamless integration with the HuggingFace models, supporting multiple weight formats such as SafeTensors, Pytorch, and Megatron • Deploys multiple LoRA services with a single model instance • Handles multimodal inputs (combining images and text) • Enables multi-machine/multi-GPU tensor parallelism • Supports P-tuning models 🚀 Advanced Acceleration Techniques • Loads pruned irregular models • Contextual Prefix Cache for multi-turn dialogues • System Prompt Cache • Speculative Decoding Getting Started • Install RTP-LLM • Quick Start • Backend Tutorial • Contribution Guide Benchmark and Performance Learn more about RTP-LLM's performance in our benchmark reports: • Performance Benchmark Tool Acknowledgments Our project is mainly based on FasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. We also draw inspiration from vllm, transformers, llava, and qwen-vl. We thank these projects for their inspiration and help. Citation If you find RTP-LLM useful in your research or project, please consider citing: Contact Us DingTalk Group WeChat Group