back to home

intel / auto-round

🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.

908 stars
85 forks
119 issues
PythonShell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing intel/auto-round in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/intel/auto-round)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Advanced Quantization Algorithm for LLMs     English | 简体中文 User Guide | 用户指南   --- 🚀 What is AutoRound? AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging **sign-gradient descent** and providing broad hardware compatibility. See our papers SignRoundV1 and SignRoundV2 for more details. For usage instructions, please refer to the User Guide. 🆕 What's New • [2026/03] **Block-wise FP8** quantization is available via . • [2026/03] MTP layer quantization has been supported in PR • [2025/12] The **SignRoundV2** paper is available. Turn on and use the **AutoScheme** API for mixed-precision quantization to reproduce the results: *Paper*, *Notes for evaluating LLaMA models*. • [2025/11] AutoRound has landed in **LLM-Compressor**: *Usage*, *vLLM blog*, *RedHat blog*, *X post*, *Intel blog*, *Linkedin*, *微信*, *知乎*. • [2025/11] An **enhanced GGUF** quantization algorithm is available via : *Accuracy*. • [2025/10] AutoRound has been integrated into **SGLang**: *Usage*, *LMSYS Blog*, *X post*, *Intel blog*, *Linkedin*. • [2025/10] A **mixed precision** algorithm is available to generate schemes in minutes: *Usage*, *Accuracy*. • [2025/09] **MXFP4** and **NVFP4** dtypes is available: *Accuracy*. • [2025/08] An **improved INT2** algorithm is available via : *Accuracy* • [2025/07] **GGUF** format is supported: *Usage*. • [2025/05] AutoRound has been integrated into **vLLM**: *Usage*, *Medium blog*, *小红书*. • [2025/05] AutoRound has been integrated into **Transformers**: *Blog*. • [2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy: *Model*. ✨ Key Features ✅ **Superior Accuracy** Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark. ✅ **Ecosystem Integration** Seamlessly works with **Transformers, vLLM, SGLang** and more. ✅ **Multiple Formats Export** Support **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in export formats ✅ **Fast Mixed Bits/Dtypes Scheme Generation** Automatically configure in minutes, with about 1.1X-1.5X the model’s BF16 RAM size as overhead. Accuracy results and user guide. ✅ **Optimized Round-to-Nearest Mode** Use for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode ✅ **Affordable Quantization Cost** Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs ✅ **10+ VLMs Support** Out-of-the-box quantization for 10+ vision-language models example models, support matrix ✅ **Multiple Recipes** Choose from , , and to suit your needs. Details are shown in quantization recipes ✅ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends. ✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more. Installation Install from pypi Build from Source Model Quantization (CPU/Intel GPU/Gaudi/CUDA) >If you encounter issues during quantization, try using pure RTN mode with iters=0, disable_opt_rtn=True. Additionally, using group_size=32 or mixed bits is recommended for better results. CLI Usage The full list of supported arguments is provided by calling on the terminal. > **ModelScope is supported for model downloads, simply set .** We offer another two recipes, and , designed for optimal accuracy and improved speed, respectively. Details are as follows. Other Recipes In conclusion, we recommend using **auto-round for W4A16 and auto-round-best with for W2A16**. However, you may adjust the configuration to suit your specific requirements and available resources. API Usage Important Hyperparameters Quantization Scheme & Configuration • ** (str|dict|AutoScheme)**: The predefined quantization keys, e.g. , , , . For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format. • ** (int)**: Number of bits for quantization (default is ). If not None, it will override the scheme setting. • ** (int)**: Size of the quantization group (default is ). If not None, it will override the scheme setting. • ** (bool)**: Whether to use symmetric quantization (default is ). If not None, it will override the scheme setting. • ** (dict)**: Configuration for layer_wise scheme (default is ), mainly for customized mixed schemes. Algorithm Settings • ** (bool)**: [Experimental Feature] Only for . Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is . • ** (bool|None)**: Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is . If None, it defaults to in most cases to improve accuracy, but may be set to due to known issues. Tuning Process Parameters • ** (int)**: Number of tuning iterations (default is ). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning. • ** (float)**: The learning rate for rounding value (default is ). When None, it will be set to automatically. • ** (int)**: Batch size for training (default is ). 4 is also commonly used. • ** (bool)**: Whether to enable deterministic algorithms for reproducibility (default is ). Calibration Dataset • ** (str|list|tuple|torch.utils.data.DataLoader)**: The dataset for tuning (default is ). Supports local JSON files and dataset combinations, e.g. . • ** (int)**: Number of samples for tuning (default is ). • ** (int)**: Data length of the sequence for tuning (default is ). Device/Speed Configuration • ** (bool)**: If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource. • ** (bool)**: Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is ). • ** (bool)**: [Experimental Feat…