wejoncy / QLLM

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.

187 stars

19 forks

3 issues

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing wejoncy/QLLM in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/wejoncy/QLLM)

Preview:

Repository Overview (README excerpt)

Crawler view

QLLM KeyWords **Quantization**, **GPTQ,AWQ, HQQ, VPTQ**, **ONNX, ONNXRuntime**, **VLLM** Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ/VPTQ in mixed bits(2-8bit), and export to onnx model QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args , and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ/VPTQ) can be loaded from huggingface/transformers and transfor to each other without extra effort. We alread supported • [x] GPTQ quantization • [x] AWQ quantization • [x] HQQ quantization • [x] VPTQ quantization Features: • [x] GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it. • [x] We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers. • [x] Auto promoting bits/group-size for better accuracy • [x] Export to ONNX model, inference by OnnxRuntime *Latest News* 🔥 • [2026/03] CUDA 13.0 support, PyTorch 2.10, Python 3.11-3.13 • [2026/03] Support H100/H200 (sm_90), B200/B300 (sm_100), RTX 5090 (sm_120) • [2024/03] ONNX Models export API • [2024/01] Support HQQ algorithm • [2023/12] The first PyPi package released Installation Easy to install qllm from PyPi Install from release package, CUDA 13.0 is supported. [py311, py312, py313] https://github.com/wejoncy/QLLM/releases Build from Source **Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build** How to use it Quantize llama2 Convert to onnx model use to export and save onnx model or you can convert a existing model in HF Hub (NEW) Quantize model with mix bits/groupsize for higher precision (PPL) NOTE: • only support GPTQ • allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible • wjat different with gptq-for-llama is we grow bit by one instead of times 2. • all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama. • if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now. Quantize model for vLLM Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq If you use GEMM pack_mode, then you don't have to set the var Conversion among AWQ, GPTQ and MarLin Or you can use to convert GPTQ to AWQ. Or you can use to convert GPTQ to Marlin. Or you can use to convert AWQ to Marlin. Note: Not all cases are supported, for example, 1) if you quantized model with different quantization bits for different layers, you can't convert it to AWQ. 2) if GPTQ model is quantized with option, you can't convert it to AWQ. 3) if GPTQ model is quantized with option, you can't convert it to AWQ. model inference with the saved model model inference with ORT you may want to use genai to do generation with ORT. Load quantized model from hugingface/transformers start a chatbot you may need to install fschat and accelerate with pip use to enable a chatbot plugin use QLLM with API OR For some users has transformers connect issues. Please set environment with PROXY_PORT=your http proxy port PowerShell Bash windows cmd Acknowledgements GPTQ GPTQ-triton AutoGPTQ llm-awq AutoAWQ. HQQ VPTQ