An advanced post-training quantization toolkit for LLMs and VLMs by Intel, leveraging SignRound optimization to support 2–4 bit weight quantization and automatic mixed-precision scheme generation across Intel CPU/GPU, NVIDIA GPU, and Habana Gaudi.
AutoRound is an Intel-maintained post-training quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs). Its core algorithm, SignRound, leverages SignSGD to optimize rounding decisions and weight clipping in approximately 200 steps, merging the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) without introducing extra inference overhead. SignRoundV1 reports average zero-shot accuracy improvements of 6.91%–33.22% at 2-bit weight quantization.
The project supports a rich set of quantization data type combinations: W2A16, W3A16, W4A16, W8A16, W4A4 (research stage), NVFP4, MXFP4, Block-wise FP8, W8A8, etc., with export to five formats: AutoRound native, AutoAWQ, AutoGPTQ, GGUF, and LLM-Compressor. SignRoundV2 further introduces a fast sensitivity metric combining gradient information and quantization bias for layer-wise bit allocation, plus lightweight quantization scale pre-tuning search.
The AutoScheme feature can automatically generate layer-wise mixed-bit/data-type quantization plans within minutes (extra memory overhead ~1.1–1.5× the BF16 model size), with support for per-layer customization via layer_config. On the engineering side, a 7B model completes W4A16 quantization in ~10 minutes on a single GPU, with three preset schemes (auto-round / auto-round-best / auto-round-light) covering different accuracy-speed tradeoffs.
Quantized models can be loaded directly in Transformers, vLLM, SGLang, and other mainstream inference frameworks without code modifications. The same quantization pipeline adapts to multiple hardware backends: Intel Xeon CPU, Intel GPU (XPU), NVIDIA GPU (CUDA), and Habana Gaudi (HPU). Additionally, it supports 10+ VLM models out-of-the-box, Multi-Token Prediction (MTP) layer quantization, and switching between HuggingFace and ModelScope model sources via environment variable.
The underlying CUDA quantization kernels reuse open-source libraries including AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2. The academic foundation includes the SignRoundV1 paper (EMNLP 2024 Findings) and the subsequent SignRoundV2 paper.