DISCOVER THE FUTURE OF AI AGENTSarrow_forward

AutoRound

calendar_todayAdded Apr 24, 2026
categoryModel & Inference Framework
codeOpen Source
PythonPyTorch大语言模型MultimodalTransformersCLIModel & Inference FrameworkModel Training & InferenceComputer Vision & Multimodal

An advanced post-training quantization toolkit for LLMs and VLMs by Intel, leveraging SignRound optimization to support 2–4 bit weight quantization and automatic mixed-precision scheme generation across Intel CPU/GPU, NVIDIA GPU, and Habana Gaudi.

AutoRound is an Intel-maintained post-training quantization toolkit for Large Language Models (LLMs) and Vision-Language Models (VLMs). Its core algorithm, SignRound, leverages SignSGD to optimize rounding decisions and weight clipping in approximately 200 steps, merging the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ) without introducing extra inference overhead. SignRoundV1 reports average zero-shot accuracy improvements of 6.91%–33.22% at 2-bit weight quantization.

The project supports a rich set of quantization data type combinations: W2A16, W3A16, W4A16, W8A16, W4A4 (research stage), NVFP4, MXFP4, Block-wise FP8, W8A8, etc., with export to five formats: AutoRound native, AutoAWQ, AutoGPTQ, GGUF, and LLM-Compressor. SignRoundV2 further introduces a fast sensitivity metric combining gradient information and quantization bias for layer-wise bit allocation, plus lightweight quantization scale pre-tuning search.

The AutoScheme feature can automatically generate layer-wise mixed-bit/data-type quantization plans within minutes (extra memory overhead ~1.1–1.5× the BF16 model size), with support for per-layer customization via layer_config. On the engineering side, a 7B model completes W4A16 quantization in ~10 minutes on a single GPU, with three preset schemes (auto-round / auto-round-best / auto-round-light) covering different accuracy-speed tradeoffs.

Quantized models can be loaded directly in Transformers, vLLM, SGLang, and other mainstream inference frameworks without code modifications. The same quantization pipeline adapts to multiple hardware backends: Intel Xeon CPU, Intel GPU (XPU), NVIDIA GPU (CUDA), and Habana Gaudi (HPU). Additionally, it supports 10+ VLM models out-of-the-box, Multi-Token Prediction (MTP) layer quantization, and switching between HuggingFace and ModelScope model sources via environment variable.

The underlying CUDA quantization kernels reuse open-source libraries including AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2. The academic foundation includes the SignRoundV1 paper (EMNLP 2024 Findings) and the subsequent SignRoundV2 paper.

Related Projects

View All arrow_forward

STAY UPDATED

Get the latest AI tools and trends delivered straight to your inbox. No spam, just intelligence.

rocket_launch