Name	Name	Last commit message	Last commit date
Latest commit History 113 Commits
awq	awq
examples	examples
figures	figures
scripts	scripts
tinychat	tinychat
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
pyproject.toml	pyproject.toml

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

[Paper][Slides][Video]

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

The current release supports:

AWQ search for accurate quantization.
Pre-computed AWQ model zoo for LLMs (Llama-1/2/3, OPT, CodeLlama, StarCoder, Vicuna, VILA, LLaVA; load to generate quantized weights).
Memory-efficient 4-bit Linear in PyTorch.
Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (VILA).

Thanks to AWQ, TinyChat can deliver more efficient responses with LLM/VLM chatbots through 4-bit inference.

TinyChat on RTX 4090 (3.4x faster than FP16):

TinyChat on Jetson Orin (3.2x faster than FP16):

TinyChat also supports inference with vision language models (e.g., VILA, LLaVA). In the following examples, W4A16 quantized models from VILA family are launched with TinyChat.

TinyChat with VILA-13B on RTX 4090 (multi-image inputs supported):

TinyChat with VILA-7B/13B on Jetson Orin:

Check out TinyChat, which offers a turn-key solution for on-device inference of LLMs and VLMs on resource-constrained edge platforms. With TinyChat, it is now possible to efficiently run large models on small and low-power devices even without Internet connection!

News

[2024/04] 🔥 We released AWQ and TinyChat support for The Llama-3 model family! Check out our example here.
[2024/03] 🔥 AWQ has been widely adopted by the industry, such as NVIDIA, Google, Amazon, and Intel!
[2024/02] 🔥 AWQ has been accepted to MLSys 2024!
[2024/02] 🔥 We supported VILA Vision Languague Models in AWQ & TinyChat! Check our latest demos with multi-image inputs!
[2024/02] 🔥 We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!
[2023/11] 🔥 We added AWQ support and pre-computed search results for CodeLlama, StarCoder, StableCode models. Checkout our model zoo here!
[2023/11] 🔥 AWQ is now integrated natively in Hugging Face transformers through from_pretrained. You can either load quantized models from the Hub or your own HF quantized models.
[2023/10] AWQ is integrated into NVIDIA TensorRT-LLM
[2023/09] AWQ is integrated into FastChat, vLLM, HuggingFace TGI, and LMDeploy.
[2023/09] ⚡ Check out our latest TinyChat, which is ~2x faster than the first release on Orin!
[2023/09] ⚡ Check out AutoAWQ, a third-party implementation to make AWQ easier to expand to new models, improve inference speed, and integrate into Huggingface.
[2023/07] 🔥 We released TinyChat, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. Llama-2-chat models are supported! Check out our implementation here.
[2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo here!
[2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Install

Clone this repository and navigate to AWQ folder

git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq

Install Package

conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

For edge devices like Orin, before running the commands above, please:
1. Modify pyproject.toml by commenting out this line.
2. Set this line to transformers==4.32.0.
3. Manually install precompiled PyTorch binaries (>=2.0.0) from NVIDIA.
4. Set the appropriate Python version for conda environment (e.g., conda create -n awq python=3.8 -y for JetPack 5).

Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel and optimized FP16 kernels (e.g. layernorm, positional encodings).

cd awq/kernels
python setup.py install

AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

The detailed support list:

Models	Sizes	INT4-g128	INT3-g128
Llama3	8B/70B	✅	✅
Llama2	7B/13B/70B	✅	✅
LLaMA	7B/13B/30B/65B	✅	✅
OPT	125m/1.3B/2.7B/6.7B/13B/30B	✅	✅
CodeLlama	7B/13B/34B	✅	✅
StarCoder	15.5B	✅	✅
Vicuna-v1.1	7B/13B	✅
LLaVA-v0	13B	✅
VILA	7B/13B	✅

Note: We only list models that we have prepare the AWQ searching results in the table above. AWQ also supports models such as LLaVA-v1.5 7B, and you may need to run the AWQ search on your own to quantize these models.

Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples for details.

Usage

We provide several sample script to run AWQ (please refer to ./scripts). We use OPT-6.7B as an example.

Perform AWQ search and save search results (we already did it for you):

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt

Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend fake

Generate real quantized weights (INT4)

mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/llama3-8b-w4-g128-awq.pt

Load and evaluate the real quantized model (now you can see smaller gpu memory usage)

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/llama3-8b-w4-g128-awq.pt

Results on Vision-Language Models (VILA-7b/13B)

AWQ also seamlessly supports large multi-modal models (LMMs). We demonstrate the results on the recent VILA model family.

VILA-7B	VQA-v2	GQA	VizWiz	ScienceQA	TextVQA	POPE	MME	MMBench	MMBench-CN	SEED
FP16	80.3	63.1	59.6	68.0	62.6	86.3	1489.4	69.8	61.0	61.7
AWQ-INT4	80.1	63.0	57.8	68.3	61.9	85.3	1486.3	68.8	58.9	61.3

VILA-13B	VQA-v2	GQA	VizWiz	ScienceQA	TextVQA	POPE	MME	MMBench	MMBench-CN	SEED
FP16	80.5	63.6	63.1	70.5	64.0	86.3	1553.6	73.8	66.7	62.8
AWQ-INT4	80.4	63.6	63.0	71.2	63.5	87.0	1552.9	73.6	66.3	62.2

Inference speed ( Token/sec )

$~~~~~~$	Precision	A100	4090	Orin
VILA-7B	fp16	81.6	58.5	11.5
VILA-7B-AWQ	int4	155.3	168.1	35.6
VILA-13B	fp16	48.5	OOM	6.1
VILA-13B-AWQ	int4	102.1	99.0	17.5

Reference

If you find AWQ useful or relevant to your research, please kindly cite our paper:

@inproceedings{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  booktitle={MLSys},
  year={2024}
}

Related Projects

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

Vicuna and FastChat

LLaVA: Large Language and Vision Assistant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

News

Contents

Install

AWQ Model Zoo

Examples

Usage

Results on Vision-Language Models (VILA-7b/13B)

Inference speed ( Token/sec )

Reference

Related Projects

About

Releases

Packages

Contributors 10

Languages

License

mit-han-lab/llm-awq

Folders and files

Latest commit

History

Repository files navigation

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

News

Contents

Install

AWQ Model Zoo

Examples

Usage

Results on Vision-Language Models (VILA-7b/13B)

Inference speed ( Token/sec )

Reference

Related Projects

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Packages