Skip to content

A combination of Oobabooga's fork and the main cuda branch of GPTQ-for-LLaMa in a package format.

License

Notifications You must be signed in to change notification settings

jllllll/GPTQ-for-LLaMa-CUDA

Repository files navigation

This is a fork of qwopqwop200's repository meant for stable usage in text-generation-webui.

This package uses import redirection to allow for easier integration with existing projects.

Oobabooga's fork is used by default when a compatible GPU is detected.
qwopqwop200's 'cuda' branch is used for GPUs older than Pascal.
AMD-compatible conversions of both are available courtesy of WapaMario63's work: GPTQ-for-LLaMa-ROCm

Python modules can be imported as if they are in the main package and the appropriate versions will be selected:

import gptq_for_llama.llama_inference_offload
from gptq_for_llama.modelutils import find_layers
from gptq_for_llama.quant import make_quant

This can be overriden by setting the QUANT_CUDA_OVERRIDE environment variable to either old or new before importing. There is also an experimental function for switching versions on the fly:

from gptq_for_llama import switch_gptq

switch_gptq('new')
import gptq_for_llama.llama_inference_offload

Limited testing showed reliable swapping of versions. However, this may not work when swapping models repeatedly.

GPTQ-for-LLaMA

4 bits quantization of LLaMA using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

There is a pytorch branch that allows you to use groupsize and act-order together.

New Features

Changed to support new features proposed by GPTQ.

  • Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
  • Optimized cuda kernels, which are considerably faster especially on the A100, e.g. 1.9x -> 3.25x generation speedup for OPT-175B; can be activated via --faster-kernel.
  • two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Currently, groupsize and act-order do not work together and you must choose one of them.

Result

LLaMA-7B(click me)
LLaMA-7B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - 13940 5.68 12.5
RTN 4 - - 6.29 -
GPTQ 4 - 4740 6.09 3.5
RTN 3 - - 25.54 -
GPTQ 3 - 3852 8.07 2.7
GPTQ 3 128 4116 6.61 3.0
LLaMA-13B
LLaMA-13B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 5.09 24.2
RTN 4 - - 5.53 -
GPTQ 4 - 8410 5.36 6.5
RTN 3 - - 11.40 -
GPTQ 3 - 6870 6.63 5.1
GPTQ 3 128 7277 5.62 5.4
LLaMA-33B
LLaMa-33B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 4.10 60.5
RTN 4 - - 4.54 -
GPTQ 4 - 19493 4.45 15.7
RTN 3 - - 14.89 -
GPTQ 3 - 15493 5.69 12.0
GPTQ 3 128 16566 4.80 13.0
LLaMA-65B
LLaMA-65B Bits group-size memory(MiB) Wikitext2 checkpoint size(GB)
FP16 16 - OOM 3.53 121.0
RTN 4 - - 3.92 -
GPTQ 4 - OOM 3.84 31.1
RTN 3 - - 10.59 -
GPTQ 3 - OOM 5.04 23.6
GPTQ 3 128 OOM 4.17 25.6

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

Dependencies

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMA

#convert LLaMA to hf
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save llama7b-4bit.pt
# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --save_safetensors llama7b-4bit.safetensors
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ./llama-hf/llama-7b c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama"
# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ./llama-hf/llama-7b --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

CUDA Kernels support 2,3,4,8 bits and Faster CUDA Kernels support 2,3,4 bits.

Basically, 4-bit quantization and 128 groupsize are recommended.

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

About

A combination of Oobabooga's fork and the main cuda branch of GPTQ-for-LLaMa in a package format.

Resources

License

Stars

Watchers

Forks

Packages