Add AWQ quantization inference support (#1019) #1054

Narsil · 2023-09-25T07:58:35Z

Add AWQ quantization inference support

Fixes
#781

This PR (partially) adds support for AWQ quantization for inference.
More information on AWQ here. In
general, AWQ is faster and more accurate than GPTQ, which is currently
supported by TGI.

This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors
(in requirements.txt, just one line change).

Quick way to test this PR would be bring up TGI as follows:

text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq

text-generation-launcher \
--huggingface-hub-cache ~/.cache/huggingface/hub/ \
--model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \
--trust-remote-code --port 8080 \
--max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \
--quantize awq

Please note:

This PR was tested with FlashAttention v2 and vLLM.
This PR adds support for AWQ inference, not quantizing the models.
That needs to be done outside of TGI, instructions
here.
This PR only adds support for FlashLlama models for now.
Multi-GPU setup has not been tested.
No integration tests have been added so far, will add later if
maintainers are interested in this change.
This PR can be tested on any of the models released
here.

Please refer to the linked issue for benchmarks for
abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq
vs
TheBloke/Llama-2-7b-Chat-GPTQ.

Please note, AWQ has released faster (and in case of Llama, fused)
kernels for 4-bit GEMM, currently at the top of the main branch at
https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit
that has been tested to work. We can switch to latest commit later on.

Who can review?

@OlivierDehaene OR @Narsil

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@OlivierDehaene

# Add AWQ quantization inference support Fixes #781 This PR (partially) adds support for AWQ quantization for inference. More information on AWQ [here](https://arxiv.org/abs/2306.00978). In general, AWQ is faster and more accurate than GPTQ, which is currently supported by TGI. This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors (in `requirements.txt`, just one line change). Quick way to test this PR would be bring up TGI as follows: ``` text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq text-generation-launcher \ --huggingface-hub-cache ~/.cache/huggingface/hub/ \ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \ --trust-remote-code --port 8080 \ --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \ --quantize awq ``` Please note: * This PR was tested with FlashAttention v2 and vLLM. * This PR adds support for AWQ inference, not quantizing the models. That needs to be done outside of TGI, instructions [here](https://github.com/mit-han-lab/llm-awq/tree/f084f40bd996f3cf3a0633c1ad7d9d476c318aaa). * This PR only adds support for `FlashLlama` models for now. * Multi-GPU setup has not been tested. * No integration tests have been added so far, will add later if maintainers are interested in this change. * This PR can be tested on any of the models released [here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models). Please refer to the linked issue for benchmarks for [abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq) vs [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ). Please note, AWQ has released faster (and in case of Llama, fused) kernels for 4-bit GEMM, currently at the top of the `main` branch at https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit that has been tested to work. We can switch to latest commit later on. ## Who can review? @OlivierDehaene OR @Narsil --------- Co-authored-by: Abhinav Kulkarni <[email protected]>

HuggingFaceDocBuilderDev · 2023-09-25T08:03:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Narsil · 2023-09-25T08:10:35Z

@abhinavkulkarni
@casper-hansen

For visibility.

casper-hansen · 2023-09-25T08:22:57Z

Please note, AWQ has released faster (and in case of Llama, fused)
kernels for 4-bit GEMM, currently at the top of the main branch at
https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit
that has been tested to work. We can switch to latest commit later on.

They have released one new GEMV kernel and one new GEMM kernel. On their main branch, they use GEMV for token generation and GEMM for context processing. The GEMV kernel is 20% faster than the old GEMM kernel, but most importantly, the new GEMM kernel is 5-6 slower than the old GEMM kernel.

My conclusion is that the current GEMM kernel already used in this PR is the optimal one for now.

Narsil · 2023-09-25T09:18:53Z

The GEMV kernel is 20% faster than the old GEMM kernel, but most importantly, the new GEMM kernel is 5-6 slower than the old GEMM kernel.

I didn't test the GEMV, but the GEMM seems to have similar speeds for me (A10G) which card did you test on ?

casper-hansen · 2023-09-25T09:24:01Z

The GEMV kernel is 20% faster than the old GEMM kernel, but most importantly, the new GEMM kernel is 5-6 slower than the old GEMM kernel.

I didn't test the GEMV, but the GEMM seems to have similar speeds for me (A10G) which card did you test on ?

There are multiple GEMM kernels. The new one is slower because it implements a different packed format for quantized models. I tested on RTX 3090, 4090, and A100.

On RTX 3090, speed of context processing on LLaMa 7B:

GEMM (original): 2400 tokens/s
GEMM (new): 440 tokens/s
GEMV: 234 tokens/s

EDIT: What I mean by "new" GEMM kernel is this pull request that is about to be merged into the original llm-awq: mit-han-lab/llm-awq#90

Narsil · 2023-09-25T09:40:54Z

GEMM (original): 2400 tokens/s
GEMM (new): 440 tokens/s
GEMV: 234 tokens/s

Those seem like throughput values, which are relatively not important in general. We really care about latency much more. Ideally the whole curves gives a better story since we do want throughput, but at fixed latency. (2x throughput for 2x latency, is usually unacceptable in our deployments for instance).

Finishing nits + integration test

8ee9307

Add git to docker.

e08f3ac

Narsil added 3 commits September 25, 2023 08:24

Declare torch as build dep.

757cf17

Using kernel like Makefile instead.

27ecef5

Change deploy.

a8f870a

Make awq install optional + integration tests values.

02d4f62

Narsil added 5 commits September 25, 2023 09:41

Update dockerfile with new build.

4a29074

Adding target list.

2d8c034

Support TheBloke exported models.

cbf047b

Fix and test sharded version.

97292ec

Fix dockerfile.

e27438a

Narsil merged commit c5de7cd into main Sep 25, 2023
6 checks passed

Narsil deleted the awq_support branch September 25, 2023 13:31

zTaoplus mentioned this pull request Sep 25, 2023

Add support for AWQ quantized models ZJUICI/text-generation-inference#1

Closed

abhinavkulkarni mentioned this pull request Sep 27, 2023

[Announcement] AWQ is now supported in text-generation-inference mit-han-lab/llm-awq#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWQ quantization inference support (#1019) #1054

Add AWQ quantization inference support (#1019) #1054

Narsil commented Sep 25, 2023

HuggingFaceDocBuilderDev commented Sep 25, 2023

Narsil commented Sep 25, 2023

casper-hansen commented Sep 25, 2023

Narsil commented Sep 25, 2023

casper-hansen commented Sep 25, 2023 •

edited

Loading

Narsil commented Sep 25, 2023

Add AWQ quantization inference support (#1019) #1054

Add AWQ quantization inference support (#1019) #1054

Conversation

Narsil commented Sep 25, 2023

Add AWQ quantization inference support

Who can review?

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Sep 25, 2023

Narsil commented Sep 25, 2023

casper-hansen commented Sep 25, 2023

Narsil commented Sep 25, 2023

casper-hansen commented Sep 25, 2023 • edited Loading

Narsil commented Sep 25, 2023

casper-hansen commented Sep 25, 2023 •

edited

Loading