Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Collaboration? #75

Closed
dalistarh opened this issue Mar 23, 2023 · 4 comments
Closed

GPTQ Collaboration? #75

dalistarh opened this issue Mar 23, 2023 · 4 comments

Comments

@dalistarh
Copy link

Dear Qwopqwop200,

I'm writing on behalf of the authors of the GPTQ paper. We have been following your excellent work, and wanted to mention that we added a few updates to our repository yesterday, which may be interesting to you:

  • We added a minimal LlaMa integration demonstrating a few additional tricks which lead to some accuracy improvements (especially on the 7B model; for instance GPTQ is now consistently better than RTN).
  • Further, we pushed a significantly faster 3-bit kernel (optimized for the A100) and slightly adjusted evaluation procedures for PTB and C4, which are used in the camera-ready version of our paper.

In case you would be interested in collaborating more closely with us, please feel free to write us at [email protected] / [email protected]

Best regards,
Dan

@Wingie
Copy link

Wingie commented Mar 24, 2023

Hey Dan, nice to hear from you.
we are all playing with your project and love it!

i had a couple of questions for you also regarding kernels. -
do you recommend 3 bit or 4 bit kernels?
do you think gptq can also help in non-cuda implementations? like ggerganov/llama.cpp#397 i believe that they dont have cuda kernels as its a pure cpp implementation, i think they perform RTN quantization which has this effect of degrading the model (in my observations) basically wondering if there was a way to port these gptq algorithm to the cpp repo?

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 25, 2023

@Wingie llama.cpp has supported (4bit) GPTQ inference for 4 days now. There is a script in that repo called convert-gptq-to-ggml.py to get you started. 

do you think gptq can also help in non-cuda implementations?

GPTQ is indeed better than RtN even in pure CPU implementations. 

3 bit or 4 bit kernels?

With the latest optimizations to GPTQ, 13B 3bit is superior to 7B 4bit and 30B 3bit is superior to 13B 4bit, etc. So you will likely want to optimize for the maximum amount of parameters you can fit in the RAM/VRAM you have. If you have memory to spare then more bits may produce marginally better results at the same parameter count.

@MarkSchmidty
Copy link

https://github.com/IST-DASLab/gptq is the repository mentioned in the OP, for anyone who comes across this thread.

@sterlind
Copy link

@dalistarh this is just a gardening thing, but I submitted a PR to this repo to make it pip-installable. I briefly browsed your repo, and think it should more or less just work for your repo as well, if you want to borrow it (or maybe @qwopqwop200 will upstream their repo to yours.) Just FYI!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants