-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference starcoder(4bit\8bit) with GPU #417
Comments
|
It's not working for me either.
And
EDIT: llamacpp works just fine for me though. |
The CUDA backend requires some changes to the code to do full offloading, otherwise it is only used for multiplication of large matrices (generally that only happens when evaluating large prompts). It will be easier to use once we implement a common interface for all the backends, but it's going to take a while. For example of how to use it, you can look into the llama.cpp source code. In the future, llama.cpp will also be extended to support other LLMs. |
Thank you for the explaination. |
First of all, thank you for your work! I used ggml to quantize the starcoder model to 8bit (4bit), but I encountered difficulties when using GPU for inference. If you can provide me with an example, I would be very grateful.
The text was updated successfully, but these errors were encountered: