-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to decrease inference time? #245
Comments
Checkout llama.cpp ggerganov/llama.cpp#771 |
Isn't that repo cpu-only? @rahulvigneswaran asked about gpu. I am curious too. It's only 3 tokens per second on my 4090 in 8bit mode. |
Also, how can I run on 8bit mode? Is it how the model using the general instruction provided on readme runs? |
If you do not additional compute, I'd say you might want to use compressed/quantized vicuna (we provided an official 8-bit version yesterday). It will give you slightly higher throughput or lower latency but at comprised quality. We're working on pushing some of our system technologies out to optimize the inference and throughput speed. But it will take a while to land. |
It takes around 40 mins to generate 1000 sentences describing something on a single V100 (32GB). How can I decrease this and increase the speed?
The text was updated successfully, but these errors were encountered: