Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to decrease inference time? #245

Closed
rahulvigneswaran opened this issue Apr 6, 2023 · 5 comments
Closed

How to decrease inference time? #245

rahulvigneswaran opened this issue Apr 6, 2023 · 5 comments

Comments

@rahulvigneswaran
Copy link

It takes around 40 mins to generate 1000 sentences describing something on a single V100 (32GB). How can I decrease this and increase the speed?

@oldsj
Copy link

oldsj commented Apr 6, 2023

Checkout llama.cpp ggerganov/llama.cpp#771

@Shiro836
Copy link

Shiro836 commented Apr 6, 2023

Checkout llama.cpp ggerganov/llama.cpp#771

Isn't that repo cpu-only? @rahulvigneswaran asked about gpu. I am curious too. It's only 3 tokens per second on my 4090 in 8bit mode.

@rahulvigneswaran
Copy link
Author

@oldsj Yeah, like @Shiro836 said, I wanna run on GPUs.

@rahulvigneswaran
Copy link
Author

Checkout llama.cpp ggerganov/llama.cpp#771

Isn't that repo cpu-only? @rahulvigneswaran asked about gpu. I am curious too. It's only 3 tokens per second on my 4090 in 8bit mode.

Also, how can I run on 8bit mode? Is it how the model using the general instruction provided on readme runs?

@zhisbug
Copy link
Collaborator

zhisbug commented Apr 7, 2023

If you do not additional compute, I'd say you might want to use compressed/quantized vicuna (we provided an official 8-bit version yesterday). It will give you slightly higher throughput or lower latency but at comprised quality.

We're working on pushing some of our system technologies out to optimize the inference and throughput speed. But it will take a while to land.

@zhisbug zhisbug closed this as completed Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants