Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluation extremely slow with llama_cpp/gguf #1472

Open
mobicham opened this issue Feb 26, 2024 · 3 comments
Open

evaluation extremely slow with llama_cpp/gguf #1472

mobicham opened this issue Feb 26, 2024 · 3 comments
Labels
bug Something isn't working.

Comments

@mobicham
Copy link

Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it should normally take ~40 minutes or even less.
lm_eval --model gguf --model_args base_url=http:https://localhost:8000 --batch_size 16 --tasks truthfulqa_mc2 --num_fewshot 0

@haileyschoelkopf haileyschoelkopf added the bug Something isn't working. label Feb 26, 2024
@Andy1314Chen
Copy link

Note that for externally hosted models, configs such as --device and --batch_size should not be used and do not function.

i have the same problem. What can I do to make it support batch processing??

@linotfan
Copy link

It seems to be an issue with llama.cpp.

@mobicham
Copy link
Author

mobicham commented Feb 27, 2024

It seems to be an issue with llama.cpp.

So basically they say it's a problem with quantized models running with large prompts.
That sounds strange because the impact of the dequantization step should actually matter much less with longer prompts and larger batches sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

4 participants