-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluation extremely slow with llama_cpp/gguf #1472
Comments
i have the same problem. What can I do to make it support batch processing?? |
It seems to be an issue with llama.cpp. |
So basically they say it's a problem with quantized models running with large prompts. |
Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it should normally take ~40 minutes or even less.
lm_eval --model gguf --model_args base_url=http:https://localhost:8000 --batch_size 16 --tasks truthfulqa_mc2 --num_fewshot 0
The text was updated successfully, but these errors were encountered: