evaluation extremely slow with llama_cpp/gguf #1472

mobicham · 2024-02-26T10:52:08Z

Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it should normally take ~40 minutes or even less.
lm_eval --model gguf --model_args base_url=http:https://localhost:8000 --batch_size 16 --tasks truthfulqa_mc2 --num_fewshot 0

The text was updated successfully, but these errors were encountered:

Andy1314Chen · 2024-02-27T06:14:07Z

Note that for externally hosted models, configs such as --device and --batch_size should not be used and do not function.

i have the same problem. What can I do to make it support batch processing？？

linotfan · 2024-02-27T13:40:51Z

It seems to be an issue with llama.cpp.

mobicham · 2024-02-27T13:53:47Z

It seems to be an issue with llama.cpp.

So basically they say it's a problem with quantized models running with large prompts.
That sounds strange because the impact of the dequantization step should actually matter much less with longer prompts and larger batches sizes.

haileyschoelkopf added the bug Something isn't working. label Feb 26, 2024

LSinev mentioned this issue Jun 16, 2024

Ubelievable long time when host the gguf mode ? #1971

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation extremely slow with llama_cpp/gguf #1472

evaluation extremely slow with llama_cpp/gguf #1472

mobicham commented Feb 26, 2024

Andy1314Chen commented Feb 27, 2024

linotfan commented Feb 27, 2024

mobicham commented Feb 27, 2024 •

edited

evaluation extremely slow with llama_cpp/gguf #1472

evaluation extremely slow with llama_cpp/gguf #1472

Comments

mobicham commented Feb 26, 2024

Andy1314Chen commented Feb 27, 2024

linotfan commented Feb 27, 2024

mobicham commented Feb 27, 2024 • edited

mobicham commented Feb 27, 2024 •

edited