-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up inference problems #1625
Comments
Hi!
It's very surprising to me how slow vllm is. What batch size is vllm using? |
@djstrong Previous issue about batch size affecting predictions: #704 (comment). It is still a good idea to check if proper padding and Those improvements will help others to dive deeply:
|
@haileyschoelkopf Flash attention without compile causes error on my setup:
vllm was running with bs=4. Loglikelihood tasks with vllm are 4 times slower (tested with bs=1 and bs=4) than hf. |
|
You can replicate vllm loglikelihood slowness and different scores with e.g. hf bs=1 00:52 0.3878 |
Thanks! I'd recommend trying Those sorts of differences in logprobs are expected when batch size changes--the issue @LSinev linked is a good one to reference. Unfortunately this can't very easily be "fixed", but the differences in logprobs from it should be very tiny, as you're seeing, and should likely not cause deviations that exceed stderr. |
Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn't do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much. |
Thank you! bs auto usually doesn't work and also this is the case: OOM |
Oh, thats probably because of the large sequence length for mistral-7B (iirc defaults to ~32000, vllm preallocates memory according to that). You can set it lower to 2048 or 4096 with |
Thanks! |
Awesome! Added both these points to the readme in #1633 as this will probably confuse others too. |
Using bs auto with vllm is causing some extra time for "Processed prompts" - I don't know what it is but finally it is slower than bs=1. Remaining issues are:
|
About different scores with different batch sizes. I have run evaluation with max_len=1, 2 examples and bs 1 vs. 2.
Logs: https://www.diffchecker.com/CpV3RaDU/ (scores are the same with |
I have found exact place:
and replicated with minimal code (I get the same numbers in this line). Model loaded on CPU with bfloat16 gives the same numbers:
Model on GPU with bfloat16 gives different results:
model on GPU with float16 gives the same results:
So the problem is only on GPU with bfloat16. Using:
does not help. |
So the problem is actually with model implementation somewhere in huggingface transformers or some specific modules in torch, and may be sorted out at their repos if issued, not within lm-evaluation-harness repository? |
Why do you think it is a problem with model implementation? But yes, it is not related to lm-evaluation-harness repository. |
Nice bug hunting! I think this is related to the lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math |
If this is the case, it will be seen in the same test procedure with several models, not just Mistral. Would be great if someone can confirm that. UPD.: May be helpful (with sublinks too): huggingface/transformers#28732 |
The same issue with Maybe it is resolved in new cuBLAS: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1 ? |
I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.
Unfortunately, using
torch.compile
and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.Other issue is that scores with batch size 1 and 4 differs - tested with and without
logits_cache
andtorch.use_deterministic_algorithms(True)
. Is it possible to obtain the same results? Maybe there is some problem with padding?The text was updated successfully, but these errors were encountered: