Speed up inference problems #1625

djstrong · 2024-03-23T19:06:46Z

I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.

Unfortunately, using torch.compile and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.

Other issue is that scores with batch size 1 and 4 differs - tested with and without logits_cache and torch.use_deterministic_algorithms(True). Is it possible to obtain the same results? Maybe there is some problem with padding?

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-03-23T20:12:48Z

Hi!

How much do results differ by, by chance? And in particular, do you have the actual logprobs saved and do those differ based on batch size?
Does using flash attention, but not torch.compile, help at all?

It's very surprising to me how slow vllm is. What batch size is vllm using? auto would be the ideal. I faintly recall some issue mentioning vllm gets slowed down by having to return logprobs--@baberabb do you happen to know whether this is the case by chance?

LSinev · 2024-03-23T20:14:45Z

@djstrong Previous issue about batch size affecting predictions: #704 (comment).

It is still a good idea to check if proper padding and position_ids are applied to models in batch mode — different HF transformers model classes have different approaches to undefined batch inputs in model.generate and model.forward, starting from GPT2LMHeadModel at least.

Those improvements will help others to dive deeply:

Can you please provide a list of speeding issues and package versions, as well as the used state of this repository?
why you are bolding time in some lines but not others?
why two generate_until times are missing?
does normal refer to the transformers hf hub implementation?
which version of torch was used? (Several speed degradation issues on the torch repo are still open.)
How many tokens were generated in generate_until mode?

djstrong · 2024-03-23T20:28:10Z

@haileyschoelkopf
The scores for bs=1 is 0.7033 and for bs=4 0.7111 (with stderr 0.01).
Logprobs are different for bs=1 and bs=4:

Flash attention without compile causes error on my setup:

RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback):
venv/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

vllm was running with bs=4. Loglikelihood tasks with vllm are 4 times slower (tested with bs=1 and bs=4) than hf.

djstrong · 2024-03-23T20:37:15Z

@LSinev

I expect that vllm will be also faster for loglikelihood tasks. transformers 4.39.1, vllm 0.3.2, this repo state is from yesterday cffc1bd
Nothing special. I have bolded the best times, so hf is faster in loglikelihood, but vllm in generate_until.
Times for generate_until are missing because bs is too big.
Normal means hf.
Torch 2.1.2
max_gen_toks: 50

djstrong · 2024-03-23T20:59:15Z

You can replicate vllm loglikelihood slowness and different scores with e.g. lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1" --output_path "date/"date +%s --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples

hf bs=1 00:52 0.3878
hf bs=4 00:23 0.39
vllm bs=1 02:49 0.39
vllm bs=4 01:38 0.3878

haileyschoelkopf · 2024-03-23T21:23:07Z

Thanks!

I'd recommend trying auto batch size for vllm and seeing if that helps the speed.

Those sorts of differences in logprobs are expected when batch size changes--the issue @LSinev linked is a good one to reference. Unfortunately this can't very easily be "fixed", but the differences in logprobs from it should be very tiny, as you're seeing, and should likely not cause deviations that exceed stderr.

baberabb · 2024-03-23T21:23:51Z

Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn't do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass enable_prefix_caching=True to model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).

mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.

djstrong · 2024-03-23T21:57:32Z

Thank you!

bs auto usually doesn't work and also this is the case: OOM
vllm bs=auto OOM
vllm bs=32 OOM
vllm bs=16 01:31 0.3856

baberabb · 2024-03-23T22:12:10Z

Thank you!

bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856

Oh, thats probably because of the large sequence length for mistral-7B (iirc defaults to ~32000, vllm preallocates memory according to that). You can set it lower to 2048 or 4096 with max_model_len. Lowering the gpu_memory_utilization from a default of 0.9 also helps with OOMs sometimes (but the former should be enough).

djstrong · 2024-03-23T23:38:04Z

Thanks!
vllm bs=auto max_model_len=4096 01:33 (+01:30 for Processed prompts?) 0.3856

haileyschoelkopf · 2024-03-25T12:42:02Z

Awesome! Added both these points to the readme in #1633 as this will probably confuse others too.

djstrong · 2024-03-25T13:03:12Z

Using bs auto with vllm is causing some extra time for "Processed prompts" - I don't know what it is but finally it is slower than bs=1.

Remaining issues are:

vllm related - for any batch size slower computation of loglikelihood than hf (even with bs=1)
different scores with different batch sizes

djstrong · 2024-03-26T21:57:59Z

About different scores with different batch sizes. I have run evaluation with max_len=1, 2 examples and bs 1 vs. 2.

lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples --limit 2
lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 2 --log_samples --limit 2

Logs: https://www.diffchecker.com/CpV3RaDU/ (scores are the same with logits_cache=False)

djstrong · 2024-03-27T19:59:28Z

I have found exact place:

lm-evaluation-harness/lm_eval/models/huggingface.py

Line 1049 in e9d429e

self._model_call(batched_inps, **call_kwargs), dim=-1

and replicated with minimal code (I get the same numbers in this line).

Model loaded on CPU with bfloat16 gives the same numbers:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype="auto")
>>> model(torch.tensor([[28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]],

        [[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)

Model on GPU with bfloat16 gives different results:

>>> model.to('cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.5625, -11.3125,   3.3281,  ...,  -5.7500,  -2.6562,  -1.3906]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]],

        [[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

model on GPU with float16 gives the same results:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map='cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7617,  -2.6348,  -1.3047]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]],

        [[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

So the problem is only on GPU with bfloat16. Using:

>>> torch.use_deterministic_algorithms(True)
>>> torch.backends.cudnn.benchmark = False

does not help.

LSinev · 2024-03-27T20:58:26Z

So the problem is only on GPU with bfloat16

So the problem is actually with model implementation somewhere in huggingface transformers or some specific modules in torch, and may be sorted out at their repos if issued, not within lm-evaluation-harness repository?

djstrong · 2024-03-27T21:00:54Z

Why do you think it is a problem with model implementation? But yes, it is not related to lm-evaluation-harness repository.
Maybe it is some GPU optimization (cuBLAS?).

haileyschoelkopf · 2024-03-27T22:48:45Z

Nice bug hunting!

I think this is related to the lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

LSinev · 2024-03-27T22:56:58Z

lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

If this is the case, it will be seen in the same test procedure with several models, not just Mistral. Would be great if someone can confirm that.

UPD.: May be helpful (with sublinks too): huggingface/transformers#28732

djstrong · 2024-03-27T23:03:41Z

The same issue with meta-llama/Llama-2-7b-chat-hf.

Maybe it is resolved in new cuBLAS: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1 ?
I am using CUDA 12.1, cuBLAS 12.1.3.1

haileyschoelkopf mentioned this issue Mar 25, 2024

Add vLLM FAQs to README (#1625) #1633

Merged

haileyschoelkopf closed this as completed in #1633 Mar 25, 2024

haileyschoelkopf added a commit that referenced this issue Mar 25, 2024

Add vLLM FAQs to README (#1625) (#1633)

a97fde2

haileyschoelkopf reopened this Mar 25, 2024

haileyschoelkopf mentioned this issue Mar 27, 2024

Different batch sizes give different results #1645

Closed

artemorloff pushed a commit to artemorloff/lm-evaluation-harness that referenced this issue Apr 4, 2024

Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

13838c3

LSinev mentioned this issue Apr 26, 2024

Accuracy gap between single GPU and multiple GPUs #1751

Open

haileyschoelkopf mentioned this issue Apr 26, 2024

Accuracy fluctuation with same model and weights #1729

Closed

nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this issue May 2, 2024

Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

d7072cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up inference problems #1625

Speed up inference problems #1625

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 23, 2024

LSinev commented Mar 23, 2024 •

edited

djstrong commented Mar 23, 2024

djstrong commented Mar 23, 2024

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 23, 2024

baberabb commented Mar 23, 2024 •

edited

djstrong commented Mar 23, 2024

baberabb commented Mar 23, 2024

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 25, 2024

djstrong commented Mar 25, 2024

djstrong commented Mar 26, 2024 •

edited

djstrong commented Mar 27, 2024 •

edited

LSinev commented Mar 27, 2024

djstrong commented Mar 27, 2024

haileyschoelkopf commented Mar 27, 2024

LSinev commented Mar 27, 2024 •

edited

djstrong commented Mar 27, 2024

Speed up inference problems #1625

Speed up inference problems #1625

Comments

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 23, 2024

LSinev commented Mar 23, 2024 • edited

djstrong commented Mar 23, 2024

djstrong commented Mar 23, 2024

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 23, 2024

baberabb commented Mar 23, 2024 • edited

djstrong commented Mar 23, 2024

baberabb commented Mar 23, 2024

djstrong commented Mar 23, 2024

haileyschoelkopf commented Mar 25, 2024

djstrong commented Mar 25, 2024

djstrong commented Mar 26, 2024 • edited

djstrong commented Mar 27, 2024 • edited

LSinev commented Mar 27, 2024

djstrong commented Mar 27, 2024

haileyschoelkopf commented Mar 27, 2024

LSinev commented Mar 27, 2024 • edited

djstrong commented Mar 27, 2024

LSinev commented Mar 23, 2024 •

edited

baberabb commented Mar 23, 2024 •

edited

djstrong commented Mar 26, 2024 •

edited

djstrong commented Mar 27, 2024 •

edited

LSinev commented Mar 27, 2024 •

edited