Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up inference problems #1625

Open
djstrong opened this issue Mar 23, 2024 · 19 comments · Fixed by #1633
Open

Speed up inference problems #1625

djstrong opened this issue Mar 23, 2024 · 19 comments · Fixed by #1633

Comments

@djstrong
Copy link
Contributor

I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.

image

Unfortunately, using torch.compile and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.

Other issue is that scores with batch size 1 and 4 differs - tested with and without logits_cache and torch.use_deterministic_algorithms(True). Is it possible to obtain the same results? Maybe there is some problem with padding?

@haileyschoelkopf
Copy link
Contributor

Hi!

  • How much do results differ by, by chance? And in particular, do you have the actual logprobs saved and do those differ based on batch size?
  • Does using flash attention, but not torch.compile, help at all?

It's very surprising to me how slow vllm is. What batch size is vllm using? auto would be the ideal. I faintly recall some issue mentioning vllm gets slowed down by having to return logprobs--@baberabb do you happen to know whether this is the case by chance?

@LSinev
Copy link
Contributor

LSinev commented Mar 23, 2024

@djstrong Previous issue about batch size affecting predictions: #704 (comment).

It is still a good idea to check if proper padding and position_ids are applied to models in batch mode — different HF transformers model classes have different approaches to undefined batch inputs in model.generate and model.forward, starting from GPT2LMHeadModel at least.

Those improvements will help others to dive deeply:

  • Can you please provide a list of speeding issues and package versions, as well as the used state of this repository?
  • why you are bolding time in some lines but not others?
  • why two generate_until times are missing?
  • does normal refer to the transformers hf hub implementation?
  • which version of torch was used? (Several speed degradation issues on the torch repo are still open.)
  • How many tokens were generated in generate_until mode?

@djstrong
Copy link
Contributor Author

@haileyschoelkopf
The scores for bs=1 is 0.7033 and for bs=4 0.7111 (with stderr 0.01).
Logprobs are different for bs=1 and bs=4:
image

Flash attention without compile causes error on my setup:

RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback):
venv/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops15sum_IntList_out4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbSt8optionalINS5_10ScalarTypeEERS2_

vllm was running with bs=4. Loglikelihood tasks with vllm are 4 times slower (tested with bs=1 and bs=4) than hf.

@djstrong
Copy link
Contributor Author

@LSinev

  1. I expect that vllm will be also faster for loglikelihood tasks. transformers 4.39.1, vllm 0.3.2, this repo state is from yesterday cffc1bd
  2. Nothing special. I have bolded the best times, so hf is faster in loglikelihood, but vllm in generate_until.
  3. Times for generate_until are missing because bs is too big.
  4. Normal means hf.
  5. Torch 2.1.2
  6. max_gen_toks: 50

@djstrong
Copy link
Contributor Author

You can replicate vllm loglikelihood slowness and different scores with e.g. lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1" --output_path "date/"date +%s --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples

hf bs=1 00:52 0.3878
hf bs=4 00:23 0.39
vllm bs=1 02:49 0.39
vllm bs=4 01:38 0.3878

@haileyschoelkopf
Copy link
Contributor

Thanks!

I'd recommend trying auto batch size for vllm and seeing if that helps the speed.

Those sorts of differences in logprobs are expected when batch size changes--the issue @LSinev linked is a good one to reference. Unfortunately this can't very easily be "fixed", but the differences in logprobs from it should be very tiny, as you're seeing, and should likely not cause deviations that exceed stderr.

@baberabb
Copy link
Contributor

baberabb commented Mar 23, 2024

Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn't do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass enable_prefix_caching=True to model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).

mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.

@djstrong
Copy link
Contributor Author

Thank you!

bs auto usually doesn't work and also this is the case: OOM
vllm bs=auto OOM
vllm bs=32 OOM
vllm bs=16 01:31 0.3856

@baberabb
Copy link
Contributor

Thank you!

bs auto usually doesn't work and also this is the case: OOM vllm bs=auto OOM vllm bs=32 OOM vllm bs=16 01:31 0.3856

Oh, thats probably because of the large sequence length for mistral-7B (iirc defaults to ~32000, vllm preallocates memory according to that). You can set it lower to 2048 or 4096 with max_model_len. Lowering the gpu_memory_utilization from a default of 0.9 also helps with OOMs sometimes (but the former should be enough).

@djstrong
Copy link
Contributor Author

Thanks!
vllm bs=auto max_model_len=4096 01:33 (+01:30 for Processed prompts?) 0.3856

@haileyschoelkopf
Copy link
Contributor

Awesome! Added both these points to the readme in #1633 as this will probably confuse others too.

@djstrong
Copy link
Contributor Author

Using bs auto with vllm is causing some extra time for "Processed prompts" - I don't know what it is but finally it is slower than bs=1.

Remaining issues are:

  • vllm related - for any batch size slower computation of loglikelihood than hf (even with bs=1)
  • different scores with different batch sizes

@djstrong
Copy link
Contributor Author

djstrong commented Mar 26, 2024

About different scores with different batch sizes. I have run evaluation with max_len=1, 2 examples and bs 1 vs. 2.

lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 1 --log_samples --limit 2
lm_eval --model hf --model_args "pretrained=mistralai/Mistral-7B-v0.1,max_length=1" --output_path "date/"`date +%s` --tasks belebele_pol_Latn --num_fewshot 0 --device cuda:0 --batch_size 2 --log_samples --limit 2

Logs: https://www.diffchecker.com/CpV3RaDU/ (scores are the same with logits_cache=False)

@djstrong
Copy link
Contributor Author

djstrong commented Mar 27, 2024

I have found exact place:

self._model_call(batched_inps, **call_kwargs), dim=-1

and replicated with minimal code (I get the same numbers in this line).

Model loaded on CPU with bfloat16 gives the same numbers:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype="auto")
>>> model(torch.tensor([[28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]])).logits
tensor([[[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]],

        [[-11.5000, -11.2500,   3.2656,  ...,  -5.7500,  -2.7500,  -1.4375]]],
       grad_fn=<ToCopyBackward0>)

Model on GPU with bfloat16 gives different results:

>>> model.to('cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.5625, -11.3125,   3.3281,  ...,  -5.7500,  -2.6562,  -1.3906]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]],

        [[-11.6250, -11.3125,   3.1406,  ...,  -5.7812,  -2.6562,  -1.3516]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

model on GPU with float16 gives the same results:

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, device_map='cuda')
>>> model(torch.tensor([[28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7617,  -2.6348,  -1.3047]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)
>>> model(torch.tensor([[28994], [28994]]).to('cuda')).logits
tensor([[[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]],

        [[-11.6406, -11.3828,   3.2852,  ...,  -5.7578,  -2.6211,  -1.2930]]],
       device='cuda:0', grad_fn=<ToCopyBackward0>)

So the problem is only on GPU with bfloat16. Using:

>>> torch.use_deterministic_algorithms(True)
>>> torch.backends.cudnn.benchmark = False

does not help.

@LSinev
Copy link
Contributor

LSinev commented Mar 27, 2024

So the problem is only on GPU with bfloat16

So the problem is actually with model implementation somewhere in huggingface transformers or some specific modules in torch, and may be sorted out at their repos if issued, not within lm-evaluation-harness repository?

@djstrong
Copy link
Contributor Author

Why do you think it is a problem with model implementation? But yes, it is not related to lm-evaluation-harness repository.
Maybe it is some GPU optimization (cuBLAS?).

@haileyschoelkopf
Copy link
Contributor

Nice bug hunting!

I think this is related to the lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

@LSinev
Copy link
Contributor

LSinev commented Mar 27, 2024

lower precision of bfloat16 compared to float16. So doing adds in a different order somewhere (because different batch size launches a different kernel) is causing errors due to non-associativity of floating point math

If this is the case, it will be seen in the same test procedure with several models, not just Mistral. Would be great if someone can confirm that.

UPD.: May be helpful (with sublinks too): huggingface/transformers#28732

@djstrong
Copy link
Contributor Author

The same issue with meta-llama/Llama-2-7b-chat-hf.

Maybe it is resolved in new cuBLAS: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-release-12-3-update-1 ?
I am using CUDA 12.1, cuBLAS 12.1.3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants