Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

harshakokel · 2024-04-22T21:20:43Z

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-04-26T15:17:23Z

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

harshakokel · 2024-04-26T16:41:04Z

I am on vllm 0.3.2.

harshakokel · 2024-04-26T16:42:53Z

Is this a vllm problem? Should I be raising an issue on that repo?

baberabb · 2024-04-26T17:13:27Z

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

harshakokel · 2024-04-26T17:44:58Z

Yes, the weights are cached. The process is hanging after llm.generate returns results.

baberabb · 2024-04-26T18:48:33Z

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

harshakokel · 2024-04-26T20:10:40Z

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

baberabb · 2024-04-27T12:01:21Z

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.

harshakokel changed the title ~~Process Hangs when using tensor_parallel_size and data_parallel_size together~~ Process hangs when using tensor_parallel_size and data_parallel_size together Apr 22, 2024

haileyschoelkopf added the bug Something isn't working. label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

harshakokel commented Apr 22, 2024

haileyschoelkopf commented Apr 26, 2024

harshakokel commented Apr 26, 2024

harshakokel commented Apr 26, 2024 •

edited

Loading

baberabb commented Apr 26, 2024

harshakokel commented Apr 26, 2024

baberabb commented Apr 26, 2024

harshakokel commented Apr 26, 2024 •

edited

Loading

baberabb commented Apr 27, 2024

Process hangs when using tensor_parallel_size and data_parallel_size together #1734

Process hangs when using tensor_parallel_size and data_parallel_size together #1734

Comments

harshakokel commented Apr 22, 2024

haileyschoelkopf commented Apr 26, 2024

harshakokel commented Apr 26, 2024

harshakokel commented Apr 26, 2024 • edited Loading

baberabb commented Apr 26, 2024

harshakokel commented Apr 26, 2024

baberabb commented Apr 26, 2024

harshakokel commented Apr 26, 2024 • edited Loading

baberabb commented Apr 27, 2024

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together #1734

harshakokel commented Apr 26, 2024 •

edited

Loading

harshakokel commented Apr 26, 2024 •

edited

Loading