Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process hangs when using tensor_parallel_size and data_parallel_size together #1734

Open
harshakokel opened this issue Apr 22, 2024 · 8 comments
Labels
bug Something isn't working.

Comments

@harshakokel
Copy link

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10
lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

@harshakokel harshakokel changed the title Process Hangs when using tensor_parallel_size and data_parallel_size together Process hangs when using tensor_parallel_size and data_parallel_size together Apr 22, 2024
@haileyschoelkopf
Copy link
Contributor

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

@haileyschoelkopf haileyschoelkopf added the bug Something isn't working. label Apr 26, 2024
@harshakokel
Copy link
Author

I am on vllm 0.3.2.

@harshakokel
Copy link
Author

harshakokel commented Apr 26, 2024

Is this a vllm problem? Should I be raising an issue on that repo?

@baberabb
Copy link
Contributor

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

@harshakokel
Copy link
Author

Yes, the weights are cached. The process is hanging after llm.generate returns results.

@baberabb
Copy link
Contributor

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

@harshakokel
Copy link
Author

harshakokel commented Apr 26, 2024

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

@baberabb
Copy link
Contributor

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

3 participants