-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process hangs when using tensor_parallel_size
and data_parallel_size
together
#1734
Comments
tensor_parallel_size
and data_parallel_size
togethertensor_parallel_size
and data_parallel_size
together
Hi! What version of vLLM are you running with? @baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm. |
I am on vllm |
Is this a vllm problem? Should I be raising an issue on that repo? |
Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise. |
Yes, the weights are cached. The process is hanging after |
hmm. It's working for me with |
Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is |
Probably the latest one. I installed it with |
Hello,
I noticed that my process hangs at
results = ray.get(object_refs)
when I usedata_parallel_size
as well astensor_parallel_size
for vllm models.For example, this call would hang.
These would not.
Does anyone else face similar problem?
The text was updated successfully, but these errors were encountered: