peer access is not supported between these two devices #552

gmonair · 2024-06-16T18:22:56Z

When upgrading from sglang 0.1.16 to 0.1.17 I get the following error when loading a model with tp=2 on a 2xT4 machine (kaggle). The same code used to work on 0.1.16

Error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'

[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

[...]

Code:

runtime = sgl.Runtime(model_path=model_name, tp_size=2)

This used to run fine in 0.1.16 on the same machine. The model loaded is deepseek-7b, so llamaforcausal family. Let me know if you want me to test with other models.

The text was updated successfully, but these errors were encountered:

merrymercy · 2024-06-21T03:36:54Z

See this PR for a temporary fix. You can disable custom allreduce for your setup. If you got it fixed, please contribute a PR.
#531

hnyls2002 · 2024-07-07T06:33:25Z

@gmonair Add the --enable-p2p-check option to server args, so the older GPU can also support tp=2.
This #599 fixes your issue.

hnyls2002 mentioned this issue Jul 7, 2024

Add --enable-p2p-check option #599

Merged

hnyls2002 closed this as completed Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peer access is not supported between these two devices #552

peer access is not supported between these two devices #552

gmonair commented Jun 16, 2024

merrymercy commented Jun 21, 2024

hnyls2002 commented Jul 7, 2024

peer access is not supported between these two devices #552

peer access is not supported between these two devices #552

Comments

gmonair commented Jun 16, 2024

merrymercy commented Jun 21, 2024

hnyls2002 commented Jul 7, 2024