[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

stephanie-wang · 2024-05-14T17:52:46Z

Why are these changes needed?

This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously shape and dtype had to be explicitly passed to TorchTensorType.

    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()

The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver.

Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also #45319).

This also adds a few other fixes:

adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created.
adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice.
User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor.

Related issue number

Part 1 of #45306, will follow up with a separate PR for nested tensors.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <[email protected]>

Signed-off-by: Your Name <[email protected]>

Signed-off-by: Stephanie Wang <[email protected]>

Signed-off-by: Your Name <[email protected]>

Signed-off-by: Stephanie Wang <[email protected]>

Signed-off-by: Your Name <[email protected]>

Signed-off-by: Stephanie Wang <[email protected]>

… dag-gpu-channels

Signed-off-by: Your Name <[email protected]>

Signed-off-by: Stephanie Wang <[email protected]>

rkooo567

LGTM! +1 on this #45319. I feel like it is safer to kill actors when nccl destroy times out just in case (especially given it is hard to test and we don't understand this very well yet), but I will leave it yup to you.

stephanie-wang · 2024-05-17T23:20:26Z

LGTM! +1 on this #45319. I feel like it is safer to kill actors when nccl destroy times out just in case (especially given it is hard to test and we don't understand this very well yet), but I will leave it yup to you.

For the initial case what I would like to do is just offer two options, one to kill the actors and the other that syncs the stream and raises an exception to keep the actors alive. I agree it needs more testing, but we can probably defer that.

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2024-05-17T23:29:38Z

Hmm some CI test issue about not being able to find GPUs...

This reverts commit 9915fbe.

Signed-off-by: Stephanie Wang <[email protected]>

can-anyscale · 2024-05-18T14:06:08Z

.buildkite/core.rayci.yml

@@ -339,5 +339,6 @@ steps:
 - bazel run //ci/ray_ci:test_in_docker -- //python/ray/tests/... //python/ray/dag/... core
 --parallelism-per-worker 2 --gpus 2
 --build-name coregpubuild
- --only-tags multi_gpu
+ --only-tags multi_gpu || true
+ - sleep 1000000


w00t i think you might forget to remove this on CI so it was running for 8 hours

Signed-off-by: Stephanie Wang <[email protected]>

can-anyscale

ci changes look good, thankks

can-anyscale · 2024-05-20T13:46:03Z

release/microbenchmark/experimental/accelerated_dag_gpu_microbenchmark.py

@@ -1,7 +1,7 @@
 # coding: utf-8
 import logging
 import torch
-import pickle
+import ray.cloudpickle as pickle


do we want to run this release test on this PR?

…ython objects (#45473) Allows torch.Tensors nested inside Python objects to be transferred via NCCL using the following syntax: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) ``` We implement this by using an additional shared memory channel to pass CPU data, with a "nested" NCCL channel to pass the GPU data. Here is the send procedure for the above code: 1. Serialize the data. Extract out all tensors that are on the GPU and replace them with some kind of placeholder. 2. Send a list of metadata through the meta_channel. 3. Send the GPU tensors through the NCCL channel. 4. Send the rest of the CPU data through a cpu_data_channel, with the placeholders for the GPU tensors. Note that if the TorchTensorType doesn't have a shape and dtype specified, we currently use the separate meta_channel to pass metadata for the serialized tensors, as introduced in #45332. To elide the cpu_data_channel, the user should now use `TorchTensorType(direct_return=True)`, to indicate that no CPU data is sent along with the GPU data. To elide the meta_channel, the user should declare the shape and dtype, e.g., `TorchTensorType(shape=(10, ), dtype=torch.float16)`. ## Related issue number Closes #45306. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]>

…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…ython objects (ray-project#45473) Allows torch.Tensors nested inside Python objects to be transferred via NCCL using the following syntax: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) ``` We implement this by using an additional shared memory channel to pass CPU data, with a "nested" NCCL channel to pass the GPU data. Here is the send procedure for the above code: 1. Serialize the data. Extract out all tensors that are on the GPU and replace them with some kind of placeholder. 2. Send a list of metadata through the meta_channel. 3. Send the GPU tensors through the NCCL channel. 4. Send the rest of the CPU data through a cpu_data_channel, with the placeholders for the GPU tensors. Note that if the TorchTensorType doesn't have a shape and dtype specified, we currently use the separate meta_channel to pass metadata for the serialized tensors, as introduced in ray-project#45332. To elide the cpu_data_channel, the user should now use `TorchTensorType(direct_return=True)`, to indicate that no CPU data is sent along with the GPU data. To elide the meta_channel, the user should declare the shape and dtype, e.g., `TorchTensorType(shape=(10, ), dtype=torch.float16)`. ## Related issue number Closes ray-project#45306. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…ython objects (ray-project#45473) Allows torch.Tensors nested inside Python objects to be transferred via NCCL using the following syntax: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) ``` We implement this by using an additional shared memory channel to pass CPU data, with a "nested" NCCL channel to pass the GPU data. Here is the send procedure for the above code: 1. Serialize the data. Extract out all tensors that are on the GPU and replace them with some kind of placeholder. 2. Send a list of metadata through the meta_channel. 3. Send the GPU tensors through the NCCL channel. 4. Send the rest of the CPU data through a cpu_data_channel, with the placeholders for the GPU tensors. Note that if the TorchTensorType doesn't have a shape and dtype specified, we currently use the separate meta_channel to pass metadata for the serialized tensors, as introduced in ray-project#45332. To elide the cpu_data_channel, the user should now use `TorchTensorType(direct_return=True)`, to indicate that no CPU data is sent along with the GPU data. To elide the meta_channel, the user should declare the shape and dtype, e.g., `TorchTensorType(shape=(10, ), dtype=torch.float16)`. ## Related issue number Closes ray-project#45306. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>

…ython objects (ray-project#45473) Allows torch.Tensors nested inside Python objects to be transferred via NCCL using the following syntax: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) ``` We implement this by using an additional shared memory channel to pass CPU data, with a "nested" NCCL channel to pass the GPU data. Here is the send procedure for the above code: 1. Serialize the data. Extract out all tensors that are on the GPU and replace them with some kind of placeholder. 2. Send a list of metadata through the meta_channel. 3. Send the GPU tensors through the NCCL channel. 4. Send the rest of the CPU data through a cpu_data_channel, with the placeholders for the GPU tensors. Note that if the TorchTensorType doesn't have a shape and dtype specified, we currently use the separate meta_channel to pass metadata for the serialized tensors, as introduced in ray-project#45332. To elide the cpu_data_channel, the user should now use `TorchTensorType(direct_return=True)`, to indicate that no CPU data is sent along with the GPU data. To elide the meta_channel, the user should declare the shape and dtype, e.g., `TorchTensorType(shape=(10, ), dtype=torch.float16)`. ## Related issue number Closes ray-project#45306. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]>

…passed via NCCL in accelerated DAG (ray-project#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also ray-project#45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of ray-project#45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: gchurch <[email protected]>

stephanie-wang and others added 30 commits April 17, 2024 15:56

TorchTensor wrappers

2d22636

Signed-off-by: Stephanie Wang <[email protected]>

test

3a75b5a

Signed-off-by: Stephanie Wang <[email protected]>

copy

0a8a8da

Signed-off-by: Stephanie Wang <[email protected]>

update

32eaec9

Signed-off-by: Stephanie Wang <[email protected]>

torch device

0f8d092

Signed-off-by: Stephanie Wang <[email protected]>

errors

e067f5f

Signed-off-by: Stephanie Wang <[email protected]>

test

e0774b8

Signed-off-by: Stephanie Wang <[email protected]>

GPU

d065935

Signed-off-by: Your Name <[email protected]>

temp benchmark

f0813c4

Signed-off-by: Your Name <[email protected]>

with_type_hint

5fb4166

Signed-off-by: Stephanie Wang <[email protected]>

skip GPU tests

84fe2c0

Signed-off-by: Stephanie Wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

6375fd0

clean

5bab410

Signed-off-by: Stephanie Wang <[email protected]>

init nccl group

611c6aa

Signed-off-by: Your Name <[email protected]>

NCCL channel

3f16871

Signed-off-by: Your Name <[email protected]>

NCCL group works

b917eb8

Signed-off-by: Your Name <[email protected]>

micro

fea789b

Signed-off-by: Your Name <[email protected]>

update

388621d

Signed-off-by: Stephanie Wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

c35b3bb

torch

3575431

Signed-off-by: Stephanie Wang <[email protected]>

TODO

e667dae

Signed-off-by: Your Name <[email protected]>

TODO

fb93b06

Signed-off-by: Your Name <[email protected]>

Merge branch 'master' into dag-gpu-channels

4fb6897

typing

7334e8a

Signed-off-by: Stephanie Wang <[email protected]>

Merge branch 'dag-gpu-channels' of github.com:stephanie-wang/ray into…

fd59817

… dag-gpu-channels

fix deadlock on shutdown

2788aa3

Signed-off-by: Your Name <[email protected]>

Merge remote-tracking branch 'origin/dag-gpu-channels' into dag-nccl

cd8ae72

files

601e600

Signed-off-by: Your Name <[email protected]>

missing files

649cbea

Signed-off-by: Your Name <[email protected]>

move

3d39945

Signed-off-by: Stephanie Wang <[email protected]>

rkooo567 approved these changes May 17, 2024

View reviewed changes

debug

9915fbe

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 17, 2024

stephanie-wang added 2 commits May 17, 2024 17:12

Revert "debug"

c663de5

This reverts commit 9915fbe.

test

d538640

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang enabled auto-merge (squash) May 18, 2024 00:13

sleep

6da813a

Signed-off-by: Stephanie Wang <[email protected]>

github-actions bot disabled auto-merge May 18, 2024 01:13

stephanie-wang added 2 commits May 17, 2024 19:00

sleep

6f31347

Signed-off-by: Stephanie Wang <[email protected]>

cupy

2a2e1b8

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang requested review from amogkam, richardliaw and matthewdeng as code owners May 18, 2024 04:28

can-anyscale reviewed May 18, 2024

View reviewed changes

stephanie-wang added 2 commits May 19, 2024 09:41

skip gpu

41a9aa2

Signed-off-by: Stephanie Wang <[email protected]>

remove debug

df6c900

Signed-off-by: Stephanie Wang <[email protected]>

can-anyscale approved these changes May 20, 2024

View reviewed changes

richardliaw approved these changes May 20, 2024

View reviewed changes

stephanie-wang merged commit ca9f736 into ray-project:master May 20, 2024
5 of 6 checks passed

stephanie-wang mentioned this pull request May 21, 2024

[core][experimental] Support NCCL-based torch.Tensors nested inside Python objects #45473

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

stephanie-wang commented May 14, 2024 •

edited

Loading

rkooo567 left a comment

stephanie-wang commented May 17, 2024

stephanie-wang commented May 17, 2024

can-anyscale May 18, 2024

can-anyscale left a comment

can-anyscale May 20, 2024

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

Conversation

stephanie-wang commented May 14, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 left a comment

Choose a reason for hiding this comment

stephanie-wang commented May 17, 2024

stephanie-wang commented May 17, 2024

can-anyscale May 18, 2024

Choose a reason for hiding this comment

can-anyscale left a comment

Choose a reason for hiding this comment

can-anyscale May 20, 2024

Choose a reason for hiding this comment

stephanie-wang commented May 14, 2024 •

edited

Loading