[core][experimental] Pass torch.Tensors through accelerated DAGs #44825

stephanie-wang · 2024-04-18T00:54:54Z

Why are these changes needed?

This PR adds support for passing torch.Tensors to local actors in an accelerated DAG, via Ray's shared memory store. It supports the following transfer cases, as long as the sending and receiving actors are on the same node: CPU-CPU, CPU-GPU, GPU-CPU, GPU-GPU (via CPU).

This iteration requires the user to explicitly declare which DAG nodes contain torch.Tensors and the tensors' shape and dtype, with a new with_type_hint decorator. For example:

    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()

This declaration isn't necessarily useful for this PR, but it is included now because it makes it much simpler to efficiently support other cases in the future, such as p2p GPU-GPU transfers.

When a TorchTensor node is declared, the serialization of the underlying torch.Tensor is performed differently from vanilla Ray. In particular, we store the numpy view of the data. On the receiving actor, we deserialize to a torch.Tensor and move it to the device assigned to the actor, if any. Microbenchmarking shows that this is 4x faster than normal pickling and unpickling of a torch.Tensor, likely due to Ray's serialization support for numpy. Also, when moving the torch.Tensor to a GPU on the receiving side, we can avoid one extra data copy by copying directly from Ray's shared memory buffer to GPU memory.

Limitations:

Only supports tasks that directly return a torch.Tensor, i.e. the torch.Tensor cannot be nested in other data.
The task must declare the shape and dtype of its torch.Tensor at DAG compile time.
Does not support local p2p GPU-GPU transfer, either using cudaMemCpy or NCCL. Microbenchmark shows this can be >10x faster than transfer via CPU.
Does not support multinode GPU-GPU transfer, e.g., via RPC between hosts or NCCL.

Signed-off-by: Stephanie Wang <[email protected]>

Signed-off-by: Your Name <[email protected]>

ericl · 2024-04-18T19:52:35Z

test_torch_tensor_dag.py

+ # Test torch.Tensor sent between actors.
+ with InputNode() as inp:
+ dag = sender.send.bind(shape, dtype, inp)
+ dag = TorchTensor(dag, shape, dtype)


Instead of adding a wrapper DAG node, I'm thinking it may be advantageous to set typing information on the node instead to define the static shape and dtype. This would be just one case of adding more static typing information to the DAG, e.g., could imagine size as a generally useful attribute too.

Syntactically this could make the code look like this instead:

dag = sender.send.bind(shape, dtype, inp) \ .with_output_type(TorchTensorType(shape, type)) dag = receiver.recv.bind(dag)

I haven't looked closely at how this would change the backend implementation but I think this could simplify that as well by removing special cases.

I thought about this but the one issue with that API is that it's not very clean for tasks that return multiple values. This isn't supported yet but I suspect we will need to support it eventually.

Although yes, we could set use that kind of dot syntax instead of a wrapper class.

Hmm, eventually how would you support that? If the return type is arbitrarily complex then you probably need to handle it at runtime in a custom serialization hook instead of in the DAG. Or, the type can become complex as well.

Not sure if runtime handling is sufficient to setup accelerated transfer in all cases though, or if you need to know the type information statically ahead of time.

For multiple return values, we can use the normal Ray num_returns syntax.

But yeah, the type annotations will get unwieldy for nested tensors. What I'm thinking is that for cases where the user knows the shape and dtype ahead of time, they should use the static annotation and that will guarantee 0 control plane overhead from accelerated DAGs.

If they don't know it ahead of time, we can do the custom serialization hook (we just have to make sure to do it only for DAG outputs, not for other objects serialized by the user). The serialization hook would pass the type metadata through a normal channel and start the accelerated transfer separately. It'll be slower because we need to send the type metadata synchronously, but it will be much easier to program against. With the current PR, it'd also be easy for the user to add their own type hints to the task return values, by wrapping a returned torch.Tensor with the _TorchTensorWrapper (currently a developer API).

Got it. I feel like maybe annotating the function is actually the best way to go long run, like @ray.remote(return_types=[...]), but any interim state between that and the nodes is probably fine.

I do think that using type annotations instead of a separate node wrapper class will clean up the internal code though (avoid "tensor" special cases and rewriting).

ericl · 2024-04-18T19:54:37Z

test_torch_tensor_dag.py

+ return torch.ones(shape, dtype=dtype, device=self.device) * value
+
+ def recv(self, tensor):
+ assert tensor.device == self.device


How would the multi-device case be handled in the DAG declaration? I imagine it would have to be something you also include as part of the DAG in order to set up the right deserialization processing.

Yeah, I was thinking to expose a call on the actor that you can use to set the default device.

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2024-04-25T01:52:16Z

Strangely, this commit caused the performance of CPU-CPU tensor transfer to drop from 1.29GB/s to 400 MB/s. I have no idea why; the commit doesn't seem like it should have affected anything and from what I can tell, the serialization/deserialization code wasn't affected. Will file an issue after this PR is merged...

Signed-off-by: Stephanie Wang <[email protected]>

jackhumphries · 2024-04-27T00:41:07Z

python/ray/dag/compiled_dag_node.py

@@ -59,6 +79,8 @@ def do_exec_compiled_task(
 the loop.


Add comment for output_wrapper_fn

python/ray/dag/compiled_dag_node.py

jackhumphries · 2024-04-27T00:51:34Z

python/ray/dag/tests/test_torch_tensor_dag.py

+
+
+@pytest.mark.parametrize("use_gpu", [False, True])
+def test_torch_tensor_p2p(ray_start_regular_shared, use_gpu):


Does it make sense to split this test into multiple tests?

I think it's okay either way, I just used the same test to reduce duplication.

python/ray/dag/compiled_dag_node.py

rkooo567

Hmm type_hint thing seems like not a good UX. What's the exact limitation now? I assume it is that we need to know if the output is in gpu tensor ahead of time to decide the transport?

I think it is okay for the initial version, but I feel like it is pretty bad UX, so we should probably discuss how to get around this...

Other than this most other comments are nits

python/ray/dag/compiled_dag_node.py

python/ray/dag/experimental/types.py

rkooo567 · 2024-04-27T03:02:51Z

python/ray/dag/experimental/types.py

+
+
+@DeveloperAPI
+class _TorchTensorSerializer:


Consider adding unit test for this API?

python/ray/dag/experimental/types.py

rkooo567 · 2024-04-27T03:03:41Z

python/ray/dag/tests/test_torch_tensor_dag.py

+
+@pytest.mark.parametrize("use_gpu", [False, True])
+def test_torch_tensor_p2p(ray_start_regular_shared, use_gpu):
+ if use_gpu and sum(node["Resources"].get("GPU", 0) for node in ray.nodes()) < 1:


does it actually run in this case? I don't know if we have an instance that runs on gpu?

Yeah, it doesn't run right now, but I tested manually. Will probably revisit this later to run tests on GPU machines.

rkooo567 · 2024-04-27T03:05:58Z

python/ray/dag/experimental/types.py

+ def __init__(self, device: "torch.device"):
+ self.device = device
+
+ @staticmethod


is it static method to be compatible with register_serializer?

python/ray/dag/compiled_dag_node.py

rkooo567 · 2024-04-27T03:13:38Z

python/ray/dag/compiled_dag_node.py


 self.downstream_node_idxs = set()
 self.output_channel = None

+ self.output_wrapper_fn = None
+ if self.dag_node.type_hint is not None:
+ if isinstance(self.dag_node.type_hint, TorchTensorType):


wonder if we can detect torch tensor automatically here? and if it is on gpu we just wrap it to this wrapper?

I wanted to preserve existing serialization methods for torch.Tensor since this acts a bit differently from normal serialization; it automatically sets the device on the receiving end.

There is probably a nicer way to do this, but for now this is the easiest and probably doesn't affect performance.

Signed-off-by: Stephanie Wang <[email protected]>

… dag-gpu-channels

Signed-off-by: Stephanie Wang <[email protected]>

… dag-gpu-channels

Signed-off-by: Stephanie Wang <[email protected]>

…-project#44825) This PR adds support for passing torch.Tensors to local actors in an accelerated DAG, via Ray's shared memory store. It supports the following transfer cases, as long as the sending and receiving actors are on the same node: CPU-CPU, CPU-GPU, GPU-CPU, GPU-GPU (via CPU). This iteration requires the user to explicitly declare which DAG nodes contain torch.Tensors and the tensors' shape and dtype, with a new `with_type_hint` decorator. For example: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE)) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` This declaration isn't necessarily useful for this PR, but it is included now because it makes it much simpler to efficiently support other cases in the future, such as p2p GPU-GPU transfers. When a TorchTensor node is declared, the serialization of the underlying torch.Tensor is performed differently from vanilla Ray. In particular, we store the numpy view of the data. On the receiving actor, we deserialize to a torch.Tensor and move it to the device assigned to the actor, if any. Microbenchmarking shows that this is 4x faster than normal pickling and unpickling of a torch.Tensor, likely due to Ray's serialization support for numpy. Also, when moving the torch.Tensor to a GPU on the receiving side, we can avoid one extra data copy by copying directly from Ray's shared memory buffer to GPU memory. Limitations: - Only supports tasks that directly return a torch.Tensor, i.e. the torch.Tensor cannot be nested in other data. - The task must declare the shape and dtype of its torch.Tensor at DAG compile time. - Does not support local p2p GPU-GPU transfer, either using `cudaMemCpy` or NCCL. Microbenchmark shows this can be >10x faster than transfer via CPU. - Does not support multinode GPU-GPU transfer, e.g., via RPC between hosts or NCCL. --------- Signed-off-by: Stephanie Wang <[email protected]>

…-project#44825) This PR adds support for passing torch.Tensors to local actors in an accelerated DAG, via Ray's shared memory store. It supports the following transfer cases, as long as the sending and receiving actors are on the same node: CPU-CPU, CPU-GPU, GPU-CPU, GPU-GPU (via CPU). This iteration requires the user to explicitly declare which DAG nodes contain torch.Tensors and the tensors' shape and dtype, with a new `with_type_hint` decorator. For example: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE)) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` This declaration isn't necessarily useful for this PR, but it is included now because it makes it much simpler to efficiently support other cases in the future, such as p2p GPU-GPU transfers. When a TorchTensor node is declared, the serialization of the underlying torch.Tensor is performed differently from vanilla Ray. In particular, we store the numpy view of the data. On the receiving actor, we deserialize to a torch.Tensor and move it to the device assigned to the actor, if any. Microbenchmarking shows that this is 4x faster than normal pickling and unpickling of a torch.Tensor, likely due to Ray's serialization support for numpy. Also, when moving the torch.Tensor to a GPU on the receiving side, we can avoid one extra data copy by copying directly from Ray's shared memory buffer to GPU memory. Limitations: - Only supports tasks that directly return a torch.Tensor, i.e. the torch.Tensor cannot be nested in other data. - The task must declare the shape and dtype of its torch.Tensor at DAG compile time. - Does not support local p2p GPU-GPU transfer, either using `cudaMemCpy` or NCCL. Microbenchmark shows this can be >10x faster than transfer via CPU. - Does not support multinode GPU-GPU transfer, e.g., via RPC between hosts or NCCL. --------- Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…-project#44825) This PR adds support for passing torch.Tensors to local actors in an accelerated DAG, via Ray's shared memory store. It supports the following transfer cases, as long as the sending and receiving actors are on the same node: CPU-CPU, CPU-GPU, GPU-CPU, GPU-GPU (via CPU). This iteration requires the user to explicitly declare which DAG nodes contain torch.Tensors and the tensors' shape and dtype, with a new `with_type_hint` decorator. For example: ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(SHAPE, DTYPE)) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` This declaration isn't necessarily useful for this PR, but it is included now because it makes it much simpler to efficiently support other cases in the future, such as p2p GPU-GPU transfers. When a TorchTensor node is declared, the serialization of the underlying torch.Tensor is performed differently from vanilla Ray. In particular, we store the numpy view of the data. On the receiving actor, we deserialize to a torch.Tensor and move it to the device assigned to the actor, if any. Microbenchmarking shows that this is 4x faster than normal pickling and unpickling of a torch.Tensor, likely due to Ray's serialization support for numpy. Also, when moving the torch.Tensor to a GPU on the receiving side, we can avoid one extra data copy by copying directly from Ray's shared memory buffer to GPU memory. Limitations: - Only supports tasks that directly return a torch.Tensor, i.e. the torch.Tensor cannot be nested in other data. - The task must declare the shape and dtype of its torch.Tensor at DAG compile time. - Does not support local p2p GPU-GPU transfer, either using `cudaMemCpy` or NCCL. Microbenchmark shows this can be >10x faster than transfer via CPU. - Does not support multinode GPU-GPU transfer, e.g., via RPC between hosts or NCCL. --------- Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang and others added 9 commits April 17, 2024 15:56

TorchTensor wrappers

2d22636

Signed-off-by: Stephanie Wang <[email protected]>

test

3a75b5a

Signed-off-by: Stephanie Wang <[email protected]>

copy

0a8a8da

Signed-off-by: Stephanie Wang <[email protected]>

update

32eaec9

Signed-off-by: Stephanie Wang <[email protected]>

torch device

0f8d092

Signed-off-by: Stephanie Wang <[email protected]>

errors

e067f5f

Signed-off-by: Stephanie Wang <[email protected]>

test

e0774b8

Signed-off-by: Stephanie Wang <[email protected]>

GPU

d065935

Signed-off-by: Your Name <[email protected]>

temp benchmark

f0813c4

Signed-off-by: Your Name <[email protected]>

ericl reviewed Apr 18, 2024

View reviewed changes

stephanie-wang assigned ericl and jackhumphries Apr 18, 2024

stephanie-wang added 2 commits April 24, 2024 12:04

with_type_hint

5fb4166

Signed-off-by: Stephanie Wang <[email protected]>

skip GPU tests

84fe2c0

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang assigned rkooo567 Apr 24, 2024

stephanie-wang changed the title ~~[WIP] Pass torch.Tensors through accelerated DAGs~~ [core][experimental] Pass torch.Tensors through accelerated DAGs Apr 24, 2024

Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

6375fd0

clean

5bab410

Signed-off-by: Stephanie Wang <[email protected]>

jackhumphries reviewed Apr 27, 2024

View reviewed changes

rkooo567 approved these changes Apr 27, 2024

View reviewed changes

stephanie-wang added 9 commits April 29, 2024 20:03

update

388621d

Signed-off-by: Stephanie Wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into dag-gpu-channels

c35b3bb

torch

3575431

Signed-off-by: Stephanie Wang <[email protected]>

Merge branch 'master' into dag-gpu-channels

4fb6897

typing

7334e8a

Signed-off-by: Stephanie Wang <[email protected]>

Merge branch 'dag-gpu-channels' of github.com:stephanie-wang/ray into…

fd59817

… dag-gpu-channels

sleep

5c408ba

Signed-off-by: Stephanie Wang <[email protected]>

test

1ab2656

Signed-off-by: Stephanie Wang <[email protected]>

test

b02d39d

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang added 5 commits May 2, 2024 17:20

dynamic import

dc53282

Signed-off-by: Stephanie Wang <[email protected]>

Merge branch 'dag-gpu-channels' of github.com:stephanie-wang/ray into…

f536937

… dag-gpu-channels

lint

ed46db3

Signed-off-by: Stephanie Wang <[email protected]>

x

ec86968

Signed-off-by: Stephanie Wang <[email protected]>

remove test

e4022aa

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang merged commit bbcdc49 into ray-project:master May 3, 2024
5 checks passed

stephanie-wang deleted the dag-gpu-channels branch May 3, 2024 20:31

stephanie-wang mentioned this pull request May 13, 2024

[core][experimental] Meta-issue: Support transferring GPU tensors in accelerated DAG #43830

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][experimental] Pass torch.Tensors through accelerated DAGs #44825

[core][experimental] Pass torch.Tensors through accelerated DAGs #44825

stephanie-wang commented Apr 18, 2024 •

edited

Loading

ericl Apr 18, 2024

stephanie-wang Apr 18, 2024

stephanie-wang Apr 18, 2024

ericl Apr 19, 2024

stephanie-wang Apr 19, 2024

ericl Apr 19, 2024

ericl Apr 18, 2024

stephanie-wang Apr 18, 2024

stephanie-wang commented Apr 25, 2024 •

edited

Loading

jackhumphries Apr 27, 2024

jackhumphries Apr 27, 2024

stephanie-wang Apr 30, 2024

rkooo567 left a comment •

edited

Loading

rkooo567 Apr 27, 2024

rkooo567 Apr 27, 2024

stephanie-wang Apr 30, 2024

rkooo567 Apr 27, 2024

rkooo567 Apr 27, 2024

stephanie-wang Apr 30, 2024



		@pytest.mark.parametrize("use_gpu", [False, True])
		def test_torch_tensor_p2p(ray_start_regular_shared, use_gpu):

[core][experimental] Pass torch.Tensors through accelerated DAGs #44825

[core][experimental] Pass torch.Tensors through accelerated DAGs #44825

Conversation

stephanie-wang commented Apr 18, 2024 • edited Loading

Why are these changes needed?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Apr 18, 2024 •

edited

Loading

stephanie-wang commented Apr 25, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading