Introduce CUDAScopedStream class for Multi-Stream Support #3732

stotko · 2021-07-12T17:52:34Z

Highlights:

Rename CUDADeviceSwitcher to CUDAScopedDevice and remove SwitchTo function so simply its usage.
Add CUDAScopedStream class for setting the global internal per-thread stream state.
Add support for stream-based allocation (introduced in CUDA 11.2) and copy functions in CUDAMemoryManager.

Further changes:

Add cuda::{Get,Set}Stream functions to access the global state. These functions are used to implement CUDAScopedStream.
Add cuda::{Get,Set}Device to mirror the stream API. This makes CUDAScopedDevice's implementation look similar to CUDAScopedStream. It further allows to hide the implementation detail that CUDA is already capable of managing the device state internally.
Make all of Open3D's kernel calls aware of the current stream.
Add unit tests to check thread locality.
Add CUDAScopedDeviceStream in ML ops as a combination of both scoped wrappers. It infers the device from the stream and sets both states simultaneously.

Future work:

Unify stream usage in ML ops to take advantage of CUDAScopedStream.

This change is

update-docs · 2021-07-12T17:52:37Z

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

yxlao

Reviewed 12 of 17 files at r1, 8 of 8 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @stotko)

cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):

    ~CUDAScopedStream() { cuda::SetStream(prev_stream_); }

    CUDAScopedStream(CUDAScopedStream const&) = delete;

nit: const CUDAScopedStream & to be consistent with the rest.

cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):

    void Set(cudaStream_t stream) { stream_ = stream; }

    static cudaStream_t Default() { return (cudaStream_t)0; }

nit: static_cast<cudaStream_t>

cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):

        if (err_get_string == CUDA_SUCCESS) {
            fprintf(stderr, "%s:%d CUDA driver error: %s", file, line,

Possible to convert this to utility::Logxxx()? Previously the ML Ops is independent from libOpen3D, but nowlibOpen3D is linked into the ML Ops library.

stotko

Reviewable status: 8 of 33 files reviewed, 1 unresolved discussion (waiting on @benjaminum, @ssheorey, and @yxlao)

cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

nit: const CUDAScopedStream & to be consistent with the rest.

Done.

cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

nit: static_cast<cudaStream_t>

Done.

cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

Possible to convert this to utility::Logxxx()? Previously the ML Ops is independent from libOpen3D, but nowlibOpen3D is linked into the ML Ops library.

Done. I also applied some additional cleanups in Helper.h, CUDAUtils.h and CUDAState.cuh.

benjaminum

Reviewable status: 8 of 33 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)

cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

}

class CUDAScopedDeviceStream {

Where is this used? Is this the only file where we need the driver API?

cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

                utility::DivUp(n, NMS_BLOCK_SIZE));
    dim3 threads(NMS_BLOCK_SIZE);
    NmsKernel<<<blocks, threads, 0, core::cuda::GetStream()>>>(

The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.

stotko

Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)

cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

Where is this used? Is this the only file where we need the driver API?

This is a combination of CUDAScopedDevice and CUDAScopedStream where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.

Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?

cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.

Could you provide some pointers on how to approach this? I tried

TensorFlow: auto device = context->eigen_gpu_device(); Then usedevice.stream().
PyTorch: auto stream = at::cuda::getCurrentCUDAStream();

but they only seem to be available in .cu files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?

stotko · 2021-07-14T14:24:20Z

Re-requesting reviews as I decided to merge further changes into this PR:

Rename CUDADeviceSwitcher to CUDAScopedDevice
Add support for stream-based allocation (introduced in CUDA 11.2) and copy functions in CUDAMemoryManager.

benjaminum

Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)

cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

This is a combination of CUDAScopedDevice and CUDAScopedStream where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.

Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?

I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.

cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Could you provide some pointers on how to approach this? I tried

TensorFlow: auto device = context->eigen_gpu_device(); Then usedevice.stream().
PyTorch: auto stream = at::cuda::getCurrentCUDAStream();

but they only seem to be available in .cu files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?

Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.

stotko

Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)

cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.

Yes, as long as this assumption holds, CUDAScopedStream should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can remove CUDAScopedDeviceStream for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?

cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.

Reverted all applications of cuda::GetStream() in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.

benjaminum

Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)

cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Yes, as long as this assumption holds, CUDAScopedStream should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can remove CUDAScopedDeviceStream for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?

It is just a header file. I don't mind keeping it.

cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Reverted all applications of cuda::GetStream() in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.

makes sense

ssheorey

Can we add stream support to the NPP image processing operations as well?
eg: Current NPP calls:
https://github.com/intel-isl/Open3D/blob/e0a6201554fa7a8e8c0038181dd68b76929a7203/cpp/open3d/t/geometry/kernel/NPPImage.cpp#L101

NPP>=v10.1 introduces a new API for application managed stream context that we should use. The function calls are now *_Ctx() and take an extra argument that can be added to the NPP_ARGS macro.

The older API was easier to use with a npp{Get,Set}Stream().

OK to do in another PR.

Reviewed 1 of 8 files at r2, 1 of 25 files at r3, 1 of 1 files at r4, 4 of 9 files at r5, 1 of 4 files at r6.
Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)

cpp/benchmarks/core/MemoryManager.cpp, line 72 at r6 (raw file):

Synchronize(

nit: static

cpp/open3d/core/CUDAState.cuh, line 114 at r6 (raw file):

class CUDAScopedStream {
public:
    explicit CUDAScopedStream(cudaStream_t stream)

We can simplify usage by making stream optional. If stream is not provided we create a new stream in the ctor. The user doesn't need to track streams for simple code.

cpp/open3d/core/CUDAUtils.h, line 103 at r6 (raw file):

#ifdef BUILD_CUDA_MODULE

I would suggest only providing the cudaScoped{Device,Stream} APIs and not providing these additional Set APIs. Two ways of doing the same thing can cause user confusion. Get APIs are still useful.

cpp/open3d/core/CUDAUtils.cpp, line 97 at r6 (raw file):

    return device;
}

Consider not providing the Set APIs (see above).

cpp/tests/core/CUDAState.cpp, line 79 at r6 (raw file):

    std::vector<std::thread> threads;

    const int kIterations = 100000;

Can you comment on why we need a large kIterations?

ssheorey

Also, Python bindings will be useful if a user wants to build their own pipeline in Python with Open3D components.

Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)

ssheorey

NPP stream context: https://docs.nvidia.com/cuda/npp/index.html#application_managed_stream_context

Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)

yxlao

Reviewed 19 of 25 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 4 of 4 files at r6.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @stotko)

yxlao · 2021-07-16T09:17:53Z

Merging #3732 for now, and @stotko will address @ssheorey 's remaining comments in a separate PR.

stotko requested review from yxlao and ssheorey July 12, 2021 17:52

stotko force-pushed the stotko/multi-stream-support branch 2 times, most recently from 6837881 to d7430fa Compare July 13, 2021 10:14

stotko requested a review from benjaminum July 13, 2021 12:22

yxlao approved these changes Jul 14, 2021

View reviewed changes

stotko force-pushed the stotko/multi-stream-support branch from d7430fa to 3fe07f1 Compare July 14, 2021 09:39

stotko commented Jul 14, 2021

View reviewed changes

stotko force-pushed the stotko/multi-stream-support branch from 3fe07f1 to 9e434b3 Compare July 14, 2021 09:51

benjaminum requested changes Jul 14, 2021

View reviewed changes

stotko force-pushed the stotko/multi-stream-support branch from 9e434b3 to 6c9640f Compare July 14, 2021 14:11

stotko commented Jul 14, 2021

View reviewed changes

stotko requested review from yxlao and benjaminum July 14, 2021 14:23

benjaminum requested changes Jul 14, 2021

View reviewed changes

Introduce CUDAScopedStream class for Multi-Stream Support

58097fb

stotko force-pushed the stotko/multi-stream-support branch from 6c9640f to 58097fb Compare July 14, 2021 17:28

stotko commented Jul 14, 2021

View reviewed changes

benjaminum approved these changes Jul 14, 2021

View reviewed changes

ssheorey reviewed Jul 14, 2021

View reviewed changes

yxlao approved these changes Jul 15, 2021

View reviewed changes

yxlao merged commit d21d12d into master Jul 16, 2021

yxlao deleted the stotko/multi-stream-support branch July 16, 2021 09:18

This was referenced Jul 16, 2021

Cleanup CUDAScopedStream and add stream support to NPPImage #3748

Merged

Cleanup CUDA utilities #3767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce CUDAScopedStream class for Multi-Stream Support #3732

Introduce CUDAScopedStream class for Multi-Stream Support #3732

stotko commented Jul 12, 2021 •

edited

Loading

update-docs bot commented Jul 12, 2021

yxlao left a comment

stotko left a comment •

edited

Loading

benjaminum left a comment

stotko left a comment

stotko commented Jul 14, 2021 •

edited

Loading

benjaminum left a comment

stotko left a comment

benjaminum left a comment

ssheorey left a comment

ssheorey left a comment

ssheorey left a comment

yxlao left a comment

yxlao commented Jul 16, 2021

Introduce CUDAScopedStream class for Multi-Stream Support #3732

Introduce CUDAScopedStream class for Multi-Stream Support #3732

Conversation

stotko commented Jul 12, 2021 • edited Loading

update-docs bot commented Jul 12, 2021

yxlao left a comment

Choose a reason for hiding this comment

stotko left a comment • edited Loading

Choose a reason for hiding this comment

benjaminum left a comment

Choose a reason for hiding this comment

stotko left a comment

Choose a reason for hiding this comment

stotko commented Jul 14, 2021 • edited Loading

benjaminum left a comment

Choose a reason for hiding this comment

stotko left a comment

Choose a reason for hiding this comment

benjaminum left a comment

Choose a reason for hiding this comment

ssheorey left a comment

Choose a reason for hiding this comment

ssheorey left a comment

Choose a reason for hiding this comment

ssheorey left a comment

Choose a reason for hiding this comment

yxlao left a comment

Choose a reason for hiding this comment

yxlao commented Jul 16, 2021

stotko commented Jul 12, 2021 •

edited

Loading

stotko left a comment •

edited

Loading

stotko commented Jul 14, 2021 •

edited

Loading