Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CUDAScopedStream class for Multi-Stream Support #3732

Merged
merged 1 commit into from
Jul 16, 2021

Conversation

stotko
Copy link
Contributor

@stotko stotko commented Jul 12, 2021

Highlights:

  • Rename CUDADeviceSwitcher to CUDAScopedDevice and remove SwitchTo function so simply its usage.
  • Add CUDAScopedStream class for setting the global internal per-thread stream state.
  • Add support for stream-based allocation (introduced in CUDA 11.2) and copy functions in CUDAMemoryManager.

Further changes:

  • Add cuda::{Get,Set}Stream functions to access the global state. These functions are used to implement CUDAScopedStream.
  • Add cuda::{Get,Set}Device to mirror the stream API. This makes CUDAScopedDevice's implementation look similar to CUDAScopedStream. It further allows to hide the implementation detail that CUDA is already capable of managing the device state internally.
  • Make all of Open3D's kernel calls aware of the current stream.
  • Add unit tests to check thread locality.
  • Add CUDAScopedDeviceStream in ML ops as a combination of both scoped wrappers. It infers the device from the stream and sets both states simultaneously.

Future work:

  • Unify stream usage in ML ops to take advantage of CUDAScopedStream.

This change is Reviewable

@stotko stotko requested review from yxlao and ssheorey July 12, 2021 17:52
@update-docs
Copy link

update-docs bot commented Jul 12, 2021

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@stotko stotko force-pushed the stotko/multi-stream-support branch 2 times, most recently from 6837881 to d7430fa Compare July 13, 2021 10:14
@stotko stotko requested a review from benjaminum July 13, 2021 12:22
Copy link
Collaborator

@yxlao yxlao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 17 files at r1, 8 of 8 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @stotko)


cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):

    ~CUDAScopedStream() { cuda::SetStream(prev_stream_); }

    CUDAScopedStream(CUDAScopedStream const&) = delete;

nit: const CUDAScopedStream & to be consistent with the rest.


cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):

    void Set(cudaStream_t stream) { stream_ = stream; }

    static cudaStream_t Default() { return (cudaStream_t)0; }

nit: static_cast<cudaStream_t>


cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):

        if (err_get_string == CUDA_SUCCESS) {
            fprintf(stderr, "%s:%d CUDA driver error: %s", file, line,

Possible to convert this to utility::Logxxx()? Previously the ML Ops is independent from libOpen3D, but nowlibOpen3D is linked into the ML Ops library.

@stotko stotko force-pushed the stotko/multi-stream-support branch from d7430fa to 3fe07f1 Compare July 14, 2021 09:39
Copy link
Contributor Author

@stotko stotko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 33 files reviewed, 1 unresolved discussion (waiting on @benjaminum, @ssheorey, and @yxlao)


cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

nit: const CUDAScopedStream & to be consistent with the rest.

Done.


cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

nit: static_cast<cudaStream_t>

Done.


cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):

Previously, yxlao (Yixing Lao) wrote…

Possible to convert this to utility::Logxxx()? Previously the ML Ops is independent from libOpen3D, but nowlibOpen3D is linked into the ML Ops library.

Done. I also applied some additional cleanups in Helper.h, CUDAUtils.h and CUDAState.cuh.

@stotko stotko force-pushed the stotko/multi-stream-support branch from 3fe07f1 to 9e434b3 Compare July 14, 2021 09:51
Copy link
Contributor

@benjaminum benjaminum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 33 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)


cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

}

class CUDAScopedDeviceStream {

Where is this used? Is this the only file where we need the driver API?


cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

                utility::DivUp(n, NMS_BLOCK_SIZE));
    dim3 threads(NMS_BLOCK_SIZE);
    NmsKernel<<<blocks, threads, 0, core::cuda::GetStream()>>>(

The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.

@stotko stotko force-pushed the stotko/multi-stream-support branch from 9e434b3 to 6c9640f Compare July 14, 2021 14:11
Copy link
Contributor Author

@stotko stotko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)


cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

Where is this used? Is this the only file where we need the driver API?

This is a combination of CUDAScopedDevice and CUDAScopedStream where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.

Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?


cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.

Could you provide some pointers on how to approach this? I tried

TensorFlow: auto device = context->eigen_gpu_device(); Then usedevice.stream().
PyTorch: auto stream = at::cuda::getCurrentCUDAStream();

but they only seem to be available in .cu files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?

@stotko stotko requested review from yxlao and benjaminum July 14, 2021 14:23
@stotko
Copy link
Contributor Author

stotko commented Jul 14, 2021

Re-requesting reviews as I decided to merge further changes into this PR:

  • Rename CUDADeviceSwitcher to CUDAScopedDevice
  • Add support for stream-based allocation (introduced in CUDA 11.2) and copy functions in CUDAMemoryManager.

Copy link
Contributor

@benjaminum benjaminum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)


cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

This is a combination of CUDAScopedDevice and CUDAScopedStream where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.

Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?

I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.


cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Could you provide some pointers on how to approach this? I tried

TensorFlow: auto device = context->eigen_gpu_device(); Then usedevice.stream().
PyTorch: auto stream = at::cuda::getCurrentCUDAStream();

but they only seem to be available in .cu files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?

Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.

@stotko stotko force-pushed the stotko/multi-stream-support branch from 6c9640f to 58097fb Compare July 14, 2021 17:28
Copy link
Contributor Author

@stotko stotko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)


cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.

Yes, as long as this assumption holds, CUDAScopedStream should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can remove CUDAScopedDeviceStream for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?


cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, benjaminum (Benjamin Ummenhofer) wrote…

Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.

Reverted all applications of cuda::GetStream() in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.

Copy link
Contributor

@benjaminum benjaminum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)


cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Yes, as long as this assumption holds, CUDAScopedStream should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can remove CUDAScopedDeviceStream for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?

It is just a header file. I don't mind keeping it.


cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):

Previously, stotko (Patrick Stotko) wrote…

Reverted all applications of cuda::GetStream() in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.

makes sense

Copy link
Member

@ssheorey ssheorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add stream support to the NPP image processing operations as well?
eg: Current NPP calls:
https://github.com/intel-isl/Open3D/blob/e0a6201554fa7a8e8c0038181dd68b76929a7203/cpp/open3d/t/geometry/kernel/NPPImage.cpp#L101

NPP>=v10.1 introduces a new API for application managed stream context that we should use. The function calls are now *_Ctx() and take an extra argument that can be added to the NPP_ARGS macro.

The older API was easier to use with a npp{Get,Set}Stream().

OK to do in another PR.

Reviewed 1 of 8 files at r2, 1 of 25 files at r3, 1 of 1 files at r4, 4 of 9 files at r5, 1 of 4 files at r6.
Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)


cpp/benchmarks/core/MemoryManager.cpp, line 72 at r6 (raw file):

Synchronize(

nit: static


cpp/open3d/core/CUDAState.cuh, line 114 at r6 (raw file):

class CUDAScopedStream {
public:
    explicit CUDAScopedStream(cudaStream_t stream)

We can simplify usage by making stream optional. If stream is not provided we create a new stream in the ctor. The user doesn't need to track streams for simple code.


cpp/open3d/core/CUDAUtils.h, line 103 at r6 (raw file):

#ifdef BUILD_CUDA_MODULE

I would suggest only providing the cudaScoped{Device,Stream} APIs and not providing these additional Set APIs. Two ways of doing the same thing can cause user confusion. Get APIs are still useful.


cpp/open3d/core/CUDAUtils.cpp, line 97 at r6 (raw file):

    return device;
}

Consider not providing the Set APIs (see above).


cpp/tests/core/CUDAState.cpp, line 79 at r6 (raw file):

    std::vector<std::thread> threads;

    const int kIterations = 100000;

Can you comment on why we need a large kIterations?

Copy link
Member

@ssheorey ssheorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Python bindings will be useful if a user wants to build their own pipeline in Python with Open3D components.

Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)

Copy link
Member

@ssheorey ssheorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPP stream context: https://docs.nvidia.com/cuda/npp/index.html#application_managed_stream_context

Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)

Copy link
Collaborator

@yxlao yxlao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 19 of 25 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 4 of 4 files at r6.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @stotko)

@yxlao
Copy link
Collaborator

yxlao commented Jul 16, 2021

Merging #3732 for now, and @stotko will address @ssheorey 's remaining comments in a separate PR.

@yxlao yxlao merged commit d21d12d into master Jul 16, 2021
@yxlao yxlao deleted the stotko/multi-stream-support branch July 16, 2021 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants