-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CUDAScopedStream class for Multi-Stream Support #3732
Conversation
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
6837881
to
d7430fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 17 files at r1, 8 of 8 files at r2.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @stotko)
cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):
~CUDAScopedStream() { cuda::SetStream(prev_stream_); } CUDAScopedStream(CUDAScopedStream const&) = delete;
nit: const CUDAScopedStream &
to be consistent with the rest.
cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):
void Set(cudaStream_t stream) { stream_ = stream; } static cudaStream_t Default() { return (cudaStream_t)0; }
nit: static_cast<cudaStream_t>
cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):
if (err_get_string == CUDA_SUCCESS) { fprintf(stderr, "%s:%d CUDA driver error: %s", file, line,
Possible to convert this to utility::Logxxx()
? Previously the ML Ops is independent from libOpen3D
, but nowlibOpen3D
is linked into the ML Ops library.
d7430fa
to
3fe07f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 33 files reviewed, 1 unresolved discussion (waiting on @benjaminum, @ssheorey, and @yxlao)
cpp/open3d/core/CUDAState.cuh, line 137 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
nit:
const CUDAScopedStream &
to be consistent with the rest.
Done.
cpp/open3d/core/CUDAUtils.cpp, line 93 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
nit:
static_cast<cudaStream_t>
Done.
cpp/open3d/ml/Helper.h, line 108 at r2 (raw file):
Previously, yxlao (Yixing Lao) wrote…
Possible to convert this to
utility::Logxxx()
? Previously the ML Ops is independent fromlibOpen3D
, but nowlibOpen3D
is linked into the ML Ops library.
Done. I also applied some additional cleanups in Helper.h
, CUDAUtils.h
and CUDAState.cuh
.
3fe07f1
to
9e434b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 33 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)
cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):
} class CUDAScopedDeviceStream {
Where is this used? Is this the only file where we need the driver API?
cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):
utility::DivUp(n, NMS_BLOCK_SIZE)); dim3 threads(NMS_BLOCK_SIZE); NmsKernel<<<blocks, threads, 0, core::cuda::GetStream()>>>(
The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.
9e434b3
to
6c9640f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)
cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):
Previously, benjaminum (Benjamin Ummenhofer) wrote…
Where is this used? Is this the only file where we need the driver API?
This is a combination of CUDAScopedDevice
and CUDAScopedStream
where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.
Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?
cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):
Previously, benjaminum (Benjamin Ummenhofer) wrote…
The Nms op should get the stream from tensorflow or pytorch. Looks like we did not pass this information from the op wrappers to here.
Could you provide some pointers on how to approach this? I tried
TensorFlow: auto device = context->eigen_gpu_device();
Then usedevice.stream()
.
PyTorch: auto stream = at::cuda::getCurrentCUDAStream();
but they only seem to be available in .cu
files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?
Re-requesting reviews as I decided to merge further changes into this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 39 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)
cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):
Previously, stotko (Patrick Stotko) wrote…
This is a combination of
CUDAScopedDevice
andCUDAScopedStream
where the device information is inferred from the stream via the CUDA driver API. Since we cannot directly put this into the core library, due to linking issues, this class is only exposed in the ML ops. It is currently not used, but may be useful in the future.Is this combined functionality helpful in the context of the ML ops or are the individual wrappers enough to handle all use cases?
I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.
cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):
Previously, stotko (Patrick Stotko) wrote…
Could you provide some pointers on how to approach this? I tried
TensorFlow:
auto device = context->eigen_gpu_device();
Then usedevice.stream()
.
PyTorch:auto stream = at::cuda::getCurrentCUDAStream();
but they only seem to be available in
.cu
files and I get some strange compiler errors instead. Is there some difference in the design of the Nms op compared to the others?
Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.
6c9640f
to
58097fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @benjaminum, @ssheorey, and @yxlao)
cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):
Previously, benjaminum (Benjamin Ummenhofer) wrote…
I think the scoped stream is the most important class. It will allow us to use our Tensor library within the ops.
I assume that the current device is already set by tf/torch when an operator is called. I agree that having a way to infer the device could become handy but I am not sure if there are implications with linking the driver api.
Yes, as long as this assumption holds, CUDAScopedStream
should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can remove CUDAScopedDeviceStream
for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?
cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):
Previously, benjaminum (Benjamin Ummenhofer) wrote…
Yes, we need to split the op up in multiple compilation units similar to the other ops. For tensorflow we already have the right structure but for torch it is missing.
Reverted all applications of cuda::GetStream()
in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 6 of 40 files reviewed, 3 unresolved discussions (waiting on @ssheorey, @stotko, and @yxlao)
cpp/open3d/ml/Helper.h, line 101 at r4 (raw file):
Previously, stotko (Patrick Stotko) wrote…
Yes, as long as this assumption holds,
CUDAScopedStream
should be enough. Unfortunately, there is no corresponding API in the runtime library to infer the device, so we must use the driver API. If this may lead to any blockers in the future, we can removeCUDAScopedDeviceStream
for now and consider adding it at a later point in time if we observe that we need it. Any preference on how we should proceed?
It is just a header file. I don't mind keeping it.
cpp/open3d/ml/contrib/Nms.cu, line 168 at r4 (raw file):
Previously, stotko (Patrick Stotko) wrote…
Reverted all applications of
cuda::GetStream()
in the ML ops. I would like to unblock this PR and suggest to improve Nms and any other affected op in a separate PR.
makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add stream support to the NPP image processing operations as well?
eg: Current NPP calls:
https://github.com/intel-isl/Open3D/blob/e0a6201554fa7a8e8c0038181dd68b76929a7203/cpp/open3d/t/geometry/kernel/NPPImage.cpp#L101
NPP>=v10.1 introduces a new API for application managed stream context that we should use. The function calls are now *_Ctx()
and take an extra argument that can be added to the NPP_ARGS macro.
The older API was easier to use with a npp{Get,Set}Stream()
.
OK to do in another PR.
Reviewed 1 of 8 files at r2, 1 of 25 files at r3, 1 of 1 files at r4, 4 of 9 files at r5, 1 of 4 files at r6.
Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)
cpp/benchmarks/core/MemoryManager.cpp, line 72 at r6 (raw file):
Synchronize(
nit: static
cpp/open3d/core/CUDAState.cuh, line 114 at r6 (raw file):
class CUDAScopedStream { public: explicit CUDAScopedStream(cudaStream_t stream)
We can simplify usage by making stream
optional. If stream is not provided we create a new stream in the ctor. The user doesn't need to track streams for simple code.
cpp/open3d/core/CUDAUtils.h, line 103 at r6 (raw file):
#ifdef BUILD_CUDA_MODULE
I would suggest only providing the cudaScoped{Device,Stream}
APIs and not providing these additional Set
APIs. Two ways of doing the same thing can cause user confusion. Get
APIs are still useful.
cpp/open3d/core/CUDAUtils.cpp, line 97 at r6 (raw file):
return device; }
Consider not providing the Set
APIs (see above).
cpp/tests/core/CUDAState.cpp, line 79 at r6 (raw file):
std::vector<std::thread> threads; const int kIterations = 100000;
Can you comment on why we need a large kIterations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, Python bindings will be useful if a user wants to build their own pipeline in Python with Open3D components.
Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPP stream context: https://docs.nvidia.com/cuda/npp/index.html#application_managed_stream_context
Reviewable status: 13 of 40 files reviewed, 8 unresolved discussions (waiting on @stotko and @yxlao)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 19 of 25 files at r3, 1 of 1 files at r4, 8 of 9 files at r5, 4 of 4 files at r6.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @stotko)
Highlights:
CUDADeviceSwitcher
toCUDAScopedDevice
and removeSwitchTo
function so simply its usage.CUDAScopedStream
class for setting the global internal per-thread stream state.CUDAMemoryManager
.Further changes:
cuda::{Get,Set}Stream
functions to access the global state. These functions are used to implementCUDAScopedStream
.cuda::{Get,Set}Device
to mirror the stream API. This makesCUDAScopedDevice
's implementation look similar toCUDAScopedStream
. It further allows to hide the implementation detail that CUDA is already capable of managing the device state internally.CUDAScopedDeviceStream
in ML ops as a combination of both scoped wrappers. It infers the device from the stream and sets both states simultaneously.Future work:
CUDAScopedStream
.This change is