Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to per-thread stream semantics #395

Merged
merged 6 commits into from
Aug 28, 2020
Merged

Default to per-thread stream semantics #395

merged 6 commits into from
Aug 28, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Aug 26, 2020

This PR changes two things threading and stream-related:

  • call the PTDS/PTSZ variants of CUDA driver methods, which automatically select a per-thread stream when you (implicitly) requested the default one (handle 0x0)
  • for libraries, explicitly use the special per-thread stream (handle 0x2)

As a result, we get the new default stream semantics where operations on streams don't get synchronized by the global default stream anymore:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

num_streams = 8

for i in 1:num_streams
    stream = CuStream()

    data = CuArray{Float32}(undef, N)

    @cuda blocks=1 threads=64 stream=stream kernel(data, N)

    @cuda kernel(data,0)
end

Before:
multistream_before

After:
multistream_after

Another, maybe more important consequence is that we automatically get concurrent execution on different threads:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

Threads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream())
end

Before:
multithread_before

After:
multithread_after

Thanks to @marius311 for making me read https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ again (this time implementing its features) :-)

cc @vchuravy @kshyatt

@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Aug 26, 2020
@maleadt maleadt changed the title Call API functions that use a per-thread default stream. Default to per-thread stream semantics Aug 26, 2020
@maleadt
Copy link
Member Author

maleadt commented Aug 26, 2020

The current implementation isn't configurable, and defaults to these semantics. Thoughts (especially @vchuravy)? With KA using explicit streams and not relying on the default stream to synchronize the world (note that device synchronization still works like that) I don't think there should be any issues.

@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #395 into master will decrease coverage by 0.12%.
The diff coverage is 15.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #395      +/-   ##
==========================================
- Coverage   78.99%   78.86%   -0.13%     
==========================================
  Files         165      165              
  Lines        8731     8749      +18     
==========================================
+ Hits         6897     6900       +3     
- Misses       1834     1849      +15     
Impacted Files Coverage Δ
lib/cudadrv/context.jl 87.71% <0.00%> (-3.19%) ⬇️
lib/cudnn/CUDNN.jl 51.61% <ø> (ø)
lib/cusolver/CUSOLVER.jl 86.20% <ø> (ø)
lib/cusparse/CUSPARSE.jl 88.23% <ø> (ø)
lib/cutensor/wrappers.jl 93.65% <ø> (ø)
res/wrap/wrap.jl 0.00% <0.00%> (ø)
test/cufft.jl 90.22% <ø> (ø)
lib/cudadrv/stream.jl 87.87% <50.00%> (-2.45%) ⬇️
lib/cublas/CUBLAS.jl 82.97% <100.00%> (+0.37%) ⬆️
lib/cufft/fft.jl 89.95% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 892e649...fbedaaa. Read the comment docs.

@vchuravy
Copy link
Member

I think from a KA perspective this fine, but we do currently synchronize against the CuDefaultStream() to synchronize against broadcast code.

target = CUDADevice()
kernel = mykernel!(target)

event = Event(target) # sync from above
event = kernel!(..., dependencies=event)

wait(target, event) # sync from below

where Event(CUDADevice) creates a event on the CuDefaultStream to and wait(CUDADevice, event) places asynchronization edge on the CuDefaultStream. So we would need a different way of doing that synchronization, probably we would synchronize with the per thread stream.

How do normal memory copies behave? Do they still act as a global sync?

@maleadt
Copy link
Member Author

maleadt commented Aug 26, 2020

So we would need a different way of doing that synchronization, probably we would synchronize with the per thread stream.

I don't think so, as cuStreamWaitEvent has a PTDS variant, so calling it on the default stream will synchronize the per-thread one.

How do normal memory copies behave? Do they still act as a global sync?

No, they only sync the thread-local default stream (which is a major part of the benefit here, since we can load data on a thread/task without disturbing others).

@maleadt
Copy link
Member Author

maleadt commented Aug 26, 2020

@jonathan-laurent This is probably also an interesting change for AlphaZero.jl (as a heavy user of GPU multi-threading/tasking), maybe you could give it a go to see if it improves performance and/or otherwise affects the programming model?

@vchuravy
Copy link
Member

@ali-ramadhan and @glwagner, if you have time it would be good to try out Oceanigans on this, since last time we messed up synchronization the tests showed that relatively quickly.

@jonathan-laurent
Copy link

This is probably also an interesting change for AlphaZero.jl (as a heavy user of GPU multi-threading/tasking), maybe you could give it a go to see if it improves performance and/or otherwise affects the programming model?

@maleadt This looks nice.
To be clear, though, only a single thread is using the GPU at any given time in AlphaZero.jl. Indeed, there is an inference server that runs on a separate task and that receives requests from other tasks. Is there still a chance that it can benefit from this PR?

@maleadt
Copy link
Member Author

maleadt commented Aug 28, 2020

Is there still a chance that it can benefit from this PR?

No, this only has an impact when multiple threads perform GPU computations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants