Default to per-thread stream semantics #395

maleadt · 2020-08-26T10:52:04Z

This PR changes two things threading and stream-related:

call the PTDS/PTSZ variants of CUDA driver methods, which automatically select a per-thread stream when you (implicitly) requested the default one (handle 0x0)
for libraries, explicitly use the special per-thread stream (handle 0x2)

As a result, we get the new default stream semantics where operations on streams don't get synchronized by the global default stream anymore:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

num_streams = 8

for i in 1:num_streams
    stream = CuStream()

    data = CuArray{Float32}(undef, N)

    @cuda blocks=1 threads=64 stream=stream kernel(data, N)

    @cuda kernel(data,0)
end

Before:

After:

Another, maybe more important consequence is that we automatically get concurrent execution on different threads:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

Threads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream())
end

Before:

After:

Thanks to @marius311 for making me read https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/ again (this time implementing its features) :-)

cc @vchuravy @kshyatt

maleadt · 2020-08-26T12:22:10Z

The current implementation isn't configurable, and defaults to these semantics. Thoughts (especially @vchuravy)? With KA using explicit streams and not relying on the default stream to synchronize the world (note that device synchronization still works like that) I don't think there should be any issues.

codecov · 2020-08-26T12:29:46Z

Codecov Report

Merging #395 into master will decrease coverage by 0.12%.
The diff coverage is 15.38%.

@@            Coverage Diff             @@
##           master     #395      +/-   ##
==========================================
- Coverage   78.99%   78.86%   -0.13%     
==========================================
  Files         165      165              
  Lines        8731     8749      +18     
==========================================
+ Hits         6897     6900       +3     
- Misses       1834     1849      +15

Impacted Files	Coverage Δ
lib/cudadrv/context.jl	`87.71% <0.00%> (-3.19%)`	⬇️
lib/cudnn/CUDNN.jl	`51.61% <ø> (ø)`
lib/cusolver/CUSOLVER.jl	`86.20% <ø> (ø)`
lib/cusparse/CUSPARSE.jl	`88.23% <ø> (ø)`
lib/cutensor/wrappers.jl	`93.65% <ø> (ø)`
res/wrap/wrap.jl	`0.00% <0.00%> (ø)`
test/cufft.jl	`90.22% <ø> (ø)`
lib/cudadrv/stream.jl	`87.87% <50.00%> (-2.45%)`	⬇️
lib/cublas/CUBLAS.jl	`82.97% <100.00%> (+0.37%)`	⬆️
lib/cufft/fft.jl	`89.95% <100.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 892e649...fbedaaa. Read the comment docs.

vchuravy · 2020-08-26T13:00:22Z

I think from a KA perspective this fine, but we do currently synchronize against the CuDefaultStream() to synchronize against broadcast code.

target = CUDADevice()
kernel = mykernel!(target)

event = Event(target) # sync from above
event = kernel!(..., dependencies=event)

wait(target, event) # sync from below

where Event(CUDADevice) creates a event on the CuDefaultStream to and wait(CUDADevice, event) places asynchronization edge on the CuDefaultStream. So we would need a different way of doing that synchronization, probably we would synchronize with the per thread stream.

How do normal memory copies behave? Do they still act as a global sync?

maleadt · 2020-08-26T13:11:49Z

So we would need a different way of doing that synchronization, probably we would synchronize with the per thread stream.

I don't think so, as cuStreamWaitEvent has a PTDS variant, so calling it on the default stream will synchronize the per-thread one.

How do normal memory copies behave? Do they still act as a global sync?

No, they only sync the thread-local default stream (which is a major part of the benefit here, since we can load data on a thread/task without disturbing others).

Seems to have been forgotten in #243

maleadt · 2020-08-26T14:59:45Z

@jonathan-laurent This is probably also an interesting change for AlphaZero.jl (as a heavy user of GPU multi-threading/tasking), maybe you could give it a go to see if it improves performance and/or otherwise affects the programming model?

vchuravy · 2020-08-26T18:29:43Z

@ali-ramadhan and @glwagner, if you have time it would be good to try out Oceanigans on this, since last time we messed up synchronization the tests showed that relatively quickly.

jonathan-laurent · 2020-08-27T23:06:58Z

This is probably also an interesting change for AlphaZero.jl (as a heavy user of GPU multi-threading/tasking), maybe you could give it a go to see if it improves performance and/or otherwise affects the programming model?

@maleadt This looks nice.
To be clear, though, only a single thread is using the GPU at any given time in AlphaZero.jl. Indeed, there is an inference server that runs on a separate task and that receives requests from other tasks. Is there still a chance that it can benefit from this PR?

maleadt · 2020-08-28T05:56:22Z

Is there still a chance that it can benefit from this PR?

No, this only has an impact when multiple threads perform GPU computations.

maleadt added cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Aug 26, 2020

maleadt changed the title ~~Call API functions that use a per-thread default stream.~~ Default to per-thread stream semantics Aug 26, 2020

maleadt added 6 commits August 26, 2020 15:19

Call API functions that use a per-thread default stream.

88af9d7

Improve display of NULL context.

86b3290

Add accessors for special stream handles.

54c4a10

Configure libraries to use per-thread streams.

b155fdf

Fix path in CUBLAS patch.

4f377c6

Fix pointers database and regenerate CUTENSOR wrappers.

fbedaaa

Seems to have been forgotten in #243

maleadt force-pushed the tb/default_thread branch from 940fd3a to fbedaaa Compare August 26, 2020 13:35

maleadt merged commit 39327a3 into master Aug 28, 2020

maleadt deleted the tb/default_thread branch August 28, 2020 09:40

simonbyrne mentioned this pull request Nov 17, 2020

CUDA synchronization JuliaParallel/MPI.jl#443

Open

maleadt mentioned this pull request Jan 18, 2021

Automatic task-based concurrency using local streams #662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to per-thread stream semantics #395

Default to per-thread stream semantics #395

maleadt commented Aug 26, 2020 •

edited

Loading

maleadt commented Aug 26, 2020

codecov bot commented Aug 26, 2020 •

edited

Loading

vchuravy commented Aug 26, 2020

maleadt commented Aug 26, 2020

maleadt commented Aug 26, 2020

vchuravy commented Aug 26, 2020

jonathan-laurent commented Aug 27, 2020

maleadt commented Aug 28, 2020

Default to per-thread stream semantics #395

Default to per-thread stream semantics #395

Conversation

maleadt commented Aug 26, 2020 • edited Loading

maleadt commented Aug 26, 2020

codecov bot commented Aug 26, 2020 • edited Loading

Codecov Report

vchuravy commented Aug 26, 2020

maleadt commented Aug 26, 2020

maleadt commented Aug 26, 2020

vchuravy commented Aug 26, 2020

jonathan-laurent commented Aug 27, 2020

maleadt commented Aug 28, 2020

maleadt commented Aug 26, 2020 •

edited

Loading

codecov bot commented Aug 26, 2020 •

edited

Loading