Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitonic sort exceeds launch resources #2331

Closed
maleadt opened this issue Apr 16, 2024 · 3 comments · Fixed by #2353
Closed

Bitonic sort exceeds launch resources #2331

maleadt opened this issue Apr 16, 2024 · 3 comments · Fixed by #2353
Labels
bug Something isn't working cuda array Stuff about CuArray.

Comments

@maleadt
Copy link
Member

maleadt commented Apr 16, 2024

CI has recently been showing lots of sorting-related errors, which I presume can be traced back to #2308. With #2329, we can see what's up:

  Expression: check_sort!(Int32, (2, 2, 50000); dims = 3, rev = true)
  Number of threads per block exceeds kernel limit (1024 > 896).
  Stacktrace:
    [1] error(s::String)
      @ Base ./error.jl:35
    [2] diagnose_launch_failure(f::CuFunction, err::CuError; blockdim::CuDim3, threaddim::CuDim3, shmem::Int64)
      @ CUDA /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:120
    [3] launch(::CuFunction, ::CUDA.KernelState, ::CuDeviceArray{Int32, 3, 1}, ::Int32, ::Int32, ::Int32, ::Int32; blocks::Tuple{Int64, Int64, Int64}, threads::Tuple{Int64, Int64}, cooperative::Bool, shmem::Int64, stream::CuStream)
      @ CUDA /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:73
    [4] launch
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:52 [inlined]
    [5] #936
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:189 [inlined]
    [6] macro expansion
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:149 [inlined]
    [7] macro expansion
      @ ./none:0 [inlined]
    [8] convert_arguments
      @ ./none:0 [inlined]
    [9] #cudacall#935
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:191 [inlined]
   [10] cudacall
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/lib/cudadrv/execution.jl:187 [inlined]
   [11] macro expansion
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/compiler/execution.jl:266 [inlined]
   [12] macro expansion
      @ ./none:0 [inlined]
   [13] call(::CUDA.HostKernel{typeof(CUDA.BitonicSortImpl.comparator_small_kernel), Tuple{CuDeviceArray{Int32, 3, 1}, Int32, Int32, Int32, Int32, typeof(identity), typeof(isless), Val{true}, Val{3}}}, ::CuDeviceArray{Int32, 3, 1}, ::Int32, ::Int32, ::Int32, ::Int32, ::typeof(identity), ::typeof(isless), ::Val{true}, ::Val{3}; call_kwargs::@Kwargs{threads::Tuple{Int64, Int64}, blocks::Tuple{Int64, Int64, Int64}, shmem::Int64})
      @ CUDA ./none:0
   [14] call
      @ ./none:0 [inlined]
   [15] (::CUDA.HostKernel{typeof(CUDA.BitonicSortImpl.comparator_small_kernel), Tuple{CuDeviceArray{Int32, 3, 1}, Int32, Int32, Int32, Int32, typeof(identity), typeof(isless), Val{true}, Val{3}}})(::CuArray{Int32, 3, CUDA.Mem.DeviceBuffer}, ::Int32, ::Int32, ::Int32, ::Int32, ::Function, ::Function, ::Val{true}, ::Val{3}; threads::Tuple{Int64, Int64}, blocks::Tuple{Int64, Int64, Int64}, kwargs::@Kwargs{shmem::Int64})
      @ CUDA /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/compiler/execution.jl:389
   [16] bitonic_sort!(c::CuArray{Int32, 3, CUDA.Mem.DeviceBuffer}; by::Function, lt::Function, rev::Bool, dims::Int64)
      @ CUDA.BitonicSortImpl /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/sorting.jl:948
   [17] bitonic_sort!
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/sorting.jl:902 [inlined]
   [18] #sort!#1224
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/sorting.jl:988 [inlined]
   [19] sort!
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/sorting.jl:987 [inlined]
   [20] #sort!#1225
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/src/sorting.jl:992 [inlined]
   [21] check_sort!(T::Type, N::Tuple{Int64, Int64, Int64}, f::Function; kwargs::@Kwargs{dims::Int64, rev::Bool})
      @ Main /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:198
   [22] check_sort!
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:196 [inlined]
   [23] macro expansion
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:313 [inlined]
   [24] macro expansion
      @ ~/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.10/julia-1.10-latest-linux-x86_64/share/julia/stdlib/v1.10/Test/src/Test.jl:669 [inlined]
   [25] macro expansion
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:313 [inlined]
   [26] macro expansion
      @ ~/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.10/julia-1.10-latest-linux-x86_64/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [27] macro expansion
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:283 [inlined]
   [28] macro expansion
      @ ~/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.10/julia-1.10-latest-linux-x86_64/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [29] top-level scope
      @ /var/lib/buildkite-agent/builds/gpuci-15/julialang/cuda-dot-jl/test/base/sorting.jl:281
@maleadt maleadt added the bug Something isn't working label Apr 16, 2024
@maleadt maleadt changed the title sort exceeds launch resources Bitonic sort exceeds launch resources Apr 16, 2024
@maleadt maleadt added the cuda array Stuff about CuArray. label Apr 16, 2024
@maleadt
Copy link
Member Author

maleadt commented Apr 16, 2024

There seems to be some confusion about the two kernel's launch configurations in the current implementation, however naively fixing that introduces test failures:

diff --git a/src/sorting.jl b/src/sorting.jl
index 7dd563831..70cd72e29 100644
--- a/src/sorting.jl
+++ b/src/sorting.jl
@@ -908,10 +908,12 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)

     # compile kernels (using Int32 for indexing, if possible, yielding a 70% speedup)
     I = c_len <= typemax(Int32) ? Int32 : Int
+
     args1 = (c, I(c_len), one(I), one(I), one(I), by, lt, Val(rev), Val(dims))
     kernel1 = @cuda launch=false comparator_small_kernel(args1...)
-
     config1 = launch_configuration(kernel1.fun, shmem = threads -> bitonic_shmem(c, threads))
+    threads1 = config1.threads
+
     args2 = (c, I(c_len), one(I), one(I), by, lt, Val(rev), Val(dims))
     kernel2 = @cuda launch=false comparator_kernel(args2...)
     config2 = launch_configuration(kernel2.fun, shmem = threads -> bitonic_shmem(c, threads))
@@ -940,11 +942,11 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)
                 pseudo_block_length = 1 << abs(j_final + 1 - j)
                 # N_pseudo_blocks = how many pseudo-blocks are in this layer of the network
                 N_pseudo_blocks = nextpow(2, c_len) ÷ pseudo_block_length
-                pseudo_blocks_per_block = threads2 ÷ pseudo_block_length
+                pseudo_blocks_per_block = threads1 ÷ pseudo_block_length

                 # grid dimensions
                 N_blocks = max(1, N_pseudo_blocks ÷ pseudo_blocks_per_block), other_block_dims...
-                block_size = pseudo_block_length, threads2 ÷ pseudo_block_length
+                block_size = pseudo_block_length, threads1 ÷ pseudo_block_length
                 kernel1(args1...; blocks=N_blocks, threads=block_size,
                         shmem=bitonic_shmem(c, block_size))
                 break

@xaellison Since you most recently looked at this code, do you have the time to take a look?

@maleadt
Copy link
Member Author

maleadt commented Apr 23, 2024

In #2338 I padded the kernel1 block size to a power of 2, but that seems to lead to pseudo_blocks_per_block = 0.

@xaellison This is breaking lots of CI jobs, could you give this a quick look?

@xaellison
Copy link
Contributor

fix #2353

@maleadt maleadt linked a pull request Apr 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda array Stuff about CuArray.
Projects
None yet
2 participants