Fix incorrect timing results for CUDA.@elapsed #2118

thomasfaingnaert · 2023-10-18T12:25:55Z

No description provided.

vchuravy · 2023-10-18T14:21:13Z

Could you also add a test?

thomasfaingnaert · 2023-10-19T14:49:32Z

Added a test that checks that a naive matmul's runtime scales (approximately) cubicly. This might be a bit flaky, though, so I'm open to suggestions on other ways to test this.

vchuravy · 2023-10-19T17:08:25Z

I was more thinking that you changed the macro to interpolate code and not ex. So what did this fix? Reading the code it seems the only difference is the splitting off of keyword arguments?

thomasfaingnaert · 2023-10-19T17:32:07Z

Here's an example illustrating the issue:

using CUDA

function simple_matmul(C, A, B, N)
    for i = 1:N
        for j = 1:N
            @inbounds elem = C[i, j]

            for k = 1:N
                @inbounds elem += A[i, k] * B[k, j]
            end

            @inbounds C[i, j] = elem
        end
    end

    nothing
end

A = CuArray(rand(Float32, (1000, 1000)))
B = CuArray(rand(Float32, (1000, 1000)))
C = similar(A)

# warmup
@cuda simple_matmul(C, A, B, 500)

N1 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 500)
N2 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 1000)

println(N1)
println(N2)

With this PR, this outputs:

4.490149
51.85088

Without this PR, I get:

1.888e-6
3.84e-6

For other kernels I've seen similar results: CUDA.@elapsed returning something in the order of a couple microseconds, no matter the complexity of the kernel.

Reading the code it seems the only difference is the splitting off of keyword arguments?

Yes, and I'm not exactly sure why the code as written in #2113 leads to this behaviour, because, as you said, the only difference should be the keyword arguments, which shouldn't impact the results.

vchuravy · 2023-10-19T17:46:24Z

What is the @macroexpand1 before and after?

thomasfaingnaert · 2023-10-19T18:01:59Z

julia> @macroexpand1 CUDA.@elapsed @cuda simple_matmul(C, A, B, 500)
quote
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:119 =#
    (var"#23#t0", var"#24#t1") = (CUDA.CuEvent(), CUDA.CuEvent())
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:120 =#
    CUDA.record(var"#23#t0")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:121 =#
    (:(#= REPL[17]:1 =# @cuda simple_matmul(C, A, B, 500)),)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:122 =#
    CUDA.record(var"#24#t1")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:123 =#
    CUDA.synchronize(var"#24#t1"; blocking = false)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:124 =#
    CUDA.elapsed(var"#23#t0", var"#24#t1")
end

julia> @macroexpand1 CUDA.@elapsed_fixed @cuda simple_matmul(C, A, B, 500)
quote
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:148 =#
    (var"#25#t0", var"#26#t1") = (CUDA.CuEvent(), CUDA.CuEvent())
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:149 =#
    CUDA.record(var"#25#t0")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:150 =#
    #= REPL[18]:1 =# @cuda simple_matmul(C, A, B, 500)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:151 =#
    CUDA.record(var"#26#t1")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:152 =#
    CUDA.synchronize(var"#26#t1"; blocking = false)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:153 =#
    CUDA.elapsed(var"#25#t0", var"#26#t1")
end

vchuravy · 2023-10-19T18:03:12Z

So... They are identical?

thomasfaingnaert · 2023-10-19T18:05:15Z

No, there's a small difference:

(:(#= REPL[17]:1 =# @cuda simple_matmul(C, A, B, 500)),)

vs.

#= REPL[18]:1 =# @cuda simple_matmul(C, A, B, 500)

(Admittedly, it also took me some time to notice...)

thomasfaingnaert · 2023-10-23T12:22:25Z

Anyway, the problem thus is that the code is inserted as a 1-element tuple containing the to-be-benchmarked Expr, rather than just that Expr itself.

I adapted the test to check this.

Fix incorrect timing results for CUDA.@Elapsed

42126a5

thomasfaingnaert force-pushed the tf/fix-at-elapsed branch from 9ce5f7f to 42126a5 Compare October 23, 2023 12:19

maleadt merged commit 87fd247 into JuliaGPU:master Oct 23, 2023
1 check passed

thomasfaingnaert deleted the tf/fix-at-elapsed branch October 23, 2023 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect timing results for CUDA.@elapsed #2118

Fix incorrect timing results for CUDA.@elapsed #2118

thomasfaingnaert commented Oct 18, 2023

vchuravy commented Oct 18, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023 •

edited

Loading

thomasfaingnaert commented Oct 23, 2023

Fix incorrect timing results for CUDA.@elapsed #2118

Fix incorrect timing results for CUDA.@elapsed #2118

Conversation

thomasfaingnaert commented Oct 18, 2023

vchuravy commented Oct 18, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023

vchuravy commented Oct 19, 2023

thomasfaingnaert commented Oct 19, 2023 • edited Loading

thomasfaingnaert commented Oct 23, 2023

thomasfaingnaert commented Oct 19, 2023 •

edited

Loading