Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect timing results for CUDA.@elapsed #2118

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

thomasfaingnaert
Copy link
Member

No description provided.

@vchuravy
Copy link
Member

Could you also add a test?

@thomasfaingnaert
Copy link
Member Author

Added a test that checks that a naive matmul's runtime scales (approximately) cubicly. This might be a bit flaky, though, so I'm open to suggestions on other ways to test this.

@vchuravy
Copy link
Member

I was more thinking that you changed the macro to interpolate code and not ex. So what did this fix? Reading the code it seems the only difference is the splitting off of keyword arguments?

@thomasfaingnaert
Copy link
Member Author

Here's an example illustrating the issue:

using CUDA

function simple_matmul(C, A, B, N)
    for i = 1:N
        for j = 1:N
            @inbounds elem = C[i, j]

            for k = 1:N
                @inbounds elem += A[i, k] * B[k, j]
            end

            @inbounds C[i, j] = elem
        end
    end

    nothing
end

A = CuArray(rand(Float32, (1000, 1000)))
B = CuArray(rand(Float32, (1000, 1000)))
C = similar(A)

# warmup
@cuda simple_matmul(C, A, B, 500)

N1 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 500)
N2 = CUDA.@elapsed @cuda simple_matmul(C, A, B, 1000)

println(N1)
println(N2)

With this PR, this outputs:

4.490149
51.85088

Without this PR, I get:

1.888e-6
3.84e-6

For other kernels I've seen similar results: CUDA.@elapsed returning something in the order of a couple microseconds, no matter the complexity of the kernel.

Reading the code it seems the only difference is the splitting off of keyword arguments?

Yes, and I'm not exactly sure why the code as written in #2113 leads to this behaviour, because, as you said, the only difference should be the keyword arguments, which shouldn't impact the results.

@vchuravy
Copy link
Member

What is the @macroexpand1 before and after?

@thomasfaingnaert
Copy link
Member Author

julia> @macroexpand1 CUDA.@elapsed @cuda simple_matmul(C, A, B, 500)
quote
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:119 =#
    (var"#23#t0", var"#24#t1") = (CUDA.CuEvent(), CUDA.CuEvent())
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:120 =#
    CUDA.record(var"#23#t0")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:121 =#
    (:(#= REPL[17]:1 =# @cuda simple_matmul(C, A, B, 500)),)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:122 =#
    CUDA.record(var"#24#t1")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:123 =#
    CUDA.synchronize(var"#24#t1"; blocking = false)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:124 =#
    CUDA.elapsed(var"#23#t0", var"#24#t1")
end

julia> @macroexpand1 CUDA.@elapsed_fixed @cuda simple_matmul(C, A, B, 500)
quote
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:148 =#
    (var"#25#t0", var"#26#t1") = (CUDA.CuEvent(), CUDA.CuEvent())
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:149 =#
    CUDA.record(var"#25#t0")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:150 =#
    #= REPL[18]:1 =# @cuda simple_matmul(C, A, B, 500)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:151 =#
    CUDA.record(var"#26#t1")
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:152 =#
    CUDA.synchronize(var"#26#t1"; blocking = false)
    #= /home/tfaingna/.julia/dev/CUDA/lib/cudadrv/events.jl:153 =#
    CUDA.elapsed(var"#25#t0", var"#26#t1")
end

@vchuravy
Copy link
Member

So... They are identical?

@thomasfaingnaert
Copy link
Member Author

thomasfaingnaert commented Oct 19, 2023

No, there's a small difference:

(:(#= REPL[17]:1 =# @cuda simple_matmul(C, A, B, 500)),)

vs.

#= REPL[18]:1 =# @cuda simple_matmul(C, A, B, 500)

(Admittedly, it also took me some time to notice...)

@thomasfaingnaert
Copy link
Member Author

Anyway, the problem thus is that the code is inserted as a 1-element tuple containing the to-be-benchmarked Expr, rather than just that Expr itself.

I adapted the test to check this.

@maleadt maleadt merged commit 87fd247 into JuliaGPU:master Oct 23, 2023
1 check passed
@thomasfaingnaert thomasfaingnaert deleted the tf/fix-at-elapsed branch October 23, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants