Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eltype occurs at runtime in getidx for at least some MatrixField operations #1871

Open
charleskawczynski opened this issue Jul 8, 2024 · 2 comments
Labels
bug Something isn't working performance

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Jul 8, 2024

I was inspired by some performance helpdesk discussion over the weekend (xref JuliaLang/julia#55009), and made a very simple reproducer for our matrix-field getidx. I recently had a hunch that we should hoist our eltype calls, but @dennisYatunin pointed out that this should simply return a compile-time constant. The benchmark shows, however, that we're spending most of the time in eltype:

#=
julia --project
using Revise; include(joinpath("test", "Operators", "finitedifference", "getidx.jl"))
=#
using ClimaComms
ClimaComms.@import_required_backends
import BenchmarkTools
import ClimaCore
@isdefined(TU) || include(
    joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"),
);
include(joinpath(pkgdir(ClimaCore), "test", "MatrixFields", "matrix_fields_broadcasting", "test_scalar_utils.jl"))
import .TestUtilities as TU;
using Test
using JET

import ClimaCore.MatrixFields: 
import ClimaCore: Utilities, Spaces, Fields, Operators
import LazyBroadcast: @lazy

function call_getidx(space, bc, loc, idx, hidx)
    @inbounds Operators.getidx(space, bc, loc, idx, hidx)
    return nothing
end

device = ClimaComms.device()
space = TU.CenterExtrudedFiniteDifferenceSpace(
    FT;
    zelem = 30,
    helem = 4,
    context = ClimaComms.context(device),
)
bc = @lazy @. ᶠᶜmat  ᶜᶠmat 
    (2 * ᶠᶜmat  ᶜᶜmat  ᶜᶠmat + ᶠᶠmat  ᶠᶠmat / 3 - (4I,))  ᶠᶠmat 
    (ᶠᶜmat  ᶜᶜmat  ᶜᶠmat * 2 - (ᶠᶠmat / 3)  ᶠᶠmat + (4I,))  ᶠᶠmat;
loc = Operators.Interior()
idx = 10
idx = Utilities.PlusHalf(10)
hidx = (1, 1, 1)
call_getidx(space, bc, loc, idx, hidx)
BenchmarkTools.@benchmark call_getidx(space, bc, loc, idx, hidx)

import Profile, ProfileCanvas

function do_work(space, bc, loc, idx, hidx, n)
    for i in 1:n
        call_getidx(space, bc, loc, idx, hidx)
    end
    return nothing
end

do_work(space, bc, loc, idx, hidx, 1)
Profile.clear()
prof = Profile.@profile do_work(space, bc, loc, idx, hidx, 10^6)
results = Profile.fetch()
Profile.clear()
ProfileCanvas.html_file("flame.html", results)
Screenshot 2024-07-08 at 9 41 00 AM

Does this mean that the compiler is not caching the inference result? @vchuravy

The good news is that: 1) there is clearly room for improving our emitted code, and 2) this is a really nice way to make reproducers that captures the fully complexity of getidx while only targeting a single point in space, and 3) this impacts both the CPU and the GPU, since these instructions could easily increase register usage on the gpu.

@charleskawczynski
Copy link
Member Author

#1880 helped a lot, so I'll rename the title.

@charleskawczynski charleskawczynski changed the title getidx spends most of its time in eltype for at least some MatrixField operations eltype occurs at runtime in getidx for at least some MatrixField operations Jul 11, 2024
@charleskawczynski
Copy link
Member Author

Here's the flame graph for latest version:

Screenshot 2024-07-11 at 2 31 05 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

No branches or pull requests

1 participant