-
Notifications
You must be signed in to change notification settings - Fork 14
Compile-time regression #46
Comments
Hmmm, now it seems like all the time stepping kernels are being compiled every time step so code that I might have done something bad... |
If you remove the call to |
Ah that would explain the slowdown, thanks for the tip! I set I must have misunderstood what The kernels still take much longer to compile compared to GPUifyLoops v0.1.0 which I'm guessing might be because of all the Cassette stuff but I'm not sure. I should probably get that last kernel working before worrying about speed, but 9 minute compilation times makes debugging very slow :( |
Yea, the compile times are killing our kernels as well. I haven't been able to wait long enough to get my kernels to even compile with v0.2.0. :-/ |
That sounds pretty bad... Do you know how long they took to compile with v0.1.0? I'm noticing that they're taking ~2.5x more time to compile with v0.2.0. |
:/ Since v0.2.0 got rid of the |
Yeah I know it's a difficult problem, I don't know enough to have any ideas. For now I think if I spin up 10 Google Cloud GPU instances I can debug 10x as fast! |
Can you post the output of |
Just got |
I used Oceananigans.jl to create an ocean model on a GPU and time step it once which should compile and run most of the code, and almost all the CUDA kernels. With GPUifyLoops 0.1.0
With GPUifyLoops 0.2.0 (without calling/compiling the most expensive kernel
Will try on the new branches now. |
Oh wow that's actually spent inside of the CUDAnative IR rewriting passes. That's surprising, but good news as well since we should be able to optimize more easily then, say, have inference work effectively with Cassette. Could you send me the modules you're processing? That should be easier than replicating your set-up: diff --git a/src/compiler/irgen.jl b/src/compiler/irgen.jl
index a4e1f51..730b12b 100644
--- a/src/compiler/irgen.jl
+++ b/src/compiler/irgen.jl
@@ -173,6 +173,9 @@ function irgen(job::CompilerJob, linfo::Core.MethodInstance, world)
globalUnique += 1
llvmfn *= "_$globalUnique"
LLVM.name!(entry, llvmfn)
+ open(joinpath(tempdir(), "module_$(globalUnique).bc"), "w") do io
+ write(io, mod)
+ end
# minimal required optimization
@timeit to[] "rewrite" ModulePassManager() do pm |
That would be awesome! I can easily modify |
Yes. |
Got 24 modules/files out. Wasn't sure of the best way of sharing them so I just put them up on a public Google Drive folder: https://goo.gl/DARdUc I was running with a
|
Yup, those are some good modules. |
@vchuravy just tried running with the latest Sounds like @maleadt has a fix though! CliMA/Oceananigans.jl#66 (comment) |
Please try JuliaGPU/CUDAnative.jl#369 and report new timings, should have fixed some inadvertent quadratic behavior and made it linear. Note that there's probably more inefficiencies; I haven't been working with exceptionally large modules. Don't hesitate to send me these, I'm all for optimizing the compiler (if it matters). |
Awesome, thanks so much for the lightning fast fix! Will give it a try. |
No problem. Pushed an additional optimization; let me know which commit you tested on. |
Testing with your later commit (JuliaGPU/CUDAnative.jl@8801976) but it's still busy compiling (40 minutes and counting...). When I commented out the call to the problematic kernel above #46 (comment) everything took ~8 minutes. And when I tested on I'll let it keep compiling for as long I can to see what the |
Strange, the speedup I noticed on the modules you sent was pretty large (400s -> 5s). |
Interesting, I'll repatch and reproduce the modules. The modules I shared before may have not included the massive problematic kernel as it was causing stack overflows until JuliaLang/julia@5875d8c. |
I've shared the new modules in a new Google Drive folder: https://goo.gl/AeMDfP There are now 2 more modules for a total of 26. And the new ones (25, 26) are huge compared to the others which are tiny, so hopefully that's the problematic kernel. Sorry they weren't included earlier.
|
No, they are smaller and not as problematic as the previous dump of modules you sent. Anyway, I won't be doing much development this weekend, so take your time to figure out what the impact of those optimizations is and which modules remain problematic (for what part of the compiler -- the timings should be helpful there). EDIT: also note that many of the modules are probably from the CUDAnative runtime library, which explains the many tiny kernels. You could get rid of those by warming up and making sure the runtime is compiled (or commenting-out the global |
Ah I see, I thought the relative file size meant something. I tried again without the kernel I think is problematic and indeed compilation is much faster (500 s -> 100 s) and IR rewrite time is only 7 seconds. The problem might be on my end at this point. I'll play around with that one kernel and see if I can isolate the problem and get it to compile, but everything looks great without it. Thanks again for your help!
|
IR emission and linking are both pretty direct calls into respectively julia and LLVM, so there's not much to optimize there. Although the allocations are suspicious... To debug that phase of the compiler, it would be useful to have a single kernel invocation (or even better, a function + tuple type for use with eg. |
Turns out the "problematic kernel" does compile, it just takes 2 hours, most of which is spent in IR emission and linking. If there's not much to optimize then I wonder if this one kernel can be launched the way it was done in GPUifyLoops 0.1.0 (where this kernel compiled relatively quickly) and all the others can be launched with Casette and I'll try modifying the kernel to see if it's just the large number of
The new modules that were spit out seem much larger including a 38 MiB one.
|
Not sure if this is helpful or what you were suggesting but running |
Not quite, since you are using |
Ah I thought I had to match the signature as defined. Just tried again with the actual types (no parameters), which looks kind of similar so not sure if it's helpful: https://gist.github.com/ali-ramadhan/1d82f1414642fa9b5ba3b1300c51f66e I also tried with the explicit types + parameterized types (see bottom of comment) but hit an LLVM error. Not sure if it's relevant but I also tried with the explicit types plus parameterized types, e.g.
|
@ali-ramadhan and I spend a bit more time on this issue. The underlying problem is that the 265 fix for Cassette causes a 3 magnitude blow-up in code-size and a lot more work for CUDAnative.jl So I will disable that. Together with the CUDAnative improvements compile times for Ali are now rather reasonable. (10s) |
49: disable 265 fix if not requested during precompile time r=vchuravy a=vchuravy fixes #46 Co-authored-by: Valentin Churavy <[email protected]>
Loving the new
@launch
macro! But I noticed my kernels for Oceananigans.jl are taking much longer to compile once I upgraded to GPUifyLoops v0.2.0 and one kernel in particular would never finish compiling as it somehow got stuck in an infinite recursive call ending with aStackOverflowError
.I was able to track this down to the following line
GPUifyLoops.jl/src/GPUifyLoops.jl
Line 109 in 66a6de1
which I changed to
which got me up and running again as none of my kernels require
contextualize
and brought compilation time back down to 3.5 minutes from 8.5 minutes withcontextualize
. No idea why the compiler got stuck whencontextualize
was turned on as I didn't have any function calls that could be converted.I was wondering if it would be a good idea to make
contextualize
an option when launching a kernel? From my understanding, it lets you writeexp(x)
in your kernel which gets converted toCUDAnative.exp(x)
when called via@launch CUDA()
. I can seecontextualize
being useful but might be nice to turn it on for certain kernels that need it to speed up compilation.I'm happy work on this. In case
contextualize
should always be on I can just keep this on my fork.I was able to reproduce this with Julia 1.1 and the latest nightly [Version 1.2.0-DEV.562 (2019-03-28) Commit 11156024da].
Unfortunately I'm not sure what the issue was so I can't provide a minimal working example.
Verbose details included below.
Problematic kernel (I'm working on making it less ugly!)
Launching the kernel
Error
The text was updated successfully, but these errors were encountered: