Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

Merged
merged 7 commits into from
Jun 3, 2019

Conversation

jonathanvdc
Copy link
Contributor

Hello! This is my first PR here. I'm an MSc student working on making CUDAnative interact more organically with the Julia compiler. @maleadt is my supervisor. (Hi, Tim)

Some background: I'm proposing the changes in this PR in order to make GPU codegen more efficient for constructs that require GC support. The LateLowerGCFrame pass performs desirable platform-agnostic optimizations and we'd like to enable it in the CUDAnative pass pipeline. Sadly, we actually can't do that at the moment because LateLowerGCFrame also lowers GC intrinsics to LLVM IR that can only be expected to work on a CPU. See 1 for an issue where this is discussed in a little more depth.

This PR retains only the platform-agnostic aspects of LateLowerGCFrame and moves the CPU-specific parts into a new pass called FinalLowerGC. This doesn't change any of the resulting codegen for CPU platforms, but it should make LateLowerGCFrame eligible for reuse by CUDAnative.

I also added a helper file (llvm-pass-helpers.{cpp,h}) that contains functionality used by both the LateLowerGCFrame and FinalLowerGC passes—creating some helper functions and types seemed like a much more appropriate solution than duplicating code shared by LateLowerGCFrame and FinalLowerGC.

The helper functions/types mostly deal with scanning an llvm::Module& for Julia types, intrinsics and well-known functions. I did my best to design the helpers in such a way that they'll also be useful to other passes. For example, the initialization logic in AllocOpt could be simplified by using the helpers.

Anyway, I'd love to get some feedback on this PR and I'm happy to make changes accordingly. Thanks!

@maleadt maleadt added compiler:codegen Generation of LLVM IR and native code domain:gpu Affects running Julia on a GPU labels Feb 21, 2019
@maleadt maleadt requested a review from Keno February 21, 2019 14:32
@Keno
Copy link
Member

Keno commented Feb 21, 2019

What's the reasoning behind two passes as opposed to just doing codegen based on the target triple, or potentially making the pass a parent class with the codegen parts behind a virtual method? Seems like a lot of the complexity here stems from trying to preserve information across the pass boundary.

@maleadt
Copy link
Member

maleadt commented Feb 21, 2019

What's the reasoning behind two passes as opposed to just doing codegen based on the target triple, or potentially making the pass a parent class with the codegen parts behind a virtual method?

Being able to provide an implementation for the second pass in CUDAnative, written in Julia, using the LLVM C API. Kind-of what we do now, providing our own implementation for all of LateLowerGCFrame.
EDIT: I guess that doesn't really mean it needs to be two passes, cfr. createSimpleLoopUnrollPass/createLoopUnrollPass.

I was also imagining this "final" pass taking care of possible other platform-dependent intrinsics, either as the result of other LLVM-level lowering passes or maybe even direct llvmcalls in the stdlib.
GC lowering seemed like a good testing grounds for that approach.

@maleadt maleadt changed the title Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass WIP/RFC: Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass Feb 21, 2019
@jonathanvdc jonathanvdc changed the title WIP/RFC: Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass Feb 21, 2019
@Keno
Copy link
Member

Keno commented Feb 21, 2019

I get the motivation. On the other hand, this pass is very tightly coupled to the representation of the GC intrinsics, so I'm not really sure much is lost by just branching on the target architecture in the code generator and putting those details there. That seems far simpler than trying to encode the entire state data structure into intrinsics in the IR.

@jonathanvdc
Copy link
Contributor Author

jonathanvdc commented Feb 21, 2019

@Keno thanks for your feedback! :) I'll try and address your concerns.

  • Regarding branching on the target architecture: wouldn't that fundamentally impose a "right" way to implement a GC on every architecture? Because I don't think there exists one "right" way to do that: at the moment, CUDAnative's GC strategy boils down to "malloc, never free". That's actually not such a crazy thing to do on a GPU, but we might want to implement an actual GC for CUDAnative in the future. At that point, the target-specific GPU codegen will change radically.

    If we were to branch on the target architecture, then every tweak to the way the CUDAnative GC works will require changes to the GPU GC codegen in LateLowerGCFrame. CPU GCs are fairly stable, but GPU GCs are much more of an active area of research and we're just getting started with dynamic memory allocation in CUDAnative. It's not at all unreasonable to expect that CUDAnative's GC implementation will take a while to stabilize. We'd like to limit the effects of evolving a GC for CUDAnative to the CUDAnative project itself.

    Moreover, we might want to support multiple GC strategies for CUDAnative. For example, some kernels might be better served by a malloc-never-free strategy while others might really need a GC. If LateLowerGCFrame were to select instructions based on the target architecture, then we wouldn't be able to support both strategies at the same time. The changes in this PR allow CUDAnative to support both by customizing its pass pipeline.

  • I don't really understand what makes you say that my proposed changes try to encode the State data structure into intrinsics—that certainly wasn't my intention.

    I'll elaborate. I introduced six new intrinsics that are emitted by LateLowerGCFrame instead of something platform-specific. Each of those intrinsics can be lowered formulaically for both CPU and GPU back-ends. This PR implements the CPU lowerings described here in FinalLowerGC.

    1. julia.new_gc_frame allocates a new GC frame. This intrinsic expands to an alloca + memset when targeting the regular Julia GC. When targeting a malloc-never-free GC, we want julia.new_gc_frame to expand to a smaller alloca that will hopefully either get DCE'd or at least be replaced by scalars. The malloc-never-free GPU GC's alloca is smaller than the CPU GC's because the CPU GC needs some extra storage that the "malloc-never-free GPU GC doesn't use.
    2. julia.push_gc_frame registers a GC frame with the GC. When targeting the CPU GC, it expands to a complex sequence of instructions that register the GC frame. When targeting a malloc-never-free GC, we simply delete all calls to this intrinsic, i.e., we turn it into a nop.
    3. julia.pop_gc_frame does the opposite. It unregisters a GC frame. Its lowering is analogous to julia.push_gc_frame's.
    4. julia.get_gc_frame_slot gets a pointer to a slot in a GC frame. It always expands to a GEP, but the CPU GC's GEP looks slightly different from the current CUDAnative lowering, because the layout of GC frames is different there.
    5. julia.gc_alloc_bytes allocates storage for an object of a particular size, same as julia.gc_alloc_obj. When targeting the CPU GC, it expands to either gc_pool_alloc or a gc_big_alloc. When targeting a malloc-never-free GC, julia.gc_alloc_bytes expands to a malloc call, plus some pointer arithmetic.
    6. julia.queue_gc_root is lowered as a jl_gc_queue_root call when targeting the CPU GC and gets turned into a nop when targeting the malloc-never-free GC.

    All of these intrinsics are easy to lower and they have clear semantics yet enough wiggle room to support multiple GC implementations. If anything, they're just a way for LateLowerGCFrame to declare "this is where I want platform-specific code."

EDIT: I forgot about an intrinsic (julia.queue_gc_root) when I wrote this comment. Added it to the list now. Sorry about that.

@jonathanvdc
Copy link
Contributor Author

jonathanvdc commented Apr 21, 2019

@Keno Hello. It's been a while since I opened this PR and it's kind of been dormant since then. In the meantime, I've successfully implemented a garbage collector for CUDAnative on top of the changes I proposed here. Garbage collection underpins a wide range of Julia features, so I'd like to upstream my GC into CUDAnative's master branch. However, I'm fairly sure that's just not going to happen until the intrinsics I'm proposing here—or some alternative but equivalent mechanism—are included in Julia itself.

I understand if you're not very excited about this admittedly rather large PR, but I would really like to salvage its core ideas. Code-wise, it consists of two main additions:

  1. I refactored the way Julia intrinsics and well-known functions are resolved by passes. I believe that this is a good thing in a sense that is orthogonal to the GC codegen changes I'm proposing. The refactor takes a lot of accidental complexity out of the optimization passes and moves it to a dedicated helper file. It gets rid of a good chunk of code duplication. Furthermore, the resolution approach taken by the refactored code is simply more robust and arguably more maintainable than the ad-hoc approach previously taken by Julia's LLVM passes.

    That being said, if you genuinely don't want the refactor, then I can just tear it out and duplicate some code to achieve the same effect.

  2. The crux of this PR: I introduced six new intrinsics that are produced by LateLowerGCFrame and subsequently lowered to CPU-specific code by FinalLowerGC. @maleadt suggested I take this approach and I truly believe that this is the cleanest way to give CUDAnative and similar packages a window of opportunity to supply their own GC lowerings.

    With regard to your previous comments: I'm fairly sure that branching on the target triple to determine which lowering to use won't work. At the time of writing, my GC branch of CUDAnative has two such lowerings: one applies a "malloc never free" strategy, the other introduces calls to the GPU GC. They target the same architectures, so their target triples are identical.

    If you really don't want to introduce new intrinsics, then I suppose we could parameterize LateLowerGCFrame somehow and encode a variety of different lowerings in LateLowerGCFrame itself. That would imply moving all of this logic into Julia itself. I don't like this solution, but if you say that this is what you want then I'd be okay with implementing it. My end goal is to get my GPU GC for CUDAnative to work on unmodified Julia. This PR is mostly just a means to achieve that.

I'm not sure how to proceed here. Do you see an avenue for me to upstream some mechanism to support my GPU GC along with a "malloc never free" lowering? Would it help if I split this PR up into the refactor and its core contribution? Would you like me to make changes to the way the lowering process itself works?

I mostly just want a roadmap I can stick to for adding support for my CUDAnative GC to Julia. I'd be happy to rework this PR into one or more PRs that are more to your liking. But for that I need you to tell me what those PRs should look like.

Also, I'd like to point out that my thesis is due by the end of May. I'm very motivated to work on this PR at the moment, but I also have a lot of other things I need to take care of before that end of May deadline. So I can't really afford to experiment with a variety of different approaches before settling on some "optimal" solution. Do you think that—given these constraints—we can make that deadline?

Thanks!

@vchuravy
Copy link
Sponsor Member

CI failure is #30365

char FinalLowerGC::ID = 0;
static RegisterPass<FinalLowerGC> X("FinalLowerGC", "Final GC intrinsic lowering pass", false, false);

Pass *createFinalLowerGCPass()
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a C-API function like

extern "C" JL_DLLEXPORT void LLVMExtraAddCombineMulAddPass(LLVMPassManagerRef PM)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I added one.

@vchuravy
Copy link
Sponsor Member

@Keno so far we have been trying to separate the different backend implementations from the main Julia implementation and I think that strategy has served us reasonably well. It allowed the GPU backend to experiment and iterate quicker than Julia release cycles and branching on the target would strongly couple the GPU backend to versions of Julia (in addition to as Jonathan says there are several strategies that backends might want to experiment with.)

@jonathanvdc
Copy link
Contributor Author

@vchuravy Thanks for chiming in! I rebased the PR again and added the C-API function you asked for.

@Keno
Copy link
Member

Keno commented Apr 23, 2019

If it's useful to the GPU backend I'm ok with this. I'll try to find the time to read through the details this week.

@codecov-io
Copy link

Codecov Report

Merging #31135 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #31135   +/-   ##
=======================================
  Coverage   81.21%   81.21%           
=======================================
  Files         372      372           
  Lines       62957    62957           
=======================================
  Hits        51131    51131           
  Misses      11826    11826

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5741c7f...d13ae6d. Read the comment docs.

@jonathanvdc
Copy link
Contributor Author

Awesome, thanks!

@vchuravy
Copy link
Sponsor Member

Jonathan, it would be great if you can squash some of these commits (the goal would be to have self contained changes that pass tests)

@jonathanvdc
Copy link
Contributor Author

Sure thing Valentin. I'll see what I can do tomorrow. Thanks for helping me navigate this!

Factor a loop out of LateLowerGCFrame::runOnFunction and into GCLoweringRefs

Move TBAA NDNode init from LateLowerGCFrame to GCLoweringRefs

Rename 'llvm-late-gc-helpers.{cpp,h}' to 'llvm-pass-helpers.{cpp,h}'
Fix a typo in the final GC lowering pass

Split up 'julia.push_new_gc_frame' into two intrinsics

Steal name from existing GC frame when lowering 'julia.new_gc_frame'

Move GC frame pop lowering to final GC lowering pass

Change how GC helper intrinsics are accessed

Remove fields from GCLoweringRefs

Capture module pointer in GCLoweringRefs

Make the 'julia.new_gc_frame' intrinsic noalias

Move GC frame access lowering to final GC lowering pass

Update some outdated comments

Add some sanity checks to final GC lowering pass

Include 'llvm-version.h' in final GC lowering pass

Define tests that test GC frame lowering passes in isolation

Use 'i32' instead of 'size_t' for 'julia.{new,push}_gc_frame' size args

Try to fix GC lowering tests

These tests work fine on most machines, but they break on x86 Travis CI builds.
The breakage is almost certainly due to a mismatch between integer types in
the tests and 'size_t', exposed by recent changes to the GC lowering pass.

Define 'LLVMExtraAddFinalLowerGCPass'
@jonathanvdc
Copy link
Contributor Author

jonathanvdc commented Apr 24, 2019

I rebased my branch to move bug-fix commits adjacent to buggy commits (bugs were usually related to ABIs different from my own machine's), then squashed those commits. I also squashed related commits to reduce the amount of noise.

My chronomancy seems to have confused GitHub, though: the commits in this PR are laid out in a more or less random order now. Sorry about that. Didn't see it coming. At least git log seems unaffected.

Here's git log's take on the commits in this PR, if that's at all useful.

commit ef31437ec71d654f0f96cd886d1614822dd81f38
Author: jonathanvdc <[email protected]>
Date:   Wed Feb 20 11:28:56 2019 +0100

    Change intrinsic declaration function signatures
    
    Insert a newline

commit 0f58a60109153a1f0ed5ccfd1abb52a6e8b6b7f1
Author: Jonathan <[email protected]>
Date:   Wed Apr 24 12:53:42 2019 +0200

    Rename 'IntrinsicDescription::getOrDefine' to 'getOrDeclare'
    
    Use getOrDeclare to get 'julia.gc_alloc_obj' intrinsic

commit 7bb118c2d02b642f741609fd187de26644ec6dff
Author: jonathanvdc <[email protected]>
Date:   Tue Feb 19 14:19:23 2019 +0100

    Introduce a namespace for well-known runtime functions
    
    Use helpers to grab well-known functions

commit 1e06be7da07a224dd9364f767cbde722430b51f8
Author: jonathanvdc <[email protected]>
Date:   Thu Feb 21 16:01:09 2019 +0100

    Define a 'julia.queue_gc_root' intrinsic
    
    Delay 'jl_gc_queue_root' call generation until final GC lowering pass
    
    Fix two broken assertions

commit fff80dbdae8e27fc3e303ccc8f41b4c2b2d9bacf
Author: jonathanvdc <[email protected]>
Date:   Wed Apr 24 16:03:57 2019 +0200

    Define a 'julia.gc_alloc_bytes' intrinsic
    
    Delay GC allocation lowering into the final GC lowering pass
    
    Fix a broken LLVM pass test
    
    Define GC alloc lowering pass tests
    
    Cast 'julia.gc_alloc_obj' argument
    
    Add missing annotation to test
    
    
    Relax final GC lowering pass test

commit 1352b973007a9487ac6390f9cc03785ce7bb7d6f
Author: jonathanvdc <[email protected]>
Date:   Tue Apr 23 19:02:55 2019 +0200

    Delay GC frame lowering into a separate pass
    
    Fix a typo in the final GC lowering pass
    
    Split up 'julia.push_new_gc_frame' into two intrinsics
    
    Steal name from existing GC frame when lowering 'julia.new_gc_frame'
    
    Move GC frame pop lowering to final GC lowering pass
    
    Change how GC helper intrinsics are accessed
    
    Remove fields from GCLoweringRefs
    
    Capture module pointer in GCLoweringRefs
    
    Make the 'julia.new_gc_frame' intrinsic noalias
    
    Move GC frame access lowering to final GC lowering pass
    
    Update some outdated comments
    
    Add some sanity checks to final GC lowering pass
    
    Include 'llvm-version.h' in final GC lowering pass
    
    Define tests that test GC frame lowering passes in isolation
    
    Use 'i32' instead of 'size_t' for 'julia.{new,push}_gc_frame' size args
    
    Try to fix GC lowering tests
    
    These tests work fine on most machines, but they break on x86 Travis CI builds.
    The breakage is almost certainly due to a mismatch between integer types in
    the tests and 'size_t', exposed by recent changes to the GC lowering pass.
    
    Define 'LLVMExtraAddFinalLowerGCPass'

commit 0bd560222b30eb07bb73e95a3fff5040cdc5827c
Author: jonathanvdc <[email protected]>
Date:   Mon Feb 18 17:53:33 2019 +0100

    Factor platform-agnostic types and intrinsics out of LateLowerGCFrame
    
    Factor a loop out of LateLowerGCFrame::runOnFunction and into GCLoweringRefs
    
    Move TBAA NDNode init from LateLowerGCFrame to GCLoweringRefs
    
    Rename 'llvm-late-gc-helpers.{cpp,h}' to 'llvm-pass-helpers.{cpp,h}'

Delay GC allocation lowering into the final GC lowering pass

Fix a broken LLVM pass test

Define GC alloc lowering pass tests

Cast 'julia.gc_alloc_obj' argument

Add missing annotation to test


Relax final GC lowering pass test
Delay 'jl_gc_queue_root' call generation until final GC lowering pass

Fix two broken assertions
Use helpers to grab well-known functions
Use getOrDeclare to get 'julia.gc_alloc_obj' intrinsic
@vchuravy
Copy link
Sponsor Member

vchuravy commented May 3, 2019

@Keno good to merge?

@jonathanvdc
Copy link
Contributor Author

@Keno Any chance we can get this merged in the near future?

@Keno Keno merged commit d6fe9c1 into JuliaLang:master Jun 3, 2019
@Keno
Copy link
Member

Keno commented Jun 3, 2019

I reviewed this a while back and thought it was broadly fine, so I went ahead and merged this. There were a couple minor things, but we can do those as part of a larger look at how this all works, that I don't think we need to do right this second.

@jonathanvdc
Copy link
Contributor Author

Awesome! Thanks!

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 3, 2019

Might have been good to fix the commit message for that merge, so that it didn't still say work-in-progress, but seems good to have this now! (with the proviso that we may break it whenever we find that we need to—but which is unlikely to be frequent)

@jonathanvdc
Copy link
Contributor Author

Hmmm. That's interesting. Your first error ("DEBUG was not declared in this scope") pertains to my changes here. I submitted a PR to fix it: #32265.

However, your second error ("TerminatorInst was not declared in this scope") seems unrelated to my changes: the line that triggers the error was added in 0c89c16, which is over two years old. So I don't really understand why you're not getting that error for 9dceb46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code domain:gpu Affects running Julia on a GPU GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants