Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an LLVM pass to remove Julia-specific address space information. #35645

Merged
merged 3 commits into from
May 3, 2020

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Apr 29, 2020

Fixes #22414:

julia> @code_llvm debuginfo=:none [1]
define nonnull %jl_value_t addrspace(10)* @julia_vect_135(i64) {
top:
  %1 = call %jl_value_t addrspace(10)* inttoptr (i64 140092205758784 to %jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, i64)*)(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140091990275760 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 1)
  %2 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %3 = bitcast %jl_value_t addrspace(11)* %2 to i64 addrspace(13)* addrspace(11)*
  %4 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %3, align 8
  store i64 %0, i64 addrspace(13)* %4, align 8
  ret %jl_value_t addrspace(10)* %1
}

becomes

define nonnull %jl_value_t* @julia_vect_353(i64) {
top:
  %1 = call %jl_value_t* inttoptr (i64 140357775323744 to %jl_value_t* (%jl_value_t*, i64)*)(%jl_value_t* inttoptr (i64 140357553911312 to %jl_value_t*), i64 1)
  %2 = bitcast %jl_value_t* %1 to i64**
  %3 = load i64*, i64** %2, align 8
  store i64 %0, i64* %3, align 8
  ret %jl_value_t* %1
}

Maintains target-specific address spaces:

julia> function foo(a)
           @inbounds a[1] = 1
           return
       end
foo (generic function with 1 method)

julia> CUDAnative.code_llvm(foo, Tuple{CuDeviceArray{Float32,2,AS.Global}; debuginfo=:none)
define dso_local void @julia_foo_4061({ [2 x i64], i64 } addrspace(11)* nocapture nonnull readonly dereferenceable(24)) {
top:
  %1 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 } addrspace(11)* %0, i64 0, i32 1
  %2 = bitcast i64 addrspace(11)* %1 to float* addrspace(11)*
  %3 = load float*, float* addrspace(11)* %2, align 8
  %4 = addrspacecast float* %3 to float addrspace(1)*
  store float 1.000000e+00, float addrspace(1)* %4, align 4
  ret void
}

becomes:

define dso_local void @julia_foo_4055({ [2 x i64], i64 }* nocapture nonnull readonly dereferenceable(24)) {
top:
  %1 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 1
  %2 = bitcast i64* %1 to float**
  %3 = load float*, float** %2, align 8
  %4 = addrspacecast float* %3 to float addrspace(1)*
  store float 1.000000e+00, float addrspace(1)* %4, align 4
  ret void
}

@maleadt maleadt added the compiler:codegen Generation of LLVM IR and native code label Apr 29, 2020
@maleadt maleadt marked this pull request as draft April 29, 2020 17:12
Copy link
Sponsor Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how short this is! Would be good if we can run @code_llvm with AS and without so that in the case one needs to debug AS information it is still easily accessible.

src/aotcompile.cpp Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Apr 30, 2020

Would be good if we can run @code_llvm with AS and without so that in the case one needs to debug AS information it is still easily accessible.

I'm not sure it's worth a dedicated option, like we don't have one for e.g. not lowering GC intrinsics (which are similarly internal to Julia's IR).

@maleadt maleadt marked this pull request as ready for review April 30, 2020 11:29
@maleadt
Copy link
Member Author

maleadt commented Apr 30, 2020

Should be good to go, so if people want to review. I'd like this to be part of 1.5 as it is necessary for the SPIR-V compiler (and I'd rather not patch yet another back-end). Related to that, this PR should make it possible to remove those address space-related LLVM patches we've been carrying.

@Keno
Copy link
Member

Keno commented Apr 30, 2020

This is probably fine, but I do have to say I'm not a fan of all the extra memory traffic and latency this introduces.

@vchuravy
Copy link
Sponsor Member

If we switch to GlobalISel we might have to do this (https://www.youtube.com/watch?v=Oj1BNoL1jpM&t=143s) or backends need to generally start documenting which AddressSpaces are okay to be used by frontends. We can turn this off by default, but on the other hand it makes the IR much nicer to read ;)

@maleadt
Copy link
Member Author

maleadt commented Apr 30, 2020

Yeah, the pass is not cheap:, around 2% of total LLVM pass time:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 33.9226 seconds (33.9245 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   7.0328 ( 22.7%)   0.6318 ( 21.1%)   7.6646 ( 22.6%)   7.6644 ( 22.6%)  X86 DAG->DAG Instruction Selection
   2.2434 (  7.3%)   0.5846 ( 19.5%)   2.8279 (  8.3%)   2.8283 (  8.3%)  X86 Assembly Printer
   2.3667 (  7.7%)   0.1848 (  6.2%)   2.5515 (  7.5%)   2.5501 (  7.5%)  Combine redundant instructions
   1.4999 (  4.8%)   0.1041 (  3.5%)   1.6040 (  4.7%)   1.6035 (  4.7%)  Greedy Register Allocator
   1.1356 (  3.7%)   0.0420 (  1.4%)   1.1776 (  3.5%)   1.1762 (  3.5%)  Global Value Numbering
   1.0332 (  3.3%)   0.0721 (  2.4%)   1.1052 (  3.3%)   1.1043 (  3.3%)  Induction Variable Simplification
   0.9135 (  3.0%)   0.0449 (  1.5%)   0.9584 (  2.8%)   0.9575 (  2.8%)  Loop Strength Reduction
   0.8980 (  2.9%)   0.0308 (  1.0%)   0.9288 (  2.7%)   0.9275 (  2.7%)  Combine redundant instructions #2
   0.7757 (  2.5%)   0.0312 (  1.0%)   0.8069 (  2.4%)   0.8058 (  2.4%)  Combine redundant instructions #3
-- 0.6786 (  2.2%)   0.0724 (  2.4%)   0.7510 (  2.2%)   0.7501 (  2.2%)  Remove IR address space information.
   0.6703 (  2.2%)   0.0741 (  2.5%)   0.7444 (  2.2%)   0.7429 (  2.2%)  Machine Instruction Scheduler
   0.5619 (  1.8%)   0.0247 (  0.8%)   0.5866 (  1.7%)   0.5858 (  1.7%)  Loop Invariant Code Motion
   0.3563 (  1.2%)   0.0301 (  1.0%)   0.3864 (  1.1%)   0.3852 (  1.1%)  CodeGen Prepare
   0.3377 (  1.1%)   0.0147 (  0.5%)   0.3525 (  1.0%)   0.3517 (  1.0%)  SLP Vectorizer
   0.3007 (  1.0%)   0.0365 (  1.2%)   0.3372 (  1.0%)   0.3362 (  1.0%)  Late Lower GCFrame Pass
...
  30.9312 (100.0%)   2.9914 (100.0%)  33.9226 (100.0%)  33.9245 (100.0%)  Total

We could whitelist it for targets that are known not to care about random address spaces (like X86). But I'd definitely enable it for reflection. I'll have a look, maybe I can optimize the implementation a little.

@maleadt
Copy link
Member Author

maleadt commented May 1, 2020

I removed the pass from the default pipeline, it's now only called during reflection. We can always add it back when its needed, or if specific targets require it. I also removed the NVPTX and WASM address space patches, since these aren't host targets and we can just run the address space removal pass from the respective compilers (cc @tshort, see JuliaGPU/GPUCompiler.jl#22).

@maleadt maleadt merged commit b8d9e78 into master May 3, 2020
@maleadt maleadt deleted the tb/strip_addrspace branch May 3, 2020 06:32
@Keno
Copy link
Member

Keno commented May 3, 2020

I do still think we should start a discussion with upstream about reserving some AS numbers for frontend usage and requiring the backends to ignore them. Any amount of overhead seems sad for something that could be easily fixed by an if/else in the backend. The reflection use case is legitimate and not performance sensitive of course.

{
}

Type *remapType(Type *SrcTy)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm here doesn't appear to have a guaranteed fixed point (it doesn't deal with the fact that PointerType->getElementType can form cycles in the type tree). We avoid emitting any code like that, so it's probably okay for us, but probably not for upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does LLVM typically deal with that? A Seen set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +140 to +142
Type *SrcTy = remapType(
cast<PointerType>(Src->getType()->getScalarType())
->getElementType());
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Type *SrcTy = remapType(
cast<PointerType>(Src->getType()->getScalarType())
->getElementType());
Type *SrcTy = remapType(cast<GetElementPtrConstantExpr>(CE)
->getSourceElementType());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use GetElementPtrConstantExpr, it looks like a private class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strip address space after gcframe allocation
4 participants