Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jl_create_native generates bad code for *(Float16,Float16) #34993

Closed
maleadt opened this issue Mar 4, 2020 · 4 comments · Fixed by #35655
Closed

jl_create_native generates bad code for *(Float16,Float16) #34993

maleadt opened this issue Mar 4, 2020 · 4 comments · Fixed by #35655
Assignees
Labels
compiler:codegen Generation of LLVM IR and native code kind:invalid Indicates that an issue or pull request is no longer relevant

Comments

@maleadt
Copy link
Member

maleadt commented Mar 4, 2020

The codegen restructuring has regressed some GPU code, where we used to get static code we now get invokes and calls to jfptr functions:

julia> CUDAnative.code_llvm(*, Tuple{Float16, Float16}; strict=true)
ERROR: InvalidIRError: compiling *(Float16, Float16) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jfptr_Float32_2275)

;  @ float.jl:398 within `*'
define dso_local i16 @_Z10julia_MUL_7Float167Float16(i16, i16) local_unnamed_addr {
top:
  %2 = call fastcc float @j_Float32_4187(i16 %0)
  %3 = call fastcc float @j_Float32_4188(i16 %1)


define internal fastcc float @j_Float32_4127(i16) unnamed_addr #0 {
top:
  %1 = alloca %jl_value_t addrspace(10)*, align 8
  %2 = call fastcc %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 2)
  %3 = bitcast %jl_value_t addrspace(10)* %2 to i16 addrspace(10)*
  store i16 %0, i16 addrspace(10)* %3, align 8
  store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %1, align 8
  %4 = call fastcc nonnull %jl_value_t addrspace(10)* @tojlinvoke4128(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139691536025776 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** nonnull %1, i32 1)


define internal fastcc %jl_value_t addrspace(10)* @tojlinvoke4128(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) unnamed_addr {
top:
  %3 = call %jl_value_t addrspace(10)* @jfptr_Float32_2275(%jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)** %1, i32 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139690687678928 to %jl_value_t*) to %jl_value_t addrspace(10)*))
  ret %jl_value_t addrspace(10)* %3
}

I'm using CUDAnative.code_llvm here, which calls jl_create_native, as InteractiveUtils.code_llvm only shows the IR of the outer function even when dumping the entire module. I guess that may be a red herring though, since generating code for the constructor itself yields the expected IR (see below).

Before the refactor:

;  @ float.jl:398 within `*'
define i16 @julia___29(i16, i16) {
top:
  %2 = call float @julia_Float32_17988(i16 %0)
  %3 = call float @julia_Float32_17988(i16 %1)

;  @ float.jl:209 within `Float32'
define internal float @julia_Float32_17988(i16) {
top:
; direct implementation

Generating code for the constructor directly:

julia> CUDAnative.code_llvm(Float32, Tuple{Float16}; strict=true)

;  @ float.jl:209 within `Float32'
define dso_local float @_Z13julia_Float327Float16(%jl_value_t addrspace(10)*, i16) {
top:
; direct implementation

Ref #25984, JuliaGPU/CUDAnative.jl#162 (comment)

@maleadt maleadt added kind:regression Regression in behavior compared to a previous version compiler:codegen Generation of LLVM IR and native code domain:gpu Affects running Julia on a GPU labels Mar 4, 2020
@vtjnash vtjnash added kind:invalid Indicates that an issue or pull request is no longer relevant and removed domain:gpu Affects running Julia on a GPU kind:regression Regression in behavior compared to a previous version labels Mar 4, 2020
@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 4, 2020

This is generally what it means to be non-recursive in codegen. Now jl_create_native should give you back exactly the code it was asked for, and not recursively emit a lot of extra, unwarranted functions. It's still probably doing too much extra, so not quite right yet, but it's getting closer to that goal.

@maleadt
Copy link
Member Author

maleadt commented Mar 4, 2020

OK, I thought jl_create_native would collect the IR of all needed functions similarly to how codegen worked before (but with a local cache and not a global one). Thoughts on how to do so, since GPU codegen generally requires that kind of generated code (i.e. without the invoke/jfptr)?

@Keno
Copy link
Member

Keno commented Mar 4, 2020

I've used #33955 for that kind of thing in experiments, which I suspect may be the correct solution eventually, but obviously it would be nice to have a workaround for now.

@maleadt
Copy link
Member Author

maleadt commented Mar 5, 2020

This is generally what it means to be non-recursive in codegen.

It's one thing to emit placeholder functions that later will be replaced with (a call to) the final function, it's another to have these placeholders alloc GC frames and perform non-specsig calls.

There also seem to be some caching happening:

julia> println(codegen(*, Tuple{Float16, Float16}))

define internal i16 @julia_MUL._79(i16, i16) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = call float @julia_Float32_82(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %0)
  %8 = call float @julia_Float32_82(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %1)

define internal float @julia_Float32_82(%jl_value_t addrspace(10)*, i16) {
top:
; actual impl

Why is there an additional pointer argument here?

Anyway, after calling this function on the CPU, jl_create_native returns vastly different code:

julia> kernel(a,b) = a*b
kernel (generic function with 1 method)

julia> kernel(Float16(1),Float16(2))
Float16(2.0)

julia> println(codegen(*, Tuple{Float16, Float16}))

define internal i16 @julia_MUL._131(i16, i16) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = call float @j_Float32_132(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %0)
  %8 = call float @j_Float32_133(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %1)

define internal float @j_Float32_132(%jl_value_t addrspace(10)*, i16) #7 {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = bitcast %jl_value_t*** %2 to i8*
  %8 = call noalias nonnull %jl_value_t addrspace(10)* @julia.gc_alloc_obj(i8* %7, i64 2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592403568 to %jl_value_t*) to %jl_value_t addrspace(10)*)) #6
  %9 = bitcast %jl_value_t addrspace(10)* %8 to i16 addrspace(10)*
  store i16 %1, i16 addrspace(10)* %9, align 8
  %10 = call cc37 nonnull %jl_value_t addrspace(10)* bitcast (%jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32)* @tojlinvoke135 to %jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)*)*)(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* %8)
  %11 = call %jl_value_t addrspace(10)* @julia.typeof(%jl_value_t addrspace(10)* %10)
  %12 = icmp eq %jl_value_t addrspace(10)* %11, addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*)
  br i1 %12, label %pass, label %fail

fail:                                             ; preds = %top
  %13 = addrspacecast %jl_value_t addrspace(10)* %10 to %jl_value_t addrspace(12)*
  call void @jl_type_error(i8* inttoptr (i64 94849326539024 to i8*), %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(12)* %13)
  unreachable

pass:                                             ; preds = %top
  %14 = bitcast %jl_value_t addrspace(10)* %10 to float addrspace(10)*
  %15 = load float, float addrspace(10)* %14, align 4
  ret float %15
}

define internal %jl_value_t addrspace(10)* @tojlinvoke135(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) {
top:
  %3 = call %jl_value_t addrspace(10)* @jfptr_Float32_107(%jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)** %1, i32 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516541103984 to %jl_value_t*) to %jl_value_t addrspace(10)*))
  ret %jl_value_t addrspace(10)* %3
}

declare %jl_value_t addrspace(10)* @jfptr_Float32_107(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32, %jl_value_t addrspace(10)*)

Calls to julia_Float32_82 got replaced by two different calls to j_Float32_132 and j_Float32_133, whose implementation now contains those tojlinvoke placeholders requiring a GC frame, which is supposed to call jfptr_Float32_107 but that function isn't even part of the module.

MWE without CUDAnative:

using Core.Compiler: MethodInstance
using Base: _methods_by_ftype

function codegen(f,tt)
    # get the method instance
    world = typemax(UInt)
    sig = Base.signature_type(f, tt)
    mthds = _methods_by_ftype(sig, -1, world)
    Base.isdispatchtuple(tt) || return(:(error("$tt is not a dispatch tuple")))
    length(mthds) == 1 || return (:(throw(MethodError(f,tt))))
    mtypes, msp, m = mthds[1]
    method_instance = ccall(:jl_specializations_get_linfo, Ref{MethodInstance}, (Any, Any, Any), m, mtypes, msp)

    # generate ir
    params = Base.CodegenParams()
    native_code = ccall(:jl_create_native, Ptr{Cvoid},
                        (Vector{Core.MethodInstance}, Base.CodegenParams),
                        [method_instance], params)
    @assert native_code != C_NULL
    llvm_mod = ccall(:jl_get_llvm_module, Ptr{Cvoid},
                            (Ptr{Cvoid},), native_code)
    @assert llvm_mod != C_NULL

    # get the top-level code
    code = Core.Compiler.inf_for_methodinstance(method_instance, world, world)

    # get the top-level function index
    llvm_func_idx = Ref{Int32}(-1)
    llvm_specfunc_idx = Ref{Int32}(-1)
    ccall(:jl_breakpoint, Nothing, ())
    ccall(:jl_get_function_id, Nothing,
            (Ptr{Cvoid}, Any, Ptr{Int32}, Ptr{Int32}),
            native_code, code, llvm_func_idx, llvm_specfunc_idx)
    @assert llvm_func_idx[] != -1
    @assert llvm_specfunc_idx[] != -1

    # get the top-level function)
    llvm_func = ccall(:jl_get_llvm_function, Ptr{Cvoid},
                    (Ptr{Cvoid}, UInt32), native_code, llvm_func_idx[]-1)
    llvm_specfunc = ccall(:jl_get_llvm_function, Ptr{Cvoid},
                        (Ptr{Cvoid}, UInt32), native_code, llvm_specfunc_idx[]-1)
    @assert llvm_specfunc != C_NULL

    # dump ir
    ccall(:jl_dump_function_ir, Ref{String},
          (Ptr{Cvoid}, Bool, Bool, Ptr{UInt8}),
          llvm_specfunc, true, true, :none)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code kind:invalid Indicates that an issue or pull request is no longer relevant
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants