Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuples made me sad (so I fixed them) #4042

Merged
merged 28 commits into from
Dec 7, 2013
Merged

Tuples made me sad (so I fixed them) #4042

merged 28 commits into from
Dec 7, 2013

Conversation

Keno
Copy link
Member

@Keno Keno commented Aug 13, 2013

Alright, this rabbit hole turned out to be a giant dungeon instead, but I think I have finally found the exit.

Basically what this does is prevent heap allocation of tuples whenever possible, instead passing them in registers or on the stack (or wherever llvm thinks is most efficient). An easy example is this:

julia> bar() = (1,2)

julia> code_llvm(bar,())

; Function Attrs: sspreq
define <2 x i64> @julia_bar714() #2 {
top:
  ret <2 x i64> <i64 1, i64 2>, !dbg !3834
}

whereas before it used to heap allocate it (albeit it was smart enough to recognize that the result was constant, it only did it once, but still: (from the old version)

julia> bar() = (1,2)

julia> code_llvm(bar,())

define %jl_value_t* @julia_bar() {
top:
  ret %jl_value_t* inttoptr (i64 62429664 to %jl_value_t*), !dbg !3407
}

A perhaps more exciting and real work application is size on various array, e.g.

julia> code_llvm(size,(Array{Int64,2},))

; Function Attrs: sspreq
define <2 x i64> @julia_size(%jl_value_t*) #2 {
top:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 4, i32 0, !dbg !3876
  %2 = load %jl_value_t** %1, align 8, !dbg !3876
  %3 = ptrtoint %jl_value_t* %2 to i64, !dbg !3876
  %4 = insertelement <2 x i64> undef, i64 %3, i32 0, !dbg !3881, !julia_type !3882
  %5 = getelementptr inbounds %jl_value_t* %0, i64 5, i32 0, !dbg !3881
  %6 = load %jl_value_t** %5, align 8, !dbg !3881
  %7 = ptrtoint %jl_value_t* %6 to i64, !dbg !3881
  %8 = insertelement <2 x i64> %4, i64 %7, i32 1, !dbg !3881, !julia_type !3882
  ret <2 x i64> %8, !dbg !3881
}

Note that there is now no allocation involved, whereas before (again from the old version):

julia> code_llvm(size,(Array{Int64,2},))

define %jl_value_t* @julia_size(%jl_value_t*, %jl_value_t**, i32) {
top:
  %3 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*]* %3, i64 0, i64 0
  %4 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 2, !dbg !3419
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %5 = getelementptr [3 x %jl_value_t*]* %3, i64 0, i64 1, !dbg !3419
  %6 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !3419
  %.c = bitcast %jl_value_t** %6 to %jl_value_t*, !dbg !3419
  store %jl_value_t* %.c, %jl_value_t** %5, align 8, !dbg !3419
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !3419
  store %jl_value_t* null, %jl_value_t** %4, align 8
  %7 = load %jl_value_t** %1, align 8, !dbg !3419
  %8 = getelementptr inbounds %jl_value_t* %7, i64 4, i32 0, !dbg !3419
  %9 = load %jl_value_t** %8, align 8, !dbg !3419
  %10 = ptrtoint %jl_value_t* %9 to i64, !dbg !3419
  %11 = call %jl_value_t* @jl_box_int64(i64 %10), !dbg !3424
  store %jl_value_t* %11, %jl_value_t** %4, align 8, !dbg !3424
  %12 = call %jl_value_t* @allocobj(i64 32), !dbg !3424
  %13 = getelementptr inbounds %jl_value_t* %12, i64 2, i32 0, !dbg !3424
  store %jl_value_t* %11, %jl_value_t** %13, align 8, !dbg !3424
  %14 = getelementptr inbounds %jl_value_t* %12, i64 0, i32 0, !dbg !3424
  store %jl_value_t* inttoptr (i64 28847136 to %jl_value_t*), %jl_value_t** %14, align 8, !dbg !3424
  %15 = getelementptr inbounds %jl_value_t* %12, i64 1, i32 0, !dbg !3424
  store %jl_value_t* inttoptr (i64 2 to %jl_value_t*), %jl_value_t** %15, align 8, !dbg !3424
  %16 = getelementptr inbounds %jl_value_t* %12, i64 3, i32 0, !dbg !3424
  store %jl_value_t* null, %jl_value_t** %16, align 8, !dbg !3424
  store %jl_value_t* %12, %jl_value_t** %4, align 8, !dbg !3424
  %17 = getelementptr inbounds %jl_value_t* %7, i64 5, i32 0, !dbg !3424
  %18 = load %jl_value_t** %17, align 8, !dbg !3424
  %19 = ptrtoint %jl_value_t* %18 to i64, !dbg !3424
  %20 = call %jl_value_t* @jl_box_int64(i64 %19), !dbg !3424
  store %jl_value_t* %20, %jl_value_t** %16, align 8, !dbg !3424
  %21 = load %jl_value_t** %5, align 8, !dbg !3424
  %22 = getelementptr inbounds %jl_value_t* %21, i64 0, i32 0, !dbg !3424
  store %jl_value_t** %22, %jl_value_t*** @jl_pgcstack, align 8, !dbg !3424
  ret %jl_value_t* %12, !dbg !3424
}

Another exciting part of this pull request that if your types happen to correspond to a machine vector type (aka SIMD type), I make it just that, allowing fun things like

julia> foo(x::(Int32,Int32,Int32,Int32)) = llvmcall("""%3 = add <4 x i32> %1, %0
       ret <4 x i32> %3""",(Int32,Int32,Int32,Int32),((Int32,Int32,Int32,Int32),(Int32,Int32,Int32,Int32)),(int32(1),int32(2),int32(3),int32(4)),
       x)
# methods for generic function foo
foo(x::(Int32,Int32,Int32,Int32)) at none:1

julia> code_llvm(foo,((Int32,Int32,Int32,Int32),))
; Number of arguments: 2
define <4 x i32> @julia_foo1(<4 x i32> ,<4 x i32> ) {
%3 = add <4 x i32> %1, %0
ret <4 x i32> %3
}
define linkonce_odr <4 x i32> @julia_foo1(<4 x i32>, <4 x i32>) {
  %3 = add <4 x i32> %1, %0
  ret <4 x i32> %3
}


; Function Attrs: sspreq
define <4 x i32> @julia_foo(<4 x i32>) #2 {
top:
  %1 = add <4 x i32> %0, <i32 1, i32 2, i32 3, i32 4>
  ret <4 x i32> %1, !dbg !3798
}

julia> code_native(foo,((Int32,Int32,Int32,Int32),))
    .text
Filename: none
Source line: 1
    push    RBP
    mov RBP, RSP
    sub RSP, 16
    mov RAX, QWORD PTR FS:[40]
    mov QWORD PTR [RBP - 8], RAX
    mov RAX, QWORD PTR FS:[40]
    cmp RAX, QWORD PTR [RBP - 8]
    jne 20
    movabs  RAX, 140737297706192
    paddd   XMM0, XMMWORD PTR [RAX]
Source line: 1
    add RSP, 16
    pop RBP
    ret
    movabs  RAX, 140737322375370
    call    RAX

Note in particular the paddd in the native disassembly. The llvmcall used here is a new intrinsic to easily embed any LLVM IR into your julia functions. It is part of this pull request for demonstration purposes mostly, but is fully featured and ready to be used, but if people would rather not have it, I can rebase it out.

The tuple unboxing currently is only enabled for immutable types due to GC considerations, but I will follow up with that patch after discussion on this one is done (it already work fine for other types if you disable the GC).

Fixes #2496
Fixes #2299
/cc @JeffBezanson

@@ -112,6 +115,13 @@ void __attribute__(()) __stack_chk_fail()
static DIBuilder *dbuilder;
static std::map<int, std::string> argNumberStrings;
static FunctionPassManager *FPM;
static PassManager *PM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I overlooking something, or is this unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right. I used to have an inlining pass in there, but that didn't actually work.

@timholy
Copy link
Sponsor Member

timholy commented Aug 13, 2013

This is huge! Really amazing work.

@StefanKarpinski
Copy link
Sponsor Member

Very cool stuff. Have you observed any major speedups due to this change?

@JeffBezanson
Copy link
Sponsor Member

You should benchmark complex sqrt and log, since they use ssqs which returns a tuple.

@ViralBShah
Copy link
Member

Really cool!

@Keno
Copy link
Member Author

Keno commented Aug 13, 2013

Here's a fun one:

julia> foo(x::(Nothing,Nothing)) = x
# methods for generic function foo
foo(x::(Nothing,Nothing)) at none:1

julia> foo((nothing,nothing))
(nothing,nothing)

julia> code_llvm(foo,((Nothing,Nothing),))

define void @julia_foo631() {
top:
  ret void, !dbg !3461
}

@ViralBShah
Copy link
Member

Marking this as 0.3 milestone.

@Keno
Copy link
Member Author

Keno commented Aug 16, 2013

Took longer than I had hoped, but I feel this is done now. Complex sqrt for example got about 2x faster.

@lindahua
Copy link
Contributor

Excellent!

I think the llvmcall intrinsic is eventually unlocking the capability of using SIMD instructions? This would be a huge step.

@ViralBShah
Copy link
Member

With the recent cleanup of floating point internals as part float16, it should be much easier to get simd capabilities. I hope we will have it in 0.3.

@Keno
Copy link
Member Author

Keno commented Aug 19, 2013

I should have been more clear. This also adds simd types. In particular the (Int32,Int32,Int32,Int32) in the example above is a SIMD vector. Note that the llvmcall expands to a packed add.

@ViralBShah
Copy link
Member

That is really cool.

@GunnarFarneback
Copy link
Contributor

I still need to figure out how to do SIMD loading and storing of data and I don't know what it takes to obtain more exotic SIMD instructions like palignr, but llvmcall can definitely be used to generate some SIMD instructions. Here's another teaser, computing x*y+z and producing very clean code on a Sandy Bridge processor with 64-bit Linux:

typealias Float32x4 (Float32,Float32,Float32,Float32)
function bar(x::Float32x4, y::Float32x4, z::Float32x4)
    xy = Base.llvmcall("""%3 = fmul <4 x float> %1, %0                                                                                                                                                  
ret <4 x float> %3""", Float32x4, (Float32x4, Float32x4), x, y)
    Base.llvmcall("""%3 = fadd <4 x float> %1, %0                                                                                                                                                       
ret <4 x float> %3""", Float32x4, (Float32x4, Float32x4), xy, z)
end
julia> code_native(bar, (Float32x4, Float32x4, Float32x4))
        .text
Filename: none
Source line: 4
        push    RBP
        mov     RBP, RSP
        vmulps  XMM0, XMM0, XMM1
        vaddps  XMM0, XMM0, XMM2
Source line: 4
        pop     RBP
        ret

256 bit SIMD with AVX also works:

typealias Float32x8 (Float32, Float32, Float32, Float32, Float32, Float32, Float32, Float32)
function bar(x::Float32x8, y::Float32x8, z::Float32x8)
    xy = Base.llvmcall("""%3 = fmul <8 x float> %1, %0                                                                                                                                                  
ret <8 x float> %3""", Float32x8, (Float32x8, Float32x8), x, y)
    Base.llvmcall("""%3 = fadd <8 x float> %1, %0                                                                                                                                                       
ret <8 x float> %3""", Float32x8, (Float32x8, Float32x8), xy, z)
end
julia> code_native(bar, (Float32x8, Float32x8, Float32x8))
        .text
Filename: none
Source line: 4
        push    RBP
        mov     RBP, RSP
        vmulps  YMM0, YMM0, YMM1
        vaddps  YMM0, YMM0, YMM2
Source line: 4
        pop     RBP
        ret

@Keno
Copy link
Member Author

Keno commented Aug 19, 2013

I should really make NTuple{Float32,8} work as a notation.

@JeffBezanson
Copy link
Sponsor Member

NTuple{8,Float32} already works; it is treated as the same type.

@Keno
Copy link
Member Author

Keno commented Aug 19, 2013

It didn't work when I tested it, but I will have another look.

@lindahua
Copy link
Contributor

Just tested, NTuple{N,T} works, and NTuple{4, Int} is exactly just (Int, Int, Int, Int).

Looking forward to using llvmcall, then we can revive the discussion of SIMD design!

@Keno
Copy link
Member Author

Keno commented Aug 19, 2013

Must've been doing sth. wrong then. Thanks for testing.

@ghost ghost assigned JeffBezanson Sep 18, 2013
@Keno
Copy link
Member Author

Keno commented Oct 5, 2013

Rebased, for those following along at home ;).

@Keno Keno mentioned this pull request Nov 1, 2013
@ArchRobison
Copy link
Contributor

Mapping tuples to the native vector datatypes is nifty. I want to try it out, but the fork is only working partially for me. I'm on a 64-bit Ubuntu 12.0 Linux box with a Haswell processor, and running into a problem. The symptom is that julia-readline exits after I start it.

$ /localdisk/adrobiso/simd/julia/usr/bin/julia-readline
$

If I put the the examples in an input file, it works fine. For example:

$ cat /tmp/a.jl
code_llvm(size,(Array{Int64,2},))
$ usr/bin/julia-readline /tmp/a.jl

define <2 x i64> @julia_size(%jl_value_t*) {
top:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 4, i32 0, !dbg !2775
  %2 = load %jl_value_t** %1, align 8, !dbg !2775
  %3 = ptrtoint %jl_value_t* %2 to i64, !dbg !2775
  %4 = insertelement <2 x i64> undef, i64 %3, i32 0, !dbg !2780, !julia_type !2781
  %5 = getelementptr inbounds %jl_value_t* %0, i64 5, i32 0, !dbg !2780
  %6 = load %jl_value_t** %5, align 8, !dbg !2780
  %7 = ptrtoint %jl_value_t* %6 to i64, !dbg !2780
  %8 = insertelement <2 x i64> %4, i64 %7, i32 1, !dbg !2780, !julia_type !2781
  ret <2 x i64> %8, !dbg !2780
}

Newbie question: What's the recommended way to debug failure of the REPL? I started using gdb on usr/bin/julia-debug-basic, but was getting lost when I hit the jl_apply calls.

@staticfloat
Copy link
Sponsor Member

Try running it inside of strace:

strace ./julia

That can sometimes print out interesting information.

@ViralBShah
Copy link
Member

I believe this is going to be merged soon, as we will release 0.2 this weekend and branch.

@Keno
Copy link
Member Author

Keno commented Nov 15, 2013

@ArchRobison I hadn't noticed since I usually use my own REPL. That turns out to be an LLVM bug (sigh): http:https://llvm.org/bugs/show_bug.cgi?id=12618. I'll see if I can work around it.

@timholy
Copy link
Sponsor Member

timholy commented Nov 15, 2013

Or we could just merge your REPL before we merge this patch? Hopefully we're talking a matter of days here...

@Keno
Copy link
Member Author

Keno commented Nov 17, 2013

Now, that 0.2 is tagged, can we finally merge this?

@ihnorton
Copy link
Member

+1

@StefanKarpinski
Copy link
Sponsor Member

Yes. This is a non-breaking change, right? So does that mean it can be merged into the release-0.2 branch?

@Keno
Copy link
Member Author

Keno commented Nov 30, 2013

@JeffBezanson, anything else? I'd love to merge this soon so I can put up some of the followup stuff.

@Keno
Copy link
Member Author

Keno commented Dec 5, 2013

@JeffBezanson bump. Let's get this merged soon.

JeffBezanson added a commit that referenced this pull request Dec 7, 2013
Tuples made me sad (so I fixed them)
@JeffBezanson JeffBezanson merged commit 3db073d into master Dec 7, 2013
@johnmyleswhite
Copy link
Member

This sped up the library I'm working on to 66% of its previous runtime. Well done!

@timholy
Copy link
Sponsor Member

timholy commented Dec 7, 2013

oh boy oh boy oh boy oh boy

julia> include("testgrid.jl")  # tests interpolation in Grid.jl
elapsed time: 0.647127396 seconds (175658452 bytes allocated)

# Now git pull && make, restart julia

julia> include("testgrid.jl")
elapsed time: 0.339640071 seconds (48006676 bytes allocated)

We are finally on-par with Matlab's griddedInterpolant.

if(iity->getBitWidth() > 32)
i = builder.CreateTrunc(i,T_int32);
else if(iity->getBitWidth() < 32)
i = builder.CreateZExt(i,T_int32);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theres also a CreateZExtOrTrunc function, although it just implements this same logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Fair point. I was trying to remember what it was called, but then again as you noted it doesn't really matter.

@IainNZ
Copy link
Member

IainNZ commented Dec 7, 2013

Prelim speed center results:

  • ATLAS build of
    • bench_eu_devec
    • bench_eu_vev
    • mandel
    • raytracer
    • stockcorr
    • became as fast as OpenBLAS times (why?! - also some ATLAS builds maybe got slower too...)
  • gk, go, parse_int, randn_zig got faster

@ViralBShah
Copy link
Member

No change for sparse matvec. I was not expecting it, but tried it just in case.

@ViralBShah
Copy link
Member

nbody_vec seems to have slowed down a lot. @staticfloat nbody_vec does not seem to be running on criid.

@IainNZ
Copy link
Member

IainNZ commented Dec 7, 2013

I should note I was looking at "criid"

@Keno Keno mentioned this pull request Dec 7, 2013
@staticfloat
Copy link
Sponsor Member

@ViralBShah Already been reported; in fact, none of the shootout/ benchmarks are getting run, because we're dying somewhere in the process.

@timholy timholy mentioned this pull request Dec 7, 2013
stevengj added a commit that referenced this pull request Dec 7, 2013
gitfoxi pushed a commit to gitfoxi/julia that referenced this pull request Dec 13, 2013
* upstream/master: (53 commits)
  edit embedding chapter
  formatting fixes
  Fix JuliaLang#5056
  more consitent/useful description of predicate in ie all()
  NEWS for JuliaLang#4042
  Add sparse matvec to perf benchmark (JuliaLang#4707)
  Use warn_once instead of warn (JuliaLang#5038)
  Use warn() instead of println in base/sprase/sparsematrix.jl
  allow scale(b,A) or scale(A,b) when b is a scalar as well as a vector, don't restrict scale unnecessarily to arrays of numbers (e.g. scaling arrays of arrays should work), and improve documentation for scale\!
  Added section about memory management
  added negative bitarray rolling
  More accurate linspace for types with greater precision than Float64
  add realmin realmax for BigFloat
  fix eps(realmax), add typemin and typemax for BigFloat
  fix JuliaLang#5025, ordering used by nextfloat/prevfloat
  roll back on errors during type definitions. fixes JuliaLang#5009
  improve method signature errors. fixes JuliaLang#5018
  update juliadoc
  add more asserts for casts
  Fix segfaulting on strstr() failure
  ...
@Keno Keno deleted the kf/tuples branch August 10, 2014 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

more efficient tuple representation SIMD types