Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression for exp, sin, cos #17751

Closed
cortner opened this issue Aug 2, 2016 · 56 comments · Fixed by #17864 or #17912
Closed

Performance regression for exp, sin, cos #17751

cortner opened this issue Aug 2, 2016 · 56 comments · Fixed by #17864 or #17912
Labels
domain:maths Mathematical functions kind:regression Regression in behavior compared to a previous version performance Must go faster status:priority This should be addressed urgently
Milestone

Comments

@cortner
Copy link

cortner commented Aug 2, 2016

function test0(N)
   r = 0.234
   for n = 1:N
      exp(r)
   end
end
test0(10)
@time test0(1_000_000)
@time test0(1_000_000)
@time test0(1_000_000)

Output in v0.4:

> julia exp_test.jl
  0.006532 seconds (148 allocations: 10.151 KB)
  0.006524 seconds (4 allocations: 160 bytes)
  0.006999 seconds (4 allocations: 160 bytes)

Output in v0.5:

$ julia5 exp_test.jl
  0.012711 seconds (133 allocations: 7.703 KB)
  0.012758 seconds (4 allocations: 160 bytes)
  0.014031 seconds (4 allocations: 160 bytes)

same if I replace exp with sin or cos; haven't tried others. I did test against a simple squaring function where the performance remains the same in both versions.

LLVM code for v0.4

julia> @code_llvm test0(1_000_000)

define void @julia_test0_21678(i64) {
top:
  %1 = icmp sgt i64 %0, 0
  %2 = select i1 %1, i64 %0, i64 0
  %3 = icmp eq i64 %2, 0
  br i1 %3, label %L3, label %L

L:                                                ; preds = %L, %top
  %"#s1.0" = phi i64 [ %4, %L ], [ 1, %top ]
  %4 = add i64 %"#s1.0", 1
  %5 = call double inttoptr (i64 13299772784 to double (double)*)(double inreg 2.340000e-01)
  %6 = icmp eq i64 %"#s1.0", %2
  br i1 %6, label %L3, label %L

L3:                                               ; preds = %L, %top
  ret void
}

LLVM output for v0.5

julia> @code_llvm test0(1_000_000)

define void @julia_test0_68854(i64) #0 {
top:
  %1 = icmp slt i64 %0, 1
  br i1 %1, label %L2, label %if.preheader

if.preheader:                                     ; preds = %top
  br label %if

L2.loopexit:                                      ; preds = %if
  br label %L2

L2:                                               ; preds = %L2.loopexit, %top
  ret void

if:                                               ; preds = %if.preheader, %if
  %"#temp#.03" = phi i64 [ %2, %if ], [ 1, %if.preheader ]
  %2 = add i64 %"#temp#.03", 1
  %3 = call double inttoptr (i64 13444864816 to double (double)*)(double 2.340000e-01)
  %4 = icmp eq i64 %"#temp#.03", %0
  br i1 %4, label %L2.loopexit, label %if
@johnmyleswhite
Copy link
Member

I think the regression isn't in any of the functions you've listed, but in the way that this kind of no-op gets optimized. I see identical performance in 0.4 and 0.5 for the following:

function test0(N)
   r = 0.234
   s = 0.0
   for n = 1:N
      s += exp(r)
   end
end
test0(10)
@time test0(1_000_000)
@time test0(1_000_000)
@time test0(1_000_000)

@cortner
Copy link
Author

cortner commented Aug 2, 2016

I'm afraid not. I had tested precisely your version before as well, but had the same result, hence I posted the simpler code:

Sunrise:JuLIPtests ortner$ julia exp_test.jl
  0.006522 seconds (148 allocations: 10.151 KB)
  0.006903 seconds (4 allocations: 160 bytes)
  0.006181 seconds (4 allocations: 160 bytes)
Sunrise:JuLIPtests ortner$ julia5 exp_test.jl
  0.012776 seconds (133 allocations: 7.703 KB)
  0.012746 seconds (4 allocations: 160 bytes)
  0.013848 seconds (4 allocations: 160 bytes)
julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
julia> versioninfo()
Julia Version 0.5.0-rc0+0
Commit 0030eec* (2016-07-26 20:22 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, has well)

@yuyichao
Copy link
Contributor

yuyichao commented Aug 2, 2016

Have you tried 0.4 with llvm 3.7.1 (and vice versa).

What's the native code?

Ref https://llvm.org/bugs/show_bug.cgi?id=25277

@cortner
Copy link
Author

cortner commented Aug 2, 2016

I'm afraid I haven't really explored Julia so deeply that I know how to switch LLVMs - sorry. Interesting though that @johnmyleswhite cannot reproduce the problem. Linux/Windows vs Apple?

@johnmyleswhite
Copy link
Member

It turns out that my 0.5 build was a few days old. I'm currently building the latest commit to see if the reason I couldn't reproduce your results is that the regression happened in the last 5 days. Will report back once the build is finished.

@johnmyleswhite
Copy link
Member

Still don't see it with the latest commit on master for 0.5:

julia> versioninfo()
Julia Version 0.5.0-rc0+122
Commit 633443c (2016-08-02 00:53 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> function test0(N)
          r = 0.234
          s = 0.0
          for n = 1:N
             s += exp(r)
          end
       end
test0 (generic function with 1 method)

julia> test0(10)

julia> @time test0(1_000_000)
  0.008087 seconds (133 allocations: 7.703 KB)

julia> @time test0(1_000_000)
  0.008218 seconds (4 allocations: 160 bytes)

julia> @time test0(1_000_000)
  0.007780 seconds (4 allocations: 160 bytes)
julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> function test0(N)
          r = 0.234
          s = 0.0
          for n = 1:N
             s += exp(r)
          end
       end
test0 (generic function with 1 method)

julia> test0(10)

julia> @time test0(1_000_000)
  0.011705 seconds (146 allocations: 9.995 KB)

julia> @time test0(1_000_000)
  0.019303 seconds (4 allocations: 160 bytes)

julia> @time test0(1_000_000)
  0.007988 seconds (4 allocations: 160 bytes)

And I get the same results for your original example. Seems like there must be something subtly different between our Julia 0.5 setups. Did you build from source?

@cortner
Copy link
Author

cortner commented Aug 2, 2016

I just built from source (I don't normally), and now I don't see the regression anymore. On the nightly that I just downloaded I do see the regression.

@cortner
Copy link
Author

cortner commented Aug 2, 2016

Julia Version 0.5.0-rc0+122
Commit 633443c (2016-08-02 00:53 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, has well)

vs

Julia Version 0.5.0-rc0+115
Commit 643bac8 (2016-08-01 21:42 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@cortner
Copy link
Author

cortner commented Aug 2, 2016

Is the nightly built on a different system? see x86_64-apple-darwin13.4.0 vs x86_64-apple-darwin15.6.0

@cortner
Copy link
Author

cortner commented Aug 2, 2016

let me know if you want me to try something else; I will also try the next nightly tomorrow.

@ViralBShah ViralBShah added this to the 0.5.x milestone Aug 2, 2016
@ViralBShah
Copy link
Member

Is this only in the libm functions, or if you write a function f() that roughly takes the same execution time as exp, do you see the same perf regression?

@yuyichao
Copy link
Contributor

yuyichao commented Aug 2, 2016

let me know if you want me to try something else;

#17751 (comment)

What's the native code?

@ViralBShah
Copy link
Member

To build julia 0.4 with llvm 3.7.1, you have to clone julia and then change the llvm version in deps/Versions.make.

@KristofferC
Copy link
Sponsor Member

I can reproduce. rc0 is slow, built from source is fast. Rebuilding the sysimg file did not help. Native code is identical:

julia> @code_native test0(1_000_000)
    .text
Filename: REPL[4]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L62
    movabsq $140279242572400, %rax  # imm = 0x7F954E6C4270
Source line: 4
    vmovsd  (%rax), %xmm0           # xmm0 = mem[0],zero
    vmovsd  %xmm0, -24(%rbp)
    movabsq $140279240293456, %r14  # imm = 0x7F954E497C50
L48:
    vmovsd  -24(%rbp), %xmm0        # xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    addq    $-1, %rbx
    jne L48
Source line: 4
L62:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    (%rax,%rax)

@pabloferz
Copy link
Contributor

pabloferz commented Aug 2, 2016

I was seeing something like this using the generic Linux binaries vs. building from source a couple months ago. The difference was that the generic ones were using a libm included in the binary, instead of the system one. Manually, changing the so files to point to the system libm makes the performance equal. I wouldn't be surprised if this were also the case here.

@kshyatt kshyatt added performance Must go faster kind:regression Regression in behavior compared to a previous version domain:maths Mathematical functions labels Aug 2, 2016
@ufechner7
Copy link

ufechner7 commented Aug 2, 2016

On my machine, Julia 0.5rc0, compiled from source is faster than Julia 0.4:

Julia 0.5:

julia> @time test0(1_000_000)
  0.005780 seconds (133 allocations: 7.703 KB)
julia> @time test0(1_000_000)
  0.005524 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
  0.005612 seconds (4 allocations: 160 bytes)

Julia 0.4:

julia> @time test0(1_000_000)
  0.006219 seconds (39 allocations: 33.984 KB)
julia> @time test0(1_000_000)
  0.006888 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
  0.006956 seconds (4 allocations: 160 bytes)

Julia 0.4:
Version 0.4.5 (2016-03-18 00:58 UTC)
Installed as Ubuntu package.

Julia 0.5:
Version 0.5.0-rc0+0 (2016-07-26 20:22 UTC)

Uwe

@yuyichao
Copy link
Contributor

yuyichao commented Aug 2, 2016

@ufechner7 Please quote your code, especially macro calls. Even though the poor https://github.com/time doesn't seems to be particularly active...

@ufechner7
Copy link

Sorry. Won't forget it next time.

@simonbyrne
Copy link
Contributor

I'm seeing the same thing (i.e. this example being about 3x slower on the downloaded rc0 vs the same built locally). This is appears to be a problem with the openlibm build we include in the rc: when I built Julia with

make LIBMNAME=/Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/libopenlibm.dylib

I also get the same slowdown.

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

OSX binaries are built with MARCH=core2 so they run on old systems (Linux and Windows use MARCH=x86-64). That may be preventing openlibm from getting compiled with more modern SIMD instructions?

@KristofferC
Copy link
Sponsor Member

I reproduced on linux binary though.

@simonbyrne
Copy link
Contributor

isn't core2 newer than x86-64?

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

yes. I don't think there were any intel macs that predated core2 though.

@simonbyrne
Copy link
Contributor

However I can't seem to recreate the issue by just rebuilding libm with MARCH=core2. Could it be something else?

@tkelman
Copy link
Contributor

tkelman commented Aug 4, 2016

rpath?

@simonbyrne
Copy link
Contributor

That's beyond my skillset...

@tkelman
Copy link
Contributor

tkelman commented Aug 7, 2016

reopen if that doesn't actually fix it

@simonbyrne
Copy link
Contributor

I just downloaded the latest nightly (d3df8e7), but still see the issue.

@simonbyrne simonbyrne reopened this Aug 7, 2016
@cortner
Copy link
Author

cortner commented Aug 7, 2016

same for me; and for what it's worth:

  • v0.4.6, prebuilt : 0.0064353 seconds
  • v0.5-rc1 built locally : 0.007741 seconds
  • nightly : 0.012451 seconds

@ViralBShah
Copy link
Member

Is this issue only on mac, or on linux as well?

@simonbyrne
Copy link
Contributor

simonbyrne commented Aug 7, 2016

Yes linux has the problem to, according to Kristoffer above

@ViralBShah
Copy link
Member

BTW, unlike Kristoffer, I do see different native code generated. The local build is generating vmovsd instructions before the call, whereas the downloaded one is generating movsd, perhaps because it is compiled to be more generic. On my local master build:

julia> VERSION
v"0.6.0-dev.72"

julia> @code_native test0(1_000_000);
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[4]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L62
    movabsq $13175163952, %rax      ## imm = 0x3114D0C30
Source line: 4
    vmovsd  (%rax), %xmm0           ## xmm0 = mem[0],zero
    vmovsd  %xmm0, -24(%rbp)
    movabsq $exp, %r14
L48:
    vmovsd  -24(%rbp), %xmm0        ## xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    addq    $-1, %rbx
    jne L48
Source line: 4
L62:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    (%rax,%rax)

as opposed to the downloaded nightly:


julia> VERSION
v"0.6.0-dev.62"

julia> @code_native test0(1_000_000);
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[5]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L61
    movabsq $13309076416, %rax      ## imm = 0x3194863C0
Source line: 4
    movsd   (%rax), %xmm0           ## xmm0 = mem[0],zero
    movsd   %xmm0, -24(%rbp)
    movabsq $exp, %r14
L48:
    movsd   -24(%rbp), %xmm0        ## xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    decq    %rbx
    jne L48
Source line: 4
L61:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

@ViralBShah
Copy link
Member

ViralBShah commented Aug 7, 2016

For the openlibm from the nightlies, when I load with gdb and disassemble sin, I see many more instructions (463) compared to the one that I have built locally (212).

The compile line does not have anything that suggests that arch specific stuff is being done:

clang -stdlib=libc++ -mmacosx-version-min=10.7 -m64  -O3 -fno-builtin -std=c99 -Wall -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/include -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/ld80 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/amd64 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/src -DASSEMBLER -D__BSD_VISIBLE -Wno-implicit-function-declaration -fPIC  -c ld80/e_acoshl.c -o ld80/e_acoshl.c.o

To make this easier to analyse, what version of OS X are we building the nightlies on and with what compiler?

@simonbyrne
Copy link
Contributor

simonbyrne commented Aug 8, 2016

This appears to be due to a change in the default openlibm build flags from -O3 to -O2, (JuliaMath/openlibm#106), combined with the fact that we package on OS X 10.9 (which uses an older version of clang: apparently -O2 has improved recently).

Some options:

  • Build on a newer machine
  • Change build flag in the deps makefile
  • Change the build flag back in openlibm

@tkelman
Copy link
Contributor

tkelman commented Aug 8, 2016

that was put back by JuliaMath/openlibm@f8d64df though

@ViralBShah
Copy link
Member

How about on linux? Similar old compiler issue?

@simonbyrne
Copy link
Contributor

Ah okay: the problem on the OS X buildbot is that there is already a CFLAGS environment variable, which overrides this.

@simonbyrne
Copy link
Contributor

No idea about linux, I don't have access to that buildbot.

@tkelman
Copy link
Contributor

tkelman commented Aug 8, 2016

Who was setting CFLAGS, and to what?

@simonbyrne
Copy link
Contributor

It appears to be homebrew related

simonbyrne added a commit that referenced this issue Aug 9, 2016
simonbyrne added a commit that referenced this issue Aug 9, 2016
@KristofferC
Copy link
Sponsor Member

Just to make sure. We have tested the new build?

ViralBShah pushed a commit that referenced this issue Aug 9, 2016
@simonbyrne
Copy link
Contributor

No, we will have to wait until new nightlies are available.

@ViralBShah
Copy link
Member

The PR merge closed this issue. Let's reopen if necessary.

tkelman pushed a commit that referenced this issue Aug 11, 2016
(cherry picked from commit 008d8e7)
@simonbyrne
Copy link
Contributor

Just confirming that this seems to be now fixed on OS X on RC-2.

@KristofferC Can you check the linux one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:maths Mathematical functions kind:regression Regression in behavior compared to a previous version performance Must go faster status:priority This should be addressed urgently
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants