Performance regression for `exp`, `sin`, `cos` #17751

cortner · 2016-08-02T02:45:42Z

function test0(N)
   r = 0.234
   for n = 1:N
      exp(r)
   end
end
test0(10)
@time test0(1_000_000)
@time test0(1_000_000)
@time test0(1_000_000)

Output in v0.4:

> julia exp_test.jl
  0.006532 seconds (148 allocations: 10.151 KB)
  0.006524 seconds (4 allocations: 160 bytes)
  0.006999 seconds (4 allocations: 160 bytes)

Output in v0.5:

$ julia5 exp_test.jl
  0.012711 seconds (133 allocations: 7.703 KB)
  0.012758 seconds (4 allocations: 160 bytes)
  0.014031 seconds (4 allocations: 160 bytes)

same if I replace exp with sin or cos; haven't tried others. I did test against a simple squaring function where the performance remains the same in both versions.

LLVM code for `v0.4`

julia> @code_llvm test0(1_000_000)

define void @julia_test0_21678(i64) {
top:
  %1 = icmp sgt i64 %0, 0
  %2 = select i1 %1, i64 %0, i64 0
  %3 = icmp eq i64 %2, 0
  br i1 %3, label %L3, label %L

L:                                                ; preds = %L, %top
  %"#s1.0" = phi i64 [ %4, %L ], [ 1, %top ]
  %4 = add i64 %"#s1.0", 1
  %5 = call double inttoptr (i64 13299772784 to double (double)*)(double inreg 2.340000e-01)
  %6 = icmp eq i64 %"#s1.0", %2
  br i1 %6, label %L3, label %L

L3:                                               ; preds = %L, %top
  ret void
}

LLVM output for `v0.5`

julia> @code_llvm test0(1_000_000)

define void @julia_test0_68854(i64) #0 {
top:
  %1 = icmp slt i64 %0, 1
  br i1 %1, label %L2, label %if.preheader

if.preheader:                                     ; preds = %top
  br label %if

L2.loopexit:                                      ; preds = %if
  br label %L2

L2:                                               ; preds = %L2.loopexit, %top
  ret void

if:                                               ; preds = %if.preheader, %if
  %"#temp#.03" = phi i64 [ %2, %if ], [ 1, %if.preheader ]
  %2 = add i64 %"#temp#.03", 1
  %3 = call double inttoptr (i64 13444864816 to double (double)*)(double 2.340000e-01)
  %4 = icmp eq i64 %"#temp#.03", %0
  br i1 %4, label %L2.loopexit, label %if

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2016-08-02T02:49:42Z

I think the regression isn't in any of the functions you've listed, but in the way that this kind of no-op gets optimized. I see identical performance in 0.4 and 0.5 for the following:

function test0(N)
   r = 0.234
   s = 0.0
   for n = 1:N
      s += exp(r)
   end
end
test0(10)
@time test0(1_000_000)
@time test0(1_000_000)
@time test0(1_000_000)

cortner · 2016-08-02T02:55:42Z

I'm afraid not. I had tested precisely your version before as well, but had the same result, hence I posted the simpler code:

Sunrise:JuLIPtests ortner$ julia exp_test.jl
  0.006522 seconds (148 allocations: 10.151 KB)
  0.006903 seconds (4 allocations: 160 bytes)
  0.006181 seconds (4 allocations: 160 bytes)
Sunrise:JuLIPtests ortner$ julia5 exp_test.jl
  0.012776 seconds (133 allocations: 7.703 KB)
  0.012746 seconds (4 allocations: 160 bytes)
  0.013848 seconds (4 allocations: 160 bytes)

julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> versioninfo()
Julia Version 0.5.0-rc0+0
Commit 0030eec* (2016-07-26 20:22 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, has well)

yuyichao · 2016-08-02T03:08:59Z

Have you tried 0.4 with llvm 3.7.1 (and vice versa).

What's the native code?

Ref https://llvm.org/bugs/show_bug.cgi?id=25277

cortner · 2016-08-02T03:20:03Z

I'm afraid I haven't really explored Julia so deeply that I know how to switch LLVMs - sorry. Interesting though that @johnmyleswhite cannot reproduce the problem. Linux/Windows vs Apple?

johnmyleswhite · 2016-08-02T03:22:54Z

It turns out that my 0.5 build was a few days old. I'm currently building the latest commit to see if the reason I couldn't reproduce your results is that the regression happened in the last 5 days. Will report back once the build is finished.

johnmyleswhite · 2016-08-02T03:33:32Z

Still don't see it with the latest commit on master for 0.5:

julia> versioninfo()
Julia Version 0.5.0-rc0+122
Commit 633443c (2016-08-02 00:53 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

julia> function test0(N)
          r = 0.234
          s = 0.0
          for n = 1:N
             s += exp(r)
          end
       end
test0 (generic function with 1 method)

julia> test0(10)

julia> @time test0(1_000_000)
  0.008087 seconds (133 allocations: 7.703 KB)

julia> @time test0(1_000_000)
  0.008218 seconds (4 allocations: 160 bytes)

julia> @time test0(1_000_000)
  0.007780 seconds (4 allocations: 160 bytes)

julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> function test0(N)
          r = 0.234
          s = 0.0
          for n = 1:N
             s += exp(r)
          end
       end
test0 (generic function with 1 method)

julia> test0(10)

julia> @time test0(1_000_000)
  0.011705 seconds (146 allocations: 9.995 KB)

julia> @time test0(1_000_000)
  0.019303 seconds (4 allocations: 160 bytes)

julia> @time test0(1_000_000)
  0.007988 seconds (4 allocations: 160 bytes)

And I get the same results for your original example. Seems like there must be something subtly different between our Julia 0.5 setups. Did you build from source?

cortner · 2016-08-02T04:53:43Z

I just built from source (I don't normally), and now I don't see the regression anymore. On the nightly that I just downloaded I do see the regression.

cortner · 2016-08-02T04:54:49Z

Julia Version 0.5.0-rc0+122
Commit 633443c (2016-08-02 00:53 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, has well)

vs

Julia Version 0.5.0-rc0+115
Commit 643bac8 (2016-08-01 21:42 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

cortner · 2016-08-02T04:56:14Z

Is the nightly built on a different system? see x86_64-apple-darwin13.4.0 vs x86_64-apple-darwin15.6.0

cortner · 2016-08-02T04:57:14Z

let me know if you want me to try something else; I will also try the next nightly tomorrow.

ViralBShah · 2016-08-02T05:47:03Z

Is this only in the libm functions, or if you write a function f() that roughly takes the same execution time as exp, do you see the same perf regression?

yuyichao · 2016-08-02T06:01:22Z

let me know if you want me to try something else;

#17751 (comment)

What's the native code?

ViralBShah · 2016-08-02T09:11:57Z

To build julia 0.4 with llvm 3.7.1, you have to clone julia and then change the llvm version in deps/Versions.make.

KristofferC · 2016-08-02T09:41:08Z

I can reproduce. rc0 is slow, built from source is fast. Rebuilding the sysimg file did not help. Native code is identical:

julia> @code_native test0(1_000_000)
    .text
Filename: REPL[4]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L62
    movabsq $140279242572400, %rax  # imm = 0x7F954E6C4270
Source line: 4
    vmovsd  (%rax), %xmm0           # xmm0 = mem[0],zero
    vmovsd  %xmm0, -24(%rbp)
    movabsq $140279240293456, %r14  # imm = 0x7F954E497C50
L48:
    vmovsd  -24(%rbp), %xmm0        # xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    addq    $-1, %rbx
    jne L48
Source line: 4
L62:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    (%rax,%rax)

pabloferz · 2016-08-02T10:13:45Z

I was seeing something like this using the generic Linux binaries vs. building from source a couple months ago. The difference was that the generic ones were using a libm included in the binary, instead of the system one. Manually, changing the so files to point to the system libm makes the performance equal. I wouldn't be surprised if this were also the case here.

ufechner7 · 2016-08-02T16:37:38Z

On my machine, Julia 0.5rc0, compiled from source is faster than Julia 0.4:

Julia 0.5:

julia> @time test0(1_000_000)
  0.005780 seconds (133 allocations: 7.703 KB)
julia> @time test0(1_000_000)
  0.005524 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
  0.005612 seconds (4 allocations: 160 bytes)

Julia 0.4:

julia> @time test0(1_000_000)
  0.006219 seconds (39 allocations: 33.984 KB)
julia> @time test0(1_000_000)
  0.006888 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
  0.006956 seconds (4 allocations: 160 bytes)

Julia 0.4:
Version 0.4.5 (2016-03-18 00:58 UTC)
Installed as Ubuntu package.

Julia 0.5:
Version 0.5.0-rc0+0 (2016-07-26 20:22 UTC)

Uwe

yuyichao · 2016-08-02T21:51:35Z

@ufechner7 Please quote your code, especially macro calls. Even though the poor https://github.com/time doesn't seems to be particularly active...

ufechner7 · 2016-08-03T09:54:34Z

Sorry. Won't forget it next time.

simonbyrne · 2016-08-04T18:39:49Z

I'm seeing the same thing (i.e. this example being about 3x slower on the downloaded rc0 vs the same built locally). This is appears to be a problem with the openlibm build we include in the rc: when I built Julia with

make LIBMNAME=/Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/libopenlibm.dylib

I also get the same slowdown.

tkelman · 2016-08-04T18:50:09Z

OSX binaries are built with MARCH=core2 so they run on old systems (Linux and Windows use MARCH=x86-64). That may be preventing openlibm from getting compiled with more modern SIMD instructions?

KristofferC · 2016-08-04T18:51:05Z

I reproduced on linux binary though.

simonbyrne · 2016-08-04T19:12:44Z

isn't core2 newer than x86-64?

tkelman · 2016-08-04T19:39:15Z

yes. I don't think there were any intel macs that predated core2 though.

simonbyrne · 2016-08-04T20:06:46Z

However I can't seem to recreate the issue by just rebuilding libm with MARCH=core2. Could it be something else?

tkelman · 2016-08-04T20:09:53Z

rpath?

simonbyrne · 2016-08-04T20:18:56Z

That's beyond my skillset...

tkelman · 2016-08-07T05:02:07Z

reopen if that doesn't actually fix it

simonbyrne · 2016-08-07T12:58:26Z

I just downloaded the latest nightly (d3df8e7), but still see the issue.

cortner · 2016-08-07T15:28:06Z

same for me; and for what it's worth:

v0.4.6, prebuilt : 0.0064353 seconds
v0.5-rc1 built locally : 0.007741 seconds
nightly : 0.012451 seconds

ViralBShah · 2016-08-07T16:06:15Z

Is this issue only on mac, or on linux as well?

simonbyrne · 2016-08-07T17:12:06Z

Yes linux has the problem to, according to Kristoffer above

ViralBShah · 2016-08-07T17:46:20Z

BTW, unlike Kristoffer, I do see different native code generated. The local build is generating vmovsd instructions before the call, whereas the downloaded one is generating movsd, perhaps because it is compiled to be more generic. On my local master build:

julia> VERSION
v"0.6.0-dev.72"

julia> @code_native test0(1_000_000);
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[4]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L62
    movabsq $13175163952, %rax      ## imm = 0x3114D0C30
Source line: 4
    vmovsd  (%rax), %xmm0           ## xmm0 = mem[0],zero
    vmovsd  %xmm0, -24(%rbp)
    movabsq $exp, %r14
L48:
    vmovsd  -24(%rbp), %xmm0        ## xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    addq    $-1, %rbx
    jne L48
Source line: 4
L62:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    (%rax,%rax)

as opposed to the downloaded nightly:


julia> VERSION
v"0.6.0-dev.62"

julia> @code_native test0(1_000_000);
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[5]
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r14
    pushq   %rbx
    subq    $16, %rsp
    movq    %rdi, %rbx
Source line: 3
    testq   %rbx, %rbx
    jle L61
    movabsq $13309076416, %rax      ## imm = 0x3194863C0
Source line: 4
    movsd   (%rax), %xmm0           ## xmm0 = mem[0],zero
    movsd   %xmm0, -24(%rbp)
    movabsq $exp, %r14
L48:
    movsd   -24(%rbp), %xmm0        ## xmm0 = mem[0],zero
    callq   *%r14
Source line: 3
    decq    %rbx
    jne L48
Source line: 4
L61:
    addq    $16, %rsp
    popq    %rbx
    popq    %r14
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

ViralBShah · 2016-08-07T17:58:01Z

For the openlibm from the nightlies, when I load with gdb and disassemble sin, I see many more instructions (463) compared to the one that I have built locally (212).

The compile line does not have anything that suggests that arch specific stuff is being done:

clang -stdlib=libc++ -mmacosx-version-min=10.7 -m64  -O3 -fno-builtin -std=c99 -Wall -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/include -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/ld80 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/amd64 -I/Users/viral/julia/deps/build/openlibm-0fa599cce845be67ed1729d9753f67e366c37b96/src -DASSEMBLER -D__BSD_VISIBLE -Wno-implicit-function-declaration -fPIC  -c ld80/e_acoshl.c -o ld80/e_acoshl.c.o

To make this easier to analyse, what version of OS X are we building the nightlies on and with what compiler?

simonbyrne · 2016-08-08T15:00:40Z

This appears to be due to a change in the default openlibm build flags from -O3 to -O2, (JuliaMath/openlibm#106), combined with the fact that we package on OS X 10.9 (which uses an older version of clang: apparently -O2 has improved recently).

Some options:

Build on a newer machine
Change build flag in the deps makefile
Change the build flag back in openlibm

tkelman · 2016-08-08T15:06:25Z

that was put back by JuliaMath/openlibm@f8d64df though

ViralBShah · 2016-08-08T15:25:10Z

How about on linux? Similar old compiler issue?

simonbyrne · 2016-08-08T15:26:11Z

Ah okay: the problem on the OS X buildbot is that there is already a CFLAGS environment variable, which overrides this.

simonbyrne · 2016-08-08T15:26:46Z

No idea about linux, I don't have access to that buildbot.

tkelman · 2016-08-08T15:31:27Z

Who was setting CFLAGS, and to what?

simonbyrne · 2016-08-08T15:35:07Z

It appears to be homebrew related

KristofferC · 2016-08-09T16:45:37Z

Just to make sure. We have tested the new build?

simonbyrne · 2016-08-09T16:51:48Z

No, we will have to wait until new nightlies are available.

ViralBShah · 2016-08-09T17:03:50Z

The PR merge closed this issue. Let's reopen if necessary.

(cherry picked from commit 008d8e7)

simonbyrne · 2016-08-12T14:07:19Z

Just confirming that this seems to be now fixed on OS X on RC-2.

@KristofferC Can you check the linux one?

ViralBShah added this to the 0.5.x milestone Aug 2, 2016

kshyatt added performance Must go faster kind:regression Regression in behavior compared to a previous version domain:maths Mathematical functions labels Aug 2, 2016

tkelman closed this as completed in #17864 Aug 7, 2016

simonbyrne reopened this Aug 7, 2016

ViralBShah mentioned this issue Aug 8, 2016

No performance scaling in threading #17395

Closed

simonbyrne added a commit that referenced this issue Aug 9, 2016

bump openlibm, fixes #17751

f3f1c17

simonbyrne added a commit that referenced this issue Aug 9, 2016

bump openlibm, fixes #17751

68ffe97

simonbyrne mentioned this issue Aug 9, 2016

Bump openlibm #17912

Merged

ViralBShah closed this as completed in #17912 Aug 9, 2016

ViralBShah pushed a commit that referenced this issue Aug 9, 2016

bump openlibm, fixes #17751 (#17912)

008d8e7

tkelman pushed a commit that referenced this issue Aug 11, 2016

bump openlibm, fixes #17751 (#17912)

fe1c733

(cherry picked from commit 008d8e7)

nalimilan referenced this issue in JuliaMath/openlibm Aug 16, 2016

Restore the optimization level to -O3 by default.

f8d64df

tkelman mentioned this issue Aug 23, 2016

Enable FFTW threading by default (to match up to performance of octave and others) #17000

Closed

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016

bump openlibm, fixes JuliaLang#17751 (JuliaLang#17912)

c724db6

Performance regression for exp, sin, cos #17751

Performance regression for exp, sin, cos #17751

Comments

cortner commented Aug 2, 2016

LLVM code for v0.4

LLVM output for v0.5

johnmyleswhite commented Aug 2, 2016

cortner commented Aug 2, 2016

yuyichao commented Aug 2, 2016

cortner commented Aug 2, 2016

johnmyleswhite commented Aug 2, 2016

johnmyleswhite commented Aug 2, 2016

cortner commented Aug 2, 2016

cortner commented Aug 2, 2016

cortner commented Aug 2, 2016

cortner commented Aug 2, 2016

ViralBShah commented Aug 2, 2016

yuyichao commented Aug 2, 2016

ViralBShah commented Aug 2, 2016

KristofferC commented Aug 2, 2016

pabloferz commented Aug 2, 2016 • edited Loading

ufechner7 commented Aug 2, 2016 • edited by yuyichao Loading

yuyichao commented Aug 2, 2016

ufechner7 commented Aug 3, 2016

simonbyrne commented Aug 4, 2016

tkelman commented Aug 4, 2016

KristofferC commented Aug 4, 2016

simonbyrne commented Aug 4, 2016

tkelman commented Aug 4, 2016

simonbyrne commented Aug 4, 2016

tkelman commented Aug 4, 2016

simonbyrne commented Aug 4, 2016

tkelman commented Aug 7, 2016

simonbyrne commented Aug 7, 2016

cortner commented Aug 7, 2016

ViralBShah commented Aug 7, 2016

simonbyrne commented Aug 7, 2016 • edited Loading

ViralBShah commented Aug 7, 2016

ViralBShah commented Aug 7, 2016 • edited Loading

simonbyrne commented Aug 8, 2016 • edited Loading

tkelman commented Aug 8, 2016

ViralBShah commented Aug 8, 2016

simonbyrne commented Aug 8, 2016

simonbyrne commented Aug 8, 2016

tkelman commented Aug 8, 2016

simonbyrne commented Aug 8, 2016

KristofferC commented Aug 9, 2016

simonbyrne commented Aug 9, 2016

ViralBShah commented Aug 9, 2016

simonbyrne commented Aug 12, 2016

Performance regression for `exp`, `sin`, `cos` #17751

Performance regression for `exp`, `sin`, `cos` #17751

LLVM code for `v0.4`

LLVM output for `v0.5`

pabloferz commented Aug 2, 2016 •

edited

Loading

ufechner7 commented Aug 2, 2016 •

edited by yuyichao

Loading

simonbyrne commented Aug 7, 2016 •

edited

Loading

ViralBShah commented Aug 7, 2016 •

edited

Loading

simonbyrne commented Aug 8, 2016 •

edited

Loading