-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression for exp
, sin
, cos
#17751
Comments
I think the regression isn't in any of the functions you've listed, but in the way that this kind of no-op gets optimized. I see identical performance in 0.4 and 0.5 for the following: function test0(N)
r = 0.234
s = 0.0
for n = 1:N
s += exp(r)
end
end
test0(10)
@time test0(1_000_000)
@time test0(1_000_000)
@time test0(1_000_000) |
I'm afraid not. I had tested precisely your version before as well, but had the same result, hence I posted the simpler code:
|
Have you tried 0.4 with llvm 3.7.1 (and vice versa). What's the native code? |
I'm afraid I haven't really explored Julia so deeply that I know how to switch LLVMs - sorry. Interesting though that @johnmyleswhite cannot reproduce the problem. Linux/Windows vs Apple? |
It turns out that my 0.5 build was a few days old. I'm currently building the latest commit to see if the reason I couldn't reproduce your results is that the regression happened in the last 5 days. Will report back once the build is finished. |
Still don't see it with the latest commit on master for 0.5: julia> versioninfo()
Julia Version 0.5.0-rc0+122
Commit 633443c (2016-08-02 00:53 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
julia> function test0(N)
r = 0.234
s = 0.0
for n = 1:N
s += exp(r)
end
end
test0 (generic function with 1 method)
julia> test0(10)
julia> @time test0(1_000_000)
0.008087 seconds (133 allocations: 7.703 KB)
julia> @time test0(1_000_000)
0.008218 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
0.007780 seconds (4 allocations: 160 bytes) julia> versioninfo()
Julia Version 0.4.6
Commit 2e358ce (2016-06-19 17:16 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
julia> function test0(N)
r = 0.234
s = 0.0
for n = 1:N
s += exp(r)
end
end
test0 (generic function with 1 method)
julia> test0(10)
julia> @time test0(1_000_000)
0.011705 seconds (146 allocations: 9.995 KB)
julia> @time test0(1_000_000)
0.019303 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
0.007988 seconds (4 allocations: 160 bytes) And I get the same results for your original example. Seems like there must be something subtly different between our Julia 0.5 setups. Did you build from source? |
I just built from source (I don't normally), and now I don't see the regression anymore. On the nightly that I just downloaded I do see the regression. |
vs
|
Is the nightly built on a different system? see |
let me know if you want me to try something else; I will also try the next nightly tomorrow. |
Is this only in the libm functions, or if you write a function f() that roughly takes the same execution time as |
|
To build julia 0.4 with llvm 3.7.1, you have to clone julia and then change the llvm version in deps/Versions.make. |
I can reproduce. rc0 is slow, built from source is fast. Rebuilding the sysimg file did not help. Native code is identical: julia> @code_native test0(1_000_000)
.text
Filename: REPL[4]
pushq %rbp
movq %rsp, %rbp
pushq %r14
pushq %rbx
subq $16, %rsp
movq %rdi, %rbx
Source line: 3
testq %rbx, %rbx
jle L62
movabsq $140279242572400, %rax # imm = 0x7F954E6C4270
Source line: 4
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
vmovsd %xmm0, -24(%rbp)
movabsq $140279240293456, %r14 # imm = 0x7F954E497C50
L48:
vmovsd -24(%rbp), %xmm0 # xmm0 = mem[0],zero
callq *%r14
Source line: 3
addq $-1, %rbx
jne L48
Source line: 4
L62:
addq $16, %rsp
popq %rbx
popq %r14
popq %rbp
retq
nopw (%rax,%rax) |
I was seeing something like this using the generic Linux binaries vs. building from source a couple months ago. The difference was that the generic ones were using a libm included in the binary, instead of the system one. Manually, changing the so files to point to the system libm makes the performance equal. I wouldn't be surprised if this were also the case here. |
On my machine, Julia 0.5rc0, compiled from source is faster than Julia 0.4: Julia 0.5: julia> @time test0(1_000_000)
0.005780 seconds (133 allocations: 7.703 KB)
julia> @time test0(1_000_000)
0.005524 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
0.005612 seconds (4 allocations: 160 bytes) Julia 0.4: julia> @time test0(1_000_000)
0.006219 seconds (39 allocations: 33.984 KB)
julia> @time test0(1_000_000)
0.006888 seconds (4 allocations: 160 bytes)
julia> @time test0(1_000_000)
0.006956 seconds (4 allocations: 160 bytes) Julia 0.4: Julia 0.5: Uwe |
@ufechner7 Please quote your code, especially macro calls. Even though the poor https://github.com/time doesn't seems to be particularly active... |
Sorry. Won't forget it next time. |
I'm seeing the same thing (i.e. this example being about 3x slower on the downloaded rc0 vs the same built locally). This is appears to be a problem with the openlibm build we include in the rc: when I built Julia with
I also get the same slowdown. |
OSX binaries are built with |
I reproduced on linux binary though. |
isn't |
yes. I don't think there were any intel macs that predated core2 though. |
However I can't seem to recreate the issue by just rebuilding libm with |
rpath? |
That's beyond my skillset... |
reopen if that doesn't actually fix it |
I just downloaded the latest nightly (d3df8e7), but still see the issue. |
same for me; and for what it's worth:
|
Is this issue only on mac, or on linux as well? |
Yes linux has the problem to, according to Kristoffer above |
BTW, unlike Kristoffer, I do see different native code generated. The local build is generating
as opposed to the downloaded nightly:
|
For the openlibm from the nightlies, when I load with The compile line does not have anything that suggests that arch specific stuff is being done:
To make this easier to analyse, what version of OS X are we building the nightlies on and with what compiler? |
This appears to be due to a change in the default openlibm build flags from -O3 to -O2, (JuliaMath/openlibm#106), combined with the fact that we package on OS X 10.9 (which uses an older version of clang: apparently -O2 has improved recently). Some options:
|
that was put back by JuliaMath/openlibm@f8d64df though |
How about on linux? Similar old compiler issue? |
Ah okay: the problem on the OS X buildbot is that there is already a CFLAGS environment variable, which overrides this. |
No idea about linux, I don't have access to that buildbot. |
Who was setting |
It appears to be homebrew related |
Just to make sure. We have tested the new build? |
No, we will have to wait until new nightlies are available. |
The PR merge closed this issue. Let's reopen if necessary. |
Just confirming that this seems to be now fixed on OS X on RC-2. @KristofferC Can you check the linux one? |
Output in v0.4:
> julia exp_test.jl 0.006532 seconds (148 allocations: 10.151 KB) 0.006524 seconds (4 allocations: 160 bytes) 0.006999 seconds (4 allocations: 160 bytes)
Output in v0.5:
same if I replace
exp
withsin
orcos
; haven't tried others. I did test against a simple squaring function where the performance remains the same in both versions.LLVM code for
v0.4
LLVM output for
v0.5
The text was updated successfully, but these errors were encountered: