Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

lungben · 2020-08-27T13:44:03Z

The scalar randn() function is significantly slower in Julia 1.5 (and master) than it was in Julia 1.4. This was a side-effect of the drastic speed-up implemented for randn vectors.

See #36414

The underlying issue is that a function call is not inlined.
In this PR, this issue is avoided by redefining the scalar version of randn(), using the Julia 1.4 code.
This brings back the V1.4 performance for randn(), but is not the nicest solution. But unfortunately I have not found a better one yet.

KristofferC · 2020-08-27T17:01:38Z

Should probably at least add a comment to both places about this duplication.

lungben · 2020-08-28T07:51:54Z

Should probably at least add a comment to both places about this duplication.

Good point, I added some comments.

rfourquet · 2020-09-12T09:01:53Z

stdlib/Random/src/normal.jl

+ #=
+ When defining 
+ `@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw()))`
+ the function call to `_randn` is currently not inlined, resulting in slightly worse


I was wondering whether it's really _randn which fails to be inlined... Looking rapidly at @code_typed randn(MersenneTwister()) doesn't show a call to _randn if I'm not mistaken. How did you conclude that randn_ is not inlined?

I deduced it from the @code_llvm (but I might be wrong here - I do not have much experience with these topics):
In the Julia 1.5 version, there is the following call:
%30 = call double @j__randn_3665(%jl_value_t* nonnull %0, i64 %29) #0
In the Julia 1.4 code, this call is missing, but there are ca. 60 additional lines of LLVM code. I suppose this is the inlined function, but am not sure about it.

Since _randn is defined as calling randn, the inliner decides to stop. This is reflected in code_llvm, but due to the particular way that code_typed works and is cached, it won't be represented there.

To fix this, I would suggest moving this code duplication into randn_unlikely instead, so that's the only recursive function and not all these others

randn_unlikely is the noinline slow path though. If we moved the code we'd have to inline everything?

rfourquet · 2020-09-12T09:05:18Z

This patch looks fine as a last resort, but I would rather postpone merging it until later before the 1.6 release, to give us a chance to understand better what happens exactly. If _randn is not inlined, then it should because of the @inline annotation, so the issue might be fixed by finding a way to force the compiler to inline.

In case this is considered for backport to the 1.5 release, then this commit could be applied directly to 1.5 for the next 1.5. release, merged for 1.6 later only if we don't find better.

rfourquet · 2020-10-07T01:07:33Z

So by looking again at the output of @code_typed randn(MersenneTwister()), the output is AFAICT exactly the same between the version on master and with your patch, except for type annotations on the global arrays wi and ki: on master it looks like

 %48 = Random.ki::Vector{UInt64}

whereas with your patch, it looks like

%48 = Random.ki::Core.Const(UInt64[0x0007799ec012f7b2, 0x0000000000000000, ... ]

I don't really know what it means, but it looks like inference is capable to infer "constness" of these arrays, leading to better perfs. Would someone know what might cause the difference and whether it's possible to help inference on the master version without duplicating the code? (maybe @Keno ?)

If someone wants to try this patch, just do

@eval Random @inline function randn(rng::AbstractRNG=default_rng())
          @inbounds begin
              r = rand(rng, UInt52Raw())

              # the following code is identical to the one in `_randn(rng::AbstractRNG, r::UInt64)`
              r &= 0x000fffffffffffff
              rabs = Int64(r>>1) # One bit for the sign
              idx = rabs & 0xFF
              x = ifelse(r % Bool, -rabs, rabs)*wi[idx+1]
              rabs < ki[idx+1] && return x # 99.3% of the time we return here 1st try
              return randn_unlikely(rng, idx, rabs, x)
          end
      end

KristofferC · 2020-10-07T07:22:42Z

What happens if you make those arrays tuples? Tuples are immutable and might be handled better.

rfourquet · 2020-10-07T16:04:12Z

What happens if you make those arrays tuples? Tuples are immutable and might be handled better.

Good idea, this seem to improve performance a bit on master. But doing the same trick atop this PR also improves things, again thanks to a Const annotation:

%61 = Random.ki_::NTuple{256, UInt64}
# vs
%61 = Random.ki_::Core.Const((0x0007799ec012f7b2, 0x0000000000000000, ...

chriselrod · 2020-10-29T19:04:36Z

Perhaps interesting, Julia 1.4:

julia> using Random, BenchmarkTools

julia> const DEFAULTRNG = Random.default_rng();

julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.850 ns (0.00% GC)
  median time:      2.857 ns (0.00% GC)
  mean time:        2.995 ns (0.00% GC)
  maximum time:     5.182 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.592 ns (0.00% GC)
  median time:      4.061 ns (0.00% GC)
  mean time:        4.077 ns (0.00% GC)
  maximum time:     6.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.309 ns (0.00% GC)
  median time:      5.769 ns (0.00% GC)
  mean time:        5.792 ns (0.00% GC)
  maximum time:     19.508 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Julia master:

julia> using Random, BenchmarkTools

julia> const DEFAULTRNG = Random.default_rng();

julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.844 ns (0.00% GC)
  median time:      2.848 ns (0.00% GC)
  mean time:        2.856 ns (0.00% GC)
  maximum time:     12.756 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.864 ns (0.00% GC)
  median time:      4.233 ns (0.00% GC)
  mean time:        4.255 ns (0.00% GC)
  maximum time:     13.951 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.842 ns (0.00% GC)
  median time:      7.298 ns (0.00% GC)
  mean time:        7.322 ns (0.00% GC)
  maximum time:     11.818 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

When inference detects a call cycle, one of two things could happen: 1. It determines that in order for inference to converge it needs to limit the signatures of some methods to something more general, or 2. The cycle is determined to be harmless at the inference level (because though there is a cycle in the CFG there is no dependency cycle of type information). In the first case, we simply disable optimizations, which is sensible, because we're likely to have to recompute some information anyway, when we actually get there dynamically. In the second case however, we do do optimizations, but it's a bit of an unusual case. In particular, inference usually delivers methods to inlining in postorder (meaning callees get optimized before their callers) such that a caller can always inline a callee. However, if there is a cycle, there is of course no unique postorder of functions, since by definition we're looking a locally strongly connected component. In this case, we would just essentially pick an arbitrary order (and moreover, depending on the point at which we enter the cycle and subsequently cached, leading to potential performance instabilities, depending on function order). However, the arbitrary order is quite possibly suboptimal. For example in #36414, we have a cycle randn -> _randn -> randn_unlikely -> rand. In this cycle the author of this code expected both `randn` and `_randn` to inline and annotated the functions as such. However, in 1.5+ the order we happed to pick would have inlined randn_unlikely into _randn (had it not been marked noinline), with a hard call break between randn and _randn, whch is problematic from a performance standpoint. This PR aims to address this by introducing a heuristic: If some functions in a cycle are marked as `@noinline`, we want to make sure to infer these last (since they won't ever be inlined anyway). To ensure this happens, while restoring postorder if this happens to break the cycle, we perform a DFS traversal rooted at any `@noinline` functions and then optimize the functions in the cycle in DFS-postorder. Of course still may still not be a true postorder in the inlining graph (if the `@noinline` functions don't break the cycle), but even in that case, it should be no worse than the default order. Fixes #36414 Closes #37234

JeffBezanson · 2021-01-15T17:55:22Z

stdlib/Random/src/normal.jl

-@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw()))
+@inline function randn(rng::AbstractRNG=default_rng())
+ #=
+ When defining 


Trailing whitespace here.

JeffBezanson · 2021-01-15T17:55:31Z

stdlib/Random/src/normal.jl

+ =#
+ @inbounds begin
+ r = rand(rng, UInt52Raw())
+


KristofferC · 2021-01-19T16:44:22Z

Rebased in #39319

definition of randn for scalars reverted to 1.4

45ec398

StefanKarpinski requested a review from rfourquet August 27, 2020 14:41

Added comments why this change is done

c1f0061

rfourquet added performance Must go faster domain:randomness Random number generation and the Random stdlib labels Sep 1, 2020

Update normal.jl

08104ed

rfourquet reviewed Sep 12, 2020

View reviewed changes

rfourquet added this to the 1.6 features milestone Sep 30, 2020

rfourquet mentioned this pull request Oct 7, 2020

use gray for debug info for code_llvm to be consistent with e.g. stacktraces #37920

Merged

Keno mentioned this pull request Oct 29, 2020

Improve optimization of harmless call cycles #38231

Open

JeffBezanson added the backport 1.6 Change should be backported to release-1.6 label Jan 15, 2021

JeffBezanson reviewed Jan 15, 2021

View reviewed changes

stdlib/Random/src/normal.jl

=#

@inbounds begin

r = rand(rng, UInt52Raw())

Copy link

Sponsor Member

JeffBezanson Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

KristofferC mentioned this pull request Jan 19, 2021

Rebase: Performance regression of scalar randn() between Julia 1.4 and 1.5 #39319

Merged

KristofferC closed this Jan 19, 2021

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

lungben commented Aug 27, 2020

KristofferC commented Aug 27, 2020

lungben commented Aug 28, 2020

rfourquet Sep 12, 2020

lungben Sep 15, 2020

vtjnash Oct 29, 2020

vtjnash Oct 29, 2020

JeffBezanson Jan 7, 2021

rfourquet commented Sep 12, 2020 •

edited

Loading

rfourquet commented Oct 7, 2020

KristofferC commented Oct 7, 2020 •

edited

Loading

rfourquet commented Oct 7, 2020

chriselrod commented Oct 29, 2020

JeffBezanson Jan 15, 2021

JeffBezanson Jan 15, 2021

KristofferC commented Jan 19, 2021

Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

Conversation

lungben commented Aug 27, 2020

KristofferC commented Aug 27, 2020

lungben commented Aug 28, 2020

rfourquet Sep 12, 2020

Choose a reason for hiding this comment

lungben Sep 15, 2020

Choose a reason for hiding this comment

vtjnash Oct 29, 2020

Choose a reason for hiding this comment

vtjnash Oct 29, 2020

Choose a reason for hiding this comment

JeffBezanson Jan 7, 2021

Choose a reason for hiding this comment

rfourquet commented Sep 12, 2020 • edited Loading

rfourquet commented Oct 7, 2020

KristofferC commented Oct 7, 2020 • edited Loading

rfourquet commented Oct 7, 2020

chriselrod commented Oct 29, 2020

JeffBezanson Jan 15, 2021

Choose a reason for hiding this comment

JeffBezanson Jan 15, 2021

Choose a reason for hiding this comment

KristofferC commented Jan 19, 2021

rfourquet commented Sep 12, 2020 •

edited

Loading

KristofferC commented Oct 7, 2020 •

edited

Loading