Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234

Closed
wants to merge 3 commits into from

Conversation

lungben
Copy link
Contributor

@lungben lungben commented Aug 27, 2020

The scalar randn() function is significantly slower in Julia 1.5 (and master) than it was in Julia 1.4. This was a side-effect of the drastic speed-up implemented for randn vectors.

See #36414

The underlying issue is that a function call is not inlined.
In this PR, this issue is avoided by redefining the scalar version of randn(), using the Julia 1.4 code.
This brings back the V1.4 performance for randn(), but is not the nicest solution. But unfortunately I have not found a better one yet.

@KristofferC
Copy link
Sponsor Member

Should probably at least add a comment to both places about this duplication.

@lungben
Copy link
Contributor Author

lungben commented Aug 28, 2020

Should probably at least add a comment to both places about this duplication.

Good point, I added some comments.

@rfourquet rfourquet added performance Must go faster domain:randomness Random number generation and the Random stdlib labels Sep 1, 2020
#=
When defining
`@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw()))`
the function call to `_randn` is currently not inlined, resulting in slightly worse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether it's really _randn which fails to be inlined... Looking rapidly at @code_typed randn(MersenneTwister()) doesn't show a call to _randn if I'm not mistaken. How did you conclude that randn_ is not inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deduced it from the @code_llvm (but I might be wrong here - I do not have much experience with these topics):
In the Julia 1.5 version, there is the following call:
%30 = call double @j__randn_3665(%jl_value_t* nonnull %0, i64 %29) #0
In the Julia 1.4 code, this call is missing, but there are ca. 60 additional lines of LLVM code. I suppose this is the inlined function, but am not sure about it.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _randn is defined as calling randn, the inliner decides to stop. This is reflected in code_llvm, but due to the particular way that code_typed works and is cached, it won't be represented there.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this, I would suggest moving this code duplication into randn_unlikely instead, so that's the only recursive function and not all these others

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randn_unlikely is the noinline slow path though. If we moved the code we'd have to inline everything?

@rfourquet
Copy link
Member

rfourquet commented Sep 12, 2020

This patch looks fine as a last resort, but I would rather postpone merging it until later before the 1.6 release, to give us a chance to understand better what happens exactly. If _randn is not inlined, then it should because of the @inline annotation, so the issue might be fixed by finding a way to force the compiler to inline.

In case this is considered for backport to the 1.5 release, then this commit could be applied directly to 1.5 for the next 1.5. release, merged for 1.6 later only if we don't find better.

@rfourquet rfourquet added this to the 1.6 features milestone Sep 30, 2020
@rfourquet
Copy link
Member

So by looking again at the output of @code_typed randn(MersenneTwister()), the output is AFAICT exactly the same between the version on master and with your patch, except for type annotations on the global arrays wi and ki: on master it looks like

 %48 = Random.ki::Vector{UInt64}

whereas with your patch, it looks like

%48 = Random.ki::Core.Const(UInt64[0x0007799ec012f7b2, 0x0000000000000000, ... ]

I don't really know what it means, but it looks like inference is capable to infer "constness" of these arrays, leading to better perfs. Would someone know what might cause the difference and whether it's possible to help inference on the master version without duplicating the code? (maybe @Keno ?)

If someone wants to try this patch, just do

@eval Random @inline function randn(rng::AbstractRNG=default_rng())
          @inbounds begin
              r = rand(rng, UInt52Raw())

              # the following code is identical to the one in `_randn(rng::AbstractRNG, r::UInt64)`
              r &= 0x000fffffffffffff
              rabs = Int64(r>>1) # One bit for the sign
              idx = rabs & 0xFF
              x = ifelse(r % Bool, -rabs, rabs)*wi[idx+1]
              rabs < ki[idx+1] && return x # 99.3% of the time we return here 1st try
              return randn_unlikely(rng, idx, rabs, x)
          end
      end

@KristofferC
Copy link
Sponsor Member

KristofferC commented Oct 7, 2020

What happens if you make those arrays tuples? Tuples are immutable and might be handled better.

@rfourquet
Copy link
Member

What happens if you make those arrays tuples? Tuples are immutable and might be handled better.

Good idea, this seem to improve performance a bit on master. But doing the same trick atop this PR also improves things, again thanks to a Const annotation:

%61 = Random.ki_::NTuple{256, UInt64}
# vs
%61 = Random.ki_::Core.Const((0x0007799ec012f7b2, 0x0000000000000000, ...

@chriselrod
Copy link
Contributor

Perhaps interesting, Julia 1.4:

julia> using Random, BenchmarkTools

julia> const DEFAULTRNG = Random.default_rng();

julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.850 ns (0.00% GC)
  median time:      2.857 ns (0.00% GC)
  mean time:        2.995 ns (0.00% GC)
  maximum time:     5.182 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.592 ns (0.00% GC)
  median time:      4.061 ns (0.00% GC)
  mean time:        4.077 ns (0.00% GC)
  maximum time:     6.759 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.309 ns (0.00% GC)
  median time:      5.769 ns (0.00% GC)
  mean time:        5.792 ns (0.00% GC)
  maximum time:     19.508 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Julia master:

julia> using Random, BenchmarkTools

julia> const DEFAULTRNG = Random.default_rng();

julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.844 ns (0.00% GC)
  median time:      2.848 ns (0.00% GC)
  mean time:        2.856 ns (0.00% GC)
  maximum time:     12.756 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.864 ns (0.00% GC)
  median time:      4.233 ns (0.00% GC)
  mean time:        4.255 ns (0.00% GC)
  maximum time:     13.951 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark randn()
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.842 ns (0.00% GC)
  median time:      7.298 ns (0.00% GC)
  mean time:        7.322 ns (0.00% GC)
  maximum time:     11.818 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Keno added a commit that referenced this pull request Oct 29, 2020
When inference detects a call cycle, one of two things could happen:
1. It determines that in order for inference to converge it needs
   to limit the signatures of some methods to something more general, or
2. The cycle is determined to be harmless at the inference level (because
   though there is a cycle in the CFG there is no dependency cycle of
   type information).

In the first case, we simply disable optimizations, which is sensible,
because we're likely to have to recompute some information anyway,
when we actually get there dynamically.

In the second case however, we do do optimizations, but it's a bit
of an unusual case. In particular, inference usually delivers
methods to inlining in postorder (meaning callees get optimized
before their callers) such that a caller can always inline a callee.

However, if there is a cycle, there is of course no unique postorder
of functions, since by definition we're looking a locally strongly
connected component. In this case, we would just essentially pick
an arbitrary order (and moreover, depending on the point at which
we enter the cycle and subsequently cached, leading to potential
performance instabilities, depending on function order).
However, the arbitrary order is quite possibly
suboptimal. For example in #36414, we have a cycle
randn -> _randn -> randn_unlikely -> rand. In this cycle the author
of this code expected both `randn` and `_randn` to inline and
annotated the functions as such. However, in 1.5+ the order we happed
to pick would have inlined randn_unlikely into _randn (had it not
been marked noinline), with a hard call break between randn and
_randn, whch is problematic from a performance standpoint.

This PR aims to address this by introducing a heuristic: If some
functions in a cycle are marked as `@noinline`, we want to make
sure to infer these last (since they won't ever be inlined anyway).
To ensure this happens, while restoring postorder if this happens
to break the cycle, we perform a DFS traversal rooted at any `@noinline`
functions and then optimize the functions in the cycle in DFS-postorder.
Of course still may still not be a true postorder in the inlining
graph (if the `@noinline` functions don't break the cycle), but
even in that case, it should be no worse than the default order.

Fixes #36414
Closes #37234
@JeffBezanson JeffBezanson added the backport 1.6 Change should be backported to release-1.6 label Jan 15, 2021
@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw()))
@inline function randn(rng::AbstractRNG=default_rng())
#=
When defining
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace here.

=#
@inbounds begin
r = rand(rng, UInt52Raw())

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

@KristofferC
Copy link
Sponsor Member

Rebased in #39319

@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:randomness Random number generation and the Random stdlib performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants