Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect quantiles for floating-point and integer arrays #144

Closed
yurivish opened this issue Apr 23, 2023 · 3 comments · Fixed by #145
Closed

Incorrect quantiles for floating-point and integer arrays #144

yurivish opened this issue Apr 23, 2023 · 3 comments · Fixed by #145

Comments

@yurivish
Copy link

yurivish commented Apr 23, 2023

Say we have an array with a large negative number and a smaller positive number:

array = Float16[-9000, 100]

Both of these numbers are represented exactly in Float16:

julia> Float16(-9000) == -9_000
true

julia> Float16(100) == 100
true

Julia will return the wrong answer for quantile queries over this array:

julia> using Statistics

julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104

julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104

If we make the large number bigger, then quantile queries over Float32 and Float64 arrays are also incorrect:

julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128

julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128

julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0

This can happen when both numbers are small:

julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625

And it can happen when the array contains more than two numbers:

julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128

For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:

julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)

In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:

julia> for percentile in 0:0.1:1
          arr = Int16[-30000, 30000]
          result = quantile(arr, percentile)
          println("p$(rpad(Int(100*percentile), 3)) = $result")
       end
p0   = -30000.0 # correct!
p10  = -30553.600000000002 # wrong
p20  = -31107.2 # wrong
p30  = -31660.8 # wrong
p40  = -32214.4 # wrong
p50  = -32768.0 # wrong
p60  = -33321.6 # wrong
p70  = -33875.2 # wrong
p80  = -34428.8 # wrong
p90  = -34982.4 # wrong
p100 = -35536.0 # wrong

If the numbers are big enough, comparable results can be found for Int32 and Int64:

julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false

julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false

Randomized testing suggests that for Int32 this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately

  • 1 in every 4 random arrays of length 2 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 5 will produce incorrect quantiles.
  • 1 in every 120 random arrays of length 10 will produce incorrect quantiles.
  • 1 in every 45,000 random arrays of length 20 will produce incorrect quantiles.

For Int64 it looks like the error rate may not go down as array size decreases, and approximately

  • 1 in every 2 random arrays of length 2 will produce incorrect quantiles.
  • 1 in every 7 random arrays of length 10 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 50 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 500 will produce incorrect quantiles.
  • 1 in every 8 random arrays of length 5,000 will produce incorrect quantiles.
The function I used to compute these estimates
"""
Estimate the probability that incorrect quantiles will
be produced for a random array of a particular size and
element type.

Correctness is determined using a very naïve method where
we only consider a quantile result to be incorrect if it
lies outside the extrema of the array.
"""
function estimate_wrong_int_arrays(T, n, n_samples)
  count = 0
  percentiles = 0:0.01:1
  arr = zeros(T, n)
  for _ in 1:n_samples
    rand!(arr)
    # Since quantiles are returned in Float64 precision,
    # convert the extrema to that type for comparisons
    lo, hi = Float64.(extrema(arr))
    for p in percentiles
      result = quantile(arr, p)
      quantile_in_bounds = lo  result  hi
      if !quantile_in_bounds
        count += 1
        break
      end
    end
  end
  count / n_samples
end

This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.

@nalimilan
Copy link
Member

Thanks. The problem seems to be due to this line:

return a + γ*(b-a)

E.g. with quantile(Float64[-1e20, 100], 1):

julia> (a, b, γ) = (-1.0e20, 100.0, 1)
(-1.0e20, 100.0, 1)

julia> a + γ*(b-a)
0.0

julia> (1-γ)*a + γ*b
100.0

We could change it to use (1-γ)*a + γ*b instead. But the current strategy was added by JuliaLang/julia#16572 to fix the fact that in some cases quantiles were not increasing as they should with p. Test cases added by that PR don't pass with that change.

Adding a small quantity like 4eps() to aleph at this line seems to fix the problem (this is what R does):

aleph = n*p + oftype(p, m)

@andreasnoack What do you think?

@andreasnoack
Copy link
Member

I tried to read through the old issue. It's not clear to me why the original version causes unsorted quantiles.

@nalimilan
Copy link
Member

nalimilan commented Jun 18, 2023

Actually it's not the original/current version which causes unsorted quantiles, it's the one I tested ((1-γ)*a + γ*b) to fix this issue.

Here's what happens using the test case from JuliaLang/julia#16572 and making the quantile print the values (returned quantile is the one with the (1-γ)*a + γ*b formula, i.e. the buggy one). It appears that a and b are equal to 0.41662034698690303, and γ changes from 0.2137500000000001 for quantile 6 to 0.6425000000000001 in quantile 7.

julia> y = [0.40003674665581906,0.4085630862624367,0.41662034698690303,0.41662034698690303,0.42189053966652057,0.42189053966652057,0.42553514344518345,0.43985732442991354]

julia> quantile(y, range(0.01, 0.99, length=17)[6])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
0.4166203469869031

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.41662034698690303, 0.2137500000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.4166203469869031

julia> quantile(y, range(0.01, 0.99, length=17)[7])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
0.416620346986903

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.41662034698690303, 0.6425000000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.416620346986903

I'm not sure what can be done about this. We could easily check whether a == b and return one of them in that case. But the same issue can happen with very close numbers, like this (taking the same numbers as before):

julia> (a, γ) = (0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.2137500000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.4166203469869031

julia> (a, γ) = (0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.6425000000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.41662034698690303

Maybe we should check whether the result is approximately equal to a or b, or whether a and b are approximately equal, and if so return a + γ*(b-a) like before (since with almost equal numbers there's no risk of precision loss due to subtraction)? This is almost equivalent to always returning a (or b...) but slightly cleaner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants