Incorrect quantiles for floating-point and integer arrays #144

yurivish · 2023-04-23T05:12:07Z

Say we have an array with a large negative number and a smaller positive number:

array = Float16[-9000, 100]

Both of these numbers are represented exactly in Float16:

julia> Float16(-9000) == -9_000
true

julia> Float16(100) == 100
true

Julia will return the wrong answer for quantile queries over this array:

julia> using Statistics

julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104

julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104

If we make the large number bigger, then quantile queries over Float32 and Float64 arrays are also incorrect:

julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128

julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128

julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0

This can happen when both numbers are small:

julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625

And it can happen when the array contains more than two numbers:

julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128

For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:

julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)

In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:

julia> for percentile in 0:0.1:1
          arr = Int16[-30000, 30000]
          result = quantile(arr, percentile)
          println("p$(rpad(Int(100*percentile), 3)) = $result")
       end
p0   = -30000.0 # correct!
p10  = -30553.600000000002 # wrong
p20  = -31107.2 # wrong
p30  = -31660.8 # wrong
p40  = -32214.4 # wrong
p50  = -32768.0 # wrong
p60  = -33321.6 # wrong
p70  = -33875.2 # wrong
p80  = -34428.8 # wrong
p90  = -34982.4 # wrong
p100 = -35536.0 # wrong

If the numbers are big enough, comparable results can be found for Int32 and Int64:

julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false

julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false

Randomized testing suggests that for Int32 this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately

1 in every 4 random arrays of length 2 will produce incorrect quantiles.
1 in every 8 random arrays of length 5 will produce incorrect quantiles.
1 in every 120 random arrays of length 10 will produce incorrect quantiles.
1 in every 45,000 random arrays of length 20 will produce incorrect quantiles.

For Int64 it looks like the error rate may not go down as array size decreases, and approximately

1 in every 2 random arrays of length 2 will produce incorrect quantiles.
1 in every 7 random arrays of length 10 will produce incorrect quantiles.
1 in every 8 random arrays of length 50 will produce incorrect quantiles.
1 in every 8 random arrays of length 500 will produce incorrect quantiles.
1 in every 8 random arrays of length 5,000 will produce incorrect quantiles.

The function I used to compute these estimates

"""
Estimate the probability that incorrect quantiles will
be produced for a random array of a particular size and
element type.

Correctness is determined using a very naïve method where
we only consider a quantile result to be incorrect if it
lies outside the extrema of the array.
"""
function estimate_wrong_int_arrays(T, n, n_samples)
  count = 0
  percentiles = 0:0.01:1
  arr = zeros(T, n)
  for _ in 1:n_samples
    rand!(arr)
    # Since quantiles are returned in Float64 precision,
    # convert the extrema to that type for comparisons
    lo, hi = Float64.(extrema(arr))
    for p in percentiles
      result = quantile(arr, p)
      quantile_in_bounds = lo ≤ result ≤ hi
      if !quantile_in_bounds
        count += 1
        break
      end
    end
  end
  count / n_samples
end

This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.

The text was updated successfully, but these errors were encountered:

nalimilan · 2023-06-17T16:03:37Z

Thanks. The problem seems to be due to this line:

Statistics.jl/src/Statistics.jl

Line 1014 in f13706e

return a + γ*(b-a)

E.g. with quantile(Float64[-1e20, 100], 1):

julia> (a, b, γ) = (-1.0e20, 100.0, 1)
(-1.0e20, 100.0, 1)

julia> a + γ*(b-a)
0.0

julia> (1-γ)*a + γ*b
100.0

We could change it to use (1-γ)*a + γ*b instead. But the current strategy was added by JuliaLang/julia#16572 to fix the fact that in some cases quantiles were not increasing as they should with p. Test cases added by that PR don't pass with that change.

Adding a small quantity like 4eps() to aleph at this line seems to fix the problem (this is what R does):

Statistics.jl/src/Statistics.jl

Line 1001 in f13706e

aleph = n*p + oftype(p, m)

@andreasnoack What do you think?

andreasnoack · 2023-06-17T18:32:25Z

I tried to read through the old issue. It's not clear to me why the original version causes unsorted quantiles.

nalimilan · 2023-06-18T15:13:11Z

Actually it's not the original/current version which causes unsorted quantiles, it's the one I tested ((1-γ)*a + γ*b) to fix this issue.

Here's what happens using the test case from JuliaLang/julia#16572 and making the quantile print the values (returned quantile is the one with the (1-γ)*a + γ*b formula, i.e. the buggy one). It appears that a and b are equal to 0.41662034698690303, and γ changes from 0.2137500000000001 for quantile 6 to 0.6425000000000001 in quantile 7.

julia> y = [0.40003674665581906,0.4085630862624367,0.41662034698690303,0.41662034698690303,0.42189053966652057,0.42189053966652057,0.42553514344518345,0.43985732442991354]

julia> quantile(y, range(0.01, 0.99, length=17)[6])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
0.4166203469869031

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.41662034698690303, 0.2137500000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.4166203469869031

julia> quantile(y, range(0.01, 0.99, length=17)[7])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
0.416620346986903

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.41662034698690303, 0.6425000000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.416620346986903

I'm not sure what can be done about this. We could easily check whether a == b and return one of them in that case. But the same issue can happen with very close numbers, like this (taking the same numbers as before):

julia> (a, γ) = (0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.2137500000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.4166203469869031

julia> (a, γ) = (0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.6425000000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.41662034698690303

Maybe we should check whether the result is approximately equal to a or b, or whether a and b are approximately equal, and if so return a + γ*(b-a) like before (since with almost equal numbers there's no risk of precision loss due to subtraction)? This is almost equivalent to always returning a (or b...) but slightly cleaner.

nalimilan mentioned this issue Jul 1, 2023

Fix overflows in quantile #145

Merged

nalimilan closed this as completed in #145 Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect quantiles for floating-point and integer arrays #144

Incorrect quantiles for floating-point and integer arrays #144

yurivish commented Apr 23, 2023 •

edited

Loading

nalimilan commented Jun 17, 2023

andreasnoack commented Jun 17, 2023

nalimilan commented Jun 18, 2023 •

edited

Loading

Incorrect quantiles for floating-point and integer arrays #144

Incorrect quantiles for floating-point and integer arrays #144

Comments

yurivish commented Apr 23, 2023 • edited Loading

nalimilan commented Jun 17, 2023

andreasnoack commented Jun 17, 2023

nalimilan commented Jun 18, 2023 • edited Loading

yurivish commented Apr 23, 2023 •

edited

Loading

nalimilan commented Jun 18, 2023 •

edited

Loading