Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variance nonzero for constant array #9631

Open
gerritholl opened this issue Aug 30, 2017 · 10 comments
Open

variance nonzero for constant array #9631

gerritholl opened this issue Aug 30, 2017 · 10 comments

Comments

@gerritholl
Copy link
Contributor

NumPy may give a nonzero variance (and thus standard deviation) for a constant array. This may be due to loss of numerical precision, but Pythons builtin variance routine gives the correct 0 answer, so clearly it's a preventable loss:

In [45]: x = [6715266981.538051]*10

In [46]: statistics.variance(x)
Out[46]: 0.0

In [47]: numpy.array(x).var()
Out[47]: 9.0949470177292824e-13

I think it would be highly desirable if x.std() and x.var() for a constant array could be assumed to be exactly identical to zero. I'm aware one should not compare floating point numbers but zero is a bit of a special case.

@njsmith
Copy link
Member

njsmith commented Aug 31, 2017

FWIW, statistics.variance appears to do all its arithmetic using exact arbitrary-precision fractions, which in addition to being extremely slow will probably blow up your memory on reasonable-sized datasets.

@bashtage
Copy link
Contributor

If you want to check for 0 variance, you should use p2p

@gerritholl
Copy link
Contributor Author

gerritholl commented Aug 31, 2017

I see, the statistics approach is indeed too high a price to pay for numpy. I figured there might be a reformulated numerical algorithm where loss of precision does not cause the same problem but I might be wrong there.

I ran into it in a vectorised situation, where I would have noticed a bug in my code months ago if my .std(x) had been zero instead of small (due to things being infinite). I'm prepending a .ptp() check now to act correctly where ptp() is zero.

Perhaps I'm asking too much here. FWIW, I confirmed that Matlab also gives a nonzero answer (although not the same as numpy!).

@njsmith
Copy link
Member

njsmith commented Aug 31, 2017

I'm not aware of any general-purpose numerical algorithms that can give that kind of precision guarantee. I think the only ways you could guarantee exact 0 from a variance calculation are by using an infinite precision calculation, or by doing a preprocessing pass to explicitly check for all the values being equal. I guess the latter isn't even necessarily all that expensive in common cases where you can bail out after checking a few values, but it's extra complexity (esp. in the vectorized case) for a pretty specific and unusual case. I dunno.

@WarrenWeckesser
Copy link
Member

I'm not aware of any general-purpose numerical algorithms that can give that kind of precision guarantee.

FWIW, it is not hard to show that Welford's algorithm [1] will give a variance that is exactly 0 when the input is constant.

Numpy uses a two-pass algorithm, and the underlying problem appears to be that numpy's calculation of the mean of x is off by one ULP:

In [258]: x = np.array([6715266981.538051]*10)

In [259]: np.mean(x)
Out[259]: 6715266981.5380497

In [260]: np.mean(x) == x[0]
Out[260]: False

[1] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

@njsmith
Copy link
Member

njsmith commented Sep 1, 2017

Ah, you're quite right -- I actually checked that one before posting, but misread the details :-). Do we know how it compares to our current algorithm in terms of speed or accuracy?

@WarrenWeckesser
Copy link
Member

I don't know about the speed. John D. Cook blogged about the accuracy: https://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/

@sk1p
Copy link

sk1p commented Jan 18, 2019

Related paper with accuracy and runtime comparisons of various methods: https://dl.acm.org/citation.cfm?id=3223036

@mhvk
Copy link
Contributor

mhvk commented Jan 19, 2019

@sk1p - thanks for the link! I think it would be great to use that one-pass algorithm. Perhaps ideally for a new gufunc that would calculate (weighted) mean and variance in one go.

@mhvk
Copy link
Contributor

mhvk commented Mar 30, 2019

Following up on this: a quick gufunc implementation by @0x0L of the algorithm referenced by @sk1p above correctly gives a variance of zero for the example on top.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants