Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array hashing breaks equality on 0.7 #26034

Closed
chethega opened this issue Feb 13, 2018 · 4 comments
Closed

Array hashing breaks equality on 0.7 #26034

chethega opened this issue Feb 13, 2018 · 4 comments
Labels
domain:arrays [a, r, r, a, y, s] domain:hashing kind:regression Regression in behavior compared to a previous version

Comments

@chethega
Copy link
Contributor

struct totally_not_five end

Base.isequal(::totally_not_five, x)=isequal(5,x);
Base.isequal(x, ::totally_not_five)=isequal(5,x);
Base.isequal(::totally_not_five, ::totally_not_five)=true;
Base.hash(::totally_not_five, h::UInt64)=hash(5, h);
import Base.==
==(::totally_not_five, x)= (5==x);
==(x,::totally_not_five)= (5==x);
==(::totally_not_five,::totally_not_five)=true;

n5=totally_not_five();

############
versioninfo()
#Julia Version 0.7.0-DEV.3961
#Commit 12964e839e* (2018-02-13 10:42 UTC)

5== n5 #true
isequal([4,n5,6], [4,5,6]) #true 
isequal(hash([4,n5,6]), hash([4,5,6])) # false
isequal(hash([n5,4,n5,6]), hash([n5,4,5,6])) #true 

############ 
versioninfo()
#Julia Version 0.6.2
#Commit d386e40 (2017-12-13 18:08 UTC)

5== n5 #true
isequal([4,n5,6], [4,5,6]) #true 
isequal(hash([4,n5,6]), hash([4,5,6])) # true
isequal(hash([n5,4,n5,6]), hash([n5,4,5,6])) #true 

I fear that this is fundamental to the current approach for O(1) range hashing (but I would be happy to be corrected!).

@mbauman
Copy link
Sponsor Member

mbauman commented Feb 13, 2018

So you're defining a type that is equal to a number and hashes like a number... but it isn't itself a number and doesn't support subtraction? Yeah, I can see how that'd get you into trouble in many situations. I don't think this is really actionable… or that we'd ever guarantee such a type would work everywhere.

@nalimilan
Copy link
Member

Indeed, since #16401 a type should not be equal to a number if it doesn't support -. Honestly, that's not the biggest difficulty with that approach. :-)

@chethega
Copy link
Contributor Author

Oh, thanks. I apparently missed the warning that if there exists y isa Number such that isequal(x, y)===true then it is assumed that x isa Number; and, for all types <:Number, widen(x) + widen(y) must be well-defined mod isequal, i.e. isequal(x,y) and isequal(x2,y2) must imply isequal(widen(x1)+widen(y1), widen(x2)+widen(y2)) (resp. for subtraction).

That's the long-term plan for a simple rule, right? I mean addition/subtraction are intentionally not well-defined mod isequal, but the code assumes that it becomes well-defined post widening; and it would be cheap to restrict the special code to things that isa Number, in order to get rid of possible bugs like hash([1,2,[]])==MethodError.

@nalimilan
Copy link
Member

It's not clear yet what approach will be retained. See #26022.

@JeffBezanson JeffBezanson added kind:regression Regression in behavior compared to a previous version domain:arrays [a, r, r, a, y, s] labels Feb 14, 2018
mbauman added a commit that referenced this issue Jul 26, 2018
KristofferC pushed a commit that referenced this issue Feb 11, 2019
    Goal: Hash approximately log(N) entries with a higher density of hashed elements
    weighted towards the end and special consideration for repeated values. Colliding
    hashes will often subsequently be compared by equality -- and equality between arrays
    works elementwise forwards and is short-circuiting. This means that a collision
    between arrays that differ by elements at the beginning is cheaper than one where the
    difference is towards the end. Furthermore, blindly choosing log(N) entries from a
    sparse array will likely only choose the same element repeatedly (zero in this case).

    To achieve this, we work backwards, starting by hashing the last element of the
    array. After hashing each element, we skip the next `fibskip` elements, where
    `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple
    ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and
    only end up hashing one slice of the array (as might happen with powers of two).
    Finally, we find the next distinct value from the one we just hashed.

Fixes #27865 and fixes #26011.

Fixes #26034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] domain:hashing kind:regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants