Hash function is invariant to reversal of arrays of symbols #20744

NickMcNutt · 2017-02-22T21:02:39Z

Julia Version 0.6.0-dev.2893
Commit 33f464ed25* (2017-02-21 17:48 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.5.0)
  CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

From the manual for hash

Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y).

While the converse of this statement isn't strictly implied, I think it would be a reasonable thing to expect from a hash function (barring adversarial hash collisions), and this isn't the case for vectors of symbols:

isequal([1, 2, 3], [3, 2, 1]) # false
hash([1, 2, 3]) == hash([3, 2, 1]) # false

isequal([:a, :b, :c], [:c, :b, :a]) # false
hash([:a, :b, :c]) == hash([:c, :b, :a]) # true

Would it be possible to modify the hash function so that it is no longer invariant to commonly used object transformations?

The text was updated successfully, but these errors were encountered:

stevengj · 2017-02-22T21:27:36Z

The root problem here is that

hash(:c, hash(:b, hash(:a, zero(UInt)))) == hash(:a, hash(:b, hash(:c, zero(UInt))))

returns true, as a consequence of hash(x::Symbol, h::UInt) = 3*object_id(x) - h.

JeffBezanson · 2017-02-22T21:31:04Z

I think we should bring back the bitmix function deprecated in 0.3 (!). bitmix(x,y) was deprecated to hash(x,y), but I'm not really happy with that since hash has to pay the cost of hashing x numerically, when all we need is mixing. This makes it easier to mix multiple hash values more thoroughly.

JeffreySarnoff · 2017-02-23T02:22:41Z

@JeffBezanson Where may I see that bitmix(x,y) implementation?

JeffBezanson · 2017-02-23T02:33:36Z

It called C functions in src/support/hashing.[ch].

StefanKarpinski · 2017-02-23T18:44:34Z

Using bitmix for mixing hashes was causing us to do about twice as much hashing as necessary, which is why I got rid of it – we were hashing each value and then bitmixing the hashes. With a Merkle-Damgard construction, you only need to do one hash operation per object, not two. The trouble here is that we're doing zero actual hash operations per object and instead relying on object_id to be "random". The correct solution here is that symbol hashing (and general object hashing) should not be so simple and actually use a real hash function – i.e. not a function that can easily be predicted and inverted.

stevengj · 2017-02-23T19:29:05Z

@StefanKarpinski, this issue has nothing to do with whether object_id is "random" or a good hash. The problem is with the mixing — you get exactly the same collision if you use the same mixing function with rand(UInt) "hashes":

julia> badmix(x::UInt, h::UInt) = 3*x - h
badmix (generic function with 1 method)

julia> x1 = rand(UInt); x2 = rand(UInt); x3 = rand(UInt);

julia> badmix(x1, badmix(x2, badmix(x3, zero(UInt)))) == badmix(x3, badmix(x2, badmix(x1, zero(UInt))))
true

If we just defined hash(x::Symbol, h::UInt) = bitmix(object_id(x), h) with a better bitmix function, it seems like the current issue would be fixed..

stevengj · 2017-02-23T19:35:49Z

I can confirm that using the old bitmix function fixes the problem:

julia> int64hash(x) = ccall(:int64hash, UInt, (UInt,), x)
int64hash (generic function with 1 method)

julia> bitmix(a, b) = int64hash(xor(a, bswap(b)))
bitmix (generic function with 1 method)

julia> myhash(x::Symbol, h::UInt) = bitmix(object_id(x), h)
myhash (generic function with 1 method)

julia> myhash(:c, myhash(:b, myhash(:a, zero(UInt)))) == myhash(:a, myhash(:b, myhash(:c, zero(UInt))))
false

stevengj · 2017-02-23T19:44:45Z

An even simpler solution, that doesn't require the oldbitmix:

julia> myhash(x::Symbol, h::UInt) = hash(object_id(x), h)
myhash (generic function with 1 method)

julia> myhash(:c, myhash(:b, myhash(:a, zero(UInt)))) == myhash(:a, myhash(:b, myhash(:c, zero(UInt))))
false

Maybe this is what Stefan mean by "using an actual hash function?" I was distinguishing between "hashing x" and "mixing with h", but Stefan was not?

StefanKarpinski · 2017-02-23T19:48:29Z

I already have a PR fixing this. It's essentially the same as your fix except that it uses low-level bit hashing (hash_uint) directly instead of going through the hash function, which is the "cost of hashing x numerically" that @JeffBezanson was unhappy about paying. Although the cost is pretty low for UInt values, it's not nothing (there's an uitofp call required and some other ops).

StefanKarpinski · 2017-02-23T20:24:06Z

I was curious about the history of this: it was originally correct but was broken here in order to not pay the hashing cost for objects since object_id is already a hash. However, when chaining hashes of objects, this means no real hashing is ever done at all. A somewhat cheaper solution here would be to change the hashing function to 3h - object_id(x) since then hash(a, hash(b, hash(c, h))) wouldn't produce 3object_id(a) - 3object_id(b) + 3object_id(c) - h but would instead produce 27h - 9object_id(a) - 3object_id(b) - object_id(c) which at least treats a and c asymmetrically. But that still strikes me as fairly sketchy since we're not doing any real hashing – it's all simple linear operations. The hash_uint function I'm using in my PR is exactly translated from the C int64hash function that was previously used to define bitmix, so it's hard to see how my PR could be objectionable from a performance standpoint.

JeffreySarnoff · 2017-02-23T20:28:32Z

y'all seem to be satisfied with MurmurHash3 .. why?

StefanKarpinski · 2017-02-23T20:30:14Z

Inertia? That's only used for hashing large data blobs, this is a different hash function entirely.

StefanKarpinski · 2017-02-23T20:34:32Z

Another way to avoid any hashing beyond the computation of object_id in the single-value case would be to define single-argument hash(x::ANY) = object_id(x) and then define hash(x::ANY, h::UInt) = object_id(x) $ hash_uint(h) or something like that. This saves a single hash operation in any M-D style chain of hash operations, in particular doing no hashes in the single-object case. I'm not sure if object_id is good enough as a hash.

JeffBezanson · 2017-02-23T20:37:58Z

Yes, I agree your PR using hash_uint is acceptable. object_id also internally uses the same hashing and mixing functions (e.g. int64hash), so it's probably good enough on its own.

StefanKarpinski added a commit that referenced this issue Feb 23, 2017

fix hashing of general objects, including Symbols (fix #20744)

7e33982

StefanKarpinski added a commit that referenced this issue Feb 23, 2017

hashing of general objects was far too simple (fixes #20744)

df00880

StefanKarpinski added a commit that referenced this issue Feb 23, 2017

hashing of general objects was far too simple (fixes #20744)

9fe729b

tkelman closed this as completed in 0dca594 Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash function is invariant to reversal of arrays of symbols #20744

Hash function is invariant to reversal of arrays of symbols #20744

NickMcNutt commented Feb 22, 2017

stevengj commented Feb 22, 2017 •

edited

Loading

JeffBezanson commented Feb 22, 2017

JeffreySarnoff commented Feb 23, 2017

JeffBezanson commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017 •

edited

Loading

stevengj commented Feb 23, 2017

stevengj commented Feb 23, 2017

stevengj commented Feb 23, 2017 •

edited

Loading

StefanKarpinski commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017 •

edited

Loading

JeffreySarnoff commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017

JeffBezanson commented Feb 23, 2017

Hash function is invariant to reversal of arrays of symbols #20744

Hash function is invariant to reversal of arrays of symbols #20744

Comments

NickMcNutt commented Feb 22, 2017

stevengj commented Feb 22, 2017 • edited Loading

JeffBezanson commented Feb 22, 2017

JeffreySarnoff commented Feb 23, 2017

JeffBezanson commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017 • edited Loading

stevengj commented Feb 23, 2017

stevengj commented Feb 23, 2017

stevengj commented Feb 23, 2017 • edited Loading

StefanKarpinski commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017 • edited Loading

JeffreySarnoff commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017

StefanKarpinski commented Feb 23, 2017

JeffBezanson commented Feb 23, 2017

stevengj commented Feb 22, 2017 •

edited

Loading

StefanKarpinski commented Feb 23, 2017 •

edited

Loading

stevengj commented Feb 23, 2017 •

edited

Loading

StefanKarpinski commented Feb 23, 2017 •

edited

Loading