Use run length encoding for hashing arrays #10167

simonster · 2015-02-11T22:30:09Z

Fixes #10160. This implementation only uses run length encoding in cases where there is a repeated element. The overhead appears to be pretty small. For Float64 arrays with no runs, it's about 3-5% slower than the old hash. For Int arrays, it actually seems to be a couple percent faster (??). For the all zeros dense case it's >5x faster, and for SparseMatrixCSC the complexity is O(nz) instead of O(m*n).

StefanKarpinski · 2015-02-11T22:39:54Z

This seems like a big overall win to me.

StefanKarpinski · 2015-02-11T22:41:18Z

In particular, the all zeros / many zeros dense array and sparse array cases are very common and quite important. Sacrificing a rather small performance hit for that seems like a good tradeoff.

StefanKarpinski · 2015-02-11T22:42:21Z

base/hashing2.jl

+ x2 != x1 && break
+ runlength += 1
+ end
+ h += hashrle_seed


Yes, this is exactly the right thing to do – it makes causing collisions hard.

JeffBezanson · 2015-02-11T22:45:00Z

Yep, +1.
How important is the RLEHasher? I'd rather avoid allocating lots of objects during hashing.

simonster · 2015-02-11T23:23:00Z

RLEHasher isn't entirely necessary. It just made this a bit easier to write. It should only allocate for SparseMatrixCSC of non-bits types. I could make it a type instead of an immutable and reuse the same object, although that might be a bit slower for bits types, or I can just get rid of it.

simonster · 2015-02-12T00:23:46Z

Removing RLEHasher wasn't too hard.

StefanKarpinski · 2015-02-13T18:28:37Z

Nice. Any objections to merging this? Otherwise let's get this merged soon.

Use run length encoding for hashing arrays

ivarne · 2015-02-21T15:13:28Z

base/hashing2.jl

+ x1 = x2
+ x2, state = next(a, state)
+ if x2 == x1
+ # For repeated elements, use run length encoding


Probably. I guess this won't hash alternating positive and negative zeros differently from a run of zeros of the same sign and won't do RLE for runs of NaNs. Unfortunately it seems that isequal adds 15% overhead to Float64 hashing 😦

I think we need to suck it up and do this correctly. It would be good to optimize isequal for Float64 though. Currently we have this:

julia> @code_llvm isequal(1.0, 1.0) define i1 @julia_isequal_131700(double, double) { top: %2 = bitcast double %0 to i64, !dbg !676 %3 = bitcast double %1 to i64, !dbg !676 %4 = fcmp uno double %0, 0.000000e+00, !dbg !676 %5 = fcmp uno double %1, 0.000000e+00, !dbg !676 %6 = and i1 %4, %5, !dbg !676 %7 = icmp eq i64 %2, %3, !dbg !676 %8 = or i1 %6, %7, !dbg !676 ret i1 %8, !dbg !676 }

I wonder if there isn't something more efficient we could do. The challenges are:

all NaNs are equal to each other

0.0 and -0.0 are not equal to each other

all other values are equal iff and only if they are bit-identical

Thanks, Obama.

Why should all nans be equal to each other? Can't we redefine this elsewhere, e.g. in isequal? If we distinguish between +0.0 and -0.0, then we're outside of IEEE's numerical equality anyway. We should be able to just use bitwise equality here as well.

Put differently: Different nans exist for a reason (that we may like or not), and there needs to be a way to distinguish between them. isequal would be the natural choice for this.

If someone needs to treat all nans as equal, then this particular place should pay the overhead for calling isnan and introducing the special case.

No, isequal is not the natural way to distinguish different NaNs, that's what === is for (among other things). isequal is for hashing; we do not want all 2^53 different NaNs to hash differently. I guarantee you that having d[NaN] and d[-NaN] accessing different dict entries will cause no end of complaints.

This is very much outside of IEEE behavior, but then again, it's pretty clear that IEEE never thought for two seconds about hashing or other broader issues of equality in high-level programming languages. We have discussed making +0.0 and -0.0 hash the same, but I'm still not entirely sold on that.

One thing I like about making +0.0 isequal -0.0 is that it turns isequal into a strict coarsening of == and we could potentially have an entirely correct definition of isequal as:

isequal(x, y) = x == y isequal(x::Number, y::Number) == x == y || x != x && y != y

Then no one would ever need to tinker with isequal except to use it or add methods for increased efficiency.

It looks like:

isequal2(x::Float64, y::Float64) = (x === y) || ((x != x) & (y != y))

(with a branch) may be faster than:

isequal2(x::Float64, y::Float64) = (x === y) | ((x != x) & (y != y))

(which generates identical assembly to our current isequal). But due to the branch prediction penalty, performance may depend on more factors than I've benchmarked.

See discussion in code comments on #10167

StefanKarpinski reviewed Feb 11, 2015
View reviewed changes

simonster force-pushed the sjk/rle-hash branch 2 times, most recently from 968425b to 116b9a7 Compare February 12, 2015 00:19

Use run length encoding for hashing arrays

7f9b707

simonster force-pushed the sjk/rle-hash branch from 116b9a7 to 7f9b707 Compare February 12, 2015 00:59

simonster added a commit that referenced this pull request Feb 14, 2015

Merge pull request #10167 from JuliaLang/sjk/rle-hash

0179028

Use run length encoding for hashing arrays

simonster merged commit 0179028 into master Feb 14, 2015

simonster deleted the sjk/rle-hash branch February 14, 2015 17:09

karanveerm mentioned this pull request Feb 21, 2015

Hash error due to overloading of == for custom types #10267

Closed

ivarne reviewed Feb 21, 2015
View reviewed changes

simonster added a commit that referenced this pull request Feb 21, 2015

Use isequal for RLE hashing to prevent collisions between -0. and 0.

d61befa

See discussion in code comments on #10167

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use run length encoding for hashing arrays #10167

Use run length encoding for hashing arrays #10167

simonster commented Feb 11, 2015

StefanKarpinski commented Feb 11, 2015

StefanKarpinski commented Feb 11, 2015

StefanKarpinski Feb 11, 2015

JeffBezanson commented Feb 11, 2015

simonster commented Feb 11, 2015

simonster commented Feb 12, 2015

StefanKarpinski commented Feb 13, 2015

ivarne Feb 21, 2015

simonster Feb 21, 2015

StefanKarpinski Feb 21, 2015

eschnett Feb 21, 2015

StefanKarpinski Feb 21, 2015

StefanKarpinski Feb 21, 2015

simonster Feb 21, 2015

Use run length encoding for hashing arrays #10167

Use run length encoding for hashing arrays #10167

Conversation

simonster commented Feb 11, 2015

StefanKarpinski commented Feb 11, 2015

StefanKarpinski commented Feb 11, 2015

Choose a reason for hiding this comment

JeffBezanson commented Feb 11, 2015

simonster commented Feb 11, 2015

simonster commented Feb 12, 2015

StefanKarpinski commented Feb 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment