Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use run length encoding for hashing arrays #10167

Merged
merged 1 commit into from
Feb 14, 2015
Merged

Use run length encoding for hashing arrays #10167

merged 1 commit into from
Feb 14, 2015

Conversation

simonster
Copy link
Member

Fixes #10160. This implementation only uses run length encoding in cases where there is a repeated element. The overhead appears to be pretty small. For Float64 arrays with no runs, it's about 3-5% slower than the old hash. For Int arrays, it actually seems to be a couple percent faster (??). For the all zeros dense case it's >5x faster, and for SparseMatrixCSC the complexity is O(nz) instead of O(m*n).

@StefanKarpinski
Copy link
Sponsor Member

This seems like a big overall win to me.

@StefanKarpinski
Copy link
Sponsor Member

In particular, the all zeros / many zeros dense array and sparse array cases are very common and quite important. Sacrificing a rather small performance hit for that seems like a good tradeoff.

x2 != x1 && break
runlength += 1
end
h += hashrle_seed
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly the right thing to do – it makes causing collisions hard.

@JeffBezanson
Copy link
Sponsor Member

Yep, +1.
How important is the RLEHasher? I'd rather avoid allocating lots of objects during hashing.

@simonster
Copy link
Member Author

RLEHasher isn't entirely necessary. It just made this a bit easier to write. It should only allocate for SparseMatrixCSC of non-bits types. I could make it a type instead of an immutable and reuse the same object, although that might be a bit slower for bits types, or I can just get rid of it.

@simonster simonster force-pushed the sjk/rle-hash branch 2 times, most recently from 968425b to 116b9a7 Compare February 12, 2015 00:19
@simonster
Copy link
Member Author

Removing RLEHasher wasn't too hard.

@StefanKarpinski
Copy link
Sponsor Member

Nice. Any objections to merging this? Otherwise let's get this merged soon.

simonster added a commit that referenced this pull request Feb 14, 2015
Use run length encoding for hashing arrays
@simonster simonster merged commit 0179028 into master Feb 14, 2015
@simonster simonster deleted the sjk/rle-hash branch February 14, 2015 17:09
x1 = x2
x2, state = next(a, state)
if x2 == x1
# For repeated elements, use run length encoding
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isequal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I guess this won't hash alternating positive and negative zeros differently from a run of zeros of the same sign and won't do RLE for runs of NaNs. Unfortunately it seems that isequal adds 15% overhead to Float64 hashing 😦

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to suck it up and do this correctly. It would be good to optimize isequal for Float64 though. Currently we have this:

julia> @code_llvm isequal(1.0, 1.0)

define i1 @julia_isequal_131700(double, double) {
top:
  %2 = bitcast double %0 to i64, !dbg !676
  %3 = bitcast double %1 to i64, !dbg !676
  %4 = fcmp uno double %0, 0.000000e+00, !dbg !676
  %5 = fcmp uno double %1, 0.000000e+00, !dbg !676
  %6 = and i1 %4, %5, !dbg !676
  %7 = icmp eq i64 %2, %3, !dbg !676
  %8 = or i1 %6, %7, !dbg !676
  ret i1 %8, !dbg !676
}

I wonder if there isn't something more efficient we could do. The challenges are:

  • all NaNs are equal to each other
  • 0.0 and -0.0 are not equal to each other
  • all other values are equal iff and only if they are bit-identical

Thanks, Obama.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should all nans be equal to each other? Can't we redefine this elsewhere, e.g. in isequal? If we distinguish between +0.0 and -0.0, then we're outside of IEEE's numerical equality anyway. We should be able to just use bitwise equality here as well.

Put differently: Different nans exist for a reason (that we may like or not), and there needs to be a way to distinguish between them. isequal would be the natural choice for this.

If someone needs to treat all nans as equal, then this particular place should pay the overhead for calling isnan and introducing the special case.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, isequal is not the natural way to distinguish different NaNs, that's what === is for (among other things). isequal is for hashing; we do not want all 2^53 different NaNs to hash differently. I guarantee you that having d[NaN] and d[-NaN] accessing different dict entries will cause no end of complaints.

This is very much outside of IEEE behavior, but then again, it's pretty clear that IEEE never thought for two seconds about hashing or other broader issues of equality in high-level programming languages. We have discussed making +0.0 and -0.0 hash the same, but I'm still not entirely sold on that.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I like about making +0.0 isequal -0.0 is that it turns isequal into a strict coarsening of == and we could potentially have an entirely correct definition of isequal as:

isequal(x, y) = x == y
isequal(x::Number, y::Number) == x == y || x != x && y != y

Then no one would ever need to tinker with isequal except to use it or add methods for increased efficiency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like:

isequal2(x::Float64, y::Float64) = (x === y) || ((x != x) & (y != y))

(with a branch) may be faster than:

isequal2(x::Float64, y::Float64) = (x === y) | ((x != x) & (y != y))

(which generates identical assembly to our current isequal). But due to the branch prediction penalty, performance may depend on more factors than I've benchmarked.

simonster added a commit that referenced this pull request Feb 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashing SparseMatrixCSC is slow
5 participants