-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use run length encoding for hashing arrays #10167
Conversation
This seems like a big overall win to me. |
In particular, the all zeros / many zeros dense array and sparse array cases are very common and quite important. Sacrificing a rather small performance hit for that seems like a good tradeoff. |
x2 != x1 && break | ||
runlength += 1 | ||
end | ||
h += hashrle_seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly the right thing to do – it makes causing collisions hard.
Yep, +1. |
|
968425b
to
116b9a7
Compare
Removing |
116b9a7
to
7f9b707
Compare
Nice. Any objections to merging this? Otherwise let's get this merged soon. |
Use run length encoding for hashing arrays
x1 = x2 | ||
x2, state = next(a, state) | ||
if x2 == x1 | ||
# For repeated elements, use run length encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isequal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. I guess this won't hash alternating positive and negative zeros differently from a run of zeros of the same sign and won't do RLE for runs of NaNs. Unfortunately it seems that isequal
adds 15% overhead to Float64 hashing 😦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to suck it up and do this correctly. It would be good to optimize isequal
for Float64
though. Currently we have this:
julia> @code_llvm isequal(1.0, 1.0)
define i1 @julia_isequal_131700(double, double) {
top:
%2 = bitcast double %0 to i64, !dbg !676
%3 = bitcast double %1 to i64, !dbg !676
%4 = fcmp uno double %0, 0.000000e+00, !dbg !676
%5 = fcmp uno double %1, 0.000000e+00, !dbg !676
%6 = and i1 %4, %5, !dbg !676
%7 = icmp eq i64 %2, %3, !dbg !676
%8 = or i1 %6, %7, !dbg !676
ret i1 %8, !dbg !676
}
I wonder if there isn't something more efficient we could do. The challenges are:
- all
NaNs
are equal to each other 0.0
and-0.0
are not equal to each other- all other values are equal iff and only if they are bit-identical
Thanks, Obama.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should all nans be equal to each other? Can't we redefine this elsewhere, e.g. in isequal
? If we distinguish between +0.0 and -0.0, then we're outside of IEEE's numerical equality anyway. We should be able to just use bitwise equality here as well.
Put differently: Different nans exist for a reason (that we may like or not), and there needs to be a way to distinguish between them. isequal
would be the natural choice for this.
If someone needs to treat all nans as equal, then this particular place should pay the overhead for calling isnan
and introducing the special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, isequal
is not the natural way to distinguish different NaNs, that's what ===
is for (among other things). isequal
is for hashing; we do not want all 2^53 different NaN
s to hash differently. I guarantee you that having d[NaN]
and d[-NaN]
accessing different dict entries will cause no end of complaints.
This is very much outside of IEEE behavior, but then again, it's pretty clear that IEEE never thought for two seconds about hashing or other broader issues of equality in high-level programming languages. We have discussed making +0.0 and -0.0 hash the same, but I'm still not entirely sold on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I like about making +0.0 isequal
-0.0 is that it turns isequal
into a strict coarsening of ==
and we could potentially have an entirely correct definition of isequal
as:
isequal(x, y) = x == y
isequal(x::Number, y::Number) == x == y || x != x && y != y
Then no one would ever need to tinker with isequal
except to use it or add methods for increased efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like:
isequal2(x::Float64, y::Float64) = (x === y) || ((x != x) & (y != y))
(with a branch) may be faster than:
isequal2(x::Float64, y::Float64) = (x === y) | ((x != x) & (y != y))
(which generates identical assembly to our current isequal
). But due to the branch prediction penalty, performance may depend on more factors than I've benchmarked.
See discussion in code comments on #10167
Fixes #10160. This implementation only uses run length encoding in cases where there is a repeated element. The overhead appears to be pretty small. For Float64 arrays with no runs, it's about 3-5% slower than the old
hash
. For Int arrays, it actually seems to be a couple percent faster (??). For the all zeros dense case it's >5x faster, and for SparseMatrixCSC the complexity isO(nz)
instead ofO(m*n)
.