Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make arrays and ranges hash and compare equal #16401

Merged
merged 11 commits into from
Dec 21, 2017
Merged

Make arrays and ranges hash and compare equal #16401

merged 11 commits into from
Dec 21, 2017

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented May 17, 2016

Instead of hashing the values themselves, hash the first value and
differences between subsequent elements using run-length encoding.
This allows for O(1) hashing of ranges consistent with AbstractArrays,
which means they can now compare equal.

Elements for which the - operator is not defined are hashed directly.
This assumes that types which can be used for ranges implement -.


This is based on the suggestion by @StefanKarpinski and @mbauman at #12226 (comment). It fixes two issues I repeatedly filed about the same problem (#13565 and #16364).

The hashing tests currently pass except for sparse matrices, for which I didn't adapt the algorithm yet. I'd like to get some feedback first. I'm a total newbie as regards hashing, so don't trust me on this -- though the code was already surprisingly close to what I needed, so changes are limited.

The main limitation is the handling of the case where the difference between two subsequent elements cannot be computed. Since the goal is only to apply the same hashing to arrays and ranges, it doesn't really matter what we do when - doesn't work, since ranges shouldn't work in that case either. But we still need to hash Any[1,2,3], [1,2,3] and 1:3 the same, which means we cannot rely on eltype to find out whether - works. I guess the best solution would be to have a fast path for when eltype is concrete (or precise enough to know whether - is applicable or not), and the current slow path for other cases. Suggestions welcome.

Fixes #16364.

base/hashing.jl Outdated
h = hash(x1, h)
done(a, state) && return h

# Then hash the difference between two subsequent elements when - is supported,
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have a fast-path version wrapped in isleaftype(eltype(a)), which gets elided at compile-time (see #7088). Would be lovely to have applicable work the same way, whenever all passed types are leaf-types.

Demo for isleaftype:

julia> function testelt{T}(a::Array{T})
           if isleaftype(T)
               return "leaf"
           end
           return "nonleaf"
       end
testelt (generic function with 1 method)

julia> @code_warntype testelt(Int[1])
Variables:
  #self#::#testelt
  a::Array{Int64,1}

Body:
  begin  # REPL[1], line 2:
      $(QuoteNode(true)) # REPL[1], line 3:
      return "leaf"
  end::String

julia> @code_warntype testelt(Integer[1])
Variables:
  #self#::#testelt
  a::Array{Integer,1}

Body:
  begin  # REPL[1], line 2:
      $(QuoteNode(false))
      goto 4
      4:  # REPL[1], line 5:
      return "nonleaf"
  end::String

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's more or less what I have in mind. Thanks for the precisions!

@nalimilan
Copy link
Member Author

nalimilan commented May 17, 2016

OK, I've updated the PR with an adapted hash method for sparse matrices, and with an attempt at using isleaftype to make the common case efficient. Not sure whether that's the best solution. Anyway, the tests now pass!

There's still an issue with some float ranges like 1.0:0.3:2.0 and linspace(1, 3, 10) (which are not in the tests currently). @StefanKarpinski Do you think we can do something about this? Looks like == is smarter than my hash.

EDIT: for now I had to comment out a test where a sparse matrix starts with -0.0. Needs some tweaking.

base/hashing.jl Outdated
val = (x1, x2) -> x2
end
else
val = (x1, x2) -> applicable(-, x2, x1) ? x2 - x1 : x2
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be a little better defined if we just did RLE of the diff for Number subtypes. If a number doesn't support subtraction, I think it's pretty broken. But I could imagine a non-numeric type whose subtraction method doesn't definitively identify distinct elements.

That way you could simply express this as regular old dispatch. It'll automatically be fast for well-typed collections… and you won't need to worry about type stability yourself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd thought about that, but one also needs to support non-Numbers like Char, Date, etc. Some kind of trait indicating whether a type can be used with ranges (i.e. supports subtraction and addition) would definitely help.

But there's also the problem that you need to return the same hash for untyped arrays, and I'm not sure how you'd dispatch on that...

Copy link
Sponsor Member

@mbauman mbauman May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just imagining a generic function:

diffhash(a, b) = b
diffhash(a::Number, b::Number) = a-b
diffhash(a::DateTime, b::DateTime) = a-b
# … etc …

Custom types can directly hook into it themselves. And you don't need to worry about container types since Julia will just do the dispatch on the actual elements; well-typed containers will automatically be type-stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. But without a trait to identify types for which ranges work, authors of custom types could easily forget to add that definition, since hashing issues aren't always glaring. So I guess the Any fallback should still call applicable to be safe. Anyway with @timholy's PR we might not even need this.

@StefanKarpinski
Copy link
Sponsor Member

How does this handle floating-point range like 0:0.1:3?

@StefanKarpinski
Copy link
Sponsor Member

Eh, sorry, I just saw that you asked me about that in the above comment. I think that's the really hard part and I can't really see a way to solve it – I suspect something really clever would be required.

@StefanKarpinski
Copy link
Sponsor Member

I guess one approach would be to just punt on ranges where the actual steps are not exactly equal to the requested step. However, I'm not entirely sure what that criterion should be. Conservatively, I would guess that if the endpoints and the step are all integral and have epsilons ≤ 1 it should be safe to do this. Otherwise, you'd have to just enumerate the values and compute the full hash like you would for a vector.

@nalimilan
Copy link
Member Author

Yeah, that would be a possible fallback if we don't find any other strategy. Though getting an O(n) operation for some float ranges and not others isn't ideal.

Another strategy I tried is to change the hashing of arrays to consider subsequent v2-v1 steps as equal when they are below a few epsilon (but how many exactly?). That works great when the first step is equal to the range's step, but of course not in other cases like 0.3:-0.1:0.0. I wonder whether we could reliably compute the "real" step from its approximations. Doesn't sound easy: do we have any guaranty that it would be e.g. the most frequent of the approximations? or to their average when considering a particular number of approximations?

@toivoh
Copy link
Contributor

toivoh commented May 23, 2016

About float ranges: could we infer for a given vector which range it would be if it were a range, eg from the first and last element and the length, and then hash the differences between that range's elements and the ones given?

@StefanKarpinski
Copy link
Sponsor Member

Ah, that's a clever idea, @toivoh. That seems like a feasible solution.

@nalimilan
Copy link
Member Author

I don't get it. "Which range it would be if it were a range" is more or less what I was asking, but so far I don't know how to do that unfortunately. It would be great if you have a solution...

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented May 23, 2016

You hash v::Vector by comparing it to the range, r with start v[1] and end v[end] with as many elements as v has (i.e. r = v[1]:(v[end]-v[1])/(length(v)-1):v[end]). That way if v is equal to r they hash the same – namely by hashing the difference which is all zeros.

@StefanKarpinski
Copy link
Sponsor Member

I think this requires RLE hashing of the differences between v and r. So to hash an actual range, we know that we can just hash the start and stop and the number of elements between and then zero.

@nalimilan
Copy link
Member Author

Ah, that's clever indeed. So we could compute these differences, and then do the hashing on the result.

Though we also need to hash sparse arrays efficiently. For these, the difference between a range and actual values will be equal to a range for each sequence of zero elements; in order to hash these efficiently, we need to hash the difference/step between two subsequent elements IIUC. This is getting pretty complex, but could work.

@StefanKarpinski
Copy link
Sponsor Member

Yeah, I'm not sure how to make it handle sparse arrays efficiently too...

@toivoh
Copy link
Contributor

toivoh commented May 24, 2016

For range-like arrays, the only hashing case that is really important is when the array turns out to match a range exactly.

So another way could be to encode the number of initial elements that match the range, and then to go on and hash the rest as normal. That should allow to hash sparse vectors efficiently as well.

@StefanKarpinski
Copy link
Sponsor Member

Also a clever idea.

@nalimilan
Copy link
Member Author

I have something almost ready locally. This approach gives a much simpler code, with only quite limited changes.

But it still doesn't work in corner cases. For example:

julia> r=0.1:0.1:0.3;

julia> first(r):(last(r)-first(r))/(length(r)-1):last(r)
0.1:0.09999999999999999:0.3

julia> (last(r)-first(r))/(length(r)-1) - step(r)
-1.3877787807814457e-17

Since this is exactly eps, can something be done about it?

@StefanKarpinski
Copy link
Sponsor Member

Ugh. Ranges are the worst.

@nalimilan
Copy link
Member Author

Maybe the same kind of trick you use to choose the best bounds could be used here?

@StefanKarpinski
Copy link
Sponsor Member

Yeah, it needs to be something similar.

@StefanKarpinski StefanKarpinski added the kind:breaking This change will break code label Sep 13, 2016
@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016
@StefanKarpinski
Copy link
Sponsor Member

This is close but not quite there. We should figure this out for 0.6.

@nalimilan
Copy link
Member Author

I really need help to make progress here. If somebody can find a way to get the examples form my last comment to work, then I can work out the rest. I've just push the last state of my work from May25th (which implements the alternative strategy suggested by @toivoh) to the nl/hash2 branch (beware, it's not clean yet).

@tkelman tkelman added this to the 1.0 milestone Jan 5, 2017
@nalimilan nalimilan added the status:triage This should be discussed on a triage call label Dec 21, 2017
@StefanKarpinski StefanKarpinski removed the status:triage This should be discussed on a triage call label Dec 21, 2017
@StefanKarpinski StefanKarpinski merged commit 8872f90 into master Dec 21, 2017
@nalimilan nalimilan deleted the nl/hash branch December 21, 2017 19:58
@timholy
Copy link
Sponsor Member

timholy commented Dec 21, 2017

I didn't think we were going to get this one. What a wonderful gift! Thanks, @nalimilan and all those who helped.

@nalimilan
Copy link
Member Author

I didn't either, so thank @StefanKarpinski for deciding that had to be possible. :-) I just hope it won't turn out to be too complex/annoying, as you never know whether another ugly corner case might pop up...

@yurivish
Copy link
Contributor

yurivish commented Dec 21, 2017

Could somebody explain the practical benefits of this one? I've skimmed the thread and am finding it a bit tricky to wrap my mind around.

Edit: Aha:

julia> 1:3 == [1,2,3]
true

So this is effectively a sort of (very subtle and nuanced) bugfix that corrects the behavior of == to correspond to the definition in the docs?

"Should be implemented for all types with a notion of equality, based on the abstract value that an instance represents. For example, all numeric types are compared by numeric value, ignoring type."

@nalimilan
Copy link
Member Author

The "only" benefit of this change is this:

julia> 1:3 == [1,2,3]
true

The drawbacks are more complex to explain... 😈

@StefanKarpinski
Copy link
Sponsor Member

The main drawbacks are: O(n) hashing for float ranges and needing to implement the widen operation for a type in order to hash ranges with that element type. That and a lot of blood, sweat and tears (mostly @nalimilan's).

@vchuravy
Copy link
Sponsor Member

@nalimilan
Copy link
Member Author

Indeed. Of course, equality tests between arrays and ranges got 686 times slower, because they previously compared pointers and returned false immediately. Most other changes in that report appear to be noise (especially all the unrelated improvements), so I'm not sure what to do with that. :-/

@vchuravy
Copy link
Sponsor Member

:) Well that's fair enough then. Expected regressions are better than unexpected regressions.

@JeffBezanson
Copy link
Sponsor Member

I really think this is too complex and brittle. Could we, say, restrict it to Integers? I would rather have hashing of ranges be O(n) more often than have cases like

julia> hash(Any[Date("2018-1-03"), Date("2018-1-04"), Date("2018-1-05")])
ERROR: MethodError: no method matching widen(::Type{Date})

@nalimilan
Copy link
Member Author

nalimilan commented Jan 4, 2018

I agree it would make sense to use the O(N) algorithm by default. The difficulty with that is that all types which can compare equal need to use the same algorithm: if we restrict the O(1) algorithm to Integer, 1:2 would not hash the same as 1.0:2.0.

EDIT: we could probably restrict the O(1) hashing to Number by default, and make it opt-in for other types.

@StefanKarpinski
Copy link
Sponsor Member

The approach that could work here is to have a generic O(n) algorithm for ranges in general, which happens to collapse to an O(1) algorithm for integer ranges. I can try to show what I mean, but it won't be this week, so I propose that we leave this as is and we can try to improve it by making it fail in fewer cases but potentially be slower in more cases.

@StefanKarpinski
Copy link
Sponsor Member

Just to clarify: this is the inverse design: it detects cases where the O(1) algorithm may be switched to, rather than using the same algorithm in different cases with different performance.

@mbauman mbauman mentioned this pull request Jan 12, 2018
27 tasks
@nalimilan
Copy link
Member Author

Without more details, I don't see how it could be implemented (but maybe I lack imagination). For 0.7, it wouldn't be unreasonable to restrict the O(1) algorithm to Number types, and allow custom types to opt-in if needed. The only assumption this would make is that no non-Number type should compare equal to a Number type (but that sounds like a good rule anyway?).

@nalimilan
Copy link
Member Author

#25822 restricts the O(1) algorithm to Number types to limit problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:hashing kind:breaking This change will break code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get rid of special-casing of ranges in == and isequal() for AbstractArrays