Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simpler array hashing #26022

Merged
merged 8 commits into from
Aug 2, 2018
Merged

RFC: Simpler array hashing #26022

merged 8 commits into from
Aug 2, 2018

Conversation

mbauman
Copy link
Sponsor Member

@mbauman mbauman commented Feb 13, 2018

This is a straw-man implementation of a simpler array hashing scheme. It's very basic and has room for further optimizations, but it's intention is to provide a starting point as an alternative to #25822. In short: This hashes the size of the array and then the first three distinct elements and their linear indices and then the last three distinct elements and their linear distance from the end of the array.

The exact scheme here is open to bikeshedding -- we could use more or less distinct elements. I use "distinct elements" as a way of ensuring that all relatively empty sparse arrays don't hash similarly, and I hash elements at both the beginning and end of the array because they're the simplest to get to and final elements will be the "most expensive" to discover that they differ if we have to fall back to an equality check. The most complicated part is keeping track of what you hashed in order to prevent running through the entire array twice (once forwards and once backwards).

@mbauman mbauman added the domain:arrays [a, r, r, a, y, s] label Feb 13, 2018
@nalimilan
Copy link
Member

Thanks for this. I certainly appreciate the simplicity of this approach compared with the current one. I think we should keep in mind a third possible approach, though, which is to just drop O(1) hashing for ranges (it's not clear that's very useful in practice) and use a simple solution similar to what was implemented before #16401. As a hybrid solution, we could also continue to hash all elements for multidimensional arrays (as done both before and after #16401), and only change how vectors are hashed, since that's what is needed to hash ranges in O(1) time.

Assuming we adopt the general approach from this PR (i.e. do not hash all elements nor all differences between subsequent elements, but only a part of the information), several other quantities could be included in the hash, with the objective of making collisions less likely when modifying a single entry in the middle of the array:

  • hash the sum of all elements
  • hash the number of nonzero elements
  • hash the indices of series of nonzero elements (i.e. ranges of indices rather than individual indices)

These quantities have the advantage that they can be computed in O(1) time for (integer, at least) ranges and relatively efficiently for sparse matrices. But they require going over the full array, making the computation of hashes O(N) for general arrays while your PR is currently O(1).

Overall I think we should identify a few typical use cases to decide which approach is the best one. Without that it difficult to know what's the best tradeoff between computation cost and probability of collision. In general I'm a bit concerned about the fact that changing the value of entries in the middle of the array wouldn't change the hash: ideally collisions shouldn't happen in such trivial cases IMHO. But maybe that's fine in practice and the reduction in hashing cost is worth it. I'll just leave a pointer to JuliaLang/Distributed.jl#48, which AFAICT isn't a very correct use of array hashing, but which is going to become even more problematic if we make collisions much more likely (it could use a custom algorithm if needed).

Finally, let me note that we still have the possibility of continuing to hash all elements for multidimensional arrays (as was done before and after #16401), and only change how vectors are hashed, since that's what is needed to hash ranges in O(1) time.

Doing some research, I've found only two examples of array hashing in main languages:

# Efficient O(1) method equivalent to the O(N) AbstractArray fallback,
# which works only for ranges with regular step (RangeStepRegular)
function hash_range(r::AbstractRange, h::UInt)
h += hashaa_seed
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: I forgot to include the hashaa_seed in the computation

# But we cannot define those here, due to bootstrapping. They're defined later in iterators.jl.
function hashndistinct(A, I, n, hx)
_hashcheckbounds(A, I)
seen = Set()
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be better to use a Vector for this since the number of items is small, and it will avoid hashing them multiple times.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking of using a specialized 3-vector to avoid allocations entirely, but figured we could decide on the direction and exact behavior first before optimizing it further.

@JeffreySarnoff
Copy link
Contributor

JeffreySarnoff commented Feb 14, 2018

This is a good optimization. These are some hash admixes worth consideration.

in some manner, one from one or more of these pairs may help

(a) the shape (dims) # reducingly hash each dim size
(α) the length (prod(dims))

(b) the initial 1st, 2nd, 7th and final 1st (end), 2nd, 7th (end-6) (or missings)
(β) the initial and final two and initial and final fn(length)th element
fn(length) = mod(maxmin(length, 61)...,) + 1

(edited to comport with reality, thank you for the note, Milan)

@nalimilan
Copy link
Member

@JeffreySarnoff (0), (a) and (α) do not matter for isequal, so they definitely cannot enter the hash.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 14, 2018

So the options and their tradeoffs here are:

  • If we don't hash every single element, it's easy to mutate an array and have its hash remain the same. That's just fine, though, since hash collisions are allowed and will fall back to equality checks. We just don't want it to happen too often, since those equality checks could be expensive to repeatedly perform.
  • If we hash every single element, it's very expensive for sparse matrices to compute every single element. Making hashing ranges O(N) might make an otherwise cromulent structure like 1:typemax(Int) stall till the heat death of the universe if you happen to put it in a set.
  • If we do something clever, we have to be doubly clever in order to deal with heterogeneous arrays of things that might not support the cleverness or might overflow/underflow/change precision (e.g., subtraction for diff, addition for sum, promotion and precision difficulties in both cases). The only sort of cleverness I'd advocate for is the kind that only relies upon hashing and equality — like run-length encoding or distinct elements.

I'm becoming more convinced that an O(1) hashing method is the way to go here. I'd happily extend the number of elements to help convince assuage you.

@nalimilan
Copy link
Member

I generally agree, the current approach is too brittle and complex. I still a bit sad to move to an approach which makes it so easy to get collisions. Maybe I just need some time to accept this. ;-)

Another less fragile possibility which occurred to me would be to use another property of ranges, which is less strict than computing differences between subsequent elements but also less problematic to check: the fact that ranges are monotonically increasing or decreasing. We could continue to hash all elements for vectors which are not sorted and multidimensional arrays (keeping RLE for efficiency with sparse matrices of course), but use the O(1) approach for sorted vectors (for element types which support isless). In practice, it should be quite rare that a user would hash several ordered vectors of the same length and which start and end with the same elements, so this should dramatically reduce the risk of collisions.

@JeffreySarnoff
Copy link
Contributor

I agree that loosing discrimination is a negative (there are algorithms that thrive on some probable assurance of differents hashing differently, so things need be visited once to be useful with other things of a group).

Increasing the span at front and back that are hashed, and speckle-ing some of the ith quartile positions (or a few thereabout) should do a decent of job of discrimination, especially where the length>>some mod a_length_relevant_prime is used to pick the span between successive hash-sampled values (or value pairs).

Or for speed, keep a sequence of indices to pluck and let them wrap on the length.

@JeffBezanson
Copy link
Sponsor Member

One option might be to use exponentially-spaced indices, and hash log(n) points.

@JeffreySarnoff
Copy link
Contributor

... and interweave those indicies, taking them from [1] increasing and from [end] decreasing, to disambiguate small vectors and similar arrays more successfully?

@JeffreySarnoff
Copy link
Contributor

struct EntityHash{T<:Union{UInt64, Missing}}
    eachentity::UInt64  # every hashable thing has this valued
    collective::T       # an aggregative thing may have this valued
                        #    lazily done only if needed or where
                        #    the e.g. write stream does the calc
end

Things that are items or elements have sizeof(EntityHash) == sizeof(UInt64)
Things that are collectives and are more fully hashed have twice that size

@martinholters
Copy link
Member

We could continue to hash all elements for vectors which are not sorted and multidimensional arrays (keeping RLE for efficiency with sparse matrices of course), but use the O(1) approach for sorted vectors (for element types which support isless).

Fails for the totally_not_five from #26034. (Not that that example seems particularly relevant, but the fact the it is so easy define an object which breaks the approach is concerning.)

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 16, 2018

Taking log(n) hashes is quite compelling — particularly if we increase distance between hashes as we iterate from the back of the array, since that weights the elements involved in the hash towards the "most expensive" to differentiate with isequal. The major downside I see there is that many sparse matrices will have fewer than log(n) stored values.

What if we used a combination of these approaches?

  • Given array A, consider the list of hashes hxs that contains the hashes of those elements that are distinct from their neighbor, working backwards from the end of the array. That is:
    hxs = [hash(A[end]); [hash(A[i]) for i in lastindex(A)-1:-1:firstindex(A) if hash(A[i]) != hash(A[i+1]) && A[i] != A[i+1]]
  • hash(A) is sum(hxs[[2^i for i=0:floor(Int, log2(end))]])

Of course, you would never compute it this way, but it can be done iteratively.

This has the slightly strange effect that we compute the hash of every element for Array but don't use them all in the computation, but I think it's way better than taking the difference of every element. The advantages are: fast iteration through sparse arrays (often just alternating between a hash of zero and a hash of a stored value), and we only need to hash O(log(N)) elements in computed arrays where we know successive elements are not equal.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 16, 2018

The proposal in my last post still leaves a bad taste in my mouth. If we're going to do log(N), let's make it work for all array types. Here's another idea:

  • Iterate the array backwards
  • Add the hashes of the first 3 (or so) distinct values you see, stopping at index j
  • Launch into a blind log(N) hashing regime starting at index j, adding sum(hash.(A[[j - 2^i for i in 0:floor(log2(end-j)))]))

@StefanKarpinski
Copy link
Sponsor Member

I'm a bit concerned that hashing log(N) entries in a sufficiently sparse array will tend to catch none of the non-zeros.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 16, 2018

Right — that's my second proposal. In short: Change this PR (which does O(1) hashing of some small number of distinct elements from the front and back) to ignore the forward iteration and augment the backwards iteration with log(N) hashes after it's hashed the small number of distinct elements.

@nalimilan
Copy link
Member

That's an interesting solution. There are a few things I don't understand:

  • Why do you need to iterate backwards?
  • Why do you compare the hashes of subsequent elements before comparing them directly?
  • Why sum the hashes rather than combine them with hash?

@mbauman
Copy link
Sponsor Member Author

mbauman commented Feb 16, 2018

Great questions:

  • I iterate backwards because of the asymmetry between == and hash. In the case of a hash collision, we fall back to ==. By default and most commonly, == iteratively compares its elements walking forwards through the array. This means that — if the arrays are indeed different — it'll probably find that faster. I would bet this is why C# hashes the last 8 elements.
  • I compare hashes because I had been thinking we were already computing them — either as part of the total array hash or in a Set to hold the distinct elements. Might as well use them!
  • I sum h += hash(A[i]) instead of chaining with hash(A[i], h) in order to allow for the reuse of the the hashes.

But you're right — it's probably not worth hashing elements that don't get used in the total hash, and we probably don't want to use a Set for such a limited number of elements and comparisons.

@nalimilan
Copy link
Member

I think we should proceed with any reasonable solution so that the API is stabilized and BigFloat can be moved to stdlib (#25716). We can always switch to a different approach in the future since it's not breaking.

@JeffreySarnoff
Copy link
Contributor

I'm a bit concerned that hashing log(N) entries in a sufficiently sparse array will tend to catch none of the non-zeros.

Unclear that log(N) is the "hatrack" .. the same would be true of sqrt(N) samples.

Milan's push to get this one done so other contingencies can move along makes sense. It all sounds non-blocking and pathway providing.

Maybe it is worthwhile to keep a short vector of linearized indices to nonzero values to provide informed indirection. 32 or 64 Int64s could do that for a sparse array of 10 trillion elements. If the data structure knows the index of its first and its final nonzero entries, two of those indices could be rewritten when that internal state is rewritten. Two can get two more (the next innermost) and the structure may find that useful too, giving the hash at worst 4 nonzero values and their linearized indexings -- 8 nonzero Ints to hash or better

@nalimilan
Copy link
Member

@mbauman Do you plan to finish this, or do you need help?

@JeffBezanson JeffBezanson added this to the 1.0.x milestone May 30, 2018
@mbauman
Copy link
Sponsor Member Author

mbauman commented Jun 29, 2018

Yes, let's resurrect this. I have a WIP branch locally where I started implementing #26022 (comment) a while ago, but I stalled on it (and then forgot about it) because I didn't like how hard it was to either implement or describe. I'll take a second look this afternoon, and see if I can find some simplifications.

@StefanKarpinski
Copy link
Sponsor Member

Here's another thought: hash arrays in such a manner that if an array approximates an range, it hashes like it and just take the hit of those two objects being in the same bucket. After all, what are the chances that you're hashing both arrays and ranges together and have an array that is almost equal to a range but not quite exactly equal to it? That's the only bad case with such an approach.

@oscardssmith
Copy link
Member

won't that fail because any bucket will have an edge that will have the same precision problems that your were trying to avoid?

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 10, 2018

Ok, I've fixed the type instability (#15276), and I'm pretty happy with this now. I've done some cursory testing of the performance impacts here by filling and emptying dictionaries of arrays — overall things look pretty good but note that this can cause dictionaries to change algorithmic complexity. I contrived the last two cases such that they're hash collisions but unequal; this is a worst case and leads to O(length(A)*length(dict)^2) performance. Of course, it's also possible to construct a set of arrays that will all have colliding hashes with the current system, too — it's just a bit harder to do. I've flagged the asymptotically bad cases in bold.

keys fill_master empty_master fill_pr empty_pr
[(1:10) .- i for i in 1:100000] 0.01 0.008 0.04 0.03
[(1:2^60) .- i for i in 1:100000] 0.01 0.005 0.1 0.07
[(1:10)./i for i in 1:100000] 0.08 0.04 0.05 0.03
[(1:2^60)./i for i in 1:100000] ~Inf ~Inf 0.4 0.2
[rand(1000, 100) for _ in 1:100] 0.1 0.07 0.00007 0.0001
[rand(1000, 100) for _ in 1:100000] 100 70 0.3 0.2
[Array{Any}(rand(1000,100)) for _ in 1:100] 0.9 0.4 0.0004 0.0006
[Array{Any}(rand(1000,100)) for _ in 1:1000] 7 4 0.006 0.003
A = rand(1000,100); [setindex!(copy(A), i, length(A)-20) for i in 1:100] 0.1 0.07 1.1 0.0002
A = rand(1000,100); [setindex!(copy(A), i, length(A)-20) for i in 1:1000] 2 0.7 100 0.007

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 10, 2018

O(length(A)*length(dict)^2) performance

I think this is usually dealt with by using a more resilient dictionary algorithm (this is the problem case for the linear probing scheme we currently use) or a more tailored data-structure for the problem (such as a Trie). I'm surprised it's quite that bad though – typically the goal is to have the worst case for hash collision be to turn the lookup into a linear-scan (without the square power). Where does the ^2 power derive from?

i = last(I)
v1 = A[i]
h = hash(i=>v1, h)
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly should use Ref{eltype(A)}(v1) here, or just open-code the loop (to make heterogeneous arrays fast). the same would apply to h = hash(i=>v1, h) though (vs. writing it as h = hash(i, hash(v1, h))), so I'm not sure of the importance of it

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a simple type-unstable case to the matrix above — this is already smoking the status quo (largely by simply touching fewer unstable elements). I get your point about constructing the Pairs, but I don't fully understand the Ref in the closure. In any case, this should be a relatively small constant cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining why let is used.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 10, 2018

Where does the ^2 power derive from?

This is the complexity for inserting the whole set of hash-equal items into the dictionary. For each item, it must compare == against all existing items in the dictionary.

@nalimilan
Copy link
Member

Thanks! The code and the benchmarks look good to me. Of course it's somewhat slower for ranges, and much slower when there are collisions, as expected. But the main question which is hard to address is whether collisions are frequent in practice. If that'd really the case (which isn't very likely), an ideal approach would maybe to have Dict use two hashes: this one as a first default pass, and a systematic hashing of all elements as soon as a collision happens (ignored if ranges are involved). If collisions are rare, this PR should make things quite faster since much fewer elements are hashed.

BTW, if this solution is retained, we'll need to do something about JuliaLang/Distributed.jl#48. Maybe just use the old algorithm there, to limit the risk that the array wouldn't be updated because of collisions (but of course it's not completely correct either).

i = last(I)
v1 = A[i]
h = hash(i=>v1, h)
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining why let is used.

fibskip = prevfibskip = oneunit(j)
while j > fibskip
j -= fibskip
h = hash(A[J[j]], h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inbounds would be OK here right?

Also, have you considered hashing the first entry which differs from the previous/next one starting from J[j]? That would improve detection of differences between sparse matrices. I'm concerned that hashing the last distinct values isn't enough.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this could be @inbounds but I wouldn't expect it to make a difference in performance — profiling hashing a Float64 array shows checkbounds accounts for 8 samples out of ~5000. I'm inclined to leave it.

Also, have you considered hashing the first entry which differs from the previous/next one

Yes, something along those lines is compelling, but I had avoided it due to the challenge of efficiently skipping linear offsets across cartesian indices. Now that I have coded that computation here, though, perhaps it wouldn't be so bad. I'll see if it can be done reasonably.

…ments

    Goal: Hash approximately log(N) entries with a higher density of hashed elements
    weighted towards the end and special consideration for repeated values. Colliding
    hashes will often subsequently be compared by equality -- and equality between arrays
    works elementwise forwards and is short-circuiting. This means that a collision
    between arrays that differ by elements at the beginning is cheaper than one where the
    difference is towards the end. Furthermore, blindly choosing log(N) entries from a
    sparse array will likely only choose the same element repeatedly (zero in this case).

    To achieve this, we work backwards, starting by hashing the last element of the
    array. After hashing each element, we skip the next `fibskip` elements, where
    `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple
    ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and
    only end up hashing one slice of the array (as might happen with powers of two).
    Finally, we find the next distinct value from the one we just hashed.
@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 11, 2018

Thanks @nalimilan — I like the design you proposed in #26022 (comment) much better.

fibskip, prevfibskip = fibskip + prevfibskip, fibskip

# Find a key index with a value distinct from `elt` -- might be `keyidx` itself
keyidx = findprev(!isequal(elt), A, keyidx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC this finds the first entry after keyidx which differs from the last hashed element. That's a bit different from what I had in mind: I was thinking about finding the first element which differs from the one at keyidx. I'm not sure it's really better, but the idea was that since in a sparse array it's likely that keyidx hits a structural zero, looking for the previous distinct element makes it likely you'll hash a non-zero entry. With your approach, if you hash a zero the first time, you will look for the previous non-zero entry the next time, but if you hit a zero entry the time after that, you'll happily hash it; so you'll end up hashing a zero half of the time, right?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.

That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.

Sorry, I'm not sure I follow. Could you develop?

That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why use findprev, which requires all this keys vs. linear indices dance, instead of a plain loop? I imagine one reason could be that findprev could have a specialized method for sparse arrays which would skip empty ranges, but currently that's not the case. Is that what you have in mind?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood your comment at first — I thought you were suggesting only unique'ing against the very first element. Now I understand that you mean to access the value after each skip, and then hash the next element that's different from it.

There are two reasons to do the keys vs. linear indices dance: one is findprev, but the other is that I want to hash index-value pairs to add a bit more information about the structure of the array. And of course, we cannot introduce a hashing difference between IndexLinear and IndexCartesian arrays. The fact that we then also allow arrays to opt into optimizations via findprev is a nice bonus, especially since it's going to be a pain to completely re-write your own hash optimization that has exactly the same behavior.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 13, 2018

Ok, I've slowed down that exponential — by quite a bit. I now only grab the next number in the Fibonacci sequence after every 4096 hashes. I chose this number by considering:

  • The large anticipated penalty for a hash collision. I wanted to maximize the size of the array that gets 100% of its elements hashed… especially in light of remotecall_fetch vulnerable to hash collisions? Distributed.jl#48.
    • Every element in arrays smaller than length 4096 (64x64 or 8x8x8x8) is included in the hash computation.
    • For an array of size 8192 with no sequentially repeated elements, 75% of its elements are hashed: every other element is hashed in the first half of the array, and every element is hashed in the second half.
  • The time it takes to hash ranges with typemax(Int) elements as they used to be instantaneous — so I wanted to keep this at least somewhat tractable. With this choice on my beefy machine, it takes ~0.5 s to hash (1:typemax(Int64))/pi, and ~10 ms to hash 1:typemax(Int64).
    • For an array of a billion elements with no repeats, 99,915 elements are hashed (10^5/10^9 — 0.01%)
    • For an array of typemax(Int) elements with no repeats, 276,303 elements are hashed (10^5.5/10^19).

This is the balance point that seems reasonable to me — a power of two just because it's slightly cheaper to perform the mod.

@nalimilan
Copy link
Member

nalimilan commented Jul 14, 2018

Makes sense. I wonder whether we couldn't find a function which increases slowly for common sizes, but quite faster when sizes get really large. Or maybe we could just apply an arbitrary threshold beyond which we stop hashing entries? A Vector{UInt8} with typemax(Int) elements takes 8e6 TB, which hardware won't be able to handle for decades or centuries. It wouldn't be terrible if we stopped at typemax(Int)/1e6 elements.

EDIT: the goal being to ensure even the largest possible range doesn't take one second to hash.

# entirely-distinct 8000-element array will have ~75% of its elements hashed,
# with every other element hashed in the first half of the array. At the same
# time, hashing a `typemax(Int64)`-length Float64 range takes about a second.
if mod(n, 4096) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since performance is important here, why not replacing the mod with explicit bit twiddling? In this case I would suggest replacing this line with:

if (Int==Int64 && n<<52==0) || n<<20==0

That's because 4096 = 2^12 and 52 = 64 - 12 and 20 = 32 - 12. Also the Int==Int64 part will get inlined so this will be effectively replaced by either n<<52==0 or n<<20==0.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use rem(n, 4096) == 0 the compiler takes care of the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always forget about the rem and %

@JeffBezanson
Copy link
Sponsor Member

Bump. Could we add the test cases from #27865, #26011, and #16401 (comment) and merge this?

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 26, 2018

Ok, test cases pushed. There are, of course, many other schemes that we could dream up here… but triage was in support of merging this as it stands.

To help avoid JuliaLang/Distributed.jl#48, I'm putting together another PR that will basically make Distributed use its own hashing system — the constraints are a little different there since we don't need to worry about hash equality across different types. I don't think the two need to be merged at the same time, but I should have that ready to go by tomorrow.

@JeffBezanson JeffBezanson merged commit b0bf91e into master Aug 2, 2018
@JeffBezanson JeffBezanson deleted the mb/simplerarrayhashing branch August 2, 2018 15:34
KristofferC pushed a commit that referenced this pull request Feb 11, 2019
    Goal: Hash approximately log(N) entries with a higher density of hashed elements
    weighted towards the end and special consideration for repeated values. Colliding
    hashes will often subsequently be compared by equality -- and equality between arrays
    works elementwise forwards and is short-circuiting. This means that a collision
    between arrays that differ by elements at the beginning is cheaper than one where the
    difference is towards the end. Furthermore, blindly choosing log(N) entries from a
    sparse array will likely only choose the same element repeatedly (zero in this case).

    To achieve this, we work backwards, starting by hashing the last element of the
    array. After hashing each element, we skip the next `fibskip` elements, where
    `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple
    ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and
    only end up hashing one slice of the array (as might happen with powers of two).
    Finally, we find the next distinct value from the one we just hashed.

Fixes #27865 and fixes #26011.

Fixes #26034
@nalimilan nalimilan mentioned this pull request Oct 9, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants