Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HashDictionary with an ordered hash dictionary #13

Merged
merged 20 commits into from
Jun 11, 2020

Conversation

andyferris
Copy link
Owner

@andyferris andyferris commented Feb 24, 2020

The new implementation preserves insertion order. It may use slightly more memory overall but is much faster to iterate because the indices/values are stored (somewhat) densely (performance comparable to Vector, i.e. limited by memory bandwidth). Faster insertion/deletion/resizing due to less need to call hash (hashes are recorded). Compared to OrderedCollections.jl this implementation (a) does not mutate the object on reads, like iteration (which can't be thread-safe) and (b) may theoretically deal better with deletion without slowdowns. Implementation is somewhat inspired by Python 3.6 ordered dictionary.

This PR now includes new functions distinct and index. The constructors for HashIndices and HashDictionary has been improved.

Things to do:

  • copy optimization
  • filter! optimization
  • make sizehints safe!
  • add performance benchmarks to repository
  • ensure HashIndices input is uniquified (or throw an error?)
  • think about adding a linear probing cut-off, double-check resizing... it's good enough and pretty well tested, can revisit for speed purposes

This PR obviously leaves Dictionaries.jl with only ordered collections - so we are free to add more semantics to AbstractIndices and AbstractDictionary - things to think about include sort, sortperm, permute!, push!/pop!/splice! (these aren't collections of Pairs though).

@andyferris andyferris added the enhancement New feature or request label Feb 24, 2020
@andyferris andyferris requested a review from c42f February 24, 2020 12:39
@codecov-io
Copy link

codecov-io commented Feb 24, 2020

Codecov Report

Merging #13 into master will increase coverage by 4.88%.
The diff coverage is 64.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
+ Coverage   59.10%   63.99%   +4.88%     
==========================================
  Files          18       18              
  Lines         983     1322     +339     
==========================================
+ Hits          581      846     +265     
- Misses        402      476      +74     
Impacted Files Coverage Δ
src/Dictionaries.jl 100.00% <ø> (ø)
src/MappedDictionary.jl 15.00% <0.00%> (+7.30%) ⬆️
src/map.jl 25.86% <16.12%> (-4.45%) ⬇️
src/Indices.jl 71.79% <18.18%> (-11.54%) ⬇️
src/Dictionary.jl 72.50% <25.00%> (-1.86%) ⬇️
src/tokens.jl 25.39% <60.00%> (+9.71%) ⬆️
src/HashDictionary.jl 61.61% <61.22%> (+0.42%) ⬆️
src/AbstractDictionary.jl 83.25% <66.66%> (+9.22%) ⬆️
src/AbstractIndices.jl 61.11% <68.42%> (+28.18%) ⬆️
src/insertion.jl 65.03% <71.42%> (+24.73%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f92b3d4...0d34363. Read the comment docs.

The new implementation preserves insertion order. It may use slightly
more memory overal but is much faster to iterate because the
indices/values are stored densely (performance comparable to `Vector`,
i.e. limited by memory bandwidth). Faster insertion/deletion/resizing
due to less need to call `hash` (hashes are recorded).
@andyferris andyferris force-pushed the ajf/order-hash-dictionary branch 9 times, most recently from 2f42b42 to 3119467 Compare March 11, 2020 00:18
@c42f
Copy link
Collaborator

c42f commented Apr 24, 2020

Hey Andy, sorry I haven't reviewed yet, this diff is kind of epic and scary!

I assume the best strategy would be to read the entirety of HashDictionary.jl and HashIndices.jl? The diffs look pretty useless as it's a big rewrite... and I didn't know the previous code anyway.

@andyferris
Copy link
Owner Author

Yes it’s a brand new implementation of a hash map based on a python 3.6 talk I saw regarding their then-new implementation.

My first hash map so it scares me! I need to copy some tests or something.

@andyferris
Copy link
Owner Author

OK this is getting super close. Benchmark suite is working, results are mostly good. The delete operation appears to be a bit slower than for Set or Dictionaries.OldHashIndices (which impacts other operations) so that needs investigating. On the other hand, operations like filter! (to few) and reductions are like totally rediculously awesome :)