Skip to content

mtumilowicz/java12-fundamentals-hash-map-implementation-workshop

Repository files navigation

Build Status License: GPL v3

java12-fundamentals-hash-map-implementation-workshop

quote

TDD - everybody knows that TDD stand for test driven development; however people too often concentrate on the words: test and development, and don't conciders what the world driven really implies. For tests to drive development they must do more than just test that code performs its required functionality: they must clearly express that required functionality to the reader.

Nat Pryce and Steve Freeman, Are your tests really driving your development?

preface

  • goals of this workshop:
    • practice TDD
    • practice asking right questions during development
    • become acquainted with implementation of HashMap
    • chance to discuss some bitwise operations
  • exemplary solution is in answers package

java 8 HashMap

implementation overview

  1. signed left shift operator <<
    • shifts a bit pattern to the left
    • used in counting power of two
    • e.g. 1 << 4 = 16
  2. an unsigned right shift operator >>>
    • shifts a bit pattern to the right regardless sign
    • e.g. A = 60 = 0011 1100 => A >>> 2 = 0000 1111 and 0000 1111 = 15
  3. transient Node<K,V>[] table;
    • HashMap uses writeObject and readObject to implement custom serialization rather than just letting its field be serialized normally
    • it writes the number of buckets and entries to the stream and rebuilds itself from those fields when deserialized
    • table itself is simply unnecessary in the serial form
  4. hash function
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
    
    • XOR - cheapest possible way to reduce systematic loss
      • XOR has the best bit shuffling properties (on average produces 1/2 one-bits)
      • OR on average produces 3/4 one-bits
      • AND on average produces 3/4 zero-bits
    • >>> - in order to use highest bits in calculation of a bucket: tab[(n - 1) & hash]
    • as a result, the modulo obtained has less collision
    h             | h (binary)                              | h % 32 | (h ^ h >>> 16) % 32
    ------------: | :-------------------------------------: | -----: | ------------------:
           65,537 | 0000 0000 0000 0001 0000 0000 0000 0001 |      1 |                   0
          131,073 | 0000 0000 0000 0010 0000 0000 0000 0001 |      1 |                   3
          262,145 | 0000 0000 0000 0100 0000 0000 0000 0001 |      1 |                   5
          524,289 | 0000 0000 0000 1000 0000 0000 0000 0001 |      1 |                   1
    
  5. small number of collisions is virtually inevitable: https://github.com/mtumilowicz/hash-function
  6. note that instead of using list/tree to resolve hash collisions problem, there is another approach: open addressing
    • open addressing basics
      • all elements are stored in the hash table itself
      • insert(k): find slot for k
        • if empty - insert k
        • if not empty - search right until an empty slot is found and insert k
      • search(k): find slot for k
        • if empty - not found
        • if not empty and not equal to k - keep searching right
        • if not empty and equal to k - found
      • delete(k): we cannot perform simple deletion because the search may fail (we introduce gaps)
        • marked slots as "deleted"
        • we can insert an item in a deleted slot, but the search doesn’t stop at a deleted slot
  7. getNode
    • tab[(n - 1) & hash]
    • size have to be power of two: 2^n - 1, binary representation: all 1
    • 16 - 1 = 15, binary representation: 1111
    • bitwise AND of any number with 1111 -> last 4 bits of the number
    • it is equivalent to the modulo of the number by 16
    • division operation is usually an expensive operation
    • bitwise operation is usually preferred over division
    • last 4 bits will evaluate to any number from 0 to 15
    • suppose we use the size 17 instead
    • 17 - 1 = 16 which is 10000 in binary
    • bitwise AND with 16 -> lose all bits except the 5th bit from the end
    • regardless of the number, the bucket index will be either 16 or 0
    • it means a lot of collisions and poor performance
    • instead of O(1) for retrieval, you'd need O(log n)
    • in case of ConcurrentHashMap in a multithreaded environment, you'd experience lot of synchronizations
  8. resize
    • triggered when: countElementsInBuckets() > countBuckets() * LOAD_FACTOR
    • we cannot make decision to resize using concrete bucket size (for example - assuming that we resize when number of elements in some bucket exceeds given threshold), because there will always exist bad hash function that directs all elements into that very bucket

performance

  • JEP 180: Handle Frequent HashMap Collisions with Balanced Trees: improve the performance of HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries
  • when a bucket becomes too big (TREEIFY_THRESHOLD = 8), HashMap dynamically replaces list with a tree (using hash code as a branching variable)
    • if two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right
    • if hashes are equal, HashMap compares the keys (if possible)
      • it is a good practice when keys are Comparable
      • if keys are not comparable, no performance improvements in case of heavy hash collisions
  • in some cases (mentioned above) with heavy hash collisions rather than pessimistic O(n) we get much better O(log n)
  • when we have proper hash function (or even ideal hash function) the LOAD_FACTOR guarantees that there will be only a tiny amount of entries in each bucket, and it will be strictly limited for all - therefore we are allowed to say, that complexity of retrieving element from a hashmap is O(1)

questions that should be asked

  1. initial capacity
  2. resize (capacity factor) - how to reasonably define load factor?
    • assumption: every hash function at least makes a best attempt to distribute hash codes
    • growing based on total size keeps the collision lists at a reasonable size with realistic imperfect hash function
  3. what to do with null, any validations?
    • null key
    • null value
    • note that there is no significant difference between getting key with null value and getting key that not exists
  4. drop or replace entry with the key that already exists in map?
  5. get should return Optional?
  6. what to do in case of collision? list vs tree vs open addressing
    • any threshold?
  7. thread safe vs not thread safe
    • approaches to thread safety (global vs per-bucket synchronization)
  8. immutable or mutable?
  9. how to write tests
    • you could unit-test implementation but tests should be easy to throw away

digression

  • HashSet uses HashMap internally to store it’s objects
  • whenever we create a HashSet, HashMap object associated with it is also created
    • elements you add into HashSet are stored as keys of HashMap object, value associated with those keys is a constant
  • from HashSet implementation
    private static final Object PRESENT = new Object();
    
    public boolean add(E e)
    {
            return map.put(e, PRESENT)==null;
    }
    
    public boolean remove(Object o)
    {
            return map.remove(o)==PRESENT;
    }