Building a faster hash table for high performance SQL joins

senderista · 2023-12-20T18:26:23 1703096783

Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:

https://github.com/senderista/hashtable-benchmarks/blob/mast...

https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

mlochbaum · 2023-12-20T21:48:24 1703108904

Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.

...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.

[0] https://mlochbaum.github.io/bencharray/pages/search.html

[1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

senderista · 2023-12-20T22:14:39 1703110479

Thanks for the links; the BQN impl looks really interesting. I believe TFA deals with only hash codes and offsets in the hash table proper (keys and values are stored separately in a dynamic array), so fixed-width keys/values still apply. It's true that you can't use keys interchangeably with hash codes for variable-length keys like I do for integer keys, but I don't expect that to affect the relative performance of RH vs. BLP. (I'm curious how they handle colliding hash codes; 32-bit hashes mean you have a ~50% probability of at least one collision at 2^16 keys, which isn't much.)

mlochbaum · 2023-12-20T22:41:48 1703112108

Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...

puzpuzpuz-hn · 2023-12-21T08:01:55 1703145715

That's correct. In practice, there is an insignificant amount of hash collisions, so false comparisons are extremely rare.

And thanks for sharing your experience with RH and the links!

jerrinot · 2023-12-20T19:19:10 1703099950

A QuestDB engineer here: These are cool benchmarks! The idea to try Robin Hood probing came to me after receiving some feedback on Reddit. I ran initial experiments, and the results were promising, leading to its integration into our codebase. Thank you so much for sharing your repository. Perhaps one day we'll explore bidirectional probing as well!

A snapshot of my happiness after running first experiments with Robin Hood: https://twitter.com/jerrinot/status/1730147245285150743 :)

brandonasuncion · 2023-12-20T22:47:47 1703112467

Hi there!

I made the initial suggestion to look into Robin Hood hashing when it was first posted on Reddit.

Glad to see it make its way into the repo!

nhourcard · 2023-12-20T23:03:05 1703113385

indeed! thank you for that :)

apavlo · 2023-12-21T00:06:12 1703117172

> just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks

Research results from the last five years shows that Robin Hood hashing performs better than the other approaches under the right conditions. See this eval paper:

https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjo...

mlochbaum · 2023-12-21T00:49:33 1703119773

Bidirectional isn't benchmarked here, and only mentioned once offhand:

> For example, we could already start searching for elements at the slot with expected (average) displacement from their perfect slot and probe bidirectional from there. In practice, this is not very efficient due to high branch misprediction rates and/or unfriendly access pattern.

I think this indicates a regular Robin Hood insertion and modified search, which doesn't sound that similar to Amble and Knuth's method. And anyway the relative costs of mispredictions and cache misses vary wildly based on workflow (paper studies 8-byte keys only). The paper also doesn't present Robin Hood as a clear winner, which is how I interpreted your comment. It's shown as one of five suggestions in the decision graph at the end, and only recommended for load factors between 50% and 80% among other conditions.

Edit: And the paper is from 2015, not the last five years. Is this the right link?

puzpuzpuz-hn · 2023-12-20T19:20:32 1703100032

Thanks! We're still benchmarking Robin Hood hashing and are open to further experiments. The benchmarks look promising.

hawk_ · 2023-12-20T21:53:55 1703109235

Can bidirectional linear probing be used for any key type? Or do the keys need to be of some integral type?

senderista · 2023-12-20T22:20:14 1703110814

The keys-as-hash codes trick only works for fixed-width integers; otherwise you have to explicitly store the hash codes. I assumed unique hash codes in my implementation, but I think you could easily adapt the algorithm to allow duplicate hash codes. Tolerating hash code collisions would avoid having to increase hash code size (for collisions in their case, you'd just need to probe multiple offsets in the key/value array).

mlochbaum · 2023-12-20T22:07:24 1703110044

Since the hashes are stored, you could order by hash. This would leave keys with the same hash unordered, so if you find an entry with equal hash but unequal key you have to keep probing, but that only matters on a full-hash collision.

gavinray · 2023-12-20T18:05:04 1703095504

I always enjoy reading stuff written by Andrey, he's a brilliant fellow for sure.

Can highly recommend his personal blog as well: https://puzpuzpuz.dev/

puzpuzpuz-hn · 2023-12-20T19:24:22 1703100262

Thanks, Gavin, I'm pleased to hear that. And thanks for recommending my blog!

_a_a_a_ · 2023-12-20T18:24:50 1703096690

From article

"Imagine that we run this query over a few hundred million rows. This means at least a few hundred million hash table operations. As you might imagine, a slow hash table would make for a slower query. A faster hash table? Faster queries!"

I'll read the article properly after this, this is just a quick skim, but I can't see this quote can be correct. Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference.

Okay, I'll read the article now…

Edit:

"If you insert "John" and then "Jane" string keys into a FastMap, then that would later become the iteration order. While it doesn't sound like a big deal for most applications, this guarantee is important in the database world.

If the underlying table data or index-based access returns sorted data, then we may want to keep the order to avoid having to sort the result set. This is helpful in case of a query with an ORDER BY clause. Performance-wise, direct iteration over the heap is also beneficial as it means sequential memory access."

but "...if the underlying table data or index-based access returns sorted data..." Then you've got sorted data, in which case use a merge join instead of a hash join surely.

puzpuzpuz-hn · 2023-12-20T19:34:17 1703100857

> Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference.

In a GROUP BY, you may have a few hundred million rows, but only a few hundred groups within them. A slow function would slow down things dramatically in that case since the hash table remain small and data access is potentially linear.

> Then you've got sorted data, in which case use a merge join instead of a hash join surely.

This property is beneficial for GROUP BY which includes a timestamp or a function over timestamp. QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.

As for merge join, we also use it in ASOF join: https://questdb.io/docs/reference/sql/join/#asof-join

_a_a_a_ · 2023-12-21T14:37:19 1703169439

> In a GROUP BY, you may have a few hundred million rows, but only a few hundred groups within them

ISWYM although that is rather a specific case. For your purposes though it may be a common case, I don't know.

> QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.

If data is already sorted and you have an 'order by' then just use the data directly – bingo, instant merge join, no hash table needed.

babol · 2023-12-21T22:13:10 1703196790

> > QuestDB organizes data sorted by time, so relying on insertion order may help to avoid redundant sorting if there is an ORDER BY clause with the timestamp column.

> If data is already sorted and you have an 'order by' then just use the data directly – bingo, instant merge join, no hash table needed.

I reckon keeping data on heap in insertion order isn't that useful for joins because hash table is used for lookups while iterating the other table (so the main table determines output order). Where it could help is e.g. storing results of GROUP BY. For query such as:

SELECT timestamp, key, sum(value) from data GROUP BY timestamp, key order by timestamp

if data table stores data ordered by timestamp and hash table maintains insertion order then sorting is not required after aggregating all rows because iterating heap produces the right order.

rkerno · 2023-12-20T20:04:56 1703102696

Hi, I'm curious how you deal with the potential for hash collisions across a large data set - is that a post-join check?

puzpuzpuz-hn · 2023-12-21T08:05:12 1703145912

Hi, if you're asking about the hash table itself, then currently we use linear probing, i.e. k/v pairs with a collision are inserted sequentially starting with the hash%capacity index.

pixelpoet · 2023-12-20T18:00:16 1703095216

Serious question, if performance is the lynchpin, why write it in Java?

Especially considering they use unsafe "heavily", for big joins they could easily just call out to some native code if the surrounding code reaaaaally must be Java (again, why?). It's the worst of both worlds using unsafe Java: you don't get native speed, there's loads of memory overhead from everything being an Object (besides the rest of the VM stuff), and get to "enjoy" GC pauses in the middle of your hot loops, and with fewer safety guarantees than something like Rust.

lukax · 2023-12-20T18:18:26 1703096306

They started with Java and tried to use as little GC as possible and started writing hot loops in C, C++ and Rust. Logic is in Java, other stuff is mostly native.

https://questdb.io/blog/leveraging-rust-in-our-high-performa...

dimgl · 2023-12-20T19:30:43 1703100643

Interestingly I recently saw an MQ broker written in Java and it seemed to have some pretty impressive performance. I'm not a huge fan of Java's DX but I have to admit, that's some serious performance. Can't remember the name but it was on the homepage this week I believe.

winrid · 2023-12-20T23:43:21 1703115801

Isn't javas DX pretty great? I mean not the huge XML heavy frameworks but Java itself and the IDEs work well...

dimgl · 2023-12-21T19:32:29 1703187149

Maybe it's inexperience but I never want to deal with Maven/Gradle again.

winrid · 2023-12-22T06:29:37 1703226577

Gradle's upgrades can be pretty terrible. I don't know why they keep changing it.

bluestreak · 2023-12-20T18:16:50 1703096210

this is THE number one question being asked about QuestDB :) There is a thread that might help https://news.ycombinator.com/item?id=37557880

gavinray · 2023-12-20T18:36:12 1703097372

In the most non-inflammatory way possible: I am not sure I'm convinced of the performance benefits of crossing a JNI boundary to invoke a method in another language, versus letting the JVM JIT and optimize native JVM code.

Would be interesting to see benchmarks here, do you know if there are any public ones available, specific to QuestDB's interop code?

bluestreak · 2023-12-20T19:13:43 1703099623

In certain situations, crossing the JNI boundary can be advantageous. When data resides in "native" memory, outside the Java heap, the coordination of concurrent execution logic can be handled in Java, while the actual execution occurs in C++ or Rust. In this context, the negligible penalty for crossing the JNI boundary once per 10 million rows pales in comparison to the substantial benefits achievable through C++ optimization or CPU-specific assembly.

pixelpoet · 2023-12-20T18:22:15 1703096535

I see... but that seems a little weak considering it's a funded product, the first adjective they use to describe it is "fast", and good old C++ would totally slay it. The author has a C++ background, maybe he could spend an afternoon trying that?

I couldn't even try to count the number of great posts I've read about fast hash tables from e.g. Paul Khuong alone...

toasted-subs · 2023-12-20T19:16:44 1703099804

Yeah, I genuinely don't understand why so many people reach for Java for everything.

marginalia_nu · 2023-12-20T19:32:25 1703100745

It's a very well rounded language is why, and whatever papercuts exist are either on the way out in newer/future java versions, or made up for by the tooling ecosystem.