Skip to content

davidvmckay/Fast1BRC

 
 

Repository files navigation

1️⃣🐝🏎️ The One Billion Row Challenge - .NET Edition

The One Billion Row Challenge (1BRC Original Java Challenge) is a fun exploration of how far modern .NET can be pushed for aggregating one billion rows from a text file. Grab all your (virtual) threads, reach out to SIMD, optimize your GC, or pull any other trick, and create the fastest implementation for solving this task!

Aggregated results for C#/F# at https://github.com/praeclarum/1brc

Fast1BRC is one of the fastest 😅 implementation in the .NET wild west. 🚀

Techniques used

  • Multiple threads and some SIMD
    • SIMD used here mostly for finding the index of ; separating the city name with the temperature
    • We can then keep the full city name in a single Vector256<byte> register (when the city name is <= 32)
    • For the threads, I'm keeping the main thread busy processing the last block
  • On Windows, no memory mapped file but RandomAccess reopening the same handle per thread
    • As I discovered that it is lowering OS contention on Windows
    • On Linux, it seems that it's less an issue, performance being relatively the same
    • On Mac, memory mapped file seem to perform much better
  • Initially, I implemented the dictionary to store the correspondance between city names and counters with a FNV-1A 64 bit hashing aligned on 8 bytes boundary, but I was told that this is not accepted as discussed here, so I change the implementation to 2 dictionaries:
    • One with a key of 32 bytes - implemented with Vector256<byte> that covers the default dataset (names below 32 chars)
    • One with a key of 128 bytes to support the limit of city names (100 characters)
    • Both are implemented with Vector256<byte>
    • Cache entries are aligned on 64 bytes to allow better cache usage
    • Dictionary implementation is taken from .NET BCL but reimplemented with unsafe and with using pointers directly instead of indices
    • For the 32 bytes key, the implementation inline things directly so that we maximize the codegen with everything being kept in registers.
    • For the hash of the dictionary (both 32 bytes and 128 bytes key), I'm hashing the first 16 bytes by XORing the 2 first ulong with an intermediate multiplication by 397 for the first ulong (prime number that gives good results). It is enough simple and efficient to not have any collisions.
  • No particular tricks for parsing the temperature, apart assuming that there is only 1 digit after the ..
    • When I changed the code to not parse the temperature, it was not changing meaningfully the results.

Results

Benchmark performed on 3 different machines with a different combination of OS, with the following top libraries:

Results

Some comments:

  • Results are varying a lot! 📊
  • My solution comes really really close 🥈 to Nietras solution (5% on average), but his solution is probably slightly more consistent and overall winning! 🥇
    • I discovered that it can be easy to mess things with a slight change that could work on one machine/platform but not work on another. Process/Threading priority tweaking are super sensitive and I stopped tweaking them in the end and I'm using the default. 😅
  • The results vary vastly between HW / OS 💾
    • The first aspect is the performance of the M2/SSD disk access
    • The second aspect is CPU performance (number of cores, caches, clock...etc.)
    • When looking at a profiler, on my Windows machine, the time passed in IO is almost equal to CPU.
    • Some machines could give very different results (e.g buybackoff own results on his repo)

Build

You need to have .NET 8 SDK installed

dotnet publish -c Release -r win-x64
.\bin\Release\net8.0\win-x64\publish\Fast1BRC measurements.txt

To fully test, you need to generate measurements.txt, easier on Ubuntu/MacOS:

License

This software is released under the BSD-2-Clause license.

Author

Alexandre Mutel aka xoofx.

Releases

No releases published

Packages

No packages published

Languages

  • C# 92.9%
  • PowerShell 7.1%