Hacker News new | past | comments | ask | show | jobs | submit login
The One Billion Row Challenge in CUDA (tspeterkim.github.io)
241 points by tspeterkim 3 months ago | hide | past | favorite | 74 comments



There are some good ideas for this type of problem here: https://github.com/dannyvankooten/1brc

After you deal with parsing and hashes, basically you are IO limited, so mmap helps. The C code takes less than 1.4s without any CUDA access. Because there is no compute to speak of, other than parsing and a hashmap, a reasonable guess is that even for the optimal CUDA implementation, the starting of kernels and transfer of data to the GPU would likely add a noticeable bottleneck and make the optimal CUDA code slower than this pure C code.


The 1.4s is _after_ having the file loaded into RAM by the kernel. Because this is mostly I/O bound, it's not a fair comparison to skip the read time. If you were running on a M3 mac you'd might get less than 100ms if the dataset was stored in RAM.

If you account for time loading from disk, the C implementation would be more like ~5s as reported in the blog post [1]. Speculating that their laptop's SSD may be in the 3GB/s range, perhaps there is another second or so of optimization left there (which would roughly work out to the 1.4s in-memory time).

Because you have a lot of variable width row reads this will be more difficult on a GPU than CPU.

[1] https://www.dannyvankooten.com/blog/2024/1brc/


The performance report followed the initial request: run 6 times and remove the best and worst outliers, so the mmap optimization is fair game. Agreed that the C code has room left for some additional optimization.


If we are going to consider using prior runs of the program having the file loaded in RAM by the kernel fair, why stop there?

Let's say I create a "cache" where I store the min/mean/max output for each city, mmap it, and read it at least once to make sure it is in RAM. If the cache is available I simply write it to standard out. I use whatever method to compute the first run, and I persist it to disk and then mmap it. The first run could take 20 hours and gets discarded.

By technicality it might fit the rules of the original request but it isn't an interesting solution. Feel free to submit it :)


This actually doesn't fit the rules. I've designed the challenge so that disk I/O is not part of the measured runtime (initially, by relying on the fact that the first run which would pull the file into the page cache will be the slowest and thus discarded, later on by loading the file from a RAM disk). But keeping state between runs is explicitly ruled out as per the README, for the reason you describe.


Having write access to storage or spawning persistent daemons is an extra requirement and that is often not available in practice when evaluating contest code :-)

This is a fun project for learning CUDA and I enjoyed reading about it——I just wanted to point out that the performance tuning in this case is really on the parsing, hashing, memory transfers, and IO. Taking IO out of the picture using specialized hardware or Linux kernel caching still leaves an interesting problem to solve and the focus should be on minimizing the memory transfers, parsing, and hashing.


Also this uses 16 threads while the contest restricts to running in 8 cores. Needs to compare the benchmarks in the same environment to make a fair comparison.


The AMD Ryzen 4800U has 8 cores total so the author follows the contest restriction. This CPU supports hyperthreading. (I’d be very interested in seeing hyperoptimized CUDA code using unlimited GPU cores FWIW.)


Good to know. I didn’t know the contest has no limit on hyperthread.


1brc in the contest had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their implementation of smt, but the benchmark was run on an amd cpu)

[0] https://github.com/gunnarmorling/1brc/issues/189#issuecommen...


Ok. So it's indeed restricted to 8 core (1 thread per core). Then the benchmark above using 16 threads was not really a fair comparison.


this is true for small datasets.

on large datasets, once loaded into GPU memory, cross GPU shuffling with NVLink is going to be much faster than CPU to RAM.

on the H100 boxes with 8x400Gbps, IO with GDS is also pretty fast.

for truly IObound tasks I think a lot of GPUs beats almost anything :-)


Yes, GDS will accelerate the IO to the GPU. I’d love to see the above C code compared to hyperoptimized GPU code on the right hardware, but I don’t want to accidentally nerd snipe myself :-) The unfortunate part of this particular benchmark is that once you have the data in the right place in your hardware there is very little compute left. The GPU code would probably have constant performance with an additional couple thousand operations on each row whereas CPU would slow down.

https://docs.nvidia.com/gpudirect-storage/overview-guide/ind...


This would be the code to beat. Ideally with only 8 cores but any number of cores is also very interesting.

https://github.com/gunnarmorling/1brc/discussions/710


so you'd rather nerd snipe others, gotcha ;) :D


Haha. Apologies. I hope I didn’t accidentally make anyone waste their time. If they did I’m sure there are people interested in hiring such people anyways so maybe it’s a time invested well in the end.


I am testing a gh200 and the speed you can access the system memory is amazing.. Assuming you have already encoded the station into a smallint and the size of the dataset would be around 6gb that on such system takes just 20 ms to be transfered (I am sure about that because I'm observing transfer a 9.5gb that took about 33ms right now).


> on large datasets, once loaded into GPU memory,

You're yada-yada-yadaing the best part.

If the disk can process the data at 1GB/s, the CPU can process the data at 2GB/s, the GPU can process the data at 32GB/s, then the CPU can process the data at 1GB/s and the GPU can process the data at 1GB/s.

(also, personally, "large dataset" is a short way to say "doesn't fit in memory". if it fits in memory it's small, if it doesn't fit in memory it's large. but that's just my opinion. I generally avoid calling something a "large" or "small" dataset because it's an overloaded term that means different things to different people.)


1 billion rows is "small dataset"?


The meaningful boundary between small data and large data is the difference whether the whole dataset/processing is expected to fit in the RAM of a single common machine. From that perspective, 1 billion rows a borderline case, which can be small or large depending on how large the rows are - and in this particular challenge the rows are tiny.


Just the click stream data of four 15 year old's, chilling out on the living room while watching TikTok videos....


eh, it's just two columns. 12 or 13GB IIRC.


Try an array of FPGAs.


Why does that help with IO bandwidth?


Because every transceiver pair can do 32Gb/s or 56Gb/s and you have dozens of them.


I've never been a fan plus good luck getting 800GB/s across.


Actually on modern systems it’s pure compute plus memory bounded. For this problem hashmap lookup is the biggest bottleneck. Parsing takes time away as well but not as bad as hashmap.


From what I can tell the number of unique elements is pretty small. This would mean the hash map will sit in cache. Parsing is likely the bottleneck; I dont see any use of SIMD in the linked code.


The key can be up to 100 bytes. The key in the hashmap is in cache. The key being compared is from main memory. Basically the whole file in memory is being gone through and compared. With 1 billion rows it’s about 16GB data in memory for a 16-byte average row. That’s approaching the memory bandwidth limit for a second, making it a memory bound problem.


This really depends on number of unique keys. If they sit in L1-L2: the bandwidth of these caches is an order of magnitude greater than main memory.

This problem is quite a Nerd Snipe so it a good thing that I dont have a computer with more than 16 GB or I might end up trying it myself.


Well that's a very pessimistic C++ baseline - using iostreams. The CUDA solution presented here seems to me like it's a bit of an old-timey approach, treating the GPU as an accelerator. The implementation first prepares a million tasks, then submits them as kernel invocations which perform the work. The first phase already takes over two seconds, the second phase takes over ten seconds. Resulting runtime is 16s, 60x faster than iostreams.

This is similar to how GPU acceleration is bolted on in a ton of commercial software, you look at hotspots and then move just those over to the accelerator. For example outsourcing some linear algebra in a FEM solver. That's a sensible approach for trying to accelerate legacy software, but it's not very effective and leads to poor utilization ("copy mat, copy mat, copy mat copy copy copy mat" - Dave Cutler about GPU acceleration). I think a) you can do all of this on the GPU b) this should be limited by PCIe bandwidth for getting the file contents into VRAM. The 1BRC data set is 14 GB. So this is processing that file at about 1 GB/s using both CPU and GPU.


Yeah <iostream> has terrible performance, and also I find the use of a map rather than an unordered_map a bit suspect.


Standard library maps/unordered_maps are themselves notoriously slow anyway. A flat_hash_map from abseil or parallel-hashmaps[1] would be better.

[1] https://github.com/greg7mdp/parallel-hashmap


Its much easier to improve is the baseline is terrible


Pretty cool. Shameless plug: My team at Anthropic is hiring people that can write accelerator kernels. Please reach out to <my username>@gmail.com if you want to make state of the art models faster :)


not looking for a job at this time but i do this kind of work - what is the name of the team that you are working on / typically works on this?


Curious, do you guys use Triton? I was surprised to find out PyTorch slowdown is palpable even compared to basic CUDA kernels.


qq: Would this be in CUDA?


Not sure what I can share publicly, but we use TPUs as well: https://cloud.google.com/blog/products/compute/announcing-cl...


It's funny - former Google devs (whom I just maintain a good relationship with their former employer) are ideally positioned to profitably take advantage of the arb of TPU over GPU.


Google should improve their abstractions until there isn’t room for that anymore, haha.


Instead of desiring atomic maps, would it work here to give each task its own map and then merge those after the parallel processing has ended? I have have absolutely no experience of CUDA, so no idea how applicable that approach would be. However, it seemed pretty practical and effective when I tried a parallel implementation of this challenge on plain CPUs.


This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.

The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.

I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.


Looks like TFA already uses partitioning per thread.


You can definitely have thread local storage


> These offsets are obtained iteratively, stepping through the entire file buffer by the desired split size (= total file size / desired number of parts), and marking the position of a new line character:

In many cases it is better to just pass the raw byte offsets to the workers, then they can skip to the first newline and progress until the next newline after their "end offset". This way the launcher doesn't need to access the input data at all.

This can be ineffective when boundaries are not self-describing or when the minimum processable chunk is near the expected batch size (as you will have significant swings in batch size) but I think would work quite well for this case.


By "launcher", do you mean the CUDA kernel? How can it avoid accessing the input data since it needs access to the characters based on the offsets?

I also already pass these offsets to the threads as `Part* parts`.

I also probably didn't understand your suggestion and am drawing a blank here. So pls feel free to elaborate and correct me.


He's proposing to skip the whole computation of the parts[] array and within the kernel, replace

    for (int i = 0; i < parts[bx].length; i++) {   // bx is the global thread index
      char c = buffer[parts[bx].offset-buffer_offset + i];
by something like

    long long split_size = size / num_parts;
    long long offset = bx * split_size;
    while (buffer[offset++] != '\n')
       ;
    for (int i = 0; buffer[offset+i] != '\n'; i++) {
      char c = buffer[offset + i];
(ignoring how to deal with buffer_offset for now).


  for (int i = 0; buffer[offset+i] != '\n'; i++) {
This would only process the current line, though. Here, each thread processes ~split_size bytes (multiple lines).

Even if were to read multiple lines, how would a thread know when to stop? (at which offset?)

And when it does, it should communicate where it stopped with the other threads to prevent re-reading the buffer. My brain's hurting now.


Yeah, this is a bug in their quick example code. Basically think that you simply give each thread a start and stop offset with simple byte division. Then each thread handles the first line that starts after the start offset up to but excluding the first line that starts after its end offset. So there will be a small amount of "overread" at the boundaries but this is probably negligible.

    let chunk_size = buf.length / threads;
    let start_byte = chunk_size * bx;
    let end_byte = start_byte + chunk_size;
    
    let i = start_byte;
    while (buffer[i++] != '\n') {} // Skip until the first new row.

    while (i < end_byte) {
        let row_start = i;
        let row_end = i;
        while (buffer[row_end++] != '\n') {}
        process_row(buffer, row_start, row_end);
        i = row_end + 1;
    }
Basically you first roughly split the data with simple byte division. Then each thread aligns itself to the underling data chunks. This alignment can be done in parallel across all threads rather than being part of a serial step that examines every byte before the parallel work starts. You need to take care that the alignment each thread does doesn't skip or duplicate any rows, but for simple data formats like this I don't think that should be a major difficulty.


I tried this out today. While it works (no longer a pre-split step required), it makes the CUDA kernel run ridiculously slow. I believe it's because of the while loop:

  while (i < end_byte) {
Comparing it to my original solution, 50X divergent branches are introduced! (ncu profiling)

The only difference between the two is that the for loop could deterministically iterate, yet this while loop iterates for an unknown amount (at kernel launch time).

I admit, I don't perfectly understand the reason. But this is the most likely culprit.


It can compute both the starting offset for itself and next_offset for (bx+1)...


The query itself it's a perfect hash (assuming the station is dictionary encoded) and takes around 100ms on a gh-200 (the Gpu is a single h100 with 132 SMs) with a query that's concurrency constrained.

The same level of performance can be obtained using an Ada like an L40s or and RTX 4090.

The transfer across the nvlink connecting the Cpu and GPU on a gh-200 after the parse and encoding of the source CSV takes a negligible amount of time given the 500 gb/sec of system memory bandwidth and the 900 GB /sec interconnection between Cpu and Gpu.

So the problem is disk bandwidth that's going to limit the performance of the Gpu kernel. The faster solution should be parse the Csv with a gpu kernel using namp and Managed memory (?) encode the station into a interger or a small integer. The min and max value can be used to create keyless perfect hash table for each SM to limit the concurrency on global memory using 32bit atomic operations for min, max, count and sum and then do a final reduction on the Gpu.

I don't think that is needed more then 1 modern gpu for this, especially if you are on a modern hardware like the gh-200.

I'm running this kind of aggregates on a gh-200 using 10 Billion of records having the data in Gpu or Cpu memory using our software (heavydb) for testing purposes in the last two weeks


The 1brc benchmark has the data in a RAM disk[0], there's no disk IO.

[0] https://github.com/gunnarmorling/1brc#evaluating-results


Great context for us GPGPU fans!

Do you think the L4 should have similar perf? Considering it has similar features and specs to L40, and I doubt the lower 300 GB/s bandwidth woudl be an issue.


I have a hard time believing the basic C++ implementation took 16 minutes... I just did the most trivial non-parallel implementation in python and it took less than 5 minute with pypy.


The basic Java implementation was also around 5 minutes


I think the author wanted to compare with a similar base algorithm. They should have stated that, however, instead of implying that a basic C++ algo would typically take that long.

Something else to consider is cost of cloud hardware, though it could be extrapolated from the author's results.


The Programming Massively Parallel Processors textbook has a chapter on histograms. You have basically mentioned the main trick though (privatization).

For this problem I think coursening would also help (so everything isn’t queued up on the copy to global memory)


The PMPP book is great. I reread the histogram chapter after finishing the blog, and realized I could use privatization. You got me!

By coarsening, do you mean making the threads handle more file parts, and reducing the number of private copies (of histogram or stats here) to globally commit at the end?


The query itself it's a perfect hash (assuming the station is dictionary encoded) and takes around 100ms on a gh-200 (the Gpu is a single h100 with 132 SMs) with a query that's concurrency constrained.

The same level of performance can be obtained using an Ada like an L40s or and RTX 4090.

The transfer across the nvlink connecting the Cpu and GPU on a gh-200 after the parse and encoding of the source CSV takes a negligible amount of time given the 500 gb/sec of system memory bandwidth and the 900 GB /sec interconnection between Cpu and Gpu.

So the problem is disk bandwidth that's going to limit the performance of the Gpu kernel. The faster solution should be parse the Csv with a gpu kernel using namp and Managed memory (?) encode the station into a interger or a small integer. The min and max value can be used to create keyless perfect hash table for each SM to limit the concurrency on global memory using 32bit atomic operations for min, max, count and sum and then do a final reduction on the Gpu.

I don't think that is needed more then 1 modern gpu for this, especially if you are on a modern hardware like the gh-200.

I'm running this kind of aggregates on a gh-200 using 10 Billion of records having the data in Gpu or Cpu memory using our software (heavydb) for testing purposes in the last two weeks


> GPU Hash Table?

How bad would performance have suffered if you sha256'd the lines to build the map? I'm going to guess "badly"?

Maybe something like this in CUDA: https://github.com/Cyan4973/xxHash ?


So performance would increase since hashing is faster than binary-searching.

However, the problem of collisions across threads and dealing with concurrent map key insertions still remains. e.g. when two different cities produce the same hash (one at each thread), how can we atomically compare 100 byte city strings and correctly do collision-solving (using linear probe, for example - https://nosferalatu.com/SimpleGPUHashTable.html)

Atomic operations are limited to 32-bits.


> Atomic operations are limited to 32-bits.

I'm using 64bit atomics at work, are you on an old version of cuda or are some operations only supported on 32bit?


I misspoke. (got confused with the key limit in my link above)

Atomics work up to 128-bits (https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomi...).

Regardless, it's still less than 100 bytes, which is the max length of city strings.


If the set of cities is known (as your binary search algorithm assumes), you can insert all of them first, on the CPU. That will resolve all collisions ahead of time, making the structure of the hash table essentially read-only. (Of course, you would still need atomics for the values.)


You could hash the city names using something like a djb2a algorithm?


I think this is very slow for some reason. I'll bookmark this and try on an A100 and a H100 box and then will try with multigpu on larger data.


Please do.

My hope with this blog was to rile up other CUDA enthusiasts. Making them wanna bring in bigger and better hardware that I don't have access to.

+ Someone in the CUDA MODE community got 6 seconds on a 4090.


Interesting approach.

I wonder if one could use a reduce operation across the cities or similar rather than atomic operations? GPUs are amazing at reduce operations.


Agreed. I tried reducing across cities first.

The problem was that the work of gathering all the temperatures for each city (before I could launch the reduction CUDA kernels) required a full parsing through the input data.

My final solution would be slower than the C++ baseline since the baseline already does the full parsing anyways.


am I reading correctly that your AtomicMin stops only when CAS succeeds?

can you not stop early if extracted value becomes smaller than val?


also, iirc, floats' comparison is compatible with signed integers' comparison due to representation?

so just using cuda's AtomicMin for ints would be enough


Positive finite floats compare like ints. If you can have negative values, things get slightly more complicated, since they use sign-magnitude (you basically shift the sign bit to the right and use that as an XOR mask on all the other bits). If you need to care about -0.0 == +0.0 and NaN != NaN, most bets are off. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: