After you deal with parsing and hashes, basically you are IO limited, so mmap helps. The C code takes less than 1.4s without any CUDA access. Because there is no compute to speak of, other than parsing and a hashmap, a reasonable guess is that even for the optimal CUDA implementation, the starting of kernels and transfer of data to the GPU would likely add a noticeable bottleneck and make the optimal CUDA code slower than this pure C code.
The 1.4s is _after_ having the file loaded into RAM by the kernel. Because this is mostly I/O bound, it's not a fair comparison to skip the read time. If you were running on a M3 mac you'd might get less than 100ms if the dataset was stored in RAM.
If you account for time loading from disk, the C implementation would be more like ~5s as reported in the blog post [1]. Speculating that their laptop's SSD may be in the 3GB/s range, perhaps there is another second or so of optimization left there (which would roughly work out to the 1.4s in-memory time).
Because you have a lot of variable width row reads this will be more difficult on a GPU than CPU.
The performance report followed the initial request: run 6 times and remove the best and worst outliers, so the mmap optimization is fair game. Agreed that the C code has room left for some additional optimization.
If we are going to consider using prior runs of the program having the file loaded in RAM by the kernel fair, why stop there?
Let's say I create a "cache" where I store the min/mean/max output for each city, mmap it, and read it at least once to make sure it is in RAM. If the cache is available I simply write it to standard out. I use whatever method to compute the first run, and I persist it to disk and then mmap it. The first run could take 20 hours and gets discarded.
By technicality it might fit the rules of the original request but it isn't an interesting solution. Feel free to submit it :)
This actually doesn't fit the rules. I've designed the challenge so that disk I/O is not part of the measured runtime (initially, by relying on the fact that the first run which would pull the file into the page cache will be the slowest and thus discarded, later on by loading the file from a RAM disk). But keeping state between runs is explicitly ruled out as per the README, for the reason you describe.
Having write access to storage or spawning persistent daemons is an extra requirement and that is often not available in practice when evaluating contest code :-)
This is a fun project for learning CUDA and I enjoyed reading about it——I just wanted to point out that the performance tuning in this case is really on the parsing, hashing, memory transfers, and IO. Taking IO out of the picture using specialized hardware or Linux kernel caching still leaves an interesting problem to solve and the focus should be on minimizing the memory transfers, parsing, and hashing.
Also this uses 16 threads while the contest restricts to running in 8 cores. Needs to compare the benchmarks in the same environment to make a fair comparison.
The AMD Ryzen 4800U has 8 cores total so the author follows the contest restriction. This CPU supports hyperthreading. (I’d be very interested in seeing hyperoptimized CUDA code using unlimited GPU cores FWIW.)
1brc in the contest had SMT disabled [0]. (hyperthreading is an intel marketing name and trademark for their implementation of smt, but the benchmark was run on an amd cpu)
Yes, GDS will accelerate the IO to the GPU. I’d love to see the above C code compared to hyperoptimized GPU code on the right hardware, but I don’t want to accidentally nerd snipe myself :-) The unfortunate part of this particular benchmark is that once you have the data in the right place in your hardware there is very little compute left. The GPU code would probably have constant performance with an additional couple thousand operations on each row whereas CPU would slow down.
Haha. Apologies. I hope I didn’t accidentally make anyone waste their time. If they did I’m sure there are people interested in hiring such people anyways so maybe it’s a time invested well in the end.
I am testing a gh200 and the speed you can access the system memory is amazing.. Assuming you have already encoded the station into a smallint and the size of the dataset would be around 6gb that on such system takes just 20 ms to be transfered (I am sure about that because I'm observing transfer a 9.5gb that took about 33ms right now).
If the disk can process the data at 1GB/s, the CPU can process the data at 2GB/s, the GPU can process the data at 32GB/s, then the CPU can process the data at 1GB/s and the GPU can process the data at 1GB/s.
(also, personally, "large dataset" is a short way to say "doesn't fit in memory". if it fits in memory it's small, if it doesn't fit in memory it's large. but that's just my opinion. I generally avoid calling something a "large" or "small" dataset because it's an overloaded term that means different things to different people.)
The meaningful boundary between small data and large data is the difference whether the whole dataset/processing is expected to fit in the RAM of a single common machine. From that perspective, 1 billion rows a borderline case, which can be small or large depending on how large the rows are - and in this particular challenge the rows are tiny.
Actually on modern systems it’s pure compute plus memory bounded. For this problem hashmap lookup is the biggest bottleneck. Parsing takes time away as well but not as bad as hashmap.
From what I can tell the number of unique elements is pretty small. This would mean the hash map will sit in cache. Parsing is likely the bottleneck; I dont see any use of SIMD in the linked code.
The key can be up to 100 bytes. The key in the hashmap is in cache. The key being compared is from main memory. Basically the whole file in memory is being gone through and compared. With 1 billion rows it’s about 16GB data in memory for a 16-byte average row. That’s approaching the memory bandwidth limit for a second, making it a memory bound problem.
Well that's a very pessimistic C++ baseline - using iostreams. The CUDA solution presented here seems to me like it's a bit of an old-timey approach, treating the GPU as an accelerator. The implementation first prepares a million tasks, then submits them as kernel invocations which perform the work. The first phase already takes over two seconds, the second phase takes over ten seconds. Resulting runtime is 16s, 60x faster than iostreams.
This is similar to how GPU acceleration is bolted on in a ton of commercial software, you look at hotspots and then move just those over to the accelerator. For example outsourcing some linear algebra in a FEM solver. That's a sensible approach for trying to accelerate legacy software, but it's not very effective and leads to poor utilization ("copy mat, copy mat, copy mat copy copy copy mat" - Dave Cutler about GPU acceleration). I think a) you can do all of this on the GPU b) this should be limited by PCIe bandwidth for getting the file contents into VRAM. The 1BRC data set is 14 GB. So this is processing that file at about 1 GB/s using both CPU and GPU.
Pretty cool. Shameless plug: My team at Anthropic is hiring people that can write accelerator kernels. Please reach out to <my username>@gmail.com if you want to make state of the art models faster :)
It's funny - former Google devs (whom I just maintain a good relationship with their former employer) are ideally positioned to profitably take advantage of the arb of TPU over GPU.
Instead of desiring atomic maps, would it work here to give each task its own map and then merge those after the parallel processing has ended? I have have absolutely no experience of CUDA, so no idea how applicable that approach would be. However, it seemed pretty practical and effective when I tried a parallel implementation of this challenge on plain CPUs.
This is the possible optimization that I mention at the end of the blog - using a private map for each thread block.
The catch is that this map must fit in shared memory, which is pretty limited on all current hardware: ~100KB.
I originally thought that my map (stats array) was too big to fit into this shared memory. Now, however, I realize it can. It'll interesting to see how much speedup (or not!) this optimization can bring.
> These offsets are obtained iteratively, stepping through the entire file buffer by the desired split size (= total file size / desired number of parts), and marking the position of a new line character:
In many cases it is better to just pass the raw byte offsets to the workers, then they can skip to the first newline and progress until the next newline after their "end offset". This way the launcher doesn't need to access the input data at all.
This can be ineffective when boundaries are not self-describing or when the minimum processable chunk is near the expected batch size (as you will have significant swings in batch size) but I think would work quite well for this case.
He's proposing to skip the whole computation of the parts[] array and within the kernel, replace
for (int i = 0; i < parts[bx].length; i++) { // bx is the global thread index
char c = buffer[parts[bx].offset-buffer_offset + i];
by something like
long long split_size = size / num_parts;
long long offset = bx * split_size;
while (buffer[offset++] != '\n')
;
for (int i = 0; buffer[offset+i] != '\n'; i++) {
char c = buffer[offset + i];
(ignoring how to deal with buffer_offset for now).
Yeah, this is a bug in their quick example code. Basically think that you simply give each thread a start and stop offset with simple byte division. Then each thread handles the first line that starts after the start offset up to but excluding the first line that starts after its end offset. So there will be a small amount of "overread" at the boundaries but this is probably negligible.
let chunk_size = buf.length / threads;
let start_byte = chunk_size * bx;
let end_byte = start_byte + chunk_size;
let i = start_byte;
while (buffer[i++] != '\n') {} // Skip until the first new row.
while (i < end_byte) {
let row_start = i;
let row_end = i;
while (buffer[row_end++] != '\n') {}
process_row(buffer, row_start, row_end);
i = row_end + 1;
}
Basically you first roughly split the data with simple byte division. Then each thread aligns itself to the underling data chunks. This alignment can be done in parallel across all threads rather than being part of a serial step that examines every byte before the parallel work starts. You need to take care that the alignment each thread does doesn't skip or duplicate any rows, but for simple data formats like this I don't think that should be a major difficulty.
I tried this out today. While it works (no longer a pre-split step required), it makes the CUDA kernel run ridiculously slow. I believe it's because of the while loop:
while (i < end_byte) {
Comparing it to my original solution, 50X divergent branches are introduced! (ncu profiling)
The only difference between the two is that the for loop could deterministically iterate, yet this while loop iterates for an unknown amount (at kernel launch time).
I admit, I don't perfectly understand the reason. But this is the most likely culprit.
The query itself it's a perfect hash (assuming the station is dictionary encoded) and takes around 100ms on a gh-200 (the Gpu is a single h100 with 132 SMs) with a query that's concurrency constrained.
The same level of performance can be obtained using an Ada like an L40s or and RTX 4090.
The transfer across the nvlink connecting the Cpu and GPU on a gh-200 after the parse and encoding of the source CSV takes a negligible amount of time given the 500 gb/sec of system memory bandwidth and the 900 GB /sec interconnection between Cpu and Gpu.
So the problem is disk bandwidth that's going to limit the performance of the Gpu kernel. The faster solution should be parse the Csv with a gpu kernel using namp and Managed memory (?) encode the station into a interger or a small integer. The min and max value can be used to create keyless perfect hash table for each SM to limit the concurrency on global memory using 32bit atomic operations for min, max, count and sum and then do a final reduction on the Gpu.
I don't think that is needed more then 1 modern gpu for this, especially if you are on a modern hardware like the gh-200.
I'm running this kind of aggregates on a gh-200 using 10 Billion of records having the data in Gpu or Cpu memory using our software (heavydb) for testing purposes in the last two weeks
Do you think the L4 should have similar perf? Considering it has similar features and specs to L40, and I doubt the lower 300 GB/s bandwidth woudl be an issue.
I have a hard time believing the basic C++ implementation took 16 minutes... I just did the most trivial non-parallel implementation in python and it took less than 5 minute with pypy.
I think the author wanted to compare with a similar base algorithm. They should have stated that, however, instead of implying that a basic C++ algo would typically take that long.
Something else to consider is cost of cloud hardware, though it could be extrapolated from the author's results.
The Programming Massively Parallel Processors textbook has a chapter on histograms. You have basically mentioned the main trick though (privatization).
For this problem I think coursening would also help (so everything isn’t queued up on the copy to global memory)
The PMPP book is great. I reread the histogram chapter after finishing the blog, and realized I could use privatization. You got me!
By coarsening, do you mean making the threads handle more file parts, and reducing the number of private copies (of histogram or stats here) to globally commit at the end?
The query itself it's a perfect hash (assuming the station is dictionary encoded) and takes around 100ms on a gh-200 (the Gpu is a single h100 with 132 SMs) with a query that's concurrency constrained.
The same level of performance can be obtained using an Ada like an L40s or and RTX 4090.
The transfer across the nvlink connecting the Cpu and GPU on a gh-200 after the parse and encoding of the source CSV takes a negligible amount of time given the 500 gb/sec of system memory bandwidth and the 900 GB /sec interconnection between Cpu and Gpu.
So the problem is disk bandwidth that's going to limit the performance of the Gpu kernel. The faster solution should be parse the Csv with a gpu kernel using namp and Managed memory (?) encode the station into a interger or a small integer. The min and max value can be used to create keyless perfect hash table for each SM to limit the concurrency on global memory using 32bit atomic operations for min, max, count and sum and then do a final reduction on the Gpu.
I don't think that is needed more then 1 modern gpu for this, especially if you are on a modern hardware like the gh-200.
I'm running this kind of aggregates on a gh-200 using 10 Billion of records having the data in Gpu or Cpu memory using our software (heavydb) for testing purposes in the last two weeks
So performance would increase since hashing is faster than binary-searching.
However, the problem of collisions across threads and dealing with concurrent map key insertions still remains. e.g. when two different cities produce the same hash (one at each thread), how can we atomically compare 100 byte city strings and correctly do collision-solving (using linear probe, for example - https://nosferalatu.com/SimpleGPUHashTable.html)
If the set of cities is known (as your binary search algorithm assumes), you can insert all of them first, on the CPU. That will resolve all collisions ahead of time, making the structure of the hash table essentially read-only. (Of course, you would still need atomics for the values.)
The problem was that the work of gathering all the temperatures for each city (before I could launch the reduction CUDA kernels) required a full parsing through the input data.
My final solution would be slower than the C++ baseline since the baseline already does the full parsing anyways.
Positive finite floats compare like ints. If you can have negative values, things get slightly more complicated, since they use sign-magnitude (you basically shift the sign bit to the right and use that as an XOR mask on all the other bits). If you need to care about -0.0 == +0.0 and NaN != NaN, most bets are off. :-)
After you deal with parsing and hashes, basically you are IO limited, so mmap helps. The C code takes less than 1.4s without any CUDA access. Because there is no compute to speak of, other than parsing and a hashmap, a reasonable guess is that even for the optimal CUDA implementation, the starting of kernels and transfer of data to the GPU would likely add a noticeable bottleneck and make the optimal CUDA code slower than this pure C code.