-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeviceSegmentedRadixSort performance issue #7
Comments
This algorithm has not been optimized yet just like various other algorithms, as functional completeness had a higher priority. So no need to benchmark before the final delivery. By remaining on Master, your code will become faster automatically. |
For very short and fixed-size segmented segmented radix sort may not be the best choice. I'd recommend using Edit: Unless you're using rocPRIM via hipCUB. Then the best choice is |
Thank you for the answer. However, for my application, the segment size will eventually vary between 4 and 128 with single peaks larger than blocksize, being not fixed over the full array. I will wait for the final delivery with my tests, thank you. |
Performance of segmented radix sort with a fixed segment size of 32 on rocPRIM and CUB are comparable. Segmented radix sort just isn't designed for short fixed size segments. Vega
980
|
If it will be fixed and that you can use |
Here, I do not agree. For testing I have created a very naive warp based segmented radix sort that does the initial test case within less than 1 ms on Vega10. It is limited to segment sizes of 32 tho. |
If this means something like, for example, 90% segments are <= 256, 9% <= 2000, 1% > 2000, then we have a solution for this (I hope good enough, but probably not perfect, as there is always a balance in short-long sequences performance) . This solution is available here: https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2 rocPRIM (not hipCUB!!!) supports custom configs for device-level functions: using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
rocrpim::segmented_radix_sort_keys<config>(...); In this example sort segments longer than 64 * 4 (BlockSize * ItemsPerThread) will be sorted in multiple iterations (passes) using radix of 6 and 5 bits (BlockSize cannot be larger than 2^(radix bits) due to implementation restrictions). Shorter sequences will be sorted using as small ItemsPerThread as possible (by dividing ItemsPerThread by 2), i.e. 200 will be sorted using 4 items per thread, 100 - 2 items per thread, 10 - 1 item. You need to choose BlockSize and ItemsPerThread that have the best performance in most cases:
Here "* ms" column means time for 10 repeats of 4M, so divide ms by 10.
Is the performance enough for your task? As an opposite 256 has much lower performance for short segments:
Btw, don't forget that if you know in advance the range of key values then you can set |
ex-rzr thank you very much for your hint. With tuning the blocksize and items per thread performance looks much better. |
I am facing some performance issues with segmented radix sort functionality of rocPRIM. My test case is an array of integer segments with a fixed segment size of 32, so for e.g. an array size of 4M I have 125k segments with 32 integers each. Sorting all segments sequentially on the host (std::sort) takes roughly 55ms, sorting the segments using OpenMP is about 15ms, sorting the segments using a Vega10 card and rocPRIM is 65ms.
Decreasing the segment size further increases performance on the host, and decreases performance with rocPRIM. I need to have a segment size of at least 64 to outperform the sequential sorting on the host with rocPRIM, and a segment size of 256 to beat the OpenMP version.
Nico
The text was updated successfully, but these errors were encountered: