Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeviceSegmentedRadixSort performance issue #7

Closed
ntrost57 opened this issue Jun 8, 2018 · 8 comments
Closed

DeviceSegmentedRadixSort performance issue #7

ntrost57 opened this issue Jun 8, 2018 · 8 comments
Labels

Comments

@ntrost57
Copy link

ntrost57 commented Jun 8, 2018

I am facing some performance issues with segmented radix sort functionality of rocPRIM. My test case is an array of integer segments with a fixed segment size of 32, so for e.g. an array size of 4M I have 125k segments with 32 integers each. Sorting all segments sequentially on the host (std::sort) takes roughly 55ms, sorting the segments using OpenMP is about 15ms, sorting the segments using a Vega10 card and rocPRIM is 65ms.
Decreasing the segment size further increases performance on the host, and decreases performance with rocPRIM. I need to have a segment size of at least 64 to outperform the sequential sorting on the host with rocPRIM, and a segment size of 256 to beat the OpenMP version.

Nico

@VincentSC
Copy link
Contributor

This algorithm has not been optimized yet just like various other algorithms, as functional completeness had a higher priority. So no need to benchmark before the final delivery. By remaining on Master, your code will become faster automatically.

@jszuppe
Copy link
Contributor

jszuppe commented Jun 8, 2018

For very short and fixed-size segmented segmented radix sort may not be the best choice. I'd recommend using warp_sort since 32 is smaller than max warp size. It should be the fastest method. You can check benchmark_hip_warp_sort.cpp to see how to use warp_sort in a kernel.

Edit:

Unless you're using rocPRIM via hipCUB. Then the best choice is block_radix_sort since CUB does not provide warp-level sort primitive.

@ntrost57
Copy link
Author

ntrost57 commented Jun 8, 2018

Thank you for the answer. However, for my application, the segment size will eventually vary between 4 and 128 with single peaks larger than blocksize, being not fixed over the full array. I will wait for the final delivery with my tests, thank you.

@jszuppe
Copy link
Contributor

jszuppe commented Jun 8, 2018

Performance of segmented radix sort with a fixed segment size of 32 on rocPRIM and CUB are comparable. Segmented radix sort just isn't designed for short fixed size segments.

Vega

sort_keys<int>(32)/manual_time                      3946 ms          0 ms          1   129.767MB/s   32.4418M items/s
sort_keys<long long>(32)/manual_time                9625 ms          1 ms          1   106.387MB/s   13.2984M items/s
sort_pairs<int, float>(32)/manual_time              4447 ms          0 ms          1   230.265MB/s   28.7832M items/s
sort_pairs<long long, double>(32)/manual_time      19015 ms          1 ms          1   107.707MB/s   6.73166M items/s

980

SortKeys<int>(32)/manual_time                      4381 ms       4382 ms          1   116.863MB/s   29.2157M items/s
SortKeys<long long>(32)/manual_time                6874 ms       6874 ms          1   148.972MB/s   18.6215M items/s
SortPairs<int, float>(32)/manual_time              6035 ms       6035 ms          1   169.678MB/s   21.2097M items/s
SortPairs<long long, double>(32)/manual_time      15661 ms      15662 ms          1   130.768MB/s   8.17303M items/s

@jszuppe
Copy link
Contributor

jszuppe commented Jun 8, 2018

the segment size will eventually vary between 4 and 128

If it will be fixed and that you can use warp_sort or block_sort either way.

@jszuppe jszuppe closed this as completed Jun 8, 2018
@jszuppe jszuppe added the wontfix label Jun 8, 2018
@ntrost57
Copy link
Author

ntrost57 commented Jun 8, 2018

Here, I do not agree. For testing I have created a very naive warp based segmented radix sort that does the initial test case within less than 1 ms on Vega10. It is limited to segment sizes of 32 tho.

@ex-rzr
Copy link
Contributor

ex-rzr commented Jun 13, 2018

@ntrost57

vary between 4 and 128 with single peaks larger than blocksize

If this means something like, for example, 90% segments are <= 256, 9% <= 2000, 1% > 2000, then we have a solution for this (I hope good enough, but probably not perfect, as there is always a balance in short-long sequences performance) .

This solution is available here: https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2

rocPRIM (not hipCUB!!!) supports custom configs for device-level functions:

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
rocrpim::segmented_radix_sort_keys<config>(...);

In this example sort segments longer than 64 * 4 (BlockSize * ItemsPerThread) will be sorted in multiple iterations (passes) using radix of 6 and 5 bits (BlockSize cannot be larger than 2^(radix bits) due to implementation restrictions).

Shorter sequences will be sorted using as small ItemsPerThread as possible (by dividing ItemsPerThread by 2), i.e. 200 will be sorted using 4 items per thread, 100 - 2 items per thread, 10 - 1 item.

You need to choose BlockSize and ItemsPerThread that have the best performance in most cases:

  • BlockSize should be small (64) if there are many short sequences and large (256) if many long sequences.
  • ItemsPerThread usually should be large (10-15).

Here "* ms" column means time for 10 repeats of 4M, so divide ms by 10.

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
sort_keys<int>(32)/manual_time          14 ms          0 ms         49   10.8313GB/s   2.70781G items/s
sort_keys<int>(64)/manual_time           7 ms          0 ms         94   21.0067GB/s   5.25167G items/s
sort_keys<int>(128)/manual_time          6 ms          0 ms        128   27.5956GB/s    6.8989G items/s
sort_keys<int>(192)/manual_time          6 ms          0 ms        127   28.0194GB/s   7.00485G items/s
sort_keys<int>(256)/manual_time          4 ms          0 ms        160   35.7357GB/s   8.93392G items/s

Is the performance enough for your task?

As an opposite 256 has much lower performance for short segments:

using config = rocprim::segmented_radix_sort_config<7, 6, rocprim::kernel_config<256, 7>>;
sort_keys<int>(32)/manual_time           114 ms          0 ms          6   1.37003GB/s   350.728M items/s
sort_keys<int>(64)/manual_time            57 ms          0 ms         12   2.72697GB/s   698.105M items/s
sort_keys<int>(128)/manual_time           29 ms          0 ms         24   5.40673GB/s   1.35168G items/s
sort_keys<int>(192)/manual_time           19 ms          0 ms         35   8.06406GB/s   2.01601G items/s
sort_keys<int>(256)/manual_time           15 ms          0 ms         47   10.6808GB/s   2.67021G items/s
sort_keys<int>(1000)/manual_time           8 ms          0 ms         85   18.4266GB/s   4.60665G items/s
sort_keys<int>(10000)/manual_time         17 ms          0 ms         41   9.14953GB/s   2.28738G items/s

Btw, don't forget that if you know in advance the range of key values then you can set end_bit lower than sizeof(Key) * 8.

@ntrost57
Copy link
Author

ex-rzr thank you very much for your hint. With tuning the blocksize and items per thread performance looks much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants