DeviceSegmentedRadixSort performance issue #7

ntrost57 · 2018-06-08T07:58:54Z

I am facing some performance issues with segmented radix sort functionality of rocPRIM. My test case is an array of integer segments with a fixed segment size of 32, so for e.g. an array size of 4M I have 125k segments with 32 integers each. Sorting all segments sequentially on the host (std::sort) takes roughly 55ms, sorting the segments using OpenMP is about 15ms, sorting the segments using a Vega10 card and rocPRIM is 65ms.
Decreasing the segment size further increases performance on the host, and decreases performance with rocPRIM. I need to have a segment size of at least 64 to outperform the sequential sorting on the host with rocPRIM, and a segment size of 256 to beat the OpenMP version.

Nico

VincentSC · 2018-06-08T08:17:14Z

This algorithm has not been optimized yet just like various other algorithms, as functional completeness had a higher priority. So no need to benchmark before the final delivery. By remaining on Master, your code will become faster automatically.

jszuppe · 2018-06-08T08:34:02Z

For very short and fixed-size segmented segmented radix sort may not be the best choice. I'd recommend using warp_sort since 32 is smaller than max warp size. It should be the fastest method. You can check benchmark_hip_warp_sort.cpp to see how to use warp_sort in a kernel.

Edit:

Unless you're using rocPRIM via hipCUB. Then the best choice is block_radix_sort since CUB does not provide warp-level sort primitive.

ntrost57 · 2018-06-08T08:42:03Z

Thank you for the answer. However, for my application, the segment size will eventually vary between 4 and 128 with single peaks larger than blocksize, being not fixed over the full array. I will wait for the final delivery with my tests, thank you.

jszuppe · 2018-06-08T08:42:36Z

Performance of segmented radix sort with a fixed segment size of 32 on rocPRIM and CUB are comparable. Segmented radix sort just isn't designed for short fixed size segments.

Vega

sort_keys<int>(32)/manual_time                      3946 ms          0 ms          1   129.767MB/s   32.4418M items/s
sort_keys<long long>(32)/manual_time                9625 ms          1 ms          1   106.387MB/s   13.2984M items/s
sort_pairs<int, float>(32)/manual_time              4447 ms          0 ms          1   230.265MB/s   28.7832M items/s
sort_pairs<long long, double>(32)/manual_time      19015 ms          1 ms          1   107.707MB/s   6.73166M items/s

980

SortKeys<int>(32)/manual_time                      4381 ms       4382 ms          1   116.863MB/s   29.2157M items/s
SortKeys<long long>(32)/manual_time                6874 ms       6874 ms          1   148.972MB/s   18.6215M items/s
SortPairs<int, float>(32)/manual_time              6035 ms       6035 ms          1   169.678MB/s   21.2097M items/s
SortPairs<long long, double>(32)/manual_time      15661 ms      15662 ms          1   130.768MB/s   8.17303M items/s

jszuppe · 2018-06-08T08:44:02Z

the segment size will eventually vary between 4 and 128

If it will be fixed and that you can use warp_sort or block_sort either way.

ntrost57 · 2018-06-08T08:45:56Z

Here, I do not agree. For testing I have created a very naive warp based segmented radix sort that does the initial test case within less than 1 ms on Vega10. It is limited to segment sizes of 32 tho.

ex-rzr · 2018-06-13T03:55:00Z

@ntrost57

vary between 4 and 128 with single peaks larger than blocksize

If this means something like, for example, 90% segments are <= 256, 9% <= 2000, 1% > 2000, then we have a solution for this (I hope good enough, but probably not perfect, as there is always a balance in short-long sequences performance) .

This solution is available here: https://github.com/ROCmSoftwarePlatform/rocPRIM/tree/v0.3.2

rocPRIM (not hipCUB!!!) supports custom configs for device-level functions:

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
rocrpim::segmented_radix_sort_keys<config>(...);

In this example sort segments longer than 64 * 4 (BlockSize * ItemsPerThread) will be sorted in multiple iterations (passes) using radix of 6 and 5 bits (BlockSize cannot be larger than 2^(radix bits) due to implementation restrictions).

Shorter sequences will be sorted using as small ItemsPerThread as possible (by dividing ItemsPerThread by 2), i.e. 200 will be sorted using 4 items per thread, 100 - 2 items per thread, 10 - 1 item.

You need to choose BlockSize and ItemsPerThread that have the best performance in most cases:

BlockSize should be small (64) if there are many short sequences and large (256) if many long sequences.
ItemsPerThread usually should be large (10-15).

Here "* ms" column means time for 10 repeats of 4M, so divide ms by 10.

using config = rocprim::segmented_radix_sort_config<6, 5, rocprim::kernel_config<64, 4>>;
sort_keys<int>(32)/manual_time          14 ms          0 ms         49   10.8313GB/s   2.70781G items/s
sort_keys<int>(64)/manual_time           7 ms          0 ms         94   21.0067GB/s   5.25167G items/s
sort_keys<int>(128)/manual_time          6 ms          0 ms        128   27.5956GB/s    6.8989G items/s
sort_keys<int>(192)/manual_time          6 ms          0 ms        127   28.0194GB/s   7.00485G items/s
sort_keys<int>(256)/manual_time          4 ms          0 ms        160   35.7357GB/s   8.93392G items/s

Is the performance enough for your task?

As an opposite 256 has much lower performance for short segments:

using config = rocprim::segmented_radix_sort_config<7, 6, rocprim::kernel_config<256, 7>>;
sort_keys<int>(32)/manual_time           114 ms          0 ms          6   1.37003GB/s   350.728M items/s
sort_keys<int>(64)/manual_time            57 ms          0 ms         12   2.72697GB/s   698.105M items/s
sort_keys<int>(128)/manual_time           29 ms          0 ms         24   5.40673GB/s   1.35168G items/s
sort_keys<int>(192)/manual_time           19 ms          0 ms         35   8.06406GB/s   2.01601G items/s
sort_keys<int>(256)/manual_time           15 ms          0 ms         47   10.6808GB/s   2.67021G items/s
sort_keys<int>(1000)/manual_time           8 ms          0 ms         85   18.4266GB/s   4.60665G items/s
sort_keys<int>(10000)/manual_time         17 ms          0 ms         41   9.14953GB/s   2.28738G items/s

Btw, don't forget that if you know in advance the range of key values then you can set end_bit lower than sizeof(Key) * 8.

ntrost57 · 2018-06-22T07:33:12Z

ex-rzr thank you very much for your hint. With tuning the blocksize and items per thread performance looks much better.

jszuppe closed this as completed Jun 8, 2018

jszuppe added the wontfix label Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeviceSegmentedRadixSort performance issue #7

DeviceSegmentedRadixSort performance issue #7

ntrost57 commented Jun 8, 2018

VincentSC commented Jun 8, 2018

jszuppe commented Jun 8, 2018 •

edited

Loading

ntrost57 commented Jun 8, 2018 •

edited

Loading

jszuppe commented Jun 8, 2018

jszuppe commented Jun 8, 2018

ntrost57 commented Jun 8, 2018 •

edited

Loading

ex-rzr commented Jun 13, 2018

ntrost57 commented Jun 22, 2018

DeviceSegmentedRadixSort performance issue #7

DeviceSegmentedRadixSort performance issue #7

Comments

ntrost57 commented Jun 8, 2018

VincentSC commented Jun 8, 2018

jszuppe commented Jun 8, 2018 • edited Loading

ntrost57 commented Jun 8, 2018 • edited Loading

jszuppe commented Jun 8, 2018

jszuppe commented Jun 8, 2018

ntrost57 commented Jun 8, 2018 • edited Loading

ex-rzr commented Jun 13, 2018

ntrost57 commented Jun 22, 2018

jszuppe commented Jun 8, 2018 •

edited

Loading

ntrost57 commented Jun 8, 2018 •

edited

Loading

ntrost57 commented Jun 8, 2018 •

edited

Loading