Skip to content

arif-arman/origami-sort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Origami: A High-Performance Mergesort Framework

GPLv3 license

This repository contains the implementation of the components and algorithms of the Origami framework described in the Origami paper. The framework was developed by the IRL @ Texas A&M University, published in the PVLDB Volume 15, and will be presented in the VLDB 2022.

Summary

Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multithread. We introduce Origami, an in-memory mergesort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8X faster than prior methods and a branchless streaming merger that achieves up to a 1.5X speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize threadlevel concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2X faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.

Performance

A quick preview of Origami sort performance is given in the tables below. For a deeper analysis and discussion, please refer to the paper. Benchmarks run all code compiled with Visual Studio 2019 in Windows Server 2016 on an 8-core Intel Skylake-X (i7-7820X) CPU with a fixed 4.7 GHz clock on all cores, 1 MB L2 cache, and 32 GB of DDR4-3200 quad-channel RAM. When AVX-512 was used, BIOS defaulted to a 400-MHz lower clock (i.e., 4.3 GHz), which is known as the AVX offset implemented by many motherboard manufacturers to keep core temperature under control. We enable the maximum level of optimization and use appropriate flags (e.g., /arch:AVX512) to ensure the compiler uses all available SIMD registers.

The following table shows the performance comparison of Origami with prior works in corresponding SIMD architectures. It compares the sort speed of different chunk sizes in a 1 GB array of 32-bit random keys.

Single Thread Chunked Sort Speed (M keys/s, 32-bit keys)
Chunk size SSE AVX2 AVX-512
AA-sort Origami ASPaS-sort Peloton-sort Origami Watkins-sort Xiaochen-sort Yin-sort Origami
128 K 63 176 53 139 228 40 198 140 295
256 K 61 147 47 128 210 33 184 130 269
512 K 59 138 44 120 195 30 172 113 249
1 M 57 131 41 109 183 28 160 102 232
2 M 55 124 39 92 174 25 150 95 216
4 M 54 118 37 81 168 23 140 88 203
8 M 52 112 35 77 162 21 131 83 191
16 M 50 107 33 73 153 20 122 78 181
32 M 48 102 32 70 145 19 115 72 172
64 M 47 98 30 67 138 18 109 69 163
128 M 45 95 29 65 132 17 103 66 156
256 M 44 91 28 63 126 17 97 64 149


Origami is distribution insensitive i.e. it retains almost constant speed on all inputs. The next table shows this characteristic on 1 GB data for the following distributions:

D1: Uniformly random, generated by Mersenne Twister
D2: All same keys
D3: Sorted
D4: Reverse sorted
D5: Almost sorted, where every 7th key is set to KEY_MAX
D6: Pareto keys, generated as min(ceil(beta(1/(1-u) - 1)), 10000), where beta = 7 and u ~ uniform[0, 1]
D7: Bursts of same keys, where the length of each subsequence is drawn from D6 and key from D1
D8: Random shuffle, generated by randomly permuting D7
D9: Fibonacci, wrapped around when overflows number of items to sort

Single Thread Sort Speed for Different Distributions (M keys/s)
Key size D1 D2 D3 D4 D5 D6 D7 D8 D9
Scalar 32 47 52 52 52 47 47 49 47 47
64 43 47 47 47 43 44 45 43 44
64+64 25 27 27 27 25 25 25 25 25
SSE 32 91 90 90 90 91 91 92 91 90
64 50 49 49 50 50 50 50 50 49
64+64 35 34 34 34 35 35 35 35 34
AVX2 32 126 126 125 124 126 126 127 126 125
64 48 48 48 48 48 48 48 48 47
64+64 34 34 34 34 34 34 34 34 34
AVX-512 32 149 145 145 145 148 149 149 149 146
64 65 63 64 63 64 64 65 64 63
64+64 27 26 26 26 26 27 26 26 26


Origami achieves near perfect speedup in multi-core environments. The next table shows the scalability for 1 GB random data:

Parallel Sort Speed on Skylake-X (M/s)
Key size Speed (M/s) Speed-up
1C 2C 4C 8C 2C 4C 8C
Scalar 32 47 74 147 282 1.6 3.1 6.0
64 43 75 149 273 1.7 3.5 6.4
64+64 25 44 84 162 1.8 3.4 6.5
SSE 32 91 179 352 687 2.0 3.9 7.6
64 50 94 185 361 1.9 3.7 7.2
64+64 35 70 139 260 2.0 4.0 7.4
AVX2 32 126 248 495 950 2.0 3.9 7.5
64 48 95 189 369 2.0 3.9 7.7
64+64 34 70 137 254 2.1 4.0 7.5
AVX-512 32 149 286 565 1062 1.9 3.8 7.1
64 65 122 242 462 1.9 3.7 7.1
64+64 27 53 105 197 2.0 3.9 7.3

Getting Started

Recommended Setup:

  • OS: Windows Server 2019 or Server 2016
  • Compiler: MSVC++ 14.29 (Visual Studio 2019 16.11)
  1. Make sure to check that the project is set for x64 Release.
  2. Set C++ standard to C++17 or later under Project > Properties > Configuration Properties > General.
  3. Set register type (Scalar, SSE, AVX2 or AVX-512) and sort key type (uint32, int64 or <int64, int64> key-value pair) in config.h.
  4. Set appropriate compiler flags (e.g., /arch:AVX2, /arch:AVX512 etc.) under Project > Properties > Configuration Properties > C/C++ > Command Line.
  5. Make sure no Windows update from 03/21 or later is installed as they make AVX2 and AVX-512 bmerge upto 20% slower.
  6. Update parameters in config.h if necessary (details below). Current tuned parameters are for the Skylake-X mentioned above.

Parameters:

  1. _P1_NREG: Number of register available, typically 16 or 32
  2. _P1_SWITCH: Switch point from mcmerge to mrmerge, get this from bench_sort_phases:phase1_matrix_merge_test()
  3. _P1_N: Switch point from in register sort to bmerge, get this from bench_sort_phases:phase2_switch_point_test()
  4. _P2_MERGE_UNROLL: In-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
  5. _P2_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
  6. _P2_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
  7. _P2_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
  8. _P3_MERGE_UNROLL: Out-of-cache bmerge unroll, get this from bench_bmerge:bmerge_test()
  9. _P3_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
  10. _P3_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
  11. _P3_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
  12. _MTREE_NREG: Number of registers per stream while merging withing mtree, get this from bench_mtree:mtree_single_thread_test()
  13. _MT_L1_BUFF_N: Buffer size at the internal node of each 4-way node in mtree
  14. _MT_L2_BUFF_N: Buffer size at the root of each 4-way node in mtree
  15. _MIN_K: Minimum way of merge to avoid memory bandwidth bottleneck, get this from bench_mtree:mtree_multi_thread_test()

Run:

Once all parameters are set, Origami sort API can be used as shown in bench_sort:sort_bench().

License

This project is licensed under the GPLv3.0 License - see the LICENSE file for details.

Authors

Arif Arman, Dmitri Loguinov