This repository contains the implementation of the components and algorithms of the Origami framework described in the Origami paper. The framework was developed by the IRL @ Texas A&M University, published in the PVLDB Volume 15, and will be presented in the VLDB 2022.
Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multithread. We introduce Origami, an in-memory mergesort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8X faster than prior methods and a branchless streaming merger that achieves up to a 1.5X speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize threadlevel concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2X faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.
A quick preview of Origami sort performance is given in the tables below. For a deeper analysis and discussion, please refer to the paper. Benchmarks run all code compiled with Visual Studio 2019 in Windows Server 2016 on an 8-core Intel Skylake-X (i7-7820X) CPU with a fixed 4.7 GHz clock on all cores, 1 MB L2 cache, and 32 GB of DDR4-3200 quad-channel RAM. When AVX-512 was used, BIOS defaulted to a 400-MHz lower clock (i.e., 4.3 GHz), which is known as the AVX offset implemented by many motherboard manufacturers to keep core temperature under control. We enable the maximum level of optimization and use appropriate flags (e.g., /arch:AVX512) to ensure the compiler uses all available SIMD registers.
The following table shows the performance comparison of Origami with prior works in corresponding SIMD architectures. It compares the sort speed of different chunk sizes in a 1 GB array of 32-bit random keys.
Single Thread Chunked Sort Speed (M keys/s, 32-bit keys) | |||||||||
---|---|---|---|---|---|---|---|---|---|
Chunk size | SSE | AVX2 | AVX-512 | ||||||
AA-sort | Origami | ASPaS-sort | Peloton-sort | Origami | Watkins-sort | Xiaochen-sort | Yin-sort | Origami | |
128 K | 63 | 176 | 53 | 139 | 228 | 40 | 198 | 140 | 295 |
256 K | 61 | 147 | 47 | 128 | 210 | 33 | 184 | 130 | 269 |
512 K | 59 | 138 | 44 | 120 | 195 | 30 | 172 | 113 | 249 |
1 M | 57 | 131 | 41 | 109 | 183 | 28 | 160 | 102 | 232 |
2 M | 55 | 124 | 39 | 92 | 174 | 25 | 150 | 95 | 216 |
4 M | 54 | 118 | 37 | 81 | 168 | 23 | 140 | 88 | 203 |
8 M | 52 | 112 | 35 | 77 | 162 | 21 | 131 | 83 | 191 |
16 M | 50 | 107 | 33 | 73 | 153 | 20 | 122 | 78 | 181 |
32 M | 48 | 102 | 32 | 70 | 145 | 19 | 115 | 72 | 172 |
64 M | 47 | 98 | 30 | 67 | 138 | 18 | 109 | 69 | 163 |
128 M | 45 | 95 | 29 | 65 | 132 | 17 | 103 | 66 | 156 |
256 M | 44 | 91 | 28 | 63 | 126 | 17 | 97 | 64 | 149 |
Origami is distribution insensitive i.e. it retains almost constant speed on all inputs. The next table shows this characteristic on 1 GB data for the following distributions:
D1: Uniformly random, generated by Mersenne Twister
D2: All same keys
D3: Sorted
D4: Reverse sorted
D5: Almost sorted, where every 7th key is set to KEY_MAX
D6: Pareto keys, generated as min(ceil(beta(1/(1-u) - 1)), 10000), where beta = 7 and u ~ uniform[0, 1]
D7: Bursts of same keys, where the length of each subsequence is drawn from D6 and key from D1
D8: Random shuffle, generated by randomly permuting D7
D9: Fibonacci, wrapped around when overflows number of items to sort
Single Thread Sort Speed for Different Distributions (M keys/s) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Key size | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | |
Scalar | 32 | 47 | 52 | 52 | 52 | 47 | 47 | 49 | 47 | 47 |
64 | 43 | 47 | 47 | 47 | 43 | 44 | 45 | 43 | 44 | |
64+64 | 25 | 27 | 27 | 27 | 25 | 25 | 25 | 25 | 25 | |
SSE | 32 | 91 | 90 | 90 | 90 | 91 | 91 | 92 | 91 | 90 |
64 | 50 | 49 | 49 | 50 | 50 | 50 | 50 | 50 | 49 | |
64+64 | 35 | 34 | 34 | 34 | 35 | 35 | 35 | 35 | 34 | |
AVX2 | 32 | 126 | 126 | 125 | 124 | 126 | 126 | 127 | 126 | 125 |
64 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 47 | |
64+64 | 34 | 34 | 34 | 34 | 34 | 34 | 34 | 34 | 34 | |
AVX-512 | 32 | 149 | 145 | 145 | 145 | 148 | 149 | 149 | 149 | 146 |
64 | 65 | 63 | 64 | 63 | 64 | 64 | 65 | 64 | 63 | |
64+64 | 27 | 26 | 26 | 26 | 26 | 27 | 26 | 26 | 26 |
Origami achieves near perfect speedup in multi-core environments. The next table shows the scalability for 1 GB random data:
Parallel Sort Speed on Skylake-X (M/s) | ||||||||
---|---|---|---|---|---|---|---|---|
Key size | Speed (M/s) | Speed-up | ||||||
1C | 2C | 4C | 8C | 2C | 4C | 8C | ||
Scalar | 32 | 47 | 74 | 147 | 282 | 1.6 | 3.1 | 6.0 |
64 | 43 | 75 | 149 | 273 | 1.7 | 3.5 | 6.4 | |
64+64 | 25 | 44 | 84 | 162 | 1.8 | 3.4 | 6.5 | |
SSE | 32 | 91 | 179 | 352 | 687 | 2.0 | 3.9 | 7.6 |
64 | 50 | 94 | 185 | 361 | 1.9 | 3.7 | 7.2 | |
64+64 | 35 | 70 | 139 | 260 | 2.0 | 4.0 | 7.4 | |
AVX2 | 32 | 126 | 248 | 495 | 950 | 2.0 | 3.9 | 7.5 |
64 | 48 | 95 | 189 | 369 | 2.0 | 3.9 | 7.7 | |
64+64 | 34 | 70 | 137 | 254 | 2.1 | 4.0 | 7.5 | |
AVX-512 | 32 | 149 | 286 | 565 | 1062 | 1.9 | 3.8 | 7.1 |
64 | 65 | 122 | 242 | 462 | 1.9 | 3.7 | 7.1 | |
64+64 | 27 | 53 | 105 | 197 | 2.0 | 3.9 | 7.3 |
Recommended Setup:
- OS: Windows Server 2019 or Server 2016
- Compiler: MSVC++ 14.29 (Visual Studio 2019 16.11)
- Make sure to check that the project is set for x64 Release.
- Set C++ standard to C++17 or later under Project > Properties > Configuration Properties > General.
- Set register type (Scalar, SSE, AVX2 or AVX-512) and sort key type (uint32, int64 or <int64, int64> key-value pair) in config.h.
- Set appropriate compiler flags (e.g., /arch:AVX2, /arch:AVX512 etc.) under Project > Properties > Configuration Properties > C/C++ > Command Line.
- Make sure no Windows update from 03/21 or later is installed as they make AVX2 and AVX-512 bmerge upto 20% slower.
- Update parameters in config.h if necessary (details below). Current tuned parameters are for the Skylake-X mentioned above.
Parameters:
- _P1_NREG: Number of register available, typically 16 or 32
- _P1_SWITCH: Switch point from mcmerge to mrmerge, get this from
bench_sort_phases:phase1_matrix_merge_test()
- _P1_N: Switch point from in register sort to bmerge, get this from
bench_sort_phases:phase2_switch_point_test()
- _P2_MERGE_UNROLL: In-cache bmerge unroll, get this from
bench_bmerge:bmerge_test()
- _P2_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
- _P2_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
- _P2_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
- _P3_MERGE_UNROLL: Out-of-cache bmerge unroll, get this from
bench_bmerge:bmerge_test()
- _P3_MERGE_NREG_1x: Number of registers per stream in bmerge no unroll
- _P3_MERGE_NREG_2x: Number of registers per stream in bmerge 2X unroll
- _P3_MERGE_NREG_3x: Number of registers per stream in bmerge 3X unroll
- _MTREE_NREG: Number of registers per stream while merging withing mtree, get this from
bench_mtree:mtree_single_thread_test()
- _MT_L1_BUFF_N: Buffer size at the internal node of each 4-way node in mtree
- _MT_L2_BUFF_N: Buffer size at the root of each 4-way node in mtree
- _MIN_K: Minimum way of merge to avoid memory bandwidth bottleneck, get this from
bench_mtree:mtree_multi_thread_test()
Run:
Once all parameters are set, Origami sort API can be used as shown in bench_sort:sort_bench()
.
This project is licensed under the GPLv3.0 License - see the LICENSE file for details.