This project is an HLSL compute shader implementation of the current state of the art GPU sorting algorithm, Adinets and Merrill's OneSweep, an LSD radix sort that uses Merrill and Garland's Chained Scan with Decoupled Lookback to reduce the overall global data movement during a digit-binning pass from
Given an input size of reinterpret_cast
functionality that would allow us to easily perform vectorized loading and then transposition to a striped format of the keys. The previous fastest compute shader sorting implementation I could find was by far Dondragmer's PrefixSort, which is also an 8-bit LSD radix sort and also incorporates Ashkiani et. al.'s GpuMultiSplit, but uses the older
- Download or clone the repository.
- Drag the contents of
src
into a desired folder within a Unity project. - Each sort has a compute shader and a dispatcher. Attach the desired sort's dispatcher to an empty game object. All sort dispatchers are named
SortNameHere.cs
. - Attach the matching compute shader to the game object. All compute shaders are named
SortNameHere.compute
. The dispatcher will return an error if you attach the wrong shader. - Ensure the slider is set to a nonzero value.
If you did this correctly you should see this in the inspector:
Testing Suite
Every sort dispatcher has a suite of tests which can be controlled directly from the inspector.
-
Validate Sort
performs a sort at an input size of $2^{SizeExponent
}$, with the input being the decreasing series of integers. -
Validate Random
performs a sort at an input size of $2^{SizeExponent
}$, with the input being a randomly generated set of integers. -
Single Pass Timing Test
Times the execution of a single digit binning pass, at an input size of $2^{SizeExponent
}$, with the input being the decreasing series of integers. -
All Pass Timing Test
Times the execution of the entire sort, at an input size of $2^{SizeExponent
}$, with the input being the decreasing series of integers. -
All Pass Timing Test Random
Times the execution of the entire sort, with the input being a randomly generated set of integers. Because randomly generated integers have a higher entropy than the decreasing sequence of integers, this test demonstrates signficantly better performance. -
Record Timing Data
Performs 2000 iterations ofAll Pass Timing Test Random
, then logs the results in acsv
file. -
ValidateText
prints any errors during a validation test in the deubg log. This can be quite slow if there are many errors, so it is recommended to also haveQuickText
enabled. -
QuickText
limits the number of errors printed during a validation test to 1024.
Currently, this project does not work on AMD or integrated graphics hardware.
Unfortunately, AMD, Nvidia, and integrated graphics usually have different wave sizes, which means that code that synchronizes threads on a wave level, like we do, must be manually tuned for each hardware case. Because Unity does not support runtime compilation of compute shaders, we cannot poll the hardware at runtime to compile a targetted shader variant. Although Unity does have the multi_compile
functionality, it is a very cumbersome solution because it means maintaining and compiling a copy of each kernel for each hardware case.
DX12 is a must as well as a minimum Unity version of 2021.1 or later
As we make heavy use of WaveIntrinsics, we need pragma use_dxc
to access shader model 6.0.
To the best of my knowledge, there is no way to time the execution of kernels in-situ in HLSL, so we make do by making an AsyncGPUReadbackRequest
on single element buffer, b_timing
whose sole purpose is to be the target of the readback request. Because b_timing
is only a single element, any latency introduced by the readback is negligible. So we make a coroutine, make a timestamp, dispatch the kernels, make the readback request, then wait until the readback completes like so:
private IEnumerator AllPassTimingTestRandom(int _size)
{
breaker = false;
ResetBuffersRandom();
AsyncGPUReadbackRequest request = AsyncGPUReadback.Request(passHistFour);
yield return new WaitUntil(() => request.done);
float time = Time.realtimeSinceStartup;
compute.Dispatch(k_globalHist, globalHistThreadBlocks, 1, 1);
compute.Dispatch(k_scatterOne, binningThreadBlocks, 1, 1);
compute.Dispatch(k_scatterTwo, binningThreadBlocks, 1, 1);
compute.Dispatch(k_scatterThree, binningThreadBlocks, 1, 1);
compute.Dispatch(k_scatterFour, binningThreadBlocks, 1, 1);
request = AsyncGPUReadback.Request(timingBuffer);
yield return new WaitUntil(() => request.done);
time = Time.realtimeSinceStartup - time;
Debug.Log("Raw Time: " + time);
Debug.Log("Estimated Speed: " + (_size / time) + " ele/sec.");
breaker = true;
}
To make sure that our initialization kernel is complete, we make an initial readback request beforehand.
Accurately testing Donragmer's PrefixSort was challenging because its maximum input size is only
To test PrefixSort, we made a modified version of its dispatcher as follows:
private IEnumerator RecordTimingData()
{
breaker = false;
Debug.Log("Beginning timing test.");
List<string> csv = new List<string>();
ProcessControlsAndEditorSettings();
int numBlocks = Mathf.CeilToInt(m_numElements / 1024.0f);
m_sortShader.SetInt("_NumElements", m_numElements);
m_sortShader.SetInt("_NumBlocks", numBlocks);
m_sortShader.SetBool("_ShouldSortPayload", m_shouldSortPayload);
for (int k = 0; k < 500; ++k)
{
m_sortShader.SetInt("e_seed", k + 1);
m_sortShader.SetBuffer(k_init, "KeyOutputBuffer", m_keysBufferA);
m_sortShader.Dispatch(k_init, 256, 1, 1);
AsyncGPUReadbackRequest request = AsyncGPUReadback.Request(m_keysBufferA);
yield return new WaitUntil(() => request.done);
float time = Time.realtimeSinceStartup;
for (int i = 0; i < 4; i++)
{
m_sortShader.SetInt("_FirstBitToSort", i * 8);
//flip the buffers every other sort
if ((i % 2) == 0)
{
m_sortShader.SetBuffer(m_countTotalsKernel, "KeyInputBuffer", m_keysBufferA);
m_sortShader.SetBuffer(m_finalSortKernel, "KeyInputBuffer", m_keysBufferA);
m_sortShader.SetBuffer(m_finalSortKernel, "KeyOutputBuffer", m_keysBufferB);
m_sortShader.SetBuffer(m_finalSortKernel, "PayloadInputBuffer", m_payloadBufferA);
m_sortShader.SetBuffer(m_finalSortKernel, "PayloadOutputBuffer", m_payloadBufferB);
}
else
{
m_sortShader.SetBuffer(m_countTotalsKernel, "KeyInputBuffer", m_keysBufferB);
m_sortShader.SetBuffer(m_finalSortKernel, "KeyInputBuffer", m_keysBufferB);
m_sortShader.SetBuffer(m_finalSortKernel, "KeyOutputBuffer", m_keysBufferA);
m_sortShader.SetBuffer(m_finalSortKernel, "PayloadInputBuffer", m_payloadBufferB);
m_sortShader.SetBuffer(m_finalSortKernel, "PayloadOutputBuffer", m_payloadBufferA);
}
m_sortShader.Dispatch(m_countTotalsKernel, numBlocks, 1, 1);
m_sortShader.Dispatch(m_blockPostfixKernel, 1, 64, 1);
m_sortShader.Dispatch(m_calculateOffsetsKernel, numBlocks, 1, 1);
m_sortShader.Dispatch(m_finalSortKernel, numBlocks, 1, 1);
}
request = AsyncGPUReadback.Request(timingBuffer);
yield return new WaitUntil(() => request.done);
time = Time.realtimeSinceStartup - time;
if (k != 0)
csv.Add("" + time);
if ((k & 31) == 0)
Debug.Log("Running");
}
StreamWriter sWriter = new StreamWriter("DonDragmer.csv");
sWriter.WriteLine("Total Time DonDragmer");
foreach (string str in csv)
sWriter.WriteLine(str);
sWriter.Close();
Debug.Log("Test Complete");
breaker = true;
}
Due to stalling issues with the original method of setting inputs from the CPU side, and to ensure a fair comparison between algorithms, I added an initialization kernel which uses an identical PRNG to the one used in my own implementation.
Because we use identical PRNG's and identical seed values, the inputs for the testing are identical:
extern int e_seed; //Seed value set from CPU
#define TAUS_STEP_1 ((z1 & 4294967294U) << 12) ^ (((z1 << 13) ^ z1) >> 19)
#define TAUS_STEP_2 ((z2 & 4294967288U) << 4) ^ (((z2 << 2) ^ z2) >> 25)
#define TAUS_STEP_3 ((z3 & 4294967280U) << 17) ^ (((z3 << 3) ^ z3) >> 11)
#define LCG_STEP (z4 * 1664525 + 1013904223U)
#define HYBRID_TAUS ((z1 ^ z2 ^ z3 ^ z4) & 268435455)
[numthreads(1024, 1, 1)]
void Init(int3 id : SV_DispatchThreadID)
{
uint z1 = (id.x << 2) * e_seed;
uint z2 = ((id.x << 2) + 1) * e_seed;
uint z3 = ((id.x << 2) + 2) * e_seed;
uint z4 = ((id.x << 2) + 3) * e_seed;
for (int i = id.x; i < _NumElements; i += 1024 * 256)
{
z1 = TAUS_STEP_1;
z2 = TAUS_STEP_2;
z3 = TAUS_STEP_3;
z4 = LCG_STEP;
KeyOutputBuffer[i] = HYBRID_TAUS;
}
}
Andy Adinets and Duane Merrill. Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. 2022. arXiv: 2206.01784 [cs.DC]
Duane Merrill and Michael Garland. “Single-pass Parallel Prefix Scan with De-coupled Lookback”. In: 2016. url: https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back
Saman Ashkiani et al. “GPU Multisplit”. In: SIGPLAN Not. 51.8 (Feb. 2016). issn: 0362-1340. doi: 10.1145/3016078.2851169. url: https://doi.org/10.1145/3016078.2851169.