-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_mm512_bitshuffle_epi64_mask implementation #40
Conversation
Looking really good. Don't forget to run libmorton_test to easily spot errors. |
Implements fully implements encoding for 2D and 3D cases. Not particular optimized but passes all the tests. Uses BMI2 placeholder implementation for decoding for verification Current comparison against BMI2 ``` CPU: Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 834 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 969 2: 1018 3: 923 4: 1037 5: 708 6: 992 7: 613 8: 691 02.050 ms 1.818 ms : 64-bit BMI2 instruction set 04.213 ms 4.208 ms : 64-bit AVX512 instruction set 1.835 ms 1.803 ms : 32-bit BMI2 instruction set 4.222 ms 4.216 ms : 32-bit AVX512 instruction set 2.021 ms 5.489 ms : 64-bit BMI2 Instruction set 1.952 ms 5.462 ms : 64-bit AVX512 Instruction set 01.964 ms 5.285 ms : 32-bit BMI2 Instruction set 01.964 ms 5.517 ms : 32-bit AVX512 Instruction set 14.823 ms 14.787 ms : 64-bit BMI2 instruction set 33.683 ms 34.622 ms : 64-bit AVX512 instruction set 14.487 ms 14.675 ms : 32-bit BMI2 instruction set 33.629 ms 33.935 ms : 32-bit AVX512 instruction set 15.448 ms 41.621 ms : 64-bit BMI2 Instruction set 15.282 ms 43.666 ms : 64-bit AVX512 Instruction set 15.635 ms 42.059 ms : 32-bit BMI2 Instruction set 16.512 ms 44.433 ms : 32-bit AVX512 Instruction set 137.746 ms 135.549 ms : 64-bit BMI2 instruction set 314.947 ms 325.968 ms : 64-bit AVX512 instruction set 136.730 ms 132.937 ms : 32-bit BMI2 instruction set 315.256 ms 321.689 ms : 32-bit AVX512 instruction set 141.504 ms 374.482 ms : 64-bit BMI2 Instruction set 136.387 ms 381.539 ms : 64-bit AVX512 Instruction set 153.533 ms 370.826 ms : 32-bit BMI2 Instruction set 136.177 ms 377.794 ms : 32-bit AVX512 Instruction set ```
Updated benchmarks using a proper release-mode build ``` CPU: Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 950 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 864 2: 958 3: 615 4: 806 5: 977 6: 704 7: 702 8: 840 00.915 ms 0.854 ms : 64-bit BMI2 instruction set 01.060 ms 1.033 ms : 64-bit AVX512 instruction set 0.881 ms 0.841 ms : 32-bit BMI2 instruction set 1.065 ms 1.102 ms : 32-bit AVX512 instruction set 1.082 ms 4.436 ms : 64-bit BMI2 Instruction set 0.894 ms 4.216 ms : 64-bit AVX512 Instruction set 00.933 ms 5.062 ms : 32-bit BMI2 Instruction set 01.075 ms 4.701 ms : 32-bit AVX512 Instruction set 08.231 ms 6.995 ms : 64-bit BMI2 instruction set 08.235 ms 8.252 ms : 64-bit AVX512 instruction set 7.091 ms 6.755 ms : 32-bit BMI2 instruction set 8.238 ms 8.264 ms : 32-bit AVX512 instruction set 6.836 ms 33.431 ms : 64-bit BMI2 Instruction set 6.872 ms 33.619 ms : 64-bit AVX512 Instruction set 06.908 ms 33.352 ms : 32-bit BMI2 Instruction set 06.839 ms 33.265 ms : 32-bit AVX512 Instruction set 57.233 ms 57.425 ms : 64-bit BMI2 instruction set 65.779 ms 66.391 ms : 64-bit AVX512 instruction set 56.659 ms 55.211 ms : 32-bit BMI2 instruction set 65.746 ms 68.178 ms : 32-bit AVX512 instruction set 55.670 ms 268.382 ms : 64-bit BMI2 Instruction set 55.339 ms 269.114 ms : 64-bit AVX512 Instruction set 60.553 ms 291.424 ms : 32-bit BMI2 Instruction set 59.651 ms 293.627 ms : 32-bit AVX512 Instruction set ```
Got a first MVP implemented now for encoding and decoding. It's give or take in terms of its performance against pexp/pdep methods at the moment. So I'll be exhausting some optimization ideas and then this branch will be ready for a merge/review. I found that the usual |
Cool - tell me when you want me to review/merge |
Typing from an Icelake machine(
i7-1065G7
)!I can peck at this again, even if it ends up slower in the 2D case it'll still be interesting as it can also solve the more general case of trying to interleave the bits of three input variables in parallel possibly much faster than a series of
pext
/pdep
.Here's measured latency data from uops.info for
vpshufbitqmb
.compared to
pext
and
pdep