_mm512_bitshuffle_epi64_mask implementation #40

Wunkolo · 2020-04-02T23:56:13Z

Typing from an Icelake machine(i7-1065G7)!
I can peck at this again, even if it ends up slower in the 2D case it'll still be interesting as it can also solve the more general case of trying to interleave the bits of three input variables in parallel possibly much faster than a series of pext/pdep.

Here's measured latency data from uops.info for vpshufbitqmb.

Instruction									Lat		TP			Uops	Ports
VPSHUFBITQMB (K, K, XMM, M128)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, XMM, XMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, K, YMM, M256)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, YMM, YMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, K, ZMM, M512)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, ZMM, ZMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, XMM, M128)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, XMM, XMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, YMM, M256)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, YMM, YMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, ZMM, M512)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, ZMM, ZMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5

compared to pext

Instruction						Lat		TP			Uops	Ports
PEXT (R32, R32, M32)	BMI2	[3;8]	1.00 / 1.00	2		1*p1+1*p23
PEXT (R32, R32, R32)	BMI2	3		1.00 / 1.00	1		1*p1
PEXT (R64, R64, M64)	BMI2	[3;8]	1.00 / 1.00	2		1*p1+1*p23
PEXT (R64, R64, R64)	BMI2	3		1.00 / 1.00	1		1*p1

and pdep

Instruction						Lat		TP			Uops	Ports
PDEP (R32, R32, M32)	BMI2	[3;8]	1.00 / 1.00	2	1*p1+1*p23
PDEP (R32, R32, R32)	BMI2	3		1.00 / 1.00	1	1*p1
PDEP (R64, R64, M64)	BMI2	[3;8]	1.00 / 1.00	2	1*p1+1*p23
PDEP (R64, R64, R64)	BMI2	3		1.00 / 1.00	1	1*p1

Forceflow · 2020-04-03T02:08:40Z

Looking really good. Don't forget to run libmorton_test to easily spot errors.

Implements fully implements encoding for 2D and 3D cases. Not particular optimized but passes all the tests. Uses BMI2 placeholder implementation for decoding for verification Current comparison against BMI2 ``` CPU: Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 834 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 969 2: 1018 3: 923 4: 1037 5: 708 6: 992 7: 613 8: 691 02.050 ms 1.818 ms : 64-bit BMI2 instruction set 04.213 ms 4.208 ms : 64-bit AVX512 instruction set 1.835 ms 1.803 ms : 32-bit BMI2 instruction set 4.222 ms 4.216 ms : 32-bit AVX512 instruction set 2.021 ms 5.489 ms : 64-bit BMI2 Instruction set 1.952 ms 5.462 ms : 64-bit AVX512 Instruction set 01.964 ms 5.285 ms : 32-bit BMI2 Instruction set 01.964 ms 5.517 ms : 32-bit AVX512 Instruction set 14.823 ms 14.787 ms : 64-bit BMI2 instruction set 33.683 ms 34.622 ms : 64-bit AVX512 instruction set 14.487 ms 14.675 ms : 32-bit BMI2 instruction set 33.629 ms 33.935 ms : 32-bit AVX512 instruction set 15.448 ms 41.621 ms : 64-bit BMI2 Instruction set 15.282 ms 43.666 ms : 64-bit AVX512 Instruction set 15.635 ms 42.059 ms : 32-bit BMI2 Instruction set 16.512 ms 44.433 ms : 32-bit AVX512 Instruction set 137.746 ms 135.549 ms : 64-bit BMI2 instruction set 314.947 ms 325.968 ms : 64-bit AVX512 instruction set 136.730 ms 132.937 ms : 32-bit BMI2 instruction set 315.256 ms 321.689 ms : 32-bit AVX512 instruction set 141.504 ms 374.482 ms : 64-bit BMI2 Instruction set 136.387 ms 381.539 ms : 64-bit AVX512 Instruction set 153.533 ms 370.826 ms : 32-bit BMI2 Instruction set 136.177 ms 377.794 ms : 32-bit AVX512 Instruction set ```

Updated benchmarks using a proper release-mode build ``` CPU: Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 950 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 864 2: 958 3: 615 4: 806 5: 977 6: 704 7: 702 8: 840 00.915 ms 0.854 ms : 64-bit BMI2 instruction set 01.060 ms 1.033 ms : 64-bit AVX512 instruction set 0.881 ms 0.841 ms : 32-bit BMI2 instruction set 1.065 ms 1.102 ms : 32-bit AVX512 instruction set 1.082 ms 4.436 ms : 64-bit BMI2 Instruction set 0.894 ms 4.216 ms : 64-bit AVX512 Instruction set 00.933 ms 5.062 ms : 32-bit BMI2 Instruction set 01.075 ms 4.701 ms : 32-bit AVX512 Instruction set 08.231 ms 6.995 ms : 64-bit BMI2 instruction set 08.235 ms 8.252 ms : 64-bit AVX512 instruction set 7.091 ms 6.755 ms : 32-bit BMI2 instruction set 8.238 ms 8.264 ms : 32-bit AVX512 instruction set 6.836 ms 33.431 ms : 64-bit BMI2 Instruction set 6.872 ms 33.619 ms : 64-bit AVX512 Instruction set 06.908 ms 33.352 ms : 32-bit BMI2 Instruction set 06.839 ms 33.265 ms : 32-bit AVX512 Instruction set 57.233 ms 57.425 ms : 64-bit BMI2 instruction set 65.779 ms 66.391 ms : 64-bit AVX512 instruction set 56.659 ms 55.211 ms : 32-bit BMI2 instruction set 65.746 ms 68.178 ms : 32-bit AVX512 instruction set 55.670 ms 268.382 ms : 64-bit BMI2 Instruction set 55.339 ms 269.114 ms : 64-bit AVX512 Instruction set 60.553 ms 291.424 ms : 32-bit BMI2 Instruction set 59.651 ms 293.627 ms : 32-bit AVX512 Instruction set ```

Wunkolo · 2020-04-06T18:05:11Z

Got a first MVP implemented now for encoding and decoding. It's give or take in terms of its performance against pexp/pdep methods at the moment. So I'll be exhausting some optimization ideas and then this branch will be ready for a merge/review.

I found that the usual _mm512_shuffle_epi8's 128-bit lane boundaries limited the 3D options and I had to use AVX512_VBMI 's _mm512_permutexvar_epi8 to get a true shuffle to happen, adding to the AVX512 subset requirements. Going to try and limit some of this so that it can appeal to the smallest subset of AVX512 feature sets.

Forceflow · 2020-04-08T16:19:23Z

Cool - tell me when you want me to review/merge

Wunkolo added 5 commits July 10, 2019 16:13

Init AVX512BITALG libmorton driver

328a536

First draft AVX512BITALG implementation

9bee922

First pass AVX512 morton encodes

1ba1717

Add AVX512 icelake-client target

bb580a0

Add AVX512 to testbed

149b802

Wunkolo added 5 commits April 4, 2020 20:04

Merge branch 'master' into avx512-icelake

921a5ee

Update testbed preprocessors

53f538d

Draft first pass AVX512BITALG decoding

614b78d

Wunkolo marked this pull request as ready for review May 12, 2020 18:40

Forceflow merged commit e7ed229 into Forceflow:master May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm512_bitshuffle_epi64_mask implementation #40

_mm512_bitshuffle_epi64_mask implementation #40

Wunkolo commented Apr 2, 2020

Forceflow commented Apr 3, 2020 •

edited

Loading

Wunkolo commented Apr 6, 2020 •

edited

Loading

Forceflow commented Apr 8, 2020

_mm512_bitshuffle_epi64_mask implementation #40

_mm512_bitshuffle_epi64_mask implementation #40

Conversation

Wunkolo commented Apr 2, 2020

Forceflow commented Apr 3, 2020 • edited Loading

Wunkolo commented Apr 6, 2020 • edited Loading

Forceflow commented Apr 8, 2020

Forceflow commented Apr 3, 2020 •

edited

Loading

Wunkolo commented Apr 6, 2020 •

edited

Loading