Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm512_bitshuffle_epi64_mask implementation #40

Merged
merged 10 commits into from
May 18, 2020
Merged

_mm512_bitshuffle_epi64_mask implementation #40

merged 10 commits into from
May 18, 2020

Conversation

Wunkolo
Copy link
Contributor

@Wunkolo Wunkolo commented Apr 2, 2020

Typing from an Icelake machine(i7-1065G7)!
I can peck at this again, even if it ends up slower in the 2D case it'll still be interesting as it can also solve the more general case of trying to interleave the bits of three input variables in parallel possibly much faster than a series of pext/pdep.

Here's measured latency data from uops.info for vpshufbitqmb.

Instruction									Lat		TP			Uops	Ports
VPSHUFBITQMB (K, K, XMM, M128)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, XMM, XMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, K, YMM, M256)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, YMM, YMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, K, ZMM, M512)	AVX512EVEX	[1;8]	1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, K, ZMM, ZMM)	AVX512EVEX	[1;8]	1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, XMM, M128)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, XMM, XMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, YMM, M256)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, YMM, YMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5
VPSHUFBITQMB (K, ZMM, M512)		AVX512EVEX	6		1.00 / 1.00	3		1*p0+1*p23+1*p5
VPSHUFBITQMB (K, ZMM, ZMM)		AVX512EVEX	6		1.00 / 1.00	2		1*p0+1*p5			

compared to pext

Instruction						Lat		TP			Uops	Ports
PEXT (R32, R32, M32)	BMI2	[3;8]	1.00 / 1.00	2		1*p1+1*p23
PEXT (R32, R32, R32)	BMI2	3		1.00 / 1.00	1		1*p1
PEXT (R64, R64, M64)	BMI2	[3;8]	1.00 / 1.00	2		1*p1+1*p23
PEXT (R64, R64, R64)	BMI2	3		1.00 / 1.00	1		1*p1	

and pdep

Instruction						Lat		TP			Uops	Ports
PDEP (R32, R32, M32)	BMI2	[3;8]	1.00 / 1.00	2	1*p1+1*p23
PDEP (R32, R32, R32)	BMI2	3		1.00 / 1.00	1	1*p1
PDEP (R64, R64, M64)	BMI2	[3;8]	1.00 / 1.00	2	1*p1+1*p23
PDEP (R64, R64, R64)	BMI2	3		1.00 / 1.00	1	1*p1	

@Forceflow
Copy link
Owner

Forceflow commented Apr 3, 2020

Looking really good. Don't forget to run libmorton_test to easily spot errors.

Implements fully implements encoding for 2D and 3D cases.
Not particular optimized but passes all the tests.
Uses BMI2 placeholder implementation for decoding for verification

Current comparison against BMI2

```

CPU:       Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB
           Speed: 834 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 969 2: 1018 3: 923 4: 1037 5: 708 6: 992 7: 613 8: 691

    02.050 ms 1.818 ms : 64-bit BMI2 instruction set
    04.213 ms 4.208 ms : 64-bit AVX512 instruction set
    1.835 ms 1.803 ms : 32-bit BMI2 instruction set
    4.222 ms 4.216 ms : 32-bit AVX512 instruction set
    2.021 ms 5.489 ms : 64-bit BMI2 Instruction set
    1.952 ms 5.462 ms : 64-bit AVX512 Instruction set
    01.964 ms 5.285 ms : 32-bit BMI2 Instruction set
    01.964 ms 5.517 ms : 32-bit AVX512 Instruction set
    14.823 ms 14.787 ms : 64-bit BMI2 instruction set
    33.683 ms 34.622 ms : 64-bit AVX512 instruction set
    14.487 ms 14.675 ms : 32-bit BMI2 instruction set
    33.629 ms 33.935 ms : 32-bit AVX512 instruction set
    15.448 ms 41.621 ms : 64-bit BMI2 Instruction set
    15.282 ms 43.666 ms : 64-bit AVX512 Instruction set
    15.635 ms 42.059 ms : 32-bit BMI2 Instruction set
    16.512 ms 44.433 ms : 32-bit AVX512 Instruction set
    137.746 ms 135.549 ms : 64-bit BMI2 instruction set
    314.947 ms 325.968 ms : 64-bit AVX512 instruction set
    136.730 ms 132.937 ms : 32-bit BMI2 instruction set
    315.256 ms 321.689 ms : 32-bit AVX512 instruction set
    141.504 ms 374.482 ms : 64-bit BMI2 Instruction set
    136.387 ms 381.539 ms : 64-bit AVX512 Instruction set
    153.533 ms 370.826 ms : 32-bit BMI2 Instruction set
    136.177 ms 377.794 ms : 32-bit AVX512 Instruction set
```
Updated benchmarks using a proper release-mode build

```
CPU:       Topology: Quad Core model: Intel Core i7-1065G7 bits: 64 type: MT MCP L2 cache: 8192 KiB
           Speed: 950 MHz min/max: 400/3900 MHz Core speeds (MHz): 1: 864 2: 958 3: 615 4: 806 5: 977 6: 704 7: 702 8: 840

    00.915 ms 0.854 ms : 64-bit BMI2 instruction set
    01.060 ms 1.033 ms : 64-bit AVX512 instruction set
    0.881 ms 0.841 ms : 32-bit BMI2 instruction set
    1.065 ms 1.102 ms : 32-bit AVX512 instruction set
    1.082 ms 4.436 ms : 64-bit BMI2 Instruction set
    0.894 ms 4.216 ms : 64-bit AVX512 Instruction set
    00.933 ms 5.062 ms : 32-bit BMI2 Instruction set
    01.075 ms 4.701 ms : 32-bit AVX512 Instruction set
    08.231 ms 6.995 ms : 64-bit BMI2 instruction set
    08.235 ms 8.252 ms : 64-bit AVX512 instruction set
    7.091 ms 6.755 ms : 32-bit BMI2 instruction set
    8.238 ms 8.264 ms : 32-bit AVX512 instruction set
    6.836 ms 33.431 ms : 64-bit BMI2 Instruction set
    6.872 ms 33.619 ms : 64-bit AVX512 Instruction set
    06.908 ms 33.352 ms : 32-bit BMI2 Instruction set
    06.839 ms 33.265 ms : 32-bit AVX512 Instruction set
    57.233 ms 57.425 ms : 64-bit BMI2 instruction set
    65.779 ms 66.391 ms : 64-bit AVX512 instruction set
    56.659 ms 55.211 ms : 32-bit BMI2 instruction set
    65.746 ms 68.178 ms : 32-bit AVX512 instruction set
    55.670 ms 268.382 ms : 64-bit BMI2 Instruction set
    55.339 ms 269.114 ms : 64-bit AVX512 Instruction set
    60.553 ms 291.424 ms : 32-bit BMI2 Instruction set
    59.651 ms 293.627 ms : 32-bit AVX512 Instruction set
```
@Wunkolo
Copy link
Contributor Author

Wunkolo commented Apr 6, 2020

Got a first MVP implemented now for encoding and decoding. It's give or take in terms of its performance against pexp/pdep methods at the moment. So I'll be exhausting some optimization ideas and then this branch will be ready for a merge/review.

I found that the usual _mm512_shuffle_epi8's 128-bit lane boundaries limited the 3D options and I had to use AVX512_VBMI 's _mm512_permutexvar_epi8 to get a true shuffle to happen, adding to the AVX512 subset requirements. Going to try and limit some of this so that it can appeal to the smallest subset of AVX512 feature sets.

@Forceflow
Copy link
Owner

Cool - tell me when you want me to review/merge

@Wunkolo Wunkolo marked this pull request as ready for review May 12, 2020 18:40
@Forceflow Forceflow merged commit e7ed229 into Forceflow:master May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants