Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hacky nonmult8 for VNNI #90

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Hacky nonmult8 for VNNI #90

wants to merge 13 commits into from

Conversation

XapaJIaMnu
Copy link
Collaborator

It's not a purr fect implementation, but it is a start...
This patch implements the following:

  • PrepareB for arbitrary columns matrices for all architectures. The last non-multiple-of-eight-columns are prepared and compressed as a small independent width by 8 matrix, zero'ed blocks of register_width are stripped. Unfortunately, this is not done in place in the current implementation, and involves memory copying. This can be improved in the future. I am using some inlined functions that don't have CPU_ATTR set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways.
  • Avx512VNNI multiplication of matrices of arbitrary number of columns + tests. The multiplication proceeds as normal until it reaches the last non-multiple-of-eight column and then proceeds to do those in a separate loop.

Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)

Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.

Preliminary performance benchmarks with the builtin

newTimeAVX512VNNI += testNew<AVX512VNNI::Kernels8>(8, 256, 256);
to check for any performance regressions. (This is not including irregularly shaped non-multiple-of-8 matrices)
This branch (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31014 seconds.
1000 iterations of SSSE3 took: 2.39446 seconds.
1000 iterations of Shifted SSSE3 took: 1.98965 seconds.
1000 iterations of AVX2 without bias took: 1.33628 seconds.
1000 iterations of AVX2 took: 1.33306 seconds.
1000 iterations of Shifted AVX2 took: 1.20668 seconds.
1000 iterations of AVX512 without bias took: 1.01728 seconds.
1000 iterations of AVX512 took: 1.04101 seconds.
1000 iterations of Shifted AVX512 took: 0.779364 seconds.
1000 iterations of AVX512VNNI without bias took: 0.754878 seconds.
1000 iterations of AVX512VNNI took: 0.771353 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.539761 seconds.

Master (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31003 seconds.
1000 iterations of SSSE3 took: 2.37843 seconds.
1000 iterations of Shifted SSSE3 took: 1.97674 seconds.
1000 iterations of AVX2 without bias took: 1.28795 seconds.
1000 iterations of AVX2 took: 1.33322 seconds.
1000 iterations of Shifted AVX2 took: 1.20815 seconds.
1000 iterations of AVX512 without bias took: 1.01804 seconds.
1000 iterations of AVX512 took: 1.06707 seconds.
1000 iterations of Shifted AVX512 took: 0.779698 seconds.
1000 iterations of AVX512VNNI without bias took: 0.776488 seconds.
1000 iterations of AVX512VNNI took: 0.772831 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.653334 seconds.

Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.

@XapaJIaMnu XapaJIaMnu requested a review from kpu July 17, 2021 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant