Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This is a backport of #1889 and #1892
Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) in acl_matmul
Computes (BA)^T instead of (A^T)(B^T) when the cost of transposing (B*A) is cheaper. This improves performance by ~1.25x for square matrices and even higher for
tall-skinny/fat-short matrices.
It also reduces code duplication and moves allocation of dst accumulator from ACL to scratchpad memory in oneDNN
Backport cpu: aarch64: matmul: fuse sum post op in acl matmul
Fuse the sum post op in acl matmul by setting the accumulate flag to
true in arm_compute::GEMMInfo. This speeds up the post op and saves
allocating a temporary dst sized tensor.
We also added
_for_sum
touse_dst_acc
flag to stop it being confusedwith the
dst_acc
used for transposing.Change the way we deal with fused eltwise (as well as the new sum) to
fix segfaults when binary ops followed fused ops.
Fixes # (github issue)
Checklist
General
make test
andmake test_benchdnn_*
) pass locally for each commit?Performance improvements
New features
Bug fixes
RFC PR