Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

fadara01 · 2024-06-07T16:42:01Z

Description

This is a backport of #1889 and #1892

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) in acl_matmul
Computes (BA)^T instead of (A^T)(B^T) when the cost of transposing (B*A) is cheaper. This improves performance by ~1.25x for square matrices and even higher for
tall-skinny/fat-short matrices.

It also reduces code duplication and moves allocation of dst accumulator from ACL to scratchpad memory in oneDNN

Backport cpu: aarch64: matmul: fuse sum post op in acl matmul
Fuse the sum post op in acl matmul by setting the accumulate flag to
true in arm_compute::GEMMInfo. This speeds up the post op and saves
allocating a temporary dst sized tensor.

We also added _for_sum to use_dst_acc flag to stop it being confused
with the dst_acc used for transposing.

Change the way we deal with fused eltwise (as well as the new sum) to
fix segfaults when binary ops followed fused ops.

Fixes # (github issue)

Checklist

General

[ YES ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[ YES ] Have you formatted the code using clang-format?

Performance improvements

[ YES ] Have you submitted performance data that demonstrates performance improvements?

New features

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

Bug fixes

Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
Have you added relevant regression tests?

RFC PR

Does RFC document follow the template?
Have you added a link to the rendered document?

This PR computes (B*A)^T instead of (A^T)*(B^T) when the cost of transposing (B*A) is cheaper. This improves performance by ~1.25x for square matrices and even higher for tall-skinny/fat-short matrices. It also reduces code duplication and moves allocation of dst accumulator from ACL to scratchpad memory in oneDNN Co-authored-by: Annop Wongwathanarat <[email protected]>

Fuse the sum post op in acl matmul by setting the accumulate flag to true in arm_compute::GEMMInfo. This speeds up the post op and saves allocating a temporary dst sized tensor. We also added `_for_sum` to `use_dst_acc` flag to stop it being confused with the `dst_acc` used for transposing. Change the way we deal with fused eltwise (as well as the new sum) to fix segfaults when binary ops followed fused ops. Co-authored-by: Milos Puzovic <[email protected]> Co-authored-by: Jonathan Deakin <[email protected]>

fadara01 and others added 2 commits June 7, 2024 17:39

fadara01 changed the title ~~Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) in acl_matmul~~ Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul Jun 7, 2024

vpirogov approved these changes Jun 7, 2024

View reviewed changes

vpirogov added this to the v3.5 milestone Jun 7, 2024

vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Jun 7, 2024

vpirogov merged commit 84adf42 into oneapi-src:rls-v3.5 Jun 7, 2024
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

fadara01 commented Jun 7, 2024 •

edited

Loading

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

Conversation

fadara01 commented Jun 7, 2024 • edited Loading

Description

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

fadara01 commented Jun 7, 2024 •

edited

Loading