Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul #1955

Merged
merged 2 commits into from
Jun 7, 2024

Conversation

fadara01
Copy link
Contributor

@fadara01 fadara01 commented Jun 7, 2024

Description

This is a backport of #1889 and #1892

Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) in acl_matmul
Computes (BA)^T instead of (A^T)(B^T) when the cost of transposing (B*A) is cheaper. This improves performance by ~1.25x for square matrices and even higher for
tall-skinny/fat-short matrices.

It also reduces code duplication and moves allocation of dst accumulator from ACL to scratchpad memory in oneDNN

Backport cpu: aarch64: matmul: fuse sum post op in acl matmul
Fuse the sum post op in acl matmul by setting the accumulate flag to
true in arm_compute::GEMMInfo. This speeds up the post op and saves
allocating a temporary dst sized tensor.

We also added _for_sum to use_dst_acc flag to stop it being confused
with the dst_acc used for transposing.

Change the way we deal with fused eltwise (as well as the new sum) to
fix segfaults when binary ops followed fused ops.

Fixes # (github issue)

Checklist

General

  • [ YES ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [ YES ] Have you formatted the code using clang-format?

Performance improvements

  • [ YES ] Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

fadara01 and others added 2 commits June 7, 2024 17:39
This PR computes (B*A)^T instead of (A^T)*(B^T) when the
cost of transposing (B*A) is cheaper. This improves performance
by ~1.25x for square matrices and even higher for
tall-skinny/fat-short matrices.

It also reduces code duplication and moves allocation of
dst accumulator from ACL to scratchpad memory in oneDNN

Co-authored-by: Annop Wongwathanarat <[email protected]>
Fuse the sum post op in acl matmul by setting the accumulate flag to
true in arm_compute::GEMMInfo. This speeds up the post op and saves
allocating a temporary dst sized tensor.

We also added `_for_sum` to `use_dst_acc` flag to stop it being confused
with the `dst_acc` used for transposing.

Change the way we deal with fused eltwise (as well as the new sum) to
fix segfaults when binary ops followed fused ops.

Co-authored-by: Milos Puzovic <[email protected]>
Co-authored-by: Jonathan Deakin <[email protected]>
@fadara01 fadara01 changed the title Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) in acl_matmul Backport cpu: aarch64: matmul: Optimize (A^T)*(B^T) and fuse sum post-op in acl_matmul Jun 7, 2024
@vpirogov vpirogov added this to the v3.5 milestone Jun 7, 2024
@vpirogov vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Jun 7, 2024
@vpirogov vpirogov merged commit 84adf42 into oneapi-src:rls-v3.5 Jun 7, 2024
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants