Add ARM32 kernel implementation #432

honglh · 2020-07-22T02:58:42Z

What do these changes do?

The BgemmKernel<path=kNeon, LhsScalar=uint32_t, RhsScalar=uint32_t, AccumScalar=int32_t, DstScalar=float> implementation for ARM32 device.

How Has This Been Tested?

The same code change is tested on v0.3.1 tag on RPI2 and RPI3 (32-bit). But I cannot test on the master branch for open issue #408 . Though I tested on v0.3.1 tag I cannot create PR request for tag.

Benchmark Results

On RPI2 inference time is about 100 to 200ms and RPI3 it is 50 to 100ms. (The inference time is 500ms+ without using the kernel).

Related issue number

#431

lgeiger · 2020-07-22T09:47:12Z

Thank you for the great PR!

I can see a couple CI/CD builds failed on github. Not sure how to run the tests offline. Please let me know if you have more details of the error or some instructions I can test the build offline in my local setup. Thanks!

It looks like linting fails on CI. We use clang-format to format our C++ code so running:

clang-format -i larq_compute_engine/core/*.h

should make linting on CI happy.

We run our ARM tests on CI using qemu. To reproduce the CI failure locally on a x86 Linux system or inside an Ubuntu docker container requires a bit of setup:

sudo apt-get install -y --no-install-recommends qemu-user
./configure.sh <<< $'n\n'  # or run ./configure.sh and answer with "n"
pip install numpy six

After which you should be able to run the ARM32 unittests in the emulator:

bazelisk test larq_compute_engine/tests:cc_tests_arm32_qemu --config=rpi3 --test_output=streamed

AdamHillier

Thank you for your contribution, this is brilliant :)

From taking a first look, I have a few suggestions below.

larq_compute_engine/core/bgemm_kernels_arm32.h

larq_compute_engine/core/bgemm_kernels_arm.h

honglh · 2020-07-30T05:54:10Z

Thanks all for your suggestions. I will review and make a new PR

lgeiger · 2020-07-30T09:40:25Z

I will review and make a new PR

👍 Thanks for taking a look.
Feel free to push the updates to this branch which might be easier for you instead of creating a new PR.

lgeiger · 2020-07-30T16:05:15Z

@honglh Thanks for addressing the comments. It looks like some of the kernel tests for the optimized code are still failing on CI, could you take a look at them?
I think you should be able to reproduce the failures locally with the commands mentioned above, please let us know if you run into any issues with that.

honglh · 2020-07-31T00:57:57Z

@lgeiger I have reproduced the ARM test failure on my local setup. I am looking into this issue

honglh · 2020-08-02T04:09:40Z

With change in bconv2d_test.cc the test execution

bazel test --cache_test_results=no larq_compute_engine/tests:cc_tests_arm32_qemu --config=rpi3 --test_output=streamed

passed in my local setup.

The earlier CI failure seems caused by Register_BCONV_2D64_OPT() in ARM32 (--rpi3) mode. In this mode kRuyOptimized path is selected. Then in EvalChooseKernelType() the kRuyOptimized path selects the ARM32 kernel as expected, but the prepare() function in Register_BCONV_2D64_OPT() used the function for 64-bit bitpacking bitwidth

bconv2d::Prepare<bconv2d::KernelType::kRuyOptimized, 64>

This mismatch (64-bit bitpacking and 32-bit bgemm kernel) caused CI failures.

Another failure case is from OptimizedKernel16BitOverflowTest, which seems to be a 64-bit only test case in that the kernel Register_BCONV_2D64_OPT is explicitly specified. Thus this test can is also disabled for 32-bit.

AdamHillier

It's great to see the tests passing now!

My only concern is that I think we still want to run the "BConv2D32OPT", "BConv2D64OPT", and "BConv2D32REF" tests on all architectures. For example, when the tests are running on Arm32, we still want the "BConv2D64OPT" tests to successfully run (falling back to unoptimised C++ kernels) and the tests to pass.

I did some local debugging and was able to get this working with the two small changes I've listed below. It turns out that the reason for the previous test failures on uint64 inputs was that the bgemm_runtime_path was getting set to ruy::Path::kNeon no matter what, and on Arm32 there's no Neon kernel for uint64 inputs. This meant that no bgemm kernel was run at all, and for some reason no error was thrown.

I also have a few whitespace-related suggested changes for a couple of comments, but those are very minor and I'll list them in a separate review.

After making these changes this looks good to merge from my perspective 👍

larq_compute_engine/core/bgemm_impl_ruy.h

larq_compute_engine/tflite/tests/bconv2d_test.cc

AdamHillier

The whitespace suggestions I mentioned above:

larq_compute_engine/core/bgemm_kernels_arm32.h

larq_compute_engine/core/bgemm_kernels_arm.h

Co-authored-by: Adam Hillier <[email protected]>

conform to clang format Co-authored-by: Adam Hillier <[email protected]>

Co-authored-by: Adam Hillier <[email protected]>

honglh · 2020-08-04T00:45:06Z

@AdamHillier Thanks. The changes look good on my side as well

AdamHillier

Awesome, this is fantastic 🔥

Thank you so much for doing this @honglh.

Hong Li and others added 3 commits July 21, 2020 21:53

Add arm32 4x4 kernel impl

7494af9

Add arm32 4x4 kernel impl

d21344a

Add arm32 4x4 kernel impl

d94b1e6

honglh mentioned this pull request Jul 22, 2020

BgemmKernel for ARM32 #431

Closed

lgeiger added the feature New feature or request label Jul 22, 2020

AdamHillier self-requested a review July 22, 2020 08:48

lgeiger added this to the 0.4 milestone Jul 22, 2020

lgeiger mentioned this pull request Jul 22, 2020

Remove unused packbits_arm32 #437

Merged

AdamHillier reviewed Jul 22, 2020

View reviewed changes

Hong Li and others added 6 commits July 30, 2020 09:07

Update to clang-format-5.0 format

3263994

Remove redundant #if 1

f132e64

Undef constatns after use

62fe3f4

Corrected incorrect ifdef comment

695a4c7

Add bgemm_kernes_arm32.h as depdendent

cba31a5

Use vpaddl.u8 and u16 better because popcnt will not be negative

89aac02

Hong Li added 3 commits August 1, 2020 23:12

use proper kernel for compiled arch

229e94a

use proper kernel for compiled arch

9716065

use proper kernel for compiled arch

7cee40f

lgeiger requested a review from AdamHillier August 2, 2020 09:59

AdamHillier reviewed Aug 3, 2020

View reviewed changes

larq_compute_engine/core/bgemm_impl_ruy.h Outdated Show resolved Hide resolved

larq_compute_engine/tflite/tests/bconv2d_test.cc Outdated Show resolved Hide resolved

AdamHillier reviewed Aug 3, 2020

View reviewed changes

honglh and others added 3 commits August 3, 2020 15:22

Update larq_compute_engine/core/bgemm_kernels_arm32.h

762b55b

Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

73600a1

Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

6fa0901

Co-authored-by: Adam Hillier <[email protected]>

honglh and others added 8 commits August 3, 2020 15:24

Update larq_compute_engine/core/bgemm_kernels_arm32.h

0a375d4

Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

f7b65cc

Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

0b141a7

conform to clang format Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

9c32d6e

conform to clang format Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm.h

4332f55

conform to clang format Co-authored-by: Adam Hillier <[email protected]>

Update larq_compute_engine/core/bgemm_kernels_arm32.h

32c954c

conform to clang format Co-authored-by: Adam Hillier <[email protected]>

Use kNeon for 32-bit input only

bc4b7f5

Co-authored-by: Adam Hillier <[email protected]>

Use kNeon for 32-bit input only

d64ae7c

Co-authored-by: Adam Hillier <[email protected]>

fix clang-format incompliance

4daf896

AdamHillier approved these changes Aug 4, 2020

View reviewed changes

AdamHillier merged commit e9ca6be into larq:master Aug 4, 2020

lgeiger mentioned this pull request Aug 4, 2020

Prebuild AArch32 benchmark binary for new releases #453

Merged

jneeven mentioned this pull request Jan 4, 2021

LCE Inference of QuickNet slower than BenchMark larq/larq#607

Closed

lgeiger mentioned this pull request May 20, 2021

Add support for bitpacked activations in optimized ARM32 kernels #647

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM32 kernel implementation #432

Add ARM32 kernel implementation #432

honglh commented Jul 22, 2020 •

edited

Loading

lgeiger commented Jul 22, 2020

AdamHillier left a comment

honglh commented Jul 30, 2020

lgeiger commented Jul 30, 2020

lgeiger commented Jul 30, 2020

honglh commented Jul 31, 2020 •

edited

Loading

honglh commented Aug 2, 2020 •

edited

Loading

AdamHillier left a comment

AdamHillier left a comment

honglh commented Aug 4, 2020

AdamHillier left a comment

Add ARM32 kernel implementation #432

Add ARM32 kernel implementation #432

Conversation

honglh commented Jul 22, 2020 • edited Loading

What do these changes do?

How Has This Been Tested?

Benchmark Results

Related issue number

lgeiger commented Jul 22, 2020

AdamHillier left a comment

Choose a reason for hiding this comment

honglh commented Jul 30, 2020

lgeiger commented Jul 30, 2020

lgeiger commented Jul 30, 2020

honglh commented Jul 31, 2020 • edited Loading

honglh commented Aug 2, 2020 • edited Loading

AdamHillier left a comment

Choose a reason for hiding this comment

AdamHillier left a comment

Choose a reason for hiding this comment

honglh commented Aug 4, 2020

AdamHillier left a comment

Choose a reason for hiding this comment

honglh commented Jul 22, 2020 •

edited

Loading

honglh commented Jul 31, 2020 •

edited

Loading

honglh commented Aug 2, 2020 •

edited

Loading