Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM32 kernel implementation #432

Merged
merged 24 commits into from
Aug 4, 2020
Merged

Add ARM32 kernel implementation #432

merged 24 commits into from
Aug 4, 2020

Conversation

honglh
Copy link
Contributor

@honglh honglh commented Jul 22, 2020

What do these changes do?

The BgemmKernel<path=kNeon, LhsScalar=uint32_t, RhsScalar=uint32_t, AccumScalar=int32_t, DstScalar=float> implementation for ARM32 device.

How Has This Been Tested?

The same code change is tested on v0.3.1 tag on RPI2 and RPI3 (32-bit). But I cannot test on the master branch for open issue #408 . Though I tested on v0.3.1 tag I cannot create PR request for tag.

Benchmark Results

On RPI2 inference time is about 100 to 200ms and RPI3 it is 50 to 100ms. (The inference time is 500ms+ without using the kernel).

Related issue number

#431

@honglh honglh mentioned this pull request Jul 22, 2020
@lgeiger lgeiger added the feature New feature or request label Jul 22, 2020
@AdamHillier AdamHillier self-requested a review July 22, 2020 08:48
@lgeiger lgeiger added this to the 0.4 milestone Jul 22, 2020
@lgeiger
Copy link
Member

lgeiger commented Jul 22, 2020

Thank you for the great PR!

I can see a couple CI/CD builds failed on github. Not sure how to run the tests offline. Please let me know if you have more details of the error or some instructions I can test the build offline in my local setup. Thanks!

It looks like linting fails on CI. We use clang-format to format our C++ code so running:

clang-format -i larq_compute_engine/core/*.h

should make linting on CI happy.

We run our ARM tests on CI using qemu. To reproduce the CI failure locally on a x86 Linux system or inside an Ubuntu docker container requires a bit of setup:

sudo apt-get install -y --no-install-recommends qemu-user
./configure.sh <<< $'n\n'  # or run ./configure.sh and answer with "n"
pip install numpy six

After which you should be able to run the ARM32 unittests in the emulator:

bazelisk test larq_compute_engine/tests:cc_tests_arm32_qemu --config=rpi3 --test_output=streamed

Copy link
Contributor

@AdamHillier AdamHillier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, this is brilliant :)

From taking a first look, I have a few suggestions below.

larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm.h Show resolved Hide resolved
@honglh
Copy link
Contributor Author

honglh commented Jul 30, 2020

Thanks all for your suggestions. I will review and make a new PR

@lgeiger
Copy link
Member

lgeiger commented Jul 30, 2020

I will review and make a new PR

👍 Thanks for taking a look.
Feel free to push the updates to this branch which might be easier for you instead of creating a new PR.

@lgeiger
Copy link
Member

lgeiger commented Jul 30, 2020

@honglh Thanks for addressing the comments. It looks like some of the kernel tests for the optimized code are still failing on CI, could you take a look at them?
I think you should be able to reproduce the failures locally with the commands mentioned above, please let us know if you run into any issues with that.

@honglh
Copy link
Contributor Author

honglh commented Jul 31, 2020

@lgeiger I have reproduced the ARM test failure on my local setup. I am looking into this issue

@honglh
Copy link
Contributor Author

honglh commented Aug 2, 2020

With change in bconv2d_test.cc the test execution

bazel test --cache_test_results=no larq_compute_engine/tests:cc_tests_arm32_qemu --config=rpi3 --test_output=streamed

passed in my local setup.

The earlier CI failure seems caused by Register_BCONV_2D64_OPT() in ARM32 (--rpi3) mode. In this mode kRuyOptimized path is selected. Then in EvalChooseKernelType() the kRuyOptimized path selects the ARM32 kernel as expected, but the prepare() function in Register_BCONV_2D64_OPT() used the function for 64-bit bitpacking bitwidth

bconv2d::Prepare<bconv2d::KernelType::kRuyOptimized, 64>

This mismatch (64-bit bitpacking and 32-bit bgemm kernel) caused CI failures.

Another failure case is from OptimizedKernel16BitOverflowTest, which seems to be a 64-bit only test case in that the kernel Register_BCONV_2D64_OPT is explicitly specified. Thus this test can is also disabled for 32-bit.

Copy link
Contributor

@AdamHillier AdamHillier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see the tests passing now!

My only concern is that I think we still want to run the "BConv2D32OPT", "BConv2D64OPT", and "BConv2D32REF" tests on all architectures. For example, when the tests are running on Arm32, we still want the "BConv2D64OPT" tests to successfully run (falling back to unoptimised C++ kernels) and the tests to pass.

I did some local debugging and was able to get this working with the two small changes I've listed below. It turns out that the reason for the previous test failures on uint64 inputs was that the bgemm_runtime_path was getting set to ruy::Path::kNeon no matter what, and on Arm32 there's no Neon kernel for uint64 inputs. This meant that no bgemm kernel was run at all, and for some reason no error was thrown.

I also have a few whitespace-related suggested changes for a couple of comments, but those are very minor and I'll list them in a separate review.

After making these changes this looks good to merge from my perspective 👍

larq_compute_engine/core/bgemm_impl_ruy.h Outdated Show resolved Hide resolved
larq_compute_engine/tflite/tests/bconv2d_test.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@AdamHillier AdamHillier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespace suggestions I mentioned above:

larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm32.h Outdated Show resolved Hide resolved
larq_compute_engine/core/bgemm_kernels_arm.h Show resolved Hide resolved
@honglh
Copy link
Contributor Author

honglh commented Aug 4, 2020

@AdamHillier Thanks. The changes look good on my side as well

Copy link
Contributor

@AdamHillier AdamHillier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this is fantastic 🔥

Thank you so much for doing this @honglh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants