Increase riscv implementation in DepthwiseConvKernel #127867

zhangfeiv0 · 2024-06-04T07:50:52Z

Summary:

Increase riscv implementation in DepthwiseConvKernel.

Compile:

export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=4
export CMAKE_CXX_COMPILER=clang++
export CMAKE_C_COMPILER=clang
export CMAKE_C_FLAGS=-march=rv64gcv
export CMAKE_CXX_FLAGS=-march=rv64gcv
python3 setup.py develop --cmake

Test Plan:

Correctness - Check the results of the run before and after test_convolution.py

python3 test/run_test.py --include nn/test_convolution --keep-going

Before:
===== 9 passed, 13 skipped, 564 deselected in 46.55s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

After:
===== 9 passed, 13 skipped, 564 deselected in 48.13s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

Performance - Compare the results before and after mobilenet_v2

python3 run.py mobilenet_v2 -d cpu -t eval

Before:
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 19590.647 milliseconds
CPU Wall Time: 19590.647 milliseconds
Time to first batch: 5271.3518 ms
CPU Peak Memory: 0.3809 GB

After:
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 13523.530 milliseconds
CPU Wall Time: 13523.530 milliseconds
Time to first batch: 2696.0304 ms
CPU Peak Memory: 0.3408 GB

Versions:
Clang version: 17.0.2
Platform: CanMV-K230
Architecture: riscv64
OS: Ubuntu 23.10

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Signed-off-by: Zhang fei <[email protected]> Signed-off-by: Wang fan <[email protected]>

linux-foundation-easycla · 2024-06-04T07:50:56Z

The committers listed above are authorized under a signed CLA.

✅ login: zhangfeiv0 (10f8b73)

pytorch-bot · 2024-06-04T07:50:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127867

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

NVIDIA runners have a blacklog of 2000+ jobs

✅ You can merge normally! (1 Unrelated Failure)

As of commit 10f8b73 with merge base 7ef7c26 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)
FAILED

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wangfanv0 · 2024-06-06T11:22:46Z

This is our attempt at implementing RVV. Based on the code implementation, it does not negatively impact the original PyTorch build. Additionally, we observed significant performance improvements when performing inference on MobileNet_v2. With input data of [1, 3, 224, 224], we saw a 3x performance increase.
Has anyone followed up on this PR?
@mikaylagawarecki @ezyang

ezyang · 2024-06-10T18:52:35Z

This would be the very first riscv kernel we'd add to PyTorch. I'm not sure if we want to commit to maintaining all of these kernels. @malfet for more comments.

zhangfeiv0 · 2024-06-13T06:11:42Z

@malfet do you have any suggestions？

zhangfeiv0 · 2024-06-18T06:46:49Z

@malfet @ezyang We are committed to continuously optimizing and maintaining the performance of PyTorch on the RISC-V architecture. We hope to have someone review our code.

ezyang · 2024-06-27T18:50:15Z

cc @albanD

malfet

This code is not nested by CI, but sure, why not (considering that NEON code is there)
But one should probably start with vec sublibary if they are serious about RISC-V performance

malfet · 2024-07-01T17:09:35Z

@pytorchbot merge -f "Lint is green, the rest is not compiled"

pytorchmergebot · 2024-07-01T17:11:11Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary:** Increase riscv implementation in DepthwiseConvKernel. **Compile:** export USE_CUDA=0 export USE_DISTRIBUTED=0 export USE_MKLDNN=0 export MAX_JOBS=4 export CMAKE_CXX_COMPILER=clang++ export CMAKE_C_COMPILER=clang export CMAKE_C_FLAGS=-march=rv64gcv export CMAKE_CXX_FLAGS=-march=rv64gcv python3 setup.py develop --cmake **Test Plan:** **Correctness** - Check the results of the run before and after test_convolution.py python3 test/run_test.py --include nn/test_convolution --keep-going **Before:** ===== 9 passed, 13 skipped, 564 deselected in 46.55s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 **After:** ===== 9 passed, 13 skipped, 564 deselected in 48.13s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 **Performance** - Compare the results before and after mobilenet_v2 python3 run.py mobilenet_v2 -d cpu -t eval **Before:** Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 19590.647 milliseconds CPU Wall Time: 19590.647 milliseconds Time to first batch: 5271.3518 ms CPU Peak Memory: 0.3809 GB **After:** Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 13523.530 milliseconds CPU Wall Time: 13523.530 milliseconds Time to first batch: 2696.0304 ms CPU Peak Memory: 0.3408 GB **Versions:** Clang version: 17.0.2 Platform: CanMV-K230 Architecture: riscv64 OS: Ubuntu 23.10 Pull Request resolved: pytorch#127867 Approved by: https://github.com/malfet

Increase riscv implementation in DepthwiseConvKernel

10f8b73

Signed-off-by: Zhang fei <[email protected]> Signed-off-by: Wang fan <[email protected]>

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 4, 2024

pytorchbot added the open source label Jun 4, 2024

zou3519 requested a review from mikaylagawarecki June 5, 2024 02:18

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 5, 2024

zhangfeiv0 mentioned this pull request Jun 26, 2024

RVV support in PyTorch #129553

Open

malfet approved these changes Jul 1, 2024

View reviewed changes

pytorchmergebot added the merging label Jul 1, 2024

pytorchmergebot closed this in 1956d87 Jul 1, 2024

pytorchmergebot added Merged and removed merging labels Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase riscv implementation in DepthwiseConvKernel #127867

Increase riscv implementation in DepthwiseConvKernel #127867

zhangfeiv0 commented Jun 4, 2024 •

edited by pytorch-bot bot

Loading

linux-foundation-easycla bot commented Jun 4, 2024 •

edited

Loading

pytorch-bot bot commented Jun 4, 2024 •

edited

Loading

wangfanv0 commented Jun 6, 2024

ezyang commented Jun 10, 2024

zhangfeiv0 commented Jun 13, 2024

zhangfeiv0 commented Jun 18, 2024

ezyang commented Jun 27, 2024

malfet left a comment •

edited

Loading

malfet commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

Increase riscv implementation in DepthwiseConvKernel #127867

Increase riscv implementation in DepthwiseConvKernel #127867

Conversation

zhangfeiv0 commented Jun 4, 2024 • edited by pytorch-bot bot Loading

linux-foundation-easycla bot commented Jun 4, 2024 • edited Loading

pytorch-bot bot commented Jun 4, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127867

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

wangfanv0 commented Jun 6, 2024

ezyang commented Jun 10, 2024

zhangfeiv0 commented Jun 13, 2024

zhangfeiv0 commented Jun 18, 2024

ezyang commented Jun 27, 2024

malfet left a comment • edited Loading

Choose a reason for hiding this comment

malfet commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

Merge started

zhangfeiv0 commented Jun 4, 2024 •

edited by pytorch-bot bot

Loading

linux-foundation-easycla bot commented Jun 4, 2024 •

edited

Loading

pytorch-bot bot commented Jun 4, 2024 •

edited

Loading

malfet left a comment •

edited

Loading