-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase riscv implementation in DepthwiseConvKernel #127867
Conversation
Signed-off-by: Zhang fei <[email protected]> Signed-off-by: Wang fan <[email protected]>
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127867
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 10f8b73 with merge base 7ef7c26 (): UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This is our attempt at implementing RVV. Based on the code implementation, it does not negatively impact the original PyTorch build. Additionally, we observed significant performance improvements when performing inference on MobileNet_v2. With input data of [1, 3, 224, 224], we saw a 3x performance increase. |
This would be the very first riscv kernel we'd add to PyTorch. I'm not sure if we want to commit to maintaining all of these kernels. @malfet for more comments. |
@malfet do you have any suggestions? |
cc @albanD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not nested by CI, but sure, why not (considering that NEON code is there)
But one should probably start with vec sublibary if they are serious about RISC-V performance
@pytorchbot merge -f "Lint is green, the rest is not compiled" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
**Summary:** Increase riscv implementation in DepthwiseConvKernel. **Compile:** export USE_CUDA=0 export USE_DISTRIBUTED=0 export USE_MKLDNN=0 export MAX_JOBS=4 export CMAKE_CXX_COMPILER=clang++ export CMAKE_C_COMPILER=clang export CMAKE_C_FLAGS=-march=rv64gcv export CMAKE_CXX_FLAGS=-march=rv64gcv python3 setup.py develop --cmake **Test Plan:** **Correctness** - Check the results of the run before and after test_convolution.py python3 test/run_test.py --include nn/test_convolution --keep-going **Before:** ===== 9 passed, 13 skipped, 564 deselected in 46.55s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 **After:** ===== 9 passed, 13 skipped, 564 deselected in 48.13s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 **Performance** - Compare the results before and after mobilenet_v2 python3 run.py mobilenet_v2 -d cpu -t eval **Before:** Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 19590.647 milliseconds CPU Wall Time: 19590.647 milliseconds Time to first batch: 5271.3518 ms CPU Peak Memory: 0.3809 GB **After:** Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 13523.530 milliseconds CPU Wall Time: 13523.530 milliseconds Time to first batch: 2696.0304 ms CPU Peak Memory: 0.3408 GB **Versions:** Clang version: 17.0.2 Platform: CanMV-K230 Architecture: riscv64 OS: Ubuntu 23.10 Pull Request resolved: pytorch#127867 Approved by: https://github.com/malfet
Summary:
Increase riscv implementation in DepthwiseConvKernel.
Compile:
export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=4
export CMAKE_CXX_COMPILER=clang++
export CMAKE_C_COMPILER=clang
export CMAKE_C_FLAGS=-march=rv64gcv
export CMAKE_CXX_FLAGS=-march=rv64gcv
python3 setup.py develop --cmake
Test Plan:
Correctness - Check the results of the run before and after test_convolution.py
python3 test/run_test.py --include nn/test_convolution --keep-going
Before:
===== 9 passed, 13 skipped, 564 deselected in 46.55s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32
After:
===== 9 passed, 13 skipped, 564 deselected in 48.13s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32
Performance - Compare the results before and after mobilenet_v2
python3 run.py mobilenet_v2 -d cpu -t eval
Before:
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 19590.647 milliseconds
CPU Wall Time: 19590.647 milliseconds
Time to first batch: 5271.3518 ms
CPU Peak Memory: 0.3809 GB
After:
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 13523.530 milliseconds
CPU Wall Time: 13523.530 milliseconds
Time to first batch: 2696.0304 ms
CPU Peak Memory: 0.3408 GB
Versions:
Clang version: 17.0.2
Platform: CanMV-K230
Architecture: riscv64
OS: Ubuntu 23.10
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10