Performance improvement in Normalize GPU Kernel #14139

sandeep-krishnamurthy · 2019-02-12T23:44:14Z

Description

Similar to perf improvements of ToTensor GPU kernel in PR - #14099
In this PR, I wrote a separate CUDA kernel for GPU and moved out of Kernel launch/map.

Benchmarks below.

Benchmarks
Ran 500 Normalize operation on (3, 512, 512) sample input.

GPU

Before: ('Average time per Normalize 3,512,512 - ', 38.19581985473633)
After: ('Average time per Normalize 3,512,512 - ', 0.5398507118225098)

CPU

Before: ('Average time per Normalize 3,512,512 - ', 1.8209707736968994)
After: ('Average time per Normalize 3,512,512 - ', 1.2644755840301514)

('Total time for CPU ToTensor - ', 1264.4755840301514)
('Average time per Normalize 3,512,512 - ', 1.2644755840301514)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Remove Kernel Launch/Map for Normalize operator
Make independent kernel for CPU Normalize
Add 2 separate CUDA kernel for Normalize Forward operator (3D and 4D) inputs.
Add 2 separate CUDA kernel for Normalize Backward operator (3D and 4D) inputs.

@stu1130 @zhreshold

src/operator/image/image_random.cu

zhreshold · 2019-02-13T23:51:12Z

src/operator/image/image_random.cu

 ToTensorCudaKernel<gpu, DType>
- <<<blocks, dim3(32, 32), 0, stream>>>(input, output,
+ <<<blocks, dim3(H, cuda::kMaxThreadsPerBlock / H), 0, stream>>>(input, output,


please fix ToTensor similarly in a separate PR.

Sure already work in progress. Should this PR wait till then or undo dim3(H, cuda::kMaxThreadsPerBlock / H) back to dim3(32, 32). Let me know next steps for this PR. Thanks!

back to (32, 32), we can address it later.

zhreshold · 2019-02-14T00:08:17Z

Basically lgtm, please make minor revision and once CI passes ,we can merge

sandeep-krishnamurthy · 2019-02-14T00:14:27Z

Basically lgtm, please make minor revision and once CI passes ,we can merge

Done. Will create ToTensor refactoring PR in a day or two. Thanks again for your time and fast turn around time for all PR reviews.

* New CPU kernel for normalize * New GPU kernel for Normalize * Add launch bounds and increase threads to 32*32 * do not hardcode number of threads * Try fix windows build failure * make channels as int to fix windows build issues with omp * Simplify cuda kernels with 1 D thread block * Minor refactoring * Revert thread dim for ToTensor operator

szha requested a review from zhreshold February 12, 2019 23:53

sandeep-krishnamurthy added Operator pr-awaiting-review PR is waiting for code review labels Feb 12, 2019

zhreshold reviewed Feb 13, 2019

View reviewed changes

src/operator/image/image_random.cu Outdated Show resolved Hide resolved

zhreshold reviewed Feb 13, 2019

View reviewed changes

src/operator/image/image_random.cu Outdated Show resolved Hide resolved

zhreshold suggested changes Feb 13, 2019

View reviewed changes

src/operator/image/image_random.cu Show resolved Hide resolved

src/operator/image/image_random.cu Outdated Show resolved Hide resolved

zhreshold suggested changes Feb 13, 2019

View reviewed changes

sandeep-krishnamurthy force-pushed the normalize_gpu_perf branch from 9ca1aec to 26a0532 Compare February 14, 2019 00:08

sandeep-krishnamurthy added 9 commits February 13, 2019 22:08

New CPU kernel for normalize

04fc2fd

New GPU kernel for Normalize

64d2b47

Add launch bounds and increase threads to 32*32

9c175f9

do not hardcode number of threads

94787e8

Try fix windows build failure

7b8039c

make channels as int to fix windows build issues with omp

a98f034

Simplify cuda kernels with 1 D thread block

eb8da43

Minor refactoring

652e1ca

Revert thread dim for ToTensor operator

f595dff

sandeep-krishnamurthy force-pushed the normalize_gpu_perf branch from cf05dd7 to f595dff Compare February 14, 2019 06:09

zhreshold approved these changes Feb 14, 2019

View reviewed changes

sandeep-krishnamurthy merged commit dad33b5 into apache:master Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement in Normalize GPU Kernel #14139

Performance improvement in Normalize GPU Kernel #14139

sandeep-krishnamurthy commented Feb 12, 2019

zhreshold Feb 13, 2019

sandeep-krishnamurthy Feb 14, 2019

zhreshold Feb 14, 2019

zhreshold commented Feb 14, 2019

sandeep-krishnamurthy commented Feb 14, 2019

Performance improvement in Normalize GPU Kernel #14139

Performance improvement in Normalize GPU Kernel #14139

Conversation

sandeep-krishnamurthy commented Feb 12, 2019

Description

Checklist

Essentials

Changes

zhreshold Feb 13, 2019

Choose a reason for hiding this comment

sandeep-krishnamurthy Feb 14, 2019

Choose a reason for hiding this comment

zhreshold Feb 14, 2019

Choose a reason for hiding this comment

zhreshold commented Feb 14, 2019

sandeep-krishnamurthy commented Feb 14, 2019