gpu: intel: Optimize reusable layer normalization using work-group based reductions #1990
+181
−74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This pull request adds an alternative kernel that performs better under certain conditions compared to the sub-group reduction kernel previously implemented. The new kernel uses the work_group_reduce_add function to perform the mean an variance reductions instead of using the sub_group based reductions. One benefit of this kernel is that it performs better for sizes that do not fully utilize the device when sub-group based reductions are used.
Optimizations
work-group based reductions vs sub-group based reductions
There are two kernels implemented for the reusable layer normalization layer. These kernels differ in the way the summation operation is performed in the variance and mean calculation. The work-group kernel will launch a work-item for each element in the lnorm axis and the sub-group kernel will launch one SIMD worth of work-items in the lnorm axis. The work group kernel will use work_group_reduction_add function and sub-group version will use the sub_group_reduction_add function to perform the summation. Here is a heatmap of the two kernel and how they perform over the different shapes of the input tensor.
Use of fixed sized loops vs variable sized loops
Example: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR53
There is a significant penalty when a runtime variable is used to exit the loop condition. Here are heatmaps between a runtime condition vs a compile time condition for the for loop:
Use macro to avoid loops in work-group kernel
The ifdef here: https://github.com/oneapi-src/oneDNN/compare/main...umar456:oneDNN:uarshad/reusable_vectorized_lnorm?expand=1#diff-399297f4e437e8a12e0e654089b8af8c938a7a8430c6efab4ea61a029a683f8cR51
is used to avoid adding loop in the work-group kernel. I had originally assumed that if the compiler knew that we were only iterating the loop one time it would be able to remove the overhead of adding the loop but it seems its not the case. Here is a heatmap of using the macro to remove the loop in the work-group kernel.
Use large GRF for certain shapes in the sub-group based kernel
Large GRF flag can significantly improve the speed of the kernel under certain situation. You see the greatest speedup in situations where the tensor is small enough to fit in the device cache and there is the lnorm axis is larger than 768. I suspect this is because it allows the device to queue more load transactions than without the flag. Additionally there is a significant slowdown where the lnorm axis is small and the number of subgroups launched is greater than one wave. I suspect his is because fewer sub-groups are active when the large GRF flag is used. Here is the heatmap of the sub-group kernel with and without the GRF flag.
Overall Speedup
512 EU PVC
Heatmap vs. Original Vectorized: