Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Optimize Transpose3DKernel #10891

Merged
merged 1 commit into from
Mar 17, 2022
Merged

[CUDA] Optimize Transpose3DKernel #10891

merged 1 commit into from
Mar 17, 2022

Conversation

centwang
Copy link
Contributor

Optimize CUDA Transpose3DKernel by handling more elements in one thread.

Take Transpose([3072,64,512], [0,21]) as example (which is from one of our models):

in A100, the kernel takes:

  • FP16: before the change, 631.52us, after the change, 293.79us, x2.15 faster
  • FP32: before the change, 650.62us, after the change, 566.94us, x1.15 faster

in V100:

  • FP16: before the change, 727.17us, after the change, 518.30us, x1.4 faster
  • FP32: before the change, 1.08ms, after the change, 967.17us, x1.12 faster

Also run more perf tests on different input tensor size, no perf regression observed. The optimization works better on FP16 than FP32, and large tensors.

@centwang centwang added the training issues related to ONNX Runtime training; typically submitted using template label Mar 16, 2022
@@ -7,44 +7,52 @@
namespace onnxruntime {
namespace cuda {

constexpr unsigned int TILE_DIM = 16;
constexpr unsigned int NUM_ELE_PER_THREAD = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we ever tried other values like 2/8/16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we expect same improvement on ROCM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried 4 and 8, make no much difference. Most of our other kernels use 4, so follow the same number here. Didn't check the perf on ROCm.

Copy link
Contributor

@pengwa pengwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@centwang centwang merged commit 6c0eff1 into master Mar 17, 2022
@centwang centwang deleted the weicwang/transpose branch March 17, 2022 10:09
lavanyax pushed a commit to intel/onnxruntime that referenced this pull request Mar 29, 2022
seddonm1 pushed a commit to seddonm1/onnxruntime that referenced this pull request May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants