[CUDA] Optimize Transpose3DKernel #10891

centwang · 2022-03-16T10:16:58Z

Optimize CUDA Transpose3DKernel by handling more elements in one thread.

Take Transpose([3072,64,512], [0,21]) as example (which is from one of our models):

in A100, the kernel takes:

FP16: before the change, 631.52us, after the change, 293.79us, x2.15 faster
FP32: before the change, 650.62us, after the change, 566.94us, x1.15 faster

in V100:

FP16: before the change, 727.17us, after the change, 518.30us, x1.4 faster
FP32: before the change, 1.08ms, after the change, 967.17us, x1.12 faster

Also run more perf tests on different input tensor size, no perf regression observed. The optimization works better on FP16 than FP32, and large tensors.

pengwa · 2022-03-17T06:15:52Z

onnxruntime/core/providers/cuda/tensor/transpose_impl.cu

@@ -7,44 +7,52 @@
 namespace onnxruntime {
 namespace cuda {

-constexpr unsigned int TILE_DIM = 16;
+constexpr unsigned int NUM_ELE_PER_THREAD = 4;


Have we ever tried other values like 2/8/16?

And we expect same improvement on ROCM?

Tried 4 and 8, make no much difference. Most of our other kernels use 4, so follow the same number here. Didn't check the perf on ROCm.

pengwa

LGTM.

optimize Transpose3DKernel

cf8a2ed

centwang added the training issues related to ONNX Runtime training; typically submitted using template label Mar 16, 2022

centwang requested review from pengwa, souptc, weixingzhang and zhijxu-MS March 16, 2022 10:16

pengwa reviewed Mar 17, 2022

View reviewed changes

pengwa approved these changes Mar 17, 2022

View reviewed changes

centwang merged commit 6c0eff1 into master Mar 17, 2022

centwang deleted the weicwang/transpose branch March 17, 2022 10:09

lavanyax pushed a commit to intel/onnxruntime that referenced this pull request Mar 29, 2022

optimize Transpose3DKernel (microsoft#10891)

de477af

seddonm1 pushed a commit to seddonm1/onnxruntime that referenced this pull request May 15, 2022

optimize Transpose3DKernel (microsoft#10891)

0c88a46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Optimize Transpose3DKernel #10891

[CUDA] Optimize Transpose3DKernel #10891

centwang commented Mar 16, 2022

pengwa Mar 17, 2022

pengwa Mar 17, 2022

centwang Mar 17, 2022

pengwa left a comment

[CUDA] Optimize Transpose3DKernel #10891

[CUDA] Optimize Transpose3DKernel #10891

Conversation

centwang commented Mar 16, 2022

pengwa Mar 17, 2022

Choose a reason for hiding this comment

pengwa Mar 17, 2022

Choose a reason for hiding this comment

centwang Mar 17, 2022

Choose a reason for hiding this comment

pengwa left a comment

Choose a reason for hiding this comment