-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @ElaineBao , Thanks for submitting the PR
CI supported jobs: [unix-cpu, edge, centos-gpu, windows-cpu, miscellaneous, sanity, unix-gpu, windows-gpu, clang, website, centos-cpu] Note: |
@mxnet-bot run ci [all] |
Is tensor-level summation slow as the compiler fails to optimize the mshadow implementation? Or what is the reason? |
Not quite sure about the reason, but I think it may be related to temporary memory allocation |
@mxnet-bot run ci [unix-gpu, windows-gpu] |
Jenkins CI successfully triggered : [windows-gpu, unix-gpu] |
@ElaineBao do you expect this to be true at other places where tensor-level summation is used? Should these places be checked / fixed too? |
@leezu I cannot say all tensor-level summation is slow, after all I haven't run all the cases. But changing tensor-level summation to element-wise summation actually increases the amount of code and makes the code less readable, so if not for known efficiency issue, I think it's better to remain unchanged and using tensor-level summation. |
@ElaineBao Could you please share a benchmarking script so we can verify the effect of this optimization? opperf may help: https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf. |
@TaoLv OK, I'll work on it |
The CI issue should be already addressed. Please rebase your PR and resolve the comments. Thanks. |
3cd7560
to
794a324
Compare
Sorry for the late reply.
So I use mxnet profiler to validate the performance, I think it's also reasonable.
And the performance:
|
@mxnet-bot run ci [unix-gpu, unix-cpu] |
Jenkins CI successfully triggered : [unix-gpu, unix-cpu] |
@ElaineBao Thank you. Impressive speedup! |
FYI, @ChaiBapchya. |
Opperf to be fixed by #17894 |
Description
The function of
AddTakeGrad
is used in the backward pass of embedding operator. Originally it uses tensor-level summation, which is very slow. By replacing tensor-level summation to element-wise summation, this function can be faster (about 6X speedup for a dummy example).@xinyu-intel @zixuanweeei @TaoLv please review.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments