Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Optimize AddTakeGrad Tensor Sum #17906

Merged
merged 1 commit into from
Apr 7, 2020
Merged

Conversation

ElaineBao
Copy link
Contributor

@ElaineBao ElaineBao commented Mar 25, 2020

Description

The function of AddTakeGrad is used in the backward pass of embedding operator. Originally it uses tensor-level summation, which is very slow. By replacing tensor-level summation to element-wise summation, this function can be faster (about 6X speedup for a dummy example).

@xinyu-intel @zixuanweeei @TaoLv please review.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @ElaineBao , Thanks for submitting the PR
Once your PR is ready for CI checks, invoke the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, edge, centos-gpu, windows-cpu, miscellaneous, sanity, unix-gpu, windows-gpu, clang, website, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@ElaineBao
Copy link
Contributor Author

@mxnet-bot run ci [all]

@leezu
Copy link
Contributor

leezu commented Mar 25, 2020

Is tensor-level summation slow as the compiler fails to optimize the mshadow implementation? Or what is the reason?

@ElaineBao
Copy link
Contributor Author

Is tensor-level summation slow as the compiler fails to optimize the mshadow implementation? Or what is the reason?

Not quite sure about the reason, but I think it may be related to temporary memory allocation

@ElaineBao
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu, windows-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [windows-gpu, unix-gpu]

@leezu
Copy link
Contributor

leezu commented Mar 26, 2020

@ElaineBao do you expect this to be true at other places where tensor-level summation is used? Should these places be checked / fixed too?

@ElaineBao
Copy link
Contributor Author

@leezu I cannot say all tensor-level summation is slow, after all I haven't run all the cases.

But changing tensor-level summation to element-wise summation actually increases the amount of code and makes the code less readable, so if not for known efficiency issue, I think it's better to remain unchanged and using tensor-level summation.

@TaoLv
Copy link
Member

TaoLv commented Mar 26, 2020

@ElaineBao Could you please share a benchmarking script so we can verify the effect of this optimization? opperf may help: https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf.

@ElaineBao
Copy link
Contributor Author

@TaoLv OK, I'll work on it

@TaoLv
Copy link
Member

TaoLv commented Apr 4, 2020

The CI issue should be already addressed. Please rebase your PR and resolve the comments. Thanks.

@ElaineBao
Copy link
Contributor Author

Sorry for the late reply.
I tried to use opperf, but it doesn't work, some error throwed out when I was using it:

#   File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 606, in <module>
#     "axis_shape": DEFAULT_AXIS_SHAPE,
# NameError: name 'DEFAULT_AXIS_SHAPE' is not defined

So I use mxnet profiler to validate the performance, I think it's also reasonable.
The script is as follows:

import random
import pandas as pd
import mxnet as mx
import numpy as np
from sklearn.model_selection import train_test_split

batch_size = 1000
num_epoch = 5
model_prefix = 'drivethru_attention_d'
n_plus= 522
total = 40000
profiling = True

records = []
for i in range(0, total):
    pluids = [random.randint(0, n_plus - 1) for i in range(0, 5)]
    label = random.randint(0, 1)
    records.append((pluids, label))

data = pd.DataFrame(records,
                    columns=['pluids','label'])
train, test = train_test_split(data, test_size=0.1, random_state=100)

X_train = mx.io.NDArrayIter(data={'pluids': np.array(train['pluids'].values.tolist(), dtype=int)},
                            label={'output_label': train['label'].values},
                            batch_size=batch_size,
                            shuffle=True)
X_eval = mx.io.NDArrayIter(data={'pluids': np.array(test['pluids'].values.tolist(), dtype=int)},
                            label={'output_label': test['label'].values},
                            batch_size=batch_size,
                            shuffle=True)
y_true = mx.symbol.Variable('output_label')


pluids = mx.symbol.Variable('pluids')
plu_embed = mx.symbol.Embedding(data=pluids, input_dim=n_plus, output_dim=50, name='plu_embed')

fc1 = mx.symbol.FullyConnected(data=plu_embed, num_hidden=int(n_plus), name='fc1')
rec_model = mx.symbol.SoftmaxOutput(data=fc1, label=y_true, name='output')

mod = mx.mod.Module(symbol=rec_model,
                    data_names=['pluids'],
                    label_names=['output_label'],
                    context=[mx.cpu()])
# enable profiler
mx.profiler.set_config(profile_symbolic=True, profile_imperative=True, profile_memory=False,
                                profile_api=True, filename='profile.json', aggregate_stats=True)
mx.profiler.set_state('run')

mod.fit(train_data=X_train,
        num_epoch=num_epoch,
        initializer=mx.init.Xavier(rnd_type="gaussian"),
        optimizer='adagrad',
        eval_metric=['accuracy'],
        validation_metric=['accuracy', mx.metric.TopKAccuracy(3)],
        eval_data=X_eval,
        batch_end_callback=mx.callback.Speedometer(batch_size, 2))

mx.profiler.set_state('stop')
print(mx.profiler.dumps())

And the performance:

  1. before optimization of _backward_Embedding:
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_backward_Embedding                   180        2854.3340          12.3320          29.1350          15.8574
_mul_scalar                          1620         527.4130           0.0030           3.3110           0.3256
_backward_FullyConnected              180         162.2140           0.7430           1.6510           0.9012
SoftmaxOutput                         200         129.6620           0.1250           1.2650           0.6483
FullyConnected                        200         110.0570           0.2340          42.0660           0.5503
argmax                                200          49.5320           0.1840           0.4930           0.2477
broadcast_add                        1080          31.0420           0.0040           2.9860           0.0287
Embedding                             200          25.0530           0.0240           3.8110           0.1253
_backward_SoftmaxOutput               180          19.0860           0.0560           0.8680           0.1060
square                                540          18.5240           0.0030           2.8510           0.0343
sqrt                                  540          17.3870           0.0060           0.9440           0.0322
DeleteVariable                       3532          11.3070           0.0020           0.0330           0.0032
broadcast_sub                         540           8.2790           0.0040           0.0440           0.0153
broadcast_div                         540           7.4970           0.0050           0.0730           0.0139
_plus_scalar                          555           6.5850           0.0040           0.0650           0.0119
SetValueOp                              8           5.8160           0.0050           5.6540           0.7270
CopyCPU2CPU                           448           4.1040           0.0020           0.1030           0.0092
ResourceParallelRandomSetSeed               1           3.8440           3.8440           3.8440           3.8440
WaitForVar                            220           1.3150           0.0040           0.0120           0.0060
Cast                                   33           1.1590           0.0080           0.2680           0.0351
_random_normal                          2           0.7270           0.1210           0.6060           0.3635
_zeros                                  6           0.2650           0.0070           0.0720           0.0442
_div_scalar                            15           0.1400           0.0050           0.0190           0.0093
SetupExec                               6           0.0150           0.0010           0.0060           0.0025
_full                                   1           0.0060           0.0060           0.0060           0.0060
  1. after optimization
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_mul_scalar                          1620         451.0970           0.0030           3.2960           0.2785
_backward_FullyConnected              180         195.9230           0.7440           2.3720           1.0885
SoftmaxOutput                         200         156.3020           0.1080           1.2910           0.7815
FullyConnected                        200         136.2320           0.2300          43.5920           0.6812
argmax                                200          54.5550           0.1710           0.4960           0.2728
_backward_SoftmaxOutput               180          39.8900           0.0570           0.8930           0.2216
Embedding                             200          27.0910           0.0270           3.1330           0.1355
broadcast_add                        1080          24.6370           0.0040           0.6560           0.0228
_backward_Embedding                   180          21.5230           0.0970           0.4120           0.1196
sqrt                                  540          20.1840           0.0060           0.1300           0.0374
square                                540          19.2420           0.0040           2.9200           0.0356
DeleteVariable                       3532          13.1160           0.0010           0.1310           0.0037
broadcast_sub                         540          11.0550           0.0040           0.0980           0.0205
broadcast_div                         540           9.3750           0.0050           0.1110           0.0174
_plus_scalar                          555           8.2280           0.0040           0.1140           0.0148
SetValueOp                              8           5.9090           0.0050           5.7620           0.7386
CopyCPU2CPU                           448           4.2760           0.0030           0.1040           0.0095
ResourceParallelRandomSetSeed               1           3.8370           3.8370           3.8370           3.8370
Cast                                   33           1.2800           0.0090           0.2670           0.0388
WaitForVar                            195           1.2160           0.0040           0.0180           0.0062
_random_normal                          2           0.7190           0.1200           0.5990           0.3595
_zeros                                  6           0.2710           0.0060           0.0790           0.0452
_div_scalar                            15           0.2610           0.0050           0.0790           0.0174
SetupExec                               6           0.0150           0.0020           0.0050           0.0025
_full                                   1           0.0070           0.0070           0.0070           0.0070

@ElaineBao
Copy link
Contributor Author

@mxnet-bot run ci [unix-gpu, unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, unix-cpu]

@TaoLv
Copy link
Member

TaoLv commented Apr 5, 2020

@ElaineBao Thank you. Impressive speedup!

@TaoLv
Copy link
Member

TaoLv commented Apr 5, 2020

Sorry for the late reply.
I tried to use opperf, but it doesn't work, some error throwed out when I was using it:
File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 606, in
"axis_shape": DEFAULT_AXIS_SHAPE,
NameError: name 'DEFAULT_AXIS_SHAPE' is not defined

FYI, @ChaiBapchya.

@leezu
Copy link
Contributor

leezu commented Apr 5, 2020

Opperf to be fixed by #17894

@leezu leezu merged commit c3c76a8 into apache:master Apr 7, 2020
mk-61 pushed a commit to mk-61/incubator-mxnet that referenced this pull request Apr 7, 2020
@ElaineBao ElaineBao deleted the opt-embedding-bwd branch April 14, 2020 00:42
ElaineBao added a commit to ElaineBao/incubator-mxnet that referenced this pull request Apr 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants