Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Optimize transpose operator with MKL-DNN #14545

Merged
merged 16 commits into from
Apr 10, 2019

Conversation

TaoLv
Copy link
Member

@TaoLv TaoLv commented Mar 27, 2019

Description

Take shapes from GluonNLP BERT base model as an example:

shapes = [(64, 512, 1024), (32, 128, 12, 64), (32, 12, 128, 64)]
axes = [(0, 2, 1), (0, 2, 1, 3), (0, 2, 1 ,3)]

for idx, sh in enumerate(shapes):
    axis = axes[idx]
    a = np.random.rand(*sh)
    x = mx.nd.array(a)

    tic = time.time()
    for i in range(100):
        y = mx.nd.transpose(x, axis)
        y.wait_to_read()

    toc = time.time()
    print("transpose %s to %s, %f " %(sh, axis, (toc-tic)))
    b = np.transpose(a, axis)
    np.allclose(b, y.asnumpy())

Before optimization:

transpose (64, 512, 1024) to (0, 2, 1), 1.228018
transpose (32, 128, 12, 64) to (0, 2, 1, 3), 0.069434
transpose (32, 12, 128, 64) to (0, 2, 1, 3), 0.065088

After optimization:

transpose (64, 512, 1024) to (0, 2, 1), 0.7996
transpose (32, 128, 12, 64) to (0, 2, 1, 3), 0.0107
transpose (32, 12, 128, 64) to (0, 2, 1, 3), 0.0069

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@TaoLv
Copy link
Member Author

TaoLv commented Mar 27, 2019

@abhinavs95
Copy link
Contributor

@mxnet-label-bot add [MKLDNN, Backend, pr-awaiting-review]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet MKLDNN pr-awaiting-review PR is waiting for code review labels Mar 27, 2019

CHECK_EQ(inputs.size(), 1U);
CHECK_EQ(outputs.size(), 1U);
if (SupportMKLDNNTranspose(param, inputs[0])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move CHECK_EQ(req, kWriteTo) << "Transpose does not support inplace"; here? Then we have opportunity to provide fallback support instead of crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid there is no way to fallback. The check is copied from the original implementation:
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op-inl.h#L311

I will move the check to here and make the error happens on an early stage.


public:
MKLDNNTransposeForward(const TransposeParam& param,
const OpReqType &req,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems req is redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will remove that.

if (data.IsMKLDNNData()) {
this->data_->set_data_handle(data.GetMKLDNNData()->get_data_handle());
} else {
this->data_->set_data_handle(data.data().dptr<float>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this code can be reused for other dtype, so we'd better avoid explictly using float here. How about use template dtype or simply use this->data_->set_data_handle(data.GetMKLDNNData()->get_data_handle());? The later one should always work even if data isn't MKLDNN data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change that. Although, almost all MKL-DNN fp operators are restricted by checking data.dtype() == mshadow::kFloat32. We need revisit those checks one day we want to support fp64.

@pengzhao-intel
Copy link
Contributor

cc @eric-haibin-lin

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the special test cases for MKL-DNN trasnpose?

/*!
* \file mkldnn_transpose.cc
* \brief
* \author
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your name

const NDArray &data) {
auto data_ndim = data.shape().ndim();

if (data_ndim > 4 || data.dtype() != mshadow::kFloat32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does transpose work for INT8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work but it's not tested and verified. So here I limited the dimensionality and data type just like what we did for other MKL-DNN operators. BTW, if we want to use INT8 transpose in a quantized network, probably we need a quantized transpose operator to accept and output an additional scale argument.

dst_fmt.layout_desc.blocking.strides[1][axes[i]] = 1;

total_stride *= shape[axes[i]];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the explanation of what the logic inside for these index setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments for that.

@TaoLv
Copy link
Member Author

TaoLv commented Mar 31, 2019

Do we need the special test cases for MKL-DNN trasnpose?

Normal use cases should be covered by test_operator.py:test_transpose. I added one additional case to test if the input has MKL-DNN internal layout, eg. nchw16C.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

please also try the below two issues and see if MKLDNN can accelerate

#14496
#14563

Copy link
Contributor

@ZhennanQin ZhennanQin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@TaoLv
Copy link
Member Author

TaoLv commented Apr 4, 2019

@pengzhao-intel It mitigates the performance issue of transpose in #14496 but doesn't help #14563. Possibly #14563 is not caused by transpose regression. @apeforest @fhieber @samskalicky

(py3env) [lvtao@mlt-skx138 mxnet]$ numactl --physcpubind=0-27 --membind=0 python transpose_perf.py
10
50
100
200
500
mxnet version: 1.3.1
--------------------
size: 10
p50: 0.11 ms
p90: 0.11 ms
p99: 0.11 ms
--------------------
size: 50
p50: 0.06 ms
p90: 0.07 ms
p99: 0.07 ms
--------------------
size: 100
p50: 0.18 ms
p90: 0.18 ms
p99: 0.19 ms
--------------------
size: 200
p50: 2.89 ms
p90: 3.86 ms
p99: 4.22 ms
--------------------
size: 500
p50: 102.48 ms
p90: 102.74 ms
p99: 102.85 ms



mxnet version: 1.4.0
--------------------
size: 10
p50: 0.11 ms
p90: 0.12 ms
p99: 0.38 ms
--------------------
size: 50
p50: 0.09 ms
p90: 0.09 ms
p99: 0.10 ms
--------------------
size: 100
p50: 0.89 ms
p90: 0.93 ms
p99: 0.95 ms
--------------------
size: 200
p50: 5.31 ms
p90: 6.03 ms
p99: 6.52 ms
--------------------
size: 500
p50: 186.42 ms
p90: 186.93 ms
p99: 187.04 ms


(py3env) [lvtao@mlt-skx138 mxnet]$ numactl --physcpubind=0-27 --membind=0 python transpose_perf.py
10
50
100
200
500
mxnet version: 1.5.0 (this PR)
--------------------
size: 10
p50: 0.03 ms
p90: 0.03 ms
p99: 0.03 ms
--------------------
size: 50
p50: 0.08 ms
p90: 0.08 ms
p99: 0.09 ms
--------------------
size: 100
p50: 0.19 ms
p90: 0.33 ms
p99: 0.34 ms
--------------------
size: 200
p50: 2.44 ms
p90: 3.33 ms
p99: 3.50 ms
--------------------
size: 500
p50: 95.93 ms
p90: 96.31 ms
p99: 97.07 ms

@pengzhao-intel
Copy link
Contributor

pengzhao-intel commented Apr 4, 2019

It's really good!

Regarding #14563, does the transpose run into MKLDNN?
I think it's enough to check the time of transpose w/ and w/o this PR.

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, agree with @TaoLv on the summary of the related issues above.

@azai91
Copy link
Contributor

azai91 commented Apr 6, 2019

@TaoLv can you retrigger the build? one test is failing. seems to be a flaky test as that test is failing on master as well

http:https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/499/pipeline

@zachgk zachgk mentioned this pull request Apr 9, 2019
@pengzhao-intel
Copy link
Contributor

very great improvement. Merging now.

@pengzhao-intel pengzhao-intel merged commit 2c5d7f7 into apache:master Apr 10, 2019
larroy pushed a commit to larroy/mxnet that referenced this pull request Apr 15, 2019
* add mkldnn transpose

* general transpose

* support mkldnn format

* fix lint

* address comments

* add unit test

* add comments

* retrigger CI
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* add mkldnn transpose

* general transpose

* support mkldnn format

* fix lint

* address comments

* add unit test

* add comments

* retrigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet MKLDNN pr-awaiting-review PR is waiting for code review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants