Integrate MKLDNN Conv1d and support 3d layout #13530

xinyu-intel · 2018-12-04T14:00:24Z

Description

This PR aims to integrate MKLDNN Conv1d and enable 3d layout for Conv and Activation.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

MKLDNN Conv1d
MKLDNN 3d layout for Conv and Activation

@pengzhao-intel @TaoLv

pengzhao-intel · 2018-12-04T14:27:57Z

@eric-haibin-lin @zheng-da @azai91

TaoLv · 2018-12-04T14:37:00Z

Related issues: #11906, #11161

TaoLv · 2018-12-05T01:37:45Z

@mxnet-label-bot add [MKLDNN, Operator, pr-awaiting-review]

TaoLv

Please point out where are the unit tests for this feature. Is it covered by the existing CI? Notice activation is also changed in this PR.

src/operator/nn/activation.cc

src/operator/nn/mkldnn/mkldnn_act.cc

src/operator/nn/mkldnn/mkldnn_base-inl.h

src/operator/nn/mkldnn/mkldnn_base.cc

src/operator/nn/mkldnn/mkldnn_convolution.cc

xinyu-intel · 2018-12-07T07:43:21Z

@TaoLv address Tao's comments. These changes are already covered by test_operator:test_convolution_grouping and test_operator:test_activation. I check if they run into MKL-DNN kernel by checking the mkldnn_verbose.

TaoLv · 2018-12-08T14:22:38Z

src/operator/nn/mkldnn/mkldnn_base.cc

+ } else if (arr.shape().ndim() == 3) {
+ tz = num_groups > 1
+ ? mkldnn::memory::dims{num_groups,
+ static_cast<int>(arr.shape()[0] /


Use O, I, H, W instead of 0, 1, 2 ...

TaoLv · 2018-12-08T14:22:43Z

src/operator/nn/mkldnn/mkldnn_convolution.cc

+ dilates[0] = param.conv_param.dilate[0] - 1;
+ dilates[1] = param.conv_param.dilate[1] - 1;
+ } else {
+ LOG(FATAL) << "MKL-DNN currently supports 1d and 2d convolution";


It would be better if we can mention the given dimension in the error message. Same in other LOG(FATAL).

TaoLv · 2018-12-08T14:33:50Z

Do we have unit test for 1D Convolution without grouping? Please also make sure that this change works well for quantized Convolution (which I think doesn't support 1D yet).

xinyu-intel · 2018-12-10T15:23:06Z

Address Tao's comments. I've skip conv1d convolution on both C level and Python level. @ZhennanQin Please help take a look. Maybe skip all non-4d data layout in quantization.py is not a good choice.

ZhennanQin · 2018-12-11T01:10:19Z

python/mxnet/contrib/quantization.py

@@ -488,6 +488,9 @@ def quantize_model(sym, arg_params, aux_params,
 A tuple of quantized symbol, quantized arg_params, and aux_params.
 -------
 """
+ if ctx == cpu(0) and len(calib_data.provide_data[0].shape) != 3:
+ raise ValueError('MKL-DNN quantized OPs temporary support 4d layout.')


Please don't check calib_data as quantization flow support non-calib mode.

TaoLv · 2018-12-15T05:27:36Z

@zheng-da @ZhennanQin Please help to review and approve if no further concerns.
@xinyu-intel Please rebase code. I notice the last time of CI run is 4 days ago.
Thank you all.

ZhennanQin · 2018-12-17T00:50:20Z

LGTM.

TaoLv · 2018-12-20T15:41:25Z

Seems there was problem with CI. Do you mind re-triggering it with an empty commit? @xinyu-intel
Ping @zheng-da for review. Thank you.

pengzhao-intel

Could you add the test for 1D activation and quantization conv (see if the msg are printed as expectation?

pengzhao-intel · 2018-12-25T04:59:16Z

src/ndarray/ndarray.cc

- dims.resize(shape.ndim() + 1);
- dims[0] = 1;
- for (size_t i = 0; i < shape.ndim(); i++)
- dims[i + 1] = shape[i];


Is there a performance difference between 3D and 4D implementation?

pengzhao-intel · 2018-12-25T05:02:49Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

+ num_groups, static_cast<int>(arr.shape()[N] / num_groups),
+ static_cast<int>(arr.shape()[C]), static_cast<int>(arr.shape()[H]),
+ static_cast<int>(arr.shape()[W])};
+ return mkldnn::memory::desc{tz, get_mkldnn_type(arr.dtype()),


This line is the common part of dim3 and dim4, right?

pengzhao-intel · 2018-12-25T05:04:23Z

src/operator/nn/mkldnn/mkldnn_base-inl.h

- static_cast<int>(arr.shape()[3])};
- return mkldnn::memory::desc{tz, get_mkldnn_type(arr.dtype()),
- mkldnn::memory::format::any};
+ CHECK((arr.shape().ndim() == 3) || (arr.shape().ndim() == 4))


Use a variable to save the value of arr.shape().ndim() to avoid mutiple time call

xinyu-intel · 2018-12-25T06:42:18Z

@pengzhao-intel Conv1d will fall back to native cpu implement before this optimization.

100 iterations total time(ms) on Xeon Skylake 8180 1 socket:

shape	before opt	after opt	speedup
(1,256,200)	715.47	73.88	9.68x
(1,1024,512)	1970.15	101.06	19.49x
(64,1024,512)	131312.48	4196.24	31.29x

I've add 1d,3d,4d data shape to activation test. Regarding quantized conv, it will now return error when the data shape is 3d and users should exclude this layer:

  CHECK_EQ(param.full_conv_param.conv_param.kernel.ndim(), 2U)
      << "MKL-DNN only supports quantized conv2d.";

pengzhao-intel

LGTM.
Thanks for the contributions :)

xinyu-intel · 2018-12-27T14:22:02Z

@zheng-da Please take a look. Thanks!

TaoLv

Please rebase code then I think it's good to merge.

TaoLv · 2019-01-02T13:43:09Z

@xinyu-intel Thank you for the contribution. Now merging.

bputrycz · 2019-01-03T22:48:53Z

I noticed a very nice improvement with this change.
So, thank you.

Still, for my use-case: conv1d, small batch size, small channel dimension, long sequence length,
I don't see much improvement with more cores added to the computations.

The simplified snippet to reproduce (conv1d.py):

import mxnet as mx
from mxnet import gluon, nd

from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
 
channels = 64

net = gluon.nn.Sequential()

conv = gluon.nn.Conv1D(channels, 4, padding=1)
act = gluon.nn.Activation('sigmoid')

for i in range(3):
    net.add(conv)
    net.add(act)

net.initialize()

data = nd.random.uniform(shape=(1, channels, 2**16))

# Warm-up
y = net(data)
nd.waitall()

profiler.set_state('run')
for i in range(10):
    y = net(data)
    nd.waitall()
profiler.set_state('stop')

print(profiler.dumps())

When run on a host with a lot of cores (AWS c4.8xlarge) results in:

$ OMP_NUM_THREADS=1 python conv1d.py | grep "Convolution\|Activation"
Activation                             60         244.8700           3.3960           9.7510           4.0812
Convolution                            60        1648.7010          24.9280          40.6520          27.4783
$ OMP_NUM_THREADS=2 python conv1d.py | grep "Convolution\|Activation"
Activation                             60         127.4460           1.6600           5.4070           2.1241
Convolution                            60         866.8680          12.6670          22.9810          14.4478
$ OMP_NUM_THREADS=4 python conv1d.py | grep "Convolution\|Activation"
Activation                             60          65.3190           0.8940           2.9280           1.0886
Convolution                            60         854.2900          12.6230          20.3230          14.2382

There is no improvement when number of threads is increased to 4, or more.

Playing more with this example, for higher 'channels' value, it starts to be a little better parallelizable.
So, it seems the parallelization is done only per a single sequence "point". Is it the case?
But, it seems quite natural to parallelize also along the sequence, especially when it is long - different threads doing a different part of the sequence.
Then, parallelization should scale linearly with number of threads.
Isn't it done like that?

Bartosz

pengzhao-intel · 2019-01-04T04:16:10Z

@bputrycz really thanks for trying the new MKLDNN API and give us the very useful feedback.
Nice analysis and reproducible examples. I will contact Intel MKL-DNN team to see how to fix it.

In a short time, you can try to launch multiple instances (processor) and each one binds to 2/4 cores to get the max throughput in case this is the inference case.

pengzhao-intel · 2019-01-04T04:40:56Z

@TaoLv is following this issue and will back to you with more details soon.

TaoLv · 2019-01-04T05:37:07Z

@bputrycz Cannot reproduce the problem. The code snippet you provided scales well on my machine:

$ OMP_NUM_THREADS=1 python conv1d.py | grep "Convolution\|Activation"
Activation                             60         183.5660           2.6600           6.1490           3.0594
Convolution                            60        1047.5291          14.2060          26.2510          17.4588
$
$ OMP_NUM_THREADS=2 python conv1d.py | grep "Convolution\|Activation"
Activation                             60         103.8990           1.4330           3.5280           1.7316
Convolution                            60         550.4830           7.5440          13.9890           9.1747
$
$ OMP_NUM_THREADS=4 python conv1d.py | grep "Convolution\|Activation"
Activation                             60          57.5180           0.7640           2.4380           0.9586
Convolution                            60         290.6110           3.8830           7.8330           4.8435
$
$ OMP_NUM_THREADS=8 python conv1d.py | grep "Convolution\|Activation"
Activation                             60          39.8030           0.4470           1.8200           0.6634
Convolution                            60         179.9880           2.2860           5.4140           2.9998

Have you ever tried the cpu affinity before run multi-thread case?

export KMP_AFFINITY=granularity=fine,compact

TaoLv · 2019-01-04T05:56:25Z

Ummm, just notice that you're using c4.8xlarge which I think have no AVX512. Will have another try and come back to you later.

TaoLv · 2019-01-04T06:03:42Z

Confirmed this problem exists on machine without AVX512 as threading optimization for 1D Convolution is made for AVX512 only. Would you mind having a try on AWS c5? @bputrycz

bputrycz · 2019-01-04T09:42:32Z

Yes. I confirm.
It scales much better on AWS c5.

Thank you very much.

pengzhao-intel · 2019-01-04T12:58:35Z

Yes. I confirm.
It scales much better on AWS c5.

Thank you very much.

It's good to see the problem can be resolved on AWS c5.
Feel free to ping us if you have any questions or issues :)
We will document this behavior in MKLDNN README @xinyu-intel

TaoLv · 2019-01-04T14:03:23Z

@bputrycz Also feel free to let me know if the performance on c4 is critical for you.

bputrycz · 2019-01-04T15:00:25Z

@TaoLv Performance on c4 is not critical for me, as for now.

* add 3d layout support for MKLDNN Conv and Activation * fix lint * code refactor * add testcase for group1 conv and skip quantization for conv1d * fix lint * avoid conv1d quantization * code refactor and add activation ut * del todo

add 3d layout support for MKLDNN Conv and Activation

06901b8

xinyu-intel requested a review from anirudh2290 as a code owner December 4, 2018 14:00

fix lint

49296fc

marcoabreu added MKLDNN Operator pr-awaiting-review PR is waiting for code review labels Dec 5, 2018

resolve conflict

cc6ebf3

TaoLv suggested changes Dec 7, 2018

View reviewed changes

xinyu-intel added 2 commits December 7, 2018 15:37

code refactor

2da1e2a

Merge remote-tracking branch 'upstream/master' into conv1d

8f70c99

TaoLv reviewed Dec 8, 2018

View reviewed changes

add testcase for group1 conv and skip quantization for conv1d

869ad9d

xinyu-intel requested a review from szha as a code owner December 10, 2018 15:18

Merge remote-tracking branch 'upstream/master' into conv1d

5e1a5f3

fix lint

a3f843b

ZhennanQin reviewed Dec 11, 2018

View reviewed changes

avoid conv1d quantization

25a0e4f

TaoLv approved these changes Dec 13, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into conv1d

e20807f

Merge remote-tracking branch 'upstream/master' into conv1d

b627592

pengzhao-intel reviewed Dec 25, 2018

View reviewed changes

code refactor and add activation ut

4d51c8f

pengzhao-intel mentioned this pull request Dec 25, 2018

Conv1D is slow #11161

Closed

pengzhao-intel approved these changes Dec 25, 2018

View reviewed changes

TaoLv approved these changes Jan 2, 2019

View reviewed changes

xinyu-intel added 2 commits January 2, 2019 17:05

del todo

95bef09

Merge remote-tracking branch 'upstream/master' into conv1d

a13d87a

TaoLv merged commit d7f9a07 into apache:master Jan 2, 2019

xinyu-intel mentioned this pull request Feb 8, 2019

MKLDNN Unsupported Dimension Bug #14093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate MKLDNN Conv1d and support 3d layout #13530

Integrate MKLDNN Conv1d and support 3d layout #13530

xinyu-intel commented Dec 4, 2018 •

edited

Loading

pengzhao-intel commented Dec 4, 2018

TaoLv commented Dec 4, 2018

TaoLv commented Dec 5, 2018

TaoLv left a comment

xinyu-intel commented Dec 7, 2018

TaoLv Dec 8, 2018

TaoLv Dec 8, 2018

TaoLv commented Dec 8, 2018

xinyu-intel commented Dec 10, 2018

ZhennanQin Dec 11, 2018

xinyu-intel Dec 11, 2018

TaoLv commented Dec 15, 2018

ZhennanQin commented Dec 17, 2018

TaoLv commented Dec 20, 2018

pengzhao-intel left a comment

pengzhao-intel Dec 25, 2018

pengzhao-intel Dec 25, 2018

pengzhao-intel Dec 25, 2018

xinyu-intel commented Dec 25, 2018 •

edited

Loading

pengzhao-intel left a comment

xinyu-intel commented Dec 27, 2018

TaoLv left a comment

TaoLv commented Jan 2, 2019

bputrycz commented Jan 3, 2019

pengzhao-intel commented Jan 4, 2019 •

edited

Loading

pengzhao-intel commented Jan 4, 2019

TaoLv commented Jan 4, 2019

TaoLv commented Jan 4, 2019

TaoLv commented Jan 4, 2019

bputrycz commented Jan 4, 2019

pengzhao-intel commented Jan 4, 2019

TaoLv commented Jan 4, 2019

bputrycz commented Jan 4, 2019

Integrate MKLDNN Conv1d and support 3d layout #13530

Integrate MKLDNN Conv1d and support 3d layout #13530

Conversation

xinyu-intel commented Dec 4, 2018 • edited Loading

Description

Checklist

Essentials

Changes

pengzhao-intel commented Dec 4, 2018

TaoLv commented Dec 4, 2018

TaoLv commented Dec 5, 2018

TaoLv left a comment

Choose a reason for hiding this comment

xinyu-intel commented Dec 7, 2018

TaoLv Dec 8, 2018

Choose a reason for hiding this comment

TaoLv Dec 8, 2018

Choose a reason for hiding this comment

TaoLv commented Dec 8, 2018

xinyu-intel commented Dec 10, 2018

ZhennanQin Dec 11, 2018

Choose a reason for hiding this comment

xinyu-intel Dec 11, 2018

Choose a reason for hiding this comment

TaoLv commented Dec 15, 2018

ZhennanQin commented Dec 17, 2018

TaoLv commented Dec 20, 2018

pengzhao-intel left a comment

Choose a reason for hiding this comment

pengzhao-intel Dec 25, 2018

Choose a reason for hiding this comment

pengzhao-intel Dec 25, 2018

Choose a reason for hiding this comment

pengzhao-intel Dec 25, 2018

Choose a reason for hiding this comment

xinyu-intel commented Dec 25, 2018 • edited Loading

pengzhao-intel left a comment

Choose a reason for hiding this comment

xinyu-intel commented Dec 27, 2018

TaoLv left a comment

Choose a reason for hiding this comment

TaoLv commented Jan 2, 2019

bputrycz commented Jan 3, 2019

pengzhao-intel commented Jan 4, 2019 • edited Loading

pengzhao-intel commented Jan 4, 2019

TaoLv commented Jan 4, 2019

TaoLv commented Jan 4, 2019

TaoLv commented Jan 4, 2019

bputrycz commented Jan 4, 2019

pengzhao-intel commented Jan 4, 2019

TaoLv commented Jan 4, 2019

bputrycz commented Jan 4, 2019

xinyu-intel commented Dec 4, 2018 •

edited

Loading

xinyu-intel commented Dec 25, 2018 •

edited

Loading

pengzhao-intel commented Jan 4, 2019 •

edited

Loading