[call for contribution] Improving CPU performance #2986

hjk41 · 2016-08-10T06:48:08Z

Currently we are still slow on CPU (#1222).

There are several things we can do:

Do a good profile on a chosen set of benchmarks to understand the bottlenecks.
Here are some candidates:
1. MNIST CNN model: since the MNIST model is very small, performance will suffer if system overhead is high, and it will show us potential bottlenecks in non-CNN operations
2. Cifar: this is much heavier than MNIST, so it will mostly show us the performance of the underlying library we are using, that is OpenBLAS/MKL + MShadow. I think the configuration (#of threads) of the libraries has quite a lot of impact on overall performance.
Integrate libraries like NNPACK (https://github.com/Maratyszcza/NNPACK) and MKLDNN (https://software.intel.com/en-us/articles/deep-neural-network-technical-preview-for-intel-math-kernel-library-intel-mkl).
Improve the operators not included in NNPACK and MKLDNN. This would include some code in MShadow and some in mxnet operators.

winstywang · 2016-08-10T07:48:52Z

Can you first confirm that the bottleneck is in computation? As far as I know, caffe uses the same computational engine as ours, but much faster than ours.

Darwin2011 · 2016-08-10T15:20:30Z

@winstywang @hjk41
I suspect that patch2cols in mxnet is one bottleneck rather than blas library.

tqchen · 2016-08-10T17:58:05Z

@Darwin2011 can you try to do some change to see if that is really the case? Thanks!

xmchen1987 · 2016-08-11T06:17:59Z

@hjk41 After uncomment OpenMP in mshadow/mshadow/tensor_cpu-inl.h, mxnet is still slower than caffe. The detail performance data on Intel(R) Xeon(R) CPU E5-4657L v2 is shown as follow:

i can follow up the issue:

give a performance analysis of mxnet, show which part is slower.
evaluate NNPACK and MKL DNN, to see any potential performance gain.
integrate NNPACK or MKL DNN into mxnet.

Maratyszcza · 2016-08-11T17:22:46Z

Unlike MKL, NNPACK is not x86-only: it now includes kernels implemented with Clang vector extensions. The implementation was originally developed for PNaCl port of NNPACK, but can be compiled to any clang-supported architecture; LLVM will automatically lower the vector operations to the SIMD ISA of the target. To build NNPACK with portable SIMD implementation instead of x86-64 assembly, configure with python configure.py --enable-psimd. On Haswell+ targets, however, performance would be about 4 times slower than with x86-64 assembly.

winstywang · 2016-08-11T23:40:42Z

@xmchen1987 Could you help run a profile on the CPU analysis? Let's figure out the bottleneck.

xmchen1987 · 2016-08-12T03:03:14Z

@hjk41 @winstywang @Darwin2011 I add time measurement function in src/operator/convolution-inl.h, and dump the running time for Alexnet.
From the table below, we can see:

convolution layers consume most of running time.
time for patch2col/col2patch is also the same with that for gemm. i remeber time for patch2col/col2patch on caffe is very small.

now there are two options to fix this problem:

fix patch2col/col2patch performance issue
use MKL DNN to avoid use patch2col/col2patch.

what do you guys think about?

hjk41 · 2016-08-12T03:07:03Z

@xmchen1987 great finding.
I guess we can use the same code as Caffe on patch2col? That should solve the problem.

Integrating MKL DNN is also a good idea. But I think if we decide to do that, we should do it systematically and replace every operator that MKL DNN has. @xmchen1987 are you interested in making this contribution?

hjk41 · 2016-08-12T03:10:37Z

@ALL I just updated the issue. As some of you suggested, we should do some profiling first. And we should make sure we have comparable performance with Caffe on most of the critical applications. I just listed MNIST-CNN and Cifar as two of the candidate benchmarks. What else can you think of?

xmchen1987 · 2016-08-12T03:13:22Z

@hjk41 sure, i can do that.

Caffe uses memory copy to do patch2col, which is more effciency. no problem to use that method.
i think we can add MKL DNN as another options, as MKL DNN can helps us to achieve better scalibility on cores.

xmchen1987 · 2016-08-12T03:19:49Z

@hjk41 can we add Alexnet, as most of benchmark for DL framework has Alexnet?

tornadomeet · 2016-08-12T03:23:24Z

@xmchen1987 does MKL DNN only support intel cpu? because many user will use mxnet in the mobile devices using ARM, so it's better using the general implementation.

hjk41 · 2016-08-12T03:24:15Z

Good idea. Alexnet would show us the performance of the CNN implementation.
I guess MKL DNN would really show its benefit in this case.

On Fri, Aug 12, 2016 at 11:19 AM, xmchen [email protected] wrote:

@hjk41 https://github.com/hjk41 can we add Alexnet, as most of
benchmark for DL framework has Alexnet?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2986 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABFI4YgWUAjK4Q39unja8awG1wcBXWHyks5qe-ZYgaJpZM4Jg0fu
.

hjk41 · 2016-08-12T03:26:34Z

@tornadomeet MKL supports only x86 CPUs. I think we should add a compile option like we do for OpenBLAS/MKL. I believe MKL DNN will become the de-facto choice for DNN on x86, just like cuDNN for GPU.

xmchen1987 · 2016-08-12T04:24:42Z

@tornadomeet For ARM, i think Eigen is a good choice. If we want to get better performance, we can't use a genneral implement, right? General implement usally means just ok performance.

@hjk41 so right now, i will integrate MKL DNN as a start to improve performance on x86 cpus.

Maratyszcza · 2016-08-12T04:46:15Z

@xmchen1987 Do you consider integration of NNPACK? It has some benefits vs MKL DNN:

NNPACK implements faster algorithm for convolutional layers based on Winograd transforms
NNPACK is not restricted to x86-64
NNPACK is open-source

tornadomeet · 2016-08-12T04:50:26Z

@Maratyszcza i prefer NNPACK personal.

hjk41 · 2016-08-12T06:34:06Z

I think both NNPACK and MKL DNN are good libraries, we can support both if we have enough resource, just like we support both OpenBLAS and MKL. As to which library support to implement first, I am fine with either way.

I will leave it to @xmchen1987 to decide which library to use first. Whatever he chooses to integrate first, it is a good contribution to mxnet community. If anyone else would like to join the effort and integrate NNPACK, he is more than welcome.

xmchen1987 · 2016-08-12T07:30:22Z

@Maratyszcza I agree. NNPACK is a promising library, it implements latest algorithm for convolution. compared with MKL DNN, it may be faster.

@hjk41 I think we can support both of them, and i choose NNPACK as a start.

xmchen1987 · 2016-08-12T07:50:43Z

@Maratyszcza i find it only implements forward, not backward in the caffe integration.
https://github.com/ajtulloch/caffe/tree/nnpack-pr
Does it have some problems or just not implemented yet?

Maratyszcza · 2016-08-12T08:10:32Z

@xmchen1987 Backward pass for convolution is implemented in NNPACK. Caffe bindings are
mostly used for inference, which is why I think they don't wrap backward
propagation functions. You may refer to nnpack.torch for examples of wrapping backward convolution functions.

sbodenstein · 2016-08-13T18:54:15Z

@xmchen1987 @tornadomeet: An argument for Intel DAAL over NNPACK:

NNPACK doesn't support Windows
Intel DAAL will receive vastly more developer support over the long run than NNPACK, and will eventually be the fastest implementation on standard deployment hardware (ie. Xeon CPU's).
Intel DAAL is also open source, like NNPACK

Agreed that it would be nice to have both in the long run.

Correction: the implementation of the DNN layers in Intel DAAL are not open source, as they come from MKL. The relevant pieces of MKL are included in the DAAL binaries, which are very permisively licensed (Apache License 2.0).

sbodenstein · 2016-08-13T18:56:21Z

Also, this issue is relevant for this discussion: #2435

Maratyszcza · 2016-08-14T00:25:39Z

@sbodenstein Intel DAAL is irrelevant. It is a high-level library, similar to MxNet. For the actual implementation of high-intensity operations, it leverages Intel MKL DNN functions.

sbodenstein · 2016-08-14T15:00:28Z

@Maratyszcza: you are correct, DAAL does indeed call MKL (I didn't know this). But:

Previous versions of Intel DAAL required separate installation of the Intel Math Kernel Library (Intel MKL) and Intel Integrated Performance Primitives (Intel IPP). The latest version of Intel DAAL actually comes with the necessary binary parts of Intel MKL (for BLAS and LAPACK) as well as Intel IPP (compression and decompression)

So DAAL is similar to cuDNN, the implementation is not open source, but comes with permissive license to use. I will will correct this above. Also, you are right, we should probably use MKL directly (unless perhaps the license for DAAL is much more permissive?)

Also, DAAL is not similar to MXNet. It was designed to be useable from other frameworks, for example:

Intel DAAL has a flexible API that allows to make integration to deep learning frameworks on different levels. It is possible, for example, to replace the implementation of a particular layer of neural network by DAAL implementation, or to replace the group of layers, or to feed the model trained with Caffe, Theano or Torch as an input to scoring stage of neural network implemented with DAAL.

xmchen1987 · 2016-08-24T07:16:54Z

As a first step, I add both NNPACK and MKL DNN in forward convolution function. I have tested the prediction accuracy, it's the same with original implement.

As NNPACK can't support stride when batch size is large than 1, I compare the performance on VGG-19 model. The forward performance(Batch size 128) on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz is shown as below:

The MKL DNN implement can achieve 2.6x performance, and has no problem on core number scalability.
The NNPACK implement can achieve 50% higher performance than MKL DNN implement on single core, but the core number scalability is really poor. @Maratyszcza Do you have a plan to fix the scalability issue?

xmchen1987 · 2016-09-06T14:31:09Z

After implemented MKL DNN in MXNet, I achieve much better performance with almost the same accuracy.
The performance data on Intel(R) Xeon(R) CPU E5-2699 v4 is shown as below chart, gets 2.5x - 3.4x performance. Now the performance of our framework is competitive with others.

Below chart shows the traing curve on cifar10. Result shows my implement can achieve almost the same accuracy.

One problem we need to figure out is, when training cifar10, not matter using base MKL implement or MKLDNN implement, the training speed is only 20+ images/sec. I find the CPU utilization is very low, @hjk41 @tqchen @mli do you have some hints for that problem?

sbodenstein · 2016-09-09T17:17:19Z

@xmchen1987: MKL 2017 has just been released: https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2017-release-notes

hjk41 · 2016-09-12T03:03:21Z

@xmchen1987 I guess it is the OpenMP problem. MShadow uses OpenMP to support multi-threading, but it is turned off by default. If you use MShadow operators too much, that could hurt performance. Could you do some profiling and tell us if this is the problem? If it is, then turning on OpenMP would help.

Un-comment this line to enable OpenMP:
https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L148
Then set OMP_NUM_THREADS to different numbers to tune performance. Try 2, 4, 6, 8 till it reaches the number of cores on your machine. Usually setting it to the number of cores of one single CPU (in case you have multiple CPUs) gets you the best performance.

xmchen1987 · 2016-09-12T04:01:13Z

@sbodenstein thanks for reminding. I will check if any API change compared with beta version.

@hjk41 I have un-comment https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L148.
In deed, i have found two problems here:

as some of MapExpCPUEngine use sse enabled MapPacketPlan, we need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/packet-inl.h#L398.
Need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L215, which is needed by operators like sumall_except_dim.
Need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L245, for furtuer problems.
This problem effect the performance most.
the expression https://github.com/dmlc/mxnet/blob/master/src/operator/batch_norm-inl.h#L104 is strange for me. I know this expression is easy to understand, but hurt the performance, as we do no need to do square_root after broadcast. We need to do square_root and simlilar ops before broadcast, but right now, i have trouble to do this. Do you have suggestion how to modify that?

hjk41 · 2016-09-12T05:59:36Z

@antinucleon @tqchen any suggestions?

Maratyszcza · 2016-09-16T20:39:42Z

FYI, now NNPACK supports Android.

tqchen · 2016-09-18T02:49:57Z

@xmchen1987 You are more than welcomed to propose a PR for the BatchNorm. Normally, the omp won't provide much improvement over simple elementwise ops. We can enable some of them manually in the performance critical ops by hand crafting some of the loops

xmchen1987 · 2016-09-23T14:13:02Z

@tqchen Recently I'm busy to tune MKL performance, and forget to add the PR.
According to my measurement on Intel(R) Xeon(R) CPU E5-2699 v4, it will bring big performance gain for the cifar10 training example.

dmlc/mshadow@5f99e94

pengzhao-intel · 2018-02-03T06:09:01Z

FYI, MKL-DNN integration is done and will be merged soon in #9677.
You can find the performance data in #8302

Maratyszcza · 2018-02-06T19:43:58Z

@xmchen1987 @hjk41 @tqchen I looked at NNPACK bindings in MXNet, and they have room for improvement:

NNPACK now includes CMake configuration scripts for all platforms. It is better to use those rather than stick to an old NNPACK version, as NNPACK is getting updates and performance improvements.
NNPACK supports using pre-allocated workspace buffers provided by the framework rather then allocating and de-allocating them inside NNPACK on each convolution call. This is a big cost, especially for small convolutions. See How to use nnp_convolution in latest version？ Maratyszcza/NNPACK#75 for details.
NNPACK supports pre-computing transformed coefficients for inference use-cases (when weights do not change between forward runs). See transform_strategy Maratyszcza/NNPACK#82 for details.
NNPACK can do fused Convolution+ReLU at the cost a single convolution operation. See activation parameter.

tqchen · 2018-02-06T20:02:37Z

@Maratyszcza Thanks for pointing it out! I created #9719 for this.

hjk41 added the Call for Contribution label Aug 10, 2016

Darwin2011 mentioned this issue Oct 5, 2016

MKLdnn Integration Patch to improve issue #2986(call for cpu performance) #3438

Closed

futurely mentioned this issue Dec 13, 2016

[WIP] Fix amalgamation #4193

Closed

5 tasks

edmBernard mentioned this issue Apr 21, 2017

DNN CPU speed davisking/dlib#517

Closed

tqchen mentioned this issue Feb 6, 2018

Improve NNPack Binding #9719

Open

[call for contribution] Improving CPU performance #2986

[call for contribution] Improving CPU performance #2986

Comments

hjk41 commented Aug 10, 2016 • edited Loading

winstywang commented Aug 10, 2016

Darwin2011 commented Aug 10, 2016

tqchen commented Aug 10, 2016

xmchen1987 commented Aug 11, 2016

Maratyszcza commented Aug 11, 2016 • edited Loading

winstywang commented Aug 11, 2016

xmchen1987 commented Aug 12, 2016

hjk41 commented Aug 12, 2016

hjk41 commented Aug 12, 2016

xmchen1987 commented Aug 12, 2016

xmchen1987 commented Aug 12, 2016

tornadomeet commented Aug 12, 2016

hjk41 commented Aug 12, 2016

hjk41 commented Aug 12, 2016

xmchen1987 commented Aug 12, 2016

Maratyszcza commented Aug 12, 2016

tornadomeet commented Aug 12, 2016 • edited Loading

hjk41 commented Aug 12, 2016

xmchen1987 commented Aug 12, 2016 • edited Loading

xmchen1987 commented Aug 12, 2016

Maratyszcza commented Aug 12, 2016 • edited Loading

sbodenstein commented Aug 13, 2016 • edited Loading

sbodenstein commented Aug 13, 2016

Maratyszcza commented Aug 14, 2016

sbodenstein commented Aug 14, 2016 • edited Loading

xmchen1987 commented Aug 24, 2016

xmchen1987 commented Sep 6, 2016 • edited Loading

sbodenstein commented Sep 9, 2016

hjk41 commented Sep 12, 2016

xmchen1987 commented Sep 12, 2016 • edited Loading

hjk41 commented Sep 12, 2016

Maratyszcza commented Sep 16, 2016

tqchen commented Sep 18, 2016 • edited Loading

xmchen1987 commented Sep 23, 2016

pengzhao-intel commented Feb 3, 2018

Maratyszcza commented Feb 6, 2018

tqchen commented Feb 6, 2018

hjk41 commented Aug 10, 2016 •

edited

Loading

Maratyszcza commented Aug 11, 2016 •

edited

Loading

tornadomeet commented Aug 12, 2016 •

edited

Loading

xmchen1987 commented Aug 12, 2016 •

edited

Loading

Maratyszcza commented Aug 12, 2016 •

edited

Loading

sbodenstein commented Aug 13, 2016 •

edited

Loading

sbodenstein commented Aug 14, 2016 •

edited

Loading

xmchen1987 commented Sep 6, 2016 •

edited

Loading

xmchen1987 commented Sep 12, 2016 •

edited

Loading

tqchen commented Sep 18, 2016 •

edited

Loading