Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[call for contribution] Improving CPU performance #2986

Open
hjk41 opened this issue Aug 10, 2016 · 37 comments
Open

[call for contribution] Improving CPU performance #2986

hjk41 opened this issue Aug 10, 2016 · 37 comments

Comments

@hjk41
Copy link
Contributor

hjk41 commented Aug 10, 2016

Currently we are still slow on CPU (#1222).

There are several things we can do:

  1. Do a good profile on a chosen set of benchmarks to understand the bottlenecks.
    Here are some candidates:
    1. MNIST CNN model: since the MNIST model is very small, performance will suffer if system overhead is high, and it will show us potential bottlenecks in non-CNN operations
    2. Cifar: this is much heavier than MNIST, so it will mostly show us the performance of the underlying library we are using, that is OpenBLAS/MKL + MShadow. I think the configuration (#of threads) of the libraries has quite a lot of impact on overall performance.
  2. Integrate libraries like NNPACK (https://github.com/Maratyszcza/NNPACK) and MKLDNN (https://software.intel.com/en-us/articles/deep-neural-network-technical-preview-for-intel-math-kernel-library-intel-mkl).
  3. Improve the operators not included in NNPACK and MKLDNN. This would include some code in MShadow and some in mxnet operators.
@winstywang
Copy link
Contributor

Can you first confirm that the bottleneck is in computation? As far as I know, caffe uses the same computational engine as ours, but much faster than ours.

@Darwin2011
Copy link

@winstywang @hjk41
I suspect that patch2cols in mxnet is one bottleneck rather than blas library.

@tqchen
Copy link
Member

tqchen commented Aug 10, 2016

@Darwin2011 can you try to do some change to see if that is really the case? Thanks!

@xmchen1987
Copy link

@hjk41 After uncomment OpenMP in mshadow/mshadow/tensor_cpu-inl.h, mxnet is still slower than caffe. The detail performance data on Intel(R) Xeon(R) CPU E5-4657L v2 is shown as follow:
image
i can follow up the issue:

  1. give a performance analysis of mxnet, show which part is slower.
  2. evaluate NNPACK and MKL DNN, to see any potential performance gain.
  3. integrate NNPACK or MKL DNN into mxnet.

@Maratyszcza
Copy link

Maratyszcza commented Aug 11, 2016

Unlike MKL, NNPACK is not x86-only: it now includes kernels implemented with Clang vector extensions. The implementation was originally developed for PNaCl port of NNPACK, but can be compiled to any clang-supported architecture; LLVM will automatically lower the vector operations to the SIMD ISA of the target. To build NNPACK with portable SIMD implementation instead of x86-64 assembly, configure with python configure.py --enable-psimd. On Haswell+ targets, however, performance would be about 4 times slower than with x86-64 assembly.

@winstywang
Copy link
Contributor

@xmchen1987 Could you help run a profile on the CPU analysis? Let's figure out the bottleneck.

@xmchen1987
Copy link

@hjk41 @winstywang @Darwin2011 I add time measurement function in src/operator/convolution-inl.h, and dump the running time for Alexnet.
From the table below, we can see:

  1. convolution layers consume most of running time.
  2. time for patch2col/col2patch is also the same with that for gemm. i remeber time for patch2col/col2patch on caffe is very small.
    image

now there are two options to fix this problem:

  1. fix patch2col/col2patch performance issue
  2. use MKL DNN to avoid use patch2col/col2patch.

what do you guys think about?

@hjk41
Copy link
Contributor Author

hjk41 commented Aug 12, 2016

@xmchen1987 great finding.
I guess we can use the same code as Caffe on patch2col? That should solve the problem.

Integrating MKL DNN is also a good idea. But I think if we decide to do that, we should do it systematically and replace every operator that MKL DNN has. @xmchen1987 are you interested in making this contribution?

@hjk41
Copy link
Contributor Author

hjk41 commented Aug 12, 2016

@ALL I just updated the issue. As some of you suggested, we should do some profiling first. And we should make sure we have comparable performance with Caffe on most of the critical applications. I just listed MNIST-CNN and Cifar as two of the candidate benchmarks. What else can you think of?

@xmchen1987
Copy link

@hjk41 sure, i can do that.

Caffe uses memory copy to do patch2col, which is more effciency. no problem to use that method.
i think we can add MKL DNN as another options, as MKL DNN can helps us to achieve better scalibility on cores.

@xmchen1987
Copy link

@hjk41 can we add Alexnet, as most of benchmark for DL framework has Alexnet?

@tornadomeet
Copy link
Contributor

@xmchen1987 does MKL DNN only support intel cpu? because many user will use mxnet in the mobile devices using ARM, so it's better using the general implementation.

@hjk41
Copy link
Contributor Author

hjk41 commented Aug 12, 2016

Good idea. Alexnet would show us the performance of the CNN implementation.
I guess MKL DNN would really show its benefit in this case.

On Fri, Aug 12, 2016 at 11:19 AM, xmchen [email protected] wrote:

@hjk41 https://github.com/hjk41 can we add Alexnet, as most of
benchmark for DL framework has Alexnet?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2986 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABFI4YgWUAjK4Q39unja8awG1wcBXWHyks5qe-ZYgaJpZM4Jg0fu
.

@hjk41
Copy link
Contributor Author

hjk41 commented Aug 12, 2016

@tornadomeet MKL supports only x86 CPUs. I think we should add a compile option like we do for OpenBLAS/MKL. I believe MKL DNN will become the de-facto choice for DNN on x86, just like cuDNN for GPU.

@xmchen1987
Copy link

@tornadomeet For ARM, i think Eigen is a good choice. If we want to get better performance, we can't use a genneral implement, right? General implement usally means just ok performance.

@hjk41 so right now, i will integrate MKL DNN as a start to improve performance on x86 cpus.

@Maratyszcza
Copy link

@xmchen1987 Do you consider integration of NNPACK? It has some benefits vs MKL DNN:

  1. NNPACK implements faster algorithm for convolutional layers based on Winograd transforms
  2. NNPACK is not restricted to x86-64
  3. NNPACK is open-source

@tornadomeet
Copy link
Contributor

tornadomeet commented Aug 12, 2016

@Maratyszcza i prefer NNPACK personal.

@hjk41
Copy link
Contributor Author

hjk41 commented Aug 12, 2016

I think both NNPACK and MKL DNN are good libraries, we can support both if we have enough resource, just like we support both OpenBLAS and MKL. As to which library support to implement first, I am fine with either way.

I will leave it to @xmchen1987 to decide which library to use first. Whatever he chooses to integrate first, it is a good contribution to mxnet community. If anyone else would like to join the effort and integrate NNPACK, he is more than welcome.

@xmchen1987
Copy link

xmchen1987 commented Aug 12, 2016

@Maratyszcza I agree. NNPACK is a promising library, it implements latest algorithm for convolution. compared with MKL DNN, it may be faster.

@hjk41 I think we can support both of them, and i choose NNPACK as a start.

@xmchen1987
Copy link

@Maratyszcza i find it only implements forward, not backward in the caffe integration.
https://github.com/ajtulloch/caffe/tree/nnpack-pr
Does it have some problems or just not implemented yet?

@Maratyszcza
Copy link

Maratyszcza commented Aug 12, 2016

@xmchen1987 Backward pass for convolution is implemented in NNPACK. Caffe bindings are
mostly used for inference, which is why I think they don't wrap backward
propagation functions. You may refer to nnpack.torch for examples of wrapping backward convolution functions.

@sbodenstein
Copy link
Contributor

sbodenstein commented Aug 13, 2016

@xmchen1987 @tornadomeet: An argument for Intel DAAL over NNPACK:

  1. NNPACK doesn't support Windows
  2. Intel DAAL will receive vastly more developer support over the long run than NNPACK, and will eventually be the fastest implementation on standard deployment hardware (ie. Xeon CPU's).
  3. Intel DAAL is also open source, like NNPACK

Agreed that it would be nice to have both in the long run.

Correction: the implementation of the DNN layers in Intel DAAL are not open source, as they come from MKL. The relevant pieces of MKL are included in the DAAL binaries, which are very permisively licensed (Apache License 2.0).

@sbodenstein
Copy link
Contributor

Also, this issue is relevant for this discussion: #2435

@Maratyszcza
Copy link

@sbodenstein Intel DAAL is irrelevant. It is a high-level library, similar to MxNet. For the actual implementation of high-intensity operations, it leverages Intel MKL DNN functions.

@sbodenstein
Copy link
Contributor

sbodenstein commented Aug 14, 2016

@Maratyszcza: you are correct, DAAL does indeed call MKL (I didn't know this). But:

Previous versions of Intel DAAL required separate installation of the Intel Math Kernel Library (Intel MKL) and Intel Integrated Performance Primitives (Intel IPP). The latest version of Intel DAAL actually comes with the necessary binary parts of Intel MKL (for BLAS and LAPACK) as well as Intel IPP (compression and decompression)

So DAAL is similar to cuDNN, the implementation is not open source, but comes with permissive license to use. I will will correct this above. Also, you are right, we should probably use MKL directly (unless perhaps the license for DAAL is much more permissive?)

Also, DAAL is not similar to MXNet. It was designed to be useable from other frameworks, for example:

Intel DAAL has a flexible API that allows to make integration to deep learning frameworks on different levels. It is possible, for example, to replace the implementation of a particular layer of neural network by DAAL implementation, or to replace the group of layers, or to feed the model trained with Caffe, Theano or Torch as an input to scoring stage of neural network implemented with DAAL.

@xmchen1987
Copy link

As a first step, I add both NNPACK and MKL DNN in forward convolution function. I have tested the prediction accuracy, it's the same with original implement.

As NNPACK can't support stride when batch size is large than 1, I compare the performance on VGG-19 model. The forward performance(Batch size 128) on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz is shown as below:
image

The MKL DNN implement can achieve 2.6x performance, and has no problem on core number scalability.
The NNPACK implement can achieve 50% higher performance than MKL DNN implement on single core, but the core number scalability is really poor. @Maratyszcza Do you have a plan to fix the scalability issue?

@xmchen1987
Copy link

xmchen1987 commented Sep 6, 2016

After implemented MKL DNN in MXNet, I achieve much better performance with almost the same accuracy.
The performance data on Intel(R) Xeon(R) CPU E5-2699 v4 is shown as below chart, gets 2.5x - 3.4x performance. Now the performance of our framework is competitive with others.
image
Below chart shows the traing curve on cifar10. Result shows my implement can achieve almost the same accuracy.
image

One problem we need to figure out is, when training cifar10, not matter using base MKL implement or MKLDNN implement, the training speed is only 20+ images/sec. I find the CPU utilization is very low, @hjk41 @tqchen @mli do you have some hints for that problem?

@sbodenstein
Copy link
Contributor

@hjk41
Copy link
Contributor Author

hjk41 commented Sep 12, 2016

@xmchen1987 I guess it is the OpenMP problem. MShadow uses OpenMP to support multi-threading, but it is turned off by default. If you use MShadow operators too much, that could hurt performance. Could you do some profiling and tell us if this is the problem? If it is, then turning on OpenMP would help.

Un-comment this line to enable OpenMP:
https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L148
Then set OMP_NUM_THREADS to different numbers to tune performance. Try 2, 4, 6, 8 till it reaches the number of cores on your machine. Usually setting it to the number of cores of one single CPU (in case you have multiple CPUs) gets you the best performance.

@xmchen1987
Copy link

xmchen1987 commented Sep 12, 2016

@sbodenstein thanks for reminding. I will check if any API change compared with beta version.

@hjk41 I have un-comment https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L148.
In deed, i have found two problems here:

  1. as some of MapExpCPUEngine use sse enabled MapPacketPlan, we need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/packet-inl.h#L398.
    Need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L215, which is needed by operators like sumall_except_dim.
    Need to add omp before https://github.com/dmlc/mshadow/blob/478e5fdf13372121421250a987b77083c186f6fd/mshadow/tensor_cpu-inl.h#L245, for furtuer problems.
    This problem effect the performance most.
  2. the expression https://github.com/dmlc/mxnet/blob/master/src/operator/batch_norm-inl.h#L104 is strange for me. I know this expression is easy to understand, but hurt the performance, as we do no need to do square_root after broadcast. We need to do square_root and simlilar ops before broadcast, but right now, i have trouble to do this. Do you have suggestion how to modify that?

@hjk41
Copy link
Contributor Author

hjk41 commented Sep 12, 2016

@antinucleon @tqchen any suggestions?

@Maratyszcza
Copy link

FYI, now NNPACK supports Android.

@tqchen
Copy link
Member

tqchen commented Sep 18, 2016

@xmchen1987 You are more than welcomed to propose a PR for the BatchNorm. Normally, the omp won't provide much improvement over simple elementwise ops. We can enable some of them manually in the performance critical ops by hand crafting some of the loops

@xmchen1987
Copy link

@tqchen Recently I'm busy to tune MKL performance, and forget to add the PR.
According to my measurement on Intel(R) Xeon(R) CPU E5-2699 v4, it will bring big performance gain for the cifar10 training example.
image

dmlc/mshadow@5f99e94

@pengzhao-intel
Copy link
Contributor

FYI, MKL-DNN integration is done and will be merged soon in #9677.
You can find the performance data in #8302

@Maratyszcza
Copy link

@xmchen1987 @hjk41 @tqchen I looked at NNPACK bindings in MXNet, and they have room for improvement:

  • NNPACK now includes CMake configuration scripts for all platforms. It is better to use those rather than stick to an old NNPACK version, as NNPACK is getting updates and performance improvements.
  • NNPACK supports using pre-allocated workspace buffers provided by the framework rather then allocating and de-allocating them inside NNPACK on each convolution call. This is a big cost, especially for small convolutions. See How to use nnp_convolution in latest version? Maratyszcza/NNPACK#75 for details.
  • NNPACK supports pre-computing transformed coefficients for inference use-cases (when weights do not change between forward runs). See transform_strategy Maratyszcza/NNPACK#82 for details.
  • NNPACK can do fused Convolution+ReLU at the cost a single convolution operation. See activation parameter.

@tqchen
Copy link
Member

tqchen commented Feb 6, 2018

@Maratyszcza Thanks for pointing it out! I created #9719 for this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants