Fix for import mxnet taking long time if multiple process launched #13602

Vikas-kum · 2018-12-10T22:48:23Z

In case there are many cores(72 cores as in c5.18xl), doing import mxnet in multiple processes take very long time. Details here: #12255

One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code.
Also, We added an environment variable where users can set the number of cores that should be used to determine tuning.
Reducing number of cores for tuning also helps in reduction of number of iterations.

Successfully ran operator tuning unit test.(OMP_TUNING*)
Ran below test code on c5.18xl with 72 cores to see if the import mxnet is faster

import os
import time
import multiprocessing
from os import getpid

def mxnet_worker():
    print("before import: pid:{}".format(getpid()))
    st_time = time.time()
    import mxnet
    end_time = time.time()
    print("after import: pid:{} time:{}".format(getpid(), end_time - st_time))

read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(30)]
from threading import Thread
i=0
#while i<32:
#    t1 = Thread(target=mxnet_worker)
#    t1.start()
#    i=i+1
for p in read_process:
    p.daemon = True
#    time.sleep(3)
    p.start()
time.sleep(100000)

Loading 30 mxnet processes takes 41 seconds with no change in num_cores for tuning.

Ideally num_cores should be 3
When I run above code, after setting the environ variable export MXNET_USE_NUM_CORES_OPERATOR_TUNING=3
the 30 mxnet processes finishs loading in 2.2 seconds.

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

doing import mxnet in multiple processes take very long time. Details : apache#12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning.

marcoabreu

Thanks a lot for this improvement. Great catch!

src/operator/operator_tune-inl.h

yuxihu

Nice catch!

src/operator/operator_tune-inl.h

TaoLv

Still not very sure about the mechanism of operator tuning and how it contributes to other operator computation. Do we have any performance comparison between w/ and w/o operator tuning? Previously, I set USE_OPERATOR_TUNING=0 when built MXNet to avoid import hang. Followed the tips here: #10560 (comment)

…to doc

Vikas-kum · 2018-12-11T19:42:39Z

Still not very sure about the mechanism of operator tuning and how it contributes to other operator computation. Do we have any performance comparison between w/ and w/o operator tuning? Previously, I set USE_OPERATOR_TUNING=0 when built MXNet to avoid import hang. Followed the tips here: #10560 (comment)

Good point. The operator tuning code was introduced by this PR: #8686 , I don't see any numbers for performance comparison with/without tuning code. I think you should start a discussion in dev thread and may be we would able to find some relevant historical info and then decide whether operator tuning should be enabled as default or not.

Regarding this PR, it is optimizing for the cases where OPERATOR_TUNING is enabled by default and unblocking 4.0 release.

yuxihu

LGTM.

anirudh2290 · 2018-12-11T22:38:17Z

docs/faq/env_var.md

+ - Values: String representation of MXNET_ENABLE_OPERATOR_TUNING environment variable
+ - 0=disable all
+ - 1=enable all
+ - float32, float16, float32=list of types to enable, and disable those not listed


Can we list the valid types here: "float32", "float16", "float64", "int8", "uint8", "int32", "int64"

anirudh2290 · 2018-12-11T23:20:40Z

src/operator/operator_tune-inl.h

@@ -56,7 +56,7 @@ namespace op {
 #endif
 #endif // MXNET_NO_INLINE

-#define OUTSIDE_COUNT_SHIFT 9


does changing this impact the IsOMPFaster selection in operator_tune.h. Do we need to tweak WORKLOAD_COUNT_SHIFT too ?

Workload_count_shift is currently 11, which means workload count will be 2048.
this means that operation is done for 2048 times. This number can be made smaller, but IsOMPFaster doesn't look like bottleneck for the related issue. It is the function which is calculating OMP overhead which is causing the problem.

roywei · 2018-12-12T00:32:21Z

@mxnet-label-bot add[Environment Variables, Operator]

pengzhao-intel · 2018-12-12T04:38:01Z

Based on 10560's comment, "It sometimes block executing in Ubuntu and always block executing in Windows" and several related issues, including import hang, are reported.
Could anyone help verify the functionality of this feature (auto tuning)? @mseth10 @azai91 @lupesko
Maybe set it off by default. Any idea?
@cjolivier01 could you provide more backgrounds and how auto-tuning works?

anirudh2290 · 2018-12-12T08:33:36Z

I am wondering if this change is needed. For example, we are disabling openmp for child processes, then why cant we also skip tuning for the child processes ?

Vikas-kum · 2018-12-12T18:40:03Z

@anirudh2290

I am wondering if this change is needed. For example, we are disabling openmp for child processes, then why can't we also skip tuning for the child processes ?

if the option to enable OMP remains there, I think this is needed. If someone enables the option, this makes the loading faster as OMP overhead is calculated faster.

If we remove OMP tuning at all, then this will automatically be removed. But removing it or disabling it will require bigger discussion about historical reasons why it was made default and if there is any benchmarking.

anirudh2290 · 2018-12-12T19:34:56Z

for child processes we disable openmp : https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L66 . This means that number of omp threads will be 1 for all operators executed in the child process. Tuning doesnt help much then.

Vikas-kum · 2018-12-12T22:03:12Z

@anirudh2290
I started a discussion in dev list for disabling openmp tuning by default, just to make sure we are not missing any proven benefit. We would go either way, disable it completely or continue with this PR, depending on input from community.

anirudh2290 · 2018-12-13T03:05:58Z

To summarize an offline discussion with @Vikas89 . The auto tuning feature has performance benefits which has been documented in #8686 in the attached txt.
tune_all.txt

As we can see for various operator and different inputs the Auto tune selects whether to use OMP or not. For close to 90% of the tests it makes the right selection.

Also @Vikas89 tried to remove tuning for child process since we are disabling openmp for child process, but since the tuning gets triggered during static variable initialization as part of process startup, changes to the fork handlers are not reflected in the tuning code. So we decided to stick to the existing implementation to reduce the number of iterations for omp overhead calculation.

anirudh2290

can we run sanity performance tests again with : ./mxnet_unit_tests --gtest_filter=OMP_TUNING.EvaluateTuneTestFloat --perf

pengzhao-intel

LGTM for this PR.

It's nice to see the microbenchmark shows the benefit from auto-tuning; however, I am considering how much benefit we can gain in the real workload.
We have been asking multiple times for this issue from different customers and it doesn't sound like a good experience.

I will spend some time to look into the code and run some benchmarks later.

Vikas-kum · 2018-12-13T18:02:57Z

tuning-perf-results.txt - For test results
@anirudh2290 ran test ./build/tests/cpp/mxnet_unit_tests --gtest_filter=OMP_TUNING.EvaluateTuneTestFloat --perf

anirudh2290 · 2018-12-13T18:07:47Z

Thanks results look good , 93 out of 96 tests do correct selection

anirudh2290 · 2018-12-13T18:16:56Z

@pengzhao-intel Can you please point to the performance issues related to OMP tuning.

…pache#13602) * apache#12255 doing import mxnet in multiple processes take very long time. Details : apache#12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning. * cpplint fix * Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc * fixing formatting in doc

…13602) * #12255 doing import mxnet in multiple processes take very long time. Details : #12255 One of the reason we have OMP tuning code which iterates to find OMP tune overhead. We are reducing this iteration count to reduce the overehead of tuning code. Also, We added an environment variable where users can set the number of cores that should be used to determine tuning. * cpplint fix * Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING to doc * fixing formatting in doc

pengzhao-intel · 2018-12-14T05:52:56Z

#12255 (comment)

@anirudh2290 this is a good summary of the issue, especially in Intel Xeon Phi system where 72 physical cores and 72X4 logic cores are available.

anirudh2290 · 2018-12-14T09:05:12Z

@pengzhao-intel yes i saw that issue. that issue checks for omp overhead serially so it launches different parallel section (with 2,..18 threads) serially. So I think for 36 cores omp runtime would try to reuse the already launched threads and not launch 170 threads. This can still be a problem when we fork the process into many subprocesses and we tried to disable operator tuning subprocesses but it wasn't trivial. Therefore the solution implemented here would be a good intermediate solution for a reasonable number of processes forked. We should revisit the long term solution to disable tuning in forked processes though.

TaoLv · 2018-12-18T15:13:38Z

docs/faq/env_var.md

+ - 0=disable all
+ - 1=enable all
+ - float32, float16, float32=list of types to enable, and disable those not listed
+ - refer : https://github.com/apache/incubator-mxnet/blob/master/src/operator/operator_tune-inl.h#L444


Not sure it's a good choice to put code link here. Once operator_tune-inl.h is changed, probably we need revise the line number here to avoid confusion.

Ah , I forgot to add that diff where I listed all the data type. I will create a separate PR to correct this.

Vikas-kum requested a review from anirudh2290 as a code owner December 10, 2018 22:48

marcoabreu reviewed Dec 10, 2018

View reviewed changes

src/operator/operator_tune-inl.h Show resolved Hide resolved

src/operator/operator_tune-inl.h Show resolved Hide resolved

marcoabreu requested review from cjolivier01 and TaoLv December 10, 2018 23:47

cpplint fix

d7a794c

yuxihu reviewed Dec 11, 2018

View reviewed changes

src/operator/operator_tune-inl.h Show resolved Hide resolved

TaoLv reviewed Dec 11, 2018

View reviewed changes

Adding new environment variable: MXNET_USE_NUM_CORES_OPERATOR_TUNING …

51b5ea0

…to doc

Vikas-kum requested a review from szha as a code owner December 11, 2018 19:24

fixing formatting in doc

1903cbd

yuxihu approved these changes Dec 11, 2018

View reviewed changes

anirudh2290 reviewed Dec 11, 2018

View reviewed changes

marcoabreu added Environment Variables Issues related to setting env vars Operator labels Dec 12, 2018

anirudh2290 approved these changes Dec 13, 2018

View reviewed changes

pengzhao-intel approved these changes Dec 13, 2018

View reviewed changes

anirudh2290 merged commit 090f222 into apache:master Dec 13, 2018

TaoLv reviewed Dec 18, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for import mxnet taking long time if multiple process launched #13602

Fix for import mxnet taking long time if multiple process launched #13602

Vikas-kum commented Dec 10, 2018 •

edited

Loading

marcoabreu left a comment

yuxihu left a comment

TaoLv left a comment •

edited

Loading

Vikas-kum commented Dec 11, 2018 •

edited

Loading

yuxihu left a comment

anirudh2290 Dec 11, 2018

anirudh2290 Dec 11, 2018 •

edited

Loading

Vikas-kum Dec 12, 2018 •

edited

Loading

roywei commented Dec 12, 2018

pengzhao-intel commented Dec 12, 2018 •

edited

Loading

anirudh2290 commented Dec 12, 2018

Vikas-kum commented Dec 12, 2018 •

edited

Loading

anirudh2290 commented Dec 12, 2018

Vikas-kum commented Dec 12, 2018

anirudh2290 commented Dec 13, 2018

anirudh2290 left a comment

pengzhao-intel left a comment

Vikas-kum commented Dec 13, 2018

anirudh2290 commented Dec 13, 2018

anirudh2290 commented Dec 13, 2018

pengzhao-intel commented Dec 14, 2018

anirudh2290 commented Dec 14, 2018

TaoLv Dec 18, 2018

Vikas-kum Dec 19, 2018

Fix for import mxnet taking long time if multiple process launched #13602

Fix for import mxnet taking long time if multiple process launched #13602

Conversation

Vikas-kum commented Dec 10, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

marcoabreu left a comment

Choose a reason for hiding this comment

yuxihu left a comment

Choose a reason for hiding this comment

TaoLv left a comment • edited Loading

Choose a reason for hiding this comment

Vikas-kum commented Dec 11, 2018 • edited Loading

yuxihu left a comment

Choose a reason for hiding this comment

anirudh2290 Dec 11, 2018

Choose a reason for hiding this comment

anirudh2290 Dec 11, 2018 • edited Loading

Choose a reason for hiding this comment

Vikas-kum Dec 12, 2018 • edited Loading

Choose a reason for hiding this comment

roywei commented Dec 12, 2018

pengzhao-intel commented Dec 12, 2018 • edited Loading

anirudh2290 commented Dec 12, 2018

Vikas-kum commented Dec 12, 2018 • edited Loading

anirudh2290 commented Dec 12, 2018

Vikas-kum commented Dec 12, 2018

anirudh2290 commented Dec 13, 2018

anirudh2290 left a comment

Choose a reason for hiding this comment

pengzhao-intel left a comment

Choose a reason for hiding this comment

Vikas-kum commented Dec 13, 2018

anirudh2290 commented Dec 13, 2018

anirudh2290 commented Dec 13, 2018

pengzhao-intel commented Dec 14, 2018

anirudh2290 commented Dec 14, 2018

TaoLv Dec 18, 2018

Choose a reason for hiding this comment

Vikas-kum Dec 19, 2018

Choose a reason for hiding this comment

Vikas-kum commented Dec 10, 2018 •

edited

Loading

TaoLv left a comment •

edited

Loading

Vikas-kum commented Dec 11, 2018 •

edited

Loading

anirudh2290 Dec 11, 2018 •

edited

Loading

Vikas-kum Dec 12, 2018 •

edited

Loading

pengzhao-intel commented Dec 12, 2018 •

edited

Loading

Vikas-kum commented Dec 12, 2018 •

edited

Loading