Revert "Improve FC perf when no_bias=False" #15099

roywei · 2019-05-29T23:02:51Z

Reverts #15033
@stu1130 and I are running a binary search on the commits that may cause #15084

This seems to be the one. The reason causing CI build failure is still unknown, as this only happens with CI enviroments, builidng locally with GPU + MKLDNN is fine on master.

I m creating this PR to test if CI check can pass

This reverts commit 6cf964a.

szha · 2019-05-29T23:28:39Z

cc @anirudh2290

anirudh2290 · 2019-05-29T23:55:53Z

Really weird! CI check passed even on my PR. So if CI check doesn't fail, does that tell much ?

szha · 2019-05-30T00:02:30Z

We certainly need to find the root cause on why CI wasn't reliable in this case. (e.g. could it have been just a cache issue?)

larroy · 2019-05-30T01:17:52Z

The root cause seems to be not enough shared memory in the docker container used in CI.

Using time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 8G /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

fixes the build.

We have some files which are very slow to compile and produce big binaries, I think is excess of inline.

larroy · 2019-05-30T01:20:36Z

#15100

szha · 2019-05-30T01:35:37Z

Hmm, if this was required, how did the original PR pass CI check?

anirudh2290 · 2019-05-30T02:36:32Z

I was able to reproduce this with make with on my local machine with command:

make DEV=1 ENABLE_TESTCOVERAGE=1 USE_CPP_PACKAGE=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 'CUDA_ARCH=-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_70,code=sm_70' USE_SIGNAL_HANDLER=1 -j$(nproc)

Note: I was able to reproduce only with a clean build after make clean. for example, after building on top of previous commit i wasn't able to reproduce this.

roywei · 2019-05-30T06:59:30Z

@stu1130 has tried with ~2G and failed. It also seems 8G in #15100 failed.
I just tried ~10G passed

ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 10000m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn

We have tried with 500m before #15033, everything works fine. Everything after this commit failed with 500m.
For local tests, also able to reproduce now with @anirudh2290 's command.
So this can't be only shared memory related.

why Improve FC perf when no_bias=False #15033 code change caused CI requiring more space? (it changed requiring additional temp space to not limit to MKLDNN only, but should not affect build?)
why CI check passed for initial PR?
how to fix local build?

I would prefer to merge this and keep #15084 open until we find root cause.

anirudh2290 · 2019-05-30T07:02:11Z

@roywei sounds good to me.

aaronmarkham · 2019-05-30T17:26:39Z

Is there are next step for those of use with PRs that failed CI due to this issue?

anirudh2290 · 2019-05-30T18:03:43Z

@aaronmarkham Rebase and retrigger CI and it should be fine.

larroy · 2019-05-30T18:34:04Z

We need to investigate what's going on, I thought it passed for me at 8G but then my PR validation failed. Could be a linker bug or an offset linking bigger than 2^32, it's still not 100% clear to me.

larroy · 2019-05-30T19:04:01Z

I'd suggest strace dump of linker

larroy · 2019-05-30T19:16:54Z

Just failed with 16G for me.

) This reverts commit 6cf964a.

Revert "Improve FC perf when no_bias=False (apache#15033)"

15c28d7

This reverts commit 6cf964a.

szha requested a review from eric-haibin-lin May 29, 2019 23:27

anirudh2290 approved these changes May 30, 2019

View reviewed changes

anirudh2290 merged commit 6543488 into apache:master May 30, 2019

roywei mentioned this pull request May 30, 2019

MXNet GPU MKLDNN Build failure #15084

Closed

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Revert "Improve FC perf when no_bias=False (apache#15033)" (apache#15099

a30de34

) This reverts commit 6cf964a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Improve FC perf when no_bias=False" #15099

Revert "Improve FC perf when no_bias=False" #15099

roywei commented May 29, 2019 •

edited

Loading

szha commented May 29, 2019

anirudh2290 commented May 29, 2019

szha commented May 30, 2019

larroy commented May 30, 2019 •

edited

Loading

larroy commented May 30, 2019

szha commented May 30, 2019

anirudh2290 commented May 30, 2019

roywei commented May 30, 2019 •

edited

Loading

anirudh2290 commented May 30, 2019

aaronmarkham commented May 30, 2019

anirudh2290 commented May 30, 2019

larroy commented May 30, 2019

larroy commented May 30, 2019

larroy commented May 30, 2019

Revert "Improve FC perf when no_bias=False" #15099

Revert "Improve FC perf when no_bias=False" #15099

Conversation

roywei commented May 29, 2019 • edited Loading

szha commented May 29, 2019

anirudh2290 commented May 29, 2019

szha commented May 30, 2019

larroy commented May 30, 2019 • edited Loading

larroy commented May 30, 2019

szha commented May 30, 2019

anirudh2290 commented May 30, 2019

roywei commented May 30, 2019 • edited Loading

anirudh2290 commented May 30, 2019

aaronmarkham commented May 30, 2019

anirudh2290 commented May 30, 2019

larroy commented May 30, 2019

larroy commented May 30, 2019

larroy commented May 30, 2019

roywei commented May 29, 2019 •

edited

Loading

larroy commented May 30, 2019 •

edited

Loading

roywei commented May 30, 2019 •

edited

Loading