Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CD] dynamic libmxet pipeline fix + small fixes #16966

Merged
merged 4 commits into from
Dec 10, 2019

Conversation

perdasilva
Copy link
Contributor

@perdasilva perdasilva commented Dec 3, 2019

Description

MKL builds for the dynamic libmxet are failing because because of a previous PR's change. It deleted mx_mkldnn_deps from the Jenkins file, however, this is needed by an underlying import.

I also noticed a couple of inconsistencies:

  1. USE_NVTX=1 was not set for the cuda 9.0 make configuration
  2. ubuntu_gpu_cu101 was using an older cudnn version

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@perdasilva perdasilva requested a review from szha as a code owner December 3, 2019 09:43
@TaoLv
Copy link
Member

TaoLv commented Dec 3, 2019

Thank you for the fix @perdasilva. Where can I find the broken status of CD?

@perdasilva
Copy link
Contributor Author

@TaoLv sorry for the delay in responding. CD runs on a daily cadance here - if you have access, you can also test any changes to the CD pipeline on jenkins dev by updating the configuration for this job by pointing it to your repository and changing the branch specified to point to your branch. That would give you a dry run of CD.

@perdasilva
Copy link
Contributor Author

perdasilva commented Dec 5, 2019

@DickJC123, I'm trying to fix CD and I think it's been failing since the fuse op PR. Do you have any idea why it could be failing for the cuda 9.0 builds?

======================================================================
ERROR: test_operator_gpu.test_batchnorm_training
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/local/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1830, in test_batchnorm_training
    check_batchnorm_training('default')
  File "/work/mxnet/tests/python/gpu/../unittest/test_operator.py", line 1769, in check_batchnorm_training
    check_numeric_gradient(test, in_location, mean_std, numeric_eps=1e-2, rtol=0.16, atol=1e-2)
  File "/work/mxnet/python/mxnet/test_utils.py", line 1101, in check_numeric_gradient
    symbolic_grads = {k:executor.grad_dict[k].asnumpy() for k in grad_nodes}
  File "/work/mxnet/python/mxnet/test_utils.py", line 1101, in <dictcomp>
    symbolic_grads = {k:executor.grad_dict[k].asnumpy() for k in grad_nodes}
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2532, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [21:10:06] src/operator/fusion/fused_op.cu:558: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.

@perdasilva perdasilva changed the title [CD] dynamic libmxet pipeline fix [WIP][CD] dynamic libmxet pipeline fix Dec 5, 2019
@DickJC123
Copy link
Contributor

I'll work with @ptrendx to resolve this.

@perdasilva perdasilva changed the title [WIP][CD] dynamic libmxet pipeline fix [CD] dynamic libmxet pipeline fix + small fixes Dec 9, 2019
@perdasilva
Copy link
Contributor Author

@DickJC123 I've created an issue to track this problem: #17020 - thanks again for looking into it

@szha szha merged commit 60f77f5 into apache:master Dec 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants