Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix test_depthwise_convoltuion for occasional CI failures #14016

Merged
merged 8 commits into from
Feb 5, 2019

Conversation

mseth10
Copy link
Contributor

@mseth10 mseth10 commented Jan 29, 2019

Description

This PR aims to fix #12203 . The test_depthwise convolution fails occasionally on CI for GPU context. The problem occurs as the test tries to match GPU output of convolution on an image with CPU output of convolution on sliced images concatenated later. The error tolerance values are sometimes exceeded. When the comparison instead is made between GPU output of convolution on an image with GPU output of convolution on sliced images concatenated later, it passes the test.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Second context changed to match first one. This make comparison for different implementations on same contexts

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also compare between cpu and cpu as context?

@apeforest
Copy link
Contributor

Could you also create an issue of descprepancy between cpu and gpu implementation with the input to replay? Thanks!

@mseth10
Copy link
Contributor Author

mseth10 commented Jan 29, 2019

@apeforest We already do that. We actually compare between default contexts. So when default context is cpu, we compare between cpu and cpu.

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @mseth10 Thanks for your thorough analysis!

Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I would also recommend that you should put if there is a input you tested this on. Overall this looks good to me.

On a side note, you should also investigate if there is a way to put the test for non-cudnn environment. I am not sure if we do this in mxnet CI, in case we do, the test should be part of that suite which runs test cases for non-cudnn environment.

@mseth10
Copy link
Contributor Author

mseth10 commented Jan 29, 2019

@Vikas89 I checked. We do have GPU no cuDNN environment running on our CI. I will make sure this test is running on that.
I will also add the test input for which this test was failing.

@lebeg
Copy link
Contributor

lebeg commented Jan 31, 2019

Unfortunately, the same test is failing in this PR:

======================================================================
ERROR: test_operator.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/test_operator.py", line 1676, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [01:40:38] src/operator/nn/mkldnn/mkldnn_base.cc:576: Check failed: similar 

Not sure whether it's the same error though.

@mseth10
Copy link
Contributor Author

mseth10 commented Jan 31, 2019

@lebeg It's not the same error - the current failure is on unix-cpu whereas earlier it was on unix-gpu. Also, I verified this failure is not due to the fix, it is because of some other code change that happened while this test was inactive.

@mseth10
Copy link
Contributor Author

mseth10 commented Feb 1, 2019

@Vikas89 This test already runs on GPU no cuDNN environment. So we are good there.

Copy link
Contributor

@lupesko lupesko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
However, since the test is still failing (albeit due to a different root cause), we should probably keep skipping this test.

@mseth10
Copy link
Contributor Author

mseth10 commented Feb 3, 2019

Filed another issue #14052 for MKLDNN bug that causes CI failure when test is enabled. Keeping the test disabled as of now.

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-merge, Flaky, Test]

@marcoabreu marcoabreu added Flaky pr-awaiting-merge Review and CI is complete. Ready to Merge Test labels Feb 4, 2019
Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM.

@Roshrini Roshrini merged commit 41ba014 into apache:master Feb 5, 2019
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* keeping same contexts for comparison

* enabling test

* testing default context

* Revert "testing default context"

This reverts commit 1f95d02.

* Disabling test due to CI failure on MKL-DNN
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* keeping same contexts for comparison

* enabling test

* testing default context

* Revert "testing default context"

This reverts commit 1f95d02.

* Disabling test due to CI failure on MKL-DNN
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* keeping same contexts for comparison

* enabling test

* testing default context

* Revert "testing default context"

This reverts commit 1f95d02.

* Disabling test due to CI failure on MKL-DNN
@mseth10 mseth10 deleted the fix-depthwise-conv branch June 1, 2020 10:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Flaky pr-awaiting-merge Review and CI is complete. Ready to Merge Test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky test: test_operator_gpu.test_depthwise_convolution
9 participants