fix test_depthwise_convoltuion for occasional CI failures #14016

mseth10 · 2019-01-29T01:55:45Z

Description

This PR aims to fix #12203 . The test_depthwise convolution fails occasionally on CI for GPU context. The problem occurs as the test tries to match GPU output of convolution on an image with CPU output of convolution on sliced images concatenated later. The error tolerance values are sometimes exceeded. When the comparison instead is made between GPU output of convolution on an image with GPU output of convolution on sliced images concatenated later, it passes the test.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Second context changed to match first one. This make comparison for different implementations on same contexts

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

apeforest

Can we also compare between cpu and cpu as context?

apeforest · 2019-01-29T03:10:55Z

Could you also create an issue of descprepancy between cpu and gpu implementation with the input to replay? Thanks!

mseth10 · 2019-01-29T19:26:48Z

@apeforest We already do that. We actually compare between default contexts. So when default context is cpu, we compare between cpu and cpu.

apeforest

LGTM. @mseth10 Thanks for your thorough analysis!

Vikas-kum

Good catch!

I would also recommend that you should put if there is a input you tested this on. Overall this looks good to me.

On a side note, you should also investigate if there is a way to put the test for non-cudnn environment. I am not sure if we do this in mxnet CI, in case we do, the test should be part of that suite which runs test cases for non-cudnn environment.

mseth10 · 2019-01-29T23:30:24Z

@Vikas89 I checked. We do have GPU no cuDNN environment running on our CI. I will make sure this test is running on that.
I will also add the test input for which this test was failing.

This reverts commit 1f95d02.

lebeg · 2019-01-31T10:32:02Z

Unfortunately, the same test is failing in this PR:

======================================================================
ERROR: test_operator.test_depthwise_convolution
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 173, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/test_operator.py", line 1676, in test_depthwise_convolution
    np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/work/mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [01:40:38] src/operator/nn/mkldnn/mkldnn_base.cc:576: Check failed: similar

Not sure whether it's the same error though.

mseth10 · 2019-01-31T18:04:25Z

@lebeg It's not the same error - the current failure is on unix-cpu whereas earlier it was on unix-gpu. Also, I verified this failure is not due to the fix, it is because of some other code change that happened while this test was inactive.

…to fix-depthwise-conv

mseth10 · 2019-02-01T01:37:44Z

@Vikas89 This test already runs on GPU no cuDNN environment. So we are good there.

lupesko

Nice.
However, since the test is still failing (albeit due to a different root cause), we should probably keep skipping this test.

…to fix-depthwise-conv

mseth10 · 2019-02-03T10:01:41Z

Filed another issue #14052 for MKLDNN bug that causes CI failure when test is enabled. Keeping the test disabled as of now.

…to fix-depthwise-conv

vandanavk · 2019-02-04T21:40:47Z

@mxnet-label-bot add [pr-awaiting-merge, Flaky, Test]

yuxihu

Good catch! LGTM.

* keeping same contexts for comparison * enabling test * testing default context * Revert "testing default context" This reverts commit 1f95d02. * Disabling test due to CI failure on MKL-DNN

keeping same contexts for comparison

aaa9ac5

apeforest suggested changes Jan 29, 2019

View reviewed changes

enabling test

4c893ed

apeforest approved these changes Jan 29, 2019

View reviewed changes

mseth10 mentioned this pull request Jan 29, 2019

flaky test: test_operator_gpu.test_depthwise_convolution #12203

Closed

Vikas-kum approved these changes Jan 29, 2019

View reviewed changes

Ubuntu added 2 commits January 30, 2019 23:45

testing default context

1f95d02

Revert "testing default context"

db8ac44

This reverts commit 1f95d02.

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

00ccd84

…to fix-depthwise-conv

lupesko suggested changes Feb 1, 2019

View reviewed changes

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

ffd2952

…to fix-depthwise-conv

mseth10 force-pushed the fix-depthwise-conv branch from a985b9d to 7a4a26b Compare February 3, 2019 09:56

Disabling test due to CI failure on MKL-DNN

c1299d7

mseth10 force-pushed the fix-depthwise-conv branch from 7a4a26b to c1299d7 Compare February 3, 2019 09:58

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

da47906

…to fix-depthwise-conv

marcoabreu added Flaky pr-awaiting-merge Review and CI is complete. Ready to Merge Test labels Feb 4, 2019

yuxihu approved these changes Feb 4, 2019

View reviewed changes

Roshrini approved these changes Feb 5, 2019

View reviewed changes

Roshrini merged commit 41ba014 into apache:master Feb 5, 2019

mseth10 deleted the fix-depthwise-conv branch June 1, 2020 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix test_depthwise_convoltuion for occasional CI failures #14016

fix test_depthwise_convoltuion for occasional CI failures #14016

mseth10 commented Jan 29, 2019

apeforest left a comment

apeforest commented Jan 29, 2019

mseth10 commented Jan 29, 2019

apeforest left a comment

Vikas-kum left a comment

mseth10 commented Jan 29, 2019

lebeg commented Jan 31, 2019

mseth10 commented Jan 31, 2019

mseth10 commented Feb 1, 2019

lupesko left a comment

mseth10 commented Feb 3, 2019

vandanavk commented Feb 4, 2019

yuxihu left a comment

fix test_depthwise_convoltuion for occasional CI failures #14016

fix test_depthwise_convoltuion for occasional CI failures #14016

Conversation

mseth10 commented Jan 29, 2019

Description

Checklist

Essentials

Changes

Comments

apeforest left a comment

Choose a reason for hiding this comment

apeforest commented Jan 29, 2019

mseth10 commented Jan 29, 2019

apeforest left a comment

Choose a reason for hiding this comment

Vikas-kum left a comment

Choose a reason for hiding this comment

mseth10 commented Jan 29, 2019

lebeg commented Jan 31, 2019

mseth10 commented Jan 31, 2019

mseth10 commented Feb 1, 2019

lupesko left a comment

Choose a reason for hiding this comment

mseth10 commented Feb 3, 2019

vandanavk commented Feb 4, 2019

yuxihu left a comment

Choose a reason for hiding this comment