Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix MKLDNN sigmoid/softrelu issue #10336

Merged
merged 5 commits into from
Apr 2, 2018
Merged

Conversation

jinhuang415
Copy link
Contributor

@jinhuang415 jinhuang415 commented Mar 30, 2018

Description

This PR is to fix the sigmoid and softrelu test failure for MKLDNN (related test case tests/python/gpu/test_operator_gpu.py:test_activation_with_type), MKLDNN eltwise primitive requires to input activation input data instead of output data for backward primitive (it will do activation forward for input data and calculate gradient based on forward result), so need to change to use input data to adapt for that. Tests passed for sigmoid/relu/softrelu after the fix.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@zheng-da
Copy link
Contributor

Is this PR to fix the problem in #10089?

@zheng-da
Copy link
Contributor

Does it fix this as well?

======================================================================

FAIL: test_loss.test_bce_loss

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/unittest/common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/unittest/test_loss.py", line 100, in test_bce_loss

    assert mod.score(data_iter, eval_metric=mx.metric.Loss())[0][1] < 0.01

AssertionError: 

@jinhuang415
Copy link
Contributor Author

@zheng-da Yes, after the fix #10089 could be enabled

@jinhuang415
Copy link
Contributor Author

jinhuang415 commented Mar 30, 2018

@zheng-da test_bce_loss could be passed with the fix:
[jinhuang@mlt-gpu203 incubator-mxnet]$ nosetests -s tests/python/unittest/test_loss.py:test_bce_loss
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=26219310 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
[15:14:52] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 40 bytes with malloc directly
[15:14:52] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 2560 bytes with malloc directly
.
----------------------------------------------------------------------
Ran 1 test in 0.939s

OK

@pengzhao-intel
Copy link
Contributor

@jinhuang415 I think we can enable all activation in this PR and close the previous #10089 .
@zheng-da what's your opinion?

@jinhuang415
Copy link
Contributor Author

@pengzhao-intel Added change to enable all MKLDNN activation

@zheng-da
Copy link
Contributor

agreed. I have closed #10089

@@ -82,7 +82,7 @@ void ActivationGradComputeExCPU(const nnvm::NodeAttrs& attrs,
const ActivationParam& param = nnvm::get<ActivationParam>(attrs.parsed);
if (SupportMKLDNN(inputs[0])) {
MKLDNN_OPCHECK_INIT(true, outputs.size(), inputs, outputs);
MKLDNNActivationBackward(attrs, ctx, inputs[0], inputs[1], req[0],
MKLDNNActivationBackward(attrs, ctx, inputs[0], inputs[2], req[0],
outputs[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs[2] doesn't exist. You need to modify ActivationGrad to pass input data to backward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, have updated diff to pass input data and updated a few other functions as well.

@zheng-da
Copy link
Contributor

Looks good to me now.
@piiswrong @szha Can you review the PR? This PR fixed a bug in the MKLDNN integration. We should merge it before release

// problems.
return param.act_type == activation::kReLU;
#if 0
return param.act_type == activation::kReLU
|| param.act_type == activation::kSigmoid
|| param.act_type == activation::kSoftReLU;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tanh is not enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaoLv I added support for Tanh and it can pass UT

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for triggering previous bugs?

@piiswrong
Copy link
Contributor

There is no previous bug. mkldnn was just disabled for these cases

@szha
Copy link
Member

szha commented Apr 1, 2018

OK. Feel free to dismiss my previous review in this PR if the change is already covered by tests.

@jinhuang415
Copy link
Contributor Author

@szha @piiswrong The related test case is tests/python/gpu/test_operator_gpu.py:test_activation_with_type, previously it will fail if we enable sigmoid/softrelu/tanh for mkldnn, after the fix the test can pass so we can enable sigmoid/softrelu/tanh for MKLDNN now.

@zheng-da
Copy link
Contributor

zheng-da commented Apr 2, 2018

@piiswrong MKLDNN activation backward uses input data (activation::kData) to compute in_grad, but the original code provides output data (activation::kOut). That's why it worked only for relu. I'm not sure why it always works. This PR fixes this bug. And now we can use MKLDNN for other types of activation.

assert_almost_equal(out1.asnumpy(), out2.asnumpy())
assert_almost_equal(out1.asnumpy(), out3.asnumpy())
assert_almost_equal(out1.asnumpy(), out2.asnumpy(), rtol=1e-3)
assert_almost_equal(out1.asnumpy(), out3.asnumpy(), rtol=1e-3)
Copy link
Contributor

@zheng-da zheng-da Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should set a smaller tolerance. changing from 1e-5 to 1e-3 seems to be a big jump.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still fail intermittently if setting to 1e-4, by checking another similar function check_consistency(), the threshold is set to 1e-3 for FP32, since for this test_lambda() case the data type is FP32 (mx.nd.random.uniform() default output FP32 type) so I set the rtol to 1e-3 as well.

@piiswrong piiswrong merged commit e0df25c into apache:master Apr 2, 2018
lanking520 pushed a commit to lanking520/incubator-mxnet that referenced this pull request Apr 2, 2018
* Fix MKLDNN sigmoid/softrelu issue

* Enable Sigmoid and SoftRelu for MKLDNN

* Add activation kData for backward calculation for MKLDNN

* Add tanh support for MKLDNN activation

* Adjust rtol to pass tanh tests for MKLDNN
@jinhuang415 jinhuang415 deleted the latest_master branch May 2, 2018 12:18
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Fix MKLDNN sigmoid/softrelu issue

* Enable Sigmoid and SoftRelu for MKLDNN

* Add activation kData for backward calculation for MKLDNN

* Add tanh support for MKLDNN activation

* Adjust rtol to pass tanh tests for MKLDNN
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Fix MKLDNN sigmoid/softrelu issue

* Enable Sigmoid and SoftRelu for MKLDNN

* Add activation kData for backward calculation for MKLDNN

* Add tanh support for MKLDNN activation

* Adjust rtol to pass tanh tests for MKLDNN
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants