Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add NHWC layout support to Pooling (cpu, gpu cuda, gpu cuDNN) #13749

Merged
merged 33 commits into from
Feb 16, 2019

Conversation

DickJC123
Copy link
Contributor

Description

This is a continuation of PR #13362 started by @ptrendx titled "Add NHWC layout support to Pooling (cuDNN only)". We're in agreement that I'm the best one to push this work forward to a point of final acceptance.

Based on reviewer comments, I've expanded the support of the newly introduced 'layout' parameter to include cpu and cuda implementations, not just cudnn. Also 3D support in cuDNN and graceful fallback for MKLDNN. I've re-enabled and expanded the testing of test_operator_gpu.py:test_pooling_versions. I understand and avoid the earlier flakiness of that test, which was due to inclusion of the pooling_v1 operator in some problem domains where its definition differs. The initial commit here is identical to the last one made by @ptrendx for PR #13362. Follow-up commits will add this new functionality and additional description.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@sandeep-krishnamurthy sandeep-krishnamurthy added CUDA Backend Issues related to the backend of MXNet labels Dec 31, 2018
@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 14, 2019
}
};

template<typename DType>
struct lp_grad<DType, 3> {
static MSHADOW_XINLINE DType Map(const DType grad, const DType in_data, const DType out_data) {
return grad * in_data * in_data / (out_data * out_data);
// Avoid nan result if both grad and out_data are 0.
return (grad == DType(0.0)) ? DType(0.0) : grad * in_data * in_data / (out_data * out_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also check out_data == DType(0.0) here if it can be zero, to avoid overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this to quiet some nans that were causing test failures. I'm reviewing the specifics to see if a better solution exists as suggested in your comment, so stay tuned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed my solution to your comment in commit 098bc49.

The checking of grad == 0.0 that you highlighted only succeeded because of the quirks of our check_consistency() routine in test_utils.py, which uses the symbol forward() output as the gradient. Per your suggestion, I'm now using a check of out_data == 0 as the more general way of quieting the test failures.

The test failures I was seeing often occurred in float16 lp-3 pooling. By example, take the case where a pool window of 2 has identical inputs 2^-9 and 2^-9. The forward output for this case is the cube root of (2^-9)^3 + (2^-9)^3. If this is calculated in float16, the 2^-27 terms underflow to 0 and the output is 0. The backward output is then grad * 2^-9 * 2^-9 / (0 * 0) = +inf (or nan if grad is also 0). When performed in float32, no underflow occurs in the forward op, and +infs are avoided in the backward op.

My conclusion: float16 is ill-equipped to perform the forward pooling operation for lp-2 and lp-3. Part of my solution here thus involves promoting the calculation to be in float32 for cpu and mxnet cuda implementations of float16-i/o pooling. This is consistent with other operators like float16-i/o Convolution and Batchnorm, which perform internal calculations in float32. I've run test_pooling_versions() thousands of times with no failures in this mode.

return std::vector<std::pair<int, int> >();
#else
return std::vector<std::pair<int, int> >{{1, 0}};
#if MXNET_USE_MKLDNN == 1 && MXNET_USE_CUDA == 0 && MXNET_USE_CUDNN == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm, do you have any case to validate this scenario? If MXNet is built with both MKL-DNN and cuDNN enabled (eg. mxnet-cu90mkl package) and user gives ctx=cpu. Is there any functionality or performance regression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unfortunately, all enabled backend operator implementations must live with the same FInPlaceOptions. I voiced concern about this restriction to @eric-haibin-lin and he graciously put some thought into: #13598. The issue has seen no activity in a month, but I'd like to see it reawakened.

Until then, the code you highlight enforces that the most conservatinve "least common denominator" FInPlaceOptions (i.e. none) are in effect for the enabled Pooling implementations. Since an operator cannot rely on FInPlace actually occurring, there should be no functional regression. And perf-wise, a user must understand that at this point enabling multiple back-ends may have some performance consequences due to the FInPlaceOption implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for any confusion. Perhaps my comments in the code for FInplaceOption speak to the issue more directly:

  // Since this routine is not provided the cpu/gpu context info, only in the case
  // where CUDA and CUDNN implementations are not available can we be sure the MKLDNN
  // implementation will be employed.  The MKLDNN FInplaceOptions are not compatible
  // with the other (i.e. cpu, cuda and cudnn) implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I understand the current limitation in here.
My concern is the performance regression in the CPU side when mxnet-cuXXmkl binary is used which actually is the widely used binary like sockeye, gluonNLP, gluonCV, etc.
Before we resolved issue #13598 (btw, we're working on this now), I think we need to keep this line AS-IS.

Copy link
Member

@eric-haibin-lin eric-haibin-lin Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mxnet-cuXX-mkl pip binary statically links cudnn and will not be affected. @szha correct me if I am wrong.

The potentially affected cases are:

  • build mxnet with mkldnn and cuda, but without cudnn
  • build mxnet with neither mkldnn nor cuda.

As this operator is used in backward for training, I see the two cases above are rare. Therefore, I do not think this is a blocking issue for this PR.

Nonetheless, I suggest @DickJC123 create a github issue with this change so that people are aware and can be tracked later, and preferably run some performance testing to have a rough idea how big an impact it is making for the above two cases.

Let's continue discussions on #13598 to avoid such problems in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally set out to create a test case (and separate issue) to demonstrate the need for the restrictive #if clause I put around the MKLDNN FInplaceOption. I was unable to do so and in the process I came to realize that I misunderstood the meaning behind the {1,0} option. In fact, it allows the framework to share the backward '1' input (the unused workspace gradient) with the backward '0' output (the pooling input gradient) where the additional MKLDNN workspace is present. This sharing is compatible with CUDA and cuDNN Pooling implementations, which do not use a workspace or its gradient! I'm therefore reverting back to a simple code guard around the MKLDNN FinplaceOption- one whose logic mirrors the other MKLDNN/no-MKLDNN switches of behavior within nn/pooling.cc. No new issue is required. With my last commit I've also expanded the pooling tests to cover cases that have the inplace tensor sharing, which requires the pooling input to be the same size as the output (and also then the workspace).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know MKLDNN impl is not affected

tests/python/gpu/test_operator_gpu.py Outdated Show resolved Hide resolved
@@ -104,7 +104,8 @@ class MKLDNNPoolingBwd {
inline bool SupportMKLDNNPooling(const PoolingParam &param) {
return param.kernel.ndim() == 2 &&
(param.pool_type == pool_enum::kMaxPooling ||
param.pool_type == pool_enum::kAvgPooling);
param.pool_type == pool_enum::kAvgPooling) &&
(!param.layout.has_value() || param.layout.value() == mshadow::kNCHW);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for changing this. Just to clarify, MKL-DNN pooling primitive does support NHWC input format. But when we integrated MKL-DNN backend into MXNet, we assumed that user input data should always be NCHW for 4D tesnsor. We need re-evaluated the workflow and integration to see if we need support NHWC here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaoLv are you tracking this separately?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This change looks good to me for now. Two possible follow-ups:

  1. if layout parameter will be removed in the next major release, possibly we will not support NHWC.
  2. otherwise, if needed, we will extend MKL-DNN integration to support NHWC. Currently I don't see any necessity for it.

Do you think I can put the suggestion for removing layout into #9686?

@TaoLv
Copy link
Member

TaoLv commented Feb 2, 2019

Maybe it's out of the scope of this PR, but I would suggest to remove layout parameters from all operator APIs in the next major release. It's framework that should take care of the performance of model. We cannot ask users to complicate their models to get some performance benefit from one specific hardware or backend. The change may be nightmare if they want to change back to other hardwares or backends.

@DickJC123
Copy link
Contributor Author

I agree with @TaoLv that a new approach to layout must wait until the next major (API-breaking) release. I look forward to the discussions of how to free users from coding layout considerations into their models and how to allow the framework to make the best backend-specific layout choices.

I've attempted to address reviewer comments to-date. Requesting further input, e.g. from those additionally listed reviewers @zheng-da @szha @zhreshold. Thanks in advance!

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to see some performance numbers for different flag combinations. If there's no visible regression, I think this PR is good to go for me. Then the community should propose a clear solution to resolve the conflict and make different backend has different option in the future. Also need discuss the API change for next major release.

src/operator/nn/pooling.cc Outdated Show resolved Hide resolved
Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @DickJC123 . It's approved now.

// For reference see Fixed Issues section in
// https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_721.html#rel_721
#if CUDNN_VERSION == 7104
is_supported = kernel_height <= 8 && kernel_width <= 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a CHECK here is probably more informative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to error-out here via CHECK() in this Supports() function, since there is a graceful fallback to the CUDA implementation.

@@ -81,7 +81,8 @@ class CuDNNPoolingOp {
CHECK_EQ(s->dnn_handle_ownership_, mshadow::Stream<gpu>::OwnHandle);
typename DataType<DType>::ScaleType alpha = 1.0f;
typename DataType<DType>::ScaleType beta = 0.0f;
this->Init(s, in_data, out_data);
if (!this->Init(s, in_data, out_data))
LOG(FATAL) << "cuDNN Pooling invoked with unsupported parameters.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. otherwise it's hard to figure out where the problem is with this error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this tip. Updated per your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other issues you'd like me to look at?

@eric-haibin-lin eric-haibin-lin merged commit 5adb6fc into apache:master Feb 16, 2019
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Feb 19, 2019
…#13749)

* Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests.

* Docs changes

* Trigger

* Skip NHWC pooling tests on non-cuDNN platforms

* Fix pylint NHWC pooling

* Fixes from review

* Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return.

* Add layout support to cpu implementation of Pooling, with tests.

* Fix cpplint.

* Fix bug in cpu nhwc impl.

* Add MXNet CUDA pooling in NWC, NHWC and NDHWC.  Turn on 3D cuDNN pooling.  Tests.

* Add PoolingParam::GetLayout() for better default layout handling.

* Fix cpplint.

* Throw exception for quantization pooling not NCHW.

* Expand nhwc pooling test coverage.

* SupportMKLDNNPooling() to examine layout param.

* Compare 'std' and 'v1' pooling versions only when op definitions permit.

* Add pooling test diagnostic output.

* Fix syntax.

* Fix pooling FInplaceOption so it can be shared by all implementations.

* Add missing param definition.

* Fix #if logic.

* Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu.

* Move back to dmlc/mshadow.git, now with float->half rounding.

* Avoid underflow of lp pooling calc for dtype=float16.

* Remove redundant pooling test.

* Minor variable naming fixes.

* Modify FInplaceOption handling per reviewer comments.  Expand testing.

* Correct gluon Pooling layout param description.

* Correct Symbol Pooling description.

* Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'.

* Empty commit to trigger CI.
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Mar 4, 2019
…#13749)

* Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests.

* Docs changes

* Trigger

* Skip NHWC pooling tests on non-cuDNN platforms

* Fix pylint NHWC pooling

* Fixes from review

* Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return.

* Add layout support to cpu implementation of Pooling, with tests.

* Fix cpplint.

* Fix bug in cpu nhwc impl.

* Add MXNet CUDA pooling in NWC, NHWC and NDHWC.  Turn on 3D cuDNN pooling.  Tests.

* Add PoolingParam::GetLayout() for better default layout handling.

* Fix cpplint.

* Throw exception for quantization pooling not NCHW.

* Expand nhwc pooling test coverage.

* SupportMKLDNNPooling() to examine layout param.

* Compare 'std' and 'v1' pooling versions only when op definitions permit.

* Add pooling test diagnostic output.

* Fix syntax.

* Fix pooling FInplaceOption so it can be shared by all implementations.

* Add missing param definition.

* Fix #if logic.

* Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu.

* Move back to dmlc/mshadow.git, now with float->half rounding.

* Avoid underflow of lp pooling calc for dtype=float16.

* Remove redundant pooling test.

* Minor variable naming fixes.

* Modify FInplaceOption handling per reviewer comments.  Expand testing.

* Correct gluon Pooling layout param description.

* Correct Symbol Pooling description.

* Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'.

* Empty commit to trigger CI.
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
…#13749)

* Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests.

* Docs changes

* Trigger

* Skip NHWC pooling tests on non-cuDNN platforms

* Fix pylint NHWC pooling

* Fixes from review

* Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return.

* Add layout support to cpu implementation of Pooling, with tests.

* Fix cpplint.

* Fix bug in cpu nhwc impl.

* Add MXNet CUDA pooling in NWC, NHWC and NDHWC.  Turn on 3D cuDNN pooling.  Tests.

* Add PoolingParam::GetLayout() for better default layout handling.

* Fix cpplint.

* Throw exception for quantization pooling not NCHW.

* Expand nhwc pooling test coverage.

* SupportMKLDNNPooling() to examine layout param.

* Compare 'std' and 'v1' pooling versions only when op definitions permit.

* Add pooling test diagnostic output.

* Fix syntax.

* Fix pooling FInplaceOption so it can be shared by all implementations.

* Add missing param definition.

* Fix #if logic.

* Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu.

* Move back to dmlc/mshadow.git, now with float->half rounding.

* Avoid underflow of lp pooling calc for dtype=float16.

* Remove redundant pooling test.

* Minor variable naming fixes.

* Modify FInplaceOption handling per reviewer comments.  Expand testing.

* Correct gluon Pooling layout param description.

* Correct Symbol Pooling description.

* Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'.

* Empty commit to trigger CI.
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…#13749)

* Adds layout support: mx.sym.Pooling(..., layout='NHWC',...) with tests.

* Docs changes

* Trigger

* Skip NHWC pooling tests on non-cuDNN platforms

* Fix pylint NHWC pooling

* Fixes from review

* Add CuDNNPoolingOp::Supports() in place of Forward()/Backward() bool return.

* Add layout support to cpu implementation of Pooling, with tests.

* Fix cpplint.

* Fix bug in cpu nhwc impl.

* Add MXNet CUDA pooling in NWC, NHWC and NDHWC.  Turn on 3D cuDNN pooling.  Tests.

* Add PoolingParam::GetLayout() for better default layout handling.

* Fix cpplint.

* Throw exception for quantization pooling not NCHW.

* Expand nhwc pooling test coverage.

* SupportMKLDNNPooling() to examine layout param.

* Compare 'std' and 'v1' pooling versions only when op definitions permit.

* Add pooling test diagnostic output.

* Fix syntax.

* Fix pooling FInplaceOption so it can be shared by all implementations.

* Add missing param definition.

* Fix #if logic.

* Temp switch to DickJC123/mshadow: shows effect of half round-to-nearest on cpu.

* Move back to dmlc/mshadow.git, now with float->half rounding.

* Avoid underflow of lp pooling calc for dtype=float16.

* Remove redundant pooling test.

* Minor variable naming fixes.

* Modify FInplaceOption handling per reviewer comments.  Expand testing.

* Correct gluon Pooling layout param description.

* Correct Symbol Pooling description.

* Use 'CHECK(x)' rather than 'if (x) LOG(FATAL)'.

* Empty commit to trigger CI.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet CUDA pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants