fixing batch_norm and layer_norm for large tensor nightly test #17805

access2rohit · 2020-03-10T18:45:36Z

Description

Enables large tensor support for following ops:

batch_norm
layer_norm

Fixes nightly large tensor failure. Recently more strict input size check was added to layer_norm in this PR: #17683 but that hasn't been added to batch_norm yet so it isn't failing currently but the shape assignment is currently incorrect as shown in the gdb logs below.

Please look at the lines marked by arrows in GDB logs

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Proof Of Correctness

layer_norm()

Before changes:

333	  const int channelCount = dshape[channelAxis];  <========
(gdb) info local
param = @0x555555cdb770: {<dmlc::Parameter<mxnet::op::BatchNormParam>> = {<No data fields>}, eps = 0.0010000000474974513, momentum = 0.899999976, fix_gamma = true, use_global_stats = false, output_mean_var = false, axis = 0,
  cudnn_off = false, min_calib_range = {is_none = true, val = {__data = "\000\000\000", __align = {<No data fields>}}}, max_calib_range = {is_none = true, val = {__data = "UU\000", __align = {<No data fields>}}}}
dshape = @0x5555572290a0: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 1, 4300000000, 0}, data_heap_ = 0x0}, <No data fields>}
channelAxis = 0
channelCount = 21845 <--------
(gdb) p dshape[channelAxis]
$1 = (long &) @0x5555572290a8: 4300000000 <--------
(gdb) n
335	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) p channelCount
$2 = 5032704

After Changes:

Thread 1 "python3" hit Breakpoint 1, mxnet::op::BatchNormShape (attrs=..., in_shape=0x555556579d98, out_shape=0x555556579db0) at src/operator/nn/batch_norm.cc:333
333	  const index_t channelCount = dshape[channelAxis]; <========
(gdb) n
335	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) info local
param = @0x555555cdb770: {<dmlc::Parameter<mxnet::op::BatchNormParam>> = {<No data fields>}, eps = 0.0010000000474974513, momentum = 0.899999976, fix_gamma = true, use_global_stats = false, output_mean_var = false, axis = 0,
  cudnn_off = false, min_calib_range = {is_none = true, val = {__data = "\000\000\000", __align = {<No data fields>}}}, max_calib_range = {is_none = true, val = {__data = "UU\000", __align = {<No data fields>}}}}
dshape = @0x5555572290a0: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 1, 4300000000, 0}, data_heap_ = 0x0}, <No data fields>}
channelAxis = 0
channelCount = 4300000000 <--------
(gdb) p dshape[channelAxis]
$1 = (long &) @0x5555572290a8: 4300000000  <--------

batch_norm()

Before changes:

Thread 1 "python3" hit Breakpoint 1, mxnet::op::LayerNormShape (attrs=..., in_shape=0x555556579dc8, out_shape=0x555556579de0) at src/operator/nn/layer_norm.cc:50
50	  const int channelCount = dshape[axis]; <========
(gdb) n
52	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) p channelCount
$3 = 5032704   <--------
(gdb) p dshape[0]
$4 = (long &) @0x555556c21f58: 430000000 <--------
(gdb) info local
param = @0x7fffffff9418: {<dmlc::Parameter<mxnet::op::LayerNormParam>> = {<No data fields>}, axis = 0, eps = 9.99999975e-06, output_mean_var = false}
dshape = @0x555556c21f50: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 0, 0, 0}, data_heap_ = 0x0}, <No data fields>}
axis = 0
channelCount = 5032704
moments_shape = {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = -29512, num_heap_allocated_ = 32767, data_stack_ = {140737488326480, 140737488325376, 93825019642720, 140737488325376},
    data_heap_ = 0x7fff936c4de7
     <std::_Rb_tree<dmlc::parameter::FieldAccessEntry*, dmlc::parameter::FieldAccessEntry*, std::_Identity<dmlc::parameter::FieldAccessEntry*>, std::less<dmlc::parameter::FieldAccessEntry*>, std::allocator<dmlc::parameter::FieldAccessEntry*> >::_Alloc_node::operator()<dmlc::parameter::FieldAccessEntry* const&>(dmlc::parameter::FieldAccessEntry* const&) const+49>}, <No data fields>}

After Changes:

Thread 1 "python3" hit Breakpoint 2, mxnet::op::LayerNormShape (attrs=..., in_shape=0x555556578ff8, out_shape=0x555556579010) at src/operator/nn/layer_norm.cc:50
50	  const index_t channelCount = dshape[axis]; <========
(gdb) n
52	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) info local
param = @0x7fffffff9438: {<dmlc::Parameter<mxnet::op::LayerNormParam>> = {<No data fields>}, axis = 0, eps = 9.99999975e-06, output_mean_var = false}
dshape = @0x5555565bc420: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 6878235116697514089, 32088647312828786, 0},
    data_heap_ = 0x0}, <No data fields>}
axis = 0
channelCount = 4300000000 <--------
moments_shape = {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = -29480, num_heap_allocated_ = 32767, data_stack_ = {140737488326512, 140737488325408, 93825021150800, 140737488325408},
    data_heap_ = 0x7fff936c4de7
     <std::_Rb_tree<dmlc::parameter::FieldAccessEntry*, dmlc::parameter::FieldAccessEntry*, std::_Identity<dmlc::parameter::FieldAccessEntry*>, std::less<dmlc::parameter::FieldAccessEntry*>, std::allocator<dmlc::parameter::FieldAccessEntry*> >::_Alloc_node::operator()<dmlc::parameter::FieldAccessEntry* const&>(dmlc::parameter::FieldAccessEntry* const&) const+49>}, <No data fields>}
(gdb) p dshape[axis]
$1 = (long &) @0x5555565bc428: 4300000000 <--------

Testing

$ MXNET_TEST_COUNT=1 nosetests --logging-level=DEBUG --verbose -s tests/nightly/test_large_vector.py:test_nn
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
test_large_vector.test_nn ... [18:14:51] src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[18:21:14] src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
ok

----------------------------------------------------------------------
Ran 1 test in 1017.457s

OK

access2rohit · 2020-03-10T18:47:22Z

@apeforest @ChaiBapchya I don't know much about layer_norm() or batch_norm() to add suitable shape checks in the tests. I have provided gdb outputs after fixing the code. Can you guys suggest addition of proper shape testing that can be added to test_large_vector and test_large_array?

access2rohit · 2020-03-10T19:02:58Z

@mxnet-label-bot add [pr-awaiting-review]

ChaiBapchya · 2020-03-10T19:28:35Z

How is addition of SHAPE_ASSIGN_CHECK to layer_norm causing this failure?
Layer norm/batch norm were passing before and some change caused it to start to fail right? What's that root cause?
Also it turns out - batch norm already has shape check in test_large_array.py
https://github.com/apache/incubator-mxnet/blob/afb8742e6e1e987833b39c487dc892b5537196a1/tests/nightly/test_large_array.py#L327

Layer norm doesn't have such a check in test_large_array.py. Maybe you could add that.

Fundamentally, For both batch norm and layer norm, since the operation is just to perform normalization over layer/batch, input shape should be equal to output shape.

access2rohit · 2020-03-10T20:33:56Z

@ChaiBapchya

1. How is addition of SHAPE_ASSIGN_CHECK to layer_norm causing this failure?
   Layer norm/batch norm were passing before and some change caused it to start to fail right? What's that root cause?

it was incorrect when added check my GDB logs

2. Also it turns out - batch norm already has shape check in test_large_array.py
   https://github.com/apache/incubator-mxnet/blob/afb8742e6e1e987833b39c487dc892b5537196a1/tests/nightly/test_large_array.py#L327

Its still incorrect.

Layer norm doesn't have such a check in test_large_array.py. Maybe you could add that.

Actually its better to add the check added in this PR #17683
Currently I don't have cycles to work on this. I have asked @sxjscience to see if he can add this check. Since I would be occupied for next 2 weeks.

apeforest · 2020-03-10T21:11:37Z

It's very unlikely the number of channels will be greater than 2^31. So this should not cause problem in practice. @sxjscience please confirm.

@access2rohit I don't fully understand the gdb output in your description. They seem to stop at different places and what do you want us to see?

access2rohit · 2020-03-16T21:16:20Z

@mxnet-label-bot update [pr-awaiting-merge]

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]>

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

ChaiBapchya · 2020-09-20T03:40:07Z

This needs to be cherry-picked into v1.x
Doing it now

* fixing batch_norm and layer_norm for large tensors (#17805) Co-authored-by: Rohit Kumar Srivastava <[email protected]> * Fix nightly large_vector test caused by incorrect with_seed path (#18178) * add back the missing environment function Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing batch_norm and layer_norm for large tensors (apache#17805) Co-authored-by: Rohit Kumar Srivastava <[email protected]> * Fix nightly large_vector test caused by incorrect with_seed path (apache#18178) * add back the missing environment function Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing batch_norm and layer_norm for large tensors (#17805) Co-authored-by: Rohit Kumar Srivastava <[email protected]> * Fix nightly large_vector test caused by incorrect with_seed path (#18178) * add back the missing environment function Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]>

fixing batch_norm and layer_norm for large tensors

2595370

access2rohit changed the title ~~fixing batch_norm and layer_norm for large tensors~~ fixing batch_norm and layer_norm for large tensor nightly test Mar 10, 2020

lanking520 added the pr-awaiting-review PR is waiting for code review label Mar 10, 2020

lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Mar 16, 2020

apeforest approved these changes Mar 16, 2020

View reviewed changes

apeforest merged commit 66b21b5 into apache:master Mar 16, 2020

MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020

fixing batch_norm and layer_norm for large tensors (apache#17805)

fee26c1

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

ChaiBapchya mentioned this pull request May 7, 2020

Nightly test fail with "test_large_vector.test_nn" on v1.7.x branch #18246

Closed

ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request May 8, 2020

fixing batch_norm and layer_norm for large tensors (apache#17805)

72e546b

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

TaoLv pushed a commit that referenced this pull request May 11, 2020

fixing batch_norm and layer_norm for large tensors (#17805) (#18261)

ceb0f06

Co-authored-by: Rohit Kumar Srivastava <[email protected]> Co-authored-by: Rohit Kumar Srivastava <[email protected]>

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

fixing batch_norm and layer_norm for large tensors (apache#17805)

88948b7

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

ChaiBapchya mentioned this pull request Sep 20, 2020

[v1.7.x] Backport fixing batch_norm and layer_norm for large tensors (#17805) #18261

Merged

ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Sep 20, 2020

fixing batch_norm and layer_norm for large tensors (apache#17805)

3f37582

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

ChaiBapchya mentioned this pull request Sep 20, 2020

[v1.x] Nightly Large Tensor test cherrypicks #19194

Merged

ChaiBapchya mentioned this pull request Sep 24, 2020

[v1.8.x] Fix Nightly Large Tensor test #19215

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing batch_norm and layer_norm for large tensor nightly test #17805

fixing batch_norm and layer_norm for large tensor nightly test #17805

access2rohit commented Mar 10, 2020 •

edited

Loading

access2rohit commented Mar 10, 2020 •

edited

Loading

access2rohit commented Mar 10, 2020

ChaiBapchya commented Mar 10, 2020

access2rohit commented Mar 10, 2020 •

edited

Loading

apeforest commented Mar 10, 2020 •

edited

Loading

access2rohit commented Mar 16, 2020

ChaiBapchya commented Sep 20, 2020

fixing batch_norm and layer_norm for large tensor nightly test #17805

fixing batch_norm and layer_norm for large tensor nightly test #17805

Conversation

access2rohit commented Mar 10, 2020 • edited Loading

Description

Checklist

Essentials

Proof Of Correctness

layer_norm()

batch_norm()

Testing

access2rohit commented Mar 10, 2020 • edited Loading

access2rohit commented Mar 10, 2020

ChaiBapchya commented Mar 10, 2020

access2rohit commented Mar 10, 2020 • edited Loading

apeforest commented Mar 10, 2020 • edited Loading

access2rohit commented Mar 16, 2020

ChaiBapchya commented Sep 20, 2020

access2rohit commented Mar 10, 2020 •

edited

Loading

access2rohit commented Mar 10, 2020 •

edited

Loading

access2rohit commented Mar 10, 2020 •

edited

Loading

apeforest commented Mar 10, 2020 •

edited

Loading