Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

ptrendx · 2019-11-16T02:15:39Z

Description

Currently type/shape inference of the operator is only called if the values of input and output attributes are not currently known. This is a bug as it may lead to some operators never seeing the values of those inputs/outputs attributes during InferAttr pass and so never validating them.

As an example, this code snippet runs without throwing any exception in current MXNet:

v = mx.sym.Variable('V')
s = mx.sym.transpose(v)
x = mx.sym.Variable('x')
s2 = x + v
s3 = s + s2
e = s3.simple_bind(ctx=mx.cpu(), x=(2,3), grad_req='null')

and it errors out only when calling e.forward (in 1.5.1, in 1.6.0 new 2D transpose implementation removed the check for shape inside the forward function of the operator so it does not error out even then). This is because the transpose node is early in the topologically sorted graph and ran its shape inference earlier (on unknown input and output shapes) than the _Plus operators, which then set all of the shapes by themselves, which prevented transpose from running its shape inference again to spot that they are not suitable for this operator.

In this PR, I track which operators are actually finished inferring attributes (including validating the attributes they got if all of the inputs/outputs are already known at the point of running their attribute inference).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)

Comments

Currently I check that the operator finished inferring attributes if all input and output attributes are set after the call to operator's InferShape/InferType. This should not be necessary as those functions return the value that should be equivalent. However, since this value was never actually checked, there are a lot of operators that lie and return true even if not everything was inferred correctly. This should be tackled in a separate PR.

anirudh2290

Wow ! Shocking :)

ptrendx · 2019-11-18T20:49:31Z

@junrushao1994 The failure in test_contrib_control_flow.test_while_loop_simple_forward test in this PR is a legitimate bug in WhileLoopType function (https://github.com/apache/incubator-mxnet/blob/master/src/operator/control_flow.cc#L764), where you always set the output type of condition to float. I do not really understand how that function is supposed to work (e.g. you set the cond_out_type but then do not use it?), could you take a look at it?

ptrendx · 2019-11-18T21:19:18Z

@reminisce The failure in test_numpy_op.test_np_transpose in this PR is a legitimate bug, where Transpose op's TransposeShape function used to infer shape does not support scalar inputs (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op-inl.h#L423). Please help in fixing this (or disabling testing of scalar inputs if Transpose not supporting scalars is expected).

ptrendx · 2019-11-18T21:19:37Z

@samskalicky @apeforest FYI, see last 2 comments.

ptrendx · 2019-11-18T21:26:02Z

@reminisce Actually, the required change to Transpose seems simple, I will do it in this PR.

junrushao · 2019-11-19T22:03:32Z

At the time I wrote the control flow code, it is still infeasible to have int64 output for many operators, like == or < - this is why this line is hard-coded as float32...I think the behavior has a breaking (but better) change recently

ptrendx · 2019-11-19T22:30:45Z

Not sure I understand @junrushao1994 - I believe the problem here is that you are forcing the float32 output type here even if the operator could output something else (and in the case of the test that fails in this PR, previously the operator did not have a chance to verify that it is correct). Why not leave the output type of the condition part as -1 so that it can be inferred?

junrushao · 2019-11-19T23:20:34Z

@ptrendx At the time that I was writing this piece of code, operators do return float32, see this.

junrushao · 2019-11-19T23:21:11Z

@ptrendx I think leaving it -1 is a nice proposal. Could you fix this? Thanks!

DickJC123

Thanks for making the attribute inference logic more robust. Nice catch! LGTM.

… graph (apache#16836) * Fix the attribute inference omitting nodes * Add test * Cleaning * Fix lint * Fix TransposeShape * Fix WhileLoopType * Changing a/b test for fusion to a/(b+1) to increase numerical stability

* Add unoptimized symbol to executor for sharing (#16798) * Add unoptimized symbol to executor for sharing * Copy the symbol in Reshape * Added test for multiple reshapes * Mixed precison binary op backward (use in) for numpy (#16791) * mixed precison binary op backward * reduce unix cpu runtime * USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Add compile-guard to fusion. (#16838) * Rename USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build. Compile-guard fusion framework. * Fix fusion-not-supported warning. * Fix compile guards * Fix cmake build so -DMXNET_ENABLE_CUDA_RTC=1 is passed to nvcc * Minimize side-effects of prev change * Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph (#16836) * Fix the attribute inference omitting nodes * Add test * Cleaning * Fix lint * Fix TransposeShape * Fix WhileLoopType * Changing a/b test for fusion to a/(b+1) to increase numerical stability * Revert "Mixed precison binary op backward (use in) for numpy (#16791)" This reverts commit 8b58b78.

ptrendx added 2 commits November 15, 2019 17:57

Fix the attribute inference omitting nodes

86b12af

Add test

1ab3eea

ptrendx requested review from reminisce and eric-haibin-lin November 16, 2019 02:15

ptrendx added 2 commits November 15, 2019 18:21

Cleaning

bab0be4

Fix lint

631c90d

ptrendx added the R1.6.0 label Nov 16, 2019

anirudh2290 reviewed Nov 16, 2019

View reviewed changes

Fix TransposeShape

ef01fd6

Fix WhileLoopType

d393f83

ptrendx requested a review from DickJC123 November 20, 2019 00:41

Changing a/b test for fusion to a/(b+1) to increase numerical stability

8ec9b82

DickJC123 approved these changes Nov 21, 2019

View reviewed changes

ptrendx merged commit a8b31a2 into apache:master Nov 21, 2019

ptrendx mentioned this pull request Nov 21, 2019

Flaky: test_fusion.test_fusion #16875

Closed

ptrendx mentioned this pull request Nov 21, 2019

Backport #16798, #16836 and #16838 to 1.6 #16874

Merged

leezu mentioned this pull request Nov 25, 2019

HybridBeamSearch broken dmlc/gluon-nlp#1020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

ptrendx commented Nov 16, 2019

anirudh2290 left a comment

ptrendx commented Nov 18, 2019 •

edited

Loading

ptrendx commented Nov 18, 2019

ptrendx commented Nov 18, 2019

ptrendx commented Nov 18, 2019

junrushao commented Nov 19, 2019 •

edited

Loading

ptrendx commented Nov 19, 2019

junrushao commented Nov 19, 2019

junrushao commented Nov 19, 2019

DickJC123 left a comment

Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

Conversation

ptrendx commented Nov 16, 2019

Description

Checklist

Essentials

Comments

anirudh2290 left a comment

Choose a reason for hiding this comment

ptrendx commented Nov 18, 2019 • edited Loading

ptrendx commented Nov 18, 2019

ptrendx commented Nov 18, 2019

ptrendx commented Nov 18, 2019

junrushao commented Nov 19, 2019 • edited Loading

ptrendx commented Nov 19, 2019

junrushao commented Nov 19, 2019

junrushao commented Nov 19, 2019

DickJC123 left a comment

Choose a reason for hiding this comment

ptrendx commented Nov 18, 2019 •

edited

Loading

junrushao commented Nov 19, 2019 •

edited

Loading