Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph #16836

Merged
merged 7 commits into from
Nov 21, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Nov 16, 2019

Description

Currently type/shape inference of the operator is only called if the values of input and output attributes are not currently known. This is a bug as it may lead to some operators never seeing the values of those inputs/outputs attributes during InferAttr pass and so never validating them.

As an example, this code snippet runs without throwing any exception in current MXNet:

v = mx.sym.Variable('V')
s = mx.sym.transpose(v)
x = mx.sym.Variable('x')
s2 = x + v
s3 = s + s2
e = s3.simple_bind(ctx=mx.cpu(), x=(2,3), grad_req='null')

and it errors out only when calling e.forward (in 1.5.1, in 1.6.0 new 2D transpose implementation removed the check for shape inside the forward function of the operator so it does not error out even then). This is because the transpose node is early in the topologically sorted graph and ran its shape inference earlier (on unknown input and output shapes) than the _Plus operators, which then set all of the shapes by themselves, which prevented transpose from running its shape inference again to spot that they are not suitable for this operator.

In this PR, I track which operators are actually finished inferring attributes (including validating the attributes they got if all of the inputs/outputs are already known at the point of running their attribute inference).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)

Comments

  • Currently I check that the operator finished inferring attributes if all input and output attributes are set after the call to operator's InferShape/InferType. This should not be necessary as those functions return the value that should be equivalent. However, since this value was never actually checked, there are a lot of operators that lie and return true even if not everything was inferred correctly. This should be tackled in a separate PR.

@ptrendx ptrendx added the R1.6.0 label Nov 16, 2019
Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow ! Shocking :)

@ptrendx
Copy link
Member Author

ptrendx commented Nov 18, 2019

@junrushao1994 The failure in test_contrib_control_flow.test_while_loop_simple_forward test in this PR is a legitimate bug in WhileLoopType function (https://github.com/apache/incubator-mxnet/blob/master/src/operator/control_flow.cc#L764), where you always set the output type of condition to float. I do not really understand how that function is supposed to work (e.g. you set the cond_out_type but then do not use it?), could you take a look at it?

@ptrendx
Copy link
Member Author

ptrendx commented Nov 18, 2019

@reminisce The failure in test_numpy_op.test_np_transpose in this PR is a legitimate bug, where Transpose op's TransposeShape function used to infer shape does not support scalar inputs (https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op-inl.h#L423). Please help in fixing this (or disabling testing of scalar inputs if Transpose not supporting scalars is expected).

@ptrendx
Copy link
Member Author

ptrendx commented Nov 18, 2019

@samskalicky @apeforest FYI, see last 2 comments.

@ptrendx
Copy link
Member Author

ptrendx commented Nov 18, 2019

@reminisce Actually, the required change to Transpose seems simple, I will do it in this PR.

@junrushao
Copy link
Member

junrushao commented Nov 19, 2019

At the time I wrote the control flow code, it is still infeasible to have int64 output for many operators, like == or < - this is why this line is hard-coded as float32...I think the behavior has a breaking (but better) change recently

@ptrendx
Copy link
Member Author

ptrendx commented Nov 19, 2019

Not sure I understand @junrushao1994 - I believe the problem here is that you are forcing the float32 output type here even if the operator could output something else (and in the case of the test that fails in this PR, previously the operator did not have a chance to verify that it is correct). Why not leave the output type of the condition part as -1 so that it can be inferred?

@junrushao
Copy link
Member

@ptrendx At the time that I was writing this piece of code, operators do return float32, see this.

@junrushao
Copy link
Member

@ptrendx I think leaving it -1 is a nice proposal. Could you fix this? Thanks!

Copy link
Contributor

@DickJC123 DickJC123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the attribute inference logic more robust. Nice catch! LGTM.

@ptrendx ptrendx merged commit a8b31a2 into apache:master Nov 21, 2019
ptrendx added a commit to ptrendx/mxnet that referenced this pull request Nov 21, 2019
… graph (apache#16836)

* Fix the attribute inference omitting nodes

* Add test

* Cleaning

* Fix lint

* Fix TransposeShape

* Fix WhileLoopType

* Changing a/b test for fusion to a/(b+1) to increase numerical stability
ptrendx added a commit that referenced this pull request Nov 22, 2019
* Add unoptimized symbol to executor for sharing (#16798)

* Add unoptimized symbol to executor for sharing

* Copy the symbol in Reshape

* Added test for multiple reshapes

* Mixed precison binary op backward (use in) for numpy (#16791)

* mixed precison binary op backward

* reduce unix cpu runtime

* USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build.  Add compile-guard to fusion. (#16838)

* Rename USE_NVRTC -> ENABLE_CUDA_RTC to fix maven build.  Compile-guard fusion framework.

* Fix fusion-not-supported warning.

* Fix compile guards

* Fix cmake build so -DMXNET_ENABLE_CUDA_RTC=1 is passed to nvcc

* Minimize side-effects of prev change

* Fix InferAttr/InferShapeAttr not calling inference for all nodes in a graph (#16836)

* Fix the attribute inference omitting nodes

* Add test

* Cleaning

* Fix lint

* Fix TransposeShape

* Fix WhileLoopType

* Changing a/b test for fusion to a/(b+1) to increase numerical stability

* Revert "Mixed precison binary op backward (use in) for numpy (#16791)"

This reverts commit 8b58b78.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants