Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1233] Enable dynamic shape in CachedOp #13419

Closed
wants to merge 9 commits into from

Conversation

junrushao
Copy link
Member

@junrushao junrushao commented Nov 27, 2018

Description

This PR enables dynamic shape in CachedOp, a.k.a. the backend supporting HybridBlock in Gluon.

In the forward pass, we have to invoke operators one-by-one because dynamic shape disallows us to allocate memory ahead of time.

The backward pass is actually unaffected, because after forward, shape of everything becomes known.

CC: @zheng-da @szha @yidawang

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add NaiveRunGraph in imperative mode, in which operators are executed in a synchronized manner.
  • Add NaiveForward mode in CachedOp, which calls NaiveRunGraph.
  • Add CheckDynamicShapeExists in CachedOp, which tells whether the graph contains an operator returning dynamic shape.

Comments

  • Yet another mode in the executor of CachedOp. Perhaps we need to refactor some day.

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Nov 27, 2018
@junrushao junrushao changed the title [WIP] Enable dynamic shape in CachedOp [MXNET-1233] Enable dynamic shape in CachedOp Nov 27, 2018
@junrushao
Copy link
Member Author

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Dec 2, 2018
@junrushao
Copy link
Member Author

@mxnet-label-bot remove [pr-work-in-progress]

@marcoabreu marcoabreu removed the pr-work-in-progress PR is still work in progress label Dec 2, 2018
@@ -262,6 +262,29 @@ std::vector<nnvm::NodeEntry> CachedOp::Gradient(
return ret;
}

bool CachedOp::CheckDynamicShapeExists(const Context& default_ctx,
const std::vector<NDArray*>& inputs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if it's better to check operators with dynamic shape directly. right now, it assumes that if a computation graph can't infer shape, it contains dynamic-shape operators. it's better to write one that works for both CachedOp and symbol executor. It's a property of a computation graph whether a graph contains dynamic shape. We can easily check it by traversing all operators in a graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion last time, I think our solution should be naive implementation first, and then do graph partitioning to speed stuff up.

arrays.reserve(num_entries);
for (auto& item : runtime.buff) {
arrays.push_back(&item);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should buffer arrays from the previous run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why buffer stuff from previous run? To save memory alloc overhead?

Context ctx = GetContext(node.source->attrs, ndinputs, ndoutputs, default_ctx);
auto invoke = [&](const OpStatePtr &state) {
DispatchMode dispatch_mode = DispatchMode::kUndefined;
SetShapeType(ctx, node.source->attrs, ndinputs, ndoutputs, &dispatch_mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still infer shape here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This function leverages the existing infer_shape to find out whether there are dynamic shape stuff inside the graph.

auto fwd_node_id = idx.node_id(fwd_node);
cached_op->Backward(retain_graph, states[fwd_node_id], ndinputs, req, ndoutputs);
} else if (createop.count(node.source->op())) {
// case 2: node is in createop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is to handle stateful operators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are right. Should I change the comments?

@@ -1002,6 +1009,18 @@ void RunGraph(const bool retain_graph,
const DispatchModeVector &dispatch_modes,
bool recording);


void NaiveRunGraph(const bool retain_graph,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these new functions deserve documentations for inputs/outputs and what it does to keep the readability

@anirudhacharya
Copy link
Member

@eric-haibin-lin another round of review?

@sandeep-krishnamurthy
Copy link
Contributor

@zheng-da - Can you please take a look at this PR again?

@junrushao
Copy link
Member Author

I am going to close this PR, split it into small pieces, and PR again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Operator pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants