Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Bulked op segments to allow Variable nodes #14200

Merged
merged 15 commits into from
Mar 7, 2019

Conversation

DickJC123
Copy link
Contributor

Description

Background: Operators are bulked into segments that define synchronization points between the cpu-based worker threads and the often-GPU-based operator execution. While the default size of the segments is set to 15, in practice the segment size is much smaller due to the current restriction that 'Variable' nodes (e.g. weight inputs to Convolution) terminate the segment formation. This is an unnecessary restriction that can be removed to increase segment size and improve performance.

This PR changes the operator bulking code to allow Variable nodes to be part of the segment, but does not count them toward the limit, which is set currently by the environment variable MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN (default = 15). Larger segments are desirable to improve both CPU and GPU efficiency. However, there are practical considerations that keep full forward- and backward-pass bulking from being the optimal. For example, generated gradients of the backward pass must wait until the end of a segment before gradient reduction can get started in a multi-GPU training scenario. Because of the possible different optima in the forward and backward pass, this PR adds 2 additional environment variable 'knobs':

MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD - sets the maximum size of bulked operator segments in the forward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.

MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD - sets the maximum size of bulked operator segments in the backward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.

While this PR is being reviewed, I will make a further post detailing the performance gains that result. Also due is a commit that adds documentation for the additional environment variables.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR. Documentation update still pending.)
  • All changes have test coverage: In effect the entire CI now runs with the larger segments.
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2019

Just to clarify: this is bulking support for symbolic execution. Bulking in Gluon was handled in (already merged) PR #13890

@roywei
Copy link
Member

roywei commented Feb 21, 2019

@mxnet-jenkins add[Backend, pr-awaiting-review]

@DickJC123
Copy link
Contributor Author

I measured the perf gains of this PR under 2 scenarios.

The first scenario was a run across 8 Volta GPUs of a mixed precision NHWC Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container). To simulate upstream MXNet with it's current bulking, I set the new-PR's bulking to 2. I reasoned that a residual unit has a forward sequence Conv-BN-Relu-Conv-BN-Add-Relu, which would have 4 segments (due to the Variable inputs to the Conv's and BN's) amongst 7 nodes, for an average nodes/segment of 1.75. The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:

batchsize  32: 3.8% speedup
batchsize  64: 5.1% speedup
batchsize 128: 1.9% speedup
batchsize 256: 1.2% speedup

A second scenario was a run across 8 Volta GPUs of a mixed precision NCHW Inception-v3 (with DALI in NVIDIA's MXNet container). The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:

batchsize  32: 6.3% speedup
batchsize  64: 7.4% speedup
batchsize 128: 3.8% speedup
batchsize 256: 2.1% speedup

@DickJC123
Copy link
Contributor Author

The main change of this PR is in graph_executor.cc in an area often touched by @piiswrong @eric-haibin-lin @KellenSunderland. Not sure if any of them would like to weigh in.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement! A few questions.

docs/faq/env_var.md Show resolved Hide resolved
@@ -53,10 +53,12 @@ struct CachedOpConfig : public dmlc::Parameter<CachedOpConfig> {
.set_default(2)
.describe("Maximum number of operators that can be inlined.");
DMLC_DECLARE_FIELD(forward_bulk_size)
.set_default(dmlc::GetEnv("MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN", 15))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that env_vars like MXNET_EXEC_BULK_EXEC_TRAIN/MXNET_EXEC_BULK_EXEC_INFER=0 are not respected by the cached_op. Would you have time to kindly fix it for cached op?
https://github.com/apache/incubator-mxnet/blob/54fd288c7a4bf59d37f793c26ef9a98ed40b0c40/src/imperative/cached_op.cc#L593-L596

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the latest commits, which address this valid concern about consistency. I also consolidated all op-bulking env var references to a central place and added timing-based tests for the perf impact of bulking. I'm happy with the PR now (assuming it passes CI). Anyone else you want to pull into the review @eric-haibin-lin?

@roywei
Copy link
Member

roywei commented Feb 23, 2019

@mxnet-label-bot add[Backend]

@marcoabreu marcoabreu added the Backend Issues related to the backend of MXNet label Feb 23, 2019
@DickJC123
Copy link
Contributor Author

@eric-haibin-lin Thanks for your comments thus far. The PR is back in a good state for you to complete your review, although it appears testing on Windows has been bypassed . Happy to respond to further comments from you or from others you may wish to pull into the review.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. Otherwise looks good. cc @szha

src/executor/graph_executor.cc Outdated Show resolved Hide resolved
@eric-haibin-lin eric-haibin-lin merged commit 8beea18 into apache:master Mar 7, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.

* Document new env variables. Unify operation with imperative.

* Add timing-based tests of symbol and gluon op bulking.

* Rename test_in_separate_process -> run_in_spawned_process.

* Remove redundant util test_operator_gpu.py:_test_in_separate_process().

* Consolidate references to env vars that set op-bulking policy.

* Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0.

* Fix python2 print() issue.

* Trigger CI.

* Consolidate similar op bulking routines.

* Trigger CI.

* Trigger CI.

* Add instrumentation to debug failing CI.
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.

* Document new env variables. Unify operation with imperative.

* Add timing-based tests of symbol and gluon op bulking.

* Rename test_in_separate_process -> run_in_spawned_process.

* Remove redundant util test_operator_gpu.py:_test_in_separate_process().

* Consolidate references to env vars that set op-bulking policy.

* Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0.

* Fix python2 print() issue.

* Trigger CI.

* Consolidate similar op bulking routines.

* Trigger CI.

* Trigger CI.

* Add instrumentation to debug failing CI.
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.

* Document new env variables. Unify operation with imperative.

* Add timing-based tests of symbol and gluon op bulking.

* Rename test_in_separate_process -> run_in_spawned_process.

* Remove redundant util test_operator_gpu.py:_test_in_separate_process().

* Consolidate references to env vars that set op-bulking policy.

* Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0.

* Fix python2 print() issue.

* Trigger CI.

* Consolidate similar op bulking routines.

* Trigger CI.

* Trigger CI.

* Add instrumentation to debug failing CI.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants