Bulked op segments to allow Variable nodes #14200

DickJC123 · 2019-02-19T01:52:43Z

Description

Background: Operators are bulked into segments that define synchronization points between the cpu-based worker threads and the often-GPU-based operator execution. While the default size of the segments is set to 15, in practice the segment size is much smaller due to the current restriction that 'Variable' nodes (e.g. weight inputs to Convolution) terminate the segment formation. This is an unnecessary restriction that can be removed to increase segment size and improve performance.

This PR changes the operator bulking code to allow Variable nodes to be part of the segment, but does not count them toward the limit, which is set currently by the environment variable MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN (default = 15). Larger segments are desirable to improve both CPU and GPU efficiency. However, there are practical considerations that keep full forward- and backward-pass bulking from being the optimal. For example, generated gradients of the backward pass must wait until the end of a segment before gradient reduction can get started in a multi-GPU training scenario. Because of the possible different optima in the forward and backward pass, this PR adds 2 additional environment variable 'knobs':

MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD - sets the maximum size of bulked operator segments in the forward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.

MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD - sets the maximum size of bulked operator segments in the backward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.

While this PR is being reviewed, I will make a further post detailing the performance gains that result. Also due is a commit that adds documentation for the additional environment variables.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR. Documentation update still pending.)
All changes have test coverage: In effect the entire CI now runs with the larger segments.
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

…LK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.

ptrendx · 2019-02-19T23:44:06Z

Just to clarify: this is bulking support for symbolic execution. Bulking in Gluon was handled in (already merged) PR #13890

roywei · 2019-02-21T01:10:28Z

@mxnet-jenkins add[Backend, pr-awaiting-review]

DickJC123 · 2019-02-21T02:52:02Z

I measured the perf gains of this PR under 2 scenarios.

The first scenario was a run across 8 Volta GPUs of a mixed precision NHWC Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container). To simulate upstream MXNet with it's current bulking, I set the new-PR's bulking to 2. I reasoned that a residual unit has a forward sequence Conv-BN-Relu-Conv-BN-Add-Relu, which would have 4 segments (due to the Variable inputs to the Conv's and BN's) amongst 7 nodes, for an average nodes/segment of 1.75. The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:

batchsize  32: 3.8% speedup
batchsize  64: 5.1% speedup
batchsize 128: 1.9% speedup
batchsize 256: 1.2% speedup

A second scenario was a run across 8 Volta GPUs of a mixed precision NCHW Inception-v3 (with DALI in NVIDIA's MXNet container). The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:

batchsize  32: 6.3% speedup
batchsize  64: 7.4% speedup
batchsize 128: 3.8% speedup
batchsize 256: 2.1% speedup

DickJC123 · 2019-02-21T03:54:18Z

The main change of this PR is in graph_executor.cc in an area often touched by @piiswrong @eric-haibin-lin @KellenSunderland. Not sure if any of them would like to weigh in.

eric-haibin-lin

Thanks for the improvement! A few questions.

docs/faq/env_var.md

eric-haibin-lin · 2019-02-23T00:59:58Z

src/imperative/cached_op.h

@@ -53,10 +53,12 @@ struct CachedOpConfig : public dmlc::Parameter<CachedOpConfig> {
 .set_default(2)
 .describe("Maximum number of operators that can be inlined.");
 DMLC_DECLARE_FIELD(forward_bulk_size)
- .set_default(dmlc::GetEnv("MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN", 15))


I noticed that env_vars like MXNET_EXEC_BULK_EXEC_TRAIN/MXNET_EXEC_BULK_EXEC_INFER=0 are not respected by the cached_op. Would you have time to kindly fix it for cached op?
https://github.com/apache/incubator-mxnet/blob/54fd288c7a4bf59d37f793c26ef9a98ed40b0c40/src/imperative/cached_op.cc#L593-L596

Please review the latest commits, which address this valid concern about consistency. I also consolidated all op-bulking env var references to a central place and added timing-based tests for the perf impact of bulking. I'm happy with the PR now (assuming it passes CI). Anyone else you want to pull into the review @eric-haibin-lin?

roywei · 2019-02-23T01:38:51Z

@mxnet-label-bot add[Backend]

DickJC123 · 2019-02-27T18:59:25Z

@eric-haibin-lin Thanks for your comments thus far. The PR is back in a good state for you to complete your review, although it appears testing on Windows has been bypassed . Happy to respond to further comments from you or from others you may wish to pull into the review.

eric-haibin-lin

One minor comment. Otherwise looks good. cc @szha

src/executor/graph_executor.cc

* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}. * Document new env variables. Unify operation with imperative. * Add timing-based tests of symbol and gluon op bulking. * Rename test_in_separate_process -> run_in_spawned_process. * Remove redundant util test_operator_gpu.py:_test_in_separate_process(). * Consolidate references to env vars that set op-bulking policy. * Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0. * Fix python2 print() issue. * Trigger CI. * Consolidate similar op bulking routines. * Trigger CI. * Trigger CI. * Add instrumentation to debug failing CI.

Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BU…

4f0245a

…LK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.

Document new env variables. Unify operation with imperative.

54fd288

DickJC123 requested review from anirudh2290 and szha as code owners February 21, 2019 03:44

eric-haibin-lin reviewed Feb 23, 2019

View reviewed changes

marcoabreu added the Backend Issues related to the backend of MXNet label Feb 23, 2019

DickJC123 added 8 commits February 23, 2019 18:26

Add timing-based tests of symbol and gluon op bulking.

f0e4296

Rename test_in_separate_process -> run_in_spawned_process.

c1b4785

Merge remote-tracking branch 'mxnet/master' into bulk_exec_symbolic

414c926

Remove redundant util test_operator_gpu.py:_test_in_separate_process().

c9d0f24

Consolidate references to env vars that set op-bulking policy.

f0e1f53

Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0.

18b444f

Fix python2 print() issue.

4bcef30

Trigger CI.

f959498

eric-haibin-lin reviewed Mar 2, 2019

View reviewed changes

src/executor/graph_executor.cc Outdated Show resolved Hide resolved

DickJC123 added 5 commits March 3, 2019 13:28

Consolidate similar op bulking routines.

320001a

Trigger CI.

79a22ed

Trigger CI.

bba19b7

Merge remote-tracking branch 'mxnet/master' into bulk_exec_symbolic

7d0cf5d

Add instrumentation to debug failing CI.

30ccbed

eric-haibin-lin approved these changes Mar 7, 2019

View reviewed changes

eric-haibin-lin merged commit 8beea18 into apache:master Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulked op segments to allow Variable nodes #14200

Bulked op segments to allow Variable nodes #14200

DickJC123 commented Feb 19, 2019

ptrendx commented Feb 19, 2019

roywei commented Feb 21, 2019

DickJC123 commented Feb 21, 2019

DickJC123 commented Feb 21, 2019

eric-haibin-lin left a comment

eric-haibin-lin Feb 23, 2019

DickJC123 Feb 26, 2019

roywei commented Feb 23, 2019

DickJC123 commented Feb 27, 2019

eric-haibin-lin left a comment

Bulked op segments to allow Variable nodes #14200

Bulked op segments to allow Variable nodes #14200

Conversation

DickJC123 commented Feb 19, 2019

Description

Checklist

Essentials

Changes

Comments

ptrendx commented Feb 19, 2019

roywei commented Feb 21, 2019

DickJC123 commented Feb 21, 2019

DickJC123 commented Feb 21, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

eric-haibin-lin Feb 23, 2019

Choose a reason for hiding this comment

DickJC123 Feb 26, 2019

Choose a reason for hiding this comment

roywei commented Feb 23, 2019

DickJC123 commented Feb 27, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment