Simplify creation of NodeEntry instances and use emplace_back #14095

larroy · 2019-02-08T12:57:57Z

Description

Reduce copying of shared_ptr. In a microbenchmark is 10% faster. But the main motivation is simplifying creation of nnvm graph on operators.

Optimize move semantics of NodeEntry
https://github.com/dmlc/tvm/pull/2576
Making copies of shared_ptr is more expensive than moving.
This PR reduces lock contention by using move semantics in NNVM nodes
making also more convenient to construct NodeEntry classes in the code
due to the added ctors.

Update NDarray with NodeEntry constructors and refine initializer lists.

Sync gradient.cc with tvm

imperative.cc will be addressed in a different refactor.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

vandanavk · 2019-02-08T17:10:17Z

@mxnet-label-bot add [pr-work-in-progress]

ptrendx · 2019-02-08T17:34:32Z

Does this change improve the time of scheduling launches to the engine?

larroy · 2019-02-08T20:45:11Z

@ptrendx thanks for your comment, would still need to measure

abhinavs95 · 2019-03-27T17:38:30Z

@larroy Could you please fix the CI failures?

piyushghai · 2019-04-02T21:36:36Z

@larroy Is this PR still WIP ?

larroy · 2019-04-17T08:48:04Z

@ptrendx suggestions on measurements?

.gitmodules

dev_menu.py

szha

Thanks for the patch. Could you please report the difference that this change is making through experiment?

larroy · 2019-05-20T18:35:56Z

@szha what would you suggest?

larroy · 2019-05-20T18:46:39Z

@szha the initial motivation is to lean out node creation and for correctness since there are difficult to catch bugs depending on Node initialization.

apeforest

Thanks for answering my question. LGTM!

larroy · 2019-05-20T20:00:35Z

@szha Performance is the same, so motivation is correctness and ease of graph building:

Tested in a tight loop:

TEST(NodeTestX, NodeTest) {
    using namespace nnvm;
    using namespace std;
    using namespace std::chrono;
    vector<nnvm::NodeEntry> v;
    nnvm::NodePtr ng = nnvm::Node::Create();
    ng->attrs.op = Op::Get("_zeros_without_dtype");
    ng->attrs.name = "zeros_without_dtype";
#if 0
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    for(size_t i = 0; i < 10000000; ++i) {
        v.push_back(NodeEntry{ng, 0, 0});
    }
    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << duration;
#endif
#if 1
    auto t1 = high_resolution_clock::now();
    for(size_t i = 0; i < 10000000; ++i) {
        v.emplace_back(ng, 0, 0);
    }
    auto t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << duration;
#endif
}

piotr@ec2 cpu:0: ~/mxnet [master]> build/tests/mxnet_unit_tests --gtest_filter="NodeTest*"

Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1659863[       OK ] NodeTestX.NodeTest (1918 ms)
[----------] 1 test from NodeTestX (1918 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1919 ms total)
[  PASSED  ] 1 test.


Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1717020[       OK ] NodeTestX.NodeTest (1985 ms)
[----------] 1 test from NodeTestX (1985 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1985 ms total)
[  PASSED  ] 1 test.

szha

The PR description says it's optimization so I'm expecting some performance difference. I'm not sure what you mean by "correctness" or "ease". Was it not working as expected? If so, is there a test for it?

The "ease of building graph" is probably easier for people to appreciate. It would be great if the change in recommended way of constructing nodes could be documented somewhere.

larroy · 2019-05-20T20:46:17Z

Moving a shared pointer is faster than copying. When using move in the microbenchmark is 10% faster. Overall is not much and in the big picture won't make a difference, but these small inefficiencies compound together and make the code in general slower in a way that you can't profile as it's scatered all over.

What changes exactly are you suggesting? to the title of the PR? is not clear to me, please help clarify.

It can also cause contention between threads due to the atomic lock.

Note: Google Test filter = NodeTest*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NodeTestX
[ RUN      ] NodeTestX.NodeTest
1655650[       OK ] NodeTestX.NodeTest (1804 ms)
[----------] 1 test from NodeTestX (1804 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1804 ms total)
[  PASSED  ] 1 test.

szha · 2019-05-20T20:53:08Z

The information in the PR suggests this patch is premature optimization as it's not making any difference in speed. If any of the symptoms can be captured in a test, please do so. Otherwise, if it's a usability improvement, then documenting the suggested use would be helpful.

larroy · 2019-05-20T20:55:38Z

I just provided a microbenchmark where you can see it's 10% faster. Usually in C++ we try to avoid doing more work than necessary such as copying a shared_ptr when is not necessary. Do you disagree with that? It creates inefficiencies compounded that don't show in a profiler. It also makes it easier and less verbose to add to an nnvm graph.

larroy · 2019-05-20T20:57:16Z

I don't think is premature optimization, is generally agreed that is better to emplace_back rather than push_back and create additional copies, it also makes less verbose to add nodes. Why you call it a premature optimization? What exact changes you are suggesting?

szha · 2019-05-20T21:01:00Z

I'm worried that it would be premature because of the lack of the measurable improvements in the problem space what mxnet concerns.

larroy · 2019-05-20T21:26:49Z

Would you be willing to merge since it simplifies creating graphs on operators if you don't find the performance improvement justified?

#14992
#14613

larroy · 2019-05-20T21:28:27Z

@szha I changed the title according to your request, are we good to merge?

szha · 2019-05-21T18:09:25Z

Thanks for documenting the change. This PR needs one more rebase.

apache/tvm#2576 Making copies of shared_ptr is more expensive than moving. This PR reduces lock contention by using move semantics in NNVM nodes making also more convenient to construct NodeEntry classes in the code due to the added ctors. Update NDarray with NodeEntry constructors and refine initializer lists. Sync gradient.cc with tvm

larroy · 2019-05-22T08:04:57Z

@szha Any more comments, or are we ready to go?

…#14095) * Optimize move semantics of NodeEntry apache/tvm#2576 Making copies of shared_ptr is more expensive than moving. This PR reduces lock contention by using move semantics in NNVM nodes making also more convenient to construct NodeEntry classes in the code due to the added ctors. Update NDarray with NodeEntry constructors and refine initializer lists. Sync gradient.cc with tvm * Remove additional calls to NodeEntry in emplace_back * refine patch * Fix lint

larroy requested a review from szha as a code owner February 8, 2019 12:57

marcoabreu added the pr-work-in-progress PR is still work in progress label Feb 8, 2019

larroy force-pushed the node_ptr branch from bfdc51a to 455e54d Compare February 11, 2019 21:57

larroy changed the title ~~[Don't merge] Optimize move semantics of NodeEntry~~ [Don't merge][Review] Optimize move semantics of NodeEntry Feb 11, 2019

larroy force-pushed the node_ptr branch 2 times, most recently from b4f1e57 to ca7dde0 Compare February 13, 2019 23:16

larroy force-pushed the node_ptr branch from ca7dde0 to 077ade6 Compare March 15, 2019 01:44

larroy force-pushed the node_ptr branch from 077ade6 to 140c809 Compare April 10, 2019 22:02

larroy requested a review from anirudh2290 as a code owner April 11, 2019 01:22

larroy force-pushed the node_ptr branch from 4edc749 to 12ce866 Compare April 11, 2019 21:19

larroy changed the title ~~[Don't merge][Review] Optimize move semantics of NodeEntry~~ Optimize move semantics of NodeEntry Apr 11, 2019

larroy force-pushed the node_ptr branch 2 times, most recently from 43b24ab to 9046556 Compare April 13, 2019 01:50

larroy changed the title ~~Optimize move semantics of NodeEntry~~ [Don't merge] Optimize move semantics of NodeEntry Apr 13, 2019

larroy force-pushed the node_ptr branch from e641c5a to fa37262 Compare April 15, 2019 23:01

larroy requested review from gigasquid, nswamy and yzhliu as code owners April 15, 2019 23:01

larroy force-pushed the node_ptr branch 2 times, most recently from 966fb9e to 287c7e3 Compare April 17, 2019 01:25

larroy force-pushed the node_ptr branch 2 times, most recently from aab4834 to 1b93548 Compare April 19, 2019 20:53

apeforest reviewed Apr 22, 2019

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

apeforest reviewed Apr 22, 2019

View reviewed changes

dev_menu.py Outdated Show resolved Hide resolved

szha reviewed May 18, 2019

View reviewed changes

larroy force-pushed the node_ptr branch from 015a0b0 to 5b24c82 Compare May 20, 2019 18:36

apeforest approved these changes May 20, 2019

View reviewed changes

szha suggested changes May 20, 2019

View reviewed changes

larroy changed the title ~~Optimize move semantics of NodeEntry~~ Simplify creation of NodeEntry instances and use emplace_back May 20, 2019

larroy force-pushed the node_ptr branch from 80ce4b6 to c2d28c0 Compare May 21, 2019 21:09

larroy added 2 commits May 21, 2019 14:16

Remove additional calls to NodeEntry in emplace_back

a79fdb0

larroy force-pushed the node_ptr branch from c2d28c0 to a79fdb0 Compare May 21, 2019 21:41

larroy added 2 commits May 21, 2019 14:47

refine patch

7ef8e84

Fix lint

d6798ce

szha approved these changes May 23, 2019

View reviewed changes

szha merged commit 038b9fb into apache:master May 23, 2019

kshitij12345 mentioned this pull request May 23, 2019

[MXNET-978] Support higher order gradient for log, log2, log10. #14992

Merged

7 tasks

larroy mentioned this pull request May 24, 2019

[MXNET-978] Second order gradient support for some unary operators #14613

Merged

7 tasks

larroy deleted the node_ptr branch May 25, 2019 01:58

larroy mentioned this pull request Jul 18, 2019

Refactor AGInfo and Imperative #14836

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify creation of NodeEntry instances and use emplace_back #14095

Simplify creation of NodeEntry instances and use emplace_back #14095

larroy commented Feb 8, 2019 •

edited

Loading

vandanavk commented Feb 8, 2019

ptrendx commented Feb 8, 2019

larroy commented Feb 8, 2019

abhinavs95 commented Mar 27, 2019

piyushghai commented Apr 2, 2019

larroy commented Apr 17, 2019

szha left a comment

larroy commented May 20, 2019

larroy commented May 20, 2019

apeforest left a comment

larroy commented May 20, 2019

szha left a comment •

edited

Loading

larroy commented May 20, 2019

szha commented May 20, 2019

larroy commented May 20, 2019

larroy commented May 20, 2019

szha commented May 20, 2019

larroy commented May 20, 2019

larroy commented May 20, 2019

szha commented May 21, 2019

larroy commented May 22, 2019

Simplify creation of NodeEntry instances and use emplace_back #14095

Simplify creation of NodeEntry instances and use emplace_back #14095

Conversation

larroy commented Feb 8, 2019 • edited Loading

Description

Checklist

Essentials

vandanavk commented Feb 8, 2019

ptrendx commented Feb 8, 2019

larroy commented Feb 8, 2019

abhinavs95 commented Mar 27, 2019

piyushghai commented Apr 2, 2019

larroy commented Apr 17, 2019

szha left a comment

Choose a reason for hiding this comment

larroy commented May 20, 2019

larroy commented May 20, 2019

apeforest left a comment

Choose a reason for hiding this comment

larroy commented May 20, 2019

szha left a comment • edited Loading

Choose a reason for hiding this comment

larroy commented May 20, 2019

szha commented May 20, 2019

larroy commented May 20, 2019

larroy commented May 20, 2019

szha commented May 20, 2019

larroy commented May 20, 2019

larroy commented May 20, 2019

szha commented May 21, 2019

larroy commented May 22, 2019

larroy commented Feb 8, 2019 •

edited

Loading

szha left a comment •

edited

Loading