Improve cached_op performance for static mode #14785

ZhennanQin · 2019-04-24T07:09:01Z

Description

@pengzhao-intel @TaoLv @xinyu-intel @junrushao1994
When gluon model hybridize with static_shape=True, static_alloc=True, cached_op with static mode will be used. For this situation, we should try to cache operator state for better performance. This PR is to enable this feature to speed up gluon inference speed, especially for small batch sizes.

Below data is collected on SKX-8180 28 cores, SKX GLUON INT8 OPT shows the performance change from this PR, base is SKX GLUON INT8.

GLUON ResNet50 V1(10 cores for rec decoder)	SKX GLUON FP32	SKX GLUON INT8	SKX GLUON INT8 OPT
Throughput(img/sec, bs=1)	64.6	11.89	144.53
Throughput(img/sec, bs=2)	79.74	23.1	226.04
Throughput(img/sec, bs=4)	111.9	43.45	302.4
Throughput(img/sec, bs=8)	134.86	78.72	347.6
Throughput(img/sec, bs=16)	143.52	129.09	362.93
Throughput(img/sec, bs=32)	146.63	197.63	381.2
Throughput(img/sec, bs=64)	153.89	261.33	380.79
Throughput(img/sec, bs=128)	156.82	326.23	408.38
Accuracy(5000 imgs)	77.21%93.55%	76.86%/93.46	76.86%/93.46

GLUON MobileNet1.0(10 cores for rec decoder)	SKX GLUON FP32	SKX GLUON INT8	SKX GLUON INT8 OPT
Throughput(img/sec, bs=1)	166.23	38.77	281.28
Throughput(img/sec, bs=2)	238.79	75.81	518.52
Throughput(img/sec, bs=4)	333.2	143.83	987.63
Throughput(img/sec, bs=8)	397.46	262.47	1245.85
Throughput(img/sec, bs=16)	425.35	425.63	1332.25
Throughput(img/sec, bs=32)	451.89	653.8	1474.7
Throughput(img/sec, bs=64)	471.77	897.99	1528.63
Throughput(img/sec, bs=128)	468.67	1125.75	1557.16
Accuracy(5000 imgs)	73.28%/91.22%	72.85%/90.99%	72.85%/90.99%

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

pengzhao-intel · 2019-04-24T07:52:56Z

cc @zhreshold

src/executor/attach_op_execs_pass.cc

zhreshold · 2019-04-24T18:03:41Z

Please fix CI and minor issue, this is awesome!!

larroy · 2019-04-24T19:34:40Z

Can the performance measurement scripts be shared?

pengzhao-intel · 2019-04-25T02:15:32Z

@larroy I think the test is already part of gluonCV
https://gluon-cv.mxnet.io/build/examples_deployment/int8_inference.html
@xinyu-intel am I right?

xinyu-intel · 2019-04-25T06:54:36Z

@pengzhao-intel @larroy yes, mainly use imagenet classification verify_pretrained.py and eval ssd for gluoncv evaluation. BTW, after Gluoncv #755 merged along with this pr. The performance will get improvement.

pengzhao-intel · 2019-04-26T04:09:37Z

@zhreshold please help to review again :)

pengzhao-intel

LGTM

zhreshold

lgtm

This reverts commit 369b66d.

* Fix cached_op * try to fix ci * Fix CI * Fix ci

…apache#14868) This reverts commit 369b66d.

* Fix cached_op * try to fix ci * Fix CI * Fix ci

…apache#14868) This reverts commit 369b66d.

Fix cached_op

93bccbd

xinyu-intel mentioned this pull request Apr 24, 2019

Set static alloc for quantized models dmlc/gluon-cv#755

Merged

pengzhao-intel mentioned this pull request Apr 24, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

try to fix ci

08b082c

zhreshold suggested changes Apr 24, 2019

View reviewed changes

src/executor/attach_op_execs_pass.cc Outdated Show resolved Hide resolved

Fix CI

2849633

szha added this to Review in progress in CPU Performance and Quantization Apr 25, 2019

szha moved this from Review in progress to In progress in CPU Performance and Quantization Apr 25, 2019

Fix ci

2e38a0a

pengzhao-intel moved this from In progress to Review in progress in CPU Performance and Quantization Apr 26, 2019

pengzhao-intel approved these changes Apr 26, 2019

View reviewed changes

zhreshold approved these changes Apr 26, 2019

View reviewed changes

CPU Performance and Quantization automation moved this from Review in progress to Reviewer approved Apr 26, 2019

zhreshold merged commit 369b66d into apache:master Apr 26, 2019

CPU Performance and Quantization automation moved this from Reviewer approved to Done Apr 26, 2019

anirudhacharya added a commit to anirudhacharya/mxnet that referenced this pull request May 3, 2019

Revert "Improve cached_op performance for static mode (apache#14785)"

41ba232

This reverts commit 369b66d.

anirudhacharya mentioned this pull request May 3, 2019

Revert "Improve cached_op performance for static mode" #14868

Merged

tlby mentioned this pull request May 3, 2019

[MXNet] - [BERT] dmlc/gluon-nlp#690

Closed

szha pushed a commit that referenced this pull request May 3, 2019

Revert "Improve cached_op performance for static mode (#14785)" (#14868)

204f3f2

This reverts commit 369b66d.

ZhennanQin mentioned this pull request May 11, 2019

Re-enable static cached_op optimization #14931

Merged

7 tasks

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019

Improve cached_op performance for static mode (apache#14785)

d56e1fa

* Fix cached_op * try to fix ci * Fix CI * Fix ci

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019

Revert "Improve cached_op performance for static mode (apache#14785)" (…

ab27a33

…apache#14868) This reverts commit 369b66d.

ZhennanQin deleted the static_cached_op branch May 31, 2019 02:07

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Improve cached_op performance for static mode (apache#14785)

3c99bdc

* Fix cached_op * try to fix ci * Fix CI * Fix ci

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Revert "Improve cached_op performance for static mode (apache#14785)" (…

a84bc75

…apache#14868) This reverts commit 369b66d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cached_op performance for static mode #14785

Improve cached_op performance for static mode #14785

ZhennanQin commented Apr 24, 2019

pengzhao-intel commented Apr 24, 2019

zhreshold commented Apr 24, 2019

larroy commented Apr 24, 2019

pengzhao-intel commented Apr 25, 2019

xinyu-intel commented Apr 25, 2019

pengzhao-intel commented Apr 26, 2019

pengzhao-intel left a comment

zhreshold left a comment

Improve cached_op performance for static mode #14785

Improve cached_op performance for static mode #14785

Conversation

ZhennanQin commented Apr 24, 2019

Description

Checklist

Essentials

Changes

Comments

pengzhao-intel commented Apr 24, 2019

zhreshold commented Apr 24, 2019

larroy commented Apr 24, 2019

pengzhao-intel commented Apr 25, 2019

xinyu-intel commented Apr 25, 2019

pengzhao-intel commented Apr 26, 2019

pengzhao-intel left a comment

Choose a reason for hiding this comment

zhreshold left a comment

Choose a reason for hiding this comment