Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improve cached_op performance for static mode #14785

Merged
merged 4 commits into from
Apr 26, 2019

Conversation

ZhennanQin
Copy link
Contributor

Description

@pengzhao-intel @TaoLv @xinyu-intel @junrushao1994
When gluon model hybridize with static_shape=True, static_alloc=True, cached_op with static mode will be used. For this situation, we should try to cache operator state for better performance. This PR is to enable this feature to speed up gluon inference speed, especially for small batch sizes.

Below data is collected on SKX-8180 28 cores, SKX GLUON INT8 OPT shows the performance change from this PR, base is SKX GLUON INT8.

GLUON ResNet50 V1(10 cores for rec decoder) SKX GLUON FP32 SKX GLUON INT8 SKX GLUON INT8 OPT
Throughput(img/sec, bs=1) 64.6 11.89 144.53
Throughput(img/sec, bs=2) 79.74 23.1 226.04
Throughput(img/sec, bs=4) 111.9 43.45 302.4
Throughput(img/sec, bs=8) 134.86 78.72 347.6
Throughput(img/sec, bs=16) 143.52 129.09 362.93
Throughput(img/sec, bs=32) 146.63 197.63 381.2
Throughput(img/sec, bs=64) 153.89 261.33 380.79
Throughput(img/sec, bs=128) 156.82 326.23 408.38
Accuracy(5000 imgs) 77.21%93.55% 76.86%/93.46 76.86%/93.46
GLUON MobileNet1.0(10 cores for rec decoder) SKX GLUON FP32 SKX GLUON INT8 SKX GLUON INT8 OPT
Throughput(img/sec, bs=1) 166.23 38.77 281.28
Throughput(img/sec, bs=2) 238.79 75.81 518.52
Throughput(img/sec, bs=4) 333.2 143.83 987.63
Throughput(img/sec, bs=8) 397.46 262.47 1245.85
Throughput(img/sec, bs=16) 425.35 425.63 1332.25
Throughput(img/sec, bs=32) 451.89 653.8 1474.7
Throughput(img/sec, bs=64) 471.77 897.99 1528.63
Throughput(img/sec, bs=128) 468.67 1125.75 1557.16
Accuracy(5000 imgs) 73.28%/91.22% 72.85%/90.99% 72.85%/90.99%

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@pengzhao-intel
Copy link
Contributor

cc @zhreshold

src/executor/attach_op_execs_pass.cc Outdated Show resolved Hide resolved
@zhreshold
Copy link
Member

Please fix CI and minor issue, this is awesome!!

@larroy
Copy link
Contributor

larroy commented Apr 24, 2019

Can the performance measurement scripts be shared?

@pengzhao-intel
Copy link
Contributor

@larroy I think the test is already part of gluonCV
https://gluon-cv.mxnet.io/build/examples_deployment/int8_inference.html
@xinyu-intel am I right?

@xinyu-intel
Copy link
Contributor

@pengzhao-intel @larroy yes, mainly use imagenet classification verify_pretrained.py and eval ssd for gluoncv evaluation. BTW, after Gluoncv #755 merged along with this pr. The performance will get improvement.

@szha szha added this to Review in progress in CPU Performance and Quantization Apr 25, 2019
@szha szha moved this from Review in progress to In progress in CPU Performance and Quantization Apr 25, 2019
@pengzhao-intel
Copy link
Contributor

@zhreshold please help to review again :)

@pengzhao-intel pengzhao-intel moved this from In progress to Review in progress in CPU Performance and Quantization Apr 26, 2019
Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

CPU Performance and Quantization automation moved this from Review in progress to Reviewer approved Apr 26, 2019
@zhreshold zhreshold merged commit 369b66d into apache:master Apr 26, 2019
CPU Performance and Quantization automation moved this from Reviewer approved to Done Apr 26, 2019
anirudhacharya added a commit to anirudhacharya/mxnet that referenced this pull request May 3, 2019
szha pushed a commit that referenced this pull request May 3, 2019
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
* Fix cached_op

* try to fix ci

* Fix CI

* Fix ci
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
@ZhennanQin ZhennanQin deleted the static_cached_op branch May 31, 2019 02:07
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Fix cached_op

* try to fix ci

* Fix CI

* Fix ci
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants