update rcnn example #11373

ijkguo · 2018-06-23T02:27:35Z

Description

People complain about the Faster R-CNN example a lot.

Hard to install. Now we only require packages that can be installed via pip.
Too much code, some of them duplicate. Now it is reduced from ~9k to ~3k lines.
Too difficult to configure. Now hyper parameters are tunable through command line arguments instead of a central global config. Just type python3 train.py -h to see all of them.
Not compatible with MXNet API changes. We reduced the code complexity to ease maintenance for API updates. The old released models and training scripts are still working.
Not compatible with Python 3. It is generally about print, true division. Fixed and tested now.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

All changes are limited to example/rcnn folder. Basically no new features are added.

Remove complex code of Cython, CUDA, pycocotools. Pure python data processing is used instead.
Remove duplicate and unnecessary code in data processing for two-stage Fast R-CNN training. Why still maintain two-stage Fast R-CNN if we have end-to-end training?
Remove shell scripts because demo, train, test can be used directly.
Reorganize the code to be dataset loading, data processing, network construction and user scripts.

Comments

Note that data processing speed of the pure numpy code could be a limitation, especially for crowded dataset like COCO. However, numpy code is concise and accurate, which can serve as future reference to build better implementations.

pengzhao-intel · 2018-06-23T02:47:27Z

Do you mind provide a scoring script for the performance testing?
Like benchmark_score.py with dummy data.
It will be very convenient for us :)

ZiyueHuang · 2018-06-24T06:09:43Z

Is the custom MutableModule still needed? I remember that it is used for variable size input. But now Module supports data batches with different shapes. Please refer to https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/module/module.py#L572-L624. @ijkguo

ijkguo · 2018-06-24T06:20:03Z

@ZiyueHuang Great updates in module. Will update soon.

@pengzhao-intel The scope of this pull request is really to maintain the example code to a usable but accurate reference. Performance benchmarking will be included in addition to matching the accuracy in the development of new gluon-cv toolkit.

pengzhao-intel · 2018-06-25T02:57:17Z

Thanks for the information. It's fine for us :)

ijkguo · 2018-06-26T02:28:46Z

@ZiyueHuang MutableModule now gone.

ijkguo · 2018-06-27T20:57:20Z

Conflict with an open pr: #11013. This PR will remove all Cython modules completely. No need for windows fix.

chinakook · 2018-07-05T07:02:08Z

Can we change the backbone network symbol definition to the gluon version(then symbolic)?
It's simpler and more accurate.

ijkguo · 2018-07-05T21:27:13Z

Could you please point a link to the gluon version of backbone network definition?

chinakook · 2018-07-06T00:35:24Z

https://github.com/apache/incubator-mxnet/tree/master/python/mxnet/gluon/model_zoo/vision

ijkguo · 2018-07-06T01:27:52Z

So you mean gluon block definition, gluon training and gluon inference, which is basically a new example. Please see https://gluon-cv.mxnet.io/model_zoo/index.html#object-detection for that.
The purpose of this PR is to maintain this symbolic example as a usable reference and resolve the issues mentioned above.

chinakook · 2018-07-06T01:37:34Z

Get symbol as following:

import mxnet as mx
from mxnet import gluon
import mxnet.gluon.model_zoo.vision as vision

net = vision.resnet50_v2(pretrained=True)
net.hybridize()
data = mx.sym.var('data')
sym = net(data)

ijkguo · 2018-07-06T05:32:58Z

Thanks for the example. Unfortunately definition in gluon.model_zoo.vision is slightly different from symbol model zoo (example/image-classification), which would break compatibility of all released models and thus require re-train them all.

To preserve the network definition, we could translate the current symbol definition to gluon.Block and get a symbol as you suggested. However, the result will not bear connection to gluon.model_zoo and will lose connection to the symbol model zoo as it has now.

As mentioned above, dmlc/gluon-cv uses gluon model zoo and trains model in gluon API. So I think the better way is to preserve the symbol definition like example/image-classification and link to the pure gluon API implementation dmlc/gluon-cv.

- python3 compactible - remove cuda operator, cython utility, pycocotools - remove mutablemodule - remove duplicate code

wangzhegeek · 2018-07-18T11:58:08Z

I found it much slower than old version,and the speed is not stable,I don't change any param, does anyone else have the same problem like me?
NFO:root:Epoch[0] Batch [20] Speed: 4.54 samples/sec RPNAcc=0.785101 RPNLogLoss=0.648121 RPNL1Loss=0.313158 RCNNAcc=0.812128 RCNNLogLoss=0.465562 RCNNL1Loss=2.561277
INFO:root:Epoch[0] Batch [40] Speed: 4.59 samples/sec RPNAcc=0.867065 RPNLogLoss=0.536473 RPNL1Loss=0.241447 RCNNAcc=0.832317 RCNNLogLoss=0.386772 RCNNL1Loss=2.580883
INFO:root:Epoch[0] Batch [60] Speed: 4.21 samples/sec RPNAcc=0.895782 RPNLogLoss=0.417571 RPNL1Loss=0.263271 RCNNAcc=0.838691 RCNNLogLoss=0.362099 RCNNL1Loss=2.577651
INFO:root:Epoch[0] Batch [80] Speed: 4.36 samples/sec RPNAcc=0.912193 RPNLogLoss=0.345723 RPNL1Loss=0.283261 RCNNAcc=0.846788 RCNNLogLoss=0.340564 RCNNL1Loss=2.602302
INFO:root:Epoch[0] Batch [100] Speed: 4.47 samples/sec RPNAcc=0.923360 RPNLogLoss=0.296611 RPNL1Loss=0.268125 RCNNAcc=0.857441 RCNNLogLoss=0.318911 RCNNL1Loss=2.584149
INFO:root:Epoch[0] Batch [120] Speed: 3.26 samples/sec RPNAcc=0.922157 RPNLogLoss=0.281377 RPNL1Loss=0.271422 RCNNAcc=0.852370 RCNNLogLoss=0.320729 RCNNL1Loss=2.568152
INFO:root:Epoch[0] Batch [140] Speed: 4.05 samples/sec RPNAcc=0.922641 RPNLogLoss=0.260944 RPNL1Loss=0.273159 RCNNAcc=0.856494 RCNNLogLoss=0.309234 RCNNL1Loss=2.548375
INFO:root:Epoch[0] Batch [160] Speed: 4.10 samples/sec RPNAcc=0.925493 RPNLogLoss=0.243349 RPNL1Loss=0.264450 RCNNAcc=0.859739 RCNNLogLoss=0.303794 RCNNL1Loss=2.547160
INFO:root:Epoch[0] Batch [180] Speed: 3.60 samples/sec RPNAcc=0.927458 RPNLogLoss=0.229557 RPNL1Loss=0.252770 RCNNAcc=0.859116 RCNNLogLoss=0.302227 RCNNL1Loss=2.534159
INFO:root:Epoch[0] Batch [200] Speed: 4.83 samples/sec RPNAcc=0.931191 RPNLogLoss=0.214644 RPNL1Loss=0.244530 RCNNAcc=0.865808 RCNNLogLoss=0.289673 RCNNL1Loss=2.486773
INFO:root:Epoch[0] Batch [220] Speed: 4.56 samples/sec RPNAcc=0.935750 RPNLogLoss=0.200768 RPNL1Loss=0.243567 RCNNAcc=0.871430 RCNNLogLoss=0.279419 RCNNL1Loss=2.471421
INFO:root:Epoch[0] Batch [240] Speed: 3.83 samples/sec RPNAcc=0.936934 RPNLogLoss=0.200039 RPNL1Loss=0.253596 RCNNAcc=0.865372 RCNNLogLoss=0.321236 RCNNL1Loss=2.468916
INFO:root:Epoch[0] Batch [260] Speed: 2.11 samples/sec RPNAcc=0.928309 RPNLogLoss=0.215350 RPNL1Loss=0.329817 RCNNAcc=0.860363 RCNNLogLoss=0.324518 RCNNL1Loss=2.454420
INFO:root:Epoch[0] Batch [280] Speed: 3.75 samples/sec RPNAcc=0.927166 RPNLogLoss=0.213307 RPNL1Loss=0.350316 RCNNAcc=0.861266 RCNNLogLoss=0.321035 RCNNL1Loss=2.443782
INFO:root:Epoch[0] Batch [300] Speed: 3.69 samples/sec RPNAcc=0.928488 RPNLogLoss=0.206919 RPNL1Loss=0.398218 RCNNAcc=0.861244 RCNNLogLoss=0.320578 RCNNL1Loss=2.437228
INFO:root:Epoch[0] Batch [320] Speed: 2.81 samples/sec RPNAcc=0.923032 RPNLogLoss=0.211846 RPNL1Loss=0.472409 RCNNAcc=0.863196 RCNNLogLoss=0.318522 RCNNL1Loss=2.403246
INFO:root:Epoch[0] Batch [340] Speed: 4.31 samples/sec RPNAcc=0.922865 RPNLogLoss=0.209912 RPNL1Loss=0.492010 RCNNAcc=0.866935 RCNNLogLoss=0.309043 RCNNL1Loss=2.405491
INFO:root:Epoch[0] Batch [360] Speed: 1.72 samples/sec RPNAcc=0.913796 RPNLogLoss=0.221485 RPNL1Loss=0.566435 RCNNAcc=0.863855 RCNNLogLoss=0.315080 RCNNL1Loss=2.389022
INFO:root:Epoch[0] Batch [380] Speed: 1.19 samples/sec RPNAcc=0.906069 RPNLogLoss=0.233500 RPNL1Loss=0.574996 RCNNAcc=0.857776 RCNNLogLoss=0.321402 RCNNL1Loss=2.370798
INFO:root:Epoch[0] Batch [400] Speed: 2.12 samples/sec RPNAcc=0.905679 RPNLogLoss=0.236498 RPNL1Loss=0.585648 RCNNAcc=0.858196 RCNNLogLoss=0.319997 RCNNL1Loss=2.343578
INFO:root:Epoch[0] Batch [420] Speed: 3.06 samples/sec RPNAcc=0.905055 RPNLogLoss=0.236932 RPNL1Loss=0.603299 RCNNAcc=0.861722 RCNNLogLoss=0.313066 RCNNL1Loss=2.327881
INFO:root:Epoch[0] Batch [440] Speed: 2.63 samples/sec RPNAcc=0.903932 RPNLogLoss=0.239366 RPNL1Loss=0.638583 RCNNAcc=0.863379 RCNNLogLoss=0.309315 RCNNL1Loss=2.309262
INFO:root:Epoch[0] Batch [460] Speed: 3.06 samples/sec RPNAcc=0.901677 RPNLogLoss=0.242633 RPNL1Loss=0.667797 RCNNAcc=0.864883 RCNNLogLoss=0.306521 RCNNL1Loss=2.303738
INFO:root:Epoch[0] Batch [480] Speed: 1.88 samples/sec RPNAcc=0.899889 RPNLogLoss=0.244968 RPNL1Loss=0.685555 RCNNAcc=0.868211 RCNNLogLoss=0.302564 RCNNL1Loss=2.253488
INFO:root:Epoch[0] Batch [500] Speed: 0.83 samples/sec RPNAcc=0.901091 RPNLogLoss=0.243805 RPNL1Loss=0.706072 RCNNAcc=0.870790 RCNNLogLoss=0.297232 RCNNL1Loss=2.199988
INFO:root:Epoch[0] Batch [520] Speed: 0.36 samples/sec RPNAcc=0.901085 RPNLogLoss=0.246106 RPNL1Loss=0.698609 RCNNAcc=0.873725 RCNNLogLoss=0.291516 RCNNL1Loss=2.106227
INFO:root:Epoch[0] Batch [540] Speed: 0.27 samples/sec RPNAcc=0.900652 RPNLogLoss=0.248750 RPNL1Loss=0.688534 RCNNAcc=0.874307 RCNNLogLoss=0.289432 RCNNL1Loss=2.034107
INFO:root:Epoch[0] Batch [560] Speed: 0.26 samples/sec RPNAcc=0.900248 RPNLogLoss=0.251010 RPNL1Loss=0.675872 RCNNAcc=0.874185 RCNNLogLoss=0.288821 RCNNL1Loss=1.979973
INFO:root:Epoch[0] Batch [580] Speed: 0.30 samples/sec RPNAcc=0.900159 RPNLogLoss=0.252489 RPNL1Loss=0.667776 RCNNAcc=0.874045 RCNNLogLoss=0.288114 RCNNL1Loss=1.946799
INFO:root:Epoch[0] Batch [600] Speed: 0.30 samples/sec RPNAcc=0.900122 RPNLogLoss=0.253738 RPNL1Loss=0.655861 RCNNAcc=0.874727 RCNNLogLoss=0.286391 RCNNL1Loss=1.919827
INFO:root:Epoch[0] Batch [620] Speed: 0.39 samples/sec RPNAcc=0.900299 RPNLogLoss=0.254222 RPNL1Loss=0.643061 RCNNAcc=0.876013 RCNNLogLoss=0.283613 RCNNL1Loss=1.894209
INFO:root:Epoch[0] Batch [640] Speed: 0.74 samples/sec RPNAcc=0.900762 RPNLogLoss=0.253900 RPNL1Loss=0.638863 RCNNAcc=0.877218 RCNNLogLoss=0.280870 RCNNL1Loss=1.873936
INFO:root:Epoch[0] Batch [660] Speed: 0.55 samples/sec RPNAcc=0.900775 RPNLogLoss=0.254636 RPNL1Loss=0.644612 RCNNAcc=0.877009 RCNNLogLoss=0.280893 RCNNL1Loss=1.845975
INFO:root:Epoch[0] Batch [680] Speed: 0.61 samples/sec RPNAcc=0.900931 RPNLogLoss=0.255214 RPNL1Loss=0.641204 RCNNAcc=0.878097 RCNNLogLoss=0.278319 RCNNL1Loss=1.812659
INFO:root:Epoch[0] Batch [700] Speed: 0.64 samples/sec RPNAcc=0.901258 RPNLogLoss=0.254986 RPNL1Loss=0.641988 RCNNAcc=0.880221 RCNNLogLoss=0.273679 RCNNL1Loss=1.764503
INFO:root:Epoch[0] Batch [720] Speed: 0.61 samples/sec RPNAcc=0.901873 RPNLogLoss=0.254332 RPNL1Loss=0.638303 RCNNAcc=0.882525 RCNNLogLoss=0.268834 RCNNL1Loss=1.710917
INFO:root:Epoch[0] Batch [740] Speed: 0.50 samples/sec RPNAcc=0.902021 RPNLogLoss=0.254266 RPNL1Loss=0.633941 RCNNAcc=0.885016 RCNNLogLoss=0.263351 RCNNL1Loss=1.660883
INFO:root:Epoch[0] Batch [760] Speed: 0.48 samples/sec RPNAcc=0.902311 RPNLogLoss=0.253982 RPNL1Loss=0.626943 RCNNAcc=0.887448 RCNNLogLoss=0.257938 RCNNL1Loss=1.620504
INFO:root:Epoch[0] Batch [780] Speed: 0.64 samples/sec RPNAcc=0.902429 RPNLogLoss=0.254169 RPNL1Loss=0.627015 RCNNAcc=0.889085 RCNNLogLoss=0.254580 RCNNL1Loss=1.591067
INFO:root:Epoch[0] Batch [800] Speed: 0.55 samples/sec RPNAcc=0.902554 RPNLogLoss=0.254300 RPNL1Loss=0.630958 RCNNAcc=0.889372 RCNNLogLoss=0.253660 RCNNL1Loss=1.574188
INFO:root:Epoch[0] Batch [820] Speed: 0.49 samples/sec RPNAcc=0.902548 RPNLogLoss=0.254546 RPNL1Loss=0.635334 RCNNAcc=0.890006 RCNNLogLoss=0.252298 RCNNL1Loss=1.557188
INFO:root:Epoch[0] Batch [840] Speed: 0.64 samples/sec RPNAcc=0.903094 RPNLogLoss=0.253755 RPNL1Loss=0.632950 RCNNAcc=0.891359 RCNNLogLoss=0.249445 RCNNL1Loss=1.547463
INFO:root:Epoch[0] Batch [860] Speed: 3.98 samples/sec RPNAcc=0.903791 RPNLogLoss=0.253039 RPNL1Loss=0.632308 RCNNAcc=0.890149 RCNNLogLoss=0.257711 RCNNL1Loss=1.561753
INFO:root:Epoch[0] Batch [880] Speed: 4.86 samples/sec RPNAcc=0.904745 RPNLogLoss=0.250062 RPNL1Loss=0.630511 RCNNAcc=0.890767 RCNNLogLoss=0.255843 RCNNL1Loss=1.574943
INFO:root:Epoch[0] Batch [900] Speed: 4.24 samples/sec RPNAcc=0.905877 RPNLogLoss=0.246774 RPNL1Loss=0.628148 RCNNAcc=0.891800 RCNNLogLoss=0.253487 RCNNL1Loss=1.587352
INFO:root:Epoch[0] Batch [920] Speed: 4.73 samples/sec RPNAcc=0.907210 RPNLogLoss=0.242976 RPNL1Loss=0.626454 RCNNAcc=0.893004 RCNNLogLoss=0.250774 RCNNL1Loss=1.598805
INFO:root:Epoch[0] Batch [940] Speed: 4.35 samples/sec RPNAcc=0.908581 RPNLogLoss=0.239367 RPNL1Loss=0.623204 RCNNAcc=0.894415 RCNNLogLoss=0.247416 RCNNL1Loss=1.610573
INFO:root:Epoch[0] Batch [960] Speed: 3.77 samples/sec RPNAcc=0.909108 RPNLogLoss=0.236802 RPNL1Loss=0.620865 RCNNAcc=0.894694 RCNNLogLoss=0.246824 RCNNL1Loss=1.623527
INFO:root:Epoch[0] Batch [980] Speed: 3.75 samples/sec RPNAcc=0.909375 RPNLogLoss=0.234438 RPNL1Loss=0.618544 RCNNAcc=0.895248 RCNNLogLoss=0.245278 RCNNL1Loss=1.632997
INFO:root:Epoch[0] Batch [1000] Speed: 4.33 samples/sec RPNAcc=0.910350 RPNLogLoss=0.231555 RPNL1Loss=0.615661 RCNNAcc=0.896198 RCNNLogLoss=0.243115 RCNNL1Loss=1.643480
INFO:root:Epoch[0] Batch [1020] Speed: 3.65 samples/sec RPNAcc=0.911275 RPNLogLoss=0.229012 RPNL1Loss=0.612912 RCNNAcc=0.896972 RCNNLogLoss=0.241464 RCNNL1Loss=1.654004
INFO:root:Epoch[0] Batch [1040] Speed: 4.40 samples/sec RPNAcc=0.912371 RPNLogLoss=0.225913 RPNL1Loss=0.609690 RCNNAcc=0.898115 RCNNLogLoss=0.238919 RCNNL1Loss=1.658991

ijkguo · 2018-07-19T00:03:35Z

It could be caused by varying CPU and disk workload. Training speed is bottlenecked by single thread python data loading.

ijkguo · 2018-07-19T00:11:07Z

This example was heavily criticized by many due to its complicated engineering optimizations. The PR was meant to improve clarity and simplicity.

To start simple, use this example. To achieve state of the art, check out other research implementations in MXNet based on the previous heavily engineered version that provide better performance. To get both simplicity and performance, stay tuned for https://github.com/dmlc/gluon-cv.

pengzhao-intel · 2018-07-19T00:22:03Z

do you try to use MKL-DNN backend?

- python3 compactible - remove cuda operator, cython utility, pycocotools - remove mutablemodule - remove duplicate code

ijkguo requested a review from szha as a code owner June 23, 2018 02:27

This was referenced Jun 26, 2018

RCNN example fails for using latest mxnet #9823

Closed

some code in example/rcnn/demo.py is wrong. #10291

Closed

ModuleNotFoundError #10019

Closed

Error when running "/rcnn/script/additional_deps.sh" #8694

Closed

ijkguo force-pushed the master branch from 3d68a4d to 5d7f353 Compare July 5, 2018 23:01

ijkguo mentioned this pull request Jul 12, 2018

Some Python 3 fixes in ./example #11671

Merged

7 tasks

ijkguo force-pushed the master branch from 5d7f353 to 94617ff Compare July 12, 2018 22:22

update rcnn example

bc694c6

- python3 compactible - remove cuda operator, cython utility, pycocotools - remove mutablemodule - remove duplicate code

ijkguo force-pushed the master branch from 94617ff to bc694c6 Compare July 12, 2018 22:42

zhreshold approved these changes Jul 12, 2018

View reviewed changes

zhreshold merged commit c766712 into apache:master Jul 13, 2018

ijkguo mentioned this pull request Jul 13, 2018

rcnn example throws CUDNN_STATUS_BAD_PARAM when running under cudnn 6.0 #11240

Closed

zhreshold mentioned this pull request Jul 21, 2018

add Windows support of cython modules #11013

Closed

7 tasks

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

update rcnn example (apache#11373)

53df88b

- python3 compactible - remove cuda operator, cython utility, pycocotools - remove mutablemodule - remove duplicate code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update rcnn example #11373

update rcnn example #11373

ijkguo commented Jun 23, 2018 •

edited

Loading

pengzhao-intel commented Jun 23, 2018

ZiyueHuang commented Jun 24, 2018

ijkguo commented Jun 24, 2018

pengzhao-intel commented Jun 25, 2018

ijkguo commented Jun 26, 2018

ijkguo commented Jun 27, 2018

chinakook commented Jul 5, 2018 •

edited

Loading

ijkguo commented Jul 5, 2018

chinakook commented Jul 6, 2018

ijkguo commented Jul 6, 2018

chinakook commented Jul 6, 2018

ijkguo commented Jul 6, 2018

wangzhegeek commented Jul 18, 2018

ijkguo commented Jul 19, 2018

ijkguo commented Jul 19, 2018

pengzhao-intel commented Jul 19, 2018

update rcnn example #11373

update rcnn example #11373

Conversation

ijkguo commented Jun 23, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

pengzhao-intel commented Jun 23, 2018

ZiyueHuang commented Jun 24, 2018

ijkguo commented Jun 24, 2018

pengzhao-intel commented Jun 25, 2018

ijkguo commented Jun 26, 2018

ijkguo commented Jun 27, 2018

chinakook commented Jul 5, 2018 • edited Loading

ijkguo commented Jul 5, 2018

chinakook commented Jul 6, 2018

ijkguo commented Jul 6, 2018

chinakook commented Jul 6, 2018

ijkguo commented Jul 6, 2018

wangzhegeek commented Jul 18, 2018

ijkguo commented Jul 19, 2018

ijkguo commented Jul 19, 2018

pengzhao-intel commented Jul 19, 2018

ijkguo commented Jun 23, 2018 •

edited

Loading

chinakook commented Jul 5, 2018 •

edited

Loading