[MXNET-1333] Estimator and Fit API #14629

roywei · 2019-04-05T16:55:37Z

Description

This PR introduce an Estimator class in contrib with easy fit method to help beginners with model training.
It's been developed on a branch, and we hope to merge it to contrib and get feedback for first iteration.

Design: https://cwiki.apache.org/confluence/display/MXNET/Gluon+Fit+API+-+Tech+Design
JIRA epics: https://issues.apache.org/jira/browse/MXNET-1333
Dev list discussion: https://lists.apache.org/thread.html/13e3dee0fc9dd8e45b6616f97d282096a1ee67cde78a93dada295577@%3Cdev.mxnet.apache.org%3E
Feedbacks: currently all feedbacks are captured in cwiki comment section. We have created JIRA issues for each feedback and will continue to work on it
Follow up PRs:
We currently have the following PRs to address feedback, will create more and track using JIRA issue

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with MXNET-1333
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Estimator class for fit and evaluate
Evenhandler class for callbacks in fit methods
Unit tests
Integration/nightly tests

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

marcoabreu · 2019-04-05T17:34:03Z

Would it be possible to make these tests part of the training test suite instead of introducing new jobs?

roywei · 2019-04-05T18:00:27Z

@marcoabreu are you referring to training test suite here? we want to only test this on nightly. Please point to the correct test suite if I m wrong. Thanks!

nswamy · 2019-04-05T20:10:17Z

Suggest to move it to contrib as there is some feedback from Mu on the dev@. We could also gather feedback from the users to see what other changes are required.

Could you please break all the backlog items into Jira tasks and paste the master ticket to this PR ? Any contributor interested to further contribute to this could pick up those tasks.

roywei · 2019-04-05T20:18:57Z

@nswamy done, all JIRA tickets has detailed description and reference to the feedback (either from cwiki or dev list discussion)

szha · 2019-04-05T21:04:25Z

python/mxnet/gluon/contrib/estimator/event_handler.py

+import numpy as np
+
+
+class EventHandler(object):


please break this down into different event classes and use the same approach as gluon's forward hook through weakref. this has the benefit of:

people can mix what they need into a unified handler through multi-inheritence, without the unnecessary pass calls.

a handler can be detached at will.

i added a method to check if a handlers has implemented train_begin ect to avoid unnecessary pass calls. The time it takes should be the same as using multi-inheritence and do a bunch of isinstance() at the beginning. The benefit is user can override any event call without inherit multiple class.

Still looking into how to use forward hook to register different input args

Thanks!

szha

Thanks for the contribution. see above. more comments to come.

piyushghai · 2019-04-05T21:50:41Z

Thanks for your contributions @roywei.
This is a very useful API for users and it simplifies a lot of pain points.

@mxnet-label-bot Add [Gluon, pr-awaiting-review]

roywei · 2019-04-05T22:14:14Z

Thanks @szha for the feedback! I m noting the summary of offlien discussion here:

Currently event handlers has access to entire estimator. Estimator has to maintain different stats/states (e.g. current epoch, num of steps, metrics) and do the book keeping to ensure eventhandlers has them when they are called.
If user want to add an event handler that want to use some info estimator does not have. He has to add it in estimator and change the estimator code to do book keeping. He has to know he can access current epoch from estimator.currrent_epoch, not some other variable name.
Separate event handlers into 6 classes (train begin class, train end class ect) instead of single eventhandler parent class. Each class maintain it's own info/state and know what args to pass when called. So estimator does not do the state management.
Using the gluon forward hook approch to provide avaibilty to detach a event handler. (e.g. remove some handler after n epochs/steps)

nswamy · 2019-04-05T22:23:37Z

I am not sure I understand the concerns,

What is the problem with 1) ?

For 2), the user can create custom event Handler taking objects whatever it needs to keep track of.

MyEventHandler
def __init__(self, whatever1, whatever2):
     self.whatever1 = whatever1, ...

def train_begin():
     # dowhatever using self.whatever1

Given that the fit API is targeted at novice users, I think 3) is going to make it unnecessarily cumbersome.

What is the benefit of using the Forward Hook approach?

nswamy · 2019-04-05T22:37:18Z

wouldn't it be easier for the user to write the training loop if they want more control instead of having the loop split 6 or more methods or hooks.
In my opinion trying to make it more flexible would complicate the usage and add to the users cognitive load.

nswamy

blocking the PR until my questions are answered.

I would like to understand why and how Sheng's proposal is better than the current design which was discussed offline and surfaced on dev@ months ago.
Last minute requests to fundamentally change the design should have very strong reasons.

roywei · 2019-04-17T22:53:16Z

@nswamy @szha I have addressed the concerns on callback in this doc: https://cwiki.apache.org/confluence/display/MXNET/Callback+Design+for+Fit+Loop
and created a PR here: #14685

Please help take a look, thanks!

roywei · 2019-05-06T01:04:52Z

@eric-haibin-lin @pinaraws @szha Thanks for the review, address comments here: https://github.com/apache/incubator-mxnet/pull/14885/files

eric-haibin-lin · 2019-05-15T05:31:51Z

My comments are addressed. Great work!!

* base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests

#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI

…API (#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric

* add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers

* Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code

* added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context

…upport for Gluon fit() API (#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py

* improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests

* move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci

* address comments * add comment * check available context * fix bug * change cpu check

* address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

roywei · 2019-05-18T01:39:55Z

@szha @eric-haibin-lin CI finally passed, validation/miscellaneous job status returned not correctly (passed instead of pending). Could you help merge if looks good?

szha · 2019-05-18T18:04:39Z

Nice work! Great job upholding the quality even at the cost of several iterations. Well done!

This reverts commit 9f451fb.

* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (apache#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (apache#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (apache#14633) * move to gluon contrib (apache#14635) * [Fit API] improve event handlers (apache#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (apache#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (apache#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (apache#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

) This reverts commit 9f451fb.

roywei requested review from eric-haibin-lin, marcoabreu and szha as code owners April 5, 2019 16:55

roywei requested review from anirudh2290, gigasquid, iblislin, nswamy, sergeykolychev and yzhliu as code owners April 5, 2019 18:17

nswamy force-pushed the fit-api branch 2 times, most recently from c2e2f80 to d76234b Compare April 5, 2019 18:47

nswamy removed request for gigasquid, yzhliu, iblislin, anirudh2290 and sergeykolychev April 5, 2019 18:49

szha reviewed Apr 5, 2019

View reviewed changes

szha suggested changes Apr 5, 2019

View reviewed changes

marcoabreu added Gluon pr-awaiting-review PR is waiting for code review labels Apr 5, 2019

nswamy previously requested changes Apr 12, 2019

View reviewed changes

roywei mentioned this pull request May 6, 2019

[Fit-API] Adress PR comments #14885

Merged

7 tasks

roywei and others added 12 commits May 15, 2019 13:47

[MXNet-1334][Fit API]base class for estimator and eventhandler (#14346)

1b4e604

* base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests

Fixed issue where the estimator was printing beyond the dataset size … (

5b1eb20

#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI

[MXNet-1340][Fit API]Update train stats (#14494)

92c3c21

* add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers

move estimator to contrib (#14633)

768470e

move to gluon contrib (#14635)

6c455ef

[Fit API] improve event handlers (#14685)

900b449

* improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests

[MXNET-1396][Fit-API] Update default handler logic (#14765)

5ac7751

* move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci

[Fit API] update estimator (#14849)

d57a712

* address comments * add comment * check available context * fix bug * change cpu check

eric-haibin-lin force-pushed the fit-api branch from 588730c to 3b17837 Compare May 15, 2019 20:48

[Fit-API] Adress PR comments (#14885)

5c34df3

* address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

eric-haibin-lin force-pushed the fit-api branch from 3b17837 to 5c34df3 Compare May 17, 2019 16:37

szha approved these changes May 18, 2019

View reviewed changes

szha merged commit 9f451fb into master May 18, 2019

eric-haibin-lin mentioned this pull request May 18, 2019

Checkpoint management dmlc/gluon-nlp#305

Closed

roywei added a commit to roywei/incubator-mxnet that referenced this pull request May 20, 2019

Revert "[MXNET-1333] Estimator and Fit API (apache#14629)"

f99ba9b

This reverts commit 9f451fb.

roywei mentioned this pull request May 20, 2019

Revert "[MXNET-1333] Estimator and Fit API" #15008

Merged

eric-haibin-lin pushed a commit that referenced this pull request May 20, 2019

Revert "[MXNET-1333] Estimator and Fit API (#14629)" (#15008)

aac3cdb

This reverts commit 9f451fb.

roywei mentioned this pull request May 20, 2019

Fit api #15009

Merged

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Revert "[MXNET-1333] Estimator and Fit API (apache#14629)" (apache#15008

ca45029

) This reverts commit 9f451fb.

szha deleted the fit-api branch September 8, 2019 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-1333] Estimator and Fit API #14629

[MXNET-1333] Estimator and Fit API #14629

roywei commented Apr 5, 2019 •

edited

Loading

marcoabreu commented Apr 5, 2019

roywei commented Apr 5, 2019

nswamy commented Apr 5, 2019

roywei commented Apr 5, 2019

szha Apr 5, 2019

roywei Apr 5, 2019

szha left a comment

piyushghai commented Apr 5, 2019

roywei commented Apr 5, 2019 •

edited

Loading

nswamy commented Apr 5, 2019 •

edited

Loading

nswamy commented Apr 5, 2019

nswamy left a comment

roywei commented Apr 17, 2019

roywei commented May 6, 2019

eric-haibin-lin commented May 15, 2019

roywei commented May 18, 2019

szha commented May 18, 2019

[MXNET-1333] Estimator and Fit API #14629

[MXNET-1333] Estimator and Fit API #14629

Conversation

roywei commented Apr 5, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

marcoabreu commented Apr 5, 2019

roywei commented Apr 5, 2019

nswamy commented Apr 5, 2019

roywei commented Apr 5, 2019

szha Apr 5, 2019

Choose a reason for hiding this comment

roywei Apr 5, 2019

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

piyushghai commented Apr 5, 2019

roywei commented Apr 5, 2019 • edited Loading

nswamy commented Apr 5, 2019 • edited Loading

nswamy commented Apr 5, 2019

nswamy left a comment

Choose a reason for hiding this comment

roywei commented Apr 17, 2019

roywei commented May 6, 2019

eric-haibin-lin commented May 15, 2019

roywei commented May 18, 2019

szha commented May 18, 2019

roywei commented Apr 5, 2019 •

edited

Loading

roywei commented Apr 5, 2019 •

edited

Loading

nswamy commented Apr 5, 2019 •

edited

Loading