Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Correct update count with Gluon trainer and update_on_kvstore=False #14377

Merged
merged 4 commits into from
Mar 17, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 9, 2019

Description

This PR fixes the update count when Gluon trainer is created with multiple devices and update_on_kvstore=False.
Fixes #13752, fixes #12713

@eric-haibin-lin This slightly hacky but should work for all cases. Thoughts?

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • The real fix here should be moving the _index_update_count to Updater (which is per device) from optimizer (which is common to all updaters), but that would require breaking API change. FYI @szha

@karan6181
Copy link
Contributor

@mxnet-label-bot add [Bug, Gluon, Optimizer, pr-awaiting-review]

@karan6181
Copy link
Contributor

@ptrendx Thank you for the contribution!

@eric-haibin-lin Can you please review this PR?

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-haibin-lin eric-haibin-lin merged commit 63ed258 into apache:master Mar 17, 2019
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
…pache#14377)

* LRScheduler with update_on_kvstore=False

* Cleaning trainer.py

* Retrigger CI

* Fixes from review
nswamy pushed a commit that referenced this pull request Apr 5, 2019
…14377)

* LRScheduler with update_on_kvstore=False

* Cleaning trainer.py

* Retrigger CI

* Fixes from review
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…pache#14377)

* LRScheduler with update_on_kvstore=False

* Cleaning trainer.py

* Retrigger CI

* Fixes from review
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adam, AdaMax and FTML cannot be used with Trainer(update_on_kv=False) distributed kvstore bug in MXNet
5 participants