AdamW operator (Fixing Weight Decay Regularization in Adam) #13728

eric-haibin-lin · 2018-12-25T07:06:45Z

Description

Implement a modification of Adam in "Fixing Weight Decay Regularization in Adam" https://arxiv.org/abs/1711.05101.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

eric-haibin-lin · 2018-12-25T19:44:40Z

@sxjscience @szhengac could you guys help review this PR?

szhengac · 2018-12-26T08:21:56Z

python/mxnet/optimizer/optimizer.py

+ rescaled_grad = clip(grad * rescale_grad, clip_gradient)
+ m = beta1 * m + (1 - beta1) * rescaled_grad
+ v = beta2 * v + (1 - beta2) * (rescaled_grad**2)
+ w = w - learning_rate * (m / (sqrt(v) + epsilon) + wd * w)


According to the paper, it has two learning rates. An alpha before m / (sqrt(v) + epsilon).

Good point. The issue is that the learning rate and schedule multiplier is not decoupled in MXNet. Here learning_rate is effectively eta_t * alpha in the paper and wd actually needs to be set as w / alpha. In another word wd can be rescaled properly so that it does exactly the same thing in the paper. Would this be acceptable? Is so maybe I can move this to contrib for the moment

I think it's acceptable as long as the wd is set correctly.

On second thought I think it's better to keep it consistent with the paper

sandeep-krishnamurthy

Thanks.
Can you please provide an output of an end to end use case using AdamW optimizer?

eric-haibin-lin · 2018-12-26T22:26:35Z

@sandeep-krishnamurthy training/fine-tuning the BERT model in GluonNLP would be a use case of AdamW

sxjscience · 2018-12-27T04:08:38Z

python/mxnet/optimizer/optimizer.py

+ kwargs['clip_gradient'] = self.clip_gradient
+
+ mean, var = state
+ adamw_update(weight, grad, mean, var, out=weight, lr=lr, wd=wd, **kwargs)


Should we set wd to something like wd / self._original_lr?

eric-haibin-lin · 2018-12-27T23:22:34Z

@sxjscience @szhengac I took a step back and moved the operator to contrib and use the same notation as the one in the paper. I think the optimizer API still needs more discussion, so I removed it from the PR.

…3728) * tests * remove optimizer and move op to contrib * rename parameter

tests

3e97113

eric-haibin-lin requested a review from szha as a code owner December 25, 2018 07:06

eric-haibin-lin mentioned this pull request Dec 25, 2018

[SCRIPT] Improve BERT fine-tuning script with AdamW optimizer, bucketing and gradient accumulation dmlc/gluon-nlp#482

Merged

6 tasks

eric-haibin-lin changed the title ~~[WIP] AdamW optimizer~~ AdamW optimizer (Fixing Weight Decay Regularization in Adam) Dec 25, 2018

szhengac reviewed Dec 26, 2018

View reviewed changes

sandeep-krishnamurthy reviewed Dec 26, 2018

View reviewed changes

sandeep-krishnamurthy added Optimizer pr-awaiting-review PR is waiting for code review labels Dec 26, 2018

sxjscience reviewed Dec 27, 2018

View reviewed changes

Ubuntu added 2 commits December 27, 2018 23:04

remove optimizer and move op to contrib

d091c59

rename parameter

7b65e5f

eric-haibin-lin changed the title ~~AdamW optimizer (Fixing Weight Decay Regularization in Adam)~~ AdamW operator (Fixing Weight Decay Regularization in Adam) Dec 27, 2018

sxjscience approved these changes Dec 28, 2018

View reviewed changes

eric-haibin-lin merged commit 116d01e into apache:master Dec 28, 2018

rondogency pushed a commit to rondogency/incubator-mxnet that referenced this pull request Jan 9, 2019

AdamW operator (Fixing Weight Decay Regularization in Adam) (apache#1…

00ce437

…3728) * tests * remove optimizer and move op to contrib * rename parameter

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

AdamW operator (Fixing Weight Decay Regularization in Adam) (apache#1…

948d453

…3728) * tests * remove optimizer and move op to contrib * rename parameter

eric-haibin-lin deleted the adamw branch December 14, 2019 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdamW operator (Fixing Weight Decay Regularization in Adam) #13728

AdamW operator (Fixing Weight Decay Regularization in Adam) #13728

eric-haibin-lin commented Dec 25, 2018 •

edited

Loading

eric-haibin-lin commented Dec 25, 2018

szhengac Dec 26, 2018

eric-haibin-lin Dec 26, 2018 •

edited

Loading

sxjscience Dec 27, 2018

eric-haibin-lin Dec 27, 2018

sandeep-krishnamurthy left a comment

eric-haibin-lin commented Dec 26, 2018

sxjscience Dec 27, 2018

eric-haibin-lin commented Dec 27, 2018

AdamW operator (Fixing Weight Decay Regularization in Adam) #13728

AdamW operator (Fixing Weight Decay Regularization in Adam) #13728

Conversation

eric-haibin-lin commented Dec 25, 2018 • edited Loading

Description

Checklist

Essentials

eric-haibin-lin commented Dec 25, 2018

szhengac Dec 26, 2018

Choose a reason for hiding this comment

eric-haibin-lin Dec 26, 2018 • edited Loading

Choose a reason for hiding this comment

sxjscience Dec 27, 2018

Choose a reason for hiding this comment

eric-haibin-lin Dec 27, 2018

Choose a reason for hiding this comment

sandeep-krishnamurthy left a comment

Choose a reason for hiding this comment

eric-haibin-lin commented Dec 26, 2018

sxjscience Dec 27, 2018

Choose a reason for hiding this comment

eric-haibin-lin commented Dec 27, 2018

eric-haibin-lin commented Dec 25, 2018 •

edited

Loading

eric-haibin-lin Dec 26, 2018 •

edited

Loading