Fixes NAG optimizer #15543 #16053

anirudhacharya · 2019-08-31T01:06:13Z

Description

Fixes #15543

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

fix update rule

For review - @zhanghang1989 @apeforest @eric-haibin-lin

zhanghang1989 · 2019-09-04T02:04:36Z

mom = state
mom[:] *= self.momentum
weight[:] += lr * self.momentum * mom
weight[:] -= lr * (1 + self.self.momentum) * grad
mom[:] -=  grad

wkcn · 2019-09-04T02:17:50Z

Hi @zhanghang1989 , is there any difference between a -= b and a[:] -= b?
a[:] -= b may call extra function __getitem__.

import mxnet as mx
import time

T = 1000
N = 1000


while 1:
    ti = time.time()
    a = mx.nd.arange(N)
    for i in range(T):
        a += 1
    mx.nd.waitall()
    print('a += b: ', time.time() - ti)

    ti = time.time()
    a = mx.nd.arange(N)
    for i in range(T):
        a[:] += 1
    mx.nd.waitall()
    print('a[:] += b: ', time.time() - ti)

Output:

a += b:  0.06155872344970703
a[:] += b:  0.3492248058319092
a += b:  0.06215381622314453
a[:] += b:  0.30852508544921875
a += b:  0.07872796058654785
a[:] += b:  0.31493425369262695
a += b:  0.08103752136230469
a[:] += b:  0.3226127624511719
a += b:  0.05706977844238281
a[:] += b:  0.29704785346984863

anirudhacharya · 2019-09-04T08:19:08Z

mom = state
mom[:] *= self.momentum
weight[:] += lr * self.momentum * mom
weight[:] -= lr * (1 + self.self.momentum) * grad
mom[:] -=  grad

@zhanghang1989 The update rule in this PR is the following -

mom_data[i] = param_momentum*mom_data[i];
KERNEL_ASSIGN(out_data[i], req, weight_data[i]-mom_data[i]
                              +(param_momentum+1)*(mom_data[i]
                                -(param_lr*(param_rescale_grad*grad_data[i]+param_wd*weight_data[i]))));

this update rule is same as the following psuedocode -

weight = (weight - momentum * mom) + (momentum+1)*(momentum * mom - lr*(grad + wd*weight))

which when simplified, translates to

weight[:] += (momentum**2 * mom) - (momentum + 1) * lr * (grad + wd*weight)

Formula -

( it is the same rule used in keras as well - https://stats.stackexchange.com/questions/179915/whats-the-difference-between-momentum-based-gradient-descent-and-nesterovs-acc)

tests/python/unittest/test_optimizer.py

src/operator/optimizer_op-inl.h

zhanghang1989

The weight update is correct. Please fix the mentum update in the end.

anirudhacharya · 2019-09-04T18:13:26Z

The weight update is correct. Please fix the mentum update in the end.

yes, will change the momentum update state

zhanghang1989 · 2019-09-04T18:16:50Z

Hi @zhanghang1989 , is there any difference between a -= b and a[:] -= b?
a[:] -= b may call extra function __getitem__.

import mxnet as mx
import time

T = 1000
N = 1000


while 1:
    ti = time.time()
    a = mx.nd.arange(N)
    for i in range(T):
        a += 1
    mx.nd.waitall()
    print('a += b: ', time.time() - ti)

    ti = time.time()
    a = mx.nd.arange(N)
    for i in range(T):
        a[:] += 1
    mx.nd.waitall()
    print('a[:] += b: ', time.time() - ti)

Output:

a += b:  0.06155872344970703
a[:] += b:  0.3492248058319092
a += b:  0.06215381622314453
a[:] += b:  0.30852508544921875
a += b:  0.07872796058654785
a[:] += b:  0.31493425369262695
a += b:  0.08103752136230469
a[:] += b:  0.3226127624511719
a += b:  0.05706977844238281
a[:] += b:  0.29704785346984863

I am not familiar with symbol API. Just write some pseudocode to show how NAG works :)

tests/python/unittest/test_optimizer.py

eric-haibin-lin · 2019-09-11T22:07:08Z

Thanks @zhanghang1989 and @anirudhacharya

Vikas-kum · 2019-09-12T05:54:19Z

@anirudhacharya perl gpu tests are failing : https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-gpu/branches/master/runs/1029/nodes/304/steps/568/log/?start=0 ,
this is the test that ran on master branch commit - 85fe0b4

* fix update rules * readable updates in unit test * mom update