Fix NaN value comparisons in relu, max and min ops #14262

anirudhacharya · 2019-02-27T17:25:00Z

Description

Fix NaN comparisons in relu, max and min ops

Fixes #14157
Fixes #11115

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

change the way max, min and relu operators function to make sure nan values are handled properly. And added tests.

@anirudh2290 @apeforest

apeforest

Why do we need this? Doesn't we already have a is_nan() for all operators?

anirudhacharya · 2019-02-27T19:13:30Z

@apeforest we need it because operations like this mx.nd.relu(np.NaN*nd.ones(1), out) returns [0.] whereas activation functions are expected to propagate the NaN values.

And maximum and minimum operators have inconsistent behavior w.r.t NaN values.

>>> a = np.NaN*nd.ones(1)
>>> b = nd.zeros(1)

>>> nd.maximum(a,b)
[0.]
<NDArray 1 @cpu(0)>

>>> nd.maximum(b,a)
[nan]
<NDArray 1 @cpu(0)>

szha · 2019-02-27T19:40:55Z

@anirudhacharya I'm not sure if it answers the question. Why do we need to start to support nan values, especially given the extra handling required for nan.

anirudhacharya · 2019-02-27T20:08:13Z

@szha I ideally do not want relu operator to clip NaN values to zero, especially when I am trying to debug a model.
And with regards to maximum and minimum it is not about 'starting to support' nan values but to fix inconsistent handling of nan values.

Pytorch's relu behavior -

>>> import torch
>>> import torch.nn as nn
>>> m = nn.ReLU()
>>> input = np.NaN * torch.ones(1)
>>> out = m(input)
>>> out
tensor([nan])

Also I found a related issue here - #14157

Edit - Another issue filed some time ago which had slipped from my memory - #11115

anirudhacharya · 2019-02-27T20:58:51Z

@mxnet-label-bot add [pr-awaiting-review]

szha · 2019-02-28T06:52:26Z

@anirudhacharya thanks for the explanation. should relu grad deal with nan in a special way?

anirudhacharya · 2019-02-28T22:27:59Z

@szha yes I think the relu grad should also be handled in a special way, thanks for pointing it out.

Currently relu grad at nan returns a 0 by evaluating this expression

MXNET_UNARY_MATH_OP_NC(relu_grad, a > DType(0) ? DType(1) : DType(0));

But max(NaN, 0) evaluates to NaN and that should translate to a relu grad value of 1 and not 0. I will make the changes to fix it.

FYI - Here is an in depth conversation on NaN handling - JuliaLang/julia#7866

adrianloy · 2019-03-04T15:43:54Z

Nice PR! I also had a bug in my model, and because of relu activations removing NaNs it took me much longer to realize there is a bug. Behaviour should definitely be changed!

tests/python/unittest/test_ndarray.py

szha · 2019-03-07T01:19:28Z

@anirudhacharya I'm not sure if relu grad should act like that. As a sanity check, consider if nan is larger than or smaller than 0.

szha

seems that nan should be surfaced in relu grad instead of 1 when output is nan, because nan is not a number.

wkcn · 2019-03-07T01:32:20Z

Could we add a new operator to check whether there are nan? It is used when debug.
When nan appears in the model, it always means that the model fails, and the output and gradient are not reliable.

anirudhacharya · 2019-03-07T01:44:50Z

@anirudhacharya I'm not sure if relu grad should act like that. As a sanity check, consider if nan is larger than or smaller than 0.

nan compared to any number is always False. nan > 0 -> False and nan < 0 -> False

Ref - https://stackoverflow.com/questions/49011370/nan-propagation-and-ieee-754-standard/49040225

But there are languages and libraries which consider nan to be greater than any number even np.inf.

szha · 2019-03-07T02:00:54Z

@anirudhacharya it's not about comparison. nan is not in the domain of the function.

anirudhacharya · 2019-03-07T02:29:21Z

@szha What you say makes sense, nan is not a number and hence not in the realm of comparison. So any occurrence either forward or backward will have to be propagated.

Maybe pytorch is doing it wrong, but just for comparison's sake pytorch seems to treat the gradient of relu @ NaN as equal to 1 -

>>> import torch
>>> import numpy as np
>>> a = np.NaN * torch.ones(1)
>>> a.requires_grad_(True)
tensor([nan], requires_grad=True)
>>> m = torch.nn.ReLU()
>>> out = m(a)
>>> out.backward()
>>> a.grad
tensor([1.])

My main motivation when I first made changes to the relu forward behavior was that the operator silently clipping NaN values was very misleading while trying to build or debug models.

I am open to suggestions on how relu gradient should behave, it would seem there is no single consensus on this and each community/library decide things for themselves .

anirudhacharya · 2019-03-07T02:32:24Z

Could we add a new operator to check whether there are nan? It is used when debug.
When nan appears in the model, it always means that the model fails, and the output and gradient are not reliable.

I think you are looking for this - http:https://mxnet.incubator.apache.org/api/python/ndarray/contrib.html?highlight=isnan#mxnet.ndarray.contrib.isnan

anirudhacharya · 2019-03-08T00:47:38Z

I modified relu grad to also propagate NaN values. As discussed above, since NaN does not exist in the domain of the function, it can also not be mapped to any element in the range of the function, hence the output is also NaN.

concern addressed.

szha · 2019-03-08T03:59:57Z

@anirudhacharya one last thing, could you measure the performance before and after this change? This change is nonetheless necessary, still it would better if we could anticipate any performance change from this. Thanks.

anirudhacharya · 2019-03-08T22:46:53Z

Run Mode: Before --> After ( time in ms)

'Whole CPU run: ' - 0.843163 --> 0.864071
'Forward CPU run: ' - 0.016467 --> 0.043115
'Whole GPU run: ' - 0.460900 --> 0.480667
'Forward GPU run: ' - 0.058783 --> 0.059333

script used

import mxnet as mx                                                                                                                                                    
import numpy as np
from mxnet.test_utils import check_speed

ctx = mx.cpu()
#ctx = mx.gpu(0)

sample_data = mx.nd.ones((3, 500, 500), ctx=ctx)
sample_data[0] = -1. 
sample_data[1] = np.NaN
sample = mx.sym.Variable("sample")
relu_sym = mx.sym.relu(data=sample)

print("Whole CPU run: ", check_speed(relu_sym, location={"sample": sample_data}, ctx=ctx, N=int(1e5), typ="whole"))
print("Forward CPU run: ", check_speed(relu_sym, location={"sample": sample_data}, ctx=ctx, N=int(1e5), typ="forward"))
#print("Whole GPU run: ", check_speed(relu_sym, location={"sample": sample_data}, ctx=ctx, N=int(1e5), typ="whole"))
#print("Forward GPU run: ", check_speed(relu_sym, location={"sample": sample_data}, ctx=ctx, N=int(1e5), typ="forward"))

* nan comparison * fix relu grad

apeforest reviewed Feb 27, 2019

View reviewed changes

anirudhacharya force-pushed the relu branch from 1d05575 to 0ffb3fd Compare February 27, 2019 19:08

marcoabreu added the pr-awaiting-review PR is waiting for code review label Feb 27, 2019

anirudhacharya force-pushed the relu branch from a100765 to 01fe502 Compare March 1, 2019 22:06

anirudh2290 reviewed Mar 4, 2019

View reviewed changes

tests/python/unittest/test_ndarray.py Show resolved Hide resolved

anirudhacharya force-pushed the relu branch 2 times, most recently from 8e1bd80 to 43efb35 Compare March 6, 2019 22:19

anirudh2290 approved these changes Mar 7, 2019

View reviewed changes

szha previously requested changes Mar 7, 2019

View reviewed changes

anirudhacharya force-pushed the relu branch 2 times, most recently from 95ce922 to 7dbf2bb Compare March 7, 2019 22:02

anirudhacharya and others added 2 commits March 7, 2019 16:15

nan comparison

28f2440

fix relu grad

749a97c

anirudhacharya force-pushed the relu branch from 7dbf2bb to 749a97c Compare March 8, 2019 00:15

szha merged commit c645591 into apache:master Mar 10, 2019

anirudhacharya deleted the relu branch March 10, 2019 08:58

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019

Fix NaN value comparisons in relu, max and min ops (apache#14262)

c1f3562

* nan comparison * fix relu grad

nswamy pushed a commit that referenced this pull request Apr 5, 2019

Fix NaN value comparisons in relu, max and min ops (#14262)

16eb81a

* nan comparison * fix relu grad

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Fix NaN value comparisons in relu, max and min ops (apache#14262)

4d60911

* nan comparison * fix relu grad

ciyongch mentioned this pull request Jul 2, 2019

Operator Performance Regression on CPU #15429

Open

6 tasks

roywei mentioned this pull request Jul 23, 2019

Performance regression for MXNet 1.5.0 #15640

Closed

vasslitvinov mentioned this pull request Jul 29, 2019

nans do not propagate through a reduction chapel-lang/chapel#13560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN value comparisons in relu, max and min ops #14262

Fix NaN value comparisons in relu, max and min ops #14262

anirudhacharya commented Feb 27, 2019 •

edited

Loading

apeforest left a comment

anirudhacharya commented Feb 27, 2019

szha commented Feb 27, 2019

anirudhacharya commented Feb 27, 2019 •

edited

Loading

anirudhacharya commented Feb 27, 2019

szha commented Feb 28, 2019

anirudhacharya commented Feb 28, 2019

adrianloy commented Mar 4, 2019

szha commented Mar 7, 2019

szha left a comment

wkcn commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

szha commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

anirudhacharya commented Mar 8, 2019 •

edited

Loading

szha commented Mar 8, 2019

anirudhacharya commented Mar 8, 2019

Fix NaN value comparisons in relu, max and min ops #14262

Fix NaN value comparisons in relu, max and min ops #14262

Conversation

anirudhacharya commented Feb 27, 2019 • edited Loading

Description

Checklist

Essentials

Changes

apeforest left a comment

Choose a reason for hiding this comment

anirudhacharya commented Feb 27, 2019

szha commented Feb 27, 2019

anirudhacharya commented Feb 27, 2019 • edited Loading

anirudhacharya commented Feb 27, 2019

szha commented Feb 28, 2019

anirudhacharya commented Feb 28, 2019

adrianloy commented Mar 4, 2019

szha commented Mar 7, 2019

szha left a comment

Choose a reason for hiding this comment

wkcn commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

szha commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

anirudhacharya commented Mar 7, 2019

anirudhacharya commented Mar 8, 2019 • edited Loading

szha commented Mar 8, 2019

anirudhacharya commented Mar 8, 2019

Run Mode: Before --> After ( time in ms)

anirudhacharya commented Feb 27, 2019 •

edited

Loading

anirudhacharya commented Feb 27, 2019 •

edited

Loading

anirudhacharya commented Mar 8, 2019 •

edited

Loading