attach_grad of intermediate variables causes the gradient graph to be lost #11865

szha · 2018-07-24T00:08:50Z

import mxnet as mx
from mxnet import gluon

net = gluon.model_zoo.vision.mobilenet0_25(pretrained=True)
loss = gluon.loss.SoftmaxCELoss()
with mx.autograd.record():
    output = net(mx.random.uniform(shape=(5,3,224,224)))
    output.attach_grad()
    l = loss(output, mx.nd.arange(5))
l.backward()
print(net.features[0].weight.grad()) # shouldn't be zeros

anirudhacharya · 2019-05-01T20:59:59Z

To understand this right, in scenarios such as the following

x = mx.nd.array([0, 7], ctx = mx.cpu())
x.attach_grad()
with mx.autograd.record():
    y = ((5 * (x**2)) + (13 * x) + 10)
    y.attach_grad()
    z = 2 * y
z.backward()
print(x.grad)

what you want is we should be able to get x.grad as non-zero values even though the intermediate variable y has been marked with attach_grad.

In the above example would you also want the result of y.grad to be retained? Because that would be a bit different and more like a feature request than a bug. It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

Have I understood this right? Which of the two above situations is your issue pointing to?

szha · 2019-05-01T22:59:58Z

would you also want the result of y.grad to be retained

Yes, that's what I'd like to have. In the current implementation, instead of marking the y's gradient to be one of the output, the above code discards the previous graph in which y resides.

anirudhacharya · 2019-05-01T23:07:09Z

It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

@szha would you agree with the above part that these intermediate gradients should not be stored by default but rather we should provide a function call, something like persist_grad which the user can call along with the variables( like y.persist_grad()) to enable storing of intermediate gradients?

szha · 2019-05-01T23:41:54Z

Yes, there should be an explicit mechanism to mark new outputs. Whether it's reusing attach_grad or a new method is up for debate.

anirudhacharya · 2019-07-23T17:55:28Z

Here is another usecase where using attach_grad() with intermediate variables gives erroneous results -

With the following example I would expect x.grad to be [10, 24, 42, 64] but using head gradients and chain rule as per the autograd documentation gives me [5, 12, 21, 32]

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
v = u.detach()
v.attach_grad()
z = v * x
ag.set_recording(False)
z.backward()
u.backward(v.grad)
print(x.grad, y.grad)

But when I do it without using head gradients like as follows I get the correct gradients -

from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
z = u * x
ag.set_recording(False)
z.backward()
print(x.grad, y.grad)

anirudhacharya · 2019-07-23T17:58:03Z

And as per the autograd documentation here - https://www.d2l.ai/chapter_crashcourse/autograd.html#attach-gradients-to-internal-variables

it would seem we are expecting the computation graph to be thrown away when we execute x.attach_grad() because we are implicitly running detach() every time attach_grad() is called.

We need to get a clear understanding of what the expected behavior is here.

larroy · 2019-07-30T05:33:52Z

Unfortunately this is how it's implemented. Why do you want to attach grad to output again?

detach is not the cause, as far as I understand the code.

KexinFeng · 2021-06-17T15:24:36Z

Here are more test cases:

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
u.attach_grad() # implicitly run u = u.detach()
z = u * x
ag.set_recording(False)

print('test1:')
z.backward()
print('x.grad', x.grad, '\ny.grad', y.grad) 
# supposed to be: (v = xy = [5, 12, 21, 32], [0, 0, 0, 0] or x**2 ?)
print('u.grad', u.grad) # supposed to be: x = [1, 2, 3, 4]
print('')

print('test2:')
u.backward(u.grad, retain_graph=True)
print('x.grad', x.grad, '\ny.grad', y.grad) 
# supposed to be: (x + v.grad * y = [10, 24, 42, 64], [0, 0, 0, 0] or x**2 ?)
print('u.grad', u.grad)
print('')

print('test3:')
u.backward()
print('x.grad', x.grad, '\ny.grad', y.grad)
print('u.grad', u.grad) # supposed to be x = [1, 2, 3, 4]


Output
test1:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 2. 3. 4.]
<NDArray 4 @cpu(0)>

test2:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 2. 3. 4.]
<NDArray 4 @cpu(0)>

test3:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 1. 1. 1.]
<NDArray 4 @cpu(0)>

The task is to evaluate z = xu = x (x*y).

Test 1:
y.grad=[0, 0, 0, 0] is obtained since u.attach_grad() implicitly runs u = u.detach(). y.grad=[0, 0, 0, 0] is desired, since if we did want to detach gradient from u node, then dz/dy would be zero.

Test 2:
Both x.grad and y.grad results are problematic. y.grad should be u.grad * x = [1, 4, 9, 16] instead of [0, 0, 0, 0]. x.grad is also problematic, which misses the first term in x.grad = v + v.grad * (y).

My guess of the purpose of implicitly running u = u.detach() when calling u.attach_grad() is such that, after attaching gradient to u node, it assumes that users will call something like u.backward(u.grad), which manually connect the calculation subgraphs. But this feature may not be necessary.

Another idea of fixing it is that attach_grad() does not implicitly call u=u.detach() . Then without manually calling u.backward(u.grad), full gradients of dz/dx, dz/dy, dz/du (achieved by u.retain_grad()) can be calculated.

A further user case is when partial gradient dz/dx with u fixed as a constant is queried, as in this case. This can be achieved by explicitly calling v=u.detach() which gets rid of the graph rooting at u .

szha added Bug Autograd labels Jul 24, 2018

KexinFeng mentioned this issue Aug 9, 2021

[FEATURE] Add feature of retain_grad #20500

Merged

7 tasks

This was referenced Aug 27, 2021

[FEATURE] Add feature of attach_grad to intermediate variables in HybridizedBlock. #20558

Closed

[FEATURE] Add feature of attach_grad to nonleaf variables in HybridizedBlock. #20559

Closed

This was referenced Jul 6, 2022

[FEATURE] Add feature of attach_grad to nonleaf variables in HybridizedBlock. #21089

Closed

[FEATURE] Add feature of attach_grad to nonleaf variables in HybridizedBlock. #21091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attach_grad of intermediate variables causes the gradient graph to be lost #11865

attach_grad of intermediate variables causes the gradient graph to be lost #11865

szha commented Jul 24, 2018

anirudhacharya commented May 1, 2019 •

edited

Loading

szha commented May 1, 2019

anirudhacharya commented May 1, 2019 •

edited

Loading

szha commented May 1, 2019

anirudhacharya commented Jul 23, 2019 •

edited

Loading

anirudhacharya commented Jul 23, 2019

larroy commented Jul 30, 2019 •

edited

Loading

KexinFeng commented Jun 17, 2021 •

edited

Loading

attach_grad of intermediate variables causes the gradient graph to be lost #11865

attach_grad of intermediate variables causes the gradient graph to be lost #11865

Comments

szha commented Jul 24, 2018

anirudhacharya commented May 1, 2019 • edited Loading

szha commented May 1, 2019

anirudhacharya commented May 1, 2019 • edited Loading

szha commented May 1, 2019

anirudhacharya commented Jul 23, 2019 • edited Loading

anirudhacharya commented Jul 23, 2019

larroy commented Jul 30, 2019 • edited Loading

KexinFeng commented Jun 17, 2021 • edited Loading

anirudhacharya commented May 1, 2019 •

edited

Loading

anirudhacharya commented May 1, 2019 •

edited

Loading

anirudhacharya commented Jul 23, 2019 •

edited

Loading

larroy commented Jul 30, 2019 •

edited

Loading

KexinFeng commented Jun 17, 2021 •

edited

Loading