Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

attach_grad of intermediate variables causes the gradient graph to be lost #11865

Open
szha opened this issue Jul 24, 2018 · 8 comments
Open

Comments

@szha
Copy link
Member

szha commented Jul 24, 2018

import mxnet as mx
from mxnet import gluon

net = gluon.model_zoo.vision.mobilenet0_25(pretrained=True)
loss = gluon.loss.SoftmaxCELoss()
with mx.autograd.record():
    output = net(mx.random.uniform(shape=(5,3,224,224)))
    output.attach_grad()
    l = loss(output, mx.nd.arange(5))
l.backward()
print(net.features[0].weight.grad()) # shouldn't be zeros
@anirudhacharya
Copy link
Member

anirudhacharya commented May 1, 2019

To understand this right, in scenarios such as the following

x = mx.nd.array([0, 7], ctx = mx.cpu())
x.attach_grad()
with mx.autograd.record():
    y = ((5 * (x**2)) + (13 * x) + 10)
    y.attach_grad()
    z = 2 * y
z.backward()
print(x.grad)

what you want is we should be able to get x.grad as non-zero values even though the intermediate variable y has been marked with attach_grad.

In the above example would you also want the result of y.grad to be retained? Because that would be a bit different and more like a feature request than a bug. It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

Have I understood this right? Which of the two above situations is your issue pointing to?

@szha
Copy link
Member Author

szha commented May 1, 2019

would you also want the result of y.grad to be retained

Yes, that's what I'd like to have. In the current implementation, instead of marking the y's gradient to be one of the output, the above code discards the previous graph in which y resides.

@anirudhacharya
Copy link
Member

anirudhacharya commented May 1, 2019

It will probably involve storing intermediate gradients of non-leaf variables in some sort of a buffer by providing a hook/function call to the user to enable storing the gradients of non-leaf variables because storing it every time by default might be a waste of memory.

@szha would you agree with the above part that these intermediate gradients should not be stored by default but rather we should provide a function call, something like persist_grad which the user can call along with the variables( like y.persist_grad()) to enable storing of intermediate gradients?

@szha
Copy link
Member Author

szha commented May 1, 2019

Yes, there should be an explicit mechanism to mark new outputs. Whether it's reusing attach_grad or a new method is up for debate.

@anirudhacharya
Copy link
Member

anirudhacharya commented Jul 23, 2019

Here is another usecase where using attach_grad() with intermediate variables gives erroneous results -

With the following example I would expect x.grad to be [10, 24, 42, 64] but using head gradients and chain rule as per the autograd documentation gives me [5, 12, 21, 32]

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
v = u.detach()
v.attach_grad()
z = v * x
ag.set_recording(False)
z.backward()
u.backward(v.grad)
print(x.grad, y.grad)

But when I do it without using head gradients like as follows I get the correct gradients -

from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
z = u * x
ag.set_recording(False)
z.backward()
print(x.grad, y.grad)

@anirudhacharya
Copy link
Member

And as per the autograd documentation here - https://www.d2l.ai/chapter_crashcourse/autograd.html#attach-gradients-to-internal-variables

it would seem we are expecting the computation graph to be thrown away when we execute x.attach_grad() because we are implicitly running detach() every time attach_grad() is called.

We need to get a clear understanding of what the expected behavior is here.

@larroy
Copy link
Contributor

larroy commented Jul 30, 2019

Unfortunately this is how it's implemented. Why do you want to attach grad to output again?

detach is not the cause, as far as I understand the code.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jun 17, 2021

Here are more test cases:

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
u.attach_grad() # implicitly run u = u.detach()
z = u * x
ag.set_recording(False)

print('test1:')
z.backward()
print('x.grad', x.grad, '\ny.grad', y.grad) 
# supposed to be: (v = xy = [5, 12, 21, 32], [0, 0, 0, 0] or x**2 ?)
print('u.grad', u.grad) # supposed to be: x = [1, 2, 3, 4]
print('')

print('test2:')
u.backward(u.grad, retain_graph=True)
print('x.grad', x.grad, '\ny.grad', y.grad) 
# supposed to be: (x + v.grad * y = [10, 24, 42, 64], [0, 0, 0, 0] or x**2 ?)
print('u.grad', u.grad)
print('')

print('test3:')
u.backward()
print('x.grad', x.grad, '\ny.grad', y.grad)
print('u.grad', u.grad) # supposed to be x = [1, 2, 3, 4]


Output
test1:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 2. 3. 4.]
<NDArray 4 @cpu(0)>

test2:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 2. 3. 4.]
<NDArray 4 @cpu(0)>

test3:
x.grad
[ 5. 12. 21. 32.]
<NDArray 4 @cpu(0)>
y.grad
[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>
u.grad
[1. 1. 1. 1.]
<NDArray 4 @cpu(0)>

The task is to evaluate z = xu = x (x*y).

Test 1:
y.grad=[0, 0, 0, 0] is obtained since u.attach_grad() implicitly runs u = u.detach(). y.grad=[0, 0, 0, 0] is desired, since if we did want to detach gradient from u node, then dz/dy would be zero.

Test 2:
Both x.grad and y.grad results are problematic. y.grad should be u.grad * x = [1, 4, 9, 16] instead of [0, 0, 0, 0]. x.grad is also problematic, which misses the first term in x.grad = v + v.grad * (y).

My guess of the purpose of implicitly running u = u.detach() when calling u.attach_grad() is such that, after attaching gradient to u node, it assumes that users will call something like u.backward(u.grad), which manually connect the calculation subgraphs. But this feature may not be necessary.

Another idea of fixing it is that attach_grad() does not implicitly call u=u.detach() . Then without manually calling u.backward(u.grad), full gradients of dz/dx, dz/dy, dz/du (achieved by u.retain_grad()) can be calculated.

A further user case is when partial gradient dz/dx with u fixed as a constant is queried, as in this case. This can be achieved by explicitly calling v=u.detach() which gets rid of the graph rooting at u .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants