-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Warn on future divergent behavior for conditional views #126129
base: gh/kurtamohler/30/base
Are you sure you want to change the base?
Conversation
[ghstack-poisoned]
…ews" Part of #109833 [ghstack-poisoned]
ghstack-source-id: 16e443f188e379ec6cb2ea6bc5be14343a1833a5 Pull Request resolved: #126129
…ews" Part of #109833 [ghstack-poisoned]
…ews" Part of #109833 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]
…ews" Part of #109833 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]
…ews" Part of #109833 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]
ghstack-source-id: 46765f09395d7b370de30fdd61d8a36363ae072f Pull Request resolved: #126129
…ews" Part of #109833 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang [ghstack-poisoned]
ghstack-source-id: d6678c5a8fcaf99b5c0a2acf979e9b82366ed141 Pull Request resolved: #126129
ghstack-source-id: 92ca4f38b33f8a13f6d3f5b990d5b9f0a370e2a9 Pull Request resolved: #126129
I'm getting an odd segfault that I'm having trouble understanding.
From the backtrace at the end of this comment, this happens when I don't know why that would happen. Maybe the One way to workaround this segfault, and prevent it in some cases, is to change if (allocator != nullptr) {
return allocator->is_simple_data_ptr(data_ptr);
} else {
return ctx == data;
} to this: if (ctx == data) {
return true;
} else if (allocator != nullptr) {
return allocator->is_simple_data_ptr(data_ptr);
} else {
return false;
} But that change just covers up the issue rather than really solving it. There could potentially be a case where we have a simple data pointer, but I guess I'll have to just go with the workaround until I know more. Backtrace: Click to expand
|
Another problem is that this PR currently breaks sharing tensors between processes. import torch
import torch.multiprocessing as mp
def test(q):
x = torch.randn(10)
x.share_memory_()
q.put(x)
if __name__ == "__main__":
manager = mp.Manager()
q = manager.Queue()
p = mp.Process(target=test, args=(q,))
p.start()
p.join()
print(q.get()) Output:
That failed check happens in I'm not sure, but maybe we should just forbid sharing COW tensors between processes? So if |
Yet another problem-- import torch
import numpy as np
a = torch.from_numpy(np.random.randn(10))
b = a._lazy_clone()
We could sort of support COW for numpy tensors, but the problem is that there would be no way to prevent numpy from mutating the data underneath, as far as I know. Should we do the same as I mentioned above for shared tensors?
Also, if we don't support COW for numpy-based tensors, what exactly should the warning message about the future behavior say? Ideally, it would say something like "This operation creates a conditional view of a numpy-based tensor. In the future it will unconditionally create a clone". That would give the user a clue about why this is happening. But from And I'm sure there are going to be other kinds of cases where we don't have a simple data pointer. We could think about creating a special deleter and context to wrap around numpy data pointers and other non-simple data pointers to add some information about what these data pointers are, for the sole purpose of making the warning messages emitted by |
Yeah, the safe thing seems to be to unconditionally clone, and then adjusting the warnings accordingly also seems good. |
ghstack-source-id: 54d1377955d87b06dbc6250722182de24160016b Pull Request resolved: #126129
I'm getting failures for gradgradcheck that I don't understand. For instance: Click to expand
If I change |
ghstack-source-id: b673187821d5b34ca0a1001ae2efb91e49f9e8a1 Pull Request resolved: #126129
Is it possible we are accidentally deallocating the data too early? |
I'm not sure yet. But I was able to write a minimal reproducer: import torch
y = torch.randn(6, dtype=torch.float64, requires_grad=True)
x = torch.randn(6, dtype=torch.float64, requires_grad=True)
inputs = (y, x)
torch.autograd.gradgradcheck(
torch.trapezoid,
inputs,
check_batched_grad=True,
) A few observations:
|
ghstack-source-id: 003b7f106109d89094b0b3c0ab46b3b3fa52f481 Pull Request resolved: #126129
I've still been trying to debug the failure for I have the following script which just runs gradgradcheck on Click to expandimport torch
import warnings
from itertools import product
warnings.simplefilter('always')
warnings.filterwarnings('ignore', '.*torch\.vmap.*', FutureWarning)
for future, check_batched_grad in product([False, True], [False, True]):
print('--------------------------------------------')
print(f'future={future}, check_batched_grad={check_batched_grad}')
torch.set_future_lazy_clone(future)
y = torch.randn(6, dtype=torch.float64, requires_grad=True)
x = torch.randn(6, dtype=torch.float64, requires_grad=True)
try:
torch.autograd.gradgradcheck(
torch.trapezoid,
(y, x),
check_batched_grad=check_batched_grad)
except RuntimeError:
print('FAIL')
else:
print('PASS') If I run it, I get the following output: Click to expand
This shows that everything passes, except in the case where future lazy clone is turned off and batched grads are being checked. So then I thought, what if I change the implementation of This is the diff I applied on top of my current PR: Click to expand
Here's what I get when I rerun the script above: Click to expand
Even though the implementation is the same for I was wondering why the Click to expand
That did indeed avoid the extra warning, and it also made the script pass in all four cases. Of course, this is not a real solution, since we need to make sure this PR's changes are backward compatible. I also tried removing the alias annotations and other related things from Click to expand
That made the script pass in all four cases. But this is also not an acceptable solution--the whole reason why I have a separate I still have some avenues to investigate, but I'm wondering if anything sticks out to anyone. The specific question is, what is wrong with the way I've registered EDIT: Here is the error message for the failing case of Click to expand
|
I'm sorry, I've been swamped recently, I will read this carefully when I get a chance |
Stack from ghstack (oldest at bottom):
Part of #109833
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang