-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Unexpected error when upgrading DGL version from 1.1.3 to 2.1.0 #7333
Comments
@Rhett-Ying could you take a look here? I see TODOs listed in this code to replace the numpy operations with torch ones. |
I fixed the bug in a6505e8 and it's not merged into DGL 2.1. It's ready in master branch for now. So you could try with latest DGL nightly build. This fix will be ready in next release DGL 2.2 which will be ready in early May. |
The fix is ready in latest DGL 2.2.1. Please try with it. |
Hi @Rhett-Ying I tried reproducing the example that @jalencato listed and got
This is on |
@thvasilo are you running on GPU and backend is not |
Correct |
why not use |
this seems to be a new bug and I don't know why it's triggered now. |
I will try with |
could you try to figure out why below tensors are on different device? both of them are supposed to be on cpu? This is the direct cause of the crash, right? dgl/python/dgl/distributed/optim/pytorch/sparse_optim.py Lines 358 to 359 in 6475057
|
DGL master(almost same as I tried to run https://github.com/dmlc/dgl/blob/master/examples/distributed/rgcn/node_classification.py with if use
this seems make sense as we support all_to_all_cpu() only? |
I’ve reproduced and find the blame commit: 5dfaf99 This commit add device into gather_listand idx_split_size but didn’t think about alltoall supports cpu only if backend is not nccl And device is override by device = grads.device in
this change is merged after DGL 1.1.3, so we hit the issue in DGL 2.2.1 In short, the direct cause is previous tensors are always in cpu, so it works well with gloo. But now, tensors are in gpu while the underlying |
Hi @Rhett-Ying I ran the repro example that @jalencato posted with the code from #7409 and it works fine now. I think we can close this once that PR is merged. |
@thvasilo could you run more examples to make sure that PR does not trigger any other issue? |
If we merge this we can run our automated integration tests with the daily pip, I don't have the bandwidth to run more manual tests on the PR code. |
🐛 Bug
When I am switching the DGL version from 1.1.3 to 1.2.1, I have met a problem here:
My code snippet is like:
To Reproduce
Steps to reproduce the behavior:
We are getting this error when using graphstorm,
Expected behavior
When running on DGL 1.1.3 I did not have the problem here.
Environment
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: