-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training with model parallelism hangs with the recent PR #985
Comments
@honglu2875 I don't know the exact reason. Anyway, I will try distributed training after #979 is merged and let you know this bug is solved. Thanks. |
I have reproduced this issue and confirmed that #979 does not fix it. |
I tried to set up neox from an empty conda env again, and used my own config (pythia-1b) but made sure
On the main branch it just errored out on the first training step instead of hanging. Applying #979, it can train normally (watched it for about 10 training steps). Will take a look at other items in OP's config later. But hope this helps narrow down the problem. |
After conferring with @honglu2875 we discovered that I was failing to apply the fix to both nodes. @absol13 It should now work for you on |
Now it works correctly. Thanks to your fast support. |
Describe the bug
Hello, I found distributed training with the setting
"model-parallel-size": >1
hangs. This situation appears in the source with the PR #958 is merged, and it did not appear in older sources or with the setting"model-parallel-size": 1
at all.To Reproduce
I share my config file below for reproduction.
Expected behavior
Training should proceed further.
Proposed solution
Screenshots
Specifically, training does not proceed at this point:
Also, I report logs from nvidia-smi command.
In the normal training procedure, GPU power usage reaches near its capacity, but I observed that it is far from its capacity which seems to mean the hanging process.
Environment (please complete the following information):
I share my config file for training to help reproducing this situation.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: