-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hangs up when finishing up a medium model training #473
Comments
This hangs up resulted a segmentation fault at the end in multi-gpu settings. only in medium.yml as tested so far.
|
Hi @SamTube405 I think that this is an issue with how deepspeed closes processes. Is this on a single node, or across multiple nodes? Could you post a config we can reproduce this with? (ideally with environment details (machines, pip freeze etc.) + a small number of steps so we don't have to wait too long to get to the error.) |
Hi @sdtblck, we discovered this issue in the single node/multi gpu setup default medium.yaml (set the # iterations to a low value like 100), as you pointed out, it could be due to the how deepspeed closes the processes at the end of the training. The python environment is as same as one pointed in the installation guide, apex was installed in addition. |
Huh, I've personally only ever experienced training hanging on multiple nodes. Actually |
This resolved with an update to DeepSpeed. |
Describe the bug
It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.
Screenshots
The text was updated successfully, but these errors were encountered: