-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training will get stuck and stop without reporting an error #37
Comments
Did you try to use a single GPU for training and testing first? Setting deterministic does not cause the stuck. I once met this problem before but soon I updated pytorch lightening and it got fixed. A possible problem may lie in the mutli-Gpu training stage when GPUs stuck with each other for waiting the sync. |
Thank you for your reply. I will try to test with a single gpu. By the way, what version of cuda, pytorch, pytorch-lighting did you finally use? |
I use pytorch_lightning==1.5.9, Cuda=10.1. But I think now the new version of pytorch lightning is also working, just need a few tweaking. |
I set deterministic to be False, and it can run successfully. But when it runs to about 68% of epoch=1, the training will get stuck and stop without reporting an error, and it will not move. How can I solve this?
The text was updated successfully, but these errors were encountered: