-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when using multi-GPU training #324
Comments
I train on multi-GPU well before today I pull newest code from master. |
@luozhiping the only way is to revert back to a previous commit (this would probably work). Additional fixes on the master branch have been added for correctness/speed etc, however would require re-training the model as the architecture has changed. Not the most suitable thing on the planet but variable-lengths etc make a large difference! |
I'll close this since the issues resolved :) |
@SeanNaren I got that and I will retrain my model.But you didn't answer me why I got error when I train on multi-GPU? It will not appear when I only use one GPU.Error trace is above
|
this error seems not solved yet.
|
Hi, I used the same commit (655cd58) as mentioned earlier by @SeanNaren Works fine on single GPU. Trying to run DDP across two 1080 GTX cards and getting the following ERROR. PyTorch 0.4.0 Looking for fixes. [ds2@blipp73 deepspeech.pytorch]$ python -m multiproc train.py --train-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt --val-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt --cuda |
Got the same issue. Running without -m miltiproc option on a sinlge 1080 Ti card seems to work but with -m multiproc option, I get the error below: Torch: 0.4.1.post2 DistributedDataParallel( terminate called after throwing an instance of 'gloo::EnforceNotMet' |
I meet a same error |
me either |
Why I can't training on multi-GPU?
The text was updated successfully, but these errors were encountered: