Error when using multi-GPU training #324

luozhiping · 2018-07-14T13:31:05Z

Why I can't training on multi-GPU?

python -m multiproc train.py --train-manifest qkids/manifest/qkids_train_manifest_limit_250.csv --val-manifest qkids/manifest/qkids_test_manifest_limit_never_train.csv --cuda --model-path models/libri_final_and_limit.pth --epochs 50 --checkpoint --checkpoint-per-batch 1000 --batch-size 20 --tensorboard --log-params --id libri_final_and_limit
['train.py', '--train-manifest', 'qkids/manifest/qkids_train_manifest_limit_250.csv', '--val-manifest', 'qkids/manifest/qkids_test_manifest_limit_never_train.csv', '--cuda', '--model-path', 'models/libri_final_and_limit.pth', '--epochs', '50', '--checkpoint', '--checkpoint-per-batch', '1000', '--batch-size', '20', '--tensorboard', '--log-params', '--id', 'libri_final_and_limit', '--world-size', '2', '--rank', '0', '--gpu-rank', '0']
['train.py', '--train-manifest', 'qkids/manifest/qkids_train_manifest_limit_250.csv', '--val-manifest', 'qkids/manifest/qkids_test_manifest_limit_never_train.csv', '--cuda', '--model-path', 'models/libri_final_and_limit.pth', '--epochs', '50', '--checkpoint', '--checkpoint-per-batch', '1000', '--batch-size', '20', '--tensorboard', '--log-params', '--id', 'libri_final_and_limit', '--world-size', '2', '--rank', '1', '--gpu-rank', '1']
DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/home/luozhiping/workspace/speech/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 246, in
out, output_sizes = model(inputs, input_sizes)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 217, in forward
return self.gather(outputs, self.output_device)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 226, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 55, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 186, in gather
"but expected {}".format(got, expected))
ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

luozhiping · 2018-07-14T13:33:35Z

I train on multi-GPU well before today I pull newest code from master.
There is another question: I see that model's construction have been changed, so that I can't continue-from my prior model, is that a way to fix it?

SeanNaren · 2018-07-14T13:48:39Z

@luozhiping the only way is to revert back to a previous commit (this would probably work). Additional fixes on the master branch have been added for correctness/speed etc, however would require re-training the model as the architecture has changed. Not the most suitable thing on the planet but variable-lengths etc make a large difference!

SeanNaren · 2018-07-14T13:48:47Z

I'll close this since the issues resolved :)

luozhiping · 2018-07-14T16:38:16Z

@SeanNaren I got that and I will retrain my model.But you didn't answer me why I got error when I train on multi-GPU? It will not appear when I only use one GPU.Error trace is above

ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29

fanlu · 2018-09-11T12:32:26Z

this error seems not solved yet.

root@7096c8ab06ef:/workspace/deepspeech.pytorch# python -m multiproc train.py --cuda  --train-manifest /workspace/data/libri_train_manifest.csv --val-manifest /workspace/data/libri_val_manifest.csv                     
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '0', '--gpu-rank', '0']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '1', '--gpu-rank', '1']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '2', '--gpu-rank', '2']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '3', '--gpu-rank', '3']
Model Save directory already exists.
Traceback (most recent call last):
  File "train.py", line 256, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=(int(args.gpu_rank),) if args.rank else None)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 140, in __init__
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable (allocate at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCStorage_resize + 0x123 (0x7faa44cf94e3 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: THCTensor_resizeNd + 0x30f (0x7faa44d0717f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: THCudaTensor_newWithStorage + 0xfa (0x7faa44d0d65a in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::CUDAFloatType::th_tensor(at::ArrayRef<long>) const + 0xa5 (0x7faa44c2d745 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::tensor(at::Type const&, at::ArrayRef<long>) + 0x3a (0x7faa67da37da in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::tensor(at::ArrayRef<long>) const + 0x9 (0x7faa67f91b69 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::tensor(at::ArrayRef<long>) const + 0x44 (0x7faa69c13d04 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::broadcast(at::Tensor const&, at::ArrayRef<long>) + 0x194 (0x7faa6a0c5dc4 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: torch::cuda::broadcast_coalesced(at::ArrayRef<at::Tensor>, at::ArrayRef<long>, unsigned long) + 0xa10 (0x7faa6a0c7060 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0xc423cb (0x7faa6a0cb3cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x38a5cb (0x7faa698135cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #11: _PyCFunction_FastCallDict + 0x154 (0x556ffc3457c4 in /miniconda/envs/py36/bin/python)
frame #12: <unknown function> + 0x19c10c (0x556ffc3d310c in /miniconda/envs/py36/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x30a (0x556ffc3f741a in /miniconda/envs/py36/bin/python)
frame #14: <unknown function> + 0x1950a6 (0x556ffc3cc0a6 in /miniconda/envs/py36/bin/python)
frame #15: <unknown function> + 0x1960e1 (0x556ffc3cd0e1 in /miniconda/envs/py36/bin/python)
frame #16: <unknown function> + 0x19c1e5 (0x556ffc3d31e5 in /miniconda/envs/py36/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x556ffc3f741a in /miniconda/envs/py36/bin/python)
frame #18: PyEval_EvalCodeEx + 0x329 (0x556ffc3cdbf9 in /miniconda/envs/py36/bin/python)
frame #19: <unknown function> + 0x197a14 (0x556ffc3cea14 in /miniconda/envs/py36/bin/python)
frame #20: PyObject_Call + 0x3e (0x556ffc3455ce in /miniconda/envs/py36/bin/python)
frame #21: THPFunction_apply(_object*, _object*) + 0x38f (0x7faa69bf1a2f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #22: PyCFunction_Call + 0x5f (0x556ffc34879f in /miniconda/envs/py36/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x550c (0x556ffc3fc61c in /miniconda/envs/py36/bin/python)
frame #24: <unknown function> + 0x1954ce (0x556ffc3cc4ce in /miniconda/envs/py36/bin/python)
frame #25: <unknown function> + 0x1960e1 (0x556ffc3cd0e1 in /miniconda/envs/py36/bin/python)
frame #26: <unknown function> + 0x19c1e5 (0x556ffc3d31e5 in /miniconda/envs/py36/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x10bb (0x556ffc3f81cb in /miniconda/envs/py36/bin/python)
frame #28: <unknown function> + 0x195366 (0x556ffc3cc366 in /miniconda/envs/py36/bin/python)
frame #29: _PyFunction_FastCallDict + 0x3db (0x556ffc3cd7db in /miniconda/envs/py36/bin/python)
frame #30: _PyObject_FastCallDict + 0x26f (0x556ffc345b8f in /miniconda/envs/py36/bin/python)
frame #31: _PyObject_Call_Prepend + 0x63 (0x556ffc34a773 in /miniconda/envs/py36/bin/python)
frame #32: PyObject_Call + 0x3e (0x556ffc3455ce in /miniconda/envs/py36/bin/python)
frame #33: <unknown function> + 0x16996b (0x556ffc3a096b in /miniconda/envs/py36/bin/python)
frame #34: <unknown function> + 0x19c447 (0x556ffc3d3447 in /miniconda/envs/py36/bin/python)
frame #35: _PyObject_FastCallDict + 0x8b (0x556ffc3459ab in /miniconda/envs/py36/bin/python)
frame #36: _PyObject_FastCallKeywords + 0xaa (0x556ffc3cd3ca in /miniconda/envs/py36/bin/python)
frame #37: <unknown function> + 0x19c25e (0x556ffc3d325e in /miniconda/envs/py36/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x10bb (0x556ffc3f81cb in /miniconda/envs/py36/bin/python)
frame #39: PyEval_EvalCodeEx + 0x329 (0x556ffc3cdbf9 in /miniconda/envs/py36/bin/python)
frame #40: PyEval_EvalCode + 0x1c (0x556ffc3ce99c in /miniconda/envs/py36/bin/python)
frame #41: <unknown function> + 0x213e44 (0x556ffc44ae44 in /miniconda/envs/py36/bin/python)
frame #42: PyRun_FileExFlags + 0xa1 (0x556ffc44b241 in /miniconda/envs/py36/bin/python)
frame #43: PyRun_SimpleFileExFlags + 0x1c4 (0x556ffc44b444 in /miniconda/envs/py36/bin/python)
frame #44: Py_Main + 0x648 (0x556ffc44ef78 in /miniconda/envs/py36/bin/python)
frame #45: main + 0xee (0x556ffc316efe in /miniconda/envs/py36/bin/python)
frame #46: __libc_start_main + 0xf0 (0x7faa80cc3830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #47: <unknown function> + 0x1c6f25 (0x556ffc3fdf25 in /miniconda/envs/py36/bin/python)

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267: driver shutting down
root@7096c8ab06ef:/workspace/deepspeech.pytorch# git log -5 
commit 655cd586de798b3656c9c52242a11cde3d8a6bd4
Merge: 51f742e de4280d
Author: Sean Naren <[email protected]>
Date:   Fri Jul 6 07:27:34 2018 +0100

    Merge pull request #318 from dalonlobo/master
    
    Update to readme and train.py for multiproc

abuvaneswari · 2018-09-19T23:36:04Z

Hi,

I used the same commit (655cd58) as mentioned earlier by @SeanNaren

Works fine on single GPU. Trying to run DDP across two 1080 GTX cards and getting the following ERROR.

PyTorch 0.4.0
CUDA 9

Looking for fixes.
thanks,
Buvana

[ds2@blipp73 deepspeech.pytorch]$ python -m multiproc train.py --train-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt --val-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt --cuda
['train.py', '--train-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt', '--val-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt', '--cuda', '--world-size', '2', '--rank', '0', '--gpu-rank', '0']
['train.py', '--train-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt', '--val-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt', '--cuda', '--world-size', '2', '--rank', '1', '--gpu-rank', '1']
DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/home/ds2/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 248, in
out, output_sizes = model(inputs, input_sizes)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 217, in forward
return self.gather(outputs, self.output_device)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 226, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 50, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524584710464/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524584710464/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

spakhomov · 2018-09-21T15:44:12Z

Got the same issue. Running without -m miltiproc option on a sinlge 1080 Ti card seems to work but with -m multiproc option, I get the error below:

Torch: 0.4.1.post2
CUDA: 9.1
NVIDIA: 410.48

DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/workspace/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 248, in
out, output_sizes = model(inputs, input_sizes)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 223, in forward
return self.gather(outputs, self.output_device)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in gather
return gather(outputs, output_device, dim=self.dim)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(outputs)))
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [16, 7, 29], but expected [16, 16, 29] (gather at torch/csrc/cuda/comm.cpp:183)
frame #0: + 0xc41e6a (0x7f298cecee6a in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #1: + 0x38a5cb (0x7f298c6175cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: _PyCFunction_FastCallDict + 0x154 (0x7f29a55d07c4 in /miniconda/envs/py36/bin/python)
frame #3: + 0x19c10c (0x7f29a565e10c in /miniconda/envs/py36/bin/python)
frame #4: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #5: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #6: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #7: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #9: PyEval_EvalCodeEx + 0x972 (0x7f29a5659242 in /miniconda/envs/py36/bin/python)
frame #10: + 0x197a14 (0x7f29a5659a14 in /miniconda/envs/py36/bin/python)
frame #11: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #12: THPFunction_apply(_object, _object) + 0x38f (0x7f298c9f5a2f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #13: PyCFunction_Call + 0x5f (0x7f29a55d379f in /miniconda/envs/py36/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x550c (0x7f29a568761c in /miniconda/envs/py36/bin/python)
frame #15: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #16: _PyFunction_FastCallDict + 0x1bb (0x7f29a56585bb in /miniconda/envs/py36/bin/python)
frame #17: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #18: + 0x129e32 (0x7f29a55ebe32 in /miniconda/envs/py36/bin/python)
frame #19: PyIter_Next + 0xe (0x7f29a561475e in /miniconda/envs/py36/bin/python)
frame #20: PySequence_Tuple + 0xf9 (0x7f29a5619519 in /miniconda/envs/py36/bin/python)
frame #21: + 0x17cfdd (0x7f29a563efdd in /miniconda/envs/py36/bin/python)
frame #22: + 0x19c3f5 (0x7f29a565e3f5 in /miniconda/envs/py36/bin/python)
frame #23: _PyObject_FastCallDict + 0x8b (0x7f29a55d09ab in /miniconda/envs/py36/bin/python)
frame #24: + 0x19c25e (0x7f29a565e25e in /miniconda/envs/py36/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #26: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #27: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #28: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #30: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #31: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #32: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x10bb (0x7f29a56831cb in /miniconda/envs/py36/bin/python)
frame #34: + 0x195eab (0x7f29a5657eab in /miniconda/envs/py36/bin/python)
frame #35: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #37: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #38: _PyFunction_FastCallDict + 0x3db (0x7f29a56587db in /miniconda/envs/py36/bin/python)
frame #39: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #40: _PyObject_Call_Prepend + 0x63 (0x7f29a55d5773 in /miniconda/envs/py36/bin/python)
frame #41: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x1a88 (0x7f29a5683b98 in /miniconda/envs/py36/bin/python)
frame #43: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #44: _PyFunction_FastCallDict + 0x1bb (0x7f29a56585bb in /miniconda/envs/py36/bin/python)
frame #45: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #46: _PyObject_Call_Prepend + 0x63 (0x7f29a55d5773 in /miniconda/envs/py36/bin/python)
frame #47: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #48: + 0x16a307 (0x7f29a562c307 in /miniconda/envs/py36/bin/python)
frame #49: _PyObject_FastCallDict + 0x8b (0x7f29a55d09ab in /miniconda/envs/py36/bin/python)
frame #50: + 0x19c25e (0x7f29a565e25e in /miniconda/envs/py36/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #52: PyEval_EvalCodeEx + 0x329 (0x7f29a5658bf9 in /miniconda/envs/py36/bin/python)
frame #53: PyEval_EvalCode + 0x1c (0x7f29a565999c in /miniconda/envs/py36/bin/python)
frame #54: + 0x213e44 (0x7f29a56d5e44 in /miniconda/envs/py36/bin/python)
frame #55: PyRun_FileExFlags + 0xa1 (0x7f29a56d6241 in /miniconda/envs/py36/bin/python)
frame #56: PyRun_SimpleFileExFlags + 0x1c4 (0x7f29a56d6444 in /miniconda/envs/py36/bin/python)
frame #57: Py_Main + 0x648 (0x7f29a56d9f78 in /miniconda/envs/py36/bin/python)
frame #58: main + 0xee (0x7f29a55a1efe in /miniconda/envs/py36/bin/python)
frame #59: __libc_start_main + 0xf0 (0x7f29a4cd3830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1c6f25 (0x7f29a5688f25 in /miniconda/envs/py36/bin/python)

terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267: driver shutting down

linzehua · 2018-10-23T05:04:51Z

I meet a same error

jeeyung · 2018-11-12T13:54:23Z

me either

SeanNaren closed this as completed Jul 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using multi-GPU training #324

Error when using multi-GPU training #324

luozhiping commented Jul 14, 2018

luozhiping commented Jul 14, 2018

SeanNaren commented Jul 14, 2018

SeanNaren commented Jul 14, 2018

luozhiping commented Jul 14, 2018

fanlu commented Sep 11, 2018

abuvaneswari commented Sep 19, 2018 •

edited

Loading

spakhomov commented Sep 21, 2018

linzehua commented Oct 23, 2018

jeeyung commented Nov 12, 2018

Error when using multi-GPU training #324

Error when using multi-GPU training #324

Comments

luozhiping commented Jul 14, 2018

luozhiping commented Jul 14, 2018

SeanNaren commented Jul 14, 2018

SeanNaren commented Jul 14, 2018

luozhiping commented Jul 14, 2018

fanlu commented Sep 11, 2018

abuvaneswari commented Sep 19, 2018 • edited Loading

spakhomov commented Sep 21, 2018

linzehua commented Oct 23, 2018

jeeyung commented Nov 12, 2018

abuvaneswari commented Sep 19, 2018 •

edited

Loading