Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using multi-GPU training #324

Closed
luozhiping opened this issue Jul 14, 2018 · 9 comments
Closed

Error when using multi-GPU training #324

luozhiping opened this issue Jul 14, 2018 · 9 comments

Comments

@luozhiping
Copy link

Why I can't training on multi-GPU?

python -m multiproc train.py --train-manifest qkids/manifest/qkids_train_manifest_limit_250.csv --val-manifest qkids/manifest/qkids_test_manifest_limit_never_train.csv --cuda --model-path models/libri_final_and_limit.pth --epochs 50 --checkpoint --checkpoint-per-batch 1000 --batch-size 20 --tensorboard --log-params --id libri_final_and_limit
['train.py', '--train-manifest', 'qkids/manifest/qkids_train_manifest_limit_250.csv', '--val-manifest', 'qkids/manifest/qkids_test_manifest_limit_never_train.csv', '--cuda', '--model-path', 'models/libri_final_and_limit.pth', '--epochs', '50', '--checkpoint', '--checkpoint-per-batch', '1000', '--batch-size', '20', '--tensorboard', '--log-params', '--id', 'libri_final_and_limit', '--world-size', '2', '--rank', '0', '--gpu-rank', '0']
['train.py', '--train-manifest', 'qkids/manifest/qkids_train_manifest_limit_250.csv', '--val-manifest', 'qkids/manifest/qkids_test_manifest_limit_never_train.csv', '--cuda', '--model-path', 'models/libri_final_and_limit.pth', '--epochs', '50', '--checkpoint', '--checkpoint-per-batch', '1000', '--batch-size', '20', '--tensorboard', '--log-params', '--id', 'libri_final_and_limit', '--world-size', '2', '--rank', '1', '--gpu-rank', '1']
DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/home/luozhiping/workspace/speech/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 246, in
out, output_sizes = model(inputs, input_sizes)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 217, in forward
return self.gather(outputs, self.output_device)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 226, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 55, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 186, in gather
"but expected {}".format(got, expected))
ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

@luozhiping
Copy link
Author

I train on multi-GPU well before today I pull newest code from master.
There is another question: I see that model's construction have been changed, so that I can't continue-from my prior model, is that a way to fix it?

@SeanNaren
Copy link
Owner

@luozhiping the only way is to revert back to a previous commit (this would probably work). Additional fixes on the master branch have been added for correctness/speed etc, however would require re-training the model as the architecture has changed. Not the most suitable thing on the planet but variable-lengths etc make a large difference!

@SeanNaren
Copy link
Owner

I'll close this since the issues resolved :)

@luozhiping
Copy link
Author

@SeanNaren I got that and I will retrain my model.But you didn't answer me why I got error when I train on multi-GPU? It will not appear when I only use one GPU.Error trace is above

ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29

@fanlu
Copy link

fanlu commented Sep 11, 2018

this error seems not solved yet.

root@7096c8ab06ef:/workspace/deepspeech.pytorch# python -m multiproc train.py --cuda  --train-manifest /workspace/data/libri_train_manifest.csv --val-manifest /workspace/data/libri_val_manifest.csv                     
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '0', '--gpu-rank', '0']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '1', '--gpu-rank', '1']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '2', '--gpu-rank', '2']
['train.py', '--cuda', '--train-manifest', '/workspace/data/libri_train_manifest.csv', '--val-manifest', '/workspace/data/libri_val_manifest.csv', '--world-size', '4', '--rank', '3', '--gpu-rank', '3']
Model Save directory already exists.
Traceback (most recent call last):
  File "train.py", line 256, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=(int(args.gpu_rank),) if args.rank else None)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 140, in __init__
    self._module_copies = replicate(self.module, self.device_ids, detach=True)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable (allocate at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCStorage_resize + 0x123 (0x7faa44cf94e3 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: THCTensor_resizeNd + 0x30f (0x7faa44d0717f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: THCudaTensor_newWithStorage + 0xfa (0x7faa44d0d65a in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::CUDAFloatType::th_tensor(at::ArrayRef<long>) const + 0xa5 (0x7faa44c2d745 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::tensor(at::Type const&, at::ArrayRef<long>) + 0x3a (0x7faa67da37da in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::tensor(at::ArrayRef<long>) const + 0x9 (0x7faa67f91b69 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::tensor(at::ArrayRef<long>) const + 0x44 (0x7faa69c13d04 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::broadcast(at::Tensor const&, at::ArrayRef<long>) + 0x194 (0x7faa6a0c5dc4 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: torch::cuda::broadcast_coalesced(at::ArrayRef<at::Tensor>, at::ArrayRef<long>, unsigned long) + 0xa10 (0x7faa6a0c7060 in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0xc423cb (0x7faa6a0cb3cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x38a5cb (0x7faa698135cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #11: _PyCFunction_FastCallDict + 0x154 (0x556ffc3457c4 in /miniconda/envs/py36/bin/python)
frame #12: <unknown function> + 0x19c10c (0x556ffc3d310c in /miniconda/envs/py36/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x30a (0x556ffc3f741a in /miniconda/envs/py36/bin/python)
frame #14: <unknown function> + 0x1950a6 (0x556ffc3cc0a6 in /miniconda/envs/py36/bin/python)
frame #15: <unknown function> + 0x1960e1 (0x556ffc3cd0e1 in /miniconda/envs/py36/bin/python)
frame #16: <unknown function> + 0x19c1e5 (0x556ffc3d31e5 in /miniconda/envs/py36/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x30a (0x556ffc3f741a in /miniconda/envs/py36/bin/python)
frame #18: PyEval_EvalCodeEx + 0x329 (0x556ffc3cdbf9 in /miniconda/envs/py36/bin/python)
frame #19: <unknown function> + 0x197a14 (0x556ffc3cea14 in /miniconda/envs/py36/bin/python)
frame #20: PyObject_Call + 0x3e (0x556ffc3455ce in /miniconda/envs/py36/bin/python)
frame #21: THPFunction_apply(_object*, _object*) + 0x38f (0x7faa69bf1a2f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #22: PyCFunction_Call + 0x5f (0x556ffc34879f in /miniconda/envs/py36/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x550c (0x556ffc3fc61c in /miniconda/envs/py36/bin/python)
frame #24: <unknown function> + 0x1954ce (0x556ffc3cc4ce in /miniconda/envs/py36/bin/python)
frame #25: <unknown function> + 0x1960e1 (0x556ffc3cd0e1 in /miniconda/envs/py36/bin/python)
frame #26: <unknown function> + 0x19c1e5 (0x556ffc3d31e5 in /miniconda/envs/py36/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x10bb (0x556ffc3f81cb in /miniconda/envs/py36/bin/python)
frame #28: <unknown function> + 0x195366 (0x556ffc3cc366 in /miniconda/envs/py36/bin/python)
frame #29: _PyFunction_FastCallDict + 0x3db (0x556ffc3cd7db in /miniconda/envs/py36/bin/python)
frame #30: _PyObject_FastCallDict + 0x26f (0x556ffc345b8f in /miniconda/envs/py36/bin/python)
frame #31: _PyObject_Call_Prepend + 0x63 (0x556ffc34a773 in /miniconda/envs/py36/bin/python)
frame #32: PyObject_Call + 0x3e (0x556ffc3455ce in /miniconda/envs/py36/bin/python)
frame #33: <unknown function> + 0x16996b (0x556ffc3a096b in /miniconda/envs/py36/bin/python)
frame #34: <unknown function> + 0x19c447 (0x556ffc3d3447 in /miniconda/envs/py36/bin/python)
frame #35: _PyObject_FastCallDict + 0x8b (0x556ffc3459ab in /miniconda/envs/py36/bin/python)
frame #36: _PyObject_FastCallKeywords + 0xaa (0x556ffc3cd3ca in /miniconda/envs/py36/bin/python)
frame #37: <unknown function> + 0x19c25e (0x556ffc3d325e in /miniconda/envs/py36/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x10bb (0x556ffc3f81cb in /miniconda/envs/py36/bin/python)
frame #39: PyEval_EvalCodeEx + 0x329 (0x556ffc3cdbf9 in /miniconda/envs/py36/bin/python)
frame #40: PyEval_EvalCode + 0x1c (0x556ffc3ce99c in /miniconda/envs/py36/bin/python)
frame #41: <unknown function> + 0x213e44 (0x556ffc44ae44 in /miniconda/envs/py36/bin/python)
frame #42: PyRun_FileExFlags + 0xa1 (0x556ffc44b241 in /miniconda/envs/py36/bin/python)
frame #43: PyRun_SimpleFileExFlags + 0x1c4 (0x556ffc44b444 in /miniconda/envs/py36/bin/python)
frame #44: Py_Main + 0x648 (0x556ffc44ef78 in /miniconda/envs/py36/bin/python)
frame #45: main + 0xee (0x556ffc316efe in /miniconda/envs/py36/bin/python)
frame #46: __libc_start_main + 0xf0 (0x7faa80cc3830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #47: <unknown function> + 0x1c6f25 (0x556ffc3fdf25 in /miniconda/envs/py36/bin/python)

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267: driver shutting down
root@7096c8ab06ef:/workspace/deepspeech.pytorch# git log -5 
commit 655cd586de798b3656c9c52242a11cde3d8a6bd4
Merge: 51f742e de4280d
Author: Sean Naren <[email protected]>
Date:   Fri Jul 6 07:27:34 2018 +0100

    Merge pull request #318 from dalonlobo/master
    
    Update to readme and train.py for multiproc

@abuvaneswari
Copy link

abuvaneswari commented Sep 19, 2018

Hi,

I used the same commit (655cd58) as mentioned earlier by @SeanNaren

Works fine on single GPU. Trying to run DDP across two 1080 GTX cards and getting the following ERROR.

PyTorch 0.4.0
CUDA 9

Looking for fixes.
thanks,
Buvana

[ds2@blipp73 deepspeech.pytorch]$ python -m multiproc train.py --train-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt --val-manifest ~/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt --cuda
['train.py', '--train-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt', '--val-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt', '--cuda', '--world-size', '2', '--rank', '0', '--gpu-rank', '0']
['train.py', '--train-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_train_manifest_sorted.txt', '--val-manifest', '/home/ds2/ds2_old_commit/deepspeech.pytorch/ted_dev_manifest_sorted.txt', '--cuda', '--world-size', '2', '--rank', '1', '--gpu-rank', '1']
DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/home/ds2/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 248, in
out, output_sizes = model(inputs, input_sizes)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 217, in forward
return self.gather(outputs, self.output_device)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 226, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/ds2/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 50, in forward
assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524584710464/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524584710464/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

@spakhomov
Copy link

Got the same issue. Running without -m miltiproc option on a sinlge 1080 Ti card seems to work but with -m multiproc option, I get the error below:

Torch: 0.4.1.post2
CUDA: 9.1
NVIDIA: 410.48

DistributedDataParallel(
(module): DeepSpeech(
(conv): MaskConv(
(seq_module): Sequential(
(0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): Hardtanh(min_val=0, max_val=20, inplace)
(3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Hardtanh(min_val=0, max_val=20, inplace)
)
)
(rnns): Sequential(
(0): BatchRNN(
(rnn): GRU(1312, 800, bidirectional=True)
)
(1): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(2): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(3): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
(4): BatchRNN(
(batch_norm): SequenceWise (
BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
(rnn): GRU(800, 800, bidirectional=True)
)
)
(fc): Sequential(
(0): SequenceWise (
Sequential(
(0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Linear(in_features=800, out_features=29, bias=False)
))
)
(inference_softmax): InferenceBatchSoftmax()
)
)
Number of parameters: 41187968
/workspace/deepspeech.pytorch/model.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
x, h = self.rnn(x)
Traceback (most recent call last):
File "train.py", line 248, in
out, output_sizes = model(inputs, input_sizes)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 223, in forward
return self.gather(outputs, self.output_device)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 232, in gather
return gather(outputs, output_device, dim=self.dim)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(outputs)))
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [16, 7, 29], but expected [16, 16, 29] (gather at torch/csrc/cuda/comm.cpp:183)
frame #0: + 0xc41e6a (0x7f298cecee6a in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #1: + 0x38a5cb (0x7f298c6175cb in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: _PyCFunction_FastCallDict + 0x154 (0x7f29a55d07c4 in /miniconda/envs/py36/bin/python)
frame #3: + 0x19c10c (0x7f29a565e10c in /miniconda/envs/py36/bin/python)
frame #4: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #5: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #6: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #7: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #8: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #9: PyEval_EvalCodeEx + 0x972 (0x7f29a5659242 in /miniconda/envs/py36/bin/python)
frame #10: + 0x197a14 (0x7f29a5659a14 in /miniconda/envs/py36/bin/python)
frame #11: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #12: THPFunction_apply(_object
, _object
) + 0x38f (0x7f298c9f5a2f in /miniconda/envs/py36/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #13: PyCFunction_Call + 0x5f (0x7f29a55d379f in /miniconda/envs/py36/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x550c (0x7f29a568761c in /miniconda/envs/py36/bin/python)
frame #15: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #16: _PyFunction_FastCallDict + 0x1bb (0x7f29a56585bb in /miniconda/envs/py36/bin/python)
frame #17: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #18: + 0x129e32 (0x7f29a55ebe32 in /miniconda/envs/py36/bin/python)
frame #19: PyIter_Next + 0xe (0x7f29a561475e in /miniconda/envs/py36/bin/python)
frame #20: PySequence_Tuple + 0xf9 (0x7f29a5619519 in /miniconda/envs/py36/bin/python)
frame #21: + 0x17cfdd (0x7f29a563efdd in /miniconda/envs/py36/bin/python)
frame #22: + 0x19c3f5 (0x7f29a565e3f5 in /miniconda/envs/py36/bin/python)
frame #23: _PyObject_FastCallDict + 0x8b (0x7f29a55d09ab in /miniconda/envs/py36/bin/python)
frame #24: + 0x19c25e (0x7f29a565e25e in /miniconda/envs/py36/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #26: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #27: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #28: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #30: + 0x1954ce (0x7f29a56574ce in /miniconda/envs/py36/bin/python)
frame #31: + 0x1960e1 (0x7f29a56580e1 in /miniconda/envs/py36/bin/python)
frame #32: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x10bb (0x7f29a56831cb in /miniconda/envs/py36/bin/python)
frame #34: + 0x195eab (0x7f29a5657eab in /miniconda/envs/py36/bin/python)
frame #35: + 0x19c1e5 (0x7f29a565e1e5 in /miniconda/envs/py36/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #37: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #38: _PyFunction_FastCallDict + 0x3db (0x7f29a56587db in /miniconda/envs/py36/bin/python)
frame #39: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #40: _PyObject_Call_Prepend + 0x63 (0x7f29a55d5773 in /miniconda/envs/py36/bin/python)
frame #41: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x1a88 (0x7f29a5683b98 in /miniconda/envs/py36/bin/python)
frame #43: + 0x1950a6 (0x7f29a56570a6 in /miniconda/envs/py36/bin/python)
frame #44: _PyFunction_FastCallDict + 0x1bb (0x7f29a56585bb in /miniconda/envs/py36/bin/python)
frame #45: _PyObject_FastCallDict + 0x26f (0x7f29a55d0b8f in /miniconda/envs/py36/bin/python)
frame #46: _PyObject_Call_Prepend + 0x63 (0x7f29a55d5773 in /miniconda/envs/py36/bin/python)
frame #47: PyObject_Call + 0x3e (0x7f29a55d05ce in /miniconda/envs/py36/bin/python)
frame #48: + 0x16a307 (0x7f29a562c307 in /miniconda/envs/py36/bin/python)
frame #49: _PyObject_FastCallDict + 0x8b (0x7f29a55d09ab in /miniconda/envs/py36/bin/python)
frame #50: + 0x19c25e (0x7f29a565e25e in /miniconda/envs/py36/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x30a (0x7f29a568241a in /miniconda/envs/py36/bin/python)
frame #52: PyEval_EvalCodeEx + 0x329 (0x7f29a5658bf9 in /miniconda/envs/py36/bin/python)
frame #53: PyEval_EvalCode + 0x1c (0x7f29a565999c in /miniconda/envs/py36/bin/python)
frame #54: + 0x213e44 (0x7f29a56d5e44 in /miniconda/envs/py36/bin/python)
frame #55: PyRun_FileExFlags + 0xa1 (0x7f29a56d6241 in /miniconda/envs/py36/bin/python)
frame #56: PyRun_SimpleFileExFlags + 0x1c4 (0x7f29a56d6444 in /miniconda/envs/py36/bin/python)
frame #57: Py_Main + 0x648 (0x7f29a56d9f78 in /miniconda/envs/py36/bin/python)
frame #58: main + 0xee (0x7f29a55a1efe in /miniconda/envs/py36/bin/python)
frame #59: __libc_start_main + 0xf0 (0x7f29a4cd3830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1c6f25 (0x7f29a5688f25 in /miniconda/envs/py36/bin/python)

terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1535491974311/work/third_party/gloo/gloo/cuda.cu:267: driver shutting down

@linzehua
Copy link

I meet a same error

@jeeyung
Copy link

jeeyung commented Nov 12, 2018

me either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants