Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training rel detector using multi gpus #37

Open
wtliao opened this issue Nov 2, 2018 · 2 comments
Open

training rel detector using multi gpus #37

wtliao opened this issue Nov 2, 2018 · 2 comments

Comments

@wtliao
Copy link

wtliao commented Nov 2, 2018

Hi,
I have successfully trained the detector using multiple gpus (8). But I have the following issue when training rel detector using more than one GPUs (have tried on 1080 ti, p100 and K40)

Traceback (most recent call last):
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 229, in <module>
    rez = train_epoch(epoch)
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 135, in train_epoch
    tr.append(train_batch(batch, verbose=b % (conf.print_interval*10) == 0)) #b == 0))
  File "/home/wtliao/work_space/neural-motifs-master-backup/models/train_rels.py", line 179, in train_batch
    loss.backward()
  File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/wtliao/anaconda2/envs/mofit/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: narrow is not implemented for type UndefinedType

The code works well for single gpu. I have no idea about that at all and I cant find a sollution by google. Do you have any idea about that? Thanks

@rowanz
Copy link
Owner

rowanz commented Nov 2, 2018

sorry, I don't support training the relationship model with multiple GPUs right now (it's not what I used for these experiments). I found it actually doesn't help too much in terms of speedup, as the LSTMs are kinda slow and hard to parallelize.

@wtliao
Copy link
Author

wtliao commented Nov 5, 2018

Thanks. Get it. The issuse happens at backward of LSTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants