Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multiple GPUs for training the FARMReader #224

Closed
sbhttchryy opened this issue Jul 13, 2020 · 8 comments
Closed

Using multiple GPUs for training the FARMReader #224

sbhttchryy opened this issue Jul 13, 2020 · 8 comments

Comments

@sbhttchryy
Copy link

Hi, is there any way I can force Haystack to use all my GPUs?

@tholor
Copy link
Member

tholor commented Jul 13, 2020

Hi @sbhttchryy ,

What exact part of the pipeline are you referring to here?
Training/Fine-tuning a reader via FARMReader.train(..., use_gpu=True) should already leverage all available GPUs via PyTorch's DataParallel.
For inference, we don't have Multi-GPU support implemented right now as we in our deployments rather spin up the REST API with multiple workers (each using a different GPU). Can you maybe elaborate a bit on your use case here?

@sbhttchryy
Copy link
Author

sbhttchryy commented Jul 15, 2020

Hi, @tholor thank you for your quick response. This is the code that I am using and I aim to use all the GPUs.

`from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

from haystack.database.memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
#print(dicts[:3])
document_store.write_documents(dicts)

from haystack.retriever.sparse import TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes = 4)
train_data = "/data/home/krystian/haystack_new/data/squad"
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1)
`

It does not use any of the GPUs. It returns this message often(the same happens if I don't use the num_processes parameter)

07/15/2020 12:41:30 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
System:

OS: Ubuntu 16.04.6 LTS (Xenial Xerus)
GPU: Tesla K80
CUDA version: 10.1

Thank you.

@tholor
Copy link
Member

tholor commented Jul 15, 2020

Ok, this training code should actually utilize all GPUs.
So let's debug:

  • Can you verify that your GPU is found by pytorch?
import torch
torch.cuda.is_available()
  • Are you running the latest Haystack master branch?
  • Are you using Colab or a local machine?

@tholor tholor changed the title Forcing Haystack to use multiple GPUs Using multiple GPUs for training the FARMReader Jul 15, 2020
@sbhttchryy
Copy link
Author

sbhttchryy commented Jul 15, 2020

Thank you for your response.

  1. You are correct -indeed the GPU is not being found by Pytorch.
  2. Yes, I am using the latest master branch.
  3. I am using a local machine.

@tholor
Copy link
Member

tholor commented Jul 15, 2020

Ok this sometimes happens if you install pytorch via pip.
I would try:
conda install pytorch cudatoolkit=10.2 -c pytorch

(with the version for cuda that you have installed locally, see also the official docs)

If that doesn't help you will probably need to check Nvidia drivers, CUDA ...

@sbhttchryy
Copy link
Author

Indeed. Thank you very much. Now it utilizes all the GPUs. However, it runs into the following error.

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

The full log is as follows:
07/15/2020 13:14:37 - INFO - haystack.indexing.utils - Found data stored in data/article_txt_got. Delete this first if you really want to fetch new data.
07/15/2020 13:14:37 - INFO - haystack.retriever.sparse - Found 2811 candidate paragraphs from 2811 docs in DB
07/15/2020 13:14:38 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
07/15/2020 13:14:38 - INFO - farm.infer - Could not find distilbert-base-uncased-distilled-squad locally. Try to download from model hub ...
07/15/2020 13:14:40 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is.
We guess it's an ENGLISH model ...
If not: Init the language model by supplying the 'language' param.
07/15/2020 13:14:42 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"loss_ignore_index": -1}
07/15/2020 13:14:46 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
07/15/2020 13:14:46 - INFO - farm.infer - Got ya 4 parallel workers to do inference ...
07/15/2020 13:14:46 - INFO - farm.infer - 0 0 0 0
07/15/2020 13:14:46 - INFO - farm.infer - /w\ /w\ /w\ /w
07/15/2020 13:14:46 - INFO - farm.infer - /'\ / \ /'\ /'
07/15/2020 13:14:46 - INFO - farm.infer -
07/15/2020 13:14:46 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset /data/home/krystian/haystack_new/data/squad/dev-v2.0.json: 100%|█████████████████████████████████████████████| 1204/1204 [00:03<00:00, 315.28 Dicts/s]
Train epoch 0/0 (Cur. train loss: 0.0000): 0%| | 0/1220 [00:07<?, ?it/s]
Traceback (most recent call last):
File "/data/home/krystian/haystack_new/Tutorial1_Basic_QA_Pipeline_training.py", line 23, in
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1)
File "/data/home/krystian/haystack_new/haystack/haystack/reader/farm.py", line 199, in train
self.inferencer.model = trainer.train()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/train.py", line 290, in train
logits = self.model.forward(**batch)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/adaptive_model.py", line 397, in forward
sequence_output, pooled_output = self.forward_lm(**kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/adaptive_model.py", line 441, in forward_lm
sequence_output, pooled_output = self.language_model(**kwargs, output_all_encoded_layers=False)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/language_model.py", line 792, in forward
attention_mask=padding_mask,
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/transformers/modeling_distilbert.py", line 466, in forward
inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/transformers/modeling_distilbert.py", line 91, in forward
word_embeddings = self.word_embeddings(input_ids) # (bs, max_seq_length, dim)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

Thank you again. :)

@tholor
Copy link
Member

tholor commented Jul 15, 2020

That's indeed a bug and I was able to reproduce this. Adding a quick fix via #234. Thanks for reporting this!

@tholor
Copy link
Member

tholor commented Jul 18, 2020

Closing this as it was fixed by #234.
Feel free to reopen, if this didn't solve your problem @sbhttchryy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants