Using multiple GPUs for training the FARMReader #224

sbhttchryy · 2020-07-13T13:32:01Z

Hi, is there any way I can force Haystack to use all my GPUs?

tholor · 2020-07-13T13:50:14Z

What exact part of the pipeline are you referring to here?
Training/Fine-tuning a reader via FARMReader.train(..., use_gpu=True) should already leverage all available GPUs via PyTorch's DataParallel.
For inference, we don't have Multi-GPU support implemented right now as we in our deployments rather spin up the REST API with multiple workers (each using a different GPU). Can you maybe elaborate a bit on your use case here?

sbhttchryy · 2020-07-15T12:45:35Z

Hi, @tholor thank you for your quick response. This is the code that I am using and I aim to use all the GPUs.

`from haystack import Finder
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

from haystack.database.memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
#print(dicts[:3])
document_store.write_documents(dicts)

from haystack.retriever.sparse import TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes = 4)
train_data = "/data/home/krystian/haystack_new/data/squad"
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1)
`

It does not use any of the GPUs. It returns this message often(the same happens if I don't use the num_processes parameter)

07/15/2020 12:41:30 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
System:

OS: Ubuntu 16.04.6 LTS (Xenial Xerus)
GPU: Tesla K80
CUDA version: 10.1

Thank you.

tholor · 2020-07-15T12:52:59Z

Ok, this training code should actually utilize all GPUs.
So let's debug:

Can you verify that your GPU is found by pytorch?

import torch
torch.cuda.is_available()

Are you running the latest Haystack master branch?
Are you using Colab or a local machine?

sbhttchryy · 2020-07-15T13:02:00Z

Thank you for your response.

You are correct -indeed the GPU is not being found by Pytorch.
Yes, I am using the latest master branch.
I am using a local machine.

tholor · 2020-07-15T13:07:02Z

Ok this sometimes happens if you install pytorch via pip.
I would try:
conda install pytorch cudatoolkit=10.2 -c pytorch

(with the version for cuda that you have installed locally, see also the official docs)

If that doesn't help you will probably need to check Nvidia drivers, CUDA ...

sbhttchryy · 2020-07-15T13:17:28Z

Indeed. Thank you very much. Now it utilizes all the GPUs. However, it runs into the following error.

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

The full log is as follows:
07/15/2020 13:14:37 - INFO - haystack.indexing.utils - Found data stored in data/article_txt_got. Delete this first if you really want to fetch new data.
07/15/2020 13:14:37 - INFO - haystack.retriever.sparse - Found 2811 candidate paragraphs from 2811 docs in DB
07/15/2020 13:14:38 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
07/15/2020 13:14:38 - INFO - farm.infer - Could not find distilbert-base-uncased-distilled-squad locally. Try to download from model hub ...
07/15/2020 13:14:40 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is.
We guess it's an ENGLISH model ...
If not: Init the language model by supplying the 'language' param.
07/15/2020 13:14:42 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"loss_ignore_index": -1}
07/15/2020 13:14:46 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
07/15/2020 13:14:46 - INFO - farm.infer - Got ya 4 parallel workers to do inference ...
07/15/2020 13:14:46 - INFO - farm.infer - 0 0 0 0
07/15/2020 13:14:46 - INFO - farm.infer - /w\ /w\ /w\ /w
07/15/2020 13:14:46 - INFO - farm.infer - /'\ / \ /'\ /'
07/15/2020 13:14:46 - INFO - farm.infer -
07/15/2020 13:14:46 - INFO - farm.utils - device: cuda n_gpu: 4, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset /data/home/krystian/haystack_new/data/squad/dev-v2.0.json: 100%|█████████████████████████████████████████████| 1204/1204 [00:03<00:00, 315.28 Dicts/s]
Train epoch 0/0 (Cur. train loss: 0.0000): 0%| | 0/1220 [00:07<?, ?it/s]
Traceback (most recent call last):
File "/data/home/krystian/haystack_new/Tutorial1_Basic_QA_Pipeline_training.py", line 23, in
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1)
File "/data/home/krystian/haystack_new/haystack/haystack/reader/farm.py", line 199, in train
self.inferencer.model = trainer.train()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/train.py", line 290, in train
logits = self.model.forward(**batch)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/adaptive_model.py", line 397, in forward
sequence_output, pooled_output = self.forward_lm(**kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/adaptive_model.py", line 441, in forward_lm
sequence_output, pooled_output = self.language_model(**kwargs, output_all_encoded_layers=False)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/farm/modeling/language_model.py", line 792, in forward
attention_mask=padding_mask,
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/transformers/modeling_distilbert.py", line 466, in forward
inputs_embeds = self.embeddings(input_ids) # (bs, seq_length, dim)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/transformers/modeling_distilbert.py", line 91, in forward
word_embeddings = self.word_embeddings(input_ids) # (bs, max_seq_length, dim)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/data/home/krystian/test_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

Thank you again. :)

tholor · 2020-07-15T14:40:12Z

That's indeed a bug and I was able to reproduce this. Adding a quick fix via #234. Thanks for reporting this!

tholor · 2020-07-18T15:26:00Z

Closing this as it was fixed by #234.
Feel free to reopen, if this didn't solve your problem @sbhttchryy.

sbhttchryy added the question label Jul 13, 2020

tholor changed the title ~~Forcing Haystack to use multiple GPUs~~ Using multiple GPUs for training the FARMReader Jul 15, 2020

tholor mentioned this issue Jul 15, 2020

Fix multi gpu training via Dataparallel #234

Merged

tholor closed this as completed Jul 18, 2020

julian-risch mentioned this issue May 25, 2021

Remove quick fix to avoid applying DataParallel twice? #1093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multiple GPUs for training the FARMReader #224

Using multiple GPUs for training the FARMReader #224

sbhttchryy commented Jul 13, 2020

tholor commented Jul 13, 2020

sbhttchryy commented Jul 15, 2020 •

edited

Loading

tholor commented Jul 15, 2020

sbhttchryy commented Jul 15, 2020 •

edited

Loading

tholor commented Jul 15, 2020

sbhttchryy commented Jul 15, 2020

tholor commented Jul 15, 2020 •

edited

Loading

tholor commented Jul 18, 2020

Using multiple GPUs for training the FARMReader #224

Using multiple GPUs for training the FARMReader #224

Comments

sbhttchryy commented Jul 13, 2020

tholor commented Jul 13, 2020

sbhttchryy commented Jul 15, 2020 • edited Loading

tholor commented Jul 15, 2020

sbhttchryy commented Jul 15, 2020 • edited Loading

tholor commented Jul 15, 2020

sbhttchryy commented Jul 15, 2020

tholor commented Jul 15, 2020 • edited Loading

tholor commented Jul 18, 2020

sbhttchryy commented Jul 15, 2020 •

edited

Loading

sbhttchryy commented Jul 15, 2020 •

edited

Loading

tholor commented Jul 15, 2020 •

edited

Loading