Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError when loading data using data_silo #1927

Closed
AlonEirew opened this issue Dec 24, 2021 · 0 comments · Fixed by #1928
Closed

RuntimeError when loading data using data_silo #1927

AlonEirew opened this issue Dec 24, 2021 · 0 comments · Fixed by #1928

Comments

@AlonEirew
Copy link
Contributor

AlonEirew commented Dec 24, 2021

Describe the bug
Getting RuntimeError when loading data using data_silo, this error seems related to the multiprocessing sharing strategy, which opens many file descriptors. trying to increase the ulimit on my machine to 2048 does not help (cannot increase further).

A solution to the issue might be to further increase file descriptors limit (following this fastai/fastai#23).
Unfortunately the hard limit in my machine is set to 2048

See file descriptors (open files) limit on machine:
#>ulimit -n

Increase file descriptors:
#>ulimit -n 2048

See hard limits:
#>ulimit -H -a

Error message

Traceback (most recent call last):
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results
    task = get()
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/multiprocessing/connection.py", line 256, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/alon_nlp/miniconda3/envs/wec-es/lib/python3.9/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata

Expected behavior
No Error

Additional context
Running on a machine with 88 CPU's

To Reproduce

document_store = FAISSDocumentStore.load(index_path=faiss_index_path, config_path=faiss_config_path)

retriever = DensePassageRetriever(document_store=document_store,
                                      query_embedding_model='bert-base-uncased',
                                      passage_embedding_model='bert-base-uncased',
                                      infer_tokenizer_classes=True,
                                      max_seq_len_query=64,
                                      max_seq_len_passage=180
                                      )

retriever.train(data_dir=doc_dir,
                    train_filename=train_filename,
                    dev_filename=dev_filename,
                    test_filename=dev_filename,
                    n_epochs=n_epochs,
                    batch_size=16,
                    grad_acc_steps=8,
                    save_dir=save_dir,
                    evaluate_every=20,
                    embed_title=False,
                    num_positives=1,
                    num_hard_negatives=1
                    )

FAQ Check

System:

  • OS: 16.04.1-Ubuntu
  • GPU/CPU: 4 * TITAN Xp / 87 * Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
  • Haystack version (commit or version number): v1.0.0
  • DocumentStore: FAISSDocumentStore
  • Reader: NA
  • Retriever: DensePassageRetriever
tholor pushed a commit that referenced this issue Jan 4, 2022
…y open file descriptors from multiprocessing (#1928)

* fix #1687

* fix RuntimeError: received 0 items of ancdata

* Add an arg multiprocessing_strategy to DataSilo and DPR.train()

* Add latest docstring and tutorial changes

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant