FARMReader slow #1077

bappctl · 2021-05-19T20:51:59Z

Question
I am running one of the samples in K8 pod (gpu) It get stuck in FARMReader for long (30+ mins) and time out. Any reason? All i added was 2 .txt document

    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                            use_gpu=True, return_no_answer=True, no_ans_boost=0.7, context_window_size=200)

    retriever = ElasticsearchRetriever(document_store= document_store)
     
    pipe = ExtractiveQAPipeline(reader, retriever)
    
    # predict n answers
     prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)

y[```
2021-05-19 23:34:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
05/19/2021 23:34:10 - INFO - farm.infer - Got ya 23 parallel workers to do inference ...
05/19/2021 23:34:10 - INFO - farm.infer - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
05/19/2021 23:34:10 - INFO - farm.infer - /w\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /w\ /w\ /w\ /w\ /w\ /|\ /w\ /|\ /|\ /|\ /|\ /w\ /w\ /|
05/19/2021 23:34:10 - INFO - farm.infer - /'\ / \ /'\ /'\ / \ / \ /'\ /'\ /'\ /'\ /'\ /'\ / \ /'\ /'\ / \ /'\ /'\ /'\ /'\ / \ / \ /'
05/19/2021 23:34:10 - INFO - farm.infer -
05/19/2021 23:34:10 - INFO - elasticsearch - POST http:https://10.x.x.x:8071/sidx/_search [status:200 request:0.003s]
05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
05/19/2021 23:34:10 - WARNING - farm.data_handler.dataset - Could not determine type for feature 'labels'. Converting now to a tensor of default type long.
[2021-05-19 23:34:40 +0000] [8] [WARNING] Worker graceful timeout (pid:8)
[2021-05-19 23:34:42 +0000] [8] [INFO] Worker exiting (pid: 8)

The text was updated successfully, but these errors were encountered:

tholor · 2021-05-21T12:25:17Z

Hey @bappctl,

How long were your two txt documents?
I assume you are running this on CPU nodes in k8?
What happened before 2021-05-19 23:34:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)?

One thing that you can try to narrow down the root cause: Disable the multiprocessing for inference. You can do that via
passing num_processes=0 to the FARMReader.

As a side note: With only two docs the retriever is basically useless as top_k_retriever=10 will always return your two documents.

bappctl · 2021-05-25T21:02:01Z

@tholor

Just a one pager document (two of them)
Running in GPU nodes
It times out in FARM Reader
Agree on side note, it's next step after seeing overcoming above issue

TransformersReader works but FARMReader couldn't get it to work. In above code I made num_processes=0 as suggested. But it gets stuck here for almost 40mins (had to kill the process)

I notice a peculiar behavior with FARMReader, when i kill the pod then in container log I see the correct result getting printed before app exit, if I don't kill the pod it gets stuck as mentioned above.

I see this behaviour only with FARMReader if I switch to TransformersReader apps works fine as expected

bappctl · 2021-05-26T04:07:26Z

@tholor With TransformersReader how to train on custom data? (similar to FARMReader train())

tholor · 2021-05-26T05:47:40Z

@oryx1729 have you seen such an issue in our kubernetes deployment or have an idea what might cause the deadlock here?

bappctl · 2021-05-26T06:07:00Z

@tholor Finally made it work. It's something to do with gunicorn threads. When I faced issue I had it set to to 3 threads (very minimum) now I removed it and went with workers and worker-connections then it started working with FARMReader. But threads didn't cause any issue with TransformersReader it happens only with FARMReader.

oryx1729 · 2021-05-26T16:13:20Z

Hi @bappctl, can you share the parameters that you're using for running Gunicorn? We have been using this for our deployments. It might be possible that you need a higher timeout value here?

bappctl · 2021-05-26T18:13:55Z

Hi @oryx1729
[PREDICT]
Even with below gunicorn config I see issues - first request returns fine with predict, then subsequent request it stalls and get stuck

    reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes=0)
    pipe = ExtractiveQAPipeline(reader, retriever)
    prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)

The deploy pod has 4 CPUs, 10GB RAM, 1 GPU. Just tried with 2 documents no more than 2 pages.

CMD ["gunicorn", "--name", "hs", "--timeout", "1800", "--workers", "5", "--worker-connections","2","--worker-class", "gevent", "--bind", ":8061", "main:app"]

The other thing I notice the GPU memory is not freed after predict.

[TRAIN]
In parallel I have no luck with FARMReader.train() it stalls for ever.

 reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes=0)
 reader.train(data_dir=model, train_filename="squad.json", n_epochs=20, dev_split = 0, save_dir=model)

oryx1729 · 2021-05-27T06:24:50Z

@bappctl are you using FastAPI for the APIs? In that case, the Gunicorn worker class should be uvicorn.workers.UvicornWorker.

Can you share the complete code for your API endpoint?

bappctl · 2021-05-28T06:17:13Z

@oryx1729

I am not using FastAPI. No luck with both train and predict

import json
import os
import config
import logging
from flask_cors import CORS
from flask import Flask, request, jsonify
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts
from haystack.reader.farm import FARMReader
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.file_converter.pdf import PDFToTextConverter
from haystack.retriever.dense import DensePassageRetriever
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.pipeline import ExtractiveQAPipeline
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

app = Flask(__name__)
CORS(app)

@app.route('haystack/es', methods=['POST'])
def es_store():
    if request.files:
        index = request.form['index']
        doc = request.files["doc"]
        eshost = request.form['host']
        esport = request.form['port']
        local_dir = '/home/data'

        file_path = os.path.join(local_dir, doc.filename)
        doc.save(file_path)
        document_store = ElasticsearchDocumentStore(host=eshost, port=esport, username='', password='', index=index)
        dicts = convert_files_to_dicts(
            app.config["input"],
            clean_func=clean_wiki_text,
            split_paragraphs=False)
        document_store.write_documents(dicts)
        os.remove(file_path)
        return json.dumps({'code':200,'status':'success','message': 'File uploaded.'})   
    else:
        return json.dumps({'status':'Failed','message': 'File upload failed.'})

@app.route('/haystack/train', methods=['POST'])
def train():
    local_dir = '/home/data'
    reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes=0)
    reader.train(data_dir=local_dir, train_filename="squad.json", n_epochs=20, dev_split = 0, save_dir=local_dir)
    return json.dumps({'code':'200','status':'success','message': 'Train successful.'})

@app.route('/haystack/predict', methods=['POST'])
def predict():
    question = request.form['question']
    index = request.form['index']
    eshost = request.form['host']
    esport = request.form['port']
    document_store = ElasticsearchDocumentStore(host=eshost, port=esport, username='', password='', index=index)
    retriever = ElasticsearchRetriever(document_store= document_store)
    reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad",  use_gpu=True)
    pipe = ExtractiveQAPipeline(reader, retriever)
    prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)
    answer = []
    for res in prediction['answers']:
        answer.append(res)
    return json.dumps({'code':200,'status':'success','message': 'Predict successful.', 'result': answer})   
    
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8061)

oryx1729 · 2021-05-28T06:43:11Z

Hi @bappctl, in the /haystack/predict, the model is reloaded each time a question is being asked. It would be more efficient to declare the pipeline outside the predict() method, so as to load it only once when the Flask app starts. Something like this:

document_store = ElasticsearchDocumentStore(host=eshost, port=esport, username='', password='', index=index)
retriever = ElasticsearchRetriever(document_store= document_store)
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad",  use_gpu=True)
pipe = ExtractiveQAPipeline(reader, retriever)
    
@app.route('/haystack/predict', methods=['POST'])
def predict():
    question = request.form['question']
    index = request.form['index']
    eshost = request.form['host']
    esport = request.form['port']
    prediction = pipe.run(query=question, top_k_retriever=10, top_k_reader=3)
    answer = []
    for res in prediction['answers']:
        answer.append(res)
    return json.dumps({'code':200,'status':'success','message': 'Predict successful.', 'result': answer})

bappctl · 2021-05-28T20:06:08Z

@oryx1729
irrespective the same code (even when model called every predict) is not stuck with TransformersReader. I will give a try with your suggestion.

bappctl · 2021-05-29T21:34:11Z

@oryx1729
As a first step I am just trying to save the model to local and load from there. Below is what i get, anything missing I need to do. Doing nothing special very straight forward.

docker
CMD ["gunicorn", "--name", "haystack", "--timeout", "1800", "--workers", "5", "--worker-connections","2","--worker-class", "gevent", "--bind", ":8091", "main:app"]

main.py

    reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True, num_processes=0)
    reader.train(data_dir=modeldir, train_filename="squad.json", n_epochs=20, dev_split = 0, save_dir=modeldir)

05/29/2021 21:19:32 - INFO - farm.utils - Using device: CUDA
05/29/2021 21:19:32 - INFO - farm.utils - Number of GPUs: 1
05/29/2021 21:19:32 - INFO - farm.utils - Distributed Training: False
05/29/2021 21:19:32 - INFO - farm.utils - Automatic Mixed Precision: None
05/29/2021 21:19:32 - INFO - filelock - Lock 139883311981008 acquired on /root/.cache/huggingface/transformers/ab70e5f489e00bb2df55e4bae145e9b1c7dc794cfa0fd8228e1299d400613429.f3874c2af5400915dc843c97f502c5d30edc728e5ec3b60c4bd6958e87970f75.lock
Downloading: 100%|██████████| 451/451 [00:00<00:00, 793kB/s]
05/29/2021 21:19:33 - INFO - filelock - Lock 139883311981008 released on /root/.cache/huggingface/transformers/ab70e5f489e00bb2df55e4bae145e9b1c7dc794cfa0fd8228e1299d400613429.f3874c2af5400915dc843c97f502c5d30edc728e5ec3b60c4bd6958e87970f75.lock
05/29/2021 21:19:34 - INFO - filelock - Lock 139879304556368 acquired on /root/.cache/huggingface/transformers/b00ff18397f70f871bd8f11949a3c5ffd5fb18fd6d4e3df947dc386950b8d59d.69a963759b72d26fb77afa9b7d43c9107b99dfe7ca78af52e0237c8d001c7dcf.lock
Downloading: 100%|██████████| 265M/265M [00:25<00:00, 10.5MB/s]
05/29/2021 21:20:00 - INFO - filelock - Lock 139879304556368 released on /root/.cache/huggingface/transformers/b00ff18397f70f871bd8f11949a3c5ffd5fb18fd6d4e3df947dc386950b8d59d.69a963759b72d26fb77afa9b7d43c9107b99dfe7ca78af52e0237c8d001c7dcf.lock
05/29/2021 21:20:40 - INFO - filelock - Lock 139879093416848 acquired on /root/.cache/huggingface/transformers/e12f02d630da91a0982ce6db1ad595231d155a2b725ab106971898276d842ecc.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 1.93MB/s]
05/29/2021 21:20:41 - INFO - filelock - Lock 139879093416848 released on /root/.cache/huggingface/transformers/e12f02d630da91a0982ce6db1ad595231d155a2b725ab106971898276d842ecc.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
05/29/2021 21:20:41 - INFO - filelock - Lock 139879090118160 acquired on /root/.cache/huggingface/transformers/475d46024228961ca8770cead39e1079f135fd2441d14cf216727ffac8d41d78.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100%|██████████| 466k/466k [00:00<00:00, 2.22MB/s]
05/29/2021 21:20:42 - INFO - filelock - Lock 139879090118160 released on /root/.cache/huggingface/transformers/475d46024228961ca8770cead39e1079f135fd2441d14cf216727ffac8d41d78.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
05/29/2021 21:20:42 - WARNING - farm.utils - ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
05/29/2021 21:20:42 - INFO - farm.utils - Using device: CUDA
05/29/2021 21:20:42 - INFO - farm.utils - Number of GPUs: 1
05/29/2021 21:20:42 - INFO - farm.utils - Distributed Training: False
05/29/2021 21:20:42 - INFO - farm.utils - Automatic Mixed Precision: None

[2021-05-29 21:26:50 +0000] [1] [INFO] Handling signal: term
[2021-05-29 21:26:51 +0000] [9] [INFO] Worker exiting (pid: 9)
[2021-05-29 21:26:51 +0000] [10] [INFO] Worker exiting (pid: 10)
[2021-05-29 21:26:51 +0000] [35] [INFO] Worker exiting (pid: 35)
[2021-05-29 21:26:51 +0000] [8] [INFO] Worker exiting (pid: 8)
[2021-05-29 21:27:21 +0000] [1] [INFO] Shutting down: Master


If reduce to 1 worker

[2021-05-29 22:10:44 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
05/29/2021 22:10:44 - INFO - farm.infer - Got ya 23 parallel workers to do inference ...
[2021-05-29 22:10:45 +0000] [1] [WARNING] Worker with pid 8 was terminated due to signal 9
[2021-05-29 22:10:45 +0000] [97] [INFO] Booting worker with pid: 97
05/29/2021 22:10:46 - INFO - faiss.loader - Loading faiss with AVX2 support.
05/29/2021 22:10:46 - INFO - faiss.loader - Loading faiss.
05/29/2021 22:10:46 - INFO - farm.modeling.prediction_head - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .

oryx1729 · 2021-06-17T15:08:44Z

Hi @bappctl, apologies for the delay in getting back on this.

A few observations in the case above:

--workers set to 5 might spin up 5 instances of the model, so ideally you might want to start first with 1 and see if everything works
in the case of running with 1 worker, I see an error in the logs: Worker with pid 8 was terminated due to signal 9. It could be possible that the pod is running out of memory? Can you ensure that sufficient memory resources are available?
the logs say INFO - farm.infer - Got ya 23 parallel workers to do inference ..., however, in the code snippet, the FARMReader has num_processes=0? Could it be that possible that you were looking at the wrong logs or perhaps a different version of the code was deployed?
is there a specific reason that you want to use Flask over FastAPI? There's a rest_api module in Haystack written with FastAPI. You could extend/adapt it for your use case?

bappctl · 2021-06-26T23:51:18Z

@oryx1729 No worries. I tried with 1 worker too. It didn't help. After failed tries, I dismantled that instance and moved on unfortunately I couldn't verify it now. I will take a look into the last option, there is no preference flask or fast api at my end, all I wanted is to get it running successfully. So far no luck. I will give another try and update you.

bappctl · 2021-07-16T19:53:37Z

@oryx1729

went with your fastapi suggestion . It works.

I have couple of questions

does FarmReader train() auto save the model or should .save() be explicitly invoked to save the trained model?
Is there a way to instantiate ElasticsearchDocumentStore without index and then assign index later?
after few sequential predict queries the GPU freezes, is haystack not releasing the memory after successful query completion?
get [CRITICAL] WORKER TIMEOUT (pid:8)
[WARNING] Worker with pid 8 was terminated due to signal 9
BUT if make num_process=0 I don't hit the issue. Even a small increase of num_process to 2 cause issue (subsequent query doesn't get processed stuck in sending request until request cancelled and resent again)

oryx1729 · 2021-07-19T07:29:18Z

Hi @bappctl,

does FarmReader train() auto save the model or should .save() be explicitly invoked to save the trained model?

save() is called inside the train(), so you do not need an explicit call.

Is there a way to instantiate ElasticsearchDocumentStore without index and then assign index later?

Can you provide more details of the use case here? An instance of ElasticsearchDocumentStore must have an index assigned. Based on the create_index parameter, it will try to create a new index or connect to an existing one.

after few sequential predict queries the GPU freezes, is haystack not releasing the memory after successful query completion?

It seems like you're running into memory issues with using multiprocessing(by setting num_processes > 0). You could try increasing the memory if possible, or, use num_process=0.

bappctl · 2021-07-28T03:15:38Z

@oryx1729
Ignore my question on index. It's no more relevant.

The other question I have is sometimes during training for some reason the process get killed and model save() fails and I run into below issue and no more it trains and throws same error.

OSError: Unable to load weights from pytorch checkpoint file for language_model.bin

if I replace the language_model.bin with the old file the error goes. But it's not a right approach. How to overcome it.

oryx1729 · 2021-07-28T08:00:13Z

Hi @bappctl, can you share the full error stack trace that you get when the process is killed? What version of PyTorch are use using?

bappctl · 2021-07-28T21:59:06Z

@oryx1729

pytorch 1.7.1 + cu110

----- error stack ---

04:58:03 +0000] [282] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 398, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.7/dist-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/fastapi/applications.py", line 199, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/cors.py", line 78, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 580, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 241, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 52, in app
    response = await func(request)
  File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 202, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 148, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/controller/train.py", line 109, in _start_train
    reader = FARMReader(model_name_or_path=model_path)
  File "/usr/local/lib/python3.7/dist-packages/haystack/reader/farm.py", line 112, in __init__
    strict=False)
  File "/usr/local/lib/python3.7/dist-packages/farm/infer.py", line 252, in load
    model = BaseAdaptiveModel.load(load_dir=model_name_or_path, device=device, strict=strict)
  File "/usr/local/lib/python3.7/dist-packages/farm/modeling/adaptive_model.py", line 53, in load
    model = cls.subclasses["AdaptiveModel"].load(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/farm/modeling/adaptive_model.py", line 338, in load
    language_model = LanguageModel.load(load_dir)
  File "/usr/local/lib/python3.7/dist-packages/farm/modeling/language_model.py", line 142, in load
    language_model = cls.subclasses[config["name"]].load(pretrained_model_name_or_path)
  File "/usr/local/lib/python3.7/dist-packages/farm/modeling/language_model.py", line 830, in load
    distilbert.model = DistilBertModel.from_pretrained(farm_lm_model, config=config, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py", line 1208, in from_pretrained
    f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for '/model/language_model.bin' at '/model/language_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

oryx1729 · 2021-07-29T07:30:40Z

Hi @bappctl, could it be possible that you're using an older PyTorch version? Can you try with torch v1.8.1?

bappctl · 2021-07-29T20:43:26Z

@oryx1729 I don't see it happening frequently if I reduce the # workers to 1. Though not frequent at times I see the process get killed during save. I can try with 1.8.1

oryx1729 · 2021-07-30T10:07:48Z

Hi @bappctl, model training is a long-running task, so doing it within the REST API is not a recommended approach.

Could you share more details about the use case for triggering model training via an API?

An alternate approach is to train the model with a separate script & later use the trained model for inference with the API.

ZanSara · 2021-10-14T09:24:41Z

Closing as it seems that the issue was solved using PyTorch 1.8.1. Feel free to open a new issue if you still face problems.

clharman · 2022-09-22T15:37:09Z

Hey, @tholor loving the FARMReader interface. However, for a single prediction, I'm seeing FARMReader being ~6x slower than both TransformersReader and Huggingface QA pipeline with num_processes=0 or 1, and ~7.5x slower with num_processes=None. Is there something obvious I'm missing here? Should we expect inference time parity?

Using the latest for farm-haystack and transformers. Pytorch==1.12.1 Colab notebook: https://colab.research.google.com/drive/1DmbqWaFw9U4NLzn2dI_u1ypGScKdrGqp?usp=sharing

tholor · 2022-09-23T11:58:49Z

Hey @clharman,

We'd expect some time diff as the two readers have quite different postprocessing (e.g. tokenizers, handling no_answers and aggregating logits across multiple passages; see docs for more infos).

However, the diff here is totally out of the expected range and unacceptable.
I tried to reproduce it in your notebook. For me, the diff is similar in the very first execution of the FARMReader cell, but diminishes in the second execution. Basically, suggesting that there's an initial warmup step in the FARMReader costing more time. Can you confirm that you see the same behaviour?

There's still a diff between both readers, but this doesn't seem like a "critical bug" for me and might rather be the topic for some thorough profiling + refactoring.

@cc ZanSara

clharman · 2022-09-23T15:02:50Z

Thanks for the followup @tholor. When I ran with a GPU I saw results matching yours, with about roughly 1.5-2x slowdown.

However, running the notebook CPU only (which is how I was doing it originally) the 6x slowdown was persistent after any warmup period.

Also, leaving num_processes=None on GPU seems to make the gap even wider -- I'm seeing 16x slower. Just FYI, I've been keeping multiprocessing turned off but thought it was weird.

tholor · 2022-09-27T09:36:06Z

ok, thanks for the clarification! the diff on CPU might be related to multiprocessing.

@ZanSara @vblagoje can one of you please take over here and try to replicate this? If the gap is consistently that huge on CPU it might make sense to open a new issue about it.

@vblagoje Weren't you investigating to get rid of multiprocessing anyway?

vblagoje · 2022-09-27T09:46:12Z

Yes @tholor we got rid of it in preprocessing via #3089 and now pending for inferencing via #3283 For inferencing there is some internal discussion to leave multiprocessing after all via non-default option. Thoughts?

clharman · 2022-09-28T15:46:38Z

FYI in case it wasn't clear -- the ~6x slowdown I measured was with multiprocessing turned off (num_processes=0 or 1). Maybe your change affects more than just that argument though. I also tried some basic profiling and found that all the compute time was happening during inference, rather than postprocessing.

sjrl · 2022-10-05T14:14:55Z

Hi, @clharman we believe we resolved the difference in time between the two reader types by finishing this PR #3283.

There is also further discussion on this in issues #3289 and #3272.

vblagoje · 2022-10-05T15:07:03Z

Hey @clharman also please have a look at this Colab notebook I am not getting such high variance in performance results (cpu only). Let us know either way.

tholor added topic:speed type:question labels May 21, 2021

tholor self-assigned this May 21, 2021

tholor assigned oryx1729 and unassigned tholor Jun 1, 2021

ZanSara closed this as completed Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FARMReader slow #1077

FARMReader slow #1077

bappctl commented May 19, 2021 •

edited

Loading

tholor commented May 21, 2021 •

edited

Loading

bappctl commented May 25, 2021 •

edited

Loading

bappctl commented May 26, 2021

tholor commented May 26, 2021

bappctl commented May 26, 2021

oryx1729 commented May 26, 2021

bappctl commented May 26, 2021 •

edited

Loading

oryx1729 commented May 27, 2021

bappctl commented May 28, 2021 •

edited

Loading

oryx1729 commented May 28, 2021 •

edited

Loading

bappctl commented May 28, 2021

bappctl commented May 29, 2021 •

edited

Loading

oryx1729 commented Jun 17, 2021

bappctl commented Jun 26, 2021

bappctl commented Jul 16, 2021 •

edited

Loading

oryx1729 commented Jul 19, 2021

bappctl commented Jul 28, 2021 •

edited

Loading

oryx1729 commented Jul 28, 2021

bappctl commented Jul 28, 2021 •

edited by oryx1729

Loading

oryx1729 commented Jul 29, 2021

bappctl commented Jul 29, 2021 •

edited

Loading

oryx1729 commented Jul 30, 2021 •

edited

Loading

ZanSara commented Oct 14, 2021

clharman commented Sep 22, 2022 •

edited

Loading

tholor commented Sep 23, 2022

clharman commented Sep 23, 2022

tholor commented Sep 27, 2022

vblagoje commented Sep 27, 2022 •

edited

Loading

clharman commented Sep 28, 2022

sjrl commented Oct 5, 2022

vblagoje commented Oct 5, 2022

FARMReader slow #1077

FARMReader slow #1077

Comments

bappctl commented May 19, 2021 • edited Loading

tholor commented May 21, 2021 • edited Loading

bappctl commented May 25, 2021 • edited Loading

bappctl commented May 26, 2021

tholor commented May 26, 2021

bappctl commented May 26, 2021

oryx1729 commented May 26, 2021

bappctl commented May 26, 2021 • edited Loading

oryx1729 commented May 27, 2021

bappctl commented May 28, 2021 • edited Loading

oryx1729 commented May 28, 2021 • edited Loading

bappctl commented May 28, 2021

bappctl commented May 29, 2021 • edited Loading

oryx1729 commented Jun 17, 2021

bappctl commented Jun 26, 2021

bappctl commented Jul 16, 2021 • edited Loading

oryx1729 commented Jul 19, 2021

bappctl commented Jul 28, 2021 • edited Loading

oryx1729 commented Jul 28, 2021

bappctl commented Jul 28, 2021 • edited by oryx1729 Loading

oryx1729 commented Jul 29, 2021

bappctl commented Jul 29, 2021 • edited Loading

oryx1729 commented Jul 30, 2021 • edited Loading

ZanSara commented Oct 14, 2021

clharman commented Sep 22, 2022 • edited Loading

tholor commented Sep 23, 2022

clharman commented Sep 23, 2022

tholor commented Sep 27, 2022

vblagoje commented Sep 27, 2022 • edited Loading

clharman commented Sep 28, 2022

sjrl commented Oct 5, 2022

vblagoje commented Oct 5, 2022

bappctl commented May 19, 2021 •

edited

Loading

tholor commented May 21, 2021 •

edited

Loading

bappctl commented May 25, 2021 •

edited

Loading

bappctl commented May 26, 2021 •

edited

Loading

bappctl commented May 28, 2021 •

edited

Loading

oryx1729 commented May 28, 2021 •

edited

Loading

bappctl commented May 29, 2021 •

edited

Loading

bappctl commented Jul 16, 2021 •

edited

Loading

bappctl commented Jul 28, 2021 •

edited

Loading

bappctl commented Jul 28, 2021 •

edited by oryx1729

Loading

bappctl commented Jul 29, 2021 •

edited

Loading

oryx1729 commented Jul 30, 2021 •

edited

Loading

clharman commented Sep 22, 2022 •

edited

Loading

vblagoje commented Sep 27, 2022 •

edited

Loading