Haystack with Albert is awesome! XLNet question #23

ahotrod · 2020-02-15T20:27:24Z

I am in the midst of evaluating Haystack with Albert and so far it looks awesome. Loving it, thanks for sharing.

I missed the whole Game of Thrones fantasy/drama phenomenon, so for a tutorial I could understand and relate-to, I went looking for other content to use with your Tutorial1_Basic_QA_Pipeline.ipynb notebook. Being a Porschephile I settled on:

import wikipedia

porsche_wikis = wikipedia.search("Porsche", results=25)
doc_dir = "data/porsche/"

for wiki in porsche_wikis:
    html_page = wikipedia.page(title = wiki, auto_suggest = False)
    text_file = open(doc_dir + wiki.replace('/', ' ') + ".txt", "w+")
    text_file.write(html_page.content)
    text_file.close()
    print(wiki)

I can relate-to the above content and ask relevant questions of it "all day long". All other code in your notebook remains the same, except I use my Albert model for QA and it works well:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
use_gpu=True)

For my application/project, I would like to also evaluate XLNet performance with Haystack but I am having trouble loading my XLNet model:

reader = FARMReader(model_name_or_path="ahotrod/xlnet_large_squad2_512",
use_gpu=True)

Attached is the complete terminal output text, but bottom-line the error I get is:

AttributeError: 'XLNetForQuestionAnswering' object has no attribute 'qa_outputs'

output_term.txt

This XLNet model was fine-tuned on Transformers v2.1.1 and is the best I have because I and others are having problems fine-tuning XLNet_large under Transformers v2.4.1, huggingface/transformers#2651

Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is not compatible/missing the attribute mentioned in the error message?

Looking forward to additional FARM/Haystack QA capabilities you have in the works, thanks for your efforts!

The text was updated successfully, but these errors were encountered:

ahotrod · 2020-02-16T02:31:37Z

I am familar with Transformers' null_score_diff_threshold, which when set to zero, results in predictions = compute_predictions_logits(examples, features, ...) from transformers.data.metrics.squad_metrics returning a null string in predictions when there is no answer.

With FarmReader what is the best indicator of a "No Answer" output when there is not "reasonable" probability of an answer? Is the argument "no_ans_threshold" for FarmReader the means to accomplish this? I do not see any change or pattern to the answers when varying this arg from -100 (default) to +100.

Must I implement "No Answer" functionality by analyzing the 'probability' & 'score' of the returned answers in prediction = finder.get.answers? If so, what would you recommend as thresholds/limits to either or both of 'probability' & 'score'?

tholor · 2020-02-17T10:14:02Z

Hi @ahotrod

Nice example of searching Porsche Wikis via the wikipedia package! We had actually quite some discussion about what domain to pick for the tutorial :D

XLNet

Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is missing the attribute mentioned in the error message?

It seems like there's a difference in naming the FF module (= "QA head") between the two available XLNet implementations:

XLNetForQuestionAnsweringSimple: it's called qa_outputs here (same as in other architectures)
XLNetForQuestionAnswering: there are separate modules for start_logits and end_logits

From a first look, XLNetForQuestionAnswering seems to be the original implementation from the paper with some kind of beam search. These differences in attributes have already been there in v.2.1.1. So I am not sure if that is really related to your performance issue. Do you have only trouble replicating training a new model or also with using your existing model for inference?
https://github.com/huggingface/transformers/blob/v2.1.1/transformers/modeling_xlnet.py#L1295

Regarding usage of this model in haystack:
a) FARMReader: As a quick fix we could load the weights from start_logitsand end_logits manually into our QA prediction head. However, this would not allow you to benefit from beam search (or other extras implemented in XLNetForQuestionAnswering). For this, we would probably need another prediction head that implements beam search. Do you have any experience in the performance difference between those two XLNet implementations? If the boost is significant, we could implement it also in FARM and make it available for all other model architectures, too.

b) TransformersReader:
If using your model for inference in a transformers pipeline works, you could also load this in haystack via

reader = TransformersReader(model="ahotrod/xlnet_large_squad2_512", tokenizer="ahotrod/xlnet_large_squad2_512", use_gpu=-1)

However, I guess transformers' pipeline is not yet supporting this model type either.

No Answer

Is the argument "no_ans_threshold" for FarmReader the means to accomplish this?

Yes, that's the right argument to change. However, you are totally right that the current implementation is missing a step to include the no answer option in the final result - so you can't see any difference. Working on this now in #24.

ahotrod · 2020-02-17T22:49:23Z

Hello @tholor, thanks for looking into these issues.

Yes, the latest fine-tuned XLNet model I have was fine-tuned with Transformers v2.1.1 and is shared at https://hugginface.co/ahotrod/xlnet_large_squad2_512. Fine-tuning this XLNet LM with Transformers v2.3.0 thru v2.4.1 is currently not possible, perhaps related to the statement that Transformers' run_squad.py was quite heavily refactored in December 2019.

Loading my shared XLNet fine-tuned with TF v2.1.1, as you suggested, with:

reader = TransformersReader(model="ahotrod/xlnet_large_squad2_512", tokenizer="ahotrod/xlnet_large_squad2_512", use_gpu=-1)

is successful, however subsequent inferencing with prediction = finder.get_answers produces this error:

ValueError: too many values to unpack (expected 2)

output_term.txt

Don't sweat the XLNet issues at this juncture on my account. I'm good with Albert xxlarge v1 fine-tuned on SQuAD2 which is a sufficient model for the foreseeable future. Perhaps the XLNet issues will sort-out with subsequent Transformers releases, as their run_squad.py code matures after having been in significant flux over the past few months.

Thanks for your PR #24 - you're all over it!

I've enjoyed and learned a good bit going thru FARM & Haystack code the last few weeks. If you need any particular feedback or whatnot be sure to ask. Maybe limited, but I will contribute where I can. Best regards!

tholor · 2020-02-18T08:32:43Z

Hey @ahotrod,

Ok sounds good! We are currently discussing a few options on how to aggregate the no_answer results, as there are some tradeoffs between "simplicity of interpretation" and "theoretical soundness". However, it should be merged soon.

ValueError: too many values to unpack (expected 2)

Yes, this is a bug on Transformers side. Should work once their pipeline object supports XLNet. There are similar issues with other QA model types (e.g. RoBERTa: huggingface/transformers#2788)

I'm good with Albert xxlarge v1 fine-tuned on SQuAD2 which is a sufficient model for the foreseeable future.

Ok, great! I am still a bit curious: did you observe any performance differences in training an XLNetForQuestionAnswering vs. XLNetForQuestionAnsweringSimple? Have you tried training one in FARM?

I've enjoyed and learned a good bit going thru FARM & Haystack code the last few weeks. If you need any particular feedback or whatnot be sure to ask.
This would actually be great. We are always keen to hear some direct user feedback and love to understand what parts of the framework were helpful & where confusion / problems occur. Let's continue this conversation via email and maybe have a quick call, if you are interested. You can reach me at malte.pietsch [at] deepset.ai

tholor · 2020-02-19T17:18:48Z

We just merged #24. This allows now returning "no_answers" from the Finder. We also made the argument in the FARMReader more intuitive to understand by renaming it to no_ans_boost. So a positive value will "boost" the no_answer to the top, while a negative "penalizes" it.

ahotrod · 2020-02-20T06:51:32Z

@tholor

Ok, great! I am still a bit curious: did you observe any performance differences in training an XLNetForQuestionAnswering vs. XLNetForQuestionAnsweringSimple? Have you tried training one in FARM?

Haven't tried training either in FARM, only with Transformers. Will do so when time permits.

Thanks, starting to go thru the merged #24 PR changes now and run some examples. Will post comments after more experimentation.

ahotrod · 2020-02-21T20:55:29Z

I have been testing with the single Porsche 911 wiki (which is similar size-wise to my domain app) to ID limitations, plus trying to "fool" FARMReader.

Looking good so far!

Here's an example with basic structure & elements for my app:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
              context_window_size=40, no_ans_boost=0, batch_size=48,
              use_gpu=True, n_candidates_per_passage=2)

question="911?"  # seed retriever

# Apply retriever to get all the paragraphs from one wiki, high number for top_k,
# be sure to get all the paragraphs of one wiki
paragraphs, meta_data = retriever.retrieve(question, top_k=80,
                                         candidate_doc_ids=None)
questions = [
"When was the 911 introduced?",
"When was the 911 first introduced?",
"When was the 911 first produced?",
"Why did Porsche develop the Type 964 RS America?",
"What was the last year for air cooled 911s?",
"Who is Hercules?"]                  # < --- obvious no_answer question

for question in questions: # Apply reader to get granular answer(s)
    start = time.time()    
    prediction = reader.predict(question, paragrahps=paragraphs, 
      meta_data_paragraphs=meta_data, top_k=2) # paragrahps <--misspelled farm.py    
    end = time.time()    
    print_answers(prediction, details="all")    
    print('\n', "Total Prediction time(M:S): ", time.strftime('%M:%S',
               time.gmtime(end - start)), '\n')

Output:

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.08s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:21 

{   'adjust_no_ans_boost': 14.835111737251282,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8803529149319752,
                       'score': 15.966211318969727},
                   {   'answer': 'February 1964',
                       'context': 'rking unit in February 1964.It originall',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.8782006146536698,
                       'score': 15.803997993469238}],
    'question': 'When was the 911 introduced?'}

 Total Prediction time(M:S):  00:41 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.06s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:22 

{   'adjust_no_ans_boost': 16.003353014588356,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8828941958574763,
                       'score': 16.161020278930664},
                   {   'answer': 'February 1964',
                       'context': 'rking unit in February 1964.It originall',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.8748021832534769,
                       'score': 15.55282211303711}],
    'question': 'When was the 911 first introduced?'}

 Total Prediction time(M:S):  00:43 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.11s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:21 

{   'adjust_no_ans_boost': 16.136614687740803,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8818734992209446,
                       'score': 16.082340240478516},
                   {   'answer': 'September 1964',
                       'context': 'ion began in September 1964, with the fi',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 13,
                       'probability': 0.8808546905697762,
                       'score': 16.004390716552734}],
    'question': 'When was the 911 first produced?'}

 Total Prediction time(M:S):  00:42 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.11s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:24 

{   'adjust_no_ans_boost': 11.999503135681152,
    'answers': [   {   'answer': 'appeals from American customers',
                       'context': '93, appeals from American customers resu',
                       'document_id': '23',
                       'offset_end': 36,
                       'offset_start': 4,
                       'probability': 0.8635460031826985,
                       'score': 14.760477066040039},
                   {   'answer': 'because the desirable 930 was not available',
                       'context': 'ecause the desirable 930 was not availab',
                       'document_id': '14',
                       'offset_end': 42,
                       'offset_start': -1,
                       'probability': 0.8081948272857139,
                       'score': 11.506584167480469}],
    'question': 'Why did Porsche develop the Type 964 RS America?'}

 Total Prediction time(M:S):  00:45 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.14s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:24 

{   'adjust_no_ans_boost': 10.78890323638916,
    'answers': [   {   'answer': '1998',
                       'context': ' between 1996 and 1998.  [Production num',
                       'document_id': '13',
                       'offset_end': 22,
                       'offset_start': 18,
                       'probability': 0.833244043870347,
                       'score': 12.870361328125},
                   {   'answer': '(1995–1998)',
                       'context': '1\nPorsche 993 (1995–1998) the last air-c',
                       'document_id': '0',
                       'offset_end': 26,
                       'offset_start': 14,
                       'probability': 0.8261399496013931,
                       'score': 12.468108177185059}],
    'question': 'What was the last year for air cooled 911s?'}

 Total Prediction time(M:S):  00:45 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.14s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:25 

{   'adjust_no_ans_boost': -19.488919258117676,
    'answers': [   {   'answer': '[computer says no answer is likely]',
                       'context': '',
                       'document_id': None,
                       'offset_end': 0,
                       'offset_start': 0,
                       'probability': 0.9306377792184659,
                       'score': 20.77222228050232},
                   {   'answer': 'Michael Mauer',
                       'context': 'was headed by Michael Mauer.\nAt the fron',
                       'document_id': '22',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.5400174445173579,
                       'score': 1.2833030223846436}],
    'question': 'Who is Hercules?'}

 Total Prediction time(M:S):  00:46

Reducing the reader.predict execution time would be nice. Example above uses retriever top_k=80 to ensure that all the one wiki's contents go thru the reader, not to overlook anything from the fast pass of the retriever. Need to determine a proper balance between retriever top_k, execution time, and answer accuracy. Current execution time is about 42-46 seconds per question in this example, secondarily a result of running on a single NVIDIA 1080Ti with max batch size. Any suggestions other than a smaller model, which will sacrifice EM/F1 performance, or a 4x V100 cloud solution?

BTW, a next step is architecting a cloud solution. I'm leaning towards GCP or Azure as the best platforms for Speech-To-Text and NLP/ML hosting.

Timoeller · 2020-02-24T10:58:47Z

Hey Ahotrod, thanks for using the cutting edge haystack version. It is really rewarding to see that it works and people are using it already!

Concerning the execution time. I have seen you already increased the batch_size parameter in the FARMReader to 48. Is this really the maximum value that fits onto your 1080Ti during inference? Could you try to setting it to 100 or even 200, so that all data coming from the retriever fits into one batch? GPU memory consumption is very different in training vs inference, so even these high batch sizes should work.
Another option would be to use the CPU, if you have enough memory to fit everything into one batch. I know this sounds mad, but it might be worth trying :D
I would be highly interested if the suggestions speed things up, so please report if you find anything useful.

We will merge the new changes into master today and will likely create more breaking changes around FARM inference + haystack interaction in the coming days. So please be prepared that some of your old code wont work with upcoming haystack versions.

Concerning cloud solution: yes, something to scale inference automatically like AWS sagemaker would be interesting. Do you know any good solutions like that on GCP or Azure?
// Btw. Malte will interact with you more personally soon - maybe there it is better to discuss ways forward.

ahotrod · 2020-02-25T05:24:43Z

@Timoeller

Concerning the execution time. I have seen you already increased the batch_size parameter in the FARMReader to 48. Is this really the maximum value that fits onto your 1080Ti during inference?

Yes, maximum batch size on the 1080Ti GPU with Albert_xxlarge is 50, which uses 11GB of the available 11.2GB of its memory. Inferencing in the cloud for production will require more capable resources.

Another option would be to use the CPU, if you have enough memory to fit everything into one batch. I know this sounds mad, but it might be worth trying :D

Using today's new master changes:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
              context_window_size=40, no_ans_boost=100, batch_size=500,
              use_gpu=False, n_candidates_per_paragraph=1)

Inferencing: 100%|██████████| 1/1 [06:00<00:00, 360.78s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 06:00
_get_predictions()->   aggregate_preds  - > model_formatted_preds() time(M:S): 00:06 

{   'answers': [   {   'answer': 'appeals from American customers',
                       'context': '93, appeals from American customers resu',
                       'document_id': '23',
                       'offset_end': 36,
                       'offset_start': 4,
                       'probability': 0.8635471550279814,
                       'score': 14.760555267333984},
                   {   'answer': 'because the desirable 930 was not available',
                       'context': 'ecause the desirable 930 was not availab',
                       'document_id': '14',
                       'offset_end': 42,
                       'offset_start': -1,
                       'probability': 0.8081962132324763,
                       'score': 11.5066556930542}],
    'no_ans_gap': 22.80221176147461,
    'question': 'Why did Porsche develop the Type 964 RS America?'}

 Total Prediction time(M:S):  06:08 

Inferencing: 100%|██████████| 1/1 [05:35<00:00, 335.17s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 05:35
_get_predictions()->   aggregate_preds  - > model_formatted_preds() time(M:S): 00:06 

{   'answers': [   {   'answer': 'Michael Mauer',
                       'context': 'was headed by Michael Mauer.\nAt the fron',
                       'document_id': '22',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.5400200540227474,
                       'score': 1.2833870649337769},
                   {   'answer': 'Goldilocks',
                       'context': 'version of the Goldilocks tale is the 99',
                       'document_id': '13',
                       'offset_end': 26,
                       'offset_start': 15,
                       'probability': 0.4770406639968783,
                       'score': -0.7352157831192017}],
    'no_ans_gap': 23.601887702941895,
    'question': 'Who is Hercules?'}

 Total Prediction time(M:S):  05:42

This CPU-only config inferences with one pass occupying a peak of 18GB CPU memory, with 100% execution on six of 12 CPU threads. The inferencing times increase about 16-17x compared to running on the GPU. The interesting thing is the aggregate_preds portion of the _get_predictions() code in infer.py drops from ~21-25 secs on the GPU to 6 secs on the CPU. More CPU-friendly operations/code I guess?

Unfortunately the "no_answer" functionality doesn't work for me now. I tried no_ans_boost = 10, 100, 1000, -10, -100, -1000 and never received a no-answer prediction like before for the unanswerable question, "Who is Hercules?". I don't see where no_ans_boost plays a role in the predict and new _calc_no_answer code. In def _calc_no_answer why the if-else structure when no_ans_score gets the same assignment under both segments?

Concerning cloud solution: yes, something to scale inference automatically like AWS sagemaker would be interesting. Do you know any good solutions like that on GCP or Azure?

Yes, GCP & Azure both have provisions to scale with demand.

Timoeller · 2020-02-25T07:55:07Z

Good morning @ahotrod

Thanks for posting the inference times. Having one batch take that long is strange. We will look into Pytorch inference in more detail soon because it seems to be very different from normal training. We will update you once we find a good solution.

More CPU-friendly operations/code I guess?

Guessed correctly, the code doesnt work on GPU. This is mainly due to working with strings to get the actual answers. Pytorch doesnt support strings, so a pure GPU solutions seems difficult. There are nevertheless many operations in this function that we could improve upon.

Concerning the no_ans_boost on newest master. Very good point, we need to update the requirements for FARM, since we need the latest FARM master installed, too.

ahotrod · 2020-02-25T17:35:26Z

Pytorch doesnt support strings, so a pure GPU solutions seems difficult. Do you have experience there?

Just in passing, as I noted sometime ago on Transformers, Rapidsai offers a CUDA stand-alone string library cuStrings & python-wrapper nvStrings: https://github.com/rapidsai/custrings

High-speed data loading & processing of textual dataframes on GPU with CUDA. Moving panda dfs to GPU is several lines of code or perhaps data loading straight to GPU.

Might be applicable for "pure GPU solutions" and for GPU-accelerated word tokenization as touched-on in this basic example:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801

Timoeller · 2020-02-26T10:54:33Z

Nice, thanks for the links. Didn't know about rapids, though they seem super active in bringing all kinds of useful code to GPUs.

I will move this conversation to mail, since I dont think it will be too useful for the community.

Timoeller · 2020-02-28T16:31:18Z

Following up on latest developments:
We have been optimizing FARM QA inference, see deepset-ai/FARM/pull/268
If you update FARM on latest master you should be able to increase the batch size a lot @ahotrod

ahotrod · 2020-02-28T21:32:53Z

If you update FARM on latest master you should be able to increase the batch size a lot @Timoeller

Yes, batch_size=160 now accommodates 73kB of context/paragraphs, easily fitting on a single GPU:

Total prediction times have dropped from about 42-46 seconds per question, to 27 seconds.
No_answer predictions are now functional.
Good work!

Timoeller · 2020-02-29T12:41:32Z

Nice! Thanks for the kudos - I forwarded it to the team : )

We ran some experiments ourselves and got roughly the same speedup.
And there is still so much to optimize:

vectorization and pytorch operations of the whole inference pipeline (merging back answers to strings as very final operation).
more efficient interaction of CPU data preprocessing and GPU model inference
efficient multi-GPU support
...

Will keep you updated.

ahotrod · 2020-03-07T19:11:12Z

And there is still so much to optimize:

Microsoft open sources breakthrough optimizations for transformer inference on GPU and CPU
January 21, 2020

"AI developers can now easily productionize large transformer models with high performance across both CPU and GPU hardware". To get started:

Train a model with or load a pre-trained model from popular frameworks such as PyTorch or TensorFlow.
Prepare your model for optimized inferencing by exporting from PyTorch or converting from TensorFlow/Keras to ONNX format.
https://pytorch.org/docs/stable/onnx.html
https://github.com/onnx/onnx-docker/tree/master/onnx-ecosystem
Inference across multiple platforms and hardware with ONNX Runtime with high performance.

https://onnx.ai/get-started.html

Install the ONNX runtime locally:

https://microsoft.github.io/onnxruntime/

tholor · 2020-03-09T14:54:41Z

Interesting! We will definitely have a look and see if this could be applicable for Inference in haystack and FARM.

Timoeller · 2020-03-09T16:30:44Z

It looks very promising indeed.
Have you tried ONNX runtime yourself on a normal Bert-base model, something like they showcase in this notebook?

In the table they report 3-layered Bert performance with very low batch sizes. These settings seems rather atypical.

ahotrod · 2020-03-09T17:02:19Z

Have you tried ONNX runtime yourself on a normal Bert-base model, something like they showcase in this notebook?

Plan to go thru that PyTorch-based notebook today. Will try some iterations with my more common settings.

There are several Bert-based model ONNX examples out there:

Tensorflow Bert model to ONNX

Earlier Bert Squad ONNX model

I noticed the earlier model has the dependency run_onnx_squad.py which will be interesting to go thru for comparison to FARM's & Transformers' run_squad.py.

ahotrod · 2020-03-10T00:07:34Z

Went with onnx-ecosystem which is a recent release (couple of weeks). Found nvidia-cuda-docker was not initializing, so I ditched Docker for now and ran this notebook from an environment with PyTorch v1.4.0, Transformers v2.5.1, ONNX runtimes v1.2.1 (CPU & GPU).

With the variables (max_seq_length=128, etc.) as originally specified, here is the result on GPU:

ONNX Runtime inference time:  0.00811

PyTorch Inference time =  0.02096
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

With max_seq_length=384, everything else the same, here is the result:

ONNX Runtime inference time:  0.0193

PyTorch Inference time =  0.0273
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

Should have more time tomorrow to examine these preliminary results and to further iterate & characterize the differences, including the notebook's variables per_gpu_eval_batch_size and eval_batch_size, both originally set to 1.

At this point I am more familiar with ALBERT_xxlarge inference performance, so eventually I may try to implement it in ONNX for an inference comparison on a larger model.

Here's another max_seq_length=384 run:
Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime_WIP - Jupyter Notebook.pdf

Timoeller · 2020-03-10T09:03:49Z

Very interesting results.
I looked at the notebooks and it is measuring only the model.forward pass. We have seen that during inference a bottleneck can come from input + model output transformations - do you know how the frameworks compare there? Though honestly, if ONNX is much better at forward passing there will always be a way to stick everything in this function and precompute input transformations.

Looking forward to more results, especially batch size and per gpu batch size - I had the impression that multi GPU utilization at inference is not really optimal in Pytorch.
Are you going to publish these and more results in a blog article? I guess a lot of people would be interested in independent test runs!

tanaysoni · 2020-03-12T13:40:29Z

@ahotrod moving this to #39.

Rework how component I/O is defined

tholor mentioned this issue Feb 17, 2020

Add no_answer option to results #24

Merged

tholor self-assigned this Feb 18, 2020

tholor added the type:bug Something isn't working label Feb 18, 2020

Timoeller assigned Timoeller and unassigned tholor Feb 26, 2020

This was referenced Mar 12, 2020

Implement and benchmark ONNX Runtime for Inference #39

Closed

Implement Inference with ONNX Runtime deepset-ai/FARM#276

Closed

tanaysoni closed this as completed Mar 12, 2020

masci pushed a commit that referenced this issue Nov 27, 2023

Merge pull request #23 from deepset-ai/component-io

8ac5461

Rework how component I/O is defined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Haystack with Albert is awesome! XLNet question #23

Haystack with Albert is awesome! XLNet question #23

ahotrod commented Feb 15, 2020 •

edited

ahotrod commented Feb 16, 2020 •

edited

tholor commented Feb 17, 2020 •

edited

ahotrod commented Feb 17, 2020

tholor commented Feb 18, 2020

tholor commented Feb 19, 2020

ahotrod commented Feb 20, 2020

ahotrod commented Feb 21, 2020 •

edited

Timoeller commented Feb 24, 2020

ahotrod commented Feb 25, 2020 •

edited

Timoeller commented Feb 25, 2020 •

edited

ahotrod commented Feb 25, 2020

Timoeller commented Feb 26, 2020

Timoeller commented Feb 28, 2020

ahotrod commented Feb 28, 2020 •

edited

Timoeller commented Feb 29, 2020

ahotrod commented Mar 7, 2020 •

edited

tholor commented Mar 9, 2020

Timoeller commented Mar 9, 2020

ahotrod commented Mar 9, 2020 •

edited

ahotrod commented Mar 10, 2020 •

edited

Timoeller commented Mar 10, 2020

tanaysoni commented Mar 12, 2020

Haystack with Albert is awesome! XLNet question #23

Haystack with Albert is awesome! XLNet question #23

Comments

ahotrod commented Feb 15, 2020 • edited

ahotrod commented Feb 16, 2020 • edited

tholor commented Feb 17, 2020 • edited

XLNet

No Answer

ahotrod commented Feb 17, 2020

tholor commented Feb 18, 2020

tholor commented Feb 19, 2020

ahotrod commented Feb 20, 2020

ahotrod commented Feb 21, 2020 • edited

Timoeller commented Feb 24, 2020

ahotrod commented Feb 25, 2020 • edited

Timoeller commented Feb 25, 2020 • edited

ahotrod commented Feb 25, 2020

Timoeller commented Feb 26, 2020

Timoeller commented Feb 28, 2020

ahotrod commented Feb 28, 2020 • edited

Timoeller commented Feb 29, 2020

ahotrod commented Mar 7, 2020 • edited

tholor commented Mar 9, 2020

Timoeller commented Mar 9, 2020

ahotrod commented Mar 9, 2020 • edited

ahotrod commented Mar 10, 2020 • edited

Timoeller commented Mar 10, 2020

tanaysoni commented Mar 12, 2020

ahotrod commented Feb 15, 2020 •

edited

ahotrod commented Feb 16, 2020 •

edited

tholor commented Feb 17, 2020 •

edited

ahotrod commented Feb 21, 2020 •

edited

ahotrod commented Feb 25, 2020 •

edited

Timoeller commented Feb 25, 2020 •

edited

ahotrod commented Feb 28, 2020 •

edited

ahotrod commented Mar 7, 2020 •

edited

ahotrod commented Mar 9, 2020 •

edited

ahotrod commented Mar 10, 2020 •

edited