Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haystack with Albert is awesome! XLNet question #23

Closed
ahotrod opened this issue Feb 15, 2020 · 22 comments
Closed

Haystack with Albert is awesome! XLNet question #23

ahotrod opened this issue Feb 15, 2020 · 22 comments
Assignees
Labels
type:bug Something isn't working

Comments

@ahotrod
Copy link

ahotrod commented Feb 15, 2020

I am in the midst of evaluating Haystack with Albert and so far it looks awesome. Loving it, thanks for sharing.

I missed the whole Game of Thrones fantasy/drama phenomenon, so for a tutorial I could understand and relate-to, I went looking for other content to use with your Tutorial1_Basic_QA_Pipeline.ipynb notebook. Being a Porschephile I settled on:

import wikipedia

porsche_wikis = wikipedia.search("Porsche", results=25)
doc_dir = "data/porsche/"

for wiki in porsche_wikis:
    html_page = wikipedia.page(title = wiki, auto_suggest = False)
    text_file = open(doc_dir + wiki.replace('/', ' ') + ".txt", "w+")
    text_file.write(html_page.content)
    text_file.close()
    print(wiki)

I can relate-to the above content and ask relevant questions of it "all day long". All other code in your notebook remains the same, except I use my Albert model for QA and it works well:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
use_gpu=True)

For my application/project, I would like to also evaluate XLNet performance with Haystack but I am having trouble loading my XLNet model:

reader = FARMReader(model_name_or_path="ahotrod/xlnet_large_squad2_512",
use_gpu=True)

Attached is the complete terminal output text, but bottom-line the error I get is:

AttributeError: 'XLNetForQuestionAnswering' object has no attribute 'qa_outputs'

output_term.txt

This XLNet model was fine-tuned on Transformers v2.1.1 and is the best I have because I and others are having problems fine-tuning XLNet_large under Transformers v2.4.1, huggingface/transformers#2651

Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is not compatible/missing the attribute mentioned in the error message?

Looking forward to additional FARM/Haystack QA capabilities you have in the works, thanks for your efforts!

@ahotrod
Copy link
Author

ahotrod commented Feb 16, 2020

I am familar with Transformers' null_score_diff_threshold, which when set to zero, results in predictions = compute_predictions_logits(examples, features, ...) from transformers.data.metrics.squad_metrics returning a null string in predictions when there is no answer.

With FarmReader what is the best indicator of a "No Answer" output when there is not "reasonable" probability of an answer? Is the argument "no_ans_threshold" for FarmReader the means to accomplish this? I do not see any change or pattern to the answers when varying this arg from -100 (default) to +100.

Must I implement "No Answer" functionality by analyzing the 'probability' & 'score' of the returned answers in prediction = finder.get.answers? If so, what would you recommend as thresholds/limits to either or both of 'probability' & 'score'?

@tholor
Copy link
Member

tholor commented Feb 17, 2020

Hi @ahotrod

Nice example of searching Porsche Wikis via the wikipedia package! We had actually quite some discussion about what domain to pick for the tutorial :D

XLNet

Perhaps this fine-tuned XLNet model & Transformers v2.1.1 is missing the attribute mentioned in the error message?

It seems like there's a difference in naming the FF module (= "QA head") between the two available XLNet implementations:

  • XLNetForQuestionAnsweringSimple: it's called qa_outputs here (same as in other architectures)
  • XLNetForQuestionAnswering: there are separate modules for start_logits and end_logits
    image
    From a first look, XLNetForQuestionAnswering seems to be the original implementation from the paper with some kind of beam search. These differences in attributes have already been there in v.2.1.1. So I am not sure if that is really related to your performance issue. Do you have only trouble replicating training a new model or also with using your existing model for inference?
    https://github.com/huggingface/transformers/blob/v2.1.1/transformers/modeling_xlnet.py#L1295

Regarding usage of this model in haystack:
a) FARMReader: As a quick fix we could load the weights from start_logitsand end_logits manually into our QA prediction head. However, this would not allow you to benefit from beam search (or other extras implemented in XLNetForQuestionAnswering). For this, we would probably need another prediction head that implements beam search. Do you have any experience in the performance difference between those two XLNet implementations? If the boost is significant, we could implement it also in FARM and make it available for all other model architectures, too.

b) TransformersReader:
If using your model for inference in a transformers pipeline works, you could also load this in haystack via

reader = TransformersReader(model="ahotrod/xlnet_large_squad2_512", tokenizer="ahotrod/xlnet_large_squad2_512", use_gpu=-1)

However, I guess transformers' pipeline is not yet supporting this model type either.

No Answer

Is the argument "no_ans_threshold" for FarmReader the means to accomplish this?

Yes, that's the right argument to change. However, you are totally right that the current implementation is missing a step to include the no answer option in the final result - so you can't see any difference. Working on this now in #24.

@ahotrod
Copy link
Author

ahotrod commented Feb 17, 2020

Hello @tholor, thanks for looking into these issues.

Yes, the latest fine-tuned XLNet model I have was fine-tuned with Transformers v2.1.1 and is shared at https://hugginface.co/ahotrod/xlnet_large_squad2_512. Fine-tuning this XLNet LM with Transformers v2.3.0 thru v2.4.1 is currently not possible, perhaps related to the statement that Transformers' run_squad.py was quite heavily refactored in December 2019.

Loading my shared XLNet fine-tuned with TF v2.1.1, as you suggested, with:

reader = TransformersReader(model="ahotrod/xlnet_large_squad2_512", tokenizer="ahotrod/xlnet_large_squad2_512", use_gpu=-1)

is successful, however subsequent inferencing with prediction = finder.get_answers produces this error:

ValueError: too many values to unpack (expected 2)

output_term.txt

Don't sweat the XLNet issues at this juncture on my account. I'm good with Albert xxlarge v1 fine-tuned on SQuAD2 which is a sufficient model for the foreseeable future. Perhaps the XLNet issues will sort-out with subsequent Transformers releases, as their run_squad.py code matures after having been in significant flux over the past few months.

Thanks for your PR #24 - you're all over it!

I've enjoyed and learned a good bit going thru FARM & Haystack code the last few weeks. If you need any particular feedback or whatnot be sure to ask. Maybe limited, but I will contribute where I can. Best regards!

@tholor
Copy link
Member

tholor commented Feb 18, 2020

Hey @ahotrod,

Ok sounds good! We are currently discussing a few options on how to aggregate the no_answer results, as there are some tradeoffs between "simplicity of interpretation" and "theoretical soundness". However, it should be merged soon.

ValueError: too many values to unpack (expected 2)

Yes, this is a bug on Transformers side. Should work once their pipeline object supports XLNet. There are similar issues with other QA model types (e.g. RoBERTa: huggingface/transformers#2788)

I'm good with Albert xxlarge v1 fine-tuned on SQuAD2 which is a sufficient model for the foreseeable future.

Ok, great! I am still a bit curious: did you observe any performance differences in training an XLNetForQuestionAnswering vs. XLNetForQuestionAnsweringSimple? Have you tried training one in FARM?

I've enjoyed and learned a good bit going thru FARM & Haystack code the last few weeks. If you need any particular feedback or whatnot be sure to ask.
This would actually be great. We are always keen to hear some direct user feedback and love to understand what parts of the framework were helpful & where confusion / problems occur. Let's continue this conversation via email and maybe have a quick call, if you are interested. You can reach me at malte.pietsch [at] deepset.ai

@tholor tholor self-assigned this Feb 18, 2020
@tholor tholor added the type:bug Something isn't working label Feb 18, 2020
@tholor
Copy link
Member

tholor commented Feb 19, 2020

We just merged #24. This allows now returning "no_answers" from the Finder. We also made the argument in the FARMReader more intuitive to understand by renaming it to no_ans_boost. So a positive value will "boost" the no_answer to the top, while a negative "penalizes" it.

@ahotrod
Copy link
Author

ahotrod commented Feb 20, 2020

@tholor

Ok, great! I am still a bit curious: did you observe any performance differences in training an XLNetForQuestionAnswering vs. XLNetForQuestionAnsweringSimple? Have you tried training one in FARM?

Haven't tried training either in FARM, only with Transformers. Will do so when time permits.

Thanks, starting to go thru the merged #24 PR changes now and run some examples. Will post comments after more experimentation.

@ahotrod
Copy link
Author

ahotrod commented Feb 21, 2020

I have been testing with the single Porsche 911 wiki (which is similar size-wise to my domain app) to ID limitations, plus trying to "fool" FARMReader.

Looking good so far!

Here's an example with basic structure & elements for my app:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
              context_window_size=40, no_ans_boost=0, batch_size=48,
              use_gpu=True, n_candidates_per_passage=2)

question="911?"  # seed retriever

# Apply retriever to get all the paragraphs from one wiki, high number for top_k,
# be sure to get all the paragraphs of one wiki
paragraphs, meta_data = retriever.retrieve(question, top_k=80,
                                         candidate_doc_ids=None)
questions = [
"When was the 911 introduced?",
"When was the 911 first introduced?",
"When was the 911 first produced?",
"Why did Porsche develop the Type 964 RS America?",
"What was the last year for air cooled 911s?",
"Who is Hercules?"]                  # < --- obvious no_answer question

for question in questions: # Apply reader to get granular answer(s)
    start = time.time()    
    prediction = reader.predict(question, paragrahps=paragraphs, 
      meta_data_paragraphs=meta_data, top_k=2) # paragrahps <--misspelled farm.py    
    end = time.time()    
    print_answers(prediction, details="all")    
    print('\n', "Total Prediction time(M:S): ", time.strftime('%M:%S',
               time.gmtime(end - start)), '\n')

Output:

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.08s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:21 

{   'adjust_no_ans_boost': 14.835111737251282,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8803529149319752,
                       'score': 15.966211318969727},
                   {   'answer': 'February 1964',
                       'context': 'rking unit in February 1964.It originall',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.8782006146536698,
                       'score': 15.803997993469238}],
    'question': 'When was the 911 introduced?'}

 Total Prediction time(M:S):  00:41 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.06s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:22 

{   'adjust_no_ans_boost': 16.003353014588356,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8828941958574763,
                       'score': 16.161020278930664},
                   {   'answer': 'February 1964',
                       'context': 'rking unit in February 1964.It originall',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.8748021832534769,
                       'score': 15.55282211303711}],
    'question': 'When was the 911 first introduced?'}

 Total Prediction time(M:S):  00:43 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.11s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:21 

{   'adjust_no_ans_boost': 16.136614687740803,
    'answers': [   {   'answer': '1963',
                       'context': 'ts car made since 1963 by Porsche AG of ',
                       'document_id': '3',
                       'offset_end': 23,
                       'offset_start': 18,
                       'probability': 0.8818734992209446,
                       'score': 16.082340240478516},
                   {   'answer': 'September 1964',
                       'context': 'ion began in September 1964, with the fi',
                       'document_id': '11',
                       'offset_end': 27,
                       'offset_start': 13,
                       'probability': 0.8808546905697762,
                       'score': 16.004390716552734}],
    'question': 'When was the 911 first produced?'}

 Total Prediction time(M:S):  00:42 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.11s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:24 

{   'adjust_no_ans_boost': 11.999503135681152,
    'answers': [   {   'answer': 'appeals from American customers',
                       'context': '93, appeals from American customers resu',
                       'document_id': '23',
                       'offset_end': 36,
                       'offset_start': 4,
                       'probability': 0.8635460031826985,
                       'score': 14.760477066040039},
                   {   'answer': 'because the desirable 930 was not available',
                       'context': 'ecause the desirable 930 was not availab',
                       'document_id': '14',
                       'offset_end': 42,
                       'offset_start': -1,
                       'probability': 0.8081948272857139,
                       'score': 11.506584167480469}],
    'question': 'Why did Porsche develop the Type 964 RS America?'}

 Total Prediction time(M:S):  00:45 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.14s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:24 

{   'adjust_no_ans_boost': 10.78890323638916,
    'answers': [   {   'answer': '1998',
                       'context': ' between 1996 and 1998.  [Production num',
                       'document_id': '13',
                       'offset_end': 22,
                       'offset_start': 18,
                       'probability': 0.833244043870347,
                       'score': 12.870361328125},
                   {   'answer': '(1995–1998)',
                       'context': '1\nPorsche 993 (1995–1998) the last air-c',
                       'document_id': '0',
                       'offset_end': 26,
                       'offset_start': 14,
                       'probability': 0.8261399496013931,
                       'score': 12.468108177185059}],
    'question': 'What was the last year for air cooled 911s?'}

 Total Prediction time(M:S):  00:45 

Inferencing: 100%|██████████| 2/2 [00:20<00:00, 10.14s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 00:20
_get_predictions()->   aggregate_preds    ->model_formatted_preds() time(M:S): 00:25 

{   'adjust_no_ans_boost': -19.488919258117676,
    'answers': [   {   'answer': '[computer says no answer is likely]',
                       'context': '',
                       'document_id': None,
                       'offset_end': 0,
                       'offset_start': 0,
                       'probability': 0.9306377792184659,
                       'score': 20.77222228050232},
                   {   'answer': 'Michael Mauer',
                       'context': 'was headed by Michael Mauer.\nAt the fron',
                       'document_id': '22',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.5400174445173579,
                       'score': 1.2833030223846436}],
    'question': 'Who is Hercules?'}

 Total Prediction time(M:S):  00:46 

Reducing the reader.predict execution time would be nice. Example above uses retriever top_k=80 to ensure that all the one wiki's contents go thru the reader, not to overlook anything from the fast pass of the retriever. Need to determine a proper balance between retriever top_k, execution time, and answer accuracy. Current execution time is about 42-46 seconds per question in this example, secondarily a result of running on a single NVIDIA 1080Ti with max batch size. Any suggestions other than a smaller model, which will sacrifice EM/F1 performance, or a 4x V100 cloud solution?

BTW, a next step is architecting a cloud solution. I'm leaning towards GCP or Azure as the best platforms for Speech-To-Text and NLP/ML hosting.

@Timoeller
Copy link
Contributor

Hey Ahotrod, thanks for using the cutting edge haystack version. It is really rewarding to see that it works and people are using it already!

Concerning the execution time. I have seen you already increased the batch_size parameter in the FARMReader to 48. Is this really the maximum value that fits onto your 1080Ti during inference? Could you try to setting it to 100 or even 200, so that all data coming from the retriever fits into one batch? GPU memory consumption is very different in training vs inference, so even these high batch sizes should work.
Another option would be to use the CPU, if you have enough memory to fit everything into one batch. I know this sounds mad, but it might be worth trying :D
I would be highly interested if the suggestions speed things up, so please report if you find anything useful.

We will merge the new changes into master today and will likely create more breaking changes around FARM inference + haystack interaction in the coming days. So please be prepared that some of your old code wont work with upcoming haystack versions.

Concerning cloud solution: yes, something to scale inference automatically like AWS sagemaker would be interesting. Do you know any good solutions like that on GCP or Azure?
// Btw. Malte will interact with you more personally soon - maybe there it is better to discuss ways forward.

@ahotrod
Copy link
Author

ahotrod commented Feb 25, 2020

@Timoeller

Concerning the execution time. I have seen you already increased the batch_size parameter in the FARMReader to 48. Is this really the maximum value that fits onto your 1080Ti during inference?

Yes, maximum batch size on the 1080Ti GPU with Albert_xxlarge is 50, which uses 11GB of the available 11.2GB of its memory. Inferencing in the cloud for production will require more capable resources.

Another option would be to use the CPU, if you have enough memory to fit everything into one batch. I know this sounds mad, but it might be worth trying :D

Using today's new master changes:

reader = FARMReader(model_name_or_path="ahotrod/albert_xxlargev1_squad2_512", 
              context_window_size=40, no_ans_boost=100, batch_size=500,
              use_gpu=False, n_candidates_per_paragraph=1)
Inferencing: 100%|██████████| 1/1 [06:00<00:00, 360.78s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 06:00
_get_predictions()->   aggregate_preds  - > model_formatted_preds() time(M:S): 00:06 

{   'answers': [   {   'answer': 'appeals from American customers',
                       'context': '93, appeals from American customers resu',
                       'document_id': '23',
                       'offset_end': 36,
                       'offset_start': 4,
                       'probability': 0.8635471550279814,
                       'score': 14.760555267333984},
                   {   'answer': 'because the desirable 930 was not available',
                       'context': 'ecause the desirable 930 was not availab',
                       'document_id': '14',
                       'offset_end': 42,
                       'offset_start': -1,
                       'probability': 0.8081962132324763,
                       'score': 11.5066556930542}],
    'no_ans_gap': 22.80221176147461,
    'question': 'Why did Porsche develop the Type 964 RS America?'}

 Total Prediction time(M:S):  06:08 

Inferencing: 100%|██████████| 1/1 [05:35<00:00, 335.17s/it]
_get_predictions()->get actual predictions->model_formatted_preds() time(M:S): 05:35
_get_predictions()->   aggregate_preds  - > model_formatted_preds() time(M:S): 00:06 

{   'answers': [   {   'answer': 'Michael Mauer',
                       'context': 'was headed by Michael Mauer.\nAt the fron',
                       'document_id': '22',
                       'offset_end': 27,
                       'offset_start': 14,
                       'probability': 0.5400200540227474,
                       'score': 1.2833870649337769},
                   {   'answer': 'Goldilocks',
                       'context': 'version of the Goldilocks tale is the 99',
                       'document_id': '13',
                       'offset_end': 26,
                       'offset_start': 15,
                       'probability': 0.4770406639968783,
                       'score': -0.7352157831192017}],
    'no_ans_gap': 23.601887702941895,
    'question': 'Who is Hercules?'}

 Total Prediction time(M:S):  05:42 

This CPU-only config inferences with one pass occupying a peak of 18GB CPU memory, with 100% execution on six of 12 CPU threads. The inferencing times increase about 16-17x compared to running on the GPU. The interesting thing is the aggregate_preds portion of the _get_predictions() code in infer.py drops from ~21-25 secs on the GPU to 6 secs on the CPU. More CPU-friendly operations/code I guess?

Unfortunately the "no_answer" functionality doesn't work for me now. I tried no_ans_boost = 10, 100, 1000, -10, -100, -1000 and never received a no-answer prediction like before for the unanswerable question, "Who is Hercules?". I don't see where no_ans_boost plays a role in the predict and new _calc_no_answer code. In def _calc_no_answer why the if-else structure when no_ans_score gets the same assignment under both segments?

Concerning cloud solution: yes, something to scale inference automatically like AWS sagemaker would be interesting. Do you know any good solutions like that on GCP or Azure?

Yes, GCP & Azure both have provisions to scale with demand.

@Timoeller
Copy link
Contributor

Timoeller commented Feb 25, 2020

Good morning @ahotrod

Thanks for posting the inference times. Having one batch take that long is strange. We will look into Pytorch inference in more detail soon because it seems to be very different from normal training. We will update you once we find a good solution.

More CPU-friendly operations/code I guess?

Guessed correctly, the code doesnt work on GPU. This is mainly due to working with strings to get the actual answers. Pytorch doesnt support strings, so a pure GPU solutions seems difficult. There are nevertheless many operations in this function that we could improve upon.

Concerning the no_ans_boost on newest master. Very good point, we need to update the requirements for FARM, since we need the latest FARM master installed, too.

@ahotrod
Copy link
Author

ahotrod commented Feb 25, 2020

Pytorch doesnt support strings, so a pure GPU solutions seems difficult. Do you have experience there?

Just in passing, as I noted sometime ago on Transformers, Rapidsai offers a CUDA stand-alone string library cuStrings & python-wrapper nvStrings: https://github.com/rapidsai/custrings

High-speed data loading & processing of textual dataframes on GPU with CUDA. Moving panda dfs to GPU is several lines of code or perhaps data loading straight to GPU.

Might be applicable for "pure GPU solutions" and for GPU-accelerated word tokenization as touched-on in this basic example:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801

@Timoeller Timoeller assigned Timoeller and unassigned tholor Feb 26, 2020
@Timoeller
Copy link
Contributor

Nice, thanks for the links. Didn't know about rapids, though they seem super active in bringing all kinds of useful code to GPUs.

I will move this conversation to mail, since I dont think it will be too useful for the community.

@Timoeller
Copy link
Contributor

Following up on latest developments:
We have been optimizing FARM QA inference, see deepset-ai/FARM/pull/268
If you update FARM on latest master you should be able to increase the batch size a lot @ahotrod

@ahotrod
Copy link
Author

ahotrod commented Feb 28, 2020

If you update FARM on latest master you should be able to increase the batch size a lot @Timoeller

Yes, batch_size=160 now accommodates 73kB of context/paragraphs, easily fitting on a single GPU:

Screenshot from 2020-02-28 13-29-43

Total prediction times have dropped from about 42-46 seconds per question, to 27 seconds.
No_answer predictions are now functional.
Good work!

@Timoeller
Copy link
Contributor

Nice! Thanks for the kudos - I forwarded it to the team : )

We ran some experiments ourselves and got roughly the same speedup.
And there is still so much to optimize:

  • vectorization and pytorch operations of the whole inference pipeline (merging back answers to strings as very final operation).
  • more efficient interaction of CPU data preprocessing and GPU model inference
  • efficient multi-GPU support
  • ...

Will keep you updated.

@ahotrod
Copy link
Author

ahotrod commented Mar 7, 2020

And there is still so much to optimize:

Microsoft open sources breakthrough optimizations for transformer inference on GPU and CPU
January 21, 2020

"AI developers can now easily productionize large transformer models with high performance across both CPU and GPU hardware". To get started:

  1. Train a model with or load a pre-trained model from popular frameworks such as PyTorch or TensorFlow.

  2. Prepare your model for optimized inferencing by exporting from PyTorch or converting from TensorFlow/Keras to ONNX format.
    https://pytorch.org/docs/stable/onnx.html
    https://github.com/onnx/onnx-docker/tree/master/onnx-ecosystem

  3. Inference across multiple platforms and hardware with ONNX Runtime with high performance.

https://onnx.ai/get-started.html

Install the ONNX runtime locally:

https://microsoft.github.io/onnxruntime/

@tholor
Copy link
Member

tholor commented Mar 9, 2020

Interesting! We will definitely have a look and see if this could be applicable for Inference in haystack and FARM.

@Timoeller
Copy link
Contributor

It looks very promising indeed.
Have you tried ONNX runtime yourself on a normal Bert-base model, something like they showcase in this notebook?

In the table they report 3-layered Bert performance with very low batch sizes. These settings seems rather atypical.

@ahotrod
Copy link
Author

ahotrod commented Mar 9, 2020

Have you tried ONNX runtime yourself on a normal Bert-base model, something like they showcase in this notebook?

Plan to go thru that PyTorch-based notebook today. Will try some iterations with my more common settings.

There are several Bert-based model ONNX examples out there:

Tensorflow Bert model to ONNX

Earlier Bert Squad ONNX model

I noticed the earlier model has the dependency run_onnx_squad.py which will be interesting to go thru for comparison to FARM's & Transformers' run_squad.py.

@ahotrod
Copy link
Author

ahotrod commented Mar 10, 2020

Went with onnx-ecosystem which is a recent release (couple of weeks). Found nvidia-cuda-docker was not initializing, so I ditched Docker for now and ran this notebook from an environment with PyTorch v1.4.0, Transformers v2.5.1, ONNX runtimes v1.2.1 (CPU & GPU).

With the variables (max_seq_length=128, etc.) as originally specified, here is the result on GPU:

ONNX Runtime inference time:  0.00811

PyTorch Inference time =  0.02096
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

With max_seq_length=384, everything else the same, here is the result:

ONNX Runtime inference time:  0.0193

PyTorch Inference time =  0.0273
***** Verifying correctness *****
PyTorch and ORT matching numbers: True
PyTorch and ORT matching numbers: True

Should have more time tomorrow to examine these preliminary results and to further iterate & characterize the differences, including the notebook's variables per_gpu_eval_batch_size and eval_batch_size, both originally set to 1.

At this point I am more familiar with ALBERT_xxlarge inference performance, so eventually I may try to implement it in ONNX for an inference comparison on a larger model.

Here's another max_seq_length=384 run:
Inference-PyTorch-Bert-Model-for-High-Performance-in-ONNX-Runtime_WIP - Jupyter Notebook.pdf

@Timoeller
Copy link
Contributor

Very interesting results.
I looked at the notebooks and it is measuring only the model.forward pass. We have seen that during inference a bottleneck can come from input + model output transformations - do you know how the frameworks compare there? Though honestly, if ONNX is much better at forward passing there will always be a way to stick everything in this function and precompute input transformations.

Looking forward to more results, especially batch size and per gpu batch size - I had the impression that multi GPU utilization at inference is not really optimal in Pytorch.
Are you going to publish these and more results in a blog article? I guess a lot of people would be interested in independent test runs!

@tanaysoni
Copy link
Contributor

@ahotrod moving this to #39.

masci pushed a commit that referenced this issue Nov 27, 2023
Rework how component I/O is defined
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants