Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Ai2 Longformer into Haystack? #61

Closed
ahotrod opened this issue Apr 14, 2020 · 14 comments
Closed

Integrate Ai2 Longformer into Haystack? #61

ahotrod opened this issue Apr 14, 2020 · 14 comments
Labels
topic:speed type:feature New feature or request

Comments

@ahotrod
Copy link

ahotrod commented Apr 14, 2020

Longformer from Allen Institute for AI (Ai2), a scalable transformer model for long-document NLP tasks without chunking/truncation. Longformer "scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer".

Paper: https://arxiv.org/abs/2004.05150

Code and pretrained model: https://github.com/allenai/longformer

TriviaQA data: https://github.com/allenai/document-qa

@tholor
Copy link
Member

tholor commented Apr 15, 2020

I came also across the paper on the weekend and it seems an interesting direction to explore.
Scaling transformers to longer contexts would definitely be useful in haystack and I see two potential benefits of processing a large context at once (vs. splitting it into chunks):
a) speed
b) accuracy

While the paper suggests that the longformer gives some (small) improvements on b) (Table 8), it's not clear to me how much faster this really is for QA. Improving the speed of readers + improving the accuracy of retrievers are the top biggest levers we see right now to improve production-ready QA systems.

In the paper they mention:

We use the sliding window attention with window size of 512 on all layers.
This matches RoBERTa’s sequence length, and therefore uses the same amount of computation
as RoBERTa.
(from Section 5)

Have you maybe already tried the code-base and have some rough indication on speed improvements?

@ahotrod
Copy link
Author

ahotrod commented Apr 15, 2020

@tholor thanks for your paper commentary and Longformer's applicability to Haystack.

I have replicated the paper's leaderboard results using their pretrained TriviaQA Longformer-large model's pytorch-lightning checkpoint. I duplicated their environment per their README, bumped the default batch size up to 32, while "breaking-in" my new Titan RTX GPU.

triviaqa_screenshot_1

triviaqa_screenshot_2

@ahotrod
Copy link
Author

ahotrod commented Apr 18, 2020

Current impression is the Longformer models' increased overall size, as a result of the larger max_sequence_lengths, will require significantly more GPU/TPU "horsepower", therefore a more expensive GPU/TPU instance for production. Present Haystack using my RoBERTa-large model is much improved speed-wise and is adequate (accuracy/speed/production-instance-cost) for my use case. Always looking for improvements, but I can go forward with present capabilities.

I'll be keeping an eye on Longformer and will evaluate it if it's added to the Transformer framework.

@tholor tholor added the type:feature New feature or request label May 6, 2020
@Utomo88
Copy link

Utomo88 commented Jun 18, 2020

It will be great if Haystack can handle Long documents without problems.
check this
https://arxiv.org/abs/2006.03701

also please check there is Update on longformer

https://github.com/allenai/longformer

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
model = AutoModel.from_pretrained("allenai/longformer-base-4096")
The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

We added a notebook to show how to convert an existing pretrained model into its "long" version.

Gradient checkpointing is in progress (check PR), but in the meantime, you can use this branch https://github.com/ibeltagy/transformers/tree/grad_checkpointing. Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

@tholor
Copy link
Member

tholor commented Jun 29, 2020

Thanks for the pointer @Utomo88. Have you tried the Longformer for QA yet? We are closely following the developments around more efficient attention mechanisms (Longformer, Reformer ...), but haven't seen super impressive metrics on their performance for QA yet (speed + accuracy). From a very quick test (on a Tesla V100), I found:

Model batch_size seq_length time (sec/batch)
deepset/bert-base-squad2 32 256 0.2
deepset/bert-base-squad2 32 512 0.4
mrm8488/longformer-base-4096-finetuned-squadv2 32 256 0.8
mrm8488/longformer-base-4096-finetuned-squadv2 32 512 0.8
mrm8488/longformer-base-4096-finetuned-squadv2 32 1024 1.6
mrm8488/longformer-base-4096-finetuned-squadv2 32 2048 3.5
mrm8488/longformer-base-4096-finetuned-squadv2 32 4096 6.8
  • Longformer scales roughly linearly, but is considerably slower than BERT for seq_length 512
  • So for a doc with 4096 tokens, it will be still faster to chunk it into 8 parts and process those via BERT
  • Longformer could be interesting though if model accuracy is considerably higher on long docs (to be tested)
  • Interestingly, the Longformer needs quite a couple of "warmup iterations" before reaching decent speed

My impression from that quick evaluation hasn't changed much to before: Interesting model and certainly a nice-to-have in Haystack, but not the highest priority as it probably won't help so much with speeding-up QA on long docs.

@Utomo88 If you are interested in following this direction and creating a PR, we are of course happy to support.

@tholor
Copy link
Member

tholor commented Jul 14, 2020

Closing this for now. Feel free to reopen in case new insights come up or anybody from the community wants to work on this :)

@tholor tholor closed this as completed Jul 14, 2020
@Krak91
Copy link
Contributor

Krak91 commented Sep 1, 2020

Hi, I was also in the process of trying to compare the longformer with roberta and that made me think of ways to improve speed or potential problems with splitting but I can't figure out how the chunking works. Can somebody elaborate on how exactly is the text being truncated before feeding it to the model? Is there any chance that an potential answer would span between these chunks and be missed? Thanks!

@brandenchan
Copy link
Contributor

Hi @Krak91, our current chunking method is the sliding window approach, which is quite standard in QA systems at the moment. Our implementation can be found here.

The idea is that we divide up a document into chunks that are approximately Y tokens long where Y is mostly determined by the max_seq_len variable. If our doc_stride is Z number of tokens, then the document will be divided up into passages where:

passage_1 = document_tokens[0: Y]
passage_2 = document_tokens[Z: Z+Y]
passage_3 = document_tokens[2Z: 2Z +Y]
...

More intuitively, we have a window of Y tokens, that we use to extract chunks from the document. This window slides forward by Z tokens each time to cover more of the document.

To answer your question, it is possible that an answer is missed because of this. It would have to extend beyond the start and the end of the overlapping region between two chunks. We usually set Z=128 and Y~=384 and so the answer would have to be at least ~256 tokens long in order to be undetectable by the model. In most use cases, especially in SQuAD style QA, this pretty much never happens. While this method is effective, it can certainly be improved upon!

I hope this makes some sense. Let me know if you'd like any follow up explanation!

@Krak91
Copy link
Contributor

Krak91 commented Sep 1, 2020

Hi @brandenchan. Yes, it makes perfect sense and just to make sure I understand correctly, you basically apply a rolling window of size = max_seq_len, and slide it by doc_stride indices (completely missed that). Thanks for the detailed explanation.

Although this almost fixes the problem, I believe the real solution is to always feed the model chunks of size that it can process as a whole without risking missing something, or processing more data than needed.

However, the tokenizer produces more tokens than words and the ratio is not consistent. I've found from a very few tests that it ranges between 1.35 - 1.72 tokens per word which is around 225 - 286 words for max_seq_len=384 . So let's say, 200 words per chunk might be a safe number.

My approach was to create a function that firstly splits the documents to paragraphs and then each paragraph that exceeds 200 words to smaller parts while keeping sentences whole. Precisely, if a paragraph is more than 200 words, it gets split to sentences and then the sentences are grouped to chunks of less than 200 words. This might leave the last chunk being just one sentence, but being a whole sentence at least keeps the semantics intact. If someone knows an algorithm that can minimize the amount of chunks while maximizing the length of each chunk to avoid this and it's efficient I'd like to know. :) Thanks!

@brandenchan
Copy link
Contributor

Yes, it makes perfect sense and just to make sure I understand correctly, you basically apply a rolling window of size = max_seq_len, and slide it by doc_stride indices (completely missed that).

Yup that's exactly right!

I can see the motivation behind your solution but there are a few things worth mentioning:

Your method assumes that answers can't go over sentence boundaries. This may be true in SQuAD, but I can imagine there being more complex questions that can only be answered by picking a two sentence answer. Nonetheless, it might still be worth it for your use case.

Perhaps more of an issue is the fact that it is generally good to be filling the chunk with as much text as possible. Consider the case where a chunk is just a single sentence. If the sentence talks about "previously mentioned concepts..." or "the concepts that follow..." , the model has no chance of figuring out what they are since it only sees the tokens in that sentence. For more tougher questions, surrounding context might be crucial to finding the answer.

This is just what my intuitions are telling me and I could be wrong especially since I don't know your exact use case. Please do update us if you manage to implement your method and get good results. Would be very happy to be proven wrong!

@Krak91
Copy link
Contributor

Krak91 commented Sep 1, 2020

Thanks for the reply @brandenchan. These are perfectly valid points, a few of which I didn't take into consideration, like the example you gave. Even if rare and even that squad trains the model to find the shorter span possible, some answers might be missed in those cases and we do need to pick up answers that span across sentences.
Since paragraphs are naturally coherent semantically, most of the time, do you think applying a rolling window over the sentences on just the paragraphs that exceed the length limit can tackle this and also maybe improve efficiency?

@brandenchan
Copy link
Contributor

Sure, I can imagine that this might reduce the number of samples that you feed the model without jeopardising the model's ability to pick spans

@brandenchan
Copy link
Contributor

I have just one more thing to add actually. We'd recommend you already split up your documents into smaller chunks before indexing them into the document store. Paragraphs might be around the right size. This will mean that the retriever will pass on smaller texts to the reader and thus result in a speed up.

Smaller chunks also means that you could use Dense Passage Retrieval. DPR works best when the text it's retrieving is less than 512 tokens since that is the maximum length that it can encode.

@Krak91
Copy link
Contributor

Krak91 commented Sep 2, 2020

Hi @brandenchan that is exactly what I'm trying to do. The fact that you suggested it gives me confidence that I'm on the right track. Also, I believe that elastic search makes this approach even more attractive. My thinking is to add the text spans that my sliding window produces as meta fields on paragraphs that exceed the limit. Then, if the retriever returns a paragraph that's more than ~200 words, give its chunks to the model as 'samples', or pick the best matching chunk as a sample instead of the whole paragraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:speed type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants