Maximum length of sentence input #70

tuanton-ai · 2021-10-26T05:50:26Z

I want to know that what is maximum length of sentence that we can input into keybert? (Because i has a very long sentence that need to extract keyword, it's about more than 20000 words)

MaartenGr · 2021-10-26T06:18:50Z

That depends, there are quite a number of embedding models that you can use in KeyBERT. Reading through the embedding guide might help. Typically though, extracting keywords from such a large document is not the use case for KeyBERT. With 20,000 words, it is actually quite difficult to extract keywords at all since the document would likely contain hundreds of different topics. You could also split the document up into sentences to get more fine-grained results but I am not sure if that would suit your use case.

millnerryan · 2021-11-03T22:58:52Z

I had the same question here. I was adding additional text to my doc to analyze, and I noticed that it didn't change the keywords at all. But adding that additional text in the beginning of the doc did.

I'm using "all-MiniLM-L6-v2", and even knowing the max length so I'm aware of it would be helpful.

MaartenGr · 2021-11-04T06:33:06Z

There is an overview here of pre-trained models using sentence-transformers. In each model, you can see the indicated max sequence length which is related to the length of documents it can process. If you are looking for models that can process longer documents, the embedding guide gives some options. For instance, you can average word embeddings using Flair to create a full representation of the document. Moreover, Flair tends to average embeddings instead of truncating them which might be helpful to your cases.

millnerryan · 2021-11-05T00:25:33Z

awesome, thanks!

venkatesh-kulkarni · 2021-12-04T19:19:04Z

I want to use KeyBert to extract keywords from a document containing 1000-2000 words. I am using all-MiniLM-L6-v2 model and it has a maximum sequence length of 512 tokens. I tried to use the averaged embeddings mentioned in the embedding guide but it is giving a poor result. How can I make modifications in the code to use all-MiniLM-L6-v2 model and at the same time overcome the issue of maximum size of sequence?

MaartenGr · 2021-12-05T08:38:58Z

@venkatesh-kulkarni Since the all-MiniLM-L6-v2 is a sentence model, I would opt for splitting your data up into either sentences or paragraphs. That model is not meant for long documents and I would advise not using it for that purpose. However, when you split up your data, the results should be much better!

adityashukzy · 2023-03-22T06:01:50Z

@MaartenGr Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos!

Do you have a tutorial/guide on how one would go about doing what you have suggested here? I'm currently using all-MiniLM-L6-v2 as well, and am experiencing the same issue of the embeddings being created after truncating the excess input text towards the end.

While I'm also exploring other avenues for embeddings such as Doc2Vec or LongFormer, I would like to try out this approach mentioned by you: namely, splitting a long document up into paragraphs, extracting keywords out from each, and averaging out to derive a singular overall doc_embedding.

Would the procedure be something as follows?

Break the document up into paragraphs.
For each paragraph, encode the paragraph text to get its embeddings.
Repeat for each paragraph.
Collate all these paragraph-level embeddings into a list and run np.mean() on it.

Also, how would you compare this approach to using Flair instead (i.e. DocumentPoolEmbeddings) which would presumably perform this at a word-level rather than a paragraph-level?

My use case is of extracting keywords and keyphrases out from 7-10 page research papers, which often run into about 2000-5000 words.

MaartenGr · 2023-03-24T05:18:24Z

@adityashukzy

Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos!

Thank you for your kind words!

Would the procedure be something as follows?

Break the document up into paragraphs.
For each paragraph, encode the paragraph text to get its embeddings.
Repeat for each paragraph.
Collate all these paragraph-level embeddings into a list and run np.mean() on it.

The main difficulty with this approach is that when you have long text that contains many different topics, using np.mean() might not be helpful as it will judge each paragraph as equally important whilst that might not be the case. Having said that, if the paragraphs are reasonably similar, then it might be worthwhile to try out! Also, to account for different keywords, using something like diversity=0.5 could be helpful as it considers a wide range of keywords which focuses less on the single embedding created through np.mean().

Also, how would you compare this approach to using Flair instead (i.e. DocumentPoolEmbeddings) which would presumably perform this at a word-level rather than a paragraph-level?

I have not tried it out myself but I think if you were to aggregate on a word-level there is a higher chance of creating an embedding that tells you little since it aggregates quite a number of words in 7-10 research papers.

bakachan19 · 2023-06-12T10:13:01Z

Hi @adityashukzy.
Did you manage to find a good solution for your problem? I am also interesting in doing this and I was wondering if you could share any insights, tips and tricks that might help.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum length of sentence input #70

Maximum length of sentence input #70

tuanton-ai commented Oct 26, 2021

MaartenGr commented Oct 26, 2021

millnerryan commented Nov 3, 2021

MaartenGr commented Nov 4, 2021

millnerryan commented Nov 5, 2021

venkatesh-kulkarni commented Dec 4, 2021

MaartenGr commented Dec 5, 2021

adityashukzy commented Mar 22, 2023

MaartenGr commented Mar 24, 2023

bakachan19 commented Jun 12, 2023

Maximum length of sentence input #70

Maximum length of sentence input #70

Comments

tuanton-ai commented Oct 26, 2021

MaartenGr commented Oct 26, 2021

millnerryan commented Nov 3, 2021

MaartenGr commented Nov 4, 2021

millnerryan commented Nov 5, 2021

venkatesh-kulkarni commented Dec 4, 2021

MaartenGr commented Dec 5, 2021

adityashukzy commented Mar 22, 2023

MaartenGr commented Mar 24, 2023

bakachan19 commented Jun 12, 2023