Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum length of sentence input #70

Open
tuanton-ai opened this issue Oct 26, 2021 · 9 comments
Open

Maximum length of sentence input #70

tuanton-ai opened this issue Oct 26, 2021 · 9 comments

Comments

@tuanton-ai
Copy link

I want to know that what is maximum length of sentence that we can input into keybert? (Because i has a very long sentence that need to extract keyword, it's about more than 20000 words)

@MaartenGr
Copy link
Owner

That depends, there are quite a number of embedding models that you can use in KeyBERT. Reading through the embedding guide might help. Typically though, extracting keywords from such a large document is not the use case for KeyBERT. With 20,000 words, it is actually quite difficult to extract keywords at all since the document would likely contain hundreds of different topics. You could also split the document up into sentences to get more fine-grained results but I am not sure if that would suit your use case.

@millnerryan
Copy link

I had the same question here. I was adding additional text to my doc to analyze, and I noticed that it didn't change the keywords at all. But adding that additional text in the beginning of the doc did.

I'm using "all-MiniLM-L6-v2", and even knowing the max length so I'm aware of it would be helpful.

@MaartenGr
Copy link
Owner

There is an overview here of pre-trained models using sentence-transformers. In each model, you can see the indicated max sequence length which is related to the length of documents it can process. If you are looking for models that can process longer documents, the embedding guide gives some options. For instance, you can average word embeddings using Flair to create a full representation of the document. Moreover, Flair tends to average embeddings instead of truncating them which might be helpful to your cases.

@millnerryan
Copy link

awesome, thanks!

@venkatesh-kulkarni
Copy link

I want to use KeyBert to extract keywords from a document containing 1000-2000 words. I am using all-MiniLM-L6-v2 model and it has a maximum sequence length of 512 tokens. I tried to use the averaged embeddings mentioned in the embedding guide but it is giving a poor result. How can I make modifications in the code to use all-MiniLM-L6-v2 model and at the same time overcome the issue of maximum size of sequence?

@MaartenGr
Copy link
Owner

@venkatesh-kulkarni Since the all-MiniLM-L6-v2 is a sentence model, I would opt for splitting your data up into either sentences or paragraphs. That model is not meant for long documents and I would advise not using it for that purpose. However, when you split up your data, the results should be much better!

@adityashukzy
Copy link

@MaartenGr Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos!

Do you have a tutorial/guide on how one would go about doing what you have suggested here? I'm currently using all-MiniLM-L6-v2 as well, and am experiencing the same issue of the embeddings being created after truncating the excess input text towards the end.

While I'm also exploring other avenues for embeddings such as Doc2Vec or LongFormer, I would like to try out this approach mentioned by you: namely, splitting a long document up into paragraphs, extracting keywords out from each, and averaging out to derive a singular overall doc_embedding.

Would the procedure be something as follows?

  1. Break the document up into paragraphs.
  2. For each paragraph, encode the paragraph text to get its embeddings.
  3. Repeat for each paragraph.
  4. Collate all these paragraph-level embeddings into a list and run np.mean() on it.

Also, how would you compare this approach to using Flair instead (i.e. DocumentPoolEmbeddings) which would presumably perform this at a word-level rather than a paragraph-level?

My use case is of extracting keywords and keyphrases out from 7-10 page research papers, which often run into about 2000-5000 words.

@MaartenGr
Copy link
Owner

@adityashukzy

Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos!

Thank you for your kind words!

Would the procedure be something as follows?

Break the document up into paragraphs.
For each paragraph, encode the paragraph text to get its embeddings.
Repeat for each paragraph.
Collate all these paragraph-level embeddings into a list and run np.mean() on it.

The main difficulty with this approach is that when you have long text that contains many different topics, using np.mean() might not be helpful as it will judge each paragraph as equally important whilst that might not be the case. Having said that, if the paragraphs are reasonably similar, then it might be worthwhile to try out! Also, to account for different keywords, using something like diversity=0.5 could be helpful as it considers a wide range of keywords which focuses less on the single embedding created through np.mean().

Also, how would you compare this approach to using Flair instead (i.e. DocumentPoolEmbeddings) which would presumably perform this at a word-level rather than a paragraph-level?

I have not tried it out myself but I think if you were to aggregate on a word-level there is a higher chance of creating an embedding that tells you little since it aggregates quite a number of words in 7-10 research papers.

@bakachan19
Copy link

Hi @adityashukzy.
Did you manage to find a good solution for your problem? I am also interesting in doing this and I was wondering if you could share any insights, tips and tricks that might help.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants