-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum length of sentence input #70
Comments
That depends, there are quite a number of embedding models that you can use in KeyBERT. Reading through the embedding guide might help. Typically though, extracting keywords from such a large document is not the use case for KeyBERT. With 20,000 words, it is actually quite difficult to extract keywords at all since the document would likely contain hundreds of different topics. You could also split the document up into sentences to get more fine-grained results but I am not sure if that would suit your use case. |
I had the same question here. I was adding additional text to my doc to analyze, and I noticed that it didn't change the keywords at all. But adding that additional text in the beginning of the doc did. I'm using "all-MiniLM-L6-v2", and even knowing the max length so I'm aware of it would be helpful. |
There is an overview here of pre-trained models using sentence-transformers. In each model, you can see the indicated max sequence length which is related to the length of documents it can process. If you are looking for models that can process longer documents, the embedding guide gives some options. For instance, you can average word embeddings using Flair to create a full representation of the document. Moreover, Flair tends to average embeddings instead of truncating them which might be helpful to your cases. |
awesome, thanks! |
I want to use KeyBert to extract keywords from a document containing 1000-2000 words. I am using all-MiniLM-L6-v2 model and it has a maximum sequence length of 512 tokens. I tried to use the averaged embeddings mentioned in the embedding guide but it is giving a poor result. How can I make modifications in the code to use all-MiniLM-L6-v2 model and at the same time overcome the issue of maximum size of sequence? |
@venkatesh-kulkarni Since the |
@MaartenGr Firstly, thank you so much for your brilliant work. It helps out the entire community. Kudos! Do you have a tutorial/guide on how one would go about doing what you have suggested here? I'm currently using While I'm also exploring other avenues for embeddings such as Would the procedure be something as follows?
Also, how would you compare this approach to using Flair instead (i.e. My use case is of extracting keywords and keyphrases out from 7-10 page research papers, which often run into about 2000-5000 words. |
Thank you for your kind words!
The main difficulty with this approach is that when you have long text that contains many different topics, using
I have not tried it out myself but I think if you were to aggregate on a word-level there is a higher chance of creating an embedding that tells you little since it aggregates quite a number of words in 7-10 research papers. |
Hi @adityashukzy. Thanks. |
I want to know that what is maximum length of sentence that we can input into keybert? (Because i has a very long sentence that need to extract keyword, it's about more than 20000 words)
The text was updated successfully, but these errors were encountered: