Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement keyword extraction with KeyBERT #7083

Open
JohnnyRacer opened this issue Feb 23, 2024 · 1 comment · May be fixed by Amnah199/haystack#1
Open

Implement keyword extraction with KeyBERT #7083

JohnnyRacer opened this issue Feb 23, 2024 · 1 comment · May be fixed by Amnah199/haystack#1
Labels
Contributions wanted! Looking for external contributions type:feature New feature or request

Comments

@JohnnyRacer
Copy link

Hello, I think adding a keyword extractor with KeyBERT would be quite useful. The keywords extracted could be used for paraphrasing or summarizing with logit_bias to allow for more consistent word usage in those tasks. Alternatively the keywords can also be used for locating a set documents from a collection that matches the keywords or phrases (sparse keyword based retrieval) . I have created snippet below based on the usage section KeyBERT's README.MD and the example from NamedEntityExtractor below.

from haystack.dataclasses import Document
from haystack.components.extractors import KeywordExtractor

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """

extractor = KeywordExtractor(model="all-MiniLM-L6-v2") # Compatible with all 'sentence-transformer' models 

documents = [Document(content=doc)]

extractor.warm_up()
keywords_output = extractor.run(documents, keyphrase_ngram_range=(1, 1), stop_words=None)
# Longer phrases instead of keywords can be extracted by altering the keyphrase_ngram_range parameter
print(keywords_output)

"""
Expected output from the 'keywords_output':

[
    Document(id=97fc47fdd6aeb2540d0b015b234088b7386abea93671a7d336a80c244387457a,
    content: doc, # The input document from above
    meta: { # Adds the keywords to the document metadata
        'keywords' : [('learning', 0.4604),
        ('algorithm', 0.4556),
        ('training', 0.4487),
        ('class', 0.4086),
        ('mapping', 0.3700)]
    }
]

"""
@masci masci added the type:feature New feature or request label Mar 22, 2024
@masci
Copy link
Contributor

masci commented Mar 22, 2024

Hi @JohnnyRacer and thanks for the details you put in the issue!

We talked about this internally and while we understand the use case we couldn't figure out a way to prioritise this work. I imagine writing a custom component would be a good workaround in the meantime, but I'm also labelling this issue as contributions wanted in case anybody want to give it a shot.

@masci masci added the Contributions wanted! Looking for external contributions label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions type:feature New feature or request
Projects
2 participants