v0.5 #58

MaartenGr · 2021-08-27T06:23:22Z

Hightlights:

Added Guided KeyBERT
- kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
- thanks to @zolekode for the inspiration!
Use the newest all-* models from SBERT

Guided KeyBERT

Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords.

Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT:

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """

kw_model = KeyBERT()
seed_keywords = ["information"]
keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords)

zolekode · 2021-08-27T07:57:43Z

This is really awesome @MaartenGr

oztalha · 2021-09-16T23:35:09Z

keybert/_model.py

+ # Guided KeyBERT with seed keywords
+ if seed_keywords is not None:
+ seed_embeddings = self.model.embed([" ".join(seed_keywords)])
+ doc_embedding = np.average([doc_embedding, seed_embeddings], axis=0, weights=[3, 1])


Why are we giving 3 times more weight to doc_embedding? Where does the number 3 come from?

We compute a weighted average to make sure that the guided keywords only nudge the keyword extraction. For example, if we were to give the seeded keywords more weight than the actual document, we were to overfit the seeded keywords.

The actual value of 3 is merely through experimentation but will most likely be a separate hyperparameter that the user can use. However, thus far 3 seems to be a nice balance between keeping the document embedding as the main entity to compare with the seeded keywords nudging (and not pushing) the keyword extraction.

Guided KeyBERT

0334057

oztalha reviewed Sep 16, 2021

View reviewed changes

MaartenGr added 4 commits September 28, 2021 14:44

Add FAQ for Chinese documents

932de62

Update default SBERT model

ee1828b

Add Guided Key

a99fc63

Prepare v0.5 release

cb039a6

MaartenGr merged commit 6ab9af1 into master Sep 28, 2021

shengbo-ma mentioned this pull request Dec 20, 2022

Guided KeyBERT for a list of docs #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5 #58

v0.5 #58

MaartenGr commented Aug 27, 2021 •

edited

Loading

zolekode commented Aug 27, 2021

oztalha Sep 16, 2021

MaartenGr Sep 17, 2021

v0.5 #58

v0.5 #58

Conversation

MaartenGr commented Aug 27, 2021 • edited Loading

Hightlights:

Guided KeyBERT

zolekode commented Aug 27, 2021

oztalha Sep 16, 2021

Choose a reason for hiding this comment

MaartenGr Sep 17, 2021

Choose a reason for hiding this comment

MaartenGr commented Aug 27, 2021 •

edited

Loading