Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAKE backend #461

Merged
merged 57 commits into from
May 14, 2021
Merged

YAKE backend #461

merged 57 commits into from
May 14, 2021

Conversation

juhoinkinen
Copy link
Member

@juhoinkinen juhoinkinen commented Jan 12, 2021

This PR adds a new backend to Annif by integrating the YAKE library.

YAKE performs unsupervised automatic keyword extraction, and in the Annif backend the keywords found by YAKE are searched from the SKOS vocabulary labels, and the matches are returned as subject suggestions. The search can be targeted to prefLabels, altLabels and/or hiddenLabels as set in project configuration.

The YAKE backend is based on lexical principle, but does not perform as well as the other lexical backends (MLLM, STWFSA or Maui) as measured by evaluation results. However, the (free) keyword extraction operation offers a possibility to add new features in Annif, especially the feature for suggesting new terms for a vocabulary (the keywords not found in the vocabulary), see #224. Also the unsupervised approach can be useful in some cases: there is no need for training data.

@codecov
Copy link

codecov bot commented Jan 12, 2021

Codecov Report

Merging #461 (9a2127a) into master (57a2a38) will increase coverage by 0.00%.
The diff coverage is 99.59%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #461    +/-   ##
========================================
  Coverage   99.46%   99.47%            
========================================
  Files          73       76     +3     
  Lines        5280     5513   +233     
========================================
+ Hits         5252     5484   +232     
- Misses         28       29     +1     
Impacted Files Coverage Δ
annif/backend/maui.py 100.00% <ø> (ø)
annif/project.py 99.34% <ø> (ø)
annif/vocab.py 98.30% <92.85%> (-1.70%) ⬇️
annif/backend/yake.py 98.24% <98.24%> (ø)
annif/analyzer/__init__.py 100.00% <100.00%> (+7.40%) ⬆️
annif/analyzer/analyzer.py 100.00% <100.00%> (ø)
annif/analyzer/simple.py 100.00% <100.00%> (ø)
annif/analyzer/snowball.py 100.00% <100.00%> (ø)
annif/analyzer/voikko.py 94.73% <100.00%> (+0.29%) ⬆️
annif/backend/__init__.py 100.00% <100.00%> (ø)
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03fdd78...9a2127a. Read the comment docs.

@juhoinkinen juhoinkinen marked this pull request as ready for review February 1, 2021 08:20
Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some comments on some possible improvements. Overall I think this is looking very promising.

How to express the licensing information (GPLv3) needs some more thought. It shouldn't be terribly complicated but needs a different frame of thinking so I mostly just looked at the code now :)

.travis.yml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
annif/backend/yake.py Outdated Show resolved Hide resolved
annif/backend/yake.py Outdated Show resolved Hide resolved
annif/backend/yake.py Show resolved Hide resolved
not_matched.append((kp, self._transform_score(score)))
# Remove duplicate uris, conflating the scores
suggestions = self._combine_suggestions(suggestions)
self.debug('Keyphrases not matched:\n' + '\t'.join(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future version, I think these non-matched keyphrases should be propagated back to the user as well, but it could be done in a subsequent PR as it requires a lot more scaffolding.

setup.py Outdated Show resolved Hide resolved
annif/backend/yake.py Outdated Show resolved Hide resolved
@juhoinkinen juhoinkinen linked an issue Feb 9, 2021 that may be closed by this pull request
@sonarcloud
Copy link

sonarcloud bot commented Mar 17, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen
Copy link
Member Author

Rebased & force-pushed

annif/vocab.py Outdated Show resolved Hide resolved
@juhoinkinen juhoinkinen added this to the 0.53 milestone May 6, 2021
@juhoinkinen juhoinkinen changed the title Yake integration YAKE backend May 6, 2021
@juhoinkinen
Copy link
Member Author

I updated the PR description and put evaluation results in the results table for comparison with Maui and MLLM.

Still to do: a Wiki page for the backend.

There are some inline comments/questions in the PR from me. @osma can you take a look at this again?

@juhoinkinen
Copy link
Member Author

A draft Wiki page: https://github.com/NatLibFi/Annif/wiki/Backend:-YAKE

Another page edit to do in Wiki: add installation instructions to Optional features and dependencies.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I gave a few suggestions for small changes. You can decide if you want to address them or not, then it's OK to merge this.

I will write a separate comment about the wiki documentation

README.md Outdated Show resolved Hide resolved
annif/backend/yake.py Show resolved Hide resolved
annif/backend/yake.py Show resolved Hide resolved
annif/backend/yake.py Outdated Show resolved Hide resolved
annif/backend/yake.py Show resolved Hide resolved
annif/vocab.py Outdated Show resolved Hide resolved
@osma
Copy link
Member

osma commented May 11, 2021

Regarding the wiki page:

  • there were a couple of comments above about how to handle vocabulary changes etc. (the need for annif clear)
  • the parameters are not very well explained currently; the descriptions could be expanded a bit (currently only default values are mentiones) and perhaps there could be a paragraph or two explaining how to configure this in common situations
  • the Optional features and dependencies page needs a section on Yake - this could include the same note about GPL as in the README

@sonarcloud
Copy link

sonarcloud bot commented May 12, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen juhoinkinen merged commit 322e014 into master May 14, 2021
@juhoinkinen juhoinkinen deleted the yake-integration branch May 14, 2021 12:33
@osma osma mentioned this pull request Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RAKE-like backend
2 participants