Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue274 retain subject ids when loading vocabulary over existing one #383

Conversation

juhoinkinen
Copy link
Member

Closes #274.

@juhoinkinen juhoinkinen added this to the 0.46 milestone Jan 31, 2020
@codecov
Copy link

codecov bot commented Jan 31, 2020

Codecov Report

Merging #383 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #383      +/-   ##
==========================================
+ Coverage   99.38%   99.40%   +0.01%     
==========================================
  Files          59       60       +1     
  Lines        3738     4026     +288     
==========================================
+ Hits         3715     4002     +287     
- Misses         23       24       +1     
Impacted Files Coverage Δ
annif/cli.py 99.15% <0.00%> (-0.40%) ⬇️
annif/eval.py 100.00% <0.00%> (ø) ⬆️
tests/test_cli.py 100.00% <0.00%> (ø) ⬆️
annif/backend/nn_ensemble.py 100.00% <0.00%> (ø) ⬆️
tests/test_backend_nn_ensemble.py 100.00% <0.00%> (ø) ⬆️
tests/test_vocab.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21d92bf...a8c102c. Read the comment docs.

@juhoinkinen
Copy link
Member Author

Demo run:

rm data/vocabs/yso-fi/subjects
annif loadvoc tfidf-fi tests/corpora/archaeology/subjects.tsv
annif train tfidf-fi tests/corpora/archaeology/documents.tsv 
cat removed_subjects.txt | annif suggest tfidf-fi
annif loadvoc tfidf-fi tests/corpora/archaeology/subjects_updated.tsv 
cat removed_subjects.txt | annif suggest tfidf-fi

In subjects_updated.tsv the following subject labels have been removed and are in removed_subjects.txt:

mykeneläinen kulttuuri
fosfaattianalyysi
kykladinen kulttuuri
rahalöydöt
aarrelöydöt
neoliittinen kausi
hylyt
foinikialainen kulttuuri
laivalöydöt
vedenalainen arkeologia

Also, the label meriarkeologia is renamed to suurten vesien arkeologia in the updated vocabulary.

The suggestion results with original vocabulary:

<http:https://www.yso.fi/onto/yso/p10295>	kykladinen kulttuuri	0.33801543712615967
<http:https://www.yso.fi/onto/yso/p8867>	meriarkeologia	0.2033890336751938
<http:https://www.yso.fi/onto/yso/p8993>	hylyt	0.19928325712680817
<http:https://www.yso.fi/onto/yso/p8888>	foinikialainen kulttuuri	0.18877731263637543
<http:https://www.yso.fi/onto/yso/p8869>	laivalöydöt	0.17182369530200958
<http:https://www.yso.fi/onto/yso/p8868>	vedenalainen arkeologia	0.16133852303028107
<http:https://www.yso.fi/onto/yso/p9285>	neoliittinen kausi	0.15398944914340973
<http:https://www.yso.fi/onto/yso/p10073>	mykeneläinen kulttuuri	0.14665688574314117
<http:https://www.yso.fi/onto/yso/p10415>	rahalöydöt	0.1455041617155075
<http:https://www.yso.fi/onto/yso/p7429>	aigeialainen kulttuuri	0.10021767020225525

After loading updated vocabulary, the suggestion resuls are:

<http:https://www.yso.fi/onto/yso/p8867>	suurten vesien arkeologia	0.2033890336751938
<http:https://www.yso.fi/onto/yso/p7429>	aigeialainen kulttuuri	0.10021767020225525
<http:https://www.yso.fi/onto/yso/p12484>	paleoliittinen kausi	0.08510526269674301
<http:https://www.yso.fi/onto/yso/p4739>	Komsan kulttuuri	0.0794997438788414
<http:https://www.yso.fi/onto/yso/p12651>	Kundan kulttuuri	0.07667402923107147
<http:https://www.yso.fi/onto/yso/p19258>	Suomusjärven kulttuuri	0.07460267096757889
<http:https://www.yso.fi/onto/yso/p20471>	mesoliittinen kausi	0.06100035458803177
<http:https://www.yso.fi/onto/yso/p10826>	Askolan kulttuuri	0.051676034927368164
<http:https://www.yso.fi/onto/yso/p6780>	Indus-kulttuuri	0.05137034133076668
<http:https://www.yso.fi/onto/yso/p7428>	minolainen kulttuuri	0.05129198357462883

@juhoinkinen juhoinkinen marked this pull request as ready for review February 3, 2020 14:30
annif/vocab.py Outdated Show resolved Hide resolved
Comment on lines +20 to 22
uri, _, label = line.strip().partition('\t')
clean_uri = annif.util.cleanup_uri(uri)
yield Subject(uri=clean_uri, label=label, text=None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is different in #385.

Comment on lines +52 to +57
def append(self, uri, label):
subject_id = len(self._uris)
self._uris.append(uri)
self._labels.append(label)
self._uri_idx[uri] = subject_id
self._label_idx[label] = subject_id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method duplicates much of the code in __init__(...), so if this is kept, maybe there should be a helper method _append(self, uri, label) that is used in both? (Also, #385 adds notation along uri and label here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable refactoring to extract the common code into a method.

Copy link
Member

@osma osma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, I only have a few stylistic remarks.

annif/vocab.py Outdated Show resolved Hide resolved
annif/suggestion.py Outdated Show resolved Hide resolved
annif/corpus/subject.py Outdated Show resolved Hide resolved
annif/suggestion.py Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Feb 11, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 22 Security Hotspots to review)
Code Smell A 5 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Comment on lines +52 to +57
def append(self, uri, label):
subject_id = len(self._uris)
self._uris.append(uri)
self._labels.append(label)
self._uri_idx[uri] = subject_id
self._label_idx[label] = subject_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable refactoring to extract the common code into a method.

@juhoinkinen juhoinkinen merged commit 4627590 into master Feb 14, 2020
@juhoinkinen juhoinkinen deleted the issue274-retain-subject-ids-when-loading-vocabulary-over-existing-one branch February 14, 2020 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retain subject IDs when loading vocabulary over existing one
2 participants