Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"UK" doesn't seem to be registered as input #456

Closed
mo-fu opened this issue Dec 8, 2020 · 13 comments · Fixed by #468
Closed

"UK" doesn't seem to be registered as input #456

mo-fu opened this issue Dec 8, 2020 · 13 comments · Fixed by #468
Milestone

Comments

@mo-fu
Copy link
Contributor

mo-fu commented Dec 8, 2020

During our 2020 evaluation we discovered that the STW concept for United Kingdom was often not assigned, even though UK was present in the input. Changing it to U.K. helped. This is also true for the web UI on https://ai.finto.fi/ for the YSO English. Just try these two short strings. Longer sentences show the same effect.I suppose something happens during preprocessing, that removes the two letter word.

@osma
Copy link
Member

osma commented Dec 8, 2020

The minimum token length is set to 3 in the Analyzer base class, from which other analyzers such as simple and snowball inherit the functionality.

Of course it's possible to change this, but it could add a lot of noise and increase the size of models. Any ideas?

@mo-fu
Copy link
Contributor Author

mo-fu commented Dec 8, 2020

Makes sense.
My first idea was to preprocess the input to insert dots between letters of allcaps words. Simmillar to how STWFSA detects both variants. This hack basically increases the size of the token.
After your response I also thought about allowing only specific two letter tokens but this would probably increase processing time, if possible at all.

@osma
Copy link
Member

osma commented Dec 8, 2020

It would also be possible to tweak the is_valid_token method, for example so that all caps words are treated differently (e.g. minimum length 2 for all caps).

Heuristics like this are bound to be imperfect - even very short words may have important meaning. For example in Swedish "ö" means island and "å" means river :)

@mo-fu
Copy link
Contributor Author

mo-fu commented Dec 8, 2020

I feared that making exceptions would open the flood gates for even more exceptions.

@osma
Copy link
Member

osma commented Dec 8, 2020

I think that the best way to approach this is to try to make an adjustment, then benchmark the results before and after the change, for example with an Omikuji model, on a couple of different data sets. In this case important metrics could be model size, training time and RAM, precision, recall, F1 score etc.

@mo-fu
Copy link
Contributor Author

mo-fu commented Jan 26, 2021

Made some experiments regarding timing and memory used for min token sizes two and three. I did so when no other programs were running and disabled turbo boost to have less impact by thermal throttling.
The project config was

[omikuji-yso-en]
name=omikujiy YSO english
language=en
backend=omikuji
vocab=yso
analyzer=snowball(english)
cluster_k=100
max_depth=3 

I used the YSO-Finna title data set from the tutorial.

Here are the major stats

Stat Value for 2 Value for 3
NDCG 0.6692629691171502 0.668092981381023
NDCG@5 0.6590049807281738 0.6577488968685695
NDCG@10 0.6794627912775455 0.6782623266499979
Maximum resident set size training (kbytes) 3020180 3211152
Maximum resident set size evaluation (kbytes) 13768836 13963352
User time training(seconds) 15162.35 14756.73
User time evaluation(seconds) 10189.39 9877.84
Sytem time training (seconds) 60.28 58.54
Sytem time evaluation (seconds) 1304.42 1325.07
Wall time training (h:mm:ss) 0:54:26.69 0:52:43.70
Wall time evaluation (h:mm:ss) 2:20:19 2:17:42

I need to retrain for three as I did not save the files

File Name Size in Bytes for 2 Size in Bytes for 3
omikuji-train.txt 424508792 377233735
vectorizer 6191720 6168414
tree0.cbor 237622688 232768228
tree2.cbor 238238260 232715201
tree1.cbor 238085203 232933647

More Details

Min Token Size 3

Training

User time (seconds): 14756.73
System time (seconds): 58.54
Percent of CPU this job got: 468% 
Elapsed (wall clock) time (h:mm:ss or m:ss): 52:43.70
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3211152
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1514901
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0 
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Evaluation Metrics

Precision (doc avg):            0.317233
Recall (doc avg):               0.6995525585342507
F1 score (doc avg):             0.39827965665506854
Precision (subj avg):           0.22211907332309463
Recall (subj avg):              0.33122881513924446
F1 score (subj avg):            0.24864102469050758
Precision (weighted subj avg):  0.3441752072875698
Recall (weighted subj avg):     0.6418459446718368
F1 score (weighted subj avg):   0.4363464807191685
Precision (microavg):           0.3177159282108805
Recall (microavg):              0.6418459446718368
F1 score (microavg):            0.4250370629403422
F1@5:                           0.46643648366301016
NDCG:                           0.668092981381023
NDCG@5:                         0.6577488968685695
NDCG@10:                        0.6782623266499979
Precision@1:                    0.72995
Precision@3:                    0.5790166666666666
Precision@5:                    0.473354
LRAP:                           0.5618558703035785
True positives:                 317233
False positives:                681247
False negatives:                177018
Documents evaluated:            100000

Evaluation Resources

User time (seconds): 9877.84
System time (seconds): 1325.07
Percent of CPU this job got: 135%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:17:42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 13963352
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 117227180
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Min Token Size 2

Training

User time (seconds): 15162.35
System time (seconds): 60.28
Percent of CPU this job got: 465%
Elapsed (wall clock) time (h:mm:ss or m:ss): 54:26.69
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3020180
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1792410
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

### Evaluation Metrics
Precision (doc avg):            0.31795
Recall (doc avg):               0.7011969488468128
F1 score (doc avg):             0.3991430168983095 
Precision (subj avg):           0.2220694137771722
Recall (subj avg):              0.33060479807678983
F1 score (subj avg):            0.24835069745711372
Precision (weighted subj avg):  0.3451724837793228
Recall (weighted subj avg):     0.6432966245895304
F1 score (weighted subj avg):   0.4372320524925233
Precision (microavg):           0.3183798127472087
Recall (microavg):              0.6432966245895304
F1 score (microavg):            0.42594920895625366
F1@5:                           0.4676404781666284
NDCG:                           0.6692629691171502
NDCG@5:                         0.6590049807281738
NDCG@10:                        0.6794627912775455
Precision@1:                    0.73003
Precision@3:                    0.57999
Precision@5:                    0.47441
LRAP:                           0.5630602939268746
True positives:                 317950
False positives:                680700
False negatives:                176301
Documents evaluated:            100000

Evaluation Resources

User time (seconds): 10189.39
System time (seconds): 1304.42
Percent of CPU this job got: 136%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:20:19
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 13768836
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 120041611
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

@osma
Copy link
Member

osma commented Feb 15, 2021

It looks like there was a small improvement in precision/recall/F1 and a small increase in train and eval times with minimum token length set to 2 instead of 3. Somewhat surprisingly, memory use (both train and eval) decreased, but this can vary between runs and the difference was very small anyway.

Based on this, do you think it would be a good idea to set the token size to 2 globally @mo-fu? Should we do more tests first?

It would also be possible to

  • make the token length configurable (e.g. analyzer=snowball(english,min_token=2))
  • try to make it smarter e.g. min token length 2 for all-caps tokens, 3 otherwise

@mo-fu
Copy link
Contributor Author

mo-fu commented Feb 15, 2021

According to the numbers for omikuji I would say the token size could be set to two globally. But maybe TFIDF should also be checked for two reasons:

  1. It is the most basic backend and used as starting point in the tutorials
  2. AFAIK the simmilarity computation needs the vectorized training instances. Therefore changing the token size should have the biggest effect here.
    I will try to run these.

@osma
Copy link
Member

osma commented Feb 15, 2021

Yes, it's a good idea to test with tfidf as well, exactly for the reasons you mentioned!

@mo-fu
Copy link
Contributor Author

mo-fu commented Feb 17, 2021

Here are the training results for tfidf:
They are again very similar. I still need to rerun the eval timings for min token size two In the off hours. I doubt the results will much differ from three, as most time is used for computing predictive metrics. Based on this I would say there won't be much harm in changing the min token size to two.

Most Important Metrics:

Metric Min Token Size 2 Min Toke Size 3
User time (seconds) 576.17 550.37
System time (seconds 42.51 41.96
Elapsed (wall clock) time (mm:ss) 10:23.62 9:52.63
Maximum resident set size (kbytes) 1164416 1120176
Model size (bytes) 87542937 85319705
Vectorizer size (bytes) 4922899 4915028

Min Size 2

Backend tfidf: transforming subject corpus
Backend tfidf: creating vectorizer
Backend tfidf: creating similarity index
Command being timed: "annif train tfidf-yso-en /home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-finna-diff.tsv"
User time (seconds): 576.17
System time (seconds): 42.51
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 10:23.62
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1164416
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 691350
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Min Size 3

Command being timed: "annif train tfidf-yso-en /home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-finna-diff.tsv"
User time (seconds): 550.37
System time (seconds): 41.96
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h: mm:ss or m:ss): 9:52.63
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1120176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 690005
Voluntary context switches: 0
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0 

@mo-fu
Copy link
Contributor Author

mo-fu commented Feb 18, 2021

Evaluation results for tfidf report mostly the same:

NDCG scores

Metric Min Token Size 2 Min TokenSize 3
NDCG 0.3603475844646917 0.36007724884192016
NDCG@5 0.3527443880475419 0.3531684376569817
NDCG@10 0.36583862767979836 0.3655562506839312

Timing and Memory

Metric| Min Token Size 2 | Min TokenSize 3|
User time (seconds)| 23574.71| 23692.96
System time (seconds)| 10106.96| 9692.78
Elapsed (wall clock) time (h:mm:ss)| 9:21:42|9:16:46
Maximum resident set size (kbytes)| 14061768| 14105380

Min Size 2

Precision (doc avg):          	0.16309787301587303
Recall (doc avg):             	0.3829147699472592
F1 score (doc avg):           	0.2080784200918946
Precision (subj avg):         	0.12387980373550003
Recall (subj avg):            	0.29377425983532357
F1 score (subj avg):          	0.14219666170034267
Precision (weighted subj avg):	0.3772837373489077
Recall (weighted subj avg):   	0.3289057584102005
F1 score (weighted subj avg): 	0.29102282144820796
Precision (microavg):         	0.16291144504963145
Recall (microavg):            	0.3289057584102005
F1 score (microavg):          	0.21789604759983539
F1@5:                         	0.23928756144076146
NDCG:                         	0.3603475844646917
NDCG@5:                       	0.3527443880475419
NDCG@10:                      	0.36583862767979836
Precision@1:                  	0.43717
Precision@3:                  	0.30638166666666666
Precision@5:                  	0.2421935
LRAP:                         	0.24861273849384033
True positives:               	162562
False positives:              	835293
False negatives:              	331689
Documents evaluated:          	100000
	Command being timed: "annif eval tfidf-yso-en /home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-finna-small.tsv"
	User time (seconds): 23574.71
	System time (seconds): 10106.96
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 9:21:42
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 14061768
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1785697283
	Voluntary context switches: 0
	Involuntary context switches: 0
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Min Size 3

Precision (doc avg):          	0.16219287301587304
Recall (doc avg):             	0.3812029573809712
F1 score (doc avg):           	0.20698367345311616
Precision (subj avg):         	0.12402065061868699
Recall (subj avg):            	0.2926094512585602
F1 score (subj avg):          	0.14159969434580164
Precision (weighted subj avg):	0.3799152945466671
Recall (weighted subj avg):   	0.3270564955862507
F1 score (weighted subj avg): 	0.29032813801949814
Precision (microavg):         	0.16202535705658477
Recall (microavg):            	0.3270564955862507
F1 score (microavg):          	0.21669765577557001
F1@5:                         	0.23888041401231402
NDCG:                         	0.36007724884192016
NDCG@5:                       	0.3531684376569817
NDCG@10:                      	0.3655562506839312
Precision@1:                  	0.4403
Precision@3:                  	0.30684833333333333
Precision@5:                  	0.24174616666666668
LRAP:                         	0.24856661927128268
True positives:               	161648
False positives:              	836023
False negatives:              	332603
Documents evaluated:          	100000
	Command being timed: "annif eval tfidf-yso-en /home/fuer/Annif-tutorial/data-sets/yso-nlf/yso-finna-small.tsv"
	User time (seconds): 23692.96
	System time (seconds): 9692.78
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 9:16:46
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 14105380
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1743172444
	Voluntary context switches: 0
	Involuntary context switches: 0
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@osma
Copy link
Member

osma commented Feb 18, 2021

Looks promising! Care to make a PR @mo-fu ? It should be pretty trivial...

@juhoinkinen
Copy link
Member

Closed by (already merged) #468.

@juhoinkinen juhoinkinen added this to the 0.52 milestone Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants