LSI backend #201

osma · 2018-11-12T08:08:00Z

We are currently using Gensim only for the basic TF-IDF backend. It should be almost trivial to create an LSI backend, it's just one extra LsiModel layer and a single parameter (number of dimensions).

LDA would be possible too, but I'll leave that for another issue.

osma · 2018-12-14T10:02:31Z

Evaluation results with the code in #219 were so bad that I don't think it makes sense to continue in this direction. LSI makes more sense when there are no predefined subjects. It might still be useful for small classifications though.

osma · 2018-12-14T10:40:49Z

Here are the evaluation results:

2018-11-27 LSI model for Annif

Created first implementation of LSI model.
Set up four projects with num_topics = (100, 200, 400, 800).
Loaded yso-fi vocab and trained each model (in parallel, on 4 CPU cores) using yso-finna-fi corpus.
Had to kill the 800 topic one because system started swapping.

lsi-fi-100 model built in ~35min CPU time (with some parallel processing)
lsi-fi-200 model built in ~41min CPU time
lsi-fi-400 model built in ~60 min CPU time, peak memory usage ~6.8GB but usually ~5.4GB

Evaluated on kirjastonhoitaja (tfidf f1@5=0.22):
lsi-fi-100 F1@5 0.05287335527720144
lsi-fi-200 F1@5 0.07323910064294681
lsi-fi-400 F1@5 0.09448403253996848 peak mem ~2.5GB

Not very promising…

Results improve with more topics, but not that much.
LSI models with >400 topics are probably not realistic
could be tested on classifications instead of YSO
could explore how limiting the vocabulary affects resource usage & results

osma added the enhancement label Nov 12, 2018

osma added this to the Short term milestone Nov 12, 2018

osma added a commit that referenced this issue Nov 28, 2018

First implementation of LSI backend, with tests. Fixes #201

9d2e7cf

osma mentioned this issue Nov 28, 2018

First implementation of LSI backend, with tests. Fixes #201 #219

Closed

osma modified the milestones: Short term, Blue Sky Dec 14, 2018

osma mentioned this issue Oct 4, 2019

Optimizations to tfidf backend training #335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSI backend #201

LSI backend #201

osma commented Nov 12, 2018

osma commented Dec 14, 2018

osma commented Dec 14, 2018

LSI backend #201

LSI backend #201

Comments

osma commented Nov 12, 2018

osma commented Dec 14, 2018

osma commented Dec 14, 2018