Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSI backend #201

Open
osma opened this issue Nov 12, 2018 · 2 comments
Open

LSI backend #201

osma opened this issue Nov 12, 2018 · 2 comments
Milestone

Comments

@osma
Copy link
Member

osma commented Nov 12, 2018

We are currently using Gensim only for the basic TF-IDF backend. It should be almost trivial to create an LSI backend, it's just one extra LsiModel layer and a single parameter (number of dimensions).

LDA would be possible too, but I'll leave that for another issue.

@osma
Copy link
Member Author

osma commented Dec 14, 2018

Evaluation results with the code in #219 were so bad that I don't think it makes sense to continue in this direction. LSI makes more sense when there are no predefined subjects. It might still be useful for small classifications though.

@osma
Copy link
Member Author

osma commented Dec 14, 2018

Here are the evaluation results:

2018-11-27 LSI model for Annif

Created first implementation of LSI model.
Set up four projects with num_topics = (100, 200, 400, 800).
Loaded yso-fi vocab and trained each model (in parallel, on 4 CPU cores) using yso-finna-fi corpus.
Had to kill the 800 topic one because system started swapping.

lsi-fi-100 model built in ~35min CPU time (with some parallel processing)
lsi-fi-200 model built in ~41min CPU time
lsi-fi-400 model built in ~60 min CPU time, peak memory usage ~6.8GB but usually ~5.4GB

Evaluated on kirjastonhoitaja (tfidf f1@5=0.22):
lsi-fi-100 F1@5 0.05287335527720144
lsi-fi-200 F1@5 0.07323910064294681
lsi-fi-400 F1@5 0.09448403253996848 peak mem ~2.5GB

Not very promising…

  • Results improve with more topics, but not that much.
  • LSI models with >400 topics are probably not realistic
  • could be tested on classifications instead of YSO
  • could explore how limiting the vocabulary affects resource usage & results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant