Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental learning in tfidf backend #226

Open
osma opened this issue Dec 14, 2018 · 0 comments
Open

Incremental learning in tfidf backend #226

osma opened this issue Dec 14, 2018 · 0 comments
Milestone

Comments

@osma
Copy link
Member

osma commented Dec 14, 2018

The tfidf backend could support incremental learning with a few adjustments (inspired by SimIndex in gensim.simserver):

  • switch from MatrixSimilarity to Similarity, which allows additions of documents (i.e. subjects for us)
  • maintain mappings between subjects and document IDs in the index; this would have to be persisted along with the index (e.g. using SQLite as in SimIndex)

But the more challenging problem is to figure out how learning operations should affect the "documents" (representations of subjects) in the index. Probably this could be handled by vector operations, along these lines:

  • for correcting false positives, take the existing subject vector and subtract the vector of the current document (multiplied by a small factor such as 0.1 or 0.01); replace the old subject vector with the result
  • for correcting false negatives, take the existing subject vector and add the vector of the current document (multiplied by a small factor such as 0.1 or 0.01); replace the old subject vector with the result

It is possible to retrieve the existing subject vectors from the index using Similarity.vector_by_id and the learned document can be turned to vector using tf-idf transformation.

Requires #225

@osma osma added this to the Short term milestone Dec 14, 2018
@osma osma modified the milestones: Short term, Long term Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant