Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tfidf backend should ignore subjects that are not part of the training data #531

Open
thomaslow opened this issue Oct 4, 2021 · 4 comments

Comments

@thomaslow
Copy link

Hi, I'm currently looking into various subject classification algorithms supporting subject hierarchies and did some initial tests with Annif and its backends. I discovered a minor problem with the tfidf vectorization implementation.

I first observed the issue when comparing evaluation results of the tfidf backend when loading a subject vocabulary either using a TSV file or a SKOS turtle file. The evaluation results were not exactly the same, even though the training and test data was the same in both cases.

It seems that unused subjects, meaning subjects that are not part of the training data, but are present in a vocabulary, e.g., due to their relationship (broader, narrower, related) to other subjects, are still added as empty buffers to the scikit-learn TfidfVectorizer. The resulting tfidf vector will be a zero-vector and all predictions (cosine similarities) will be 0 for that subject. However, the inverse-document-frequency for terms is calculated using a higher number of subjects (including unused subjects).

To my knowledge, this will (slightly) reduce the effectiveness of the tfidf-backend in distinguishing rare terms from frequent terms, and, in case of very large SKOS files with many thousands of unused subjects, might even negatively impact its predictive performance.

A possible solution would be to filter out empty subjects before calling fit_transform. However, an additional subject index needs to be kept in order to remember which score (vector index) belongs to which subject.

Cheers,
Thomas

@osma
Copy link
Member

osma commented Oct 4, 2021

Hi @thomaslow , thank you for the issue report. You're right that the tfidf backend builds a model with all the subjects, even those not referenced in training data.

The tfidf backend is really quite simple and intended to be a first stepping stone towards more advanced backends. It's easy to set up and fast to train, but not really expected to give very good results in terms of quality.

Would you by any chance be interested in implementing a change to the tfidf backend with your proposed solution to the problem (filtering out empty subjects and maintaining a mapping between index IDs and subject IDs)? We're always very happy to accept pull requests.

@thomaslow
Copy link
Author

Hi @osma, I'm sorry, I don't think I will have the time. As you said, tfidf is just a first step. I just mentioned it, because I was curious that the tfidf backend did not produce the exact same results for the same training and test data.

At the moment I'm mostly experimenting with different algorithms and approaches that can learn from a hierarchy of subjects. Annif helped a lot to get an overview over different backends and do some first experiments. I even wrote a small Python script and custom AnnifProject class to evaluate and compare multiple Annif backends with other approaches.

Unfortunately, many features that are important for my use case are still missing in Annif (cross validation, metrics that consider subject hierarchies, document meta data, etc). So, at the moment, I'm working on putting these pieces together in a separate python module.

@osma
Copy link
Member

osma commented Oct 5, 2021

Thanks, I understand. Good to hear that you're also experimenting with other backends. I recommend taking a close look at Omikuji, since at least for us it has consistently achieved good results in very different scenarios (multiclass or multilabel, small or large vocabulary...)

I'd be curious to hear more about the features you are missing in Annif. Would it be possible for you to open new issues requesting them to be added? I can't promise we will implement them (and PRs are very welcome, as I said above!), but just defining the feature would be an important first step in that process. There may be others in the community who have similar needs and could also chime in and perhaps help out.

For cross validation, I've thought that a CLI command like annif xval my-project --folds 5 path/to/corpus could be possible to implement. I seem to remember that Maui had a command like this.

Regarding metrics for hierarchies, there is an open issue #466 "Implement hierarchical precision, recall and F1 scores" - would that be relevant to you? Perhaps you could comment on the issue itself?

What do you mean by document meta data?

Also, if you discover algorithms that work well in your use case (learning from a hierarchy of subjects), it would be great if you could tell more about your and perhaps suggest including them as Annif backends.

@thomaslow
Copy link
Author

Ok, I'll add a few issues about said features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants