Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-label Cross Validation #535

Open
thomaslow opened this issue Oct 11, 2021 · 4 comments
Open

Multi-label Cross Validation #535

thomaslow opened this issue Oct 11, 2021 · 4 comments

Comments

@thomaslow
Copy link

Currently, the eval command allows to evaluate the predictive performance of a backend for a single training and test split. Unfortunately, splitting multi-class multi-label data into training and test sets is not trivial, especially in case there are only few examples for some classes. Also, relying on the same training and test split when testing various backends and model parameters can lead to overfitting.

Implementing a multi-label cross validation method would help to evaluate various classification approaches.

For cross validation, I've thought that a CLI command like annif xval my-project --folds 5 path/to/corpus could be possible to implement. I seem to remember that Maui had a command like this.

Yes, a CLI command like this would be useful. Personally, as a developer, I would prefer a Python module API that allows to put together custom pilelines, similar to sklearn.pipeline.Pipeline.

There is an implementation for cross validation of multi-label data, which seems promising, but I haven't had a chance to test, see scikit.ml or code.

@osma
Copy link
Member

osma commented Oct 11, 2021

Thanks @thomaslow !

There is a method for cross-validation in the Maui Server REST API:

URL pattern: /{tagger-id}/xvalidate
This works similar to training, but instead of training and storing a Maui model from training data, it will evaluate the training process, computing precision and recall by cross-validation. The number of cross-validation passes can be set in the configuration resource.

I'm not sure how Maui Server splits the data - is it done intelligently, trying to ensure that rare labels are evenly split, or just randomly.

Yes, a CLI command like this would be useful. Personally, as a developer, I would prefer a Python module API that allows to put together custom pilelines, similar to sklearn.pipeline.Pipeline.

I see your point. Annif isn't primarily a Python library, though - all functionality is provided either via CLI or REST API (or both). But of course it can be used as a Python module and ideally this functionality would also be available as a module with a reasonable API.

There is an implementation for cross validation of multi-label data, which seems promising, but I haven't had a chance to test, see scikit.ml or code.

Thanks for the tip!

@osma
Copy link
Member

osma commented Oct 12, 2021

I'm not sure how Maui Server splits the data - is it done intelligently, trying to ensure that rare labels are evenly split, or just randomly.

Responding to myself: it seems that Maui Server doesn't do anything particularly intelligent: it just splits the corpus into equal size batches without looking at the distribution of labels.

https://github.com/TopQuadrant/MauiServer/blob/49af8afa26bfbf9f1d0d8456c4b8c4efc5ec631c/src/main/java/org/topbraid/mauiserver/tagger/CrossValidationJob.java#L38-L48

@mfakaehler
Copy link

I cannot contribute a definitve answer here, but I have some input, that might help direct the discussion. I found the following paper quite valuable:
https://link.springer.com/chapter/10.1007/978-3-030-75765-6_27
They discuss a splitting method that ensures balance with respect to the label distribution in the various splitting parts. It seems not taking care of the splitting, can lead to a bias that prefers algorithms with a strong head-label performance. Some algorithms might even prefer to completely ignore tail labels, if they are not represented in the test- and validation set.
There is also an approach to correct for this sort of bias by using other metrics, called propensity scored measures;
https://dl.acm.org/doi/10.1145/2939672.2939756

As for cross validation, I personally wouldn't opt for that. With these extreme multi-label problems (that not all annif users may have), I fear that for large vocabularies and models even a single run with moderate computing power can take a while. Looping that through a CV pipeline would be out of computanional scope for the resources that I can access (of course, other users might have better resources).

@osma
Copy link
Member

osma commented Aug 1, 2022

Thank you for the references to interesting papers papers @mfakaehler !

Currently the splitting of data sets is always performed outside Annif. I think it could be useful to provide tools that help perform the split in some reasonable way, and the method in the first paper looks particularly useful. Although I am unsure how difficult it would be to implement that (for example as a small python script under a directory such as tools/) and whether the implementation provided by the authors can be reused.

The propensity scored measures also look useful - in particular the propensity scored nDCG variant could be more informative than standard nDCG. OTOH, it doesn't seem to be a widely used metric beyond the original paper. I encourage anyone who feels the need for this kind of metric to open a new issue here.

As for cross validation, I personally wouldn't opt for that. With these extreme multi-label problems (that not all annif users may have), I fear that for large vocabularies and models even a single run with moderate computing power can take a while.

That's a fair point, and certainly CV can increase the amount of computation by an order of magnitude. OTOH, it can also be useful in scenarios with small data sets even if it's not always practical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants