Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for topic hierarchies #316

Open
wetneb opened this issue Aug 14, 2019 · 7 comments
Open

Support for topic hierarchies #316

wetneb opened this issue Aug 14, 2019 · 7 comments
Labels
classification Relevant for classification use cases enhancement
Milestone

Comments

@wetneb
Copy link

wetneb commented Aug 14, 2019

Many topic classification systems (such as the Dewey Decimal Classification or HAL's topic hierarchy) are organized into trees of classes rather than flat lists.

Are you aware of any subject prediction models which take into account this hierarchical structure? Do you have any plans to add support for them?

We can use models for flat classifications by only taking the leaves of the hierarchical classifications, for instance. But as a user I would like that the system is also able to predict coarser classifications (so, internal nodes in the classification tree) when it is not sure enough to pick a precise leaf.

@osma
Copy link
Member

osma commented Aug 14, 2019

You are absolutely right. Currently Annif stores vocabularies as flat lists, so the hierarchy (e.g. from a SKOS file) is lost. That could be fixed, but the larger issue is that most algorithms for subject indexing and document classification only consider a flat list of categories/subjects/classes. I'm sure there are some that can make use of a hierarchical structure, but I haven't come across anything that would be suitable for integration with Annif. If you have some specific models in mind, please add a comment here.

The Maui tools has some support for hierarchies, but only on a very rudimentary level. It will take into account broader/narrower and related links between concepts when it tries to decide which are the most relevant subjects for a particular document. Subject candidates that are related to other candidates (with any type of relationship) may be scored higher, though this depends on how well this heuristic worked in the model building phase.

@osma osma added classification Relevant for classification use cases enhancement labels Aug 14, 2019
@osma osma added this to the Blue Sky milestone Aug 14, 2019
@wetneb
Copy link
Author

wetneb commented Aug 14, 2019

I am not aware of any model that does that. Thanks for the pointer to the Maui tools, it is interesting!

@osma
Copy link
Member

osma commented Aug 14, 2019

Maui is a separate project, but there is MauiService which can be used from Annif:
https://github.com/NatLibFi/Annif/wiki/Backend%3A-Maui

@wetneb
Copy link
Author

wetneb commented Aug 14, 2019

If I had time I would be interested to review the literature to see if there is any nice probabilistic model for this sort of setting.

It might just be that there is no real benefit in using a hierarchy in these sort of models - the simplicity of assuming a flat list of topics might just outweigh the benefits of handling the hierarchy.

@osma
Copy link
Member

osma commented Aug 16, 2019

Based on a quick scanning of the paper, this seems like a relevant and sensible survey of approaches for hierarchical classification:

Silla, C. N., & Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2), 31-72.
https://doi.org/10.1007/s10618-010-0175-9

I'd be interested in hearing about practical implementations, especially open source software projects, preferably in Python (so they're easy to integrate with Annif).

Of course it would be possible to implement one or more of the methods described in the above paper using e.g. sklearn, but it's a lot more work that way.

@osma
Copy link
Member

osma commented Aug 20, 2019

The sklearn-hierarchical-classification project seems to be exactly what would be needed here. It's a Python module, open source (Apache license), implemented with sklearn, based on the above mentioned paper by Silla & Freitas.

Would you like to give it a spin @wetneb, using your own data sets? It would be good to know if it works for you, and then we could consider integrating it with Annif.

@wetneb
Copy link
Author

wetneb commented Aug 20, 2019

@osma many thanks for the pointer, it looks perfect indeed! I would be very interested to give it a go (but for my own curiosity mainly, so it might not happen very soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
classification Relevant for classification use cases enhancement
Projects
None yet
Development

No branches or pull requests

2 participants