Taxoref is a methodology for Taxonomy Refinement via word embeddings. In order to select the best embedding for the refinement task, several vector models are generted and evaluated on the basis of their ability to represent taxonomic similarity relations. TaxoRef has been implemented in the following article: "TaxoRef: Embeddings Evaluation for AI-drivenTaxonomy Refinement" - ECML 2021
Plese cite TaxoRef as: Malandri, L., Mercorio, F., Mezzanzanica, M., & Nobani, N. TaxoRef: Embeddings Evaluation for AI-driven Taxonomy Refinement. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2021).
@inproceedings{malandri2021taxoref,
title={TaxoRef: Embeddings Evaluation for AI-drivenTaxonomy Refinement},
author={Malandri, Lorenzo and Mercorio, Fabio and Mezzanzanica, Mario and Nobani, Navid},
booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
year={2021},
organization={Springer}
}
TaxoRef is a general tool that can be apllied to any taxonomy. The only pre-requisite is having a text corpus to train word embeddings. In our case, it has been used in the Labour Market domain to propose a refinement of the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy, training the embeddings on a corpus of 2.2M+ Online Job Vacancies (OJVs) related to the ICT market and collected in UK in 2018.
The notebook TaxoRef.ipynb contains the code, complete and commented. You just need to clone this repository and run the notebook TaxoRef.ipynb.
The embedding generation and selection phase, thorougly described in the article, is not included, because it can be built with any pair corpus-taxonomy following the description in the paper.
The vector model and the taxonomic groups, together with their belonging group in ESCO, denominated as IscoGroup, are imported. A new table is created a row for each taxonomic terms in the embeddings vocaboulary. The other columns are, in order, the "IscoGroup", the "word vector" and the "sample" column, a representation that we will use as input for the refinement, obtained concatenating the vector representation of the term with its class (IscoGroup).
We perform the refinement by computing class probability through bayes formula and we produce a table comparing the origin class (the one in ESCo) and the destination one (the one with highest class probability, i.e. the TaxoRef suggestion)
In this sectionwe select a single ESCO group to better observe the refinement suggested by taxoref: