This is the official repo of the paper MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification
- Create a virtual environment
# We recommend you to use Anaconda to create a conda environment conda create --name matchxml python=3.8 conda activate matchxml
- Install the required software:
pip install -r requirements.txt
# eurlex-4k, wiki10-31k, amazoncat-31k, wiki-500k, amazon-670k, amazon-3m
-
Download six XMC datasets from XR-Transformer
-
Download our trained label embeddings from Google Drive and save them to
xmc-base/{dataset}
-
Download our static text features(static sentence embeddings + TF-IDF features) from Google Drive and save them to
xmc-base/{dataset}/tfidf-attnxml
, replace the original TF-IDF features.
# eurlex-4k, wiki10-31k, amazoncat-31k, wiki-500k, amazon-670k, amazon-3m
bash run.sh {dataset}
# eurlex-4k, wiki10-31k, amazoncat-31k, wiki-500k, amazon-670k, amazon-3m
bash ./label2vec_run/{dataset}.sh
python sentence_embedding.py
- Our pre-trained models can be downloaded from Google Drive
If you find this work useful in your research, please consider citing:
@article{ye2024matchxml,
title={MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification},
author={Ye, Hui and Sunderraman, Rajshekhar and Ji, Shihao},
journal={IEEE Transactions on Knowledge and Data Engineering},
year={2024},
publisher={IEEE}
}