Skip to content

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

License

Notifications You must be signed in to change notification settings

dadelani/sib-200

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the annotated English dataset, the script to extend annotation to other languages and code to run baseline text classification models.

Required dependencies

  • python
    • transformers : state-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
    • sklearn
    • evaluate
    • datasets
    • pandas
pip install -r code/requirements.txt

Create SIB dataset

sh get_flores_and_annotate.sh

or

Download it from huggingface dataset: Davlan/sib200

Run our baseline model using XLM-R

cd code/
sh xlmr_all.sh

BibTeX entry and citation info

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published