This work is licensed under a Creative Commons Attribution 4.0 International License.
AfriQA is the first cross-lingual question answering (QA) dataset with a focus on African languages. The dataset includes over 12,000 XOR QA examples across 10 African languages, making it an invaluable resource for developing more equitable QA technology. African languages have historically been underserved in the digital landscape, with far less in-language content available online. This makes it difficult for QA systems to provide accurate information to users in their native language. However, cross-lingual open-retrieval question answering (XOR QA) systems can help fill this gap by retrieving answer content from other languages. AfriQA focuses specifically on African languages where cross-lingual answer content is the only high-coverage source of information. Previous datasets have primarily focused on languages where cross-lingual QA augments coverage from the target language, but AfriQA highlights the importance of African languages as a realistic use case for XOR QA.
There are currently 10 languages covered in AfriQA:
- Bemba (bem)
- Fon (fon)
- Hausa (hau)
- Igbo (ibo)
- Kinyarwanda (kin)
- Swahili (swa)
- Twi (twi)
- Wolof (wol)
- Yorùbá (yor)
Question-answer pairs for each language and train-dev-test
split are in the data directory in jsonlines
format.
-
Dataset Naming Convention ==>
queries.afriqa.{lang_code}.{en/fr}.{split}.json
-
Data Format:
- id : Question ID
- question : Question in African Language
- translated_question : Question translated into a pivot language (English/French)
- answers : Answer in African Language
- lang : Datapoint Language (African Language) e.g
bem
- split : Dataset Split
- translated_answer : Answer in Pivot Language
- translation_type : Translation type of question and answers
{ "id": 0, "question": "Bushe icaalo ca Egypt caali tekwapo ne caalo cimbi?", "translated_question": "Has the country of Egypt been colonized before?", "answers": "['Emukwai']", "lang": "bem", "split": "dev", "translated_answer": "['yes']", "translation_type": "human_translation" }
-
Set up a virtual environment using Conda or Virtualenv or
conda create -n xor_qa_venv python=3.9 anaconda conda activate xor_qa_venv
or
python3 -m venv xor_qa_venv source xor_qa_venv/bin/activate
-
Clone the repo
git clone https://github.com/ToluClassics/masakhane_xqa --recurse-submodules
-
Install Requirements
pip install -r requirements.txt
The already processed dumps are available on HuggingFace 😊, It is recommended to use this exact corpora to be able to reproduce the baseline results. To download:
However, to prepare the Wikipedia retrieval corpus yourself, consult docs/process_wiki_dumps.md.
For all languages, there are three splits.
The original splits were named train
, dev
and test
and they correspond to the train
, validation
and test
splits.
The splits have the following sizes :
Language | train | dev | test |
---|---|---|---|
Bemba | 502 | 503 | 314 |
Fon | 427 | 428 | 386 |
Hausa | 435 | 436 | 300 |
Igbo | 417 | 418 | 409 |
Kinyarwanda | 407 | 409 | 347 |
Swahili | 415 | 417 | 302 |
Twi | 451 | 452 | 490 |
Wolof | 503 | 504 | 334 |
Yoruba | 360 | 361 | 332 |
Zulu | 387 | 388 | 325 |
Total | 4333 | 4346 | 3560 |
Coming soon...