-
Set up a virtual environment using Conda or Virtualenv or
conda create -n xor_qa_venv python=3.9 anaconda conda activate xor_qa_venv
or
python3 -m venv xor_qa_venv source xor_qa_venv/bin/activate
-
Clone the repo
git clone https://github.com/ToluClassics/masakhane_xqa --recurse-submodules
-
Install Requirements
pip install -r requirements.txt
The English and French passages for this project are drawn from Wikipedia snapshots of 2022-05-01 and 2022-04-20 respectively, and are downloaded from the Internet Archive to enable open-domain experiments. The raw documents can be downloaded from the following URLS:
- https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2
- https://archive.org/download/frwiki-20220420/frwiki-20220420-pages-articles-multistream.xml.bz2
The already processed dumps are available on HuggingFace 😊, It is recommended to use this exact corpora to be able to reproduce the baseline results. To download:
However, to run the processing pipeline yourself; We adopt the same processing used in the Dense Passage Retriever Paper. The pipeline has been bundled into this script. You can run using the code provided below:
bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
However, this document provides a detailed break down of the individual steps.