Skip to content

A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

License

Notifications You must be signed in to change notification settings

onbncbjocp68898/miracl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 

Repository files navigation

image

Build License Downloads

🙌 MIRACL

MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. The website for the event can be found at miracl.ai. This repo provides pointers to access the actual dataset.

For more details, check out our arXiv paper: Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.

Connect with us!

🙌 Corpora

The Wikipedia corpora used in MIRACL are available as a HuggingFace Dataset. So far, we have released corpora for the 16 "known languages"; the remaining 2 "surprise languages" will be revealed later!

  • 🤗 = direct link to HuggingFace Dataset
  • 🌏 = link to raw wiki dumps
Language # of Passages # of Articles Links
Arabic (ar) 2,061,414 656,982 🤗 🌏
Bengali (bn) 297,265 63,762 🤗 🌏
English (en) 32,893,221 5,758,285 🤗 🌏
Spanish (es) 10,373,953 1,669,181 🤗 🌏
Persian (fa) 2,207,172 857,827 🤗 🌏
Finnish (fi) 1,883,509 447,815 🤗 🌏
French (fr) 14,636,953 2,325,608 🤗 🌏
Hindi (hi) 506,264 148,107 🤗 🌏
Indonesian (id) 1,446,315 446,330 🤗 🌏
Japanese (ja) 6,953,614 1,133,444 🤗 🌏
Korean (ko) 1,486,752 437,373 🤗 🌏
Russian (ru) 9,543,918 1,476,045 🤗 🌏
Swahili (sw) 131,924 47,793 🤗 🌏
Telugu (te) 518,079 66,353 🤗 🌏
Thai (th) 542,166 128,179 🤗 🌏
Chinese (zh) 4,934,368 1,246,389 🤗 🌏

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprise a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.

The corpus data files are in JSON lines format, compressed with gzip. Each line in the file corresponds to a passage. Consider an example from the English corpus:

{
    "docid": "39#0",
    "title": "Albedo", 
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

The docid has the schema X#Y, where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.

🙌 Topics and Relevance Judgments

Topics (= queries) and relevance judgments (= relevance labels) of the MIRACL training sets and development sets for each of the 16 known languages are available on HuggingFace Dataset!

🤗 = direct link to HuggingFace Dataset

Train Dev
Language #Q #J #Q #J Links
Arabic (ar) 3,495 25,382 2,896 29,197 🤗
Bengali (bn) 1,631 16,754 411 4,206 🤗
English (en) 2,863 29,416 799 8,350 🤗
Spanish (es) 2,162 21,531 648 6,443 🤗
Persian (fa) 2,107 21,844 632 6,571 🤗
Finnish (fi) 2,897 20,350 1,271 12,008 🤗
French (fr) 1,143 11,426 343 3,429 🤗
Hindi (hi) 1,169 11,668 350 3,494 🤗
Indonesian (id) 4,071 41,358 960 9,668 🤗
Japanese (ja) 3,477 34,387 860 8,354 🤗
Korean (ko) 868 12,767 213 3,057 🤗
Russian (ru) 4,683 33,921 1,252 13,100 🤗
Swahili (sw) 1,901 9,359 482 5,092 🤗
Telugu (te) 3,452 18,608 828 1,606 🤗
Thai (th) 2,972 21,293 733 7,573 🤗
Chinese (zh) 1,312 13,113 393 3,928 🤗
Total 40,203 343,177 13,071 126,076

The above table shows the number of queries (#Q) and the number of judgments (#J) in each (language, split) combination, where the judgments include both positive and negative labels.

The topics are formatted in TSV, with each line organized as follows:

qid\tquery

The relevance judgments are formatted in standard TREC qrels format, as follows:

qid Q0 docid relevance

🙌 Baselines

Reproduce the results with Pyserini:

We have released baselines using BM25, mDPR, and hybrid of the two, as described in our arXiv paper. Reuslts of BM25 and mDPR could be reproduced using Pyserini.

To reproduce our baselines:

  1. Install the development version of Pyserini following these instructions. (To run baselines on surprise languages, you'll need to re-build both Anserini and Pyserini)
  2. Manually place all topics and qrels files under tools/topics-and-qrels. The topics and qrels files can be found under miracl-v1.0-${lang}/topics and miracl-v1.0-${lang}/qrels in the HuggingFace dataset.
    git clone https://huggingface.co/datasets/miracl/miracl
    mv miracl/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
    
  3. Following the commands in our 2-click-reproduction (2CR) website.

Checkpoints for dense models:

  • mDPR (w/o fine-tuning on MIRACL): castorini/mdpr-tied-pft-msmarco
  • mContriever (w/o fine-tuning on MIRACL): facebook/mcontriever-msmarco
  • mDPR (fine-tuned on MIRACL): castorini/mdpr-tied-pft-msmarco-ft-miracl-{lang}, where {lang} is the two-letter ISO code (e.g., ar, bn, ...)

🙌 Citation

If you find this dataset and repository helpful, please cite MIRACL as follows:

@article{10.1162/tacl_a_00595,
    author = {Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy},
    title = "{MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1114-1131},
    year = {2023},
    month = {09},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00595},
    url = {https://doi.org/10.1162/tacl\_a\_00595},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf},
}

🙌 Contact

If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.

About

A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published