GitHub - onbncbjocp68898/miracl: A large-scale multilingual dataset for Information Retrieval. Thorough human-annotations across 18 diverse languages.

Paper | Baselines | HuggingFace | Leaderboard

🙌 MIRACL

MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. The website for the event can be found at miracl.ai. This repo provides pointers to access the actual dataset.

For more details, check out our arXiv paper: Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages.

Connect with us!

🙌 Corpora

The Wikipedia corpora used in MIRACL are available as a HuggingFace Dataset. So far, we have released corpora for the 16 "known languages"; the remaining 2 "surprise languages" will be revealed later!

🤗 = direct link to HuggingFace Dataset
🌏 = link to raw wiki dumps

Language	# of Passages	# of Articles	Links
Arabic (ar)	2,061,414	656,982	🤗 🌏
Bengali (bn)	297,265	63,762	🤗 🌏
English (en)	32,893,221	5,758,285	🤗 🌏
Spanish (es)	10,373,953	1,669,181	🤗 🌏
Persian (fa)	2,207,172	857,827	🤗 🌏
Finnish (fi)	1,883,509	447,815	🤗 🌏
French (fr)	14,636,953	2,325,608	🤗 🌏
Hindi (hi)	506,264	148,107	🤗 🌏
Indonesian (id)	1,446,315	446,330	🤗 🌏
Japanese (ja)	6,953,614	1,133,444	🤗 🌏
Korean (ko)	1,486,752	437,373	🤗 🌏
Russian (ru)	9,543,918	1,476,045	🤗 🌏
Swahili (sw)	131,924	47,793	🤗 🌏
Telugu (te)	518,079	66,353	🤗 🌏
Thai (th)	542,166	128,179	🤗 🌏
Chinese (zh)	4,934,368	1,246,389	🤗 🌏

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprise a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.

The corpus data files are in JSON lines format, compressed with gzip. Each line in the file corresponds to a passage. Consider an example from the English corpus:

{
    "docid": "39#0",
    "title": "Albedo", 
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

The docid has the schema X#Y, where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.

🙌 Topics and Relevance Judgments

Topics (= queries) and relevance judgments (= relevance labels) of the MIRACL training sets and development sets for each of the 16 known languages are available on HuggingFace Dataset!

🤗 = direct link to HuggingFace Dataset

	Train		Dev
Language	#Q	#J	#Q	#J	Links
Arabic (ar)	3,495	25,382	2,896	29,197	🤗
Bengali (bn)	1,631	16,754	411	4,206	🤗
English (en)	2,863	29,416	799	8,350	🤗
Spanish (es)	2,162	21,531	648	6,443	🤗
Persian (fa)	2,107	21,844	632	6,571	🤗
Finnish (fi)	2,897	20,350	1,271	12,008	🤗
French (fr)	1,143	11,426	343	3,429	🤗
Hindi (hi)	1,169	11,668	350	3,494	🤗
Indonesian (id)	4,071	41,358	960	9,668	🤗
Japanese (ja)	3,477	34,387	860	8,354	🤗
Korean (ko)	868	12,767	213	3,057	🤗
Russian (ru)	4,683	33,921	1,252	13,100	🤗
Swahili (sw)	1,901	9,359	482	5,092	🤗
Telugu (te)	3,452	18,608	828	1,606	🤗
Thai (th)	2,972	21,293	733	7,573	🤗
Chinese (zh)	1,312	13,113	393	3,928	🤗
Total	40,203	343,177	13,071	126,076

The above table shows the number of queries (#Q) and the number of judgments (#J) in each (language, split) combination, where the judgments include both positive and negative labels.

The topics are formatted in TSV, with each line organized as follows:

qid\tquery

The relevance judgments are formatted in standard TREC qrels format, as follows:

qid Q0 docid relevance

🙌 Baselines

Reproduce the results with Pyserini:

We have released baselines using BM25, mDPR, and hybrid of the two, as described in our arXiv paper. Reuslts of BM25 and mDPR could be reproduced using Pyserini.

To reproduce our baselines:

Install the development version of Pyserini following these instructions. (To run baselines on surprise languages, you'll need to re-build both Anserini and Pyserini)
Manually place all topics and qrels files under tools/topics-and-qrels. The topics and qrels files can be found under miracl-v1.0-${lang}/topics and miracl-v1.0-${lang}/qrels in the HuggingFace dataset.
```
git clone https://huggingface.co/datasets/miracl/miracl
mv miracl/*/*/* $PYSERINI_PATH/tools/topics-and-qrels/
```
Following the commands in our 2-click-reproduction (2CR) website.

Checkpoints for dense models:

mDPR (w/o fine-tuning on MIRACL): castorini/mdpr-tied-pft-msmarco
mContriever (w/o fine-tuning on MIRACL): facebook/mcontriever-msmarco
mDPR (fine-tuned on MIRACL): castorini/mdpr-tied-pft-msmarco-ft-miracl-{lang}, where {lang} is the two-letter ISO code (e.g., ar, bn, ...)

🙌 Citation

If you find this dataset and repository helpful, please cite MIRACL as follows:

@article{10.1162/tacl_a_00595,
    author = {Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy},
    title = "{MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {1114-1131},
    year = {2023},
    month = {09},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00595},
    url = {https://doi.org/10.1162/tacl\_a\_00595},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf},
}

🙌 Contact

If you have any questions, feel free to email us (project.miracl [at] gmail.com) or start a Github issue under this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper | Baselines | HuggingFace | Leaderboard

🙌 MIRACL

🙌 Corpora

🙌 Topics and Relevance Judgments

🙌 Baselines

Reproduce the results with Pyserini:

Checkpoints for dense models:

🙌 Citation

🙌 Contact

About

Releases

Packages

License

onbncbjocp68898/miracl

Folders and files

Latest commit

History

Repository files navigation

Paper | Baselines | HuggingFace | Leaderboard

🙌 MIRACL

🙌 Corpora

🙌 Topics and Relevance Judgments

🙌 Baselines

Reproduce the results with Pyserini:

Checkpoints for dense models:

🙌 Citation

🙌 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages