This repository contains MayanV, a collection of parallel corpora between several Mayan languages and Spanish. MayanV is introduced in the paper "Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars", accepted at the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL2024.
MayanV includes curated parallel corpora for Spanish and the following Mayan languages spoken in Guatemala and Southern Mexico:
ISO Code | Language | Words (Mayan) | Words (Spanish) | Sentences |
---|---|---|---|---|
acr | Achi | 6,994 | 7,657 | 1,343 |
agu | Awakatec | 7,325 | 9,700 | 1,930 |
cac | Chuj | 9,398 | 10,916 | 2,299 |
itz | Itza’ | 6,069 | 7,512 | 1,539 |
ixl | Ixil | 10,888 | 12,137 | 2,325 |
kek | Q’eqchi’ | 18,529 | 21,835 | 4,133 |
kjb | Q’anjob’al | 18,035 | 18,238 | 3,014 |
mam | Mam | 15,453 | 19,117 | 3,093 |
poc | Poqomam | 18,039 | 21,744 | 3,583 |
poh | Poqomchi’ | 6,479 | 7,149 | 1,787 |
quc | K’iche’ | 14,468 | 15,474 | 2,632 |
qum | Sipakapense | 9,780 | 9,328 | 1,356 |
ttc | Tektitek | 23,571 | 24,896 | 4,022 |
tzh | Tzeltal | 103,309 | 128,659 | 19,846 |
tzj | Tz’utujil | 12,283 | 11,404 | 2,519 |
Sources for each corpus are discussed in the article. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language.
Each language corpus is organized into its respective folder within the repository. Additionally, each language folder contains its own README file providing details about the resources used to create the corpus. Language folders: Achi, Awakatec, Chuj, Itza’, Ixil, Q’eqchi’, Q’anjob’al, Mam, Poqomam, Poqomchi’, K’iche’, Sipakapense, Tektitek, Tzeltal, Tz’utujil.
MayanV has been produced as part of the R+D+i project Lightweight neural translation technologies for low-resource languages (LiLowLa) (PID2021-127999NB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe.
These data are released under this licensing scheme:
- We do not own any of the text from which these data has been extracted.
- We license the actual packaging of these parallel data under the Creative Commons CC0 license ("no rights reserved").
If you use this dataset as part of your developments, please cite it as follows:
@inproceedings{lou-etal-2024-curated,
title = "Curated Datasets and Neural Models for Machine Translation of Informal Registers between {M}ayan and {S}panish Vernaculars",
author = "Lou, Andr{\'e}s and
P{\'e}rez-Ortiz, Juan Antonio and
S{\'a}nchez-Mart{\'\i}nez, Felipe and
S{\'a}nchez-Cartagena, V{\'\i}ctor",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.156",
pages = "2838--2850",
}
A CITATION.cff
file is also included in this repository.