Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
/ mlqe Public archive

We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scale 1 to 100) generated though human evaluations that represent the quality of the translations.Paper Title Unsupervised Quality Estimation for Neural Machine Translation

License

Notifications You must be signed in to change notification settings

facebookresearch/mlqe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

MultiLingual Quality Estimation (MLQE) Dataset

This repository contains data for the 2020 Quality Estimation Shared Task:
https://www.statmt.org/wmt20/quality-estimation-task.html

Training and development data

Check the 'data' folder

NMT models

Check the 'nmt-models' folder

Parallel data used to train the NMT models

Check 'https://www.statmt.org/wmt20/quality-estimation-task.html'

German-English

Europarl v9
ParaCrawl v3
Common Crawl corpus
News Commentary v14
Wiki Titles v1
Document-split Rapid corpus

Chinese-English

News Commentary v14
Wiki Titles v1
UN Parallel Corpus V1.0
CWMT Corpus (casia2015, datum2015, datum2017, NEU)

Romanian-English

SETIMES2
Europarl v8

Estonian-English

Europarl v8
Rapid corpus of EU press releases

Sinhala-English

Flores Iterative Back Translation

Nepali-English

Flores Iterative Back Translation

Citation

If you use this data in your work, please cite:

@article{tacl2020,
    title = {Unsupervised Quality Estimation for Neural Machine Translation},
    author = {Fomicheva, Marina and Sun, Shuo and Yankovskaya, Lisa and Blain, Frédéric and Guzmán, Francisco and Fishel, Mark and Aletras, Nikolaos and Chaudhary, Vishrav and Specia, Lucia},
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {8},
    pages = {539-555},
    year = {2020}
}

Changelog

  • 2020-03-15: Adding details about training data for NMT models
  • 2020-03-19: Releasing dataset

License

The dataset is licensed under CC-BY-SA, see the LICENSE file for details.

About

We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scale 1 to 100) generated though human evaluations that represent the quality of the translations.Paper Title Unsupervised Quality Estimation for Neural Machine Translation

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published