Machine-Translation Evaluation: Comparing Traditional and Neural Machine-Translation Evaluation Metrics for English→Russian
Machine translation (MT) has become increasingly popular in recent years due to advances in technology and growing globalization. As the quality of MT continues to improve, more and more companies are turning to this method over human translation to save time and money. However, the increasing reliance on MT has also highlighted the need for automatic evaluation algorithms that can accurately measure its quality. Developing such algorithms is essential in ensuring that MT can effectively meet the needs of businesses and individuals in the global marketplace, as well as in comparing different MT systems against each other and tracking their improvements over time. MT evalaution metrics are an indispensable component of these automatic evaluation algorithms.
This repository is part of the thesis project for the Master's Degree in "Linguistics: Text Mining" at the Vrije Universiteit Amsterdam (2022-2023). The project focuses on replicating selected research conducted at the WMT21 Metrics Shared Task. The replication involves evaluating the traditional (SacreBLEU, TER, CHRF2) and best-performing reference-based (BLEURT-20, COMET-MQM_2021) and reference-free (COMET-QE-MQM_2021) neural metrics. The evaluation is conducted across two domains: news articles and TED talks translated from English into Russian. By examining the performance of these metrics, we aim to understand their effectiveness and suitability in different translation contexts. Furthermore, the thesis project goes beyond the initial evaluation and explores the applicability of reference-free neural metrics, with a particular focus on COMET-QE-MQM_2021, for professional human translators. This extended evaluation is performed on a distinct domain, namely scientific articles. The articles are translated in the same direction as the primary data.
Creator: Natalia Khaidanova
Supervisor: Sophie Arnoult
The Data folder contains:
-
all_TED_data.tsv stores all source sentences, reference translations, and MTs presented at the WMT21 Metrics Task for the TED talks domain.
-
all_news_data.tsv stores all source sentences, reference translations, and MTs presented at the WMT21 Metrics Task for the news domain.
-
create_data_files.py creates all_TED_data.tsv and all_news_data.tsv files, converts the WMT21 Metrics Task human judgments per type (MQM, raw DA, and z-normalized DA) and domain (news and TED talks) into .tsv files. The files are stored in human_judgments_seg (segment-level human judgments) and human_judgments_sys (system-level human judgments).
-
WMT21-data stores source sentences (sources), reference translations (references), MTs (system-outputs), and human judgment scores (evaluation) for each domain (news and TED talks).
-
newstest2021 contains segment-level scores for each implemented neural metric (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and all utilized traditional metrics (traditional metrics), as well as system-level scores for each of these metrics. Domain: news.
-
tedtalks contains segment-level scores for each implemented neural metric (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and all utilized traditional metrics (traditional metrics), as well as system-level scores for each of these metrics. Domain: TED talks.
The eval folder contains:
-
get_nr_annotations.py checks the number of annotated segments in the WMT21 Metrics Task data per type of human judgment (MQM, raw DA, or z-normalized DA).
-
seg_eval.py runs a segment-level evaluation of the implemented neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and traditional (SacreBLEU, TER, and CHRF2) metrics.
-
sys_eval.py runs a system-level evaluation of the implemented neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and traditional (SacreBLEU, TER, and CHRF2) metrics.
-
human_judgments_seg stores segment-level human judgment scores of each type (MQM, raw DA, or z-normalized DA) in separate .tsv files. The scores are presented for both news and TED talks.
-
human_judgments_sys stores system-level human judgment scores of each type (MQM, raw DA, or z-normalized DA) in separate .tsv files. The scores are presented for both news and TED talks.
The metrics folder contains:
-
BLEURT-20.py computes segment-level scores for the reference-based neural metric BLEURT-20 on the WMT21 Metrics Task data and calculates the metric's runtime per MT system. The resulting segment-level scores and metric's runtime are stored in BLEURT-20 (newstest2021) and BLEURT-20 (tedtalks).
-
COMET-MQM_2021.py computes segment-level scores for the reference-based neural metric COMET-MQM_2021 on the WMT21 Metrics Task data and calculates the metric's runtime per MT system. The resulting segment-level scores and metric's runtime are stored in COMET-MQM_2021 (newstest2021) and COMET-MQM_2021 (tedtalks).
-
COMET-QE-MQM_2021.py computes segment-level scores for the reference-free neural metric COMET-QE-MQM_2021 on the WMT21 Metrics Task data and calculates the metric's runtime per MT system. The resulting segment-level scores and metric's runtime are stored in COMET-QE-MQM_2021 (newstest2021) and COMET-QE-MQM_2021 (tedtalks).
-
get_sys_scores.py computes system-level scores for the neural metrics (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) on the WMT21 Metrics Task data. The system-level scores are obtained by averaging the metrics' segment-level scores. The resulting system-level scores are stored in sys (newstest2021) and sys (tedtalks).
-
traditional_metrics_seg.py computes segment-level scores for the traditional metrics (SacreBLEU, TER, and CHRF2) on the WMT21 Metrics Task data. The resulting segment-level scores are stored in traditional_metrics (newstest2021) and traditional_metrics (tedtalks).
-
traditional_metrics_sys.py computes system-level scores for the traditional metrics (SacreBLEU, TER, and CHRF2) on the WMT21 Metrics Task data. The resulting system-level scores are stored in sys (newstest2021) and sys (tedtalks).
-
traditional_metrics_runtime.py calculates the traditional metrics' (SacreBLEU, TER, and CHRF2) runtimes for segment-level evaluation. The runtimes are determined for all MT systems per domain (news and TED talks). The metrics' runtimes are stored in traditional_metrics (newstest2021) and traditional_metrics (tedtalks).
The reference-free_eval folder contains:
-
COMET-QE-MQM_2021.py computes segment- and system-level scores of the reference-free neural metric COMET-QE-MQM_2021 on the additional data comprising two scientific articles (Baby K and A Beautiful Mind). The metric evaluates both human and machine translations. Note that the source sentences and their human translations were added to the files manually.
-
add_opus_mt_translations.py adds MTs produced by the opus-mt-en-ru MT system to the data comprising two scientific articles (Baby K and A Beautiful Mind).
-
get_mean_length.py counts the mean character length of the source sentences and their human translations in the Baby K and A Beautiful Mind articles.
- Data contains two scientific articles (Baby K and A Beautiful Mind), each comprising English source sentences, their corresponding Russian human translations and MTs produced by the opus-mt-en-ru MT system. The files were created with the aim of evaluating the applicability of reference-free neural metrics, specifically COMET-QE-MQM_2021, for professional human translators. The subfolder also stores the segment- and system-level scores produced by COMET-QE-MQM_2021 for both human and machine translations.
The requirements.txt file contains information about the packages and models required to run and evaluate the implemented traditional (SacreBLEU, TER, and CHRF2) and neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) metrics. It also lists additional packages needed to run all the .py files in the repository.
The Natalia_Khaidanova_Thesis.pdf file contains the thesis report outlining the results of the research.
- Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar. 2021. Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.