Skip to content

Data and code for replicating LIUM WMT17 News Translation Systems

Notifications You must be signed in to change notification settings

lium-lst/wmt17-newstask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LIUM WMT17 Systems for News Translation Task

Below you will find the data and nmtpy configurations for LIUM's WMT17 News Translation systems (see paper):

@InProceedings{garciamartinez-EtAl:2017:WMT,
  author    = {Garc\'{i}a-Mart\'{i}nez, Mercedes  and
               Caglayan, Ozan  and  Aransa, Walid
               and  Bardet, Adrien  and  Bougares, Fethi
               and  Barrault, Lo\"{i}c},
  title     = {LIUM Machine Translation Systems for WMT17 News Translation Task},
  booktitle = {Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  pages     = {288--295},
  url       = {https://www.aclweb.org/anthology/W17-4726.pdf}
}

En->Tr Systems

Data

(Note: Turkish side of the corpora below is tokenized with a slightly modified version of Moses tokenizer which handles apostrophes correctly for Turkish.)

  • Download (13M) our normalized/tokenized/length-filtered version of officially provided SETIMES2 with ~200K sentences.

  • Download joint BPE (16K merge ops) trained on bitext.

  • The exact incremental subsamples of 150K, 700K, 1M and 1.7M (~all news2016) parallel back-translation corpora used in the paper where the target (TR) side samples are from monolingual Turkish data news.2016.shuffled. The sentences are translated into EN with a single TR->EN NMT system (~14 BLEU on newstest2016):

  • Ready to use BPE-ized subsamples as they are used in the paper (cf. Table 3):

    • (System B0) BPE-ized, (only) SETIMES2-200K (~200K total) corpora (14M)
    • (System B1) BPE-ized, (only) BT-1M (~1M total) corpora (58M)
    • (System B2) BPE-ized, SETIMES2-200K+BT-150K (~350K total) corpora (21M)
    • (System B4) BPE-ized, SETIMES2-200K+BT-700K (~900K total) corpora (51M)
    • (System B6) BPE-ized, SETIMES2-200K+BT-1M (~1.2M total) corpora (72M)
    • (System B8) BPE-ized, SETIMES2-200K+BT-1.7M (~1.9M total) corpora (112M)

About

Data and code for replicating LIUM WMT17 News Translation Systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages