Skip to content
/ WLAC Public

The shared task word level autocompletion in WMT

Notifications You must be signed in to change notification settings

lemaoliu/WLAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word-level AutoCompletion (WLAC)

This is a new shared task called WLAC in WMT 2022. In this year, the shared task involves in two language pairs German-to-English (De-En) and Chinese-to-English (Zh-En) with two directions. If you have any questions, please join in the mailing list. If you want to participate this shared task, please sign in here. For any further information, please contact Lemao Liu.

Test datasets are available

The test datasets for four directions are in test-data.

Result Submission

The results must be sent to the email: [email protected] before 23:59 on July 7th, 2022 (Anywhere-on-earth) .

Evaluation results for submissions

The evaluation for submissions is available at evaluation-results

Important Dates

Release of training data: April 20th, 2022
Release of test data: 23:59 on July 1th, 2022 (Anywhere-on-earth)
Result submission deadline: 23:59 on July 7th, 2022 (Anywhere-on-earth)
System paper submission deadline: September 7th, 2022
Paper notification: October 9th, 2022
Camera-ready version due: October 16, 2022

Key Steps

  • Download the datasets for De-En and Zh-EN (see the details in next section).
    ATTENTION!!
    The training data is up to 10M sentence pairs and those in data/train-sample are the samples.
    Participants must use only the bilingual data provided here.

    Note that pretrained language models such as BERT are allowed as well as additional monolingual data.

  • Download the scripts in the directory scripts/ to preprocess the data.

  • Run the scripts to obtain the simulated training data for WLAC task from bilingual data.

Data Preparation

De-En Bilingual Data

The bilingual data is from WMT 14 preprocessed by Stanford NLP Group: train.de and train.en.

Zh-En Bilingual Data

The bilingual data is "UN Parallel Corpus V1.0" from WMT 17. To obtain the data, one can follow three steps:

cat UNv1.0.en-zh.tar.gz.* | tar -xzf -
  • en-zh/UNv1.0.en-zh.en and en-zh/UNv1.0.en-zh.zh are source and target files. Note that both files should be preprocessed (word segmentation for zh and tokenization for en) by scripts/tokenizer.perl (from Moses Project) and scripts/word_seg.py as follows:
pip3 install jieba
perl scripts/tokenizer.perl -l en < UNv1.0.en-zh.en > UNv1.0.en-zh.tok.en
python3 scripts/word_seg.py UNv1.0.en-zh.zh > UNv1.0.en-zh.tok.zh

Preparing the Simulated Training data for WLAC

Bilingual data can not be used to train WLAC models directly. Instead, one can obtain training data (as well as development data) for WLAC from bilingual data via simulation following the reference [1] (See Section 3.2 in this paper). For example, this can be done by running the following cmd for zh->en subtask:

pip3 install pypinyin tqdm
python3 scripts/generate_samples.py --source-lang zh --target-lang en --file-prefix UNv1.0.en-zh.tok

Then UNv1.0.en-zh.tok.samples.json is the simulated training data for WLAC, whose format is as follows:

{
    "src":"The Security Council ,",
    "context_type":"zero_context",
    "left_context":"",
    "right_context":"",
    "typed_seq":"a",
    "target":"安全"
}
{
    "src":"安全 理事会 ,",
    "context_type":"prefix",
    "left_context":"The Security",
    "right_context":"",
    "typed_seq":"Coun",
    "target":"Council"
}

where "typed_seq" denotes the typed sequence for the target word, i.e., "a" is the prefix of the pronunciation of "anquan" for the Chinese word "安全", or "Coun" is the prefix of the target word of "Council" for English (or German); "context_type" indicates the location type of the target word with respect to left_context and right_context and it takes value from {"prefix", "zero_context", "suffix", and "bi_context"} (See reference [1] for more details).

Simulated Development data for WLAC

The simulated dev data can be obtained in the same way as the simulated training data mentioned above.

Test Data for WLAC

ATTENTION!! Test data are available now and the testing results should be submitted on July 7.

Reference

  • [1] Huayang Li, Lemao Liu, Guoping Huang, Shuming Shi. 2021. GWLAN: General Word-Level AutocompletioN for Computer-Aided Translation. Proceedings of ACL.

About

The shared task word level autocompletion in WMT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published