Word-level AutoCompletion (WLAC)

This is a new shared task called WLAC in WMT 2022. In this year, the shared task involves in two language pairs German-to-English (De-En) and Chinese-to-English (Zh-En) with two directions. If you have any questions, please join in the mailing list. If you want to participate this shared task, please sign in here. For any further information, please contact Lemao Liu.

Test datasets are available

The test datasets for four directions are in test-data.

Result Submission

The results must be sent to the email: [email protected] before 23:59 on July 7th, 2022 (Anywhere-on-earth) .

Evaluation results for submissions

The evaluation for submissions is available at evaluation-results

Important Dates

Release of training data: April 20th, 2022
Release of test data: 23:59 on July 1th, 2022 (Anywhere-on-earth)
Result submission deadline: 23:59 on July 7th, 2022 (Anywhere-on-earth)
System paper submission deadline: September 7th, 2022
Paper notification: October 9th, 2022
Camera-ready version due: October 16, 2022

Key Steps

Download the datasets for De-En and Zh-EN (see the details in next section).
ATTENTION!!
The training data is up to 10M sentence pairs and those in data/train-sample are the samples.
Participants must use only the bilingual data provided here.
Note that pretrained language models such as BERT are allowed as well as additional monolingual data.
Download the scripts in the directory scripts/ to preprocess the data.
Run the scripts to obtain the simulated training data for WLAC task from bilingual data.

Data Preparation

De-En Bilingual Data

The bilingual data is from WMT 14 preprocessed by Stanford NLP Group: train.de and train.en.

Zh-En Bilingual Data

The bilingual data is "UN Parallel Corpus V1.0" from WMT 17. To obtain the data, one can follow three steps:

Download two files UNv1.0.en-zh.tar.gz.00 and UNv1.0.en-zh.tar.gz.01. You may also find both files yourself from webpage.
Run the following command to combine two files and decompress them:

cat UNv1.0.en-zh.tar.gz.* | tar -xzf -

en-zh/UNv1.0.en-zh.en and en-zh/UNv1.0.en-zh.zh are source and target files. Note that both files should be preprocessed (word segmentation for zh and tokenization for en) by scripts/tokenizer.perl (from Moses Project) and scripts/word_seg.py as follows:

pip3 install jieba
perl scripts/tokenizer.perl -l en < UNv1.0.en-zh.en > UNv1.0.en-zh.tok.en
python3 scripts/word_seg.py UNv1.0.en-zh.zh > UNv1.0.en-zh.tok.zh

Preparing the Simulated Training data for WLAC

Bilingual data can not be used to train WLAC models directly. Instead, one can obtain training data (as well as development data) for WLAC from bilingual data via simulation following the reference [1] (See Section 3.2 in this paper). For example, this can be done by running the following cmd for zh->en subtask:

pip3 install pypinyin tqdm
python3 scripts/generate_samples.py --source-lang zh --target-lang en --file-prefix UNv1.0.en-zh.tok

Then UNv1.0.en-zh.tok.samples.json is the simulated training data for WLAC, whose format is as follows:

{
    "src":"The Security Council ,",
    "context_type":"zero_context",
    "left_context":"",
    "right_context":"",
    "typed_seq":"a",
    "target":"安全"
}
{
    "src":"安全 理事会 ，",
    "context_type":"prefix",
    "left_context":"The Security",
    "right_context":"",
    "typed_seq":"Coun",
    "target":"Council"
}

where "typed_seq" denotes the typed sequence for the target word, i.e., "a" is the prefix of the pronunciation of "anquan" for the Chinese word "安全", or "Coun" is the prefix of the target word of "Council" for English (or German); "context_type" indicates the location type of the target word with respect to left_context and right_context and it takes value from {"prefix", "zero_context", "suffix", and "bi_context"} (See reference [1] for more details).

Simulated Development data for WLAC

The simulated dev data can be obtained in the same way as the simulated training data mentioned above.

Test Data for WLAC

ATTENTION!! Test data are available now and the testing results should be submitted on July 7.

Reference

[1] Huayang Li, Lemao Liu, Guoping Huang, Shuming Shi. 2021. GWLAN: General Word-Level AutocompletioN for Computer-Aided Translation. Proceedings of ACL.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
data/train-sample		data/train-sample
scripts		scripts
test-data		test-data
README.md		README.md
_config.yml		_config.yml
wlac_result.md		wlac_result.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word-level AutoCompletion (WLAC)

Test datasets are available

Result Submission

Evaluation results for submissions

Important Dates

Key Steps

Data Preparation

De-En Bilingual Data

Zh-En Bilingual Data

Preparing the Simulated Training data for WLAC

Simulated Development data for WLAC

Test Data for WLAC

Reference

About

Releases

Packages

Contributors 2

Languages

lemaoliu/WLAC

Folders and files

Latest commit

History

Repository files navigation

Word-level AutoCompletion (WLAC)

Test datasets are available

Result Submission

Evaluation results for submissions

Important Dates

Key Steps

Data Preparation

De-En Bilingual Data

Zh-En Bilingual Data

Preparing the Simulated Training data for WLAC

Simulated Development data for WLAC

Test Data for WLAC

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages