GitHub - chan0park/PCoNMT: Implementation of "Learning to Generate Word- and Phrase-Embeddings for Efficient Phrase-Based Neural Machine Translation" (Park and Tsvetkov, 2019)

Code for Learning to Generate Word- and Phrase-Embeddings for Efficient Phrase-Based Neural Machine Translation (Park and Tsvetkov, 2019)

This code is adapted from an earlier version of Sachin Kumar's seq2seq-con code.

Dependencies

Phrase extraction
- Python 3.5
- nltk
- fasttext
Model training
- Pytorch 0.3.0
- Python 2.7

Running Experiments

There are four sub-steps required to run experiments:

Preprocessing
Word alignments extraction
Phrase embeddings extraction
Training/evaluation

1. Preprocessing

Tokenization and Truecasing (Using Moses Scripts)

Assuming you have (train.de, train.en, test.de, test.en, valid.de, valid.en) under data/, you can obtain tokenized and truecased files by running the following script:

./scripts/preprocess.sh data de en ./path/to/mosesdecoder

2. Word Alignments Extraction

Word alignments (Using fast_align)

By running following command, you will get {train,test,valid}.align files

./scripts/get_align.sh data de en ./path/to/fast_align

3. Phrase Embeddings

For the phrase and word embedding tables, you need fasttext embeddings that are trained on the same dimension. In our experiments, we first used parallel corpus and alignment to extract phrase list, and then used it to concatenate words in a large monolingual corpus. We then trained fasttext embedding using the corpus.

You can train your own embedding using the same method, or download our trained model and extracted embeddings from here.

Once you obtain fasttext embeddings and the model, run the following command to get concatenated target txt files and the phrase embeddings. Since you need to use fasttext python module here, it is required to use Python 3.5+.

python src/get_phrases.py data de en embs/fasttext.phrase.300.en.bin

4. Model Training/evaluation

Note that in our model training and evaluation, we use pytorch 0.3.0.post4 and Python 2.7.

Creating preprocessed data object

python src/prepare_data.py -train_src data/train.tok.true.de -train_tgt data/train.tok.true.mwe.en -train_align data/train.mwe.align \
-valid_src data/valid.tok.true.de -valid_tgt data/valid.tok.true.mwe.en -valid_align data/valid.mwe.align -save_data data/deen.pconmt \
-src_vocab_size 50000 -tgt_vocab_size 100000 -tgt_emb embs/fasttext.mwe.word.en.vec -tgt_emb_phrase data/mwe_list.mwe.vec -emb_dim 300 -normalize

Training a model

python src/train.py -data data/deen.pconmt.train.pt -layers 2 -rnn_size 1024 -word_vec_size 512 -output_emb_size 300 -brnn -loss nllvmf -optim adam -dropout 0.0 -learning_rate 0.0005 -log_interval 500 -save_model models/deen -batch_size 16 -tie_emb -gpus 0 -pre_ep 7 -fert_ep 10 -epochs 17 -fert_mode emh -uni_ep 0 -fert_dim 4

Evaluating a model without the fertility prediction

python src/translate.py -loss nllvmf -gpu 0 -replace_unk -model models/deen_bestmodel_pre.pt -src data/test.tok.true.de -tgt data/test.tok.true.en -output deen.out -batch_size 512 -beam_size 1

Evaluating a model with the fertility prediction

python src/translate_fert.py -loss nllvmf -gpu 0 -replace_unk -model models/dee_bestmodel_fert.pt -src data/test.tok.true.de -tgt data/test.tok.true.en -output deen.fert.out -batch_size 512 -beam_size 1

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
scripts		scripts
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies

Running Experiments

1. Preprocessing

2. Word Alignments Extraction

3. Phrase Embeddings

4. Model Training/evaluation

Pointers for baselines in the paper

About

Releases

Packages

Languages

chan0park/PCoNMT

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Running Experiments

1. Preprocessing

2. Word Alignments Extraction

3. Phrase Embeddings

4. Model Training/evaluation

Pointers for baselines in the paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages