Skip to content

Implementation of "Learning to Generate Word- and Phrase-Embeddings for Efficient Phrase-Based Neural Machine Translation" (Park and Tsvetkov, 2019)

Notifications You must be signed in to change notification settings

chan0park/PCoNMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Code for Learning to Generate Word- and Phrase-Embeddings for Efficient Phrase-Based Neural Machine Translation (Park and Tsvetkov, 2019)

This code is adapted from an earlier version of Sachin Kumar's seq2seq-con code.


Dependencies

  • Phrase extraction

    • Python 3.5
    • nltk
    • fasttext
  • Model training

    • Pytorch 0.3.0
    • Python 2.7

Running Experiments

There are four sub-steps required to run experiments:

  1. Preprocessing
  2. Word alignments extraction
  3. Phrase embeddings extraction
  4. Training/evaluation

1. Preprocessing

Assuming you have (train.de, train.en, test.de, test.en, valid.de, valid.en) under data/, you can obtain tokenized and truecased files by running the following script:

./scripts/preprocess.sh data de en ./path/to/mosesdecoder

2. Word Alignments Extraction

By running following command, you will get {train,test,valid}.align files

./scripts/get_align.sh data de en ./path/to/fast_align

3. Phrase Embeddings

For the phrase and word embedding tables, you need fasttext embeddings that are trained on the same dimension. In our experiments, we first used parallel corpus and alignment to extract phrase list, and then used it to concatenate words in a large monolingual corpus. We then trained fasttext embedding using the corpus.

You can train your own embedding using the same method, or download our trained model and extracted embeddings from here.

Once you obtain fasttext embeddings and the model, run the following command to get concatenated target txt files and the phrase embeddings. Since you need to use fasttext python module here, it is required to use Python 3.5+.

python src/get_phrases.py data de en embs/fasttext.phrase.300.en.bin

4. Model Training/evaluation

Note that in our model training and evaluation, we use pytorch 0.3.0.post4 and Python 2.7.

Creating preprocessed data object

python src/prepare_data.py -train_src data/train.tok.true.de -train_tgt data/train.tok.true.mwe.en -train_align data/train.mwe.align \
-valid_src data/valid.tok.true.de -valid_tgt data/valid.tok.true.mwe.en -valid_align data/valid.mwe.align -save_data data/deen.pconmt \
-src_vocab_size 50000 -tgt_vocab_size 100000 -tgt_emb embs/fasttext.mwe.word.en.vec -tgt_emb_phrase data/mwe_list.mwe.vec -emb_dim 300 -normalize

Training a model

python src/train.py -data data/deen.pconmt.train.pt -layers 2 -rnn_size 1024 -word_vec_size 512 -output_emb_size 300 -brnn -loss nllvmf -optim adam -dropout 0.0 -learning_rate 0.0005 -log_interval 500 -save_model models/deen -batch_size 16 -tie_emb -gpus 0 -pre_ep 7 -fert_ep 10 -epochs 17 -fert_mode emh -uni_ep 0 -fert_dim 4

Evaluating a model without the fertility prediction

python src/translate.py -loss nllvmf -gpu 0 -replace_unk -model models/deen_bestmodel_pre.pt -src data/test.tok.true.de -tgt data/test.tok.true.en -output deen.out -batch_size 512 -beam_size 1

Evaluating a model with the fertility prediction

python src/translate_fert.py -loss nllvmf -gpu 0 -replace_unk -model models/dee_bestmodel_fert.pt -src data/test.tok.true.de -tgt data/test.tok.true.en -output deen.fert.out -batch_size 512 -beam_size 1

Pointers for baselines in the paper

About

Implementation of "Learning to Generate Word- and Phrase-Embeddings for Efficient Phrase-Based Neural Machine Translation" (Park and Tsvetkov, 2019)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published