Skip to content

CoVe embedding training from-scratch using biLSTM with attention with TensorFlow Neural Machine Translation API.

Notifications You must be signed in to change notification settings

EsterHlav/Contextualized-Word-Vectors-CoVe-Learned-in-Translation

Repository files navigation

Contextualized Word Vectors (CoVe)

Training from Scratch in Tensorflow

Objective

Replicate the training of CoVe embedding using TensorFlow (the official implementation is in PyTorch).

Running the project

Download the data and launch raw files preprocessing to tokens (like .xml->.tok)

This will take a while as it downloads and preprocess MT-M and MT-L. It also downloads GLoVe and Char-emb kazuma.

sh ./download_and_preprocess_to_tokens.sh

The 3 datasets are:

  • MT-S: WMT'16 Multimodal Translation: Multi30k (de-en) - A corpus of 30,000 sentence pairs that briefly describe Flickr captions (generally referred as Multi30K).

  • MT-M: IWSLT'16 (de-en) - A corpus of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics.

  • MT-L: WMT'17 (de-en) - A corpus of 7 million sentence pairs that comes from web crawl data, a news and commentary corpus, European Parliament proceedings, and European Union press releases.

Regarding the embeddings, we have here 2 possible embeddings, one at the words level (GLoVe) and one at the character level (Char-emb Kazuma):

  • GLoVe: word embedding of dimension 300.
  • Char-emb kazuma: character level embedding (up to 4-grams) of dimension 100.
Training of MT-LSTM

Training of MT-LSTM as a 2 layers bidirectional LSTM encoder of an attentional sequence-to-sequence model trained on a Machine Translation task can be found in CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb and CoVe_training_MT_L.ipynb

Running the code

Each notebook - CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb, CoVe_training_MT_L.ipynb for respectively CoVe-S, CoVe-M, CoVe-L - allows to preprocess the data, build and train an MT-LSTM model, then evaluate on validation and test sets the quality of the translation, and finally shows how to compute a CoVe embedding.

Warning: should run on AWS/GCP with GPU
  • Running the downloading can be very long, with an average of 30min.
  • Running an epoch on a recent MacBook Pro:
    • CoVe-S takes in average 1min
    • CoVe-M takes in average 5min
    • CoVe-L takes in average 30min

Sources

About

CoVe embedding training from-scratch using biLSTM with attention with TensorFlow Neural Machine Translation API.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published