Contextualized Word Vectors (CoVe)

Training from Scratch in Tensorflow

Objective

Replicate the training of CoVe embedding using TensorFlow (the official implementation is in PyTorch).

Running the project

Download the data and launch raw files preprocessing to tokens (like .xml->.tok)

This will take a while as it downloads and preprocess MT-M and MT-L. It also downloads GLoVe and Char-emb kazuma.

sh ./download_and_preprocess_to_tokens.sh

The 3 datasets are:

MT-S: WMT'16 Multimodal Translation: Multi30k (de-en) - A corpus of 30,000 sentence pairs that briefly describe Flickr captions (generally referred as Multi30K).
MT-M: IWSLT'16 (de-en) - A corpus of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics.
MT-L: WMT'17 (de-en) - A corpus of 7 million sentence pairs that comes from web crawl data, a news and commentary corpus, European Parliament proceedings, and European Union press releases.

Regarding the embeddings, we have here 2 possible embeddings, one at the words level (GLoVe) and one at the character level (Char-emb Kazuma):

GLoVe: word embedding of dimension 300.
Char-emb kazuma: character level embedding (up to 4-grams) of dimension 100.

Training of MT-LSTM

Training of MT-LSTM as a 2 layers bidirectional LSTM encoder of an attentional sequence-to-sequence model trained on a Machine Translation task can be found in CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb and CoVe_training_MT_L.ipynb

Running the code

Each notebook - CoVe_training_MT_S.ipynb, CoVe_training_MT_M.ipynb, CoVe_training_MT_L.ipynb for respectively CoVe-S, CoVe-M, CoVe-L - allows to preprocess the data, build and train an MT-LSTM model, then evaluate on validation and test sets the quality of the translation, and finally shows how to compute a CoVe embedding.

Warning: should run on AWS/GCP with GPU

Running the downloading can be very long, with an average of 30min.
Running an epoch on a recent MacBook Pro:
- CoVe-S takes in average 1min
- CoVe-M takes in average 5min
- CoVe-L takes in average 30min

Sources

Commands from the CoVe Github where they explain how to download the data and preprocess to tokenized files: https://github.com/salesforce/cove
Neural Machine Translation (NMT) tensorflow tutorial (official) repository where they explain how to make use of tf.contrib.seq2seq for NMT: https://github.com/tensorflow/nmt
Official PyTorch implementation of CoVe from the author of the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
save_run		save_run
CoVe_training_MT_L.ipynb		CoVe_training_MT_L.ipynb
CoVe_training_MT_M.ipynb		CoVe_training_MT_M.ipynb
CoVe_training_MT_S.ipynb		CoVe_training_MT_S.ipynb
NMT.py		NMT.py
README.md		README.md
download_and_preprocess_to_tokens.sh		download_and_preprocess_to_tokens.sh
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextualized Word Vectors (CoVe)

Training from Scratch in Tensorflow

Objective

Running the project

Download the data and launch raw files preprocessing to tokens (like .xml->.tok)

Training of MT-LSTM

Running the code

Warning: should run on AWS/GCP with GPU

Sources

About

Releases

Packages

Languages

EsterHlav/Contextualized-Word-Vectors-CoVe-Learned-in-Translation

Folders and files

Latest commit

History

Repository files navigation

Contextualized Word Vectors (CoVe)

Training from Scratch in Tensorflow

Objective

Running the project

Download the data and launch raw files preprocessing to tokens (like .xml->.tok)

Training of MT-LSTM

Running the code

Warning: should run on AWS/GCP with GPU

Sources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages