Supplement for the submission to the LoResMT workshop from the NL Processing team

Workshop on Technologies for MT of Low Resource Languages (LoResMT)

The corpora extracts

#	Pair	Sentences
1	En	Nieuwmarkt as well as Kalverstraat, Hermitage and Rembrandtplein are within 15 minutes walking.
	Ru	До площади Ниумаркт, торговой улицы Калверштрат, музея Эрмитаж на Амстеле и площади Рембрандта можно дойти менее чем за 15 минут.
2	En	The Danube Delta is a natural reserve where you can go fishing, bird watching, take boat rides or simply unwind surrounded by nature.
	Ru	Дельта Дуная является природным заповедником, в котором можно заняться рыбной ловлей, понаблюдать за птицами, покататься на лодке или просто расслабиться в окружении природы.
3	Lv	Vienīgais pētījums, kurā Latvija tikusi kritizēta, saņemts no starptautisko aizdevēju (viens no tiem Eiropas Komisija) pēcprogrammas uzraudzības misijas.
	En	The only survey that had Latvia criticized came from the post-supervisory mission of international creditors.
4	Lv	Es nezinu visas detaļas, bet par tām, kurām zinu, man šķiet, ka visam jābeidzas labi.
	En	I don’t know all the details, but facts I do know make me believe everything will be alright.
5	En	Free wired internet, an LCD TV and an en suite bathroom are included in this air-conditioned room.
	Ko	이 객실은 무료 유선 인터넷, LCD TV, 실내 욕실과 에어컨을 갖추고 있습니다.
6	En	Air-conditioned room with heating and features a hairdryer and towels.
	Ko	에어컨과 난방 시설이 완비된 이 객실에는 헤어드라이어와 수건이 비치되어 있습니다.

Technical setup

One would need an LTS 16.04 Ubuntu in order to run this set up.

Docker configuration

Docker-CE has to be installed on the server running ML models training. The installation instructions could be found in the official documentation.

NVIDIA Drivers set up

One would have to install NVIDIA Drivers as it is described in the instructions.

NVIDIA-Docker runtime

The runtime has to be installed in order to run dockerized images.

Translation evaluation

To run the evaluation one has to execute the scorer script:

./scorer/scorer.py data/En-Ko/ref_output.txt data/En-Ko/translation_output.txt

Run training

docker run \
    -e INPUT="/data/input.txt" \
    -e OUTPUT="/output/output.txt" \
    -v <input_file>:/data/input.txt:ro \
    -v <corpus1.txt>:/data/corpus1.txt:ro \
    -v <corpus2.txt>:/data/corpus2.txt:ro \
    -v <parallel_corpus.txt>:/data/parallel_corpus.txt:ro \
    -v <output_dir>:/output:rw \
    -v <workspace>:/workspace_<runid> \
    --workdir /workspace_<runid> \
    --network=none \
    --memory=8g \
    --cpuset-cpus=0-4 \
    --pids-limit=530 \
    --runtime=nvidia \
    --ipc=host \
    <image> <entry_point>

Results

Supervised model description

A default OpenNMT-py Encoder-Decoder architecture with N_SRC and N_TGT parameters specifying vocabulary sizes for src and tgt languages respectively.

NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(N_SRC, 300, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(300, 400, num_layers=3, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(N_TGT, 300, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(700, 400)
        (1): LSTMCell(400, 400)
        (2): LSTMCell(400, 400)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=400, out_features=400)
      (linear_out): Linear(in_features=800, out_features=400)
      (sm): Softmax()
      (tanh): Tanh()
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=400, out_features=N_TGT)
    (1): LogSoftmax()
  )
)

Supervised model trained on a 50K parallel corpus

docker run -it \
    -v /LoResMT/Lv-En/test_parallel_corpus.txt:/data/parallel_corpus.txt:ro \
    -v /LoResMT/Lv-En/test_input.txt:/data/input.txt:ro \
    -v /LoResMT/output:/output:rw \
    --memory=8g \
    --runtime=nvidia \
    kwakinalabs/deephack-finals-v1

Pair	Score	Description
En-Ru	0.02123	IN -> OUT
En-Ru	0.10783	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 1 epoch
En-Ru	0.25747	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 5 epochs
En-Ru	0.28915	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 10 epochs
Lv-En	0.02075	IN -> OUT
Lv-En	0.01142	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 1 epoch
Lv-En	0.04766	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 5 epochs
Lv-En	0.05756	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 10 epochs
En-Ko	0.02759	IN -> OUT
En-Ko	0.11179	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 1 epoch
En-Ko	0.22945	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 5 epochs
En-Ko	0.25418	`supervised-gpu` - GPU, 3 RNN layers, RNN size 400, 10 epochs

UNMT - as described in the Unsupervised Machine Translation Using Monolingual Corpora Only

docker run -it \
    -v /LoResMT/Lv-En/test_parallel_corpus.txt:/data/parallel_corpus.txt:ro \
    -v /LoResMT/Lv-En/test_input.txt:/data/input.txt:ro \
    -v /LoResMT/Lv-En/test_corpus1.txt:/data/corpus1.txt:ro \
    -v /LoResMT/Lv-En/test_corpus2.txt:/data/corpus2.txt:ro \
    -v /LoResMT/output:/output:rw \
    --memory=8g \
    --runtime=nvidia \
    kwakinalabs/deephack-finals-v2

UNMT model description

An architecture of Encoder-Decoder with attention where the embedding dimension for encoder and decoder is both N_BIDI because UNMT uses a single common dictionary. It is a combination of sizes of N_SRC and N_TGT, which are specifying vocabulary sizes for src and tgt languages respectively.

UNMT(
  (encoder): EncoderRNN(
    (embedding): Embedding(N_BIDI, 300)
    (rnn): LSTM(300, 50, num_layers=3, dropout=0.1, bidirectional=True)
  )
  (decoder): AttnDecoderRNN(
    (embedding): Embedding(N_BIDI, 300)
    (attn): Attn(
      (attn): Linear(in_features=100, out_features=100)
      (sm): Softmax()
      (out): Linear(in_features=200, out_features=100)
      (tanh): Tanh()
    )
    (rnn): LSTM(400, 100, num_layers=3, dropout=0.1)
  )
  (generator): Generator(
    (out): Linear(in_features=100, out_features=N_BIDI)
    (sm): LogSoftmax()
  )
)

UNMT discriminator description

Discriminator takes the latent space size LS_SIZE as the input dimension to infer whether it could be fooled to predict the language of the output produced by the embedding.

(discriminator): Discriminator(
  (layers): ModuleList(
    (0): Linear(in_features=LS_SIZE, out_features=1024)
    (1): Linear(in_features=1024, out_features=1024)
    (2): Linear(in_features=1024, out_features=1024)
  )
  (out): Linear(in_features=1024, out_features=1)
)

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
paper		paper
presentation		presentation
scorer		scorer
supervised-gpu		supervised-gpu
unmt-gpu		unmt-gpu
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supplement for the submission to the LoResMT workshop from the NL Processing team

The corpora extracts

Technical setup

Docker configuration

NVIDIA Drivers set up

NVIDIA-Docker runtime

Translation evaluation

Run training

Results

Supervised model description

Supervised model trained on a 50K parallel corpus

UNMT - as described in the Unsupervised Machine Translation Using Monolingual Corpora Only

UNMT model description

UNMT discriminator description

About

Releases 4

Packages

Contributors 2

Languages

License

aoboturov/loresmt-nlprocessing

Folders and files

Latest commit

History

Repository files navigation

Supplement for the submission to the LoResMT workshop from the NL Processing team

The corpora extracts

Technical setup

Docker configuration

NVIDIA Drivers set up

NVIDIA-Docker runtime

Translation evaluation

Run training

Results

Supervised model description

Supervised model trained on a 50K parallel corpus

UNMT - as described in the Unsupervised Machine Translation Using Monolingual Corpora Only

UNMT model description

UNMT discriminator description

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

Packages