Workshop on Technologies for MT of Low Resource Languages (LoResMT)
# | Pair | Sentences |
---|---|---|
1 | En | Nieuwmarkt as well as Kalverstraat, Hermitage and Rembrandtplein are within 15 minutes walking. |
Ru | До площади Ниумаркт, торговой улицы Калверштрат, музея Эрмитаж на Амстеле и площади Рембрандта можно дойти менее чем за 15 минут. | |
2 | En | The Danube Delta is a natural reserve where you can go fishing, bird watching, take boat rides or simply unwind surrounded by nature. |
Ru | Дельта Дуная является природным заповедником, в котором можно заняться рыбной ловлей, понаблюдать за птицами, покататься на лодке или просто расслабиться в окружении природы. | |
3 | Lv | Vienīgais pētījums, kurā Latvija tikusi kritizēta, saņemts no starptautisko aizdevēju (viens no tiem Eiropas Komisija) pēcprogrammas uzraudzības misijas. |
En | The only survey that had Latvia criticized came from the post-supervisory mission of international creditors. | |
4 | Lv | Es nezinu visas detaļas, bet par tām, kurām zinu, man šķiet, ka visam jābeidzas labi. |
En | I don’t know all the details, but facts I do know make me believe everything will be alright. | |
5 | En | Free wired internet, an LCD TV and an en suite bathroom are included in this air-conditioned room. |
Ko | 이 객실은 무료 유선 인터넷, LCD TV, 실내 욕실과 에어컨을 갖추고 있습니다. | |
6 | En | Air-conditioned room with heating and features a hairdryer and towels. |
Ko | 에어컨과 난방 시설이 완비된 이 객실에는 헤어드라이어와 수건이 비치되어 있습니다. |
One would need an LTS 16.04 Ubuntu in order to run this set up.
Docker-CE has to be installed on the server running ML models training. The installation instructions could be found in the official documentation.
One would have to install NVIDIA Drivers as it is described in the instructions.
The runtime has to be installed in order to run dockerized images.
To run the evaluation one has to execute the scorer script:
./scorer/scorer.py data/En-Ko/ref_output.txt data/En-Ko/translation_output.txt
docker run \
-e INPUT="/data/input.txt" \
-e OUTPUT="/output/output.txt" \
-v <input_file>:/data/input.txt:ro \
-v <corpus1.txt>:/data/corpus1.txt:ro \
-v <corpus2.txt>:/data/corpus2.txt:ro \
-v <parallel_corpus.txt>:/data/parallel_corpus.txt:ro \
-v <output_dir>:/output:rw \
-v <workspace>:/workspace_<runid> \
--workdir /workspace_<runid> \
--network=none \
--memory=8g \
--cpuset-cpus=0-4 \
--pids-limit=530 \
--runtime=nvidia \
--ipc=host \
<image> <entry_point>
A default OpenNMT-py Encoder-Decoder architecture with N_SRC
and N_TGT
parameters specifying vocabulary sizes for src and tgt languages respectively.
NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(N_SRC, 300, padding_idx=1)
)
)
)
(rnn): LSTM(300, 400, num_layers=3, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(N_TGT, 300, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3)
(layers): ModuleList(
(0): LSTMCell(700, 400)
(1): LSTMCell(400, 400)
(2): LSTMCell(400, 400)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=400, out_features=400)
(linear_out): Linear(in_features=800, out_features=400)
(sm): Softmax()
(tanh): Tanh()
)
)
(generator): Sequential(
(0): Linear(in_features=400, out_features=N_TGT)
(1): LogSoftmax()
)
)
docker run -it \
-v /LoResMT/Lv-En/test_parallel_corpus.txt:/data/parallel_corpus.txt:ro \
-v /LoResMT/Lv-En/test_input.txt:/data/input.txt:ro \
-v /LoResMT/output:/output:rw \
--memory=8g \
--runtime=nvidia \
kwakinalabs/deephack-finals-v1
Pair | Score | Description |
---|---|---|
En-Ru | 0.02123 | IN -> OUT |
En-Ru | 0.10783 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 1 epoch |
En-Ru | 0.25747 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 5 epochs |
En-Ru | 0.28915 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 10 epochs |
Lv-En | 0.02075 | IN -> OUT |
Lv-En | 0.01142 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 1 epoch |
Lv-En | 0.04766 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 5 epochs |
Lv-En | 0.05756 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 10 epochs |
En-Ko | 0.02759 | IN -> OUT |
En-Ko | 0.11179 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 1 epoch |
En-Ko | 0.22945 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 5 epochs |
En-Ko | 0.25418 | supervised-gpu - GPU, 3 RNN layers, RNN size 400, 10 epochs |
UNMT - as described in the Unsupervised Machine Translation Using Monolingual Corpora Only
docker run -it \
-v /LoResMT/Lv-En/test_parallel_corpus.txt:/data/parallel_corpus.txt:ro \
-v /LoResMT/Lv-En/test_input.txt:/data/input.txt:ro \
-v /LoResMT/Lv-En/test_corpus1.txt:/data/corpus1.txt:ro \
-v /LoResMT/Lv-En/test_corpus2.txt:/data/corpus2.txt:ro \
-v /LoResMT/output:/output:rw \
--memory=8g \
--runtime=nvidia \
kwakinalabs/deephack-finals-v2
An architecture of Encoder-Decoder with attention where the embedding dimension for encoder and decoder is both N_BIDI
because UNMT uses a single common dictionary.
It is a combination of sizes of N_SRC
and N_TGT
, which are specifying vocabulary sizes for src and tgt languages respectively.
UNMT(
(encoder): EncoderRNN(
(embedding): Embedding(N_BIDI, 300)
(rnn): LSTM(300, 50, num_layers=3, dropout=0.1, bidirectional=True)
)
(decoder): AttnDecoderRNN(
(embedding): Embedding(N_BIDI, 300)
(attn): Attn(
(attn): Linear(in_features=100, out_features=100)
(sm): Softmax()
(out): Linear(in_features=200, out_features=100)
(tanh): Tanh()
)
(rnn): LSTM(400, 100, num_layers=3, dropout=0.1)
)
(generator): Generator(
(out): Linear(in_features=100, out_features=N_BIDI)
(sm): LogSoftmax()
)
)
Discriminator takes the latent space size LS_SIZE
as the input dimension to infer whether it could be fooled to predict the language of the output produced by the embedding.
(discriminator): Discriminator(
(layers): ModuleList(
(0): Linear(in_features=LS_SIZE, out_features=1024)
(1): Linear(in_features=1024, out_features=1024)
(2): Linear(in_features=1024, out_features=1024)
)
(out): Linear(in_features=1024, out_features=1)
)