Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"? #637

Closed
SkyAndCloud opened this issue Mar 26, 2018 · 47 comments

Comments

@SkyAndCloud
Copy link

Hi, I want to reproduce the results on WMT'14 ENDE datasets of "Attention is All You Need" paper. I have read OpenNMT-FAQ and I want to know the exact details about your experiments:

  1. Did you do the BPE or word spiece?
  2. What's the exact BLEU score of your experiments on WMT'14 ENDE dataset?
  3. Is there some other differences between README's steps and transformer experiments? If yes please provide a complete tutorial so as to help us reproduce your results.

Thank you very much! @srush

@srush
Copy link
Contributor

srush commented Mar 26, 2018

Yes, happy to. We are planning on posting these models and more details

It would be great if you could reproduce.

  1. Preprocessing is the same as here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
  2. 26.7
  3. It should be the same. My command:
python  train.py -data /tmp/de2/data -save_model /tmp/extra -gpuid 1 \                                                                                                                                                                                                                                                                               
        -layers 6 -rnn_size 512 -word_vec_size 512 -batch_type tokens -batch_size 4096 \                                                                                                                          
        -epochs 50  -max_generator_batches 32 -normalization tokens -dropout 0.1 -accum_count 4 \                                                                                                                 
        -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer \                                                                                                                  
        -position_encoding -param_init 0 -warmup_steps 16000 -learning_rate 2 -param_init_glorot \                                                                                                                
        -start_checkpoint_at 5 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -report_every 1000 

@SkyAndCloud
Copy link
Author

Thanks for you quick reply. I'll try to reproduce this result on OpenNMT-py.

@vince62s
Copy link
Member

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

When preprocessing, make sure you use a sequence length of 100 and -share_vocab

Cheers.

@Epsilon-Lee
Copy link

Hi Vincent @vince62s, I wonder how you preprocess the WMT14-EN-DE corpus? I download the data with your provided link.

  1. Why every word is begin with a _? E.g. in commoncrawl.de-en.de.sp file: ▁Auf ▁den ▁beide n ▁Projekt seiten ▁( ▁v b B K M ▁und ▁ py B K M ▁) ▁gibt ▁es ▁mehr ▁Details .
  2. What is the meaning of .sp as suffix for each file?
  3. How you tokenize en and de language? Are you using moses' tokenizer?
  4. What is the wmtende.model file?
  5. What is the wmtende.vocab file? Is this the BPE learned joint vocab?
  6. What is the merge number when BPEing? Is that 32k?

Thanks very much!

@vince62s
Copy link
Member

vince62s commented Apr 7, 2018

I used Sentence Piece to tokenize the corpus (instead of BPE)
this is why there is this separator.
look at the scripts here https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
I did these for the onmt-tf version but preparation is similar.

@taoleicn
Copy link
Contributor

taoleicn commented Apr 9, 2018

Hi @vince62s

I'm getting a bit lower BLEU after following the suggestions in this thread. Could you take a look and see if we are using the same config?

After 21 epochs, my 6-layer transformer model gets 26.02 / 27.21 on valid and test set. and I guess you got ~26.4 / 27.8?

Here are the commands I used for preprocessing, training and evaluation

preprocessing (-share_vocab, sequence length 100):

python preprocess.py -train_src ../wmt-en-de/train.en.shuf 
    -train_tgt ../wmt-en-de/train.de.shuf 
    -valid_src ../wmt-en-de/valid.en -valid_tgt ../wmt-en-de/valid.de 
    -save_data ../wmt-en-de/processed 
    -src_seq_length 100 -tgt_seq_length 100 
    -max_shard_size 200000000 -share_vocab

training (bs=20k, warmup=16k):

python train.py -gpuid 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 5120 -accum_count 4 -epochs 50  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim adam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 16000 -learning_rate 2 -start_checkpoint_at 10 
    -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 
    -data ../wmt-en-de/processed -param_init_glorot -layers 6 -report_every 1000

translate (alpha=0.6):

python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu

evaluate:

perl tools/multi-bleu-detok.perl ~/wmt-en-de/valid.de.detok < t2048.e21.1.out.detok
perl tools/multi-bleu-detok.perl ~/wmt-en-de/newstest2017-ende-ref.de < t2048.e21.2.out.detok

I'm using the sentence piece model wmtende.model for de-tokenization.

Thanks!

@vince62s
Copy link
Member

vince62s commented Apr 9, 2018

I used warmup 8000 and optim sparseadam.
Other settings seem similar unless I am missing somehting.
Just one thing though.
During my run, end epoch was 20 and then I trained from.

However as you can see in Issues/PR there is a bug Adam states are reset during a train_from

It may have a very slight impact.

I think if you let it go, you will end up with similar results to mine.(go up to 40 and average last 10)

@taoleicn
Copy link
Contributor

taoleicn commented Apr 9, 2018

thanks @vince62s .
warmup 8000 and/or sparseadam could be the reason, as i'm getting lower BLEU before epoch 20 as well.

@SkyAndCloud
Copy link
Author

thanks @srush @vince62s
I have a naive question here that whether if we need to tokenize plain corpus before we do bpe or sentencepiece? I read this script. It seems that opennmt-tf doesn't do any tokenization before sentencepiece. I totally agree with this operation because I think either bpe or word piece is a sub-word-unit method which views text as an unicode sequence and iterates from single character. so it should work well on untokenized corpus. However, I also saw tokenization before bpe/word-piece operation, I feel puzzled. Could you please help me or show me which is better?

@vince62s
Copy link
Member

Read the SP doc: https://github.com/google/sentencepiece
not obvious but SP does white space pretokenization by default.

@livenletdie
Copy link

Thanks for the detailed instructions. I was able to train the transformer model to get ~4 validation perplexity. However, when running the translate command, it is erring out for me. Any pointers on what could be going wrong would be helpful.

Command I ran:

python translate.py -model model_acc_70.55_ppl_4.00_e50.pt -src ../../wmt14-ende/test.en -tgt ../../wmt14-ende/test.de -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu

Log:

Loading model parameters.
average src size 25.79294274300932 3004
/home/gavenkatesh/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py:321: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(arr, volatile=not train), lengths
/home/gavenkatesh/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(arr, volatile=not train)
/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/decoders/transformer.py:266: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
volatile=True)
Traceback (most recent call last):
File "translate.py", line 32, in
main(opt)
File "translate.py", line 21, in main
attn_debug=opt.attn_debug)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 176, in translate
batch_data = self.translate_batch(batch, data)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 356, in translate_batch
ret["gold_score"] = self._run_target(batch, data)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 405, in _run_target
gold_scores += scores
RuntimeError: expand(torch.FloatTensor{[30, 1]}, size=[30]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

@livenletdie
Copy link

Update: When I dont specify -tgt it does run to completion fine.

@vince62s vince62s closed this as completed Aug 2, 2018
@xuanqing94
Copy link

xuanqing94 commented Oct 25, 2018

Yes, happy to. We are planning on posting these models and more details

It would be great if you could reproduce.

  1. Preprocessing is the same as here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
  2. 26.7
  3. It should be the same. My command:
python  train.py -data /tmp/de2/data -save_model /tmp/extra -gpuid 1 \                                                                                                                                                                                                                                                                               
        -layers 6 -rnn_size 512 -word_vec_size 512 -batch_type tokens -batch_size 4096 \                                                                                                                          
        -epochs 50  -max_generator_batches 32 -normalization tokens -dropout 0.1 -accum_count 4 \                                                                                                                 
        -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer \                                                                                                                  
        -position_encoding -param_init 0 -warmup_steps 16000 -learning_rate 2 -param_init_glorot \                                                                                                                
        -start_checkpoint_at 5 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -report_every 1000 

@srush I think in the original implementation (tensorflow_models and tensor2tensor), they use -share_embedding as well as -share_decoder_embeddings ?

@mjc14
Copy link

mjc14 commented Nov 4, 2018

@vince62s
hi, i use your dataset to reproduce your results, but i get a better result, BLEU = 30.31 on testset , is something wrong with me ? i use slrum to control my jobs ,so i change your some codes about distribution training ,but i did not change the model structure and the update method of gradients.

dataset:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

my commands:
python preprocess.py -train_src ./data/wmt_ende/train.en
-train_tgt ./data/wmt_ende/train.de
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de
-save_data ./data/wmt_ende/processed
-src_seq_length 100 -tgt_seq_length 100
-shard_size 200000000 -share_vocab

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 200000 -max_generator_batches 2 -dropout 0.1
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 8 -gpu_ranks 0 1 2 3

python translate.py -model models/wmt_ende_step_30000.pt -src data/wmt_ende/test.en -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu -report_bleu -tgt data/wmt_ende/test.de
result:
PRED AVG SCORE: -0.1594, PRED PPL: 1.1728
GOLD AVG SCORE: -1.6642, GOLD PPL: 5.2814

BLEU = 30.31, 59.5/36.3/24.8/17.5 (BP=0.975, ratio=0.975, hyp_len=81681, ref_len=83752)

@mjc14
Copy link

mjc14 commented Nov 4, 2018

you are scoring pieces not words.

i am fresh about translation, how can i do this ? thx.

@mjc14
Copy link

mjc14 commented Nov 4, 2018

you are so nice, thx

@vince62s
Copy link
Member

vince62s commented Nov 4, 2018

you are scoring pieces not words.

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 27, 2018

@vince62s
hi, i use your dataset to reproduce your results, but i get a better result, BLEU = 30.31 on testset , is something wrong with me ? i use slrum to control my jobs ,so i change your some codes about distribution training ,but i did not change the model structure and the update method of gradients.

dataset:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

my commands:
python preprocess.py -train_src ./data/wmt_ende/train.en
-train_tgt ./data/wmt_ende/train.de
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de
-save_data ./data/wmt_ende/processed
-src_seq_length 100 -tgt_seq_length 100
-shard_size 200000000 -share_vocab

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 200000 -max_generator_batches 2 -dropout 0.1
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 8 -gpu_ranks 0 1 2 3

python translate.py -model models/wmt_ende_step_30000.pt -src data/wmt_ende/test.en -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu -report_bleu -tgt data/wmt_ende/test.de
result:
PRED AVG SCORE: -0.1594, PRED PPL: 1.1728
GOLD AVG SCORE: -1.6642, GOLD PPL: 5.2814

BLEU = 30.31, 59.5/36.3/24.8/17.5 (BP=0.975, ratio=0.975, hyp_len=81681, ref_len=83752)

hi @mjc14 , I want to ask your new bleu result after scoring words using your listed params. thx a lot!

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 27, 2018

https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh#L35-L41

@vince62s Can I just detokenize the translation result by
line = line.replace(" ", "").replace("_", " ") ?
thx!

@vince62s
Copy link
Member

@ZacharyWaseda
Copy link

no you need to use spm_decode
https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh

you are so nice, many thx.

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 27, 2018

no you need to use spm_decode
https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh

hi @vince62s, I tried spm_decode, but it did not solve my problem.
I want to ask 1 epoch equals to how many steps with batch_size = 4096 using your processed training dataset.

I trained the transformer model using the following params, but my BLEU value is only around 21. The only difference is: I set train_steps as 200000, while you set epochs as 50. Maybe 1 epoch = 20000 steps, so 200000 steps only equals to 10 epochs? I am not sure about this.

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende 
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 
-encoder_type transformer -decoder_type transformer -position_encoding 
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 
-max_grad_norm 0 -param_init 0 -param_init_glorot 
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 
-world_size 1 -gpu_ranks 0

btw, did you shuffle when you preprocessed the training dataset?
I did not find out why my experiment result is so bad. Please help! thx...

@ZacharyWaseda
Copy link

Hi @vince62s

I'm getting a bit lower BLEU after following the suggestions in this thread. Could you take a look and see if we are using the same config?

After 21 epochs, my 6-layer transformer model gets 26.02 / 27.21 on valid and test set. and I guess you got ~26.4 / 27.8?

Here are the commands I used for preprocessing, training and evaluation

preprocessing (-share_vocab, sequence length 100):

python preprocess.py -train_src ../wmt-en-de/train.en.shuf 
    -train_tgt ../wmt-en-de/train.de.shuf 
    -valid_src ../wmt-en-de/valid.en -valid_tgt ../wmt-en-de/valid.de 
    -save_data ../wmt-en-de/processed 
    -src_seq_length 100 -tgt_seq_length 100 
    -max_shard_size 200000000 -share_vocab

training (bs=20k, warmup=16k):

python train.py -gpuid 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 5120 -accum_count 4 -epochs 50  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim adam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 16000 -learning_rate 2 -start_checkpoint_at 10 
    -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 
    -data ../wmt-en-de/processed -param_init_glorot -layers 6 -report_every 1000

translate (alpha=0.6):

python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu

evaluate:

perl tools/multi-bleu-detok.perl ~/wmt-en-de/valid.de.detok < t2048.e21.1.out.detok
perl tools/multi-bleu-detok.perl ~/wmt-en-de/newstest2017-ende-ref.de < t2048.e21.2.out.detok

I'm using the sentence piece model wmtende.model for de-tokenization.

Thanks!

hi @taoleicn, what does bs in bs=20k mean? btw, how many steps do 50 epochs equal to with batch_size = 5120? thx.

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 28, 2018

Hi all, up to now I can only reproduce to Bleu 25.00. Could someone help?
My params are as follows:

python train.py -world_size 1 -gpu_ranks 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 4096 -accum_count 4 -train_steps 70000  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim sparseadam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 8000 -learning_rate 2 -decay_method noam -label_smoothing 0.1 
    -adam_beta2 0.998 -data ../wmt-en-de/processed -param_init_glorot -layers 6 
    -transformer_ff 2048 -heads 8

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 29, 2018

hi, @vince62s I tried your uploaded pre-trained model: https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz, BLEU = 28.0
I trained Transformer with the same params except epoch = 15, and I ensemble the last 5 checkpoints to evaluate, now my BLEU is only 25.69. However, in my case when epoch = 7, BLEU has already been 25.00.
Do you think it is promising that I could reproduce to your BLEU value if I train the model using 50 epochs? Maybe it will take me another 2 days lol. waiting for your kind reply, thx!

@vince62s
Copy link
Member

What test set are you scoring ?

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 29, 2018

What test set are you scoring ?

@vince62s I use test.en & test.de as test dataset in your given link: https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
I think it should be newstest2017, in which there are 3004 sentence pairs.

@vince62s
Copy link
Member

At 70000 steps it should be above 27 already.
I'll see if I can run it.

@ZacharyWaseda
Copy link

At 70000 steps it should be above 27 already.
I'll see if I can run it.

@vince62s I shuffled the training dataset before doing preprocess.py. Besides, I set seq_length as 150.
Do you think it could be the reason?

python preprocess.py -train_src ./data/wmt_ende/train.en.shuf 
-train_tgt ./data/wmt_ende/train.de.shuf 
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de 
-save_data ./data/wmt_ende/processed 
-src_seq_length 150 -tgt_seq_length 150 -share_vocab

@ZacharyWaseda
Copy link

ZacharyWaseda commented Dec 29, 2018

At 70000 steps it should be above 27 already.
I'll see if I can run it.

BUT in #1093, use seq_length as 150 instead of 100 would lead to a much better result. it makes me confused.
So maybe seq_length is not the reason? I am not sure.

@ZacharyWaseda
Copy link

At 70000 steps it should be above 27 already.
I'll see if I can run it.

Maybe there are some bugs in translate process. I observed that in most translation sentences, they end with .. :: or ."., sometimes there are several words repeat 2 or 3 times in one sentence.

@vince62s
Copy link
Member

are you on master ?

@ZacharyWaseda
Copy link

are you on master ?

yes, just not up-to-date.
i will git pull and train again.
thx for your patience and help!!!

@vince62s
Copy link
Member

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

@ZacharyWaseda
Copy link

ZacharyWaseda commented Jan 2, 2019

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

@vince62s many thanks for verifying the code.

  • I used your given link: https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
    The training dataset contains Europarl-V7+CommonCrawl+NewsCommentary11, AND the last one is different from your last experiment.
  • I use 1 GPU with Accum_count 4, batch_size 4096 tokens => 16384 tokens per true batch.
    warmup_steps 8k
  • This is my experiment results:
    At 90k steps: NT2017: 26.16
    At 100k steps: NT2017: 26.36
    At 150k steps: NT2017: 26.03
    At 200k steps: NT2017: 26.79
    At 250k steps: NT2017: 26.71
    At 300k steps: NT2017: 27.44
    At 390k steps: NT2017: 27.49
  • Now our differences only exist in warmup_steps, true batch size, and training corpus. Do you think I should set Accum_count as 12 so that my true batch will also contain 49152 tokens? In my experience, true batch size would greatly affect the experiment result.
  • Btw, when you did preprocess process, did you shuffle? what is the number of src_seq_length & tgt_seq_length, did you set 100 or 150?

@vince62s
Copy link
Member

vince62s commented Jan 2, 2019

NewsC-v11 or 13 or not so different, impact should be minimal.
I use 100 as a seq_length.
Your results are reasonable. I also struggled a bit when using only one GPU to find the proper learning rate and warmup steps set up.
As a sanity check you can compare the output from my uploaded model and yours to see if there is any obvious issue.

@ZacharyWaseda
Copy link

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

hi @vince62s, I am not familiar with multi-gpu schedular. But why "6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.". I thought true batch should be 4096 * 2 tokens, which is determined by Accum_count. 6-gpu only speed up the computation process, and not affect true batch size. Is my understanding right? thx

@vince62s
Copy link
Member

vince62s commented Jan 11, 2019

no, in sync training, we send 4096 tokens on each gpu, calculate the gradients, and gather everything before updating parameters.
when accum=2, we wait twice the above before updating, hence true batch is 6x4096x2.

@ZacharyWaseda
Copy link

ZacharyWaseda commented Jan 17, 2019

no, in sync training, we send 4096 tokens on each gpu, calculate the gradients, and gather everything before updating parameters.
when accum=2, we wait twice the above before updating, hence true batch is 6x4096x2.

hi @vince62s , how many hours it takes to train on 6 gpus for 30,000 steps with your mentioned params. thx. Did the process of distributing and gathering cost a lot of time?

@vince62s
Copy link
Member

About 2 1/2 hours per 10K steps

@ZacharyWaseda
Copy link

ZacharyWaseda commented Jan 17, 2019

About 2 1/2 hours per 10K steps

So 50 steps only cost 9 seconds. That's so fast.
Can I ask about your GPU version, Tesla P40, P100 or some other type?

@XinDongol
Copy link

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

When preprocessing, make sure you use a sequence length of 100 and -share_vocab

Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

@ZacharyWaseda
Copy link

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

@XinDongol
Copy link

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

Then, we can do pre-processing to get the ready-to-use dataset.

python /OpenNMT-py/preprocess.py -train_src /datasets/wmt/train.en  \
-train_tgt /datasets/wmt/train.de \
-valid_src/datasets/wmt/valid.en -valid_tgt /datasets/wmt/valid.de \
-save_data /datasets/wmt/processed \
-src_seq_length 100 -tgt_seq_length 100 \
-shard_size 200000000 -share_vocab \

Is it correct ?

@ZacharyWaseda
Copy link

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

Then, we can do pre-processing to get the ready-to-use dataset.

python /OpenNMT-py/preprocess.py -train_src /datasets/wmt/train.en  \
-train_tgt /datasets/wmt/train.de \
-valid_src/datasets/wmt/valid.en -valid_tgt /datasets/wmt/valid.de \
-save_data /datasets/wmt/processed \
-src_seq_length 100 -tgt_seq_length 100 \
-shard_size 200000000 -share_vocab \

Is it correct ?

I did not set this param "-shard_size 200000000"

@vince62s
Copy link
Member

@Dhanasekar-S I answered to you by email.
You need to fully understand the concepts, not just apply recipes.
Do not mix up data preparation and preprocessing.
Data preparation is the same as in the -tf script https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/prepare_data.sh
it will build a sentence piece model and tokenize your data using the SP model.

Once you have tokenized data, you can preprocess them to prepare the .pt pickle files.
As written in the email, use sequence length of 100 for both src and tgt, and use share_vocab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants