merge the develop

PaddlePaddle · Dec 31, 2021 · a1d8ab0 · a1d8ab0
2 parents c907a8d + 6272496
commit a1d8ab0
Show file tree

Hide file tree

Showing 25 changed files with 452 additions and 331 deletions.
diff --git a/README.md b/README.md
@@ -530,7 +530,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
 ## Acknowledgement
 
 
-- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling) for years of attention, constructive advice and great help.
+- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
 - Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio.
 - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
 - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.

diff --git a/README_cn.md b/README_cn.md
@@ -497,7 +497,6 @@ year={2021}
 <a name="欢迎贡献"></a>
 ## 参与 PaddleSpeech 的开发
 
-
 热烈欢迎您在[Discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) 中提交问题，并在[Issues](https://github.com/PaddlePaddle/PaddleSpeech/issues) 中指出发现的 bug。此外，我们非常希望您参与到 PaddleSpeech 的开发中！
 
 ### 贡献者
@@ -539,7 +538,7 @@ year={2021}
 
 ## 致谢
 
-- 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling) 多年来的关注和建议，以及在诸多问题上的帮助。
+- 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) 多年来的关注和建议，以及在诸多问题上的帮助。
 - 非常感谢 [AK391](https://github.com/AK391) 在 Huggingface Spaces 上使用 Gradio 对我们的语音合成功能进行网页版演示。
 - 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。
 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。

diff --git a/examples/aishell3/voc1/conf/default.yaml b/examples/aishell3/voc1/conf/default.yaml
@@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
 ###########################################################
 batch_size: 8 # Batch size.
 batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
-pin_memory: true # Whether to pin memory in Pytorch DataLoader.
-num_workers: 4 # Number of workers in Pytorch DataLoader.
-remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
-allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/csmsc/voc1/conf/default.yaml b/examples/csmsc/voc1/conf/default.yaml
@@ -79,10 +79,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
 ###########################################################
 batch_size: 8 # Batch size.
 batch_max_steps: 25500 # Length of each audio in batch. Make sure dividable by n_shift.
-pin_memory: true # Whether to pin memory in Pytorch DataLoader.
-num_workers: 2 # Number of workers in Pytorch DataLoader.
-remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
-allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/csmsc/voc4/conf/default.yaml b/examples/csmsc/voc4/conf/default.yaml
@@ -88,7 +88,7 @@ discriminator_adv_loss_params:
 batch_size: 32 # Batch size.
 # batch_max_steps(24000) == prod(noise_upsample_scales)(80) * prod(upsample_scales)(300, n_shift)
 batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
-num_workers: 2 # Number of workers in Pytorch DataLoader.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/csmsc/voc5/conf/default.yaml b/examples/csmsc/voc5/conf/default.yaml
@@ -119,7 +119,7 @@ lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
 ###########################################################
 batch_size: 16 # Batch size.
 batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
-num_workers: 2 # Number of workers in Pytorch DataLoader.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/csmsc/voc5/conf/finetune.yaml b/examples/csmsc/voc5/conf/finetune.yaml
@@ -119,7 +119,7 @@ lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
 ###########################################################
 batch_size: 16 # Batch size.
 batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
-num_workers: 2 # Number of workers in Pytorch DataLoader.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/ljspeech/voc1/conf/default.yaml b/examples/ljspeech/voc1/conf/default.yaml
@@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
 ###########################################################
 batch_size: 8 # Batch size.
 batch_max_steps: 25600 # Length of each audio in batch. Make sure dividable by n_shift.
-pin_memory: true # Whether to pin memory in Pytorch DataLoader.
-num_workers: 4 # Number of workers in Pytorch DataLoader.
-remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
-allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #

diff --git a/examples/ted_en_zh/st0/conf/transformer.yaml b/examples/ted_en_zh/st0/conf/transformer.yaml
@@ -2,7 +2,7 @@
 ###########################################
 # Data #
 ###########################################
-train_manifest: data/manifest.train.tiny
+train_manifest: data/manifest.train
 dev_manifest: data/manifest.dev
 test_manifest: data/manifest.test
 min_input_len: 0.05 # second
@@ -19,8 +19,10 @@ vocab_filepath: data/lang_char/vocab.txt
 unit_type: 'spm'
 spm_model_prefix: data/lang_char/bpe_unigram_8000
 mean_std_filepath: ""
-# augmentation_config: conf/augmentation.json
-batch_size: 10
+augmentation_config: conf/preprocess.yaml
+batch_size: 16
+maxlen_in: 5 # if input length > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
 raw_wav: True # use raw_wav or kaldi feature
 spectrum_type: fbank #linear, mfcc, fbank
 feat_dim: 80
@@ -84,13 +86,13 @@ accum_grad: 2
 global_grad_clip: 5.0
 optim: adam
 optim_conf:
- lr: 0.004
- weight_decay: 1.0e-06
-scheduler: warmuplr  
+ lr: 2.5
+ weight_decay: 1e-06
+scheduler: noam 
 scheduler_conf:
  warmup_steps: 25000
  lr_decay: 1.0
-log_interval: 5
+log_interval: 50
 checkpoint:
  kbest_n: 50
  latest_n: 5
diff --git a/examples/ted_en_zh/st0/conf/transformer_mtl_noam.yaml b/examples/ted_en_zh/st0/conf/transformer_mtl_noam.yaml
@@ -19,8 +19,10 @@ vocab_filepath: data/lang_char/vocab.txt
 unit_type: 'spm'
 spm_model_prefix: data/lang_char/bpe_unigram_8000
 mean_std_filepath: ""
-# augmentation_config: conf/augmentation.json
-batch_size: 10
+augmentation_config: conf/preprocess.yaml
+batch_size: 16
+maxlen_in: 5 # if input length > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
 raw_wav: True # use raw_wav or kaldi feature
 spectrum_type: fbank #linear, mfcc, fbank
 feat_dim: 80

diff --git a/examples/ted_en_zh/st0/local/test.sh b/examples/ted_en_zh/st0/local/test.sh
@@ -14,15 +14,13 @@ ckpt_prefix=$3
 
 for type in fullsentence; do
  echo "decoding ${type}"
- batch_size=32
  python3 -u ${BIN_DIR}/test.py \
  --ngpu ${ngpu} \
  --config ${config_path} \
  --decode_cfg ${decode_config_path} \
  --result_file ${ckpt_prefix}.${type}.rsl \
  --checkpoint_path ${ckpt_prefix} \
  --opts decode.decoding_method ${type} \
- --opts decode.decode_batch_size ${batch_size}
 
  if [ $? -ne 0 ]; then
  echo "Failed in evaluation!"

diff --git a/examples/ted_en_zh/st1/RESULTS.md b/examples/ted_en_zh/st1/RESULTS.md
@@ -12,5 +12,5 @@
 ## Transformer
 | Model | Params | Config | Val loss | Char-BLEU |
 | --- | --- | --- | --- | --- |
-| FAT + Transformer+ASR MTL | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 19.45 |
+| FAT + Transformer+ASR MTL | 50.26M | conf/transformer_mtl_noam.yaml | 69.91 | 20.26 |
 | FAT + Transformer+ASR MTL with word reward | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 20.80 |
diff --git a/examples/ted_en_zh/st1/conf/transformer.yaml b/examples/ted_en_zh/st1/conf/transformer.yaml
@@ -2,42 +2,35 @@
 ###########################################
 # Data #
 ###########################################
-train_manifest: data/manifest.train.tiny
+train_manifest: data/manifest.train
 dev_manifest: data/manifest.dev
 test_manifest: data/manifest.test
-min_input_len: 5.0 # frame
-max_input_len: 3000.0 # frame
-min_output_len: 0.0 # tokens
-max_output_len: 400.0 # tokens
-min_output_input_ratio: 0.01
-max_output_input_ratio: 20.0
 
 ###########################################
 # Dataloader #
 ###########################################
-vocab_filepath: data/lang_char/vocab.txt
+vocab_filepath: data/lang_char/ted_en_zh_bpe8000.txt
 unit_type: 'spm'
-spm_model_prefix: data/lang_char/bpe_unigram_8000
+spm_model_prefix: data/lang_char/ted_en_zh_bpe8000
 mean_std_filepath: ""
 # augmentation_config: conf/augmentation.json
-batch_size: 10
-raw_wav: True # use raw_wav or kaldi feature
-spectrum_type: fbank #linear, mfcc, fbank
+batch_size: 20
 feat_dim: 83
-delta_delta: False
-dither: 1.0
-target_sample_rate: 16000
-max_freq: None
-n_fft: None
 stride_ms: 10.0
 window_ms: 25.0
-use_dB_normalization: True
-target_dB: -20
-random_seed: 0
-keep_transcription_text: False
-sortagrad: True 
-shuffle_method: batch_shuffle
-num_workers: 2
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
+maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0 
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+augmentation_config:
+num_workers: 0
+subsampling_factor: 1
+num_encs: 1
 
 
 ############################################
@@ -80,18 +73,18 @@ model_conf:
 ###########################################
 # Training #
 ###########################################
-n_epoch: 20
+n_epoch: 40
 accum_grad: 2
 global_grad_clip: 5.0
 optim: adam
 optim_conf:
- lr: 0.004
- weight_decay: 1.0e-06
-scheduler: warmuplr  
+ lr: 2.5
+ weight_decay: 0.
+scheduler: noam 
 scheduler_conf:
  warmup_steps: 25000
  lr_decay: 1.0
-log_interval: 5
+log_interval: 50
 checkpoint:
  kbest_n: 50
  latest_n: 5
diff --git a/examples/ted_en_zh/st1/conf/transformer_mtl_noam.yaml b/examples/ted_en_zh/st1/conf/transformer_mtl_noam.yaml
@@ -5,12 +5,6 @@
 train_manifest: data/manifest.train
 dev_manifest: data/manifest.dev
 test_manifest: data/manifest.test
-min_input_len: 5.0 # frame
-max_input_len: 3000.0 # frame
-min_output_len: 0.0 # tokens
-max_output_len: 400.0 # tokens
-min_output_input_ratio: 0.01
-max_output_input_ratio: 20.0
 
 ###########################################
 # Dataloader #
@@ -20,24 +14,23 @@ unit_type: 'spm'
 spm_model_prefix: data/lang_char/ted_en_zh_bpe8000
 mean_std_filepath: ""
 # augmentation_config: conf/augmentation.json
-batch_size: 10
-raw_wav: True # use raw_wav or kaldi feature
-spectrum_type: fbank #linear, mfcc, fbank
+batch_size: 20
 feat_dim: 83
-delta_delta: False
-dither: 1.0
-target_sample_rate: 16000
-max_freq: None
-n_fft: None
 stride_ms: 10.0
 window_ms: 25.0
-use_dB_normalization: True
-target_dB: -20
-random_seed: 0
-keep_transcription_text: False
-sortagrad: True 
-shuffle_method: batch_shuffle
-num_workers: 2
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
+maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
+maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
+minibatches: 0 # for debug
+batch_count: auto
+batch_bins: 0 
+batch_frames_in: 0
+batch_frames_out: 0
+batch_frames_inout: 0
+augmentation_config:
+num_workers: 0
+subsampling_factor: 1
+num_encs: 1
 
 
 ############################################
@@ -80,18 +73,18 @@ model_conf:
 ###########################################
 # Training #
 ###########################################
-n_epoch: 20
+n_epoch: 40
 accum_grad: 2
 global_grad_clip: 5.0
 optim: adam
 optim_conf:
  lr: 2.5
- weight_decay: 1.0e-06
+ weight_decay: 0.
 scheduler: noam 
 scheduler_conf:
  warmup_steps: 25000
  lr_decay: 1.0
-log_interval: 5
+log_interval: 50
 checkpoint:
  kbest_n: 50
  latest_n: 5
diff --git a/examples/ted_en_zh/st1/local/test.sh b/examples/ted_en_zh/st1/local/test.sh
@@ -14,15 +14,18 @@ ckpt_prefix=$3
 
 for type in fullsentence; do
  echo "decoding ${type}"
- batch_size=32
  python3 -u ${BIN_DIR}/test.py \
  --ngpu ${ngpu} \
  --config ${config_path} \
  --decode_cfg ${decode_config_path} \
  --result_file ${ckpt_prefix}.${type}.rsl \
  --checkpoint_path ${ckpt_prefix} \
+<<<<<<< HEAD
  --opts decode.decoding_method ${type} \
  --opts decode.decode_batch_size ${batch_size}
+=======
+ --opts decoding.decoding_method ${type} \
+>>>>>>> 6272496d9c26736750b577fd832ea9dd4ddc4e6e
 
  if [ $? -ne 0 ]; then
  echo "Failed in evaluation!"

diff --git a/examples/vctk/voc1/conf/default.yaml b/examples/vctk/voc1/conf/default.yaml
@@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
 ###########################################################
 batch_size: 8 # Batch size.
 batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
-pin_memory: true # Whether to pin memory in Pytorch DataLoader.
-num_workers: 4 # Number of workers in Pytorch DataLoader.
-remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
-allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
+num_workers: 2 # Number of workers in DataLoader.
 
 ###########################################################
 # OPTIMIZER & SCHEDULER SETTING #