Learning When to Translate for Streaming Speech

This is a PyTorch implementation for the ACL 2022 main conference paper Learning When to Translate for Streaming Speech .

Data Processing

Take German for example. Firstly, download MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the ${MUSTC_ROOT} path, and uncompress it:

LANG=de
MUSTC_ROOT=/path/data/en-${LANG}$
tar -xzvf MUSTC_v1.0_en-de.tar.gz

Then, run the script to prepare data manifest.

python3 examples/speech_to_text/prep_mustc_data_raw.py --data-root ${MUSTC_ROOT} \
  --tgt-lang ${LANG}

The generated .tsv should be expanded with the field of source language text and doubled with asr task. Here's some examples from the .tsv file.

id      audio   n_frames        tgt_text        speaker tgt_lang        src_text        src_lang
ted_2529_66     /xxx/en-de/data/train/wav/ted_2529.wav:9517120:61760      61760   Ich hatte den Vorteil einer Perspektive von dieser Breite.  spk.2529        de      I had the benefit of a spectrum this wide.      en
ted_1257_134    /xxx/en-de/data/train/wav/ted_1257.wav:13876160:80960     80960   And outside the library, I wanted to make a place to cultivate your mind.   spk.1257        en      And outside the library, I wanted to make a place to cultivate your mind.       en
ted_362_30      /xxx/en-de/data/train/wav/ted_362.wav:488959:156960       156960  Ich lebe genau hier im West Village, die Rauchwolke wurde zum Glück westwärts geweht, weg von uns.  spk.362 de      I live right there in the West Village, so the plume was luckily blowing west, away from us.        en
...
ted_526_7       /xxx/en-de/data/train/wav/ted_526.wav:16538720:19360      19360   It can also happen in the brain.    spk.526 en      It can also happen in the brain.        en
ted_190_62      /xxx/en-de/data/train/wav/ted_190.wav:7045920:47360       47360   Simple question: if you can't read and write, how do you manage your contact information?   spk.190 en      Simple question: if you can't read and write, how do you manage your contact information?   en
ted_1771_81     /xxx/en-de/data/train/wav/ted_1771.wav:9624320:25600      25600   This is my message to you. spk.1771 en      This is my message to you.      en

The preprocessed directory ${MUSTC_ROOT} should look like as follows:

.
├── en-de
│   ├── config_wave.yaml
│   ├── data
│   ├── dev_wavecif_joint.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_st.model
│   ├── spm_unigram10000_st.txt
│   ├── spm_unigram10000_st.vocab
│   ├── train_wavecif_joint.tsv
│   ├── tst-COMMON_wavecif_joint.tsv
│   ├── tst-HE_wavecif_joint.tsv
└── MUSTC_v1.0_en-de.tar.gz

The sentencepiece model and vocabulary file for En-DE can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .

The sentencepiece model and vocabulary file for En-Fr can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .

The sentencepiece model for generating the MSM's labels can be downloaded at: spm_unigram5000_asr.model , which should be placed to /path/spm_unigram5000_asr.model

The generated config_wave.yaml should look like as follows:

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: spm_unigram10000_st.model
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
vocab_filename: spm_unigram10000_st.txt
use_audio_input: true
prepend_tgt_lang_tag: true

Training

Training with multitask learning.

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --train-subset train_wave_joint \
  --valid-subset dev_wave_joint \
  --save-dir /path/${LANG}/pretrain \
  --max-tokens 3200000  \
  --update-freq 1 \
  --max-update 3200000 \
  --task speech_to_text_wav2vec \
  --criterion label_smoothed_cross_entropy \
  --report-accuracy \
  --arch convtransformer_espnet_wav2vec \
  --w2v2-model-path /path/wav2vec_small.pt \
  --optimizer adam \
  --lr 0.0001 \
  --lr-scheduler inverse_sqrt \
  --warmup-updates 25000 \
  --clip-norm 10.0 \
  --seed 1 \
  --ddp-backend=no_c10d \
  --keep-best-checkpoints 10 \
  --best-checkpoint-metric accuracy \
  --maximize-best-checkpoint-metric \
  --patience 15 \
  --max-source-positions 3200000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
  --encoder-layers 8 \
  --empty-cache-freq 100 \
  --ignore-prefix-size 1 \
  --fp16

id      audio   n_frames        tgt_text        speaker tgt_lang
ted_878_142     /xxx/en-de/data/train/wav/ted_878.wav:1216800:161760      161760  But we too rarely articulate and defend and argue about those big moral questions in our politics.   spk.878 en
ted_1776_86     /xxx/en-de/data/train/wav/ted_1776.wav:8300639:39040      39040   Ich bin also so etwas wie ein Humoranalyst.  spk.1776        de
ted_1312_6      /xxx/en-de/data/train/wav/ted_1312.wav:1980000:31200      31200   And I just finished a couple of months ago.  spk.1312        en
ted_2889_24     /xxx/en-de/data/train/wav/ted_2889.wav:3703360:139840     139840  One reason is the stigma, with 63 percent of black Americans mistaking depression for a weakness.    spk.2889        en
ted_445_163     /xxx/en-de/data/train/wav/ted_445.wav:14420960:88160      88160   They all have the same virus, but they're different enough that there's reason to believe that they've been independently acquired.  spk.445 en
ted_424_60      /xxx/en-de/data/train/wav/ted_424.wav:9106080:83840       83840   Lem Sen: "I would've made this money, too, but I spent all this time looking for the American man who stole my recipe.       spk.424 en
ted_1489_67     /xxx/en-de/data/train/wav/ted_1489.wav:12616000:39519     39519   India has the youngest growing population in the world.      spk.1489        en
ted_1258_76     /xxx/en-de/data/train/wav/ted_1258.wav:7939040:18400      18400   I spend a lot of time on the road.   spk.1258        en
ted_1513_11     /xxx/en-de/data/train/wav/ted_1513.wav:2869919:28000      28000   It's active in the Gulf of Guinea.   spk.1513        en

We use the pre-trained Wav2vec 2.0 as the acoustic encoder.

Fine-tuning with monotonic segmentation module.

fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --train-subset train_wavecif_joint \
  --valid-subset dev_wavecif_joint \
  --save-dir /path/${LANG}/finetune/ \
  --max-tokens 3200000  \
  --update-freq 1 \
  --max-update 3200000 \
  --task speech_to_text_wav2vec_cif \
  --criterion qua_ce_acc_v2 \
  --arch convtransformer_espnet_wav2vec_cif \
  --w2v2-model-path /path/wav2vec_small.pt \
  --optimizer adam \
  --lr 0.0001 \
  --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 \
  --clip-norm 10.0 \
  --seed 1 \
  --ddp-backend=no_c10d \
  --keep-best-checkpoints 10 \
  --best-checkpoint-metric accuracy \
  --maximize-best-checkpoint-metric \
  --patience 15 \
  --max-source-positions 3200000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
  --encoder-layers 8 \
  --ignore-prefix-size 1 --log-interval 20  --fp16 \
  --load-pretrained-encoder-from /path/${LANG}/pretrain/checkpoint.pt \
  --load-pretrained-decoder-from /path/${LANG}/pretrain/checkpoint.pt

Evaluation

Offline Translation

Our released models (En-De and En-Fr) can be downloaded to test the evaluation directly.

fairseq-generate ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --gen-subset tst-COMMON_wavecif_joint_st \
  --task speech_to_text_wav2vec_cif \
  --path /path/${LANG}/finetune/checkpoint.pt \
  --max-tokens 3200000 \
  --beam 5 \
  --scoring sacrebleu \
  --max-source-positions 3200000 \
  --prefix-size 1

Streaming Translation

Note that the offline models need to be converted to support streaming translation task. Our model (En-De can be downloaded to test streaming translation.

Prefix-decision

lagging=5
fixed_pre_decision_ratio=7
simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec.py \
  --source /path/data/tst-COMMON.wavurl \
  --target /path/data/tst-COMMON.${LANG} \
  --data-bin /path/data/en-${LANG}/ \
  --config config_wave.yaml \
  --model-path /path/${LANG}/finetune/checkpoint.pt \
  --output /path/${LANG}/finetune/simuleval/ \
  --waitk-lagging ${lagging} \
  --fixed-pre-decision-ratio ${fixed_pre_decision_ratio} \
  --scores \
  --port 1234

Dynamic-decision

simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec_cif.py \
  --source /path/data/tst-COMMON.wavurl \
  --target /path/data/tst-COMMON.${LANG} \
  --data-bin /path/data/en-${LANG}/ \
  --config config_wave.yaml \
  --model-path /path/${LANG}/finetune/checkpoint.pt \
  --output /path/${LANG}/finetune/simuleval/ \
  --scores \
  --max-source-positions 3200000 \
  --port 1234

Citation

Please consider citing our papers in your publications if the project helps your research. BibTeX reference is as follows.

@inproceedings{dong-etal-2022-Learning,
	title = {Learning When to Translate for Streaming Speech},
	author = {Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning When to Translate for Streaming Speech

Data Processing

Training

Evaluation

Offline Translation

Streaming Translation

Citation

About

Releases

Packages

Languages

License

dqqcasia/mosst

Folders and files

Latest commit

History

Repository files navigation

Learning When to Translate for Streaming Speech

Data Processing

Training

Evaluation

Offline Translation

Streaming Translation

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages