Automatic speech-to-text translation (AST) consists in translating a speech utterance in a source language to a text in a target language. Here we are interested in translating directly from speech in French to text in another language. In the following, we describe the steps to reproduce our AST results presented in the paper (Section 5.3).
1. AST results
2. Dataset and installation
2.1. Dataset
2.2. Installation
3. Feature preparation
3.1. Task-agnostic pre-training
3.1.1. log-Mel filterbank features
3.1.2. wav2vec features
3.2. Self-supervised fine-tuning on mTEDx
3.2.1. Perform self-supervised fine-tuning on mTEDx
3.2.2. Extract features from obtained wav2vec models
3.3. Supervised fine-tuning for ASR on mTEDx
3.3.1. Perform supervised fine-tuning for ASR on mTEDx
3.3.2. Extract features from obtained wav2vec models
4. Training ST models
5. Decoding
The following table (corresponding to Table 5 in the paper) shows the BLEU scores on the valid and test sets of multilingual TEDx (mTEDx) using different types of speech features. Note that these results are obtained from bilingual ST models trained on the respective datasets.
The baselines in our experiments are models using log-Mel filterbank features (MFB
). For models using wav2vec features, there are 3 main blocks corresponding to features extracted from (a) task-agnostic pre-training, (b) self-supervised fine-tuning on mTEDx, and (c) supervised fine-tuning for ASR on mTEDx. The two latter methods belong to the task-specific pre-training category. The highest value in each block is underlined, while the best value in each column is highlighted in bold.
Valid data | Test data | Links to models | ||||||
---|---|---|---|---|---|---|---|---|
Input features | fr-en | fr-es | fr-pt | fr-en | fr-es | fr-pt | wav2vec | ST model |
MFB | 1.15 | 0.67 | 0.61 | 1.10 | 0.87 | 0.32 | Na | fr-en,fr-es,fr-pt |
(a) Task agnostic pre-training | ||||||||
En-base | 5.54 | 1.30 | 0.54 | 5.20 | 1.47 | 0.38 | Download | fr-en,fr-es,fr-pt |
En-large | 4.11 | 1.67 | 0.32 | 3.56 | 2.29 | 0.43 | Download | fr-en,fr-es,fr-pt |
Fr-1K-base | 9.18 | 5.09 | 0.39 | 8.98 | 5.64 | 0.49 | Download | fr-en,fr-es,fr-pt |
Fr-1K-large | 15.31 | 13.74 | 8.29 | 14.46 | 14.77 | 9.37 | Download | fr-en,fr-es,fr-pt |
Fr-2.6K-base | 15.09 | 13.27 | 4.72 | 14.69 | 14.04 | 5.51 | Download | fr-en,fr-es,fr-pt |
Fr-3K-base | 15.05 | 13.19 | 4.44 | 14.80 | 14.27 | 4.72 | Download | fr-en,fr-es,fr-pt |
Fr-3K-large | 17.94 | 16.40 | 8.64 | 18.00 | 18.12 | 9.55 | Download | fr-en,fr-es,fr-pt |
Fr-7K-base | 15.13 | 12.78 | 2.65 | 14.50 | 13.61 | 2.66 | Download | fr-en,fr-es,fr-pt |
Fr-7K-large | 19.23 | 17.59 | 9.68 | 19.04 | 18.24 | 10.98 | Download | fr-en,fr-es,fr-pt |
XLSR-53-large | 7.81 | 0.49 | 0.43 | 6.75 | 0.52 | 0.36 | Download | fr-en,fr-es,fr-pt |
(b) Task specific pre-training (self-supervised on mTEDX) | ||||||||
Fr-3K-large | 18.54 | 16.40 | 8.81 | 18.38 | 17.84 | 10.57 | Download | fr-en,fr-es,fr-pt |
Fr-7K-large | 19.65 | 17.53 | 9.35 | 19.36 | 18.95 | 10.94 | Download | fr-en,fr-es,fr-pt |
XLSR-53-large | 6.83 | 0.54 | 0.34 | 6.75 | 0.34 | 0.29 | Download | fr-en,fr-es,fr-pt |
(c) Task specific pre-training (fine-tuned for ASR on mTEDX) | ||||||||
Fr-3K-large | 21.09 | 19.28 | 14.40 | 21.34 | 21.18 | 16.66 | Download | fr-en,fr-es,fr-pt |
Fr-7K-large | 21.41 | 20.32 | 15.14 | 21.69 | 21.57 | 17.43 | Download | fr-en,fr-es,fr-pt |
XLSR-53-large | 21.09 | 20.38 | 14.56 | 20.68 | 21.14 | 17.21 | Download | fr-en,fr-es,fr-pt |
En-base/large
and XLSR-53
are off-the-shelf wav2vec models trained on English and multilingual speech, respectively. The ones whose prefixes are Fr
are the wav2vec models that we trained on our collected French datasets of different sizes (1K, 2.6K, 3K, and 7K). Except for the one trained on 2.6K hours, each model has both base
and large
configurations.
NOTE: For the two task-specific pre-training methods (self-supervised and supervised fine-tuning on mTEDx), since the French speech is overlapped between the language pairs, we selected the pair having the most speech data (fr-en
) to perform task-specific pre-training and used the obtained models to extract features for the remaining pairs (fr-es
and fr-pt
). For a fair comparison, we did not use additional data augmentation technique nor ASR encoder pre-training in the experiments.
We selected subsets having French as the source language in the large multilingual speech-to-text dataset multilingual TEDx.
Our benchmark covers translation directions from French (fr
) to three target languages: English (en
), Portugese (pt
), and Spanish (es
).
The training sizes (in hours) are shown in the following table.
Dataset | fr-en | fr-es | fr-pt | Link |
---|---|---|---|---|
mTEDx | 50 | 38 | 25 | Download |
After downloading data, please unzip and save them under ${MTEDX_ROOT}
.
The experiments are performed using Python 3.8.2
, torch 1.8.1
, torchaudio 0.8.1
. Our implementation is based on fairseq S2T.
Please clone our fork (LeBenchmark
branch)
as there are modifications made for LeBenchmark:
git clone https://github.com/formiel/fairseq.git
cd fairseq
git checkout LeBenchmark
Then install it in your environment:
pip install -e .
(remove -e
in the above if you don't want to install in the editable mode).
In addition, please also install NVIDIA's apex
library as instructed in fairseq
.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
Finally, the following libraries are necessary for this recipe: sndfile, ffmpeg, pandas, soundfile, sentencepiece and torchaudio. These can be installed as follows.
- With sudo privileges:
sudo apt-get install libsndfile1-dev ffmpeg
pip install pandas soundfile sentencepiece torchaudio
- On a virtual environment:
conda install -y libsndfile ffmpeg
pip install pandas soundfile sentencepiece torchaudio
In the following, please have separate ${MTEDX_ROOT}
for each type of features since the output files (features, manifest, dictionary etc.) will be overwitten if they are under the same folder. For example, below is the structure of the downloaded data after extracting:
${DOWNLOAD_DIR}
└──fr-en
└──data
└──train
└──valid
└──test
└──fr-es
└──data
└──train
└──valid
└──test
...
Then you can create a folder named ${MTEDX_ROOT}
having similar structure as above for each type of feature and create symlinks to the data downloaded for each ${MTEDX_ROOT} folder as below.
${MTEDX_ROOT}
└──fr-en
└──data -> ${DOWNLOAD_DIR}/fr-en/data
└──fr-es
└──data -> ${DOWNLOAD_DIR}/fr-es/data
...
python examples/speech_to_text/prep_mtedx_data.py --data-root ${MTEDX_ROOT} \
--vocab-type unigram \
--vocab-size 1000 \
--task st
Before extracting features from wav2vec models, it is necessary to convert .flac
files to .wav
files. You can use ffmpeg
for such conversion under the AST/tools
directory in this repo.
bash tools/flac2wav.sh $FLAC_DIR ${MTEDX_ROOT}/wav
where $FLAC_DIR
is path to the directory containing .flac
files.
python examples/speech_to_text/prep_mtedx_data_w2v_feats.py \
--data-root ${MTEDX_ROOT} \
--vocab-type unigram --vocab-size 1000 --task st \
--use-w2v-feats \
--w2v-path ${W2V2_PATH} \
--src fr --tgt ${TGT_LANG}
where:
${W2V2_PATH}
is path to the wav2vec 2.0 model from which you want to extract features,${TGT_LANG}
is chosen among[en, es, pt]
.
IMPORTANT: If you extract features from large
models, please add --normalize-signal
to the above command line.
The input to wav2vec models needs to be single channel with sampling rate of 16kHz. Therefore, we first need to downsample the audio files before training. Similar to Section 3.1.2, you can run the following command to convert .flac
files to .wav
files.
bash tools/flac2wav.sh $FLAC_DIR ${MTEDX_ROOT}/wav
where $FLAC_DIR
is path to the directory containing .flac
files.
Since it is recommended to split each file into separate files each having smaller length when training wav2vec models, we first split the audio files (for each talk) into smaller files (each containing one sentence) based on the segment information provided in the released mTEDx dataset.
bash examples/speech_to_text/split_wav_files.sh ${INPUT_DIR} ${OUTPUT_SPLIT_DIR} ${SEGMENT_FILE}
where
${INPUT_DIR}
is path to the folder containing audio files to be split,${OUTPUT_SPLIT_DIR}
is where you want to store the resulting split files,${SEGMENT_FILE}
is path to the segment file.
The input .tsv
file to wav2vec training has the following format:
/path/to/audio/folder
filename0.wav nframes
filename1.wav nframes
...
To prepare data according to this format, please run the following command to first obtain the manifest files
python examples/speech_to_text/prep_mtedx_data_w2v_feats.py \
--data-root ${MTEDX_ROOT} \
--vocab-type unigram --vocab-size 1000 \
--task st --src fr --tgt en \
--get-manifest-only
Then run
python examples/speech_to_text/prep_ft_w2v2.py --audio-root ${AUDIO_ROOT} --tsv-path ${TSV_PATH} --dest ${DATA_DIR}
where
${AUDIO_ROOT}
is path to the folder where split audio files from (1) are saved,${TSV_PATH}
is path to the.tsv
files (train_st.tsv
,valid_st.tsv
, andtest_st.tsv
) obtained as above,${DATA_DIR}
is where you want to store the output files, including the.tsv
file having the above format, the.ltr
and.wrd
files which include the transcripts pre-tokenized at the letter and word level, repectively. This${DATA_DIR}
will be the input folder for task-specific pre-training (both self-supervised and supervised one).
NOTE: The self-supervised fine-tuning on mTEDx is resumed from the last optimizer's state of the corresponding pre-trained model, hence the number of updates will be picked up from where it left off previously. For example, your self-supervised fine-tuning should start at step around 180k for Fr-1K-base
, 158k for Fr-1K-large
, and around 496K or 500K for the remaining wav2vec Fr
models. The max_update
in the configuration file is hence the sum of previous training steps in the pre-trained model and the training steps to be performed on the task data. All of the self-supervised fine-tuned models in our experiments were trained for an additional 20K steps on fr-en
pair of mTEDx.
To perform self-supervised fine-tuning on mTEDx, please run the following command:
fairseq-hydra-train \
common.tensorboard_logdir=${TENSORBOARD_DIR} \
checkpoint.save_dir=${SAVE_DIR} \
checkpoint.restore_file=${PRETRAINED_W2V2_PATH} \
checkpoint.reset_meters=true \
task.data=${DATA_DIR} \
--config-dir NeurIPS2021/AST/configs \
--config-name ${MODEL_CONFIG}
where
$TENSORBOARD_DIR$
is path to save the tensorboard,${SAVE_DIR}
is path to save the checkpoints,${MODEL_CONFIG}
is the training configuration. The main hyperparameters are the same as inexample/wav2vec/config/pretraining/wav2vec2_large_librivox.yaml
for self-supervised fine-tuning andexample/wav2vec/config/pretraining/vox100h.yaml
for supervised fine-tuning. Please refer toNeurIPS2021/AST/configs
for the configuration files that we used in our experiments.
NOTE: For XLSR-53
model, please add
checkpoint.reset_meters=true \
checkpoint.reset_dataloader=true \
to the above command.
IMPORTANT: If your are resuming the training from a previous job (in case the previous run is stopped for some reason, such as time limit etc.), please modify the checkpoint.restore_file
to be the last checkpoint (checkpoint_last.pt
) of the previous training so that the model continues to train properly. Otherwise, it will load the pre-trained model from ${PRETRAINED_W2V2_PATH}
and run the previous training again.
Please follow Section 3.1.2 to extract features from the obtained self-supervised fine-tuned wav2vec models. $W2V2_PATH
is the path to the best checkpoint (checkpoint_best.pt
) obtained from the training in Section 3.2.1.
Please follow step (1) and (2) in Section 3.2.1 to split the audio and prepare the .tsv
and .ltr
files for training.
For supervised fine-tuning, we also need to have the dictionary. To learn the dictionary on the transcripts, please run the following command:
fairseq-preprocess --dataset-impl mmap --trainpref ${DATA_DIR}/train.ltr --only-source --thresholdsrc 0
then copy the obtained dictionary to $DATA_DIR/dict.ltr.txt
.
To perform supervised fine-tuning for ASR on mTEDx, please run the following command:
fairseq-hydra-train \
common.tensorboard_logdir=${TENSORBOARD_DIR} \
checkpoint.save_dir=${SAVE_DIR} \
task.data=${DATA_DIR} \
model.w2v_path=${PRETRAINED_W2V2_PATH} \
--config-dir NeurIPS2021/AST/configs \
--config-name ${MODEL_CONFIG}
Please refer to Section 3.1.2 for the feature extraction step.
NOTE: Please add --w2v-ctc
to the command line in Section 3.1.2 to extract features from supervised fine-tuned wav2vec models.
To train a speech-to-text translation model on the extracted features, run the following command:
fairseq-train ${MTEDX_ROOT}/${LANG_PAIR} \
--train-subset train_st \
--valid-subset valid_st\
--config-yaml config_st.yaml \
--save-dir ${ST_SAVE_DIR} \
--num-workers 4 \
--max-tokens 40000 \
--max-source-positions 150000 \
--max-target-positions 8192 \
--task speech_to_text \
--criterion label_smoothed_cross_entropy \
--report-accuracy \
--max-epoch 500 \
--arch s2t_transformer_xs \
--optimizer adam \
--lr 2e-3 \
--lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 1 \
--log-interval 1000 \
--update-freq 8 \
--tensorboard-logdir ${TENSORBOARD_DIR}
where
${LANG_PAIR}
is the language pair (for example,fr-en
,fr-es
, orfr-pt
) on which to train the models.${ST_SAVE_DIR}
is the path to save checkpoints.
IMPORTANT:
- Please add
--use-linear-before-cnn
when training ST models using features extracted from wav2vec models. - Multi-GPU training: Training on multiple GPUs requires some modifications of the above command:
- Replace
fairseq-train
withpython -u -m torch.distributed.launch --nproc_per_node=${NGPUS_PER_NODE} $(which fairseq-train)
where${NGPUS_PER_NODE}
is the number of GPUs. - Scale the effective batch size accordingly. For example, on 4 GPUs, you can set
--update-freq 2
(instead of--update-freq 8
).
To decode using a trained model (with weight-averaging over the last 10 checkpoints), run the following commands:
CHECKPOINT_FILENAME=avg_last_10_checkpoint.pt
python scripts/average_checkpoints.py \
--inputs ${ST_SAVE_DIR} \
--num-epoch-checkpoints 10 \
--output "${ST_SAVE_DIR}/${CHECKPOINT_FILENAME}"
fairseq-generate ${MTEDX_ROOT}/${LANG_PAIR} \
--config-yaml config_st.yaml \
--gen-subset ${GEN_SUBSET} \
--task speech_to_text \
--path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--max-tokens 50000 --beam 5 --scoring sacrebleu \
--results-path ${RESULT_PATH} \
--max-source-positions 50000
where:
${GEN_SUBSET}
is the name of the subset you want to decode.${RESULT_PATH}
is the path where you want to save the decoding results.