This repository contains code for building an RNN transducer model for Automatic Speech Recognition [1]. We support LSTM and Conformer based ASR at the moment.
- torch >= 1.11.0
- torchaudio >= 0.11.0
- speechbrain >= 0.5.11
- pandas
- tqdm
- transformers >= 4.18.0 URL (NOT NEEDED FOR PLAIN ASR TRAINING.)
- conformer >= 0.2.5 URL
run_debug.sh
is the script for debugging, usually done on a single node. In this script:
--batch-size
is the total batch size after seeing which a gradient descent update is made.--bsz-small
is the batch size per GPU. If the batch size total in all gpus (#gpu*--bsz-small
) is not equal to--batch-size
, then gradients are accumulated.--save-path
where to save checkpoints, (saves after every epoch by default. Edit--checkpoint-after
to change).--ckpt-path
path to checkpoint to be loaded to continue training.--train-path
path where the training file lives. It should be a csv which follows a template defined at URL.--enc-type
'lstm' OR 'conf'.--hid-tr
hidden units in the transcription network.--hid-pr
hidden units in the prediction network.--unidirectional
set this flag if training a unidirectional LSTM as the transcription network. Useful for streaming ASR.--dont-fix-path
set this flag if your csv contains the absolute path to the audio. Otherwise, don't set and edit thefix()
function indata.py
accordingly.
Run the sbatch script sbatch job_submit.sh
.
--nnodes
number of nodes to request.--gpus
number of gpus per node.
The folder sync
is required for distributed training (DDP) as we use a shared file system to synchronize training. Always remember to DELETE sync/shared
BEFORE STARTING A NEW DDP INSTANCE, otherwise the training won't start.
We use a beam search variant proposed in [2].
mkdir asr_log
in the current path if running for the first time.sbatch run_asr.sh
runs the decoding in 100 parallel nodes each node decoding 1/100 of the test set.bash run_decode.sh
is the single node variant of the above which can be used for debugging.
In the above scripts:
--test-path
is the folder containing 100 csv files numbered {0..99}.csv in the same format as URL.--decode-path
where to write the decodes, should be a folder (will be created if does not exist).--unidirectional
set this flag if training a unidirectional LSTM as the transcription network.--dont-fix-path
set this flag if your csv contains the absolute path to the audio. Otherwise, don't set.
Other hyperparameters in the training and decoding scripts are self-explanatory. See --help
in the argument definition in main.py
and decode.py
for more details.
In compute_wer.sh
, change PTH
to the path for the folder containing the decodes (see above).
Run bash compute_wer.sh
.
Word Error Rate will be computed and written to the end of the file named ${PTH}/full.txt
which would also contain "ground truth ----> hypothesis" for all utterances in the test set.
[1] Alex Graves, "Sequence transduction with recurrent neural networks.", Representation Learning Workshop ICML 2012.
[2] George Saon, Zoltán Tüske and Kartik Audhkhasi, "Alignment-length synchronous decoding for RNN transducer.", ICASSP 2020.