This directory contains scripts and tools for training speech recognition models using the Kaldi toolkit.
This repository provides tools for training custom speech recognition models using Kaldi. It supports acoustic model training, language model creation, and decoding pipelines.
.
├── cmd.sh # Command configuration for training and decoding
├── conf/
│ ├── mfcc.conf # Configuration for MFCC feature extraction
│ └── online_cmvn.conf # Online Cepstral Mean Variance Normalization (currently empty)
├── local/
│ ├── chain/
│ │ ├── run_ivector_common.sh # Script for i-vector extraction during chain model training
│ │ └── run_tdnn.sh # Script for training a TDNN model
│ ├── data_prep.sh # Data preparation script for creating Kaldi data directories
│ ├── download_and_untar.sh # Script for downloading and extracting datasets
│ ├── download_lm.sh # Downloads language models
│ ├── prepare_dict.sh # Prepares the pronunciation dictionary
│ └── score.sh # Scoring script for evaluation
├── path.sh # Script for setting Kaldi paths
├── RESULTS # Script for printing the best WER results
├── RESULTS.txt # Contains WER results from decoding
├── run.sh # Main script for the entire training pipeline
├── steps -> ../../wsj/s5/steps/ # Link to Kaldi’s WSJ steps for acoustic model training
└── utils -> ../../wsj/s5/utils/ # Link to Kaldi’s utility scripts
- cmd.sh: Defines commands for running training and decoding tasks.
- path.sh: Sets up paths for Kaldi binaries and scripts.
- run.sh: Main entry point for the training pipeline, running tasks in stages.
- RESULTS: Displays Word Error Rate (WER) for the trained models.
- Kaldi: Kaldi toolkit must be installed and configured.
- Required tools:
ffmpeg
,sox
,sctk
for data preparation and scoring.
- Clone the Vosk API repository.
- Install Kaldi and ensure the
KALDI_ROOT
is correctly set inpath.sh
. - Set environment variables using
cmd.sh
andpath.sh
.
Run the data preparation stage in run.sh
:
bash run.sh --stage 0 --stop_stage 0
This stage downloads and prepares the LibriSpeech dataset.
Prepare the pronunciation dictionary with:
bash run.sh --stage 1 --stop_stage 1
This step generates the necessary files for Kaldi's prepare_lang.sh
script.
Run the MFCC extraction process:
bash run.sh --stage 2 --stop_stage 2
This step extracts Mel-frequency cepstral coefficients (MFCC) features and computes Cepstral Mean Variance Normalization (CMVN).
Train monophone, LDA+MLLT, and SAT models:
bash run.sh --stage 3 --stop_stage 3
This stage trains GMM-based models and aligns the data for TDNN training.
Train a Time-Delay Neural Network (TDNN) chain model:
bash run.sh --stage 4 --stop_stage 4
The chain model uses i-vectors for speaker adaptation.
After training, decode the test data:
bash run.sh --stage 5 --stop_stage 5
This step decodes using the trained model and evaluates the Word Error Rate (WER).
WER can be evaluated by running:
bash RESULTS
Example of RESULTS.txt
:
%WER 14.10 [ 2839 / 20138, 214 ins, 487 del, 2138 sub ] exp/chain/tdnn/decode_test/wer_11_0.0
%WER 12.67 [ 2552 / 20138, 215 ins, 406 del, 1931 sub ] exp/chain/tdnn/decode_test_rescore/wer_11_0.0