Quick Start of Speech-to-Text

Several shell scripts provided in ./examples/tiny/local will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. LibriSpeech, Aishell). Reading these examples will also help you to understand how to make it work with your own data.

Some of the scripts in ./examples are not configured with GPUs. If you want to train with 8 GPUs, please modify CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. If you don't have any GPU available, please set CUDA_VISIBLE_DEVICES= to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce batch_size to fit.

Let's take a tiny sampled subset of LibriSpeech dataset for instance.

Go to directory
```
cd examples/tiny
```
Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to examples/librispeech instead.
Source env
```
source path.sh
```
Must do this before you start to do anything. Set MAIN_ROOT as project dir. Using defualt deepspeech2 model as MODEL, you can change this in the script.
Main entrypoint
```
bash run.sh
```
This is just a demo, please make sure every step works well before next step.

More detailed information are provided in the following sections. Wish you a happy journey with the DeepSpeech on PaddlePaddle ASR engine!

Training a model

The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in examples/aishell/local. As mentioned above, please execute sh data.sh, sh train.shandsh test.shto do data preparation, training, and testing correspondingly.

Evaluate a Model

To evaluate a model's performance quantitatively, please run:

CUDA_VISIBLE_DEVICES=0 bash local/test.sh

The error rate (default: word error rate; can be set with error_rate_type) will be printed.

We provide two types of CTC decoders: CTC greedy decoder and CTC beam search decoder. The CTC greedy decoder is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The CTC beam search decoder otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument decoding_method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quick_start.md

quick_start.md

Quick Start of Speech-to-Text

Training a model

Evaluate a Model

Files

quick_start.md

Latest commit

History

quick_start.md

File metadata and controls

Quick Start of Speech-to-Text

Training a model

Evaluate a Model