Skip to content

Commit

Permalink
new engines
Browse files Browse the repository at this point in the history
  • Loading branch information
kenarsa authored Jan 26, 2022
1 parent 74207b4 commit 8f8a6b5
Show file tree
Hide file tree
Showing 24 changed files with 754 additions and 367 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.idea/
__pycache__/
resources/data/
resources/deepspeech/
res/deepspeech/
res/LibriSpeech
267 changes: 119 additions & 148 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,208 +2,179 @@

Made in Vancouver, Canada by [Picovoice](https://picovoice.ai)

This is a minimalist and extensible framework for benchmarking different speech-to-text engines. It has been developed
and tested on Ubuntu 18.04 (x86_64) using Python3.6.
This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.

## Table of Contents

* [Background](#background)
* [Data](#data)
* [Metrics](#metrics)
* [Word Error Rate](#word-error-rate)
* [Real Time Factor](#real-time-factor)
* [Model Size](#model-size)
* [Speech-to-Text Engines](#speech-to-text-engines)
* [Amazon Transcribe](#amazon-transcribe)
* [CMU PocketSphinx](#cmu-pocketsphinx)
* [Google Speech-to-Text](#google-speech-to-text)
* [Mozilla DeepSpeech](#mozilla-deepspeech)
* [Picovoice Cheetah](#picovoice-cheetah)
* [Picovoice Leopard](#picovoice-leopard)
* [Usage](#usage)
* [Word Error Rate Measurement](#word-error-rate-measurement)
* [Real Time Factor Measurement](#real-time-factor-measurement)
* [Results](#results)
* [License](#license)

## Background

This framework has been developed by [Picovoice](https://picovoice.ai/) as part of the
[Cheetah](https://github.com/Picovoice/cheetah) project. Cheetah is Picovoice's streaming speech-to-text engine,
specifically designed to run efficiently on the edge (offline). Deep learning has been the main driver in recent
improvements in speech recognition but due to stringent compute/storage limitations of IoT platforms, it is mostly
beneficial to cloud-based engines. Picovoice's proprietary deep learning technology enables transferring these
improvements to IoT platforms with significantly lower CPU/memory footprint.
- [Data](#data)
- [Metrics](#metrics)
- [Engines](#engines)
- [Usage](#usage)
- [Results](#results)

## Data

[LibriSpeech](https://www.openslr.org/12/) dataset is used for benchmarking. We use the
[test-clean](https://www.openslr.org/resources/12/test-clean.tar.gz) portion.
- [LibriSpeech](https://www.openslr.org/12/)
- [TED-LIUM](https://www.openslr.org/7/)
- [Common Voice](https://commonvoice.mozilla.org/en)

## Metrics

This benchmark considers three metrics: word error rate, real-time factor, and model size.

### Word Error Rate

Word error rate (WER) is defined as the ratio of [Levenstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
between words in a reference transcript and words in the output of the speech-to-text engine, to the number of
words in the reference transcript.
Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output
of the speech-to-text engine to the number of words in the reference transcript.

### Real Time Factor

Real time factor (RTF) is measured as the ratio of CPU (processing) time to the length of the input speech file. A
speech-to-text engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.
Real-time factor (RTF) is the ratio of CPU (processing) time to the length of the input speech file. A speech-to-text
engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.

### Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

## Speech-to-Text Engines

### Amazon Transcribe

Amazon Transcribe is a cloud-based speceh recognition engine, offered by AWS. Find more information [here](https://aws.amazon.com/transcribe/).

### CMU PocketSphinx

[PocketSphinx](https://github.com/cmusphinx/pocketsphinx) works offline and can run on embedded platforms such as
Raspberry Pi.
## Engines

### Google Speech-to-Text
- [Amazon Transcribe](https://aws.amazon.com/transcribe/)
- [Azure Speech-to-Text](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/)
- [Google Speech-to-Text](https://cloud.google.com/speech-to-text)
- [IBM Watson Speech-to-Text](https://www.ibm.com/ca-en/cloud/watson-speech-to-text)
- [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech)
- [Picovoice Cheetah](https://picovoice.ai/)
- [Picovoice Leopard](https://picovoice.ai/)

A cloud-based speech recognition engine offered by Google Cloud Platform. Find more information
[here](https://cloud.google.com/speech-to-text/).

### Mozilla DeepSpeech

[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) is an open-source implementation of
[Baidu's DeepSpeech](https://arxiv.org/abs/1412.5567) by Mozilla.
## Usage

### Picovoice Cheetah
This benchmark has been developed and tested on `Ubuntu 20.04`.

[Cheetah](https://github.com/Picovoice/cheetah) is a streaming speech-to-text engine developed using
[Picovoice's](https://picovoice.ai/) proprietary deep learning technology. It works offline and is supported on a
growing number of platforms including Android, iOS, and Raspberry Pi.
- Install [FFmpeg](https://www.ffmpeg.org/)
- Download datasets.
- Install the requirements:

### Picovoice Leopard
```console
pip3 install -r requirements.txt
```

[Leopard](https://github.com/Picovoice/leopard) is a speech-to-text engine developed using
[Picovoice's](https://picovoice.ai/) proprietary deep learning technology. It works offline and is supported on a
growing number of platforms including Android, iOS, and Raspberry Pi.
### Amazon Transcribe Instructions

## Usage
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, and `${AWS_PROFILE}`
with the name of AWS profile you wish to use.

Below is information on how to use this framework to benchmark the speech-to-text engines.

1. Make sure that you have installed DeepSpeech and PocketSphinx on your machine by following the instructions on their official pages.
1. Unpack
DeepSpeech's models under [resources/deepspeech](/resources/deepspeech).
1. Download the [test-clean](https://www.openslr.org/resources/12/test-clean.tar.gz) portion of LibriSpeech and unpack it under
[resources/data](/resources/data).
1. For running Google Speech-to-Text and Amazon Transcribe, you need to sign up for the respective cloud provider
and setup permissions / credentials according to their documentation. Running these services may incur fees.
```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}
```

### Word Error Rate Measurement
### Azure Speech-to-Text Instructions

Word Error Rate can be measured by running the following command from the root of the repository:
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset,
`${AZURE_SPEECH_KEY}` and `${AZURE_SPEECH_LOCATION}` information from your Azure account.

```bash
python benchmark.py --engine_type AN_ENGINE_TYPE
```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}
```

The valid options for the `engine_type`
parameter are: `AMAZON_TRANSCRIBE`, `CMU_POCKET_SPHINX`, `GOOGLE_SPEECH_TO_TEXT`, `MOZILLA_DEEP_SPEECH`,
`PICOVOICE_CHEETAH`, `PICOVOICE_CHEETAH_LIBRISPEECH_LM`, `PICOVOICE_LEOPARD`, and `PICOVOICE_LEOPARD_LIBRISPEECH_LM`.
### Google Speech-to-Text Instructions

`PICOVOICE_CHEETAH_LIBRISPEECH_LM` is the same as `PICOVOICE_CHEETAH`
except that the language model is trained on LibriSpeech training text similar to
[Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech/tree/master/data/lm). The same applies to Leopard.
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, and
`${GOOGLE_APPLICATION_CREDENTIALS}` with credentials download from Google Cloud Platform.

```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}
```

### Real Time Factor Measurement
### IBM Watson Speech-to-Text Instructions

The `time` command is used to measure the execution time of different engines for a given audio file, and then divide
the CPU time by audio length. To measure the execution time for Cheetah, run:
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, and
`${WATSON_SPEECH_TO_TEXT_API_KEY}`/`${${WATSON_SPEECH_TO_TEXT_URL}}` with credentials from your IBM account.

```bash
time resources/cheetah/cheetah_demo \
resources/cheetah/libpv_cheetah.so \
resources/cheetah/acoustic_model.pv \
resources/cheetah/language_model.pv \
resources/cheetah/cheetah_eval_linux.lic \
PATH_TO_WAV_FILE
```console
python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}
```

The output should have the following format (values may be different):
### Mozilla DeepSpeech Instructions

```bash
real 0m4.961s
user 0m4.936s
sys 0m0.024s
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset,
`${DEEP_SPEECH_MODEL}` with path to DeepSpeech model file (`.pbmm`), and `${DEEP_SPEECH_SCORER}` with path to DeepSpeech
scorer file (`.scorer`).

```console
python3 benchmark.py \
--engine MOZILLA_DEEP_SPEECH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--deepspeech-pbmm ${DEEP_SPEECH_MODEL} \
--deepspeech-scorer ${DEEP_SPEECH_SCORER}
```

Then, divide the `user` value by the length of the audio file, in seconds. The user value is the actual CPU time spent in the program.
### Picovoice Cheetah Instructions

To measure the execution time for Leopard, run:
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, and
`${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).

```bash
time resources/leopard/leopard_demo \
resources/leopard/libpv_leopard.so \
resources/leopard/acoustic_model.pv \
resources/leopard/language_model.pv \
resources/leopard/leopard_eval_linux.lic \
PATH_TO_WAV_FILE
```console
python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
```

For DeepSpeech:
### Picovoice Leopard Instructions

```bash
time deepspeech \
--model resources/deepspeech/output_graph.pbmm \
--lm resources/deepspeech/lm.binary \
--trie resources/deepspeech/trie \
--audio PATH_TO_WAV_FILE
```
Replace `${DATASET}` with one of the supported datasets, `${DATASET_FOLDER}` with path to dataset, and
`${PICOVOICE_ACCESS_KEY}` with AccessKey obtained from [Picovoice Console](https://console.picovoice.ai/).

Finally, for PocketSphinx:

```bash
time pocketsphinx_continuous -infile PATH_TO_WAV_FILE
```console
python3 benchmark.py \
--engine PICOVOICE_LEOPARD \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}
```

## Results

The below results are obtained by following the previous steps. The benchmarking was performed on a Linux machine running
Ubuntu 18.04 with 64GB of RAM and an Intel i5-6500 CPU running at 3.2 GHz. WER refers to word error rate and RTF refers to
real time factor.

| Engine | WER | RTF (Desktop) | RTF (Raspberry Pi 3) | RTF (Raspberry Pi Zero) | Model Size (Acoustic and Language) |
:---:|:---:|:---:|:---:|:---:|:---:
Amazon Transcribe | 8.21% | N/A | N/A | N/A | N/A |
CMU PocketSphinx (0.1.15) | 31.82% | 0.32 | 1.87 | **2.04** | 97.8 MB |
Google Speech-to-Text | 12.23% | N/A | N/A | N/A | N/A |
Mozilla DeepSpeech (0.6.1) | 7.55% | 0.46 | N/A | N/A | 1146.8 MB |
Picovoice Cheetah (v1.2.0) | 10.49% | 0.04 | 0.62 | 3.11 | 47.9 MB |
Picovoice Cheetah LibriSpeech LM (v1.2.0) | 8.25% | 0.04 | 0.62 | 3.11 | **45.0 MB** |
Picovoice Leopard (v1.0.0) | 8.34% | **0.02** | **0.55** | 2.55 | 47.9 MB |
Picovoice Leopard LibriSpeech LM (v1.0.0) | **6.58%** | **0.02** | **0.55** | 2.55 | **45.0 MB** |

The figure below compares the word error rate of speech-to-text engines. For Picovoice, we included the engine that was
trained on LibriSpeech training data similar to Mozilla DeepSpeech.

![](resources/doc/word_error_rate_comparison.png)
### Accuracy

The figure below compares accuracy and runtime metrics of offline speech-to-text engines. For Picovoice we included the
engines that were trained on LibriSpeech training data similar to Mozilla DeepSpeech. Leopard achieves the highest accuracy
while being 23X faster and 27X smaller in size compared to second most accurate engine (Mozilla DeepSpeech).
| Engine | LibriSpeech test-clean | LibriSpeech test-other | TED-LIUM | CommonVoice | Average |
|:--------------------------------:|:----------------------:|:----------------------:|:--------:|:-----------:|:-------:|
| Amazon Transcribe | 5.20% | 9.58% | 4.25% | 15.94% | 8.74% |
| Azure Speech-to-Text | 4.96% | 9.66% | 4.99% | 12.09% | 7.93% |
| Google Speech-to-Text | 11.23% | 24.94% | 15.00% | 30.68% | 20.46% |
| Google Speech-to-Text (Enhanced) | 6.62% | 13.59% | 6.68% | 18.39% | 11.32% |
| IBM Watson Speech-to-Text | 11.08% | 26.38% | 11.89% | 38.81% | 22.04% |
| Mozilla DeepSpeech | 7.27% | 21.45% | 18.90% | 43.82% | 22.86% |
| Picovoice Cheetah | 7.08% | 16.28% | 10.89% | 23.10% | 14.34% |
| Picovoice Leopard | 5.39% | 12.45% | 9.04% | 17.13% | 11.00% |

![](resources/doc/offline_stt_comparison.png)
### RTF

## License
Measurement is carried on an Ubuntu 20.04 machine with Intel CPU (`Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz`), 64 GB of RAM,
and NVMe storage.

The benchmarking framework is freely available and can be used under the Apache 2.0 license. The provided Cheetah and Leopard
resources (binary, model, and license file) are the property of Picovoice. They are only to be used for evaluation
purposes and their use in any commercial product is strictly prohibited.
| Engine | RTF | Model Size |
|:------------------:|:----:|:----------:|
| Mozilla DeepSpeech | 0.46 | 1142 MB |
| Picovoice Cheetah | 0.07 | 19 MB |
| Picovoice Leopard | 0.05 | 19 MB |

For commercial enquiries contact us via this [form](https://picovoice.ai/contact.html).
![](res/summary.png)
Loading

0 comments on commit 8f8a6b5

Please sign in to comment.