- LibriSpeech: Common English speech benchmark
- FLEURS: Multilingual speech
- AMI Meetings: Meeting recordings
- https://github.com/openai/whisper
- Community Features:
- Transcription and diarization (speaker identification)
- Streaming (real-time)
- Cpp port (lightweight)
- Limitation:
- Hallucination: openai/whisper#679
- WER benchmarking
- https://github.com/Picovoice/speech-to-text-benchmark
Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.
Real-time factor (RTF) is the ratio of CPU (processing) time to the length of the input speech file. A speech-to-text engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.
The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.