Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Open
wants to merge 145 commits into
base: master
Choose a base branch
from

Conversation

Jiltseb
Copy link

@Jiltseb Jiltseb commented May 24, 2024

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

  • Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.

  • Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

  1. Consistency across runs: By setting the model seed, consistency across runs is improved.
  2. Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
  3. Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
  4. Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System Speed GPU Speed CPU
OpenAI Whisper 8.2x 4.5x
faster-whisper 20.1x 5.6x
HF Whisper (batched) 59.3x 8.4x
Batched Faster-Whisper 104x 14.6x

WER:

System WER
OpenAI Whisper 15.1
faster-whisper 14.6
HF Whisper (batched) 16.8
Batched Faster-Whisper 13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System WER Speed
OpenAI Whisper 6.8 9.1x
faster-whisper 6.1 17.4x
HF Whisper (batched) 8.2 42.8x
Batched Faster-Whisper 6.5 86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

Jiltseb added 30 commits June 9, 2023 13:52
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
Copy link
Collaborator

@trungkienbkhn trungkienbkhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix a few minor warnings about coding convention

faster_whisper/vad.py Outdated Show resolved Hide resolved
faster_whisper/vad.py Outdated Show resolved Hide resolved
faster_whisper/transcribe.py Outdated Show resolved Hide resolved
faster_whisper/transcribe.py Outdated Show resolved Hide resolved
@Jiltseb
Copy link
Author

Jiltseb commented Jul 5, 2024

Done, We are good to go!

@felixthekraut
Copy link

Are there any licensing concerns bringing this in from whisper-x? The whisper-x license is more restrictive than the MIT license faster whisper is under.

@Jiltseb
Copy link
Author

Jiltseb commented Jul 8, 2024

Are there any licensing concerns bringing this in from whisper-x? The whisper-x license is more restrictive than the MIT license faster whisper is under.

Using batching or HF pipeline is a generic idea on whisper model that supports batching. Only some portions of VAD segmentation are specific to whisper-x @trungkienbkhn Could you please let us know the response from SYSTRAN on this? Would be great if the author of whisper-X can provide a waiver to this.

If there are legal issues, we can switch to Silero or nvidia based open source VAD models.

@trungkienbkhn
Copy link
Collaborator

Are there any licensing concerns bringing this in from whisper-x? The whisper-x license is more restrictive than the MIT license faster whisper is under.

Using batching or HF pipeline is a generic idea on whisper model that supports batching. Only some portions of VAD segmentation are specific to whisper-x @trungkienbkhn Could you please let us know the response from SYSTRAN on this? Would be great if the author of whisper-X can provide a waiver to this.

If there are legal issues, we can switch to Silero or nvidia based open source VAD models.

I've been researching the BSD-4-Clause license. This license allows the use, copying, and editing of code for development purposes. But you should add this license in the beginning of the code file that used whisper-x's code (vad.py):

# Copyright (c) 2022, Max Bain
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright
...

# The code below is copied from whisper-x (https://github.com/m-bain/whisperX)
# and adapted for faster_whisper.

class SegmentX:
   ...

README.md Outdated Show resolved Hide resolved
@Jiltseb
Copy link
Author

Jiltseb commented Jul 9, 2024

The developer Max Bain informed via email that:
"Sure you can just use the modified version, just put some attribution in the VAD chunking file / batching section of the readme."

@felixthekraut
Copy link

The developer Max Bain informed via email that: "Sure you can just use the modified version, just put some attribution in the VAD chunking file / batching section of the readme."

Doesn't the license carry forward to users of faster whisper as well, i.e. the attribution clause will be needed for anyone using this project?

# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# 3. All advertising materials mentioning features or use of this software

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to have to be carried forward into the Faster Whisper license - which is currently MIT. I don't see how this is compatible with MIT.

@Jiltseb
Copy link
Author

Jiltseb commented Jul 11, 2024

The developer Max Bain informed via email that: "Sure you can just use the modified version, just put some attribution in the VAD chunking file / batching section of the readme."

Doesn't the license carry forward to users of faster whisper as well, i.e. the attribution clause will be needed for anyone using this project?

Please note that the author of whisper-x has changed the license to BSD-Clause-2. So, with the proper attribution in the code, it will be possible to use it. As per this license, you don’t have to mention the name of developer or software in advertising or marketing materials. I will modify the doc accordingly and update here.

@felixthekraut
Copy link

The developer Max Bain informed via email that: "Sure you can just use the modified version, just put some attribution in the VAD chunking file / batching section of the readme."

Doesn't the license carry forward to users of faster whisper as well, i.e. the attribution clause will be needed for anyone using this project?

Please note that the author of whisper-x has changed the license to BSD-Clause-2. So, with the proper attribution in the code, it will be possible to use it. As per this license, you don’t have to mention the name of developer or software in advertising or marketing materials. I will modify the doc accordingly and update here.

Thank you for working through this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet