-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856
base: master
Are you sure you want to change the base?
Conversation
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix a few minor warnings about coding convention
Done, We are good to go! |
fix usage with english-only models
Are there any licensing concerns bringing this in from whisper-x? The whisper-x license is more restrictive than the MIT license faster whisper is under. |
Using batching or HF pipeline is a generic idea on whisper model that supports batching. Only some portions of VAD segmentation are specific to whisper-x @trungkienbkhn Could you please let us know the response from SYSTRAN on this? Would be great if the author of whisper-X can provide a waiver to this. If there are legal issues, we can switch to Silero or nvidia based open source VAD models. |
I've been researching the BSD-4-Clause license. This license allows the use, copying, and editing of code for development purposes. But you should add this license in the beginning of the code file that used whisper-x's code (vad.py): # Copyright (c) 2022, Max Bain
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright
...
# The code below is copied from whisper-x (https://github.com/m-bain/whisperX)
# and adapted for faster_whisper.
class SegmentX:
... |
The developer Max Bain informed via email that: |
Doesn't the license carry forward to users of faster whisper as well, i.e. the attribution clause will be needed for anyone using this project? |
added licensing comments in the doc and the code
faster_whisper/vad.py
Outdated
# 2. Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# 3. All advertising materials mentioning features or use of this software |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to have to be carried forward into the Faster Whisper license - which is currently MIT. I don't see how this is compatible with MIT.
added formatting checks
Please note that the author of whisper-x has changed the license to BSD-Clause-2. So, with the proper attribution in the code, it will be possible to use it. As per this license, you don’t have to mention the name of developer or software in advertising or marketing materials. I will modify the doc accordingly and update here. |
Thank you for working through this! |
update license info
Implement changes in review request
Hello everyone,
This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!
Speed improvements:
Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the
enable_ta_fe
flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!Using the batched version is straightforward:
Quality Improvements
Language detection Usage:
Benchmarking:
A. Open source benchmarking:
Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.
Speed (x real-time):
WER:
B. Internal dataset:
Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.
Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.
Thank you in advance!
Acknowledgements
This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.