Skip to content

fedirz/faster-whisper-server

Repository files navigation

Faster Whisper Server

faster-whisper-server is an OpenAI API-compatible transcription server which uses faster-whisper as its backend. Features:

  • GPU and CPU support.
  • Easily deployable using Docker.
  • Configurable through environment variables (see config.py).
  • OpenAI API compatible.
  • Streaming support (transcription is sent via SSE as the audio is transcribed. You don't need to wait for the audio to fully be transcribed before receiving it).
  • Live transcription support (audio is sent via websocket as it's generated).
  • Dynamic model loading / offloading. Just specify which model you want to use in the request and it will be loaded automatically. It will then be unloaded after a period of inactivity.

Please create an issue if you find a bug, have a question, or a feature suggestion.

OpenAI API Compatibility ++

See OpenAI API reference for more information.

  • Audio file transcription via POST /v1/audio/transcriptions endpoint.
    • Unlike OpenAI's API, faster-whisper-server also supports streaming transcriptions (and translations). This is useful for when you want to process large audio files and would rather receive the transcription in chunks as they are processed, rather than waiting for the whole file to be transcribed. It works similarly to chat messages when chatting with LLMs.
  • Audio file translation via POST /v1/audio/translations endpoint.
  • Live audio transcription via WS /v1/audio/transcriptions endpoint.
    • LocalAgreement2 (paper | original implementation) algorithm is used for live transcription.
    • Only transcription of a single channel, 16000 sample rate, raw, 16-bit little-endian audio is supported.

Quick Start

Hugging Face Space

image

Using Docker

docker run --gpus=all --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:latest-cuda
# or
docker run --publish 8000:8000 --volume ~/.cache/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:latest-cpu

Using Docker Compose

curl -sO https://raw.githubusercontent.com/fedirz/faster-whisper-server/master/compose.yaml
docker compose up --detach faster-whisper-server-cuda
# or
docker compose up --detach faster-whisper-server-cpu

Using Kubernetes: tutorial

Usage

If you are looking for a step-by-step walkthrough, check out this YouTube video.

OpenAI API CLI

export OPENAI_API_KEY="cant-be-empty"
export OPENAI_BASE_URL=https://localhost:8000/v1/
openai api audio.transcriptions.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format text

openai api audio.translations.create -m Systran/faster-distil-whisper-large-v3 -f audio.wav --response-format verbose_json

OpenAI API Python SDK

from openai import OpenAI

client = OpenAI(api_key="cant-be-empty", base_url="https://localhost:8000/v1/")

audio_file = open("audio.wav", "rb")
transcript = client.audio.transcriptions.create(
    model="Systran/faster-distil-whisper-large-v3", file=audio_file
)
print(transcript.text)

cURL

# If `model` isn't specified, the default model is used
curl https://localhost:8000/v1/audio/transcriptions -F "[email protected]"
curl https://localhost:8000/v1/audio/transcriptions -F "[email protected]"
curl https://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "stream=true"
curl https://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "model=Systran/faster-distil-whisper-large-v3"
# It's recommended that you always specify the language as that will reduce the transcription time
curl https://localhost:8000/v1/audio/transcriptions -F "[email protected]" -F "language=en"

curl https://localhost:8000/v1/audio/translations -F "[email protected]"

Live Transcription (using WebSocket)

From live-audio example

demo.mp4

websocat installation is required. Live transcription of audio data from a microphone.

ffmpeg -loglevel quiet -f alsa -i default -ac 1 -ar 16000 -f s16le - | websocat --binary ws:https://localhost:8000/v1/audio/transcriptions