End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.
$ docker run -it -p 5002:5002 rhasspy/larynx:en-us
Larynx's goals are:
- "Good enough" synthesis to avoid using a cloud service
- Faster than realtime performance on a Raspberry Pi 4 (with low quality vocoder)
- Broad language support (8 languages)
- Voices trained purely from public datasets
Listen to voice samples from all of the pre-trained models.
Pre-built Docker images for each language are available for the following platforms:
linux/amd64
- desktop/laptop/serverlinux/arm64
- Raspberry Pi 64-bitlinux/arm/v7
- Raspberry Pi 32-bit
Run the Larynx web server with:
$ docker run -it -p 5002:5002 rhasspy/larynx:<LANG>
where <LANG>
is one of:
de-de
- Germanen-us
- U.S. Englishes-es
- Spanishfr-fr
- Frenchit-it
- Italiannl
- Dutchru-ru
- Russiansv-se
- Swedish
Visit https://localhost:5002 for the test page. See https://localhost:5002/openapi/ for HTTP endpoint documentation.
A larger docker image with all languages is also available as rhasspy/larynx
Pre-built Debian packages are available for download.
There are three different kinds of packages, so you can install exactly what you want and no more:
larynx-tts_<VERSION>_<ARCH>.deb
- Base Larynx code and dependencies (always required)
ARCH
is one ofamd64
(most desktops, laptops),armhf
(32-bit Raspberry Pi),arm64
(64-bit Raspberry Pi)
larynx-tts-lang-<LANG>_<VERSION>_all.deb
- Language-specific data files (at least one required)
- See above for a list of languages
larynx-tts-voice-<VOICE>_<VERSION>_all.deb
- Voice-specific model files (at least one required)
- See samples to decide which voice(s) to choose
As an example, let's say you want to use the "harvard-glow_tts" voice for English on an amd64
laptop for Larynx version 0.4.0.
You would need to download these files:
larynx-tts_0.4.0_amd64.deb
larynx-tts-lang-en-us_0.4.0_all.deb
larynx-tts-voice-en-us-harvard-glow-tts_0.4.0_all.deb
Once downloaded, you can install the packages all at once with:
sudo apt install \
./larynx-tts_0.4.0_amd64.deb \
./larynx-tts-lang-en-us_0.4.0_all.deb \
./larynx-tts-voice-en-us-harvard-glow-tts_0.4.0_all.deb
From there, you may run the larynx
command or larynx-server
to start the web server.
$ pip install larynx
For Raspberry Pi (ARM), you will first need to manually install phonetisaurus.
For 32-bit ARM systems, a pre-built onnxruntime wheel is available (official 64-bit wheels are available in PyPI).
Larynx uses gruut to transform text into phonemes. You must install the appropriate gruut language before using Larynx. U.S. English is included with gruut, but for other languages:
$ python3 -m gruut <LANGUAGE> download
Voices and vocoders are available to download from the release page. They can be extracted anywhere, and the directory simply needs to be referenced in the command-line (e,g, --voices-dir /path/to/voices
).
You can run a local web server with:
$ python3 -m larynx.server --voices-dir /path/to/voices
Visit https://localhost:5002 to view the site and try out voices. See https://localhost/5002/openapi for documentation on the available HTTP endpoints.
The following default settings can be applied (for when they're not provided in an API call):
--quality
- vocoder quality (high/medium/low, default: high)--noise-scale
- voice volatility (0-1, default: 0.333)--length-scale
- voice speed (<1 is faster, default: 1.0)
You may also set --voices-dir
to change where your voices/vocoders are stored. The directory structure should be <language>/<voice>
.
See --help
for more options.
To use Larynx as a drop-in replacement for a MaryTTS server (e.g., for use with Home Assistant), run:
$ docker run -it -p 59125:5002 rhasspy/larynx:<LANG>
The /process
HTTP endpoint should now work for voices formatted as <LANG>/<VOICE>
such as en-us/harvard-glow_tts
.
You can specify the vocoder by adding ;<VOCODER>
to the MaryTTS voice.
For example: en-us/harvard-glow_tts;hifi_gan:vctk_small
will use the lowest quality (but fastest) vocoder. This is usually necessary to get decent performance on a Raspberry Pi.
Available vocoders are:
hifi_gan:universal_large
(best quality, slowest, default)hifi_gan:vctk_medium
(medium quality)hifi_gan:vctk_small
(lowest quality, fastest)
The command below synthesizes multiple sentences and saves them to a directory. The --csv
command-line flag indicates that each sentence is of the form id|text
where id
will be the name of the WAV file.
$ cat << EOF |
s01|The birch canoe slid on the smooth planks.
s02|Glue the sheet to the dark blue background.
s03|It's easy to tell the depth of a well.
s04|These days a chicken leg is a rare dish.
s05|Rice is often served in round bowls.
s06|The juice of lemons makes fine punch.
s07|The box was thrown beside the parked truck.
s08|The hogs were fed chopped corn and garbage.
s09|Four hours of steady work faced us.
s10|Large size in stockings is hard to sell.
EOF
larynx \
--debug \
--csv \
--voice harvard-glow_tts \
--quality high \
--output-dir wavs \
--denoiser-strength 0.001
You can use the --interactive
flag instead of --output-dir
to type sentences and have the audio played immediately using the play
command from sox
.
The GlowTTS voices support two additional parameters:
--noise-scale
- determines the speaker volatility during synthesis (0-1, default is 0.333)--length-scale
- makes the voice speaker slower (> 1) or faster (< 1)
--denoiser-strength
- runs the denoiser if > 0; a small value like 0.005 is recommended.
$ larynx --list
- GlowTTS (40 voices)
- English (
en-us
, 21 voices)- blizzard_fls (F, accent, Blizzard)
- cmu_aew (M, Arctic)
- cmu_ahw (M, Arctic)
- cmu_aup (M, accent, Arctic)
- cmu_bdl (M, Arctic)
- cmu_clb (F, Arctic)
- cmu_eey (F, Arctic)
- cmu_fem (M, Arctic)
- cmu_jmk (M, Arctic)
- cmu_ksp (M, accent, Arctic)
- cmu_ljm (F, Arctic)
- cmu_lnh (F, Arctic)
- cmu_rms (M, Arctic)
- cmu_rxr (M, Arctic)
- cmu_slp (F, accent, Arctic)
- cmu_slt (F, Arctic)
- ek (F, accent, M-AILabs)
- harvard (F, accent, CC/Attr/NC)
- kathleen (F, CC0)
- ljspeech (F, Public Domain)
- mary_ann (F, M-AILabs)
- German (
de-de
, 5 voices)- thorsten (M, CC0)
- eva_k (F, M-AILabs)
- karlsson (M, M-AILabs)
- rebecca_braunert_plunkett (F, M-AILabs)
- pavoque (M, CC4/BY/NC/SA)
- French (
fr-fr
, 3 voices) - Spanish (
es-es
, 2 voices)- carlfm (M, public domain)
- karen_savage (F, M-AILabs)
- Dutch (
nl
, 3 voices) - Italian (
it-it
, 2 voices) - Swedish (
sv-se
, 1 voice)- talesyntese (M, CC0)
- Russian (
ru-ru
, 3 voices)
- English (
- Tacotron2
- Coming soon