- Make a custom text-to-speech (TTS) model out of your own voice samples
- Make a custom TTS model out of any existing voice dataset
- Make a custom TTS model by converting a generic voice dataset into another voice using an RVC model.
- Rapidly train high quality TTS by using pretrained checkpoint files
- Preview your voice model while it is training and choose the best version of your voice.
- https://www.tomshardware.com/raspberry-pi/add-any-voice-to-your-raspberry-pi-project-with-textymcspeechy
- https://www.hackster.io/news/erik-bjorgan-makes-voice-cloning-easy-with-the-applio-and-piper-based-textymcspeechy-e9bcef4246fb
- This is probably the fastest possible way to record a dataset, and is ideal for making a clone of one's own voice.
dataset_recorder.sh
takes any metadata.csv file as input and interactively records voice samples for every phrase it references under the proper file name.- an ai generated sample
metadata.csv
file is included, but you will get better results using ametadata.csv
file from a public domain academic dataset. Phrases should be phonetically diverse, with a mix of short and longer phrases (ideally the phrase length should follow a normal distribution), and should include a mix of tonal pattterns (eg. statements, questions, exclamations). If there are expressions or names you want to use in your target application, including variations of those phrases in the dataset multiple times is very beneficial to making your model sound as natural as possible.
- The quick-start guide explains how to get TextyMcSpeechy set up.
- Do not use the manual installation steps described in the legacy guide. Most of them have been automated.
- A PC running Linux which has a NVIDIA GPU is highly recommended / required for training.
- Once TTS models are trained, they will run on much less powerful hardware, but Piper uses a neural network to generate speech. This is a computationally expensive process which benefits from fast hardware.
- A Raspberry Pi 5 should be considered the absolute minimum for realtime applications like smart speakers or voice assistants.
- My Raspberry Pi 5 (8GB) based offline smart speaker project had a delay of about 8 seconds between saying a voice command (transcribed locally on the pi with faster-whisper) and a spoken response beginning from a low-quality piper-TTS voice. At least 3/4 of the delay was waiting for Piper to finish generating speech. If real-time responses with minimal lag are needed, I would strongly recommend planning to run Piper on something faster than a Pi 5. If you use use low-quality Piper models, keep spoken phrases short, or cache commonly used phrases as pre-rendered audio files, a Pi 5 (or even a Pi 4) may be adequate for your needs. Personally I'll be waiting for the Pi 6 or Pi 7 before I try to do this again.
- Performance improves dramatically when rendering speech on more powerful hardware. When running both STT transcription and Piper TTS on the CPU of my desktop computer (Core i7-13700), the delay between command and spoken response fell to less than 3 seconds, which was comparable to what I get with Google Assistant on my Nest Mini. Whisperlive and Piper both include HTTP servers, which makes it easy to move STT and TTS processes to a network computer or self-managed cloud server. I'm planning to use this approach in my smart speaker project, which will now be based on a Raspberry Pi Zero 2W.
- Install Piper
- Install Applio
- Experiment with Applio until you find the voice you wish to turn into a text-to-speech model
- Get a training dataset (wav files with text transcriptions) for a person similar in tone/accent to the target voice.
- Use Applio to batch-convert the audio files in the training dataset so that they sound like the target voice.
- Prepare the converted dataset for use with Piper
- Get a checkpoint
ckpt
file for an existing text-to-speech model similar in tone/accent to the target voice. - Use Piper to fine-tune the existing text-to-speech model using the converted dataset.
- Convert the fine-tuned
.ckpt
file to a.onnx
file that can be used by Piper directly to generate speech from text. - Test the new text-to-speech model.
- Note: The main advantage of this method is that it is less work, provided you have a decent RVC model for your target voice. Most of the intonation will come from the voice in the original dataset, which means that this method isn't great for making characters with a dramatic intonation. If you don't have access to samples of the target voice, you will get much more natural intonation by recording a base dataset that does your best impression of the target voice prior to applying the RVC model.
- Install Piper
(skip steps 2,3,4,and,5)
- Prepare the converted dataset for use with Piper
- Get a checkpoint
ckpt
file for an existing text-to-speech model similar in tone/accent to the target voice. - Use Piper to fine-tune the existing text-to-speech model using the converted dataset.
- Convert the fine-tuned
.ckpt
file to a.onnx
file that can be used by Piper directly to generate speech from text. - Test the new text-to-speech model.
- Note: The disadvantage of this method is that building a dataset can be a lot of work, but the quality of the resulting voices makes this work worthwhile. I have built usable voice models with as few as 50 samples of the target voice. This is the best method for building voices for characters that have very distinctive intonation. If catchphrases and distinctive expressions are part of the training dataset, when those phrases are used in the TTS engine they will sound almost exactly like the original voice. If building a model out of your own voice, make sure to include the names of people and places that are significant to you in your dataset, as well as any phrases you are likely to use in your application.
Note: This guide was written before the TTS dojo existed. It's still useful for understanding the steps involved, but the TTS Dojo automates many of these steps.
Follow the quick-start guide to install and configure the TTS dojo when you have a dataset and are ready to start training.
I will be taking this guide out of the main docs when I have time to write a replacement guide just for creating datasets.
sudo apt-get install python3.dev
git clone https://github.com/rhasspy/piper.git
cd piper/src/python
(You will have an easier time if you put your venv in this directory)python3.10 -m venv .venv
note - Torch needs python 3.10 and won't work on 3.11 without extra steps.source ./.venv/bin/activate
pip install pip wheel setuptools -U
pip install piper-tts
pip install build
python -m build
pip install -e .
pip3 install -r requirements.txt
bash ./build_monotonic_align.sh
sudo apt-get install espeak-ng
pip install torchmetrics==0.11.4
(this is a downgrade to avoid an error)
- Note: Installing the correct version of CUDA can be a hassle. The easiest way to get an environment that works for Piper is to activate the .venv and run:
python3 -m pip install tensorflow[and-cuda]
(enter this command exactly as it appears here including the square brackets). It's also possible to install tensorflow without GPU support by runningpython3 -m pip install tensorflow
. I haven't tried training on CPU but would be interested to hear from anyone who has tried it.
- Follow instructions here: https://github.com/IAHispano/Applio
- Applio has a nice gui - it isn't very hard to figure out
- Todo - write or link to a more comprehensive guide
Update: For an easy way to do this, see VCTK_dataset_tools/using_vctk_dataset.md
- A dataset is a collection of audio clips with matching text transcriptions. There are many options available in the public domain, or you can record and transcribe your own voice. A repo with many public domain datasets can be found here: https://github.com/jim-schwoebel/voice_datasets
- I have found https://vocalremover.org/ to be a useful tool for extracting voices from background music and other sounds.
- For voice cloning, it is best if the person speaking in the dataset has a voice similar in tone and accent to the target voice. Keep in mind that some datasets include audio from multiple speakers.
- Piper requires transcription data to be gathered into a single
metadata.csv
file, with one line per wav file in the following format:FILENAME
|transcript
is the form for single speaker datasets (if you are making your own transcripts, this is the format you should use)FILENAME
|SPEAKER ID
|transcript
is the form for multiple speaker datasets. This format will also work for a single speaker dataset if the speaker ids are all the same.
- I use a spreadsheet to create my csv file when transcribing my own datasets, but you can also create this file manually in any text editor.
- It is not necessary to include the path or file extension (eg
.wav
) in the file names listed in inmetadata.csv
.
This is what the metadata.csv file I created from the VCTK dataset looks like.
p316_001_mic1_output|Please call Stella.
p316_002_mic1_output|Ask her to bring these things with her from the store.
p316_003_mic1_output|Six spoons of fresh snow peas, five thick slabs of blue cheese
- see VCTK_dataset_tools for some helpful scripts for generating metatata.csv if your dataset uses individual text files
- see the dataset recorder for a tool that will let you quickly record a dataset using your own voice.
- Convert all of the audio files in the dataset into the target voice using Applio.
- Batch conversion of a generic dataset into another voice can be done on the "Inference" tab.
- Select the Batch tab, choose a voice model, specify the location of your input and output folders, then click "Convert"
note: The TTS Dojo provides tools that automate this step.
- Ensure that your audio files are all in
wav
format with an appropriate sampling rate (22050 Hz or 16000 Hz) and kept together in a folder namedwav
. Batch conversion and resampling of flac files can be done with the following bash script:
for file in *.flac; do
ffmpeg -i "$file" -ar 22050 "${file%.flac}.wav"
done
- Find the piper repository you cloned earlier and
cd /path/to/piper/src/python
- Make sure your virtual enviroment is activated eg.
source ./.venv/bin/activate
- Make a directory for your dataset, eg
elvis_dataset
- Copy
metadata.csv
from step 4 and yourwav
folder intoelvis_dataset
- Make a directory for the training files eg
elvis_training
- Run the following:
python3 -m piper_train.preprocess \
--language en-us \
--input-dir /path/to/elvis_training_dataset \
--output-dir /path/to/elvis_training \
--dataset-format ljspeech \
--single-speaker \
--sample-rate 22050 \
- If preprocessing is successful, it will generate
config.json
,dataset.jsonl
, and audio files inelvis_training
- Note: If preprocessing fails with a "not enough columns" error, this is usually because your
.csv
file has blank lines at the end.
note: The TTS Dojo provides tools that automate this step.
- This model should be similar in tone and accent to the target voice, as well as being in the target language.
- It must also use the same sampling rate as your training data
- The file you need will have a name like
epoch=2164-step=135554.ckpt
- Checkpoint files for piper's built in models can be found at https://huggingface.co/datasets/piper-checkpoints/tree/main
- Here's a link to the lessac medium quality voice which I used in testing https://huggingface.co/datasets/rhasspy/piper-checkpoints/blob/main/en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt
- copy this checkpoint file into your
elvis_training
directory
note: The TTS Dojo provides tools that automate this step.
- change to
/path/to/piper/src/python
directory and ensure your venv is activated. - Run the following shell script (but change the paths for dataset_dir and resume_from_checkpoint first!)
python3 -m piper_train \
--dataset-dir /path/to/elvis_training/ \
--accelerator gpu \
--devices 1 \
--batch-size 4 \
--validation-split 0.0 \
--num-test-examples 0 \
--max_epochs 30000 \
--resume_from_checkpoint /path/to/epoch=2164-step=135554.ckpt \
--checkpoint-epochs 1 \
--precision 32
- You may need to adjust your batch size if your GPU runs out of memory.
- Training has started! You will know it is working if the epoch number starts going up.
- You can monitor training progress with tensorboard. , ie:
tensorboard --logdir /path/to/elvis_training/lightning_logs
- When tensorboard's graph for "loss_disc_all" levels off, you can abort the training process with CTRL-C in the terminal window where training is happening.
note: The TTS Dojo provides tools that automate this step.
- Create a new directory for your text to speech model eg
elvisTTS
- Locate your finetuned checkpoint file for this training session. It will be found in
/path/to/elvis_training/lightning_logs/version_<N>/checkpoints/
- This file can be converted into
.onnx
format as follows:
python3 -m piper_train.export_onnx \
/path/to/elvis_training/lightning_logs/version_<N>/checkpoints/<EPOCH____>.ckpt \
/path/to/elvisTTS/elvis.onnx
- Copy config.json from
elvis_training
toelvisTTS
. It needs to be renamed to match theonnx
file, eg,elvis.onnx.json
cp /path/to/training_dir/config.json \
/path/to/elvisTTS/elvis.onnx.json
note: The TTS Dojo provides tools that automate this step.
echo 'Thank you. Thank you very much!' | piper -m /path/to/elvisTTS/elvis.onnx --output_file testaudio.wav
- Play the wav file either by double clicking it in your filesystem or with
aplay testaudio.wav