GitHub - OriginPrince/Real-Time-Voice-Cloning at 9ddc17dea702a905c44acec91bf66efbcd84e78f

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
encoder		encoder
synthesizer		synthesizer
toolbox		toolbox
utils		utils
vocoder		vocoder
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
demo_toolbox.py		demo_toolbox.py
encoder_preprocess.py		encoder_preprocess.py
encoder_train.py		encoder_train.py
requirements.txt		requirements.txt
synthesizer_preprocess_audio.py		synthesizer_preprocess_audio.py
synthesizer_preprocess_embeds.py		synthesizer_preprocess_embeds.py
synthesizer_train.py		synthesizer_train.py
vocoder_preprocess.py		vocoder_preprocess.py
vocoder_train.py		vocoder_train.py

Repository files navigation

Datasets and preprocessing

Ideally, you want to keep all your datasets under a same directory. All prepreprocessing scripts will, by default, output the clean data to a new directory SV2TTS created in your datasets root directory. Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.

You will need the following datasets:

For the encoder:

LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
VoxCeleb2: Dev A - H (extract as VoxCeleb1/dev)

For the synthesizer and the vocoder:

LibriSpeech: train-clean-100, train-clean-360 (extract as LibriSpeech/train-clean-100 and LibriSpeech/train-clean-360)

Feel free to adapt the code to your needs. Other interesting datasets that you could use:

VCTK, used in the SV2TTS paper.
M-AILABS

Known issues

There is no noise removal algorithm implemented to clean the data of the synthesizer and the vocoder. I've found the noise removal algorithm from Audacity which uses Fourier analysis to be quite good, but it's too much work to reimplement.
The hyperparameters for both the encoder and the vocoder do not appear as arguments. I have to think about how I want to do this.
I've tried filtering the non-English speakers out of VoxCeleb1 using the metadata file. However, there is no such file for VoxCeleb2. Right now, the non-English speakers of VoxCeleb2 are unfiltered (hopefully, they're still a minority in the dataset). It's hard to tell if this really has a negative impact on the model.

TODO

I'd like to eventually merge the audio module of each package to utils/audio.
I'm looking for a Tacotron framework that is pytorch-based, but that is as good as the tensorflow implementation of Rayhane Mamah.
Let the user decide if they want to use speaker embeddings or utterance embeddings for training the synthesizer.
The toolbox can always be improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets and preprocessing

Known issues

TODO

About

Releases

Packages

Languages

License

OriginPrince/Real-Time-Voice-Cloning

Folders and files

Latest commit

History

Repository files navigation

Datasets and preprocessing

Known issues

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages