Skip to content

Clone a voice in 5 seconds to generate arbitrary speech in real-time

License

Notifications You must be signed in to change notification settings

OriginPrince/Real-Time-Voice-Cloning

Repository files navigation

Datasets and preprocessing

Ideally, you want to keep all your datasets under a same directory. All prepreprocessing scripts will, by default, output the clean data to a new directory SV2TTS created in your datasets root directory. Inside this directory will be created a directory for each model: the encoder, synthesizer and vocoder.

You will need the following datasets:

For the encoder:

  • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
  • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
  • VoxCeleb2: Dev A - H (extract as VoxCeleb1/dev)

For the synthesizer and the vocoder:

  • LibriSpeech: train-clean-100, train-clean-360 (extract as LibriSpeech/train-clean-100 and LibriSpeech/train-clean-360)

Feel free to adapt the code to your needs. Other interesting datasets that you could use:

Known issues

  • There is no noise removal algorithm implemented to clean the data of the synthesizer and the vocoder. I've found the noise removal algorithm from Audacity which uses Fourier analysis to be quite good, but it's too much work to reimplement.
  • The hyperparameters for both the encoder and the vocoder do not appear as arguments. I have to think about how I want to do this.
  • I've tried filtering the non-English speakers out of VoxCeleb1 using the metadata file. However, there is no such file for VoxCeleb2. Right now, the non-English speakers of VoxCeleb2 are unfiltered (hopefully, they're still a minority in the dataset). It's hard to tell if this really has a negative impact on the model.

TODO

  • I'd like to eventually merge the audio module of each package to utils/audio.
  • I'm looking for a Tacotron framework that is pytorch-based, but that is as good as the tensorflow implementation of Rayhane Mamah.
  • Let the user decide if they want to use speaker embeddings or utterance embeddings for training the synthesizer.
  • The toolbox can always be improved.

About

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%