Voice Cloning Blog Posts

audio-security-awesome

A collection of audio security related resources

Voice Cloning Papers

GUI Python toolbox that boasts the ability to clone a voice with 5 seconds of sample data
Uses PyTorch and requires a GPU
Video presentation of toolbox features

Here are a collection of audio datasets for training new models

13,100 short audio clips of a single speaker reading passages from 7 non-fiction books
Sourced from the LibriVox project

This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent.
All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh.

Based on a Google published paper published in April 2017, Tacotron: Towards End-to-End Speech Synthesis, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs.
An implementation of Tacotron speech synthesis in TensorFlow

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based artificial intelligence firm DeepMind. The technique, outlined in a paper in September 2016,[1] is able to generate more realistic-sounding human-like voices by sampling real human speech and directly modelling waveforms.
Keras implementation of Wavenet
Tensorflow implementation of WaveNet