If you have questions or you want to help you can find us in the #audio-generation channel on the LAION Discord server.
An unofficial PyTorch implementation of SPEAR-TTS.
We are not targeting an exact copy – to speed up training we want to use existing Open Source models as bases: Whisper encoder to generate semantic tokens and EnCodec for acoustic modeling.
Following Google Brain we'll train on the LibriLight dataset. Ultimately we want to target multiple languages (Whisper and EnCodec are both multilanguage).
UPDATE 2023-04-13: We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see collabora#9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
UPDATE 2023-04-03: We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
saar-1300hr-2l-20e-T0.8.mov
- Extract acoustic tokens
- Extract Whisper embeddings and quantize them to semantic tokens
- Semantic token to acoustic token (S->A) model
- Text token to semantic token (T->S) model
- Improve the EnCodec speech quality
- Gather a bigger emotive speech dataset
- Train final high-quality models
Pros:
- Whisper training should be a lot better at extracting semantic information than a masked language model with contrastive loss (w2v-BERT)
- it's pretrained on 600k hours of multilingual speech (vs. 60k for w2v-BERT used in the paper)
- freely available
Cons:
- 2x higher "symbol rate" (50 vec/s) than w2v-BERT (25 vec/s) which means training the semantic->acoustic transformer may take longer (this turned out not to matter in practice – there are only 1500 semantic tokens for 30 seconds of audio vs. 4500 acoustic tokens)
Pros:
- High-quality pretrained model is available
Cons:
- Comparing the speech samples with SPEAR-TTS, EnCodec needs 6kbps to get the same quality (SoundStream retrained only on speech seems to work with 1.5kbps)
- CC-BY-NC license
We may switch to the OpenSource SoundStream re-implementation or train a new speech-only model.
This work would not be possible without the generous sponsorships from:
We are available to help you with both Open Source and proprietary AI projects. You can reach us via the Collabora website or on Discord ( and )
@article{SpearTTS,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
url = {https://arxiv.org/abs/2302.03540},
author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
publisher = {arXiv},
year = {2023},
}
@article{Whisper
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
publisher = {arXiv},
year = {2022},
}
@article{EnCodec
title = {High Fidelity Neural Audio Compression},
url = {https://arxiv.org/abs/2210.13438},
author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
publisher = {arXiv},
year = {2022},
}