The demo page of InstructTTS

Paper: https://arxiv.org/abs/2301.13662
Demo: https://dongchaoyang.top/InstructTTS/

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Introduction

For the first time, we study the modelling of expressive TTS with style prompt in natural language, where we meet with the following research problems: (1) how to train a language model that can capture semantic information from the natural language prompt and control the speaking style in the generated speech; (2) how to design an acoustic model to effectively model the challenging one-to-many learning problem of expressive TTS. In this paper, we will address these two challenges.

The main contributions of this study are summarized as follows:
(1) For the first time, we study the modelling of expressive TTS with natural language prompt, which brings us a step closer to achieve user-controllable expressive TTS.
(2) We introduce a novel three stage training strategy to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts.
(3) Inspired by the success of large-scale language models, \textit{e.g.}, GPT3 and ChatGPT \cite{brown2020language}, we propose to model acoustic features in discrete latent space and cast speech synthesis as a language modeling task. Specifically, we train a novel discrete diffusion model to generate vector-quantized (VQ) acoustic feature rather than to predict the commonly-used mel-spectrogram.
(4) We explore to model two types of VQ acoustic feature: mel-spectrogram based VQ features and waveform-based VQ features. We prove that the two types of VQ features can be effectively modeled by our proposed novel discrete diffusion model. We must state that our waveform-based modelling method only needs one-stage training and it is a non-autoregressive model, which is far different from our concurrent work AudioLM \cite{borsos2022audiolm}, VALL-E \cite{wang2023neural} and MusicLM \cite{borsos2023musiclm}.
(5) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, which avoiding possible content and speaker information leakage from the style prompt.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
InstructTTS_mel		InstructTTS_mel
InstructTTS_sample		InstructTTS_sample
baseline/instruct_demo		baseline/instruct_demo
compare_mel		compare_mel
compare_wave		compare_wave
demo		demo
demo2		demo2
non_pa		non_pa
sample1		sample1
sample2		sample2
sample3		sample3
sample4		sample4
_config.yml		_config.yml
bib.txt		bib.txt
index.md		index.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The demo page of InstructTTS

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Introduction

About

Releases

Packages

yangdongchao/InstructTTS

Folders and files

Latest commit

History

Repository files navigation

The demo page of InstructTTS

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Introduction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages