AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Nguyen, Bac; Cardinaux, Fabien; Uhlich, Stefan

Computer Science > Sound

arXiv:2203.11049 (cs)

[Submitted on 21 Mar 2022 (v1), last revised 7 Mar 2023 (this version, v2)]

Title:AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Authors:Bac Nguyen, Fabien Cardinaux, Stefan Uhlich

View PDF

Abstract:Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.

Comments:	ICASSP 2023
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2203.11049 [cs.SD]
	(or arXiv:2203.11049v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2203.11049

Submission history

From: Bac Nguyen [view email]
[v1] Mon, 21 Mar 2022 15:14:44 UTC (447 KB)
[v2] Tue, 7 Mar 2023 07:44:11 UTC (435 KB)

Computer Science > Sound

Title:AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators