DiffWave: A Versatile Diffusion Model for Audio Synthesis

Kong, Zhifeng; Ping, Wei; Huang, Jiaji; Zhao, Kexin; Catanzaro, Bryan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2009.09761 (eess)

[Submitted on 21 Sep 2020 (v1), last revised 30 Mar 2021 (this version, v3)]

Title:DiffWave: A Versatile Diffusion Model for Audio Synthesis

Authors:Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro

View PDF

Abstract:In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Comments:	ICLR 2021 (oral)
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Machine Learning (stat.ML)
Cite as:	arXiv:2009.09761 [eess.AS]
	(or arXiv:2009.09761v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2009.09761

Submission history

From: Wei Ping [view email]
[v1] Mon, 21 Sep 2020 11:20:38 UTC (300 KB)
[v2] Tue, 24 Nov 2020 09:47:28 UTC (365 KB)
[v3] Tue, 30 Mar 2021 19:48:38 UTC (1,145 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DiffWave: A Versatile Diffusion Model for Audio Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:DiffWave: A Versatile Diffusion Model for Audio Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators