This repository is an official PyTorch implementation of the paper:
Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung. "FreGrad: lightweight and fast frequency-aware diffusion vocoder." ICASSP (2024). [PDF] [Demo]
This repository contains a vocoder model (mel-spectrogram conditional waveform synthesis) presented in FreGrad.
The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves
Refer to the demo page for the samples from the model.
I recommend user to use VSCode Better Comments to easily find out our comments that show our contributions as described in paper.
-
Navigate to FreGrad root and install dependencies
# the codebase has been tested on Python 3.8 with PyTorch 1.8.2 LTS and 1.10.2 conda binaries pip install -r requirements.txt chmod +x train.sh inference.sh
-
Modify
filelists/train.txt
,filelists/valid.txt
,filelists/test.txt
so that the filelists point to the absolute path of the wav files. The codebase provides the LJSpeech dataset template. Here, we also provided randomly generated filelists we used to train our model that reported in paper. -
Train FreGrad (our training code supports multi-GPU training). To train the model:
- Take a look and change default parameters defined in params.py if needed.
- Specify cuda devices before train.
CUDA_VISIBLE_DEVICES=0,1,2,3 ./train.sh
The training script first builds the training set statistics and saves it to
stats_priorgrad
folder created atdata_root
(/path/to/your/LJSpeech-1.1
in the above example).It also automatically saves the hyperparameter file (
params.py
), renamed asparams_saved.py
, tomodel_dir
at runtime to be used for inference. -
Inference (fast mode with T=6)
CUDA_VISIBLE_DEVICES=0 ./inference.sh
Please uncomment or comment options in
inference.sh
to control inference process. Here, we provided:-
--fast
--fast_iter 6
uses fast inference noise schedule with--fast-iter
reverse diffusion steps. -
If
--fast
is not provided, the model performs slow sampling with the sameT
step forward diffusion used in training.
Samples are saved to the
sample_fast
if--fast
is used, orsample_slow
if not, created at the parent directory of the model (checkpoints
in the above example). -
We release the pretrained weights of FreGrad model trained on LJSpeech for 1M steps at this link. Please download and extract the file to checkpoints directory to achieve a directory as follow:
checkpoints/
| fregrad/
| weights-1000000.pt
| params_saved.py
| stats_priorgrad/
stats_priorgrad
saved at data_root
is required to use the checkpoint for training and inference. Refer to the step 3 of the Quick Start and Examples above.
The codebase defines weights.pt
as a symbolic link of the latest checkpoint.
Restore the link with ln -s weights-1000000.pt weights.pt
to continue training (__main__.py
), or perform inference (inference.py
) without specifying --step
Our backbone code is based on the following repository:
If you find FreGrad useful to your work, please consider citing the paper below:
@inproceedings{nguyen2024fregrad,
author={Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung},
booktitle={International Conference on Acoustics, Speech and Signal Processing},
title={FreGrad: Lightweight and fast frequency-aware diffusion vocoder},
year={2024},
}