Name		Name	Last commit message	Last commit date
Latest commit History 416 Commits
configs		configs
hubert		hubert
hubert_pretrain		hubert_pretrain
pitch		pitch
prepare		prepare
speaker		speaker
speaker_pretrain		speaker_pretrain
vits		vits
vits_decoder		vits_decoder
vits_extend		vits_extend
vits_pretrain		vits_pretrain
whisper		whisper
whisper_pretrain		whisper_pretrain
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
svc_export.py		svc_export.py
svc_inference.py		svc_inference.py
svc_preprocessing.py		svc_preprocessing.py
svc_trainer.py		svc_trainer.py
test.wav		test.wav

Repository files navigation

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
This project will not develop one-click packages for other purposes;

Model properties

Feature	From	Status	Function
whisper	OpenAI	✅	strong noise immunity
bigvgan	NVIDA	✅	alias and snake
natural speech	Microsoft	✅	reduce mispronunciation
neural source-filter	NII	✅	solve the problem of audio F0 discontinuity
speaker encoder	Google	✅	Timbre Encoding and Clustering
GRL for speaker	Ubisoft	✅	Preventing Encoder Leakage Timbre
SNAC	Samsung	✅	One Shot Clone of VITS
SCLN	Microsoft	✅	Improve Clone
Diffusion	HuaWei	✅	Improve sound quality
PPG perturbation	this project	✅	Improved noise immunity and de-timbre
HuBERT perturbation	this project	✅	Improved noise immunity and de-timbre
VAE perturbation	this project	✅	Improve sound quality
MIX encoder	this project	✅	Improve conversion stability
USP infer	this project	✅	Improve conversion stability
~~VITS2~~	SK Telecom	✅	~~Overuse of resources~~
HiFTNet	Columbia University	✅	NSF-iSTFTNet for speed up
RoFormer	Zhuiyi Technology	✅	Rotary Positional Embeddings

Setup Environment

Install PyTorch.
Install project dependencies
```
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
```
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download whisper model whisper-large-v2. Make sure to download large-v2.pt，put it into whisper_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.

Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/.

python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Dataset preparation

Necessary pre-processing:

Separate vocie and accompaniment with UVR (skip if no accompaniment)
Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
Put the dataset into the dataset_raw directory following the structure below.

dataset_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_svc/waves-16k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_svc/waves-32k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch

Use 16K audio to extract ppg

python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

Extract the average value of the timbre code for inference

python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer

use 32k audio to extract the linear spectrum

python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

If fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line
```
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
```
in configs/base.yaml，and adjust the learning rate appropriately, eg 5e-5.

batch_szie: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.

Start training

python svc_trainer.py -c configs/base.yaml -n sovits5.0

Resume training

python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt

Log visualization
```
tensorboard --logdir logs/
```

Inference

Export inference model: text encoder, Flow network, Decoder network

python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt

Inference Just run the following command.

python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0

generate files in the current directory:svc_out.wav

Arguments ref

args --config --model --spk --wave --ppg --vec --pit --shift

name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift