Skip to content

PlayVoice/whisper-vits-svc

 
 

Repository files navigation

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

Hugging Face Spaces GitHub Repo stars GitHub forks GitHub issues GitHub

  • This project targets deep learning beginners, basic knowledge of Python and PyTorch are the prerequisites for this project;
  • This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practices;
  • This project does not support real-time voice converting; (need to replace whisper if real-time voice converting is what you are looking for)
  • This project will not develop one-click packages for other purposes;

vits-5.0-frame

Model properties

Feature From Status Function
whisper OpenAI strong noise immunity
bigvgan NVIDA alias and snake
natural speech Microsoft reduce mispronunciation
neural source-filter NII solve the problem of audio F0 discontinuity
speaker encoder Google Timbre Encoding and Clustering
GRL for speaker Ubisoft Preventing Encoder Leakage Timbre
SNAC Samsung One Shot Clone of VITS
SCLN Microsoft Improve Clone
Diffusion HuaWei Improve sound quality
PPG perturbation this project Improved noise immunity and de-timbre
HuBERT perturbation this project Improved noise immunity and de-timbre
VAE perturbation this project Improve sound quality
MIX encoder this project Improve conversion stability
USP infer this project Improve conversion stability
VITS2 SK Telecom Overuse of resources
HiFTNet Columbia University NSF-iSTFTNet for speed up
RoFormer Zhuiyi Technology Rotary Positional Embeddings

Setup Environment

  1. Install PyTorch.

  2. Install project dependencies

    pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt

    Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error

  3. Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.

  4. Download whisper model whisper-large-v2. Make sure to download large-v2.pt,put it into whisper_pretrain/.

  5. Download hubert_soft model,put hubert-soft-0d54a1f4.pt into hubert_pretrain/.

  6. Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/.

    python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Dataset preparation

Necessary pre-processing:

  1. Separate vocie and accompaniment with UVR (skip if no accompaniment)
  2. Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
  3. Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
  4. Put the dataset into the dataset_raw directory following the structure below.
dataset_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

  1. Re-sampling

    • Generate audio with a sampling rate of 16000Hz in ./data_svc/waves-16k
    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
    
    • Generate audio with a sampling rate of 32000Hz in ./data_svc/waves-32k
    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
    
  2. Use 16K audio to extract pitch

    python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
    
  3. Use 16K audio to extract ppg

    python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
    
  4. Use 16K audio to extract hubert

    python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
    
  5. Use 16k audio to extract timbre code

    python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
    
  6. Extract the average value of the timbre code for inference

    python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
    
  7. use 32k audio to extract the linear spectrum

    python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
    
  8. Use 32k audio to generate training index

    python prepare/preprocess_train.py
    
  9. Training file debugging

    python prepare/preprocess_zzz.py
    

Train

  1. If fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line

    pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
    

    in configs/base.yaml,and adjust the learning rate appropriately, eg 5e-5.

    batch_szie: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.

  2. Start training

    python svc_trainer.py -c configs/base.yaml -n sovits5.0
    
  3. Resume training

    python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt
    
  4. Log visualization

    tensorboard --logdir logs/
    

HiFTNet

Inference

  1. Export inference model: text encoder, Flow network, Decoder network

    python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
    
  2. Inference Just run the following command.

    python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
    

    generate files in the current directory:svc_out.wav

  3. Arguments ref

    args --config --model --spk --wave --ppg --vec --pit --shift
    name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift

Code sources and references

https://github.com/facebookresearch/speech-resynthesis

https://github.com/jaywalnut310/vits

https://github.com/openai/whisper

https://github.com/mindslab-ai/univnet

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/shivammehta25/Matcha-TTS

https://github.com/mozilla/TTS

https://github.com/bshall/soft-vc

https://github.com/yl4579/HiFTNet

SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

RoFormer: Enhanced Transformer with rotary position embedding

Contributors

Thanks to

https://github.com/Francis-Komizu/Sovits