Skip to content

fgsoap/so-vits-svc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoftVC VITS Singing Voice Conversion

English | 中文简体

✨ A fork with a greatly improved interface: 34j/so-vits-svc-fork

✨ A client supports real-time conversion: w-okada/voice-changer

Warning!!!

This project is an open source, offline project, and all members of SvcDevelopTeam and all developers and maintainers of this project (hereinafter referred to as contributors) have no control over this project. The contributor of this project has never provided any organization or individual with any form of assistance, including but not limited to data set extraction, data set processing, computing support, training support, infering, etc. Contributors to the project do not and cannot know what users are using the project for. Therefore, all AI models and synthesized audio based on the training of this project have nothing to do with the contributors of this project. All problems arising therefrom shall be borne by the user.

📏 Terms of Use

Warning: Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer, svc develop team, have nothing to do with the consequences!

  1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.
  2. Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
  3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
  4. It is forbidden to use the project to engage in illegal activities, religious and political activities. The project developers firmly resist the above activities. If they do not agree with this article, the use of the project is prohibited.
  5. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
  6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.

🆕 Update!

Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the 4.0-v2 branch for more information.

📝 Model Introduction

The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to NSF HiFiGAN to solve the problem of sound interruption.

🆕 4.0 Version Update Content

  • Feature input is changed to Content Vec
  • The sampling rate is unified to use 44100Hz
  • Due to the change of hop size and other parameters, as well as the streamlining of some model structures, the required GPU memory for inference is significantly reduced. The 44kHz GPU memory usage of version 4.0 is even smaller than the 32kHz usage of version 3.0.
  • Some code structures have been adjusted
  • The dataset creation and training process are consistent with version 3.0, but the model is completely non-universal, and the data set needs to be fully pre-processed again.
  • Added an option 1: automatic pitch prediction for vc mode, which means that you don't need to manually enter the pitch key when converting speech, and the pitch of male and female voices can be automatically converted. However, this mode will cause pitch shift when converting songs.
  • Added option 2: reduce timbre leakage through k-means clustering scheme, making the timbre more similar to the target timbre.
  • Added option 3: Added NSF-HIFIGAN Enhancer, which has certain sound quality enhancement effect on some models with few train-sets, but has negative effect on well-trained models, so it is closed by default

💬 About Python Version

After conducting tests, we believe that the project runs stably on Python 3.8.9.

📥 Pre-trained Model Files

Required

# contentvec
wget -P hubert/ https://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory

Optional(Strongly recommend)

  • Pre-trained model files: G_0.pth D_0.pth
    • Place them under the logs/44k directory

Get them from svc-develop-team(TBD) or anywhere else.

Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.

Optional(Select as Required)

If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.

  • Pre-trained NSF-HIFIGAN Vocoder: nsf_hifigan_20221211.zip
    • Unzip and place the four files under the pretrain/nsf_hifigan directory
# nsf_hifigan
https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1

📊 Dataset Preparation

Simply place the dataset in the dataset_raw directory with the following file structure.

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

You can customize the speaker name.

dataset_raw
└───suijiSUI
    ├───1.wav
    ├───...
    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav

🛠️ Preprocessing

0. Slice audio

Slice to 5s - 15s, a bit longer is no problem. Too long may lead to torch.cuda.OutOfMemoryError during training or even pre-processing.

By using audio-slicer-GUI or audio-slicer-CLI

In general, only the Minimum Interval needs to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to 100 or even 50.

After slicing, delete audio that is too long and too short.

1. Resample to 44100Hz and mono

python resample.py

2. Automatically split the dataset into training and validation sets, and generate configuration files.

python preprocess_flist_config.py

3. Generate hubert and f0

python preprocess_hubert_f0.py

After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

You can modify some parameters in the generated config.json