Carnival system을 구성하는 Codec Enhancement 모델입니다. 과학기술통신부 재원으로 정보통신기획평가원(IITP) 지원을 받아 수행한 "원격 다자간 영상회의에서의 음성 품질 고도화 기술개발" 과제 공개 코드입니다. (2021.05~2024.12)
본 과제에서 개발한 End-to-end neural audio codec인 HILCodec의 학습 및 추론 코드입니다.
[paper] [samples] [code]
We tested under CUDA=11.7, torch=1.13 and CUDA=10.2, torch=1.12.
It may work in other environments, but not guaranteed.
First, install PyTorch along with torchaudio.
Then, install other requirements as below.
conda install librosa -c conda-forge
conda install jupyter notebook matplotlib scipy tensorboard tqdm pyyaml
pip install pesq pystoi
Finally, install ONNXRuntime for CPU .
Optionally, install ViSQOL.
For test, you only need to install ONNXRuntime, librosa, and soundfile.
Download VCTK, DNS-Challenge4 and Jamendo dataset for training.
For validation, we used p225
, p226
, p227
, and p228
from VCTK for clean speech. Real noisy speech recordings from DNS-Challenge4 are used for noisy speech. Jamendo/99
are used for music.
Downsample all audio files into 24khz before training (see scripts/Resampling.ipynb
).
Use configs/...yaml
file to change configurations.
Modify directories_to_include
, directories_to_exclude
, wav_dir
.
Also, modify filelists/infer_24khz.txt
or filelists/infer_speech.txt
file, which cotain audio files used for inference in tensorboard.
Either use train.py
or train_torchrun.py
for training. Examples are:
CUDA_VISIBLE_DEVICES=0,1 python train.py -c configs/hilcodec_music.yaml -n first_exp -p train.batch_size=16 train.seed=1234 -f
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc_per_node=2 train_torchrun.py -c configs/hilcodec_music.yaml -n first_exp -p train.batch_size=16 train.seed=1234 -f
Arguments:
-n: (Required) Directory name to save checkpoints, the configuration file, and tensorboard logs.
-c: (Optional) Configuration file path. If not given, use a configuration file in the directory.
-p: (Optional) Parameters after this will update configurations.
-f: (Optional) If the directory already exists, an exception will be raised to avoid overwriting config file. However, enabling this option will force overwriting config file.
Pre-trained model parameters are provided in the onnx
directory. Two versions are available:
- hil_music
- hil_speech
hil_music
is a model trained on general audio dataset (clean speech, noisy speech, music).
hil_speech
is a model trained only on clean speech dataset.
Modify the variable PATH
in test_onnx.py
as you want, and run the following code:
python test_onnx.py -n hil_speech --enc --dec
The output will be saved at onnx/hil_speech_output.wav
.
Use python test_onnx.py --help
for information about each argument.
Note that for AudioDec, you must set -H 300
.
You can convert your own trained HILCodec to ONNXRuntime using scripts/HILCodec Onnx.ipynb
.
You can also convert Encodec and AudioDec to ONNXRuntime for comparison.
Download checkpoints from official repositories and use scripts/Encodec Onnx.ipynb
or scripts/AudioDec Onnx.ipynb
.
Our training code includes objective metrics calculation. You can set pesq
in a config file appropriately.
Note that on our server it occasionally crashes (especially when calculating ViSQOL), so the default config is to turn off calculation.
To calculate metrics after training, you can use scripts/pesq.ipynb
.