Skip to content

tarepan/knn-vc-official

 
 

Repository files navigation

kNN-VC

ColabBadge PaperBadge

Clone of official kNN-VC, simple kNN-based voice conversion.

kNN-VC method

Demo

official demo.

Usage

Install

# Python >=3.10
pip install "torch>=2" "torchaudio>=2" numpy

No kNN-VC install is needed. torch.hub handle everything😉

Inference

import torch, torchaudio

src_wav_path = '<path to arbitrary 16kHz waveform>.wav'
ref_wav_paths = ['<path to arbitrary 16kHz waveform from target speaker>.wav', '<path to 2nd utterance from target speaker>.wav', ...]

knn_vc = torch.hub.load('tarepan/knn-vc-official', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)

query_seq = knn_vc.get_features(src_wav_path)
matching_set = knn_vc.get_matching_set(ref_wav_paths)

out_wav = knn_vc.match(query_seq, matching_set, topk=4)
# out_wav is (T,) tensor converted 16kHz output wav using k=4 for kNN.

Options:

  • knn_vc.match
    • topk: int - Top K
  • torch.hub.load
    • prematched: bool - Whether to use prematched model or non-prematched model

Checkpoints

Under the releases tab of this repo we provide three checkpoints:

  • Encoder: WavLM (taken from official WavLM)
  • Vocoder 1: HiFiGAN w/ raw-WavLM-L6
  • Vocoder 2: HiFiGAN w/ prematched-WavLM-L6

For the HiFiGAN models we provide both the generator inference checkpoint and full training checkpoint with optimizer states.
For performance, see the paper.

Train

Install librosa, tensorboard, matplotlib, fastprogress and scipy.

  1. Precompute WavLM features of the vocoder dataset: we provide a utility for this for the LibriSpeech dataset in prematch_dataset.py:

    usage: prematch_dataset.py [-h] --librispeech_path LIBRISPEECH_PATH
                            [--seed SEED] --out_path OUT_PATH [--device DEVICE]
                            [--topk TOPK] [--matching_layer MATCHING_LAYER]
                            [--synthesis_layer SYNTHESIS_LAYER] [--prematch]
                            [--resume]

    e.g. (prematch): python prematch_dataset.py --librispeech_path /path/to/librispeech/root --out_path /path/where/you/want/outputs/to/go --topk 4 --matching_layer 6 --synthesis_layer 6 --prematch

  2. Train HiFiGAN: until 2.5M steps

    python -m hifigan.train --audio_root_path /path/to/librispeech/root/ --feature_root_path /path/to/the/output/of/previous/step/ --input_training_file data_splits/wavlm-hifigan-train.csv --input_validation_file data_splits/wavlm-hifigan-valid.csv --checkpoint_path /path/where/you/want/to/save/checkpoint --fp16 False --config hifigan/config_v1_wavlm.json --stdout_interval 25 --training_epochs 1800 --fine_tuning

Results

Sample

Demo

Performance

  • training
    • xx [iter/sec] @ NVIDIA A100 on paperspace gradient Notebook (ConvTF32+/AMP+)
    • take about xx days for whole training
  • inference
    • z.z [sec/sample] @ xx

Official info

References

Original paper

PaperBadge

@misc{2305.18975,
Author = {Matthew Baas and Benjamin van Niekerk and Herman Kamper},
Title = {Voice Conversion With Just Nearest Neighbors},
Year = {2023},
Eprint = {arXiv:2305.18975},
}

Acknowlegements

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.2%
  • Jupyter Notebook 3.0%
  • Dockerfile 0.8%