Skip to content

A PyTorch demo of the paper Voice Separation with an Unknown Number of Multiple Speakers using gradio and Nvidia NEMO ASR model.

License

Notifications You must be signed in to change notification settings

muhammad-ahmed-ghani/svoice_demo

Repository files navigation

Speaker Voice Separation using Neural Nets

Hugging Face

Installation

git clone https://github.com/Muhammad-Ahmad-Ghani/svoice_demo.git
cd svoice_demo
conda create -n svoice python=3.7 -y
conda activate svoice
# CUDA 11.3
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch -y
# CPU only
pip install torch==1.12.0+cpu torchvision==0.13.0+cpu torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
Pretrained-Model Dataset Epochs Train Loss Valid Loss
checkpoint.th Librimix-7 (16k-mix_clean) 31 0.04 0.64

This is an intermediate checkpoint just for demo purpose.

create directory outputs/exp_ and save checkpoint there

svoice_demo
├── outputs
│   └── exp_
│       └── checkpoint.th
...

Run Gradio Demo

conda activate svoice
python demo.py

Training

Create dataset mix_clean with sample rate 16K using librimix repo.

Dataset Structure

svoice_demo
├── Libri{NUM_OF_SPEAKERS}Mix_Dataset -> Libri7Mix_Dataset
│   └── wav{SAMPLE_RATE_VALUE}k -> wav16k
│       └── min
│       │   └── dev
│       │       └── ...
│       │   └── test
│       │       └── ...
│       │   └── train-360
│       │       └── ...
...

Create metadata files

Run predefined scripts if you want.

# for 7 speakers
bash create_metadata_librimix7.sh
# for 10 speakers
bash create_metadata_librimix10.sh

Change conf/config.yaml according to your settings. Set C: NUM_OF_SPEAKERS value at line 66 for number of speakers.

python train.py

This will automaticlly read all the configurations from the conf/config.yaml file. To know more about the training you may refer to original svoice repo.

Distributed Training

python train.py ddp=1

Evaluating

python -m svoice.evaluate <path to the model> <path to folder containing mix.json and all target separated channels json files s<ID>.json>

Citation

The svoice code is borrowed from original svoice repository. All rights of code are reserved by META Research.

@inproceedings{nachmani2020voice,
  title={Voice Separation with an Unknown Number of Multiple Speakers},
  author={Nachmani, Eliya and Adi, Yossi and Wolf, Lior},
  booktitle={Proceedings of the 37th international conference on Machine learning},
  year={2020}
}
@misc{cosentino2020librimix,
    title={LibriMix: An Open-Source Dataset for Generalizable Speech Separation},
    author={Joris Cosentino and Manuel Pariente and Samuele Cornell and Antoine Deleforge and Emmanuel Vincent},
    year={2020},
    eprint={2005.11262},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

This repository is released under the CC-BY-NC-SA 4.0. license as found in the LICENSE file.

The file: svoice/models/sisnr_loss.py and svoice/data/preprocess.py were adapted from the kaituoxu/Conv-TasNet repository. It is an unofficial implementation of the Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation paper, released under the MIT License. Additionally, several input manipulation functions were borrowed and modified from the yluo42/TAC repository, released under the CC BY-NC-SA 3.0 License.

About

A PyTorch demo of the paper Voice Separation with an Unknown Number of Multiple Speakers using gradio and Nvidia NEMO ASR model.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks