Skip to content

WiraDKP/pytorch_gru_speaker_diarization

Repository files navigation

Speaker Diarization using GRU in PyTorch

Installation

Conda environment

To use this conda environment, you need to install Miniconda, then run this command

conda env create -f environment.yml

Install OpenVINO (Optional)

If you would like to use OpenVINO for inference. Please check OpenVINO official Documentation for the installation.

Step 1: Training 💪

Simply run the code in the Part 1 - Training.ipynb notebook and you are good to go.

Here is the training process if you would want to learn more:

  • Dataset Preparation
    You should structure your data in these format:

    data
        train
            speaker_1
                audio_1.wav
                audio_2.wav
            speaker_2
                audio_1.wav
                audio_2.wav            
        valid
            speaker_1
                audio_1.wav
                audio_2.wav
            speaker_2
                audio_1.wav
                audio_2.wav
        test
            both_speaker_speaking.wav      
    

    You can use this sample dataset in Kaggle to help you understand how to structure your own data.

  • Dataset & Dataloader
    I have created AudioFolder PyTorch Dataset Object so automatically utilize the dataset. Here is a quick look for you

    AudioFolder(folder_path, n_data=500, slice_dur=5, sr=16000, n_mel=128, n_fft=2048, win_length=2048, hop_length=1024, freq_range=(0, 8000)):
    

    In a nutshell, here is what it does:

    • Randomly choose an audio from folder_path, then sampled it using sr rate
    • Then it randomly slice a chunk of the audio for slice_dur seconds, quantile normalize, then converted to log melspectrogram using specified parameter (n_mel, n_fft, and so on)
    • Zero crossing rate is used as VAD to label the spectrogram frames. Each frame is labeled as the subfolder name or <NONE> if it is silent.
    • AudioFolder does these steps n_data times to create the dataset
  • Architecture & Config
    Simply create the model architecture you would like to use. I have made one sample for you in src/model.py. You can directly use it by simple changing the hyperparameters in the config. Please follow the notebook if you are confused.

    ❗ Note: If you would like to use custom architecture in OpenVINO, make sure it is compatible with OpenVINO model optimizer and inference engine.

  • Training Preparation
    Set up the model, criterion, optimizer, and callback here.

  • Training
    As what it says, running the code will train the model

Step 2: Inference 🤖

I have prepared a PyTorchPredictor and OpenVINOPredictor object. Here is a sneak peek:

PyTorchPredictor(weights_path, config_path, chunk_dur)
OpenVINOPredictor(model_xml, model_bin, config_path, chunk_dur, CPU_EXT)

To use it, simply input the arguments and use .predict(wav_path) and it will return the diarization timestamp. For example:

p = PyTorchPredictor("model/weights_best.pth", "model/configs.pth", chunk_dur=2)
timestamp = p.predict("dataset/test/test.wav", plot=True)

setting plot=True provides you with the diarization visualization like this Visualized Diarization

If you would like to use OpenVINO, use .to_onnx(fname, outdir) to convert the model into onnx format.

p.to_onnx("speaker_diarization.onnx", "model")

End notes ❤️

I hope you like the projects. I made this repo for educational purposes so it might need further tweaking to reach production level. Feel free to create an issue if you need help, and I hope I'll have the time to help you. Thank you.

About

Speaker Diarization using GRU in PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published