Skip to content

qiuqiangkong/mini_music_tagging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Minimal Implementation of Music tagging

This is an minimal implementation of music taggign with PyTorch. We use the GTZAN dataset containing 1,000 30-second audio clips for training and validation. The GTZAN dataset contains 10 genres. We use 900 audio files for training and use 100 audio files for validation. We train a convolutional neural network as classifier.

0. Download dataset

The original link dataset link: https://marsyas.info/index.html is not available anymore. Please search other sources to download the dataset. Here are the log mel spectrograms of different genre audios.

Log mel spectrogram

The downloaded dataset looks like:

dataset_root (1.3 GB)
└── genres
    ├── blues (100 files)
    ├── classical (100 files)
    ├── country (100 files)
    ├── disco (100 files)
    ├── hiphop (100 files)
    ├── jazz (100 files)
    ├── metal (100 files)
    ├── pop (100 files)
    ├── reggae (100 files)
    └── rock (100 files)

1. Install dependencies

git clone https://github.com/qiuqiangkong/mini_music_tagging

# Install Python environment.
conda create --name music_tagging python=3.8

# Activate environment.
conda activate music_tagging

# Install Python packages dependencies.
sh env.sh

2. Single GPU training

We use the Wandb toolkit for logging. You may set wandb_log to False or use other loggers.

CUDA_VISIBLE_DEVICES=0 python train.py

3. Multiple GPUs training

We use Huggingface accelerate toolkit for multiple GPUs training. Here is an example of using 4 GPUs for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes 4 train_accelerate.py

The training takes around 20 min to train for 10,000 steps on a single RTX4090 GPU card. The result looks like:

0it [00:00, ?it/s]step: 0, loss: 0.865           
Accuracy: 0.1                                 
Save model to checkpoints/train/Cnn/step=0.pth   
Save model to checkpoints/train/Cnn/latest.pth
200it [00:31,  7.80it/s]step: 200, loss: 0.159   
Accuracy: 0.48                                
Save model to checkpoints/train/Cnn/step=200.pth 
Save model to checkpoints/train/Cnn/latest.pth
...
Accuracy: 0.64
Save model to checkpoints/train/Cnn/step=10000.pth
Save model to checkpoints/train/Cnn/latest.pth

The validation accuracy during training looks like:

Validation accuracy

4. Inference

Users may use the trained checkpoints for inference.

CUDA_VISIBLE_DEVICES=0 python inference.py

For example, we test on fold 0 and get the following results:

Accuracy: 0.670

Reference

@article{kong2020panns,
  title={Panns: Large-scale pretrained audio neural networks for audio pattern recognition},
  author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={28},
  pages={2880--2894},
  year={2020},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages