A Minimal Implementation of Music tagging

This is an minimal implementation of music taggign with PyTorch. We use the GTZAN dataset containing 1,000 30-second audio clips for training and validation. The GTZAN dataset contains 10 genres. We use 900 audio files for training and use 100 audio files for validation. We train a convolutional neural network as classifier.

0. Download dataset

The original link dataset link: https://marsyas.info/index.html is not available anymore. Please search other sources to download the dataset. Here are the log mel spectrograms of different genre audios.

The downloaded dataset looks like:

dataset_root (1.3 GB)
└── genres
    ├── blues (100 files)
    ├── classical (100 files)
    ├── country (100 files)
    ├── disco (100 files)
    ├── hiphop (100 files)
    ├── jazz (100 files)
    ├── metal (100 files)
    ├── pop (100 files)
    ├── reggae (100 files)
    └── rock (100 files)

1. Install dependencies

git clone https://github.com/qiuqiangkong/mini_music_tagging

# Install Python environment.
conda create --name music_tagging python=3.8

# Activate environment.
conda activate music_tagging

# Install Python packages dependencies.
sh env.sh

2. Single GPU training

We use the Wandb toolkit for logging. You may set wandb_log to False or use other loggers.

CUDA_VISIBLE_DEVICES=0 python train.py

3. Multiple GPUs training

We use Huggingface accelerate toolkit for multiple GPUs training. Here is an example of using 4 GPUs for training.

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes 4 train_accelerate.py

The training takes around 20 min to train for 10,000 steps on a single RTX4090 GPU card. The result looks like:

0it [00:00, ?it/s]step: 0, loss: 0.865           
Accuracy: 0.1                                 
Save model to checkpoints/train/Cnn/step=0.pth   
Save model to checkpoints/train/Cnn/latest.pth
200it [00:31,  7.80it/s]step: 200, loss: 0.159   
Accuracy: 0.48                                
Save model to checkpoints/train/Cnn/step=200.pth 
Save model to checkpoints/train/Cnn/latest.pth
...
Accuracy: 0.64
Save model to checkpoints/train/Cnn/step=10000.pth
Save model to checkpoints/train/Cnn/latest.pth

The validation accuracy during training looks like:

4. Inference

Users may use the trained checkpoints for inference.

CUDA_VISIBLE_DEVICES=0 python inference.py

For example, we test on fold 0 and get the following results:

Accuracy: 0.670

Reference

@article{kong2020panns,
  title={Panns: Large-scale pretrained audio neural networks for audio pattern recognition},
  author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={28},
  pages={2880--2894},
  year={2020},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
data		data
models		models
README.md		README.md
env.sh		env.sh
inference.py		inference.py
plot.py		plot.py
train.py		train.py
train_accelerate.py		train_accelerate.py
train_fabric.py		train_fabric.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Minimal Implementation of Music tagging

0. Download dataset

1. Install dependencies

2. Single GPU training

3. Multiple GPUs training

4. Inference

Reference

About

Releases

Packages

Languages

qiuqiangkong/mini_music_tagging

Folders and files

Latest commit

History

Repository files navigation

A Minimal Implementation of Music tagging

0. Download dataset

1. Install dependencies

2. Single GPU training

3. Multiple GPUs training

4. Inference

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages