This is the Pytorch implementation of paper: Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training.
In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra model weights or supervision, experimental results on multiple downstream datasets demonstrate MaskSpec achieves a significant performance gain against the supervised methods and outperforms the previous pre-trained models. In particular, our best model reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on OpenMIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A respectively.
Continuously Updating :)
Our experiments are based on cuda 11.5
and python 3.7.10
:
conda create -n maskspec python=3.7.10
conda activate maskspec
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
- Change
base_dir
inaudioset/dataset.py
to your own path. - Change
hdf5_file
inaudioset/get_mean_std.py
to your own, which is the unbalanced training data of Audioset now. - Run
python audioset/get_mean_std.py
to get the mean and std values in random 10000 samples, and you will get a file namedmean_std_128.npy
in your working dir.
We provide a simple script to extract the embeddings and get the results of audio tagging. Our trained model (mAP of 0.471 on AudioSet) can be found Google Drive page. The pretrained model can also be found Google Drive page. Feel free to download them and use them.
bash scripts/test.sh
Test results:
[04:34:44.007448] Top 8 sound events: ['Cat', 'Animal', 'Domestic animals pets', 'Caterwaul', 'Meow', 'Speech', 'Music', 'Inside, small room']
[04:34:44.007499] Top 1 sound event: Cat
[04:34:44.007517] embedding: (1, 768)
If you want to train your own MaskSpec models, we provide the following training scripts.
Download and prepare the dataset as explained in the audioset page.
You can use the audio files provided by PANNS.
That is https://pan.baidu.com/s/13WnzI1XDSvqXZQTS-Kqujg, password: 0vc2
The base Vit model can be trained from scratch for example like this (using 8 GPUs):
bash scripts/train_from_scratch_vit.sh
The base Vit model can be pretrained for example like this (using 8 GPUs):
bash scripts/pretrain_vit.sh
The base Vit model can be pretrained for example like this (using 8 GPUs):
bash scripts/submitit_pretrain.sh
The base Vit model can be fintuned for example like this (using 8 GPUs):
bash scripts/finetune_vit.sh
The base Vit model can be fintuned for example like this:
python dcase18/convert_to_mp3.py
python dcase18/create_h5pymp3_dataset.py
python dcase18/get_mean_std.py
bash dcase18/run.sh
- PaSST: Efficient Training of Audio Transformers with Patchout
- Masked Autoencoders Are Scalable Vision Learners
If you have any problem about our code, feel free to contact
or describe your problem in Issues.