This is an implementation of the VGGIshIsh model, proposed in Taming Visually Guided Audio Generation by Vladimir Iashin and Esa Rahtu. The code is following closely the official implementation SpecVQGAN. In this repo, the model is used to classify between hit an scratch sounds from the Greatest Hit dataset.
Install the conda environment from the environment.yml
file:
conda env create -f environment.yml
To download data go to the official website of the Greates Hit-dataset https://andrewowens.com/vis/.https://andrewowens.com/vis/.
After downloading the data, preprocess it with the provided wav_to_melspec.py script:
python wav_to_melspec.py --data_path=path/to/data --save_path=path/to/save
To train the model, run:
python train.py config=configs/vggishish.yaml
To test the model, run:
python test.py config=configs/vggishish.yaml ckpt_path=path/to/ckpt/file.pt
To view training results in TensorBoard, run:
tensorboard --logdir=logs