Skip to content

Latest commit

 

History

History
55 lines (38 loc) · 5.78 KB

FAQs.md

File metadata and controls

55 lines (38 loc) · 5.78 KB

Frequently Asked Questions (FAQs)

  • Can I run musicnn on a CPU? Yes, the models are already trained.

  • I miss a functionality. How can I get it? musicnn is fairly simple. Feel free to expand it as you wish! Tell us if you think this new functionality is going to be useful for the rest of us.

  • Why musicnn contains vgg models? Because they are a nice baseline, and because people like to use computer vision models for spectrograms. Hence, in this repository you can find musicnn-based models (musically motivated convolutional neural networks) and vggs (a computer vision architecture applied to audio).

  • Which is the architecture that musicnn-based models employ? They use a musically motivated CNN frontend, some dense layers in the mid-end, and a temporal-pooling back-end. In this jupyter notebook we provide further details about the model.

  • Which is the best musicnn layer-output to pick for transfer learning? Although we haven't run exhaustive tests, throughout our visualisations and preliminary experiments we found the taggram and the max_pool layer to be the best for this purpose. The taggram because it already provides high-level music information, and the max_pool layer because it provides a relatively sparse acoustic representation of the music audio.

  • Which 50-tags does the MTT model predict? These are determined by the MagnaTagATune dataset, that is used for training the MTT models: guitar, classical, slow, techno, strings, drums, electronic, rock, fast, piano, ambient, beat, violin, vocal, synth, female, indian, opera, male, singing, vocals, no vocals, harpsichord, loud, quiet, flute, woman, male vocal, no vocal, pop, soft, sitar, solo, man, classic, choir, voice, new age, dance, male voice, female vocal, beats, harp, cello, no voice, weird, country, metal, female voice, choral.

  • Which 50-tags does the MSD model predict? These are determined by the Million Song Dataset dataset, that is used for training the MSD models: rock, pop, alternative, indie, electronic, female vocalists, dance, 00s, alternative rock, jazz, beautiful, metal, chillout, male vocalists, classic rock, soul, indie rock, Mellow, electronica, 80s, folk, 90s, chill, instrumental, punk, oldies, blues, hard rock, ambient, acoustic, experimental, female vocalist, guitar, Hip-Hop, 70s, party, country, easy listening, sexy, catchy, funk, electro, heavy metal, Progressive rock, 60s, rnb, indie pop, sad, House, happy.

  • Which are the typical cases where the model fails? When the input-audio has content that is out of the 50-tags vocabulary. Although in these cases the predictions are consistent and reasonable, the model cannot predict bass if this tag is not part of its vocabulary.

  • Why the MTT models predicts no vocals and no vocal? Because the vocabulary of the model is determined by the MagnaTagATune dataset and we used it as it is.

  • My model is slow, even with a GPU. Can I do something? Yes! In ./musicnn/configuration.py you can set a bigger batch size. The dafult is BATCH_SIZE = 1, what can be slow – but safe computationally.

  • What are these songs you include in the repository?
    ./audio/joram-moments_of_clarity-08-solipsism-59-88.mp3 is an electronic music song from the test set of the MagnaTagATune dataset.
    ./audio/TRWJAZW128F42760DD_test.mp3 is an instrumental Muddy Waters song-excerpt from the test set of the Million Song Dataset called Screamin' And Cryin' (Live In Warsaw 1976).

  • Which audio formats does the musicnn library support? We rely on librosa to read audio files. librosa uses soundfile and audioread for reading audio. As of v0.7, librosa uses soundfile by default, and falls back on audioread only when dealing with codecs unsupported by soundfile (notably, MP3, and some variants of WAV). For a list of codecs supported by soundfile, see the libsndfile documentation.

  • Which sampling rate, window and hop size were used to compute the log-mel spectrograms? We compute the STFT of a downsampled signal at 16kHz, with a Hanning window of length 512 (50% overlap). We use 96 mel-bands, and we apply a logarithmic compression to it (np.log10(10000·x + 1)).

  • I love this library! How can I send you money? First, you will need to contact me! www.jordipons.me/about-me/

Are you using musicnn?

If you are using it for academic works, please cite us:

@inproceedings{pons2018atscale,
  title={End-to-end learning for music audio tagging at scale},
  author={Pons, Jordi and Nieto, Oriol and Prockup, Matthew and Schmidt, Erik M. and Ehmann, Andreas F. and Serra, Xavier},
  booktitle={19th International Society for Music Information Retrieval Conference (ISMIR2018)},
  year={2018},
}

@inproceedings{pons2019musicnn,
  title={musicnn: pre-trained convolutional neural networks for music audio tagging},
  author={Pons, Jordi and Serra, Xavier},
  booktitle={Late-breaking/demo session in 20th International Society for Music Information Retrieval Conference (LBD-ISMIR2019)},
  year={2019},
}

If you use it for other purposes, let us know!