Sony Computer Science Laboratories, Paris, France1 Queen Mary University, London, UK2
Bass Accompaniment Generation via Latent Diffusion
Abstract
The ability to automatically generate music that appropriately matches an arbitrary input track is a challenging task. We present a novel controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem. To provide control over the timbre of generated samples, we introduce a technique to ground the latent space to a user-provided reference style during diffusion sampling. For further improving audio quality, we adapt classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. We train our model on a dataset of pairs of mixes and matching bass stems. Quantitative experiments demonstrate that, given an input mix, the proposed system can generate basslines with user-specified timbres. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
Index Terms— music, accompaniment, diffusion, generation, bass
1 Introduction
Musical accompaniment is an integral part of music composition and performance. The ability to automatically generate an accompaniment that complements and matches the style of existing instrument parts (stems) in a music track, has the potential to both enhance the creativity of artists–by proposing novel musical material for them to work with–and to make it easier and more efficient to realize their artistic visions. In recent years, deep learning techniques have shown promising results in the field of music and (to a much lesser extent) accompaniment generation. Many approaches use a symbolic representation of music as the medium [1, 2, 3], while more recently a number of models that directly generate waveform audio have also been proposed [4, 5, 6]. Diffusion models [7, 8, 9] have emerged as a powerful class of generative models capable of producing high-quality samples, although they usually require a computationally expensive iterative sampling procedure. Latent diffusion models [10] have been introduced to increase model inference speed by generating a latent, low-dimensional representation of the data from a pretrained autoencoder model, usually a Variational AutoEncoder [11].
In this work, we propose a general latent generative model for the task of accompaniment generation, and apply it to the generation of basslines. Given an input stem of arbitrary length such as a vocal melody or an input mix of arbitrary numbers of stems, our model is able to generate a complementary bass stem that musically matches the conditioning. Furthermore, we propose controllability features, such as style conditioning and conditioning guidance control, to make our system a more useful tool for artists. The key contributions of our work are:
-
•
The design of an efficient audio autoencoder to encode samples to compressed invertible representations
-
•
The design of a general conditional latent diffusion model that takes a music mix as input and produces a coherent track, while being able to handle inputs and outputs of arbitrary length
-
•
The application of both audio autoencoder and latent diffusion model to the task of encoding and generating basslines given an arbitrary input mix
-
•
The use of style conditioning during the diffusion sampling process to force the generation of a user-defined bass style.
2 Related Work
Accompaniment generation is a type of music generation that involves an additional input conditioning. In this work we focus on audio-based music generation. Autoregressive models such as WaveNet [12], SampleRNN [13], Jukebox [4], MusicLM [5] and MusicGen [14] can generate high quality samples but suffer from slow sequential sampling. Non-autoregressive models based on generative adversarial networks (GANs) [15] such as WaveGAN [16] and GANSynth [17] achieve parallel sampling but are limited to generating fixed-length audio clips. On the other hand, Musika [18] parallelly generates invertible latent representations of audio of arbitrary length, but the context available to the model is limited. Relevant to our work, BassNet [19] generates bass tracks while offering user control via a latent space variable.
More recently, models such as DiffWave [20] and WaveGrad [21] introduce diffusion to audio modeling for speech synthesis applications. For musical audio generation, Riffusion [22] fine-tunes Stable Diffusion [10] on audio spectrograms to generate music clips. Moûsai [23] trains a latent diffusion model on compressed representations and can generate minute-long coherent music. JEN-1 [24] introduces a large-scale conditional latent diffusion model that can generate long-form music both autoregressively and non-autoregressively. Finally, [6] proposes a multi-source diffusion model trained on single source waveforms that achieves both generation and separation of individual sources.
3 Method
Let be the waveform of a mix of arbitrary stems of length , where is the -th stereo frame, and let be the waveform of a single-stem audio sample with the same length. To sample given , we aim to model the conditional distribution , but since the waveforms are typically very high-dimensional (i.e. is large), we encode both and into latent representations and respectively using audio autoencoders, and model instead. Here, is the time compression ratio of the autoencoders, and we refer to the dimensionality of vectors and as and , respectively.
3.1 Audio Autoencoder
Our goal is to design an efficient audio autoencoder that can reach high compression ratios while reconstructing samples with reasonable accuracy. To achieve this, we start from the audio autoencoder architecture proposed in Musika [18], where a model is used to reconstruct the magnitude and phase components of a spectrogram instead of the full waveform, which results in faster inference. However, instead of using the original two-stage design and two-phase training process, we train a single encoder and decoder in a fully end-to-end fashion. We first use a L1 loss between a log-magnitude spectrogram and the magnitude output of the model:
where and are the encoder and decoder, and is the magnitude component of the decoder output. We also use the multi-scale spectral distance [25, 26] between the original and the reconstructed waveforms:
where is a set of pairs of hop size and window length. The phase component is modelled implicitly by the multi-scale spectral distance loss and the adversarial loss on the log-magnitude spectrogram of the reconstructed waveform:
where is the critic. The final objective used to jointly train encoder and decoder is the following:
Differently from [18], we add a second critic that receives mel-spectrograms. This addition encourages the autoencoder to reconstruct spectral information more accurately in the regions where human pitch perception is more precise.
3.2 Latent Diffusion Model
Diffusion models are trained to reverse a sequential corruption process of samples, and thus are able to retrieve samples from the data distribution by starting from a known distribution and iteratively denoising it. We choose to briefly introduce them with their score-based interpretation [27].
Our goal is to model the score of the conditional target stem latent distribution, given the input mix latent:
where is a neural network with parameters .
To achieve this, we minimize the Fisher Divergence between the output of the model and score:
Finally, we can use Langevin dynamics to iteratively generate real samples with a sufficiently large number of iterations .
In practice, we train our model to denoise noisy latent samples of the target stem , with :
where and are the signal and noise rates, is the latent representation of the corresponding input mix and is the loss weight at timestep .
The model is based on a U-Net architecture [28], with the addition of self-attention [29] in the lower resolution layers. However, the vanilla self-attention mechanism does not allow the model to generalize to arbitrarily long inputs and outputs [30], which is crucial for a flexible real-world use of the system. To achieve generalization to lengths that are unseen during training, we equip the attention layers with Dynamic Positional Bias (DPB), a technique introduced for the task of arbitrarily-sized image classification [31, 32] which consists in the addition of a learnable Relative Positional Bias (RPB) matrix where is the temporal length of the feature map:
where are query, key and value matrices. Each entry is learned with a Multi-Layer Perceptron (MLP) on the relative difference between positions and :
3.3 Style Grounding
To maximize its utility as a creative tool for music artists, our objective is a generation system that is controllable by the user. To this end, we design a technique that enables the generation of single-stem samples with user-specified timbre characteristics and style. Given a reference audio waveform provided by the user to indicate their desired style, we first encode it to a compressed latent representation with the corresponding audio autoencoder. Then, we simply average the latent representation over the timesteps to obtain a single dimensional vector , where indicates the average across all timesteps. Finally, during the diffusion model sampling process, we force the generated latent samples at each reverse diffusion timestep to have an average across time that remains close to . We weigh this re-centering by the square of the timestep-specific noise rate, so that the effect is stronger at earlier iterations while keeping the model free to deviate when generating the lower-level details of the sample. Given the denoised output of the diffusion model at sampling iteration we calculate:
This technique exploits the semantically rich latent space produced by the autoencoder to enforce distinct timbre features captured in onto the output of the diffusion model.
3.4 Classifier-Free Guidance
Classifier-Free Guidance (CFG) [33] is a technique that allows a conditional diffusion model to generate samples that more closely adhere to the provided input:
where is an unconditionally-generated sample at timestep . However, when high guidance weights are used, image generation models are known to generate overly saturated and exposed images [34]. We experience a similar issue in our latent audio generation scenario, with highly distorted and saturated samples being generated. Solutions such as clipping of the guided samples between a defined range of values or dynamic thresholding [34] are not applicable in our case, since our latent space is not bounded. We thus use the technique proposed by [35] for guiding the generation of arbitrary spaces, which controls the increase in standard deviation of the guided samples with an hyperparameter , and allows us to reduce artifacts at higher guidance weights.
Grounded | Not Grounded | |
---|---|---|
Cosine Distance | 0.269 | 0.644 |
Euclidean Distance | 0.407 | 0.836 |
4 Implementation Details
We train the audio autoencoders on random crops of seconds to produce representations with and , while is kept the same for both models. Input log-magnitude spectrograms for both the autoencoder and the critics are calculated using and . mel-bins are used for the second critic. The architecture of both autoencoder and critics consists of residual convolutional blocks. We choose , and the multi-scale spectral distance loss is calculated using . We always choose . The autoencoders consist of M parameters and are trained using Adam [36] with and for 500k iterations at a batch size of . The latent diffusion model is trained on (mix, stem) pairs, where both samples are 23 seconds long and are first encoded to 256 timesteps-long latent representations. For a given track, the mix is obtained by mixing a non-empty random subset of stems from the track. The latent diffusion model consists of residual convolutional blocks, with self-attention layers at the lower resolution levels. The latent representation of the conditioning mix is concatenated with the noisy input, while the diffusion timestep information is expressed through sinusoidal embeddings [29] which are concatenated with the feature maps before every block. of input latent representations are zero-ed out to train the model unconditionally, thus allowing CFG. The latent diffusion model consists of M parameters and is trained using AdamW [37] with and for 500k iterations at a batch size of . To train the model we use the v-objective [38] with a cosine schedule, while at inference we use the DDIM sampler [9].
5 Experiments and Results
We train the proposed accompaniment generation system on the task of conditional bassline generation, using an internal dataset of 20 k songs with available stems, among which the bass guitar. 1,500 of the tracks are used as test set. We first train the audio autoencoder used to encode the input mixes on the MTG-Jamendo dataset [39]. The autoencoder used to encode the bass samples is trained on bass stems from our internal dataset and the latent diffusion model is trained on (mix, bass stem) pairs from the same dataset. We first evaluate the quality of unconditionally generated samples with respect to the number of DDIM steps in Fig. 2 (right). We show in Fig. 2 (left) how the CFG rescaling technique can improve the FAD of generated samples for high CFG weights. To evaluate the ability of the system to generate samples that musically match the input mix, we train a contrastive model to assign high scores to matching (mix, bass stem) pairs and low scores to non-matching ones using the same internal dataset. In Fig. 3, we visualize the scores assigned by that model to pairs of random segments of mixes from the test set, and 25 bass stems generated conditionally for each of those segments. A high value on the diagonal means the bass stem generated for that mix matches that mix better than the bass stems generated for the other mixes. To quantitatively evaluate the efficacy of the proposed style grounding technique, we use an off-the-shelf audio classification model [40] to extract embeddings of generated samples with and witout style-grounding (using the same input mix as conditioning), and compare them in Table 1 to embeddings of the target style sample via the Cosine and Euclidean distance. Readers can listen to samples generated by our system at: https://sonycslparis.github.io/bass_accompaniment_demo/
6 Conclusion
We have presented a novel controllable system for music accompaniment generation using latent diffusion models. When trained on bass stems, our model is able to generate basslines that musically match an arbitrary input mix. We propose the design of an efficient audio autoencoder for producing compressed invertible latent representations, the adaptation of latent diffusion models to handle inputs and outputs of arbitrary length, and a latent-specific style grounding technique to control the timbre of generated samples. Experiments demonstrate that our model can generate basslines that musically match the input mix and that can be grounded with user-provided timbres. A limitation of our system is that it does not offer user control over the exact notes of the generated accompaniment. Future work involves training the model to generate other instruments besides bass. We believe our system can enhance the creative workflow of music artists, creating a variety of bass accompaniments to fit their existing material, while also offering control over the creation process.
This work was supported by UKRI [grant EP/S022694/1].
References
- [1] Gaëtan Hadjeres, François Pachet and Frank Nielsen “DeepBach: a Steerable Model for Bach Chorales Generation” In ICML, 2017
- [2] Cheng-Zhi Anna Huang et al. “Music Transformer: Generating Music with Long-Term Structure” In ICLR, 2019
- [3] Dimitri Rütte, Luca Biggio, Yannic Kilcher and Thomas Hoffman “FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control” In arXiv preprint arXiv:2201.10936, 2022
- [4] Prafulla Dhariwal et al. “Jukebox: A generative model for music” In arXiv preprint arXiv:2005.00341, 2020
- [5] Andrea Agostinelli et al. “MusicLM: Generating Music From Text”, 2023 arXiv:2301.11325 [cs.SD]
- [6] Giorgio Mariani et al. “Multi-Source Diffusion Models for Simultaneous Music Generation and Separation”, 2023 arXiv:2302.02257 [cs.SD]
- [7] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan and Surya Ganguli “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” In ICML, 2015
- [8] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising Diffusion Probabilistic Models” In NeurIPS, 2020
- [9] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In ICLR, 2021
- [10] Robin Rombach et al. “High-resolution image synthesis with latent diffusion models” In CVPR, 2022
- [11] Diederik P. Kingma and Max Welling “Auto-Encoding Variational Bayes” In ICLR, 2014
- [12] Aäron Oord et al. “WaveNet: A Generative Model for Raw Audio” In The 9th ISCA Speech Synthesis Workshop, 2016
- [13] Soroush Mehri et al. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model” In ICLR, 2017
- [14] Jade Copet et al. “Simple and Controllable Music Generation” In arXiv preprint arXiv:2306.05284, 2023
- [15] Ian J. Goodfellow et al. “Generative Adversarial Nets” In NeurIPS, 2014
- [16] Chris Donahue, Julian J. McAuley and Miller S. Puckette “Adversarial Audio Synthesis” In ICLR, 2019
- [17] Jesse H. Engel et al. “GANSynth: Adversarial Neural Audio Synthesis” In ICLR, 2019
- [18] Marco Pasini and Jan Schlüter “Musika! Fast Infinite Waveform Music Generation” In ISMIR, 2022
- [19] Maarten Grachten, Stefan Lattner and Emmanuel Deruty “BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control” In Applied Sciences, 2020
- [20] Zhifeng Kong et al. “DiffWave: A Versatile Diffusion Model for Audio Synthesis” In ICLR, 2021
- [21] Nanxin Chen et al. “WaveGrad: Estimating Gradients for Waveform Generation” In ICLR, 2021
- [22] Seth* Forsgren and Hayk* Martiros “Riffusion - Stable diffusion for real-time music generation”, 2022 URL: https://riffusion.com/about
- [23] Flavio Schneider, Zhijing Jin and Bernhard Schölkopf “Mo^ usai: Text-to-Music Generation with Long-Context Latent Diffusion” In arXiv preprint arXiv:2301.11757, 2023
- [24] Peike Li et al. “JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models” In arXiv preprint arXiv:2308.04729, 2023
- [25] Jesse H. Engel, Lamtharn Hantrakul, Chenjie Gu and Adam Roberts “DDSP: Differentiable Digital Signal Processing” In ICLR, 2020
- [26] Antoine Caillon and Philippe Esling “RAVE: A variational autoencoder for fast and high-quality neural audio synthesis” In arXiv preprint arXiv:2111.05011, 2021
- [27] Yang Song, Conor Durkan, Iain Murray and Stefano Ermon “Maximum Likelihood Training of Score-Based Diffusion Models” In NeurIPS, 2021
- [28] Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation” In MICCAI, 2015
- [29] Ashish Vaswani et al. “Attention is All you Need” In NeurIPS, 2017
- [30] Yutao Sun et al. “A Length-Extrapolatable Transformer” In ACL, 2023
- [31] Wenxiao Wang et al. “CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention” In ICLR, 2022
- [32] Ze Liu et al. “Swin Transformer V2: Scaling Up Capacity and Resolution” In CVPR, 2022
- [33] Jonathan Ho and Tim Salimans “Classifier-free diffusion guidance” In arXiv preprint arXiv:2207.12598, 2022
- [34] Chitwan Saharia et al. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In arXiv preprint arXiv:2205.11487, 2022
- [35] Shanchuan Lin, Bingchen Liu, Jiashi Li and Xiao Yang “Common Diffusion Noise Schedules and Sample Steps are Flawed” In arXiv:2305.08891, 2023
- [36] Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In ICLR, 2015
- [37] Ilya Loshchilov and Frank Hutter “Decoupled Weight Decay Regularization” In ICLR, 2019
- [38] Tim Salimans and Jonathan Ho “Progressive Distillation for Fast Sampling of Diffusion Models” In ICLR, 2022
- [39] Dmitry Bogdanov et al. “The MTG-Jamendo Dataset for Automatic Music Tagging” In Machine Learning for Music Discovery Workshop, ICML, 2019
- [40] Qiuqiang Kong et al. “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition” In ACM Trans. Audio Speech Lang. Process., 2020