License: CC BY 4.0
arXiv:2402.01412v1 [cs.SD] 02 Feb 2024

Sony Computer Science Laboratories, Paris, France1   Queen Mary University, London, UK2

Bass Accompaniment Generation via Latent Diffusion

Abstract

The ability to automatically generate music that appropriately matches an arbitrary input track is a challenging task. We present a novel controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem. To provide control over the timbre of generated samples, we introduce a technique to ground the latent space to a user-provided reference style during diffusion sampling. For further improving audio quality, we adapt classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. We train our model on a dataset of pairs of mixes and matching bass stems. Quantitative experiments demonstrate that, given an input mix, the proposed system can generate basslines with user-specified timbres. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.

Index Terms—  music, accompaniment, diffusion, generation, bass

1 Introduction

Musical accompaniment is an integral part of music composition and performance. The ability to automatically generate an accompaniment that complements and matches the style of existing instrument parts (stems) in a music track, has the potential to both enhance the creativity of artists–by proposing novel musical material for them to work with–and to make it easier and more efficient to realize their artistic visions. In recent years, deep learning techniques have shown promising results in the field of music and (to a much lesser extent) accompaniment generation. Many approaches use a symbolic representation of music as the medium [1, 2, 3], while more recently a number of models that directly generate waveform audio have also been proposed [4, 5, 6]. Diffusion models [7, 8, 9] have emerged as a powerful class of generative models capable of producing high-quality samples, although they usually require a computationally expensive iterative sampling procedure. Latent diffusion models [10] have been introduced to increase model inference speed by generating a latent, low-dimensional representation of the data from a pretrained autoencoder model, usually a Variational AutoEncoder [11].

In this work, we propose a general latent generative model for the task of accompaniment generation, and apply it to the generation of basslines. Given an input stem of arbitrary length such as a vocal melody or an input mix of arbitrary numbers of stems, our model is able to generate a complementary bass stem that musically matches the conditioning. Furthermore, we propose controllability features, such as style conditioning and conditioning guidance control, to make our system a more useful tool for artists. The key contributions of our work are:

  • The design of an efficient audio autoencoder to encode samples to compressed invertible representations

  • The design of a general conditional latent diffusion model that takes a music mix as input and produces a coherent track, while being able to handle inputs and outputs of arbitrary length

  • The application of both audio autoencoder and latent diffusion model to the task of encoding and generating basslines given an arbitrary input mix

  • The use of style conditioning during the diffusion sampling process to force the generation of a user-defined bass style.

2 Related Work

Accompaniment generation is a type of music generation that involves an additional input conditioning. In this work we focus on audio-based music generation. Autoregressive models such as WaveNet [12], SampleRNN [13], Jukebox [4], MusicLM [5] and MusicGen [14] can generate high quality samples but suffer from slow sequential sampling. Non-autoregressive models based on generative adversarial networks (GANs) [15] such as WaveGAN [16] and GANSynth [17] achieve parallel sampling but are limited to generating fixed-length audio clips. On the other hand, Musika [18] parallelly generates invertible latent representations of audio of arbitrary length, but the context available to the model is limited. Relevant to our work, BassNet [19] generates bass tracks while offering user control via a latent space variable.

More recently, models such as DiffWave [20] and WaveGrad [21] introduce diffusion to audio modeling for speech synthesis applications. For musical audio generation, Riffusion [22] fine-tunes Stable Diffusion [10] on audio spectrograms to generate music clips. Moûsai [23] trains a latent diffusion model on compressed representations and can generate minute-long coherent music. JEN-1 [24] introduces a large-scale conditional latent diffusion model that can generate long-form music both autoregressively and non-autoregressively. Finally, [6] proposes a multi-source diffusion model trained on single source waveforms that achieves both generation and separation of individual sources.

3 Method

Let 𝐱={x1,,xT}𝐱subscript𝑥1subscript𝑥𝑇\mathbf{x}=\{\mathit{x_{1}},...,\mathit{x_{T}}\}bold_x = { italic_x start_POSTSUBSCRIPT italic_1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } be the waveform of a mix of arbitrary stems of length T𝑇Titalic_T, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th stereo frame, and let 𝐲={y1,,yT}𝐲subscript𝑦1subscript𝑦𝑇\mathbf{y}=\{\mathit{y_{1}},...,\mathit{y_{T}}\}bold_y = { italic_y start_POSTSUBSCRIPT italic_1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } be the waveform of a single-stem audio sample with the same length. To sample 𝐲𝐲\mathbf{y}bold_y given 𝐱𝐱\mathbf{x}bold_x, we aim to model the conditional distribution p(𝐲|𝐱)𝑝conditional𝐲𝐱\mathit{p(\mathbf{y}|\mathbf{x})}italic_p ( bold_y | bold_x ), but since the waveforms are typically very high-dimensional (i.e. T𝑇Titalic_T is large), we encode both 𝐱𝐱\mathbf{x}bold_x and 𝐲𝐲\mathbf{y}bold_y into latent representations 𝐜𝐱={cx,1,,cx,T/r𝑡𝑖𝑚𝑒}subscript𝐜𝐱subscript𝑐𝑥1subscript𝑐𝑥𝑇subscript𝑟𝑡𝑖𝑚𝑒\mathbf{c_{x}}=\{\mathit{c_{x,1}},...,\mathit{c_{x,T/r_{\mathit{time}}}}\}bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT italic_x , italic_1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_x , italic_T / italic_r start_POSTSUBSCRIPT italic_time end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and 𝐜𝐲={cy,1,,cy,T/r𝑡𝑖𝑚𝑒}subscript𝐜𝐲subscript𝑐𝑦1subscript𝑐𝑦𝑇subscript𝑟𝑡𝑖𝑚𝑒\mathbf{c_{y}}=\{\mathit{c_{y,1}},...,\mathit{c_{y,T/r_{\mathit{time}}}}\}bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT italic_y , italic_1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_y , italic_T / italic_r start_POSTSUBSCRIPT italic_time end_POSTSUBSCRIPT end_POSTSUBSCRIPT } respectively using audio autoencoders, and model p(𝐜𝐲|𝐜𝐱)𝑝conditionalsubscript𝐜𝐲subscript𝐜𝐱\mathit{p(\mathbf{c_{y}}|\mathbf{c_{x}})}italic_p ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT | bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) instead. Here, r𝑡𝑖𝑚𝑒subscript𝑟𝑡𝑖𝑚𝑒r_{\mathit{time}}italic_r start_POSTSUBSCRIPT italic_time end_POSTSUBSCRIPT is the time compression ratio of the autoencoders, and we refer to the dimensionality of vectors cx,isubscript𝑐𝑥𝑖\mathit{c_{x,i}}italic_c start_POSTSUBSCRIPT italic_x , italic_i end_POSTSUBSCRIPT and cy,isubscript𝑐𝑦𝑖\mathit{c_{y,i}}italic_c start_POSTSUBSCRIPT italic_y , italic_i end_POSTSUBSCRIPT as 𝑑𝑖𝑚xsubscript𝑑𝑖𝑚𝑥\mathit{dim}_{x}italic_dim start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝑑𝑖𝑚ysubscript𝑑𝑖𝑚𝑦\mathit{dim}_{y}italic_dim start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, respectively.

3.1 Audio Autoencoder

Our goal is to design an efficient audio autoencoder that can reach high compression ratios while reconstructing samples with reasonable accuracy. To achieve this, we start from the audio autoencoder architecture proposed in Musika [18], where a model is used to reconstruct the magnitude and phase components of a spectrogram s𝑠sitalic_s instead of the full waveform, which results in faster inference. However, instead of using the original two-stage design and two-phase training process, we train a single encoder and decoder in a fully end-to-end fashion. We first use a L1 loss between a log-magnitude spectrogram s𝑠sitalic_s and the magnitude output of the model:

E,D,𝑟𝑒𝑐=𝔼sp(s)D(E(s))𝑚𝑎𝑔s1subscript𝐸𝐷𝑟𝑒𝑐subscript𝔼similar-to𝑠𝑝𝑠subscriptnorm𝐷subscript𝐸𝑠𝑚𝑎𝑔𝑠1\mathcal{L}_{E,D,\mathit{rec}}=\mathbb{E}_{s\sim p(s)}||D(E(s))_{\mathit{mag}}% -s||_{1}caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D , italic_rec end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p ( italic_s ) end_POSTSUBSCRIPT | | italic_D ( italic_E ( italic_s ) ) start_POSTSUBSCRIPT italic_mag end_POSTSUBSCRIPT - italic_s | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where E𝐸Eitalic_E and D𝐷Ditalic_D are the encoder and decoder, and D(E(s))𝑚𝑎𝑔𝐷subscript𝐸𝑠𝑚𝑎𝑔D(E(s))_{\mathit{mag}}italic_D ( italic_E ( italic_s ) ) start_POSTSUBSCRIPT italic_mag end_POSTSUBSCRIPT is the magnitude component of the decoder output. We also use the multi-scale spectral distance [25, 26] between the original and the reconstructed waveforms:

w~~𝑤\displaystyle\tilde{w}over~ start_ARG italic_w end_ARG =\displaystyle== iSTFT(D(E(s)))iSTFT𝐷𝐸𝑠\displaystyle\mathrm{iSTFT}(D(E(s)))roman_iSTFT ( italic_D ( italic_E ( italic_s ) ) )
D,𝑚𝑠𝑠𝑑subscript𝐷𝑚𝑠𝑠𝑑\displaystyle\mathcal{L}_{D,\mathit{mssd}}caligraphic_L start_POSTSUBSCRIPT italic_D , italic_mssd end_POSTSUBSCRIPT =\displaystyle== 𝔼wp(w)hSTFTh(w)2STFTh(w~)21subscript𝔼similar-to𝑤𝑝𝑤subscriptsubscriptnormsubscriptSTFTsuperscript𝑤2subscriptSTFTsuperscript~𝑤21\displaystyle\mathbb{E}_{w\sim p(w)}\sum_{h\in\mathcal{H}}||\,\mathrm{STFT}_{h% }(w)^{2}-\mathrm{STFT}_{h}(\tilde{w})^{2}\,||_{1}blackboard_E start_POSTSUBSCRIPT italic_w ∼ italic_p ( italic_w ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_h ∈ caligraphic_H end_POSTSUBSCRIPT | | roman_STFT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_w ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_STFT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over~ start_ARG italic_w end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where \mathcal{H}caligraphic_H is a set of pairs of hop size and window length. The phase component is modelled implicitly by the multi-scale spectral distance loss and the adversarial loss on the log-magnitude spectrogram of the reconstructed waveform:

s~~𝑠\displaystyle\tilde{s}over~ start_ARG italic_s end_ARG =\displaystyle== log(STFT(w~)2+ϵ)STFTsuperscript~𝑤2italic-ϵ\displaystyle\log(\mathrm{STFT}(\tilde{w})^{2}+\epsilon)roman_log ( roman_STFT ( over~ start_ARG italic_w end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϵ )
Csubscript𝐶\displaystyle\mathcal{L}_{C}caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT =\displaystyle== 𝔼sp(s)[min(0,1+C(s))]subscript𝔼similar-to𝑠𝑝𝑠delimited-[]01𝐶𝑠\displaystyle-\mathbb{E}_{s\sim p(s)}\left[\min(0,\ -1+C(s))\right]- blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p ( italic_s ) end_POSTSUBSCRIPT [ roman_min ( 0 , - 1 + italic_C ( italic_s ) ) ]
𝔼sp(s)[min(0,1C(s~))]subscript𝔼similar-to𝑠𝑝𝑠delimited-[]01𝐶~𝑠\displaystyle-\mathbb{E}_{s\sim p(s)}\left[\min(0,\ -1-C(\tilde{s}))\right]- blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p ( italic_s ) end_POSTSUBSCRIPT [ roman_min ( 0 , - 1 - italic_C ( over~ start_ARG italic_s end_ARG ) ) ]
E,D,𝑎𝑑𝑣subscript𝐸𝐷𝑎𝑑𝑣\displaystyle\mathcal{L}_{E,D,\mathit{adv}}caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D , italic_adv end_POSTSUBSCRIPT =\displaystyle== 𝔼sp(s)C(s~)subscript𝔼similar-to𝑠𝑝𝑠𝐶~𝑠\displaystyle-\mathbb{E}_{s\sim p(s)}\ C(\tilde{s})- blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p ( italic_s ) end_POSTSUBSCRIPT italic_C ( over~ start_ARG italic_s end_ARG )

where C𝐶Citalic_C is the critic. The final objective used to jointly train encoder and decoder is the following:

E,Dsubscript𝐸𝐷\displaystyle\mathcal{L}_{E,D}caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D end_POSTSUBSCRIPT =E,D,𝑎𝑑𝑣+λ𝑟𝑒𝑐E,D,𝑟𝑒𝑐+λ𝑚𝑠𝑠𝑑E,D,𝑚𝑠𝑠𝑑absentsubscript𝐸𝐷𝑎𝑑𝑣subscript𝜆𝑟𝑒𝑐subscript𝐸𝐷𝑟𝑒𝑐subscript𝜆𝑚𝑠𝑠𝑑subscript𝐸𝐷𝑚𝑠𝑠𝑑\displaystyle=\mathcal{L}_{E,D,\mathit{adv}}+\lambda_{\mathit{rec}}\mathcal{L}% _{E,D,\mathit{rec}}+\lambda_{\mathit{mssd}}\mathcal{L}_{E,D,\mathit{mssd}}= caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D , italic_adv end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_rec end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D , italic_rec end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_mssd end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_E , italic_D , italic_mssd end_POSTSUBSCRIPT

Differently from [18], we add a second critic that receives mel-spectrograms. This addition encourages the autoencoder to reconstruct spectral information more accurately in the regions where human pitch perception is more precise.

Refer to caption
Fig. 1: Inference of the system. Noise is concatenated to the latent representation of the conditioning waveform 𝐜𝐱subscript𝐜𝐱\mathbf{c_{x}}bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, and K𝐾Kitalic_K denoising steps are performed to generate 𝐜^𝐲subscript^𝐜𝐲\mathbf{\hat{c}_{y}}over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT which is then decoded to waveform. The representation of a user-specified style sample 𝐜𝐬𝐭𝐲𝐥𝐞subscript𝐜𝐬𝐭𝐲𝐥𝐞\mathbf{c_{style}}bold_c start_POSTSUBSCRIPT bold_style end_POSTSUBSCRIPT can be used to ground the generated output to a specific style.

3.2 Latent Diffusion Model

Diffusion models are trained to reverse a sequential corruption process of samples, and thus are able to retrieve samples from the data distribution by starting from a known distribution and iteratively denoising it. We choose to briefly introduce them with their score-based interpretation [27].

Our goal is to model the score of the conditional target stem latent distribution, given the input mix latent:

Gθ(𝐜𝐲,𝐜𝐱)𝐜𝐲logp(𝐜𝐲|𝐜𝐱)subscript𝐺𝜃subscript𝐜𝐲subscript𝐜𝐱subscriptsubscript𝐜𝐲𝑝conditionalsubscript𝐜𝐲subscript𝐜𝐱G_{\theta}(\mathbf{c_{y}},\mathbf{c_{x}})\approx\nabla_{\mathbf{c_{y}}}\log p(% \mathbf{c_{y}|c_{x}})italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) ≈ ∇ start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT | bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT )

where Gθ(𝐜𝐲,𝐜𝐱)subscript𝐺𝜃subscript𝐜𝐲subscript𝐜𝐱G_{\theta}(\mathbf{c_{y}},\mathbf{c_{x}})italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) is a neural network with parameters θ𝜃\thetaitalic_θ.

To achieve this, we minimize the Fisher Divergence between the output of the model and score:

𝔼p(𝐜𝐲,𝐜𝐱)[Gθ(𝐜𝐲,𝐜𝐱)𝐜𝐲logp(𝐜𝐲|𝐜𝐱)22]\mathbb{E}_{p(\mathbf{c_{y}},\mathbf{c_{x}})}\left[\left\|G_{\mathbf{\theta}}(% \mathbf{c_{y}},\mathbf{c_{x}})-\nabla_{\mathbf{c_{y}}}\log p(\mathbf{c_{y}|c_{% x}})\right\|_{2}^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_p ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∥ italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT | bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

Finally, we can use Langevin dynamics to iteratively generate real samples with a sufficiently large number of iterations K𝐾Kitalic_K.

In practice, we train our model to denoise noisy latent samples of the target stem 𝐳t=αt𝐜𝐲+βtϵsubscript𝐳𝑡subscript𝛼𝑡subscript𝐜𝐲subscript𝛽𝑡italic-ϵ\mathbf{z}_{t}=\alpha_{t}\mathbf{c_{y}}+\beta_{t}\mathbf{\epsilon}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ, with ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ):

Gθ=𝔼𝐜𝐲,𝐜𝐱p(𝐜𝐲,𝐜𝐱),t[0,1]wtGθ(𝐳t,t,𝐜𝐱)𝐜𝐲22subscriptsubscript𝐺𝜃subscript𝔼formulae-sequencesimilar-tosubscript𝐜𝐲subscript𝐜𝐱𝑝subscript𝐜𝐲subscript𝐜𝐱similar-to𝑡01subscript𝑤𝑡superscriptsubscriptnormsubscript𝐺𝜃subscript𝐳𝑡𝑡subscript𝐜𝐱subscript𝐜𝐲22\mathcal{L}_{G_{\theta}}=\mathbb{E}_{\mathbf{c_{y}},\mathbf{c_{x}}\sim p(% \mathbf{c_{y}},\mathbf{c_{x}}),t\sim[0,1]}w_{t}||G_{\theta}(\mathbf{z}_{t},t,% \mathbf{c_{x}})-\mathbf{c_{y}}||_{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ∼ italic_p ( bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) , italic_t ∼ [ 0 , 1 ] end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) - bold_c start_POSTSUBSCRIPT bold_y end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the signal and noise rates, 𝐜𝐜\mathbf{c}bold_c is the latent representation of the corresponding input mix and wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the loss weight at timestep t𝑡titalic_t.

The model is based on a U-Net architecture [28], with the addition of self-attention [29] in the lower resolution layers. However, the vanilla self-attention mechanism does not allow the model to generalize to arbitrarily long inputs and outputs [30], which is crucial for a flexible real-world use of the system. To achieve generalization to lengths that are unseen during training, we equip the attention layers with Dynamic Positional Bias (DPB), a technique introduced for the task of arbitrarily-sized image classification [31, 32] which consists in the addition of a learnable Relative Positional Bias (RPB) matrix 𝐁L×L𝐁superscript𝐿𝐿\mathbf{B}\in\mathbb{R}^{L\times L}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_L end_POSTSUPERSCRIPT where L𝐿Litalic_L is the temporal length of the feature map:

Attention(𝐐,𝐊,𝐕)=SoftMax(𝐐𝐊𝐓d+𝐁)Attention𝐐𝐊𝐕SoftMaxsuperscript𝐐𝐊𝐓𝑑𝐁\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{SoftMax}\left(\frac{% \mathbf{QK^{T}}}{\sqrt{d}}+\mathbf{B}\right)Attention ( bold_Q , bold_K , bold_V ) = SoftMax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG + bold_B )

where 𝐐,𝐊,𝐕L×d𝐐𝐊𝐕superscript𝐿𝑑\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{L\times d}bold_Q , bold_K , bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT are query, key and value matrices. Each entry 𝐁i,jsubscript𝐁𝑖𝑗\mathbf{B}_{i,j}bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is learned with a Multi-Layer Perceptron (MLP) on the relative difference between positions i𝑖iitalic_i and j𝑗jitalic_j:

𝐁i,j=MLP(ij)subscript𝐁𝑖𝑗MLP𝑖𝑗\mathbf{B}_{i,j}=\text{MLP}(i-j)bold_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = MLP ( italic_i - italic_j )

3.3 Style Grounding

To maximize its utility as a creative tool for music artists, our objective is a generation system that is controllable by the user. To this end, we design a technique that enables the generation of single-stem samples with user-specified timbre characteristics and style. Given a reference audio waveform 𝐲𝐲\mathbf{y}bold_y provided by the user to indicate their desired style, we first encode it to a compressed latent representation 𝐜𝑠𝑡𝑦𝑙𝑒subscript𝐜𝑠𝑡𝑦𝑙𝑒\mathbf{c}_{\mathit{style}}bold_c start_POSTSUBSCRIPT italic_style end_POSTSUBSCRIPT with the corresponding audio autoencoder. Then, we simply average the latent representation over the timesteps to obtain a single 𝑑𝑖𝑚ysubscript𝑑𝑖𝑚𝑦\mathit{dim}_{y}italic_dim start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT dimensional vector μt(𝐜𝑠𝑡𝑦𝑙𝑒)subscript𝜇𝑡subscript𝐜𝑠𝑡𝑦𝑙𝑒\mu_{t}(\mathbf{c}_{\mathit{style}})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT italic_style end_POSTSUBSCRIPT ), where μt()subscript𝜇𝑡\mu_{t}(\cdot)italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) indicates the average across all timesteps. Finally, during the diffusion model sampling process, we force the generated latent samples at each reverse diffusion timestep to have an average across time that remains close to μt(𝐜𝑠𝑡𝑦𝑙𝑒)subscript𝜇𝑡subscript𝐜𝑠𝑡𝑦𝑙𝑒\mu_{t}(\mathbf{c}_{\mathit{style}})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT italic_style end_POSTSUBSCRIPT ). We weigh this re-centering by the square of the timestep-specific noise rate, so that the effect is stronger at earlier iterations while keeping the model free to deviate when generating the lower-level details of the sample. Given the denoised output of the diffusion model 𝐜^y,kT×𝑑𝑖𝑚ysubscript^𝐜𝑦𝑘superscript𝑇subscript𝑑𝑖𝑚𝑦\mathbf{\hat{c}}_{y,k}\in\mathbb{R}^{T\times\mathit{dim}_{y}}over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_y , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_dim start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at sampling iteration k𝑘kitalic_k we calculate:

𝐜^y,k,𝑔𝑟𝑜𝑢𝑛𝑑=𝐜^y,kμt(𝐜^y,k)+βk2μt(𝐜𝑠𝑡𝑦𝑙𝑒)+(1βk2)μt(𝐜^y,k)subscript^𝐜𝑦𝑘𝑔𝑟𝑜𝑢𝑛𝑑subscript^𝐜𝑦𝑘subscript𝜇𝑡subscript^𝐜𝑦𝑘superscriptsubscript𝛽𝑘2subscript𝜇𝑡subscript𝐜𝑠𝑡𝑦𝑙𝑒1superscriptsubscript𝛽𝑘2subscript𝜇𝑡subscript^𝐜𝑦𝑘\mathbf{\hat{c}}_{y,k,\mathit{ground}}=\mathbf{\hat{c}}_{y,k}-\mu_{t}(\mathbf{% \hat{c}}_{y,k})+\beta_{k}^{2}\mu_{t}(\mathbf{c}_{\mathit{style}})+(1-\beta_{k}% ^{2})\mu_{t}(\mathbf{\hat{c}}_{y,k})over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_y , italic_k , italic_ground end_POSTSUBSCRIPT = over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_y , italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_y , italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_c start_POSTSUBSCRIPT italic_style end_POSTSUBSCRIPT ) + ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_y , italic_k end_POSTSUBSCRIPT )

This technique exploits the semantically rich latent space produced by the autoencoder to enforce distinct timbre features captured in c¯ysubscript¯𝑐𝑦\bar{\mathit{c}}_{y}over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT onto the output of the diffusion model.

3.4 Classifier-Free Guidance

Classifier-Free Guidance (CFG) [33] is a technique that allows a conditional diffusion model to generate samples that more closely adhere to the provided input:

𝐜^k,𝑐𝑓𝑔=Gθ(𝐳^k,k,𝐜𝐱)+λ𝑐𝑓𝑔(Gθ(𝐳^k,k)Gθ(𝐳^k,k,𝐜𝐱))subscript^𝐜𝑘𝑐𝑓𝑔subscript𝐺𝜃subscript^𝐳𝑘𝑘subscript𝐜𝐱subscript𝜆𝑐𝑓𝑔subscript𝐺𝜃subscript^𝐳𝑘𝑘subscript𝐺𝜃subscript^𝐳𝑘𝑘subscript𝐜𝐱\mathbf{\hat{c}}_{k,\mathit{cfg}}=G_{\theta}(\mathbf{\hat{z}}_{k},k,\mathbf{c_% {x}})+\lambda_{\mathit{cfg}}(G_{\theta}(\mathbf{\hat{z}}_{k},k)-G_{\theta}(% \mathbf{\hat{z}}_{k},k,\mathbf{c_{x}}))over^ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_k , italic_cfg end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_cfg end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k , bold_c start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ) )

where Gθ(𝐳^k,k)subscript𝐺𝜃subscript^𝐳𝑘𝑘G_{\theta}(\mathbf{\hat{z}}_{k},k)italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) is an unconditionally-generated sample at timestep k𝑘kitalic_k. However, when high guidance weights λ𝑐𝑓𝑔subscript𝜆𝑐𝑓𝑔\lambda_{\mathit{cfg}}italic_λ start_POSTSUBSCRIPT italic_cfg end_POSTSUBSCRIPT are used, image generation models are known to generate overly saturated and exposed images [34]. We experience a similar issue in our latent audio generation scenario, with highly distorted and saturated samples being generated. Solutions such as clipping of the guided samples between a defined range of values or dynamic thresholding [34] are not applicable in our case, since our latent space is not bounded. We thus use the technique proposed by [35] for guiding the generation of arbitrary spaces, which controls the increase in standard deviation of the guided samples with an hyperparameter ϕ[0,1]italic-ϕ01\phi\in[0,1]italic_ϕ ∈ [ 0 , 1 ], and allows us to reduce artifacts at higher guidance weights.

Refer to caption
Refer to caption
Fig. 2: Left: FAD evaluation of unconditional samples with respect to the number of DDIM inference steps. 64 steps result in the lowest FAD, and we use K=64𝐾64K=64italic_K = 64 in all subsequent experiments. Right: FAD evaluation of conditional samples with respect to CFG weights and with varying ϕitalic-ϕ\phiitalic_ϕ. When higher CFG weights (>2.5absent2.5>2.5> 2.5) are used, the latent rescaling technique results in lower FAD.
Refer to caption
Fig. 3: Soft assignments of 25 random input mixes and corresponding generated basslines by a contrastive model (Section 5). High diagonal values indicate the generated basslines best match their respective conditional inputs.
Grounded Not Grounded
Cosine Distance 0.269 0.644
Euclidean Distance 0.407 0.836
Table 1: Average Euclidean and Cosine distance between embeddings of style samples from the test set and embeddings of generated samples both using the proposed grounding technique and not using it.

4 Implementation Details

We train the audio autoencoders on random crops of 1.51.51.51.5 seconds to produce representations with 𝑑𝑖𝑚x=64subscript𝑑𝑖𝑚𝑥64\mathit{dim}_{x}=64italic_dim start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 64 and 𝑑𝑖𝑚y=32subscript𝑑𝑖𝑚𝑦32\mathit{dim}_{y}=32italic_dim start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 32, while r𝑡𝑖𝑚𝑒=4096subscript𝑟𝑡𝑖𝑚𝑒4096r_{\mathit{time}}=4096italic_r start_POSTSUBSCRIPT italic_time end_POSTSUBSCRIPT = 4096 is kept the same for both models. Input log-magnitude spectrograms for both the autoencoder and the critics are calculated using ℎ𝑜𝑝_𝑙𝑒𝑛=256ℎ𝑜𝑝_𝑙𝑒𝑛256\mathit{hop\_len}=256italic_hop _ italic_len = 256 and 𝑤𝑖𝑛_𝑙𝑒𝑛=4ℎ𝑜𝑝_𝑙𝑒𝑛𝑤𝑖𝑛_𝑙𝑒𝑛4ℎ𝑜𝑝_𝑙𝑒𝑛\mathit{win\_len}=4\cdot\mathit{hop\_len}italic_win _ italic_len = 4 ⋅ italic_hop _ italic_len. 128128128128 mel-bins are used for the second critic. The architecture of both autoencoder and critics consists of residual convolutional blocks. We choose λ𝑟𝑒𝑐=25,λ𝑚𝑠𝑠𝑑=0.002formulae-sequencesubscript𝜆𝑟𝑒𝑐25subscript𝜆𝑚𝑠𝑠𝑑0.002\lambda_{\mathit{rec}}=25,\lambda_{\mathit{mssd}}=0.002italic_λ start_POSTSUBSCRIPT italic_rec end_POSTSUBSCRIPT = 25 , italic_λ start_POSTSUBSCRIPT italic_mssd end_POSTSUBSCRIPT = 0.002, and the multi-scale spectral distance loss is calculated using ℎ𝑜𝑝_𝑙𝑒𝑛[25,26,27,28,29,211,212]ℎ𝑜𝑝_𝑙𝑒𝑛superscript25superscript26superscript27superscript28superscript29superscript211superscript212\mathit{hop\_len}\in[2^{5},2^{6},2^{7},2^{8},2^{9},2^{11},2^{12}]italic_hop _ italic_len ∈ [ 2 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT ]. We always choose 𝑤𝑖𝑛_𝑙𝑒𝑛=4ℎ𝑜𝑝_𝑙𝑒𝑛𝑤𝑖𝑛_𝑙𝑒𝑛4ℎ𝑜𝑝_𝑙𝑒𝑛\mathit{win\_len}=4\cdot\mathit{hop\_len}italic_win _ italic_len = 4 ⋅ italic_hop _ italic_len. The autoencoders consist of 37373737M parameters and are trained using Adam [36] with β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5 and β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9 for 500k iterations at a batch size of 32323232. The latent diffusion model is trained on (mix, stem) pairs, where both samples are similar-to\sim23 seconds long and are first encoded to 256 timesteps-long latent representations. For a given track, the mix is obtained by mixing a non-empty random subset of stems from the track. The latent diffusion model consists of residual convolutional blocks, with self-attention layers at the lower resolution levels. The latent representation of the conditioning mix is concatenated with the noisy input, while the diffusion timestep information is expressed through sinusoidal embeddings [29] which are concatenated with the feature maps before every block. 15%percent1515\%15 % of input latent representations are zero-ed out to train the model unconditionally, thus allowing CFG. The latent diffusion model consists of 42424242M parameters and is trained using AdamW [37] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 for 500k iterations at a batch size of 128128128128. To train the model we use the v-objective [38] with a cosine schedule, while at inference we use the DDIM sampler [9].

5 Experiments and Results

We train the proposed accompaniment generation system on the task of conditional bassline generation, using an internal dataset of  20 k songs with available stems, among which the bass guitar. 1,500 of the tracks are used as test set. We first train the audio autoencoder used to encode the input mixes on the MTG-Jamendo dataset [39]. The autoencoder used to encode the bass samples is trained on bass stems from our internal dataset and the latent diffusion model is trained on (mix, bass stem) pairs from the same dataset. We first evaluate the quality of unconditionally generated samples with respect to the number of DDIM steps in Fig. 2 (right). We show in Fig. 2 (left) how the CFG rescaling technique can improve the FAD of generated samples for high CFG weights. To evaluate the ability of the system to generate samples that musically match the input mix, we train a contrastive model to assign high scores to matching (mix, bass stem) pairs and low scores to non-matching ones using the same internal dataset. In Fig. 3, we visualize the scores assigned by that model to 25252525 pairs of random segments of mixes from the test set, and 25 bass stems generated conditionally for each of those segments. A high value on the diagonal means the bass stem generated for that mix matches that mix better than the bass stems generated for the other mixes. To quantitatively evaluate the efficacy of the proposed style grounding technique, we use an off-the-shelf audio classification model [40] to extract embeddings of generated samples with and witout style-grounding (using the same input mix as conditioning), and compare them in Table 1 to embeddings of the target style sample via the Cosine and Euclidean distance. Readers can listen to samples generated by our system at: https://sonycslparis.github.io/bass_accompaniment_demo/

6 Conclusion

We have presented a novel controllable system for music accompaniment generation using latent diffusion models. When trained on bass stems, our model is able to generate basslines that musically match an arbitrary input mix. We propose the design of an efficient audio autoencoder for producing compressed invertible latent representations, the adaptation of latent diffusion models to handle inputs and outputs of arbitrary length, and a latent-specific style grounding technique to control the timbre of generated samples. Experiments demonstrate that our model can generate basslines that musically match the input mix and that can be grounded with user-provided timbres. A limitation of our system is that it does not offer user control over the exact notes of the generated accompaniment. Future work involves training the model to generate other instruments besides bass. We believe our system can enhance the creative workflow of music artists, creating a variety of bass accompaniments to fit their existing material, while also offering control over the creation process.

This work was supported by UKRI [grant EP/S022694/1].

References

  • [1] Gaëtan Hadjeres, François Pachet and Frank Nielsen “DeepBach: a Steerable Model for Bach Chorales Generation” In ICML, 2017
  • [2] Cheng-Zhi Anna Huang et al. “Music Transformer: Generating Music with Long-Term Structure” In ICLR, 2019
  • [3] Dimitri Rütte, Luca Biggio, Yannic Kilcher and Thomas Hoffman “FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control” In arXiv preprint arXiv:2201.10936, 2022
  • [4] Prafulla Dhariwal et al. “Jukebox: A generative model for music” In arXiv preprint arXiv:2005.00341, 2020
  • [5] Andrea Agostinelli et al. “MusicLM: Generating Music From Text”, 2023 arXiv:2301.11325 [cs.SD]
  • [6] Giorgio Mariani et al. “Multi-Source Diffusion Models for Simultaneous Music Generation and Separation”, 2023 arXiv:2302.02257 [cs.SD]
  • [7] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan and Surya Ganguli “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” In ICML, 2015
  • [8] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising Diffusion Probabilistic Models” In NeurIPS, 2020
  • [9] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In ICLR, 2021
  • [10] Robin Rombach et al. “High-resolution image synthesis with latent diffusion models” In CVPR, 2022
  • [11] Diederik P. Kingma and Max Welling “Auto-Encoding Variational Bayes” In ICLR, 2014
  • [12] Aäron Oord et al. “WaveNet: A Generative Model for Raw Audio” In The 9th ISCA Speech Synthesis Workshop, 2016
  • [13] Soroush Mehri et al. “SampleRNN: An Unconditional End-to-End Neural Audio Generation Model” In ICLR, 2017
  • [14] Jade Copet et al. “Simple and Controllable Music Generation” In arXiv preprint arXiv:2306.05284, 2023
  • [15] Ian J. Goodfellow et al. “Generative Adversarial Nets” In NeurIPS, 2014
  • [16] Chris Donahue, Julian J. McAuley and Miller S. Puckette “Adversarial Audio Synthesis” In ICLR, 2019
  • [17] Jesse H. Engel et al. “GANSynth: Adversarial Neural Audio Synthesis” In ICLR, 2019
  • [18] Marco Pasini and Jan Schlüter “Musika! Fast Infinite Waveform Music Generation” In ISMIR, 2022
  • [19] Maarten Grachten, Stefan Lattner and Emmanuel Deruty “BassNet: A Variational Gated Autoencoder for Conditional Generation of Bass Guitar Tracks with Learned Interactive Control” In Applied Sciences, 2020
  • [20] Zhifeng Kong et al. “DiffWave: A Versatile Diffusion Model for Audio Synthesis” In ICLR, 2021
  • [21] Nanxin Chen et al. “WaveGrad: Estimating Gradients for Waveform Generation” In ICLR, 2021
  • [22] Seth* Forsgren and Hayk* Martiros “Riffusion - Stable diffusion for real-time music generation”, 2022 URL: https://riffusion.com/about
  • [23] Flavio Schneider, Zhijing Jin and Bernhard Schölkopf “Mo\\\backslash\^ usai: Text-to-Music Generation with Long-Context Latent Diffusion” In arXiv preprint arXiv:2301.11757, 2023
  • [24] Peike Li et al. “JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models” In arXiv preprint arXiv:2308.04729, 2023
  • [25] Jesse H. Engel, Lamtharn Hantrakul, Chenjie Gu and Adam Roberts “DDSP: Differentiable Digital Signal Processing” In ICLR, 2020
  • [26] Antoine Caillon and Philippe Esling “RAVE: A variational autoencoder for fast and high-quality neural audio synthesis” In arXiv preprint arXiv:2111.05011, 2021
  • [27] Yang Song, Conor Durkan, Iain Murray and Stefano Ermon “Maximum Likelihood Training of Score-Based Diffusion Models” In NeurIPS, 2021
  • [28] Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-Net: Convolutional Networks for Biomedical Image Segmentation” In MICCAI, 2015
  • [29] Ashish Vaswani et al. “Attention is All you Need” In NeurIPS, 2017
  • [30] Yutao Sun et al. “A Length-Extrapolatable Transformer” In ACL, 2023
  • [31] Wenxiao Wang et al. “CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention” In ICLR, 2022
  • [32] Ze Liu et al. “Swin Transformer V2: Scaling Up Capacity and Resolution” In CVPR, 2022
  • [33] Jonathan Ho and Tim Salimans “Classifier-free diffusion guidance” In arXiv preprint arXiv:2207.12598, 2022
  • [34] Chitwan Saharia et al. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In arXiv preprint arXiv:2205.11487, 2022
  • [35] Shanchuan Lin, Bingchen Liu, Jiashi Li and Xiao Yang “Common Diffusion Noise Schedules and Sample Steps are Flawed” In arXiv:2305.08891, 2023
  • [36] Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In ICLR, 2015
  • [37] Ilya Loshchilov and Frank Hutter “Decoupled Weight Decay Regularization” In ICLR, 2019
  • [38] Tim Salimans and Jonathan Ho “Progressive Distillation for Fast Sampling of Diffusion Models” In ICLR, 2022
  • [39] Dmitry Bogdanov et al. “The MTG-Jamendo Dataset for Automatic Music Tagging” In Machine Learning for Music Discovery Workshop, ICML, 2019
  • [40] Qiuqiang Kong et al. “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition” In ACM Trans. Audio Speech Lang. Process., 2020