FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

Nicola Pia1 Equal contribution to this work    Martin Strauss2    Markus Multrus1    Bernd Edler2    Affiliations Fraunhofer IIS1, Erlangen, Germany. International Audio Laboratories Erlangen∗,2, Erlangen, Germany. A joint institution of the Friedrich-Alexander-Universität at Erlangen-Nürnberg (FAU) and Fraunhofer IIS.    {nicola.pia,markus.multrus}@iis.fraunhofer.de, {martin.strauss,bernd.edler}@audiolabs-erlangen.de
Abstract

This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.

Index Terms:
neural audio coding, conditional flow matching, low bit rate coding

I Introduction

In the modern digital world, audio codecs are used on a day-to-day basis, so every technological advancement can have a large impact. In recent years, deep neural networks (DNNs) revolutionized the field of audio compression. Early approaches [1, 2, 3] control the compression at via entropy-based losses and ensure good quality via reconstruction losses. With the advent of deep generative models the quality of neural codecs at bit rates lower than 12 kbps greatly improved.

While for speech coding many different approaches were proven to be successful [4, 5, 6, 7], the general audio codec SoundStream [8] established a new paradigm of training a residual VQ-VAE [9] via an additional GAN loss end-to-end (e2e). For this, a DNN-encoder extracts a learned latent, a residual VQ generates the bit stream, and a DNN-decoder synthesizes the audio. All the modules are jointly learned via a combination of multiple spectral reconstruction, VQ-VAE codebook and commitment and adversarial losses.

Various improvements on the design of SoundStream were proposed afterwards. EnCodec [10] used recurrent networks and an improved compression capability via entropy coding based on language models in the quantizer. The Descript-Audio-Codec (DAC) [11] achieved high quality extending on the model size, using innovative audio-specific activations [12], and scaling up the discriminator architecture.

The e2e VQ-GAN approach offers a great flexibility in the design and complexity of the codec [13, 14, 15]. However, it often entails a complicated and unstable training pipeline, which sometimes fails to meet quality expectations for challenging signal types, particularly at bit rates lower than 6 kbps.

Denoising Diffusion Probabilistic Models (DDPMs) were proposed recently for speech [16] and general audio [17, 18]. While [18] targets semantic coding at ultra low bit rates, MultiBandDiffusion (MBD) [17] is a decoder model that enables high-quality synthesis of the EnCodec latent at 1.5, 3 and 6 kbps for general audio. This model uses a time-domain subband-based decoding scheme and achieves state-of-the-art quality for music. The high complexity of this model makes it hard to use in embedded devices and its dependency on a pre-trained bit stream might limit its compression capabilities.

VQ-GANs entail a highly involved training pipeline and the existing DDPMs are computationally heavy models. This demonstrates the need for a solution that is easy to train, while offering high quality performance at acceptable complexity.

Recently, a new paradigm to train continuous normalizing flows (CNFs) called conditional flow matching (CFM) emerged [19] and demonstrated state-of-the-art quality for both image [20] and audio generation [21, 22, 23]. This approach offers a simple training pipeline at much lower inference and training costs compared to DDPMs.

In this work, we present the Flow Matching Audio Codec (FlowMAC), a new audio compression model for low bit rate coding of general audio at 24242424 kHz audio based on CFM. Our proposed approach learns a mel spectrogram encoder, residual VQ, and decoder via a combination of a simple reconstruction loss and the CFM objective. The CFM-based decoder generates realistic mel spectrograms from the discrete latent, which is then converted to waveform domain via an efficient version of BigVGAN [24]. The model design is simple and the training pipeline is stable and efficient.

Our contributions can be summarized as follows:

  • We introduce FlowMAC, a CFM-based mel spectrogram codec offering a simple and efficient training pipeline.

  • Our listening test results demonstrate that FlowMAC achieves state-of-the-art quality at 3 kbps matching GAN-based and DDPM-based solutions at double the bit rate.

  • We propose an efficient version of FlowMAC capable of coding at high quality and faster than real time on a CPU.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionODE solverbitstreamEncoderDecoderCFMmelmelBigVGAN1x1 ConvRVQ enc MHA Dropout + LNFFNDropout + LN1x1 ConvRVQ decDropout + LNMHADropout + LNFFN×\displaystyle\times×N×\displaystyle\times×Nmel-encodermel-decoderConv1D mel spect Decoded mel spect Decoded mel spect Transformer block1D ResNetTransformer block1D ResNetTransformer block1D ResNet1×\displaystyle\times×1 ConvConcat𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTt𝑡\displaystyle t\ italic_tembeddingTransformer block1D ResNetTransformer block1D ResNetTransformer block1D ResNet

Figure 1: FlowMAC architecture. The top illustrates the high level pipeline. The bottom left shows the structure of the mel spectrogram encoder and decoder. The bottom right denotes the details on the CFM module.

II Flow Matching fundamentals

For neural audio coding, we learn an encoder-decoder architecture that compresses input mel spectrograms into a quantized bit stream. We then use the information from this bit stream to condition a CFM-based mel spectrogram decoder for high-quality mel spectrogram generation. To this end, we consider the distribution q𝑞qitalic_q of mel spectrograms of the input audio signals and we learn a time-dependent vector field 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, whose flow transforms a Gaussian prior p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into q𝑞qitalic_q.

Flow matching [19] describes a method to fit a time-dependent probability density path pt:[0,1]×d0:subscript𝑝𝑡01superscript𝑑superscriptabsent0p_{t}:[0,1]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{\geq 0}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : [ 0 , 1 ] × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT ≥ 0 end_POSTSUPERSCRIPT between a simple sampling distribution p0(𝐱)subscript𝑝0𝐱p_{0}(\mathbf{x})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) and the target data distribution q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x ), where t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] and 𝐱d𝐱superscript𝑑\mathbf{x}\in\mathbb{R}^{d}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. More precisely it defines a framework to train a CNF ϕtsubscriptitalic-ϕ𝑡\phi_{t}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via learning its associated vector field 𝐮tsubscript𝐮𝑡\mathbf{u}_{t}bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT directly.

Following Section 4.1 in [19] we define

pt(𝐱|𝐱1)=𝒩(𝐱;μt(𝐱1),σt(𝐱1)2𝐈),subscript𝑝𝑡conditional𝐱subscript𝐱1𝒩𝐱subscript𝜇𝑡subscript𝐱1subscript𝜎𝑡superscriptsubscript𝐱12𝐈p_{t}(\mathbf{x}|\mathbf{x}_{1})=\mathcal{N}\left(\mathbf{x};\mu_{t}(\mathbf{x% }_{1}),\sigma_{t}(\mathbf{x}_{1})^{2}\mathbf{I}\right),italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x ; italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , (1)

where 𝐱1q(𝐱1)similar-tosubscript𝐱1𝑞subscript𝐱1\mathbf{x}_{1}\sim q(\mathbf{x}_{1})bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) sampled from the train set, μt(x)=t𝐱1subscript𝜇𝑡𝑥𝑡subscript𝐱1\mu_{t}(x)=t\mathbf{x}_{1}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_t bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and σt=1(1σmin)tsubscript𝜎𝑡11subscript𝜎min𝑡\sigma_{t}=1-(1-\sigma_{\text{min}})titalic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - ( 1 - italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) italic_t with σmin1much-less-thansubscript𝜎min1\sigma_{\text{min}}\ll 1italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ≪ 1. This defines a Gaussian path where p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the standard Gaussian and p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a Gaussian centered at 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with small variance. Theorem 3 in [19] shows that this probability path is generated by the Optimal Transport Conditional Vector Field

𝐮t(𝐱|𝐱1)=𝐱1(1σmin)𝐱1(1σmin)t.subscript𝐮𝑡conditional𝐱subscript𝐱1subscript𝐱11subscript𝜎min𝐱11subscript𝜎min𝑡\mathbf{u}_{t}(\mathbf{x}|\mathbf{x}_{1})=\frac{\mathbf{x}_{1}-(1-\sigma_{% \text{min}})\mathbf{x}}{1-(1-\sigma_{\text{min}})t}.bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = divide start_ARG bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( 1 - italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) bold_x end_ARG start_ARG 1 - ( 1 - italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) italic_t end_ARG . (2)

This yields the conditional flow matching objective

CFM(θ)subscriptCFM𝜃\displaystyle\mathcal{L}_{\textup{CFM}}(\theta)caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT ( italic_θ ) =𝔼t,q(𝐱1),pt(𝐱|𝐱1)𝐯t(𝐱;θ)𝐮t(𝐱|𝐱1)2absentsubscript𝔼𝑡𝑞subscript𝐱1subscript𝑝𝑡conditional𝐱subscript𝐱1superscriptdelimited-∥∥subscript𝐯𝑡𝐱𝜃subscript𝐮𝑡conditional𝐱subscript𝐱12\displaystyle=\mathbb{E}_{t,q(\mathbf{x}_{1}),p_{t}(\mathbf{x}|\mathbf{x}_{1})% }\left\lVert\mathbf{v}_{t}(\mathbf{x};\theta)-\mathbf{u}_{t}(\mathbf{x}|% \mathbf{x}_{1})\right\rVert^{2}= blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ; italic_θ ) - bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼t,q(𝐱1),p0(𝐱0)𝐯t(𝐱;θ)(𝐱1(1σmin)𝐱0)2absentsubscript𝔼𝑡𝑞subscript𝐱1subscript𝑝0subscript𝐱0superscriptdelimited-∥∥subscript𝐯𝑡𝐱𝜃subscript𝐱11subscript𝜎minsubscript𝐱02\displaystyle=\mathbb{E}_{t,q(\mathbf{x}_{1}),p_{0}(\mathbf{x}_{0})}\left% \lVert\mathbf{v}_{t}(\mathbf{x};\theta)-\left(\mathbf{x}_{1}-(1-\sigma_{\text{% min}})\mathbf{x}_{0}\right)\right\rVert^{2}= blackboard_E start_POSTSUBSCRIPT italic_t , italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ; italic_θ ) - ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( 1 - italic_σ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where 𝐯t(𝐱,θ)subscript𝐯𝑡𝐱𝜃\mathbf{v}_{t}(\mathbf{x},\theta)bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x , italic_θ ) denotes a DNN parametrized by θ𝜃\thetaitalic_θ, the time step t𝕌[0,1]similar-to𝑡𝕌01t\sim\mathbb{U}[0,1]italic_t ∼ blackboard_U [ 0 , 1 ] is sampled from a uniform distribution.

For our system the neural network 𝐯t(𝐱;θ)subscript𝐯𝑡𝐱𝜃\mathbf{v}_{t}(\mathbf{x};\theta)bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ; italic_θ ) is additionally conditioned on the decoded bit stream c𝑐citalic_c obtained from a learned mel spectrogram compression network. During inference, 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT takes c𝑐citalic_c as input and a Gaussian noise sample 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and outputs the derivatives of the corresponding CNF. This flow is then integrated using an ODE solver, e.g. the Euler method.

III Proposed Architecture

The architecture of FlowMAC is illustrated in Figure 1.

III-A Mel Encoder-Decoder

The 128128128128 mel spectrogram bands are calculated on the input 24242424 kHz audio with hop size 512512512512 and window of 2048204820482048 samples, hence, yielding 47 frames per second. Mean and standard deviations are calculated offline for the whole dataset and used as fixed normalization factors for the input. The normalized mel spectrogram passes through a 1×\times×1 convolutional layer with 128 channels to extract features for the encoder. The encoder is a sequence of multi-head attention (MHA), dropout, layer normalization, feed-forward and dropout layers, producing a latent vector to be quantized. The network block is repeated N=6𝑁6N=6italic_N = 6 times.

The decoder architecture follows the same structure as the encoder. Finally, a 1×\times×1 convolutional layer serves as a final projection layer to generate the decoded quantized mel spectrogram. The sum of MSE and MAE losses (priorsubscript𝑝𝑟𝑖𝑜𝑟\mathcal{L}_{prior}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT) serves as reconstruction loss for the input mel spectrogram.

For quantization we use a learned residual VQ based on VQ-VAE [9], with projections to small dimensional spaces similar to [11]. FlowMAC uses a codebook size of 256 and 8 quantizer stages and a downsampling dimension 16 for the 128-dimensional latent. Using 8 bits per level with 47 frames per second results in a rounded total of 3 kbps.

III-B CFM Module

The CFM architecture follows [21] and uses a U-Net with residual 1D convolutional blocks and transformer blocks with snakebeta activations [24]. Finally, the output of the U-Net passes through a 1D Block consisting of a 1D convolution, group normalization and a Mish activation [25], after which a 1×\times×1 convolutional layer creates the final output. The corresponding time-step embeddings use a RoPE-Embedding as in [26].

The CFM decoder is conditioned on the decoded quantized mel spectrogram via concatenation to the input Gaussian noise to estimate the corresponding vector field. The optimization criteria CFMsubscriptCFM\mathcal{L}_{\text{CFM}}caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT is defined in Section II.

Overall, the training objective for the whole system is then

=λpprior+λvq+CFM,subscript𝜆𝑝subscript𝑝𝑟𝑖𝑜𝑟subscript𝜆𝑣subscript𝑞subscriptCFM\mathcal{L}=\lambda_{p}\mathcal{L}_{prior}+\lambda_{v}\mathcal{L}_{q}+\mathcal% {L}_{\text{CFM}},caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT , (3)

where λp=0.01subscript𝜆𝑝0.01\lambda_{p}=0.01italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.01 and λv=0.25subscript𝜆𝑣0.25\lambda_{v}=0.25italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0.25 denote weighting factors for the prior and VQ-VAE loss (qsubscript𝑞\mathcal{L}_{q}caligraphic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT). To improve the CFM training, we sample the timestep t𝑡titalic_t according to a logit normal distribution [20] for each mini-batch. In addition, we train our model with a classifier-free guidance (CFG) technique [27], where the decoded mel spectrogram condition is set to zero with a probability of pg=0.2subscript𝑝𝑔0.2p_{g}=0.2italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 0.2, which improves signal quality.

III-C Mel-to-Audio Module

As mel-to-audio module, we re-train a smaller version of BigVGAN [24] on our data: We adapt the mel spectrogram calculation to fit the setting described in Section III-A. Then, we decrease the decoder initial channels to 1024 and use an additional upsampling layer. This yields a smaller architecture than the original BigVGAN.

Notice that the dependence of our system on this mel-to-audio module for the final audio synthesis leads to a highest achievable quality dictated by BigVGAN’s performance. This is saturated by our mel spectrogram codec and our subjective evaluations confirm this phenomenon.

III-D FlowMAC inference

Thanks to the residual vector quantizer we achieve bit rate scalability via dropping out codebook levels at inference time. Moreover, the iterative nature of the Euler method used for inference enables some freedom on the number of function evaluations (NFE) for the CFM decoder. FlowMAC works at 1.5 and 3 kbps, uses 32 steps for the ODE solver and factor 1 for the CFG, hence, leading to a total of 64 NFE.

Early experimentation showed that the quality of the mel coder subsystem quickly saturates. To test this, we introduce FlowMAC-CQ, a separately trained model at 6 kbps. For this we use the same hyperparameters and NFE as for FlowMAC. Finally, we test the quality-complexity trade-off via using a single step for the Euler method and no CFG, hence, obtaining FlowMAC-LC and using 1 NFE.

Informal listening showed that using more than 64 NFE did not bring significant improvement in quality. Careful attention needs to be placed on the choice of the CFG factor: values smaller than 0.2 usually lead to noisy signals (except for the single-step Euler method) and values bigger that 2 overestimate the energy and introduce unwanted artifacts.

IV Evaluation

IV-A Experimental setup

We train both FlowMAC and BigVGAN on a combination of the full LibriTTS [28] clean and dev train subsets as in [24] and an internal music database consisting of 640640640640 hours of high-quality music of various genres. The sampling rate for all datapoints was 24242424 kHz.

BigVGAN was trained following the official implementation [29] for 1M iterations on a single A100 GPU. FlowMAC was trained with the Adam optimizer with learning rate 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a segment length of 2 s and batch size of 128 for 700k iterations on a single RTX3080.

TABLE I: Complexity measurements. Numbers for FlowMAC and FlowMAC-LC include BigVGAN. RTF is the ratio between the inference and the input duration measured on a notebook with Intel Core i7-10850H CPU @ 2.70GHz.
Model Nr. of Params RTF
DAC 75 M 01.24
MBD 15 M 50.55
FlowMAC 80 M 03.38
FlowMAC-LC 80 M 00.78

IV-B Subjective Evaluation

To evaluate the proposed system, we perform a P.808 DCR [30] listening test with naive listeners and a MUSHRA [31] listening test with expert listeners. To this end, we design a test set of 12 items carefully selected to represent typical challenging signals for audio codecs. The test set includes 4 clean and noisy speech samples (male, female, child and speech over music, including fast and emotional speech with varying prosody), 5 music items of various genres (including rock, pop, classical and electronic), and 3 out-of-distribution items (castanets, harpsichord and glockenspiel).

For the P.808 DCR test we compare FlowMAC to state-of-the-art DNN-based audio codecs and a well-known legacy codec. We select the GAN-based audio codecs DAC [11] and EnCodec [10], and the DDPM-baseline MBD [17]. We use the official implementations and pre-trained weights for all those models [32, 33].We recognize that the training sets vary strongly between the conditions. Still, we consider it useful to compare our system with well-established and robust codecs.

As a measure of the highest achievable quality with FlowMAC we include the copy-synthesis of the signals via BigVGAN. As a benchmark legacy-condition we use an internal implementation of the MPEG-D USAC Standard [34]. This works on full-band audio, but we downsample the decoded signal to 24 kHz to more closely measure the differences in the codecs at this sample rate. This puts USAC at a disadvantage and may result in lower scores for it. As a lower anchor a low-pass filter with cutoff frequency of 3.5 kHz was used.

The P.808 DCR offers a good overall idea of the average quality of the different conditions in the test. The MUSHRA test provides finer comparisons between a subset of the most promising conditions. Therefore, BigVGAN, FlowMAC at 3 kbps, DAC at 6 kbps, MBD at 6 kbps and USAC at 8 kbps are selected for the MUSHRA test.

IV-C Complexity

We measure the complexity of the DNN-based codec systems included in the MUSHRA listening test and FlowMAC-LC in terms of numbers of parameters and real-time factor (RTF). Table I summarizes the results. The only condition able to generate the audio faster than real time is FlowMAC-LC. We do not report the complexity figures for USAC, but we notice it is significantly faster than the DNN-based codecs. The implementation of the DNN codecs are not optimized for synthesis speed. We notice that none of the codecs is able to operate at low algorithmic delay, hence faster than real time generation would not enable the application of these codecs in telecommunications.

V Results and Discussion

Refer to caption
Figure 2: Results for P.808 DCR test with 46 listener and 95% CI.
Refer to caption
Figure 3: Results for MUSHRA test with 14 listeners and 95% CI.

Figure 2 illustrates the results of the P.808 DCR listening test with 46 listeners. The results from the naive listeners confirm that both FlowMAC and FlowMAC-LC are the best models at 3 kbps, being on average on par with EnCodec and MBD at 6 kbps. FlowMAC at 1.5 kbps shows a significant quality drop, while no significant quality improvement is achieved by the 6 kbps version FlowMAC-CQ. As expected, the copy-synthesis with BigVGAN offers the highest achievable quality for our system. FlowMAC-LC’s average rating are lower than the high-complexity version. Still, the test confirms that it is a competitive baseline. For naive listeners the higher frequency resolution of DAC 44.1 kHz at 8 kbps offers does not offer an advantage over the 24 kHz model.

Overall we notice that the all DNN conditions achieve comparable quality with the legacy USAC condition at similar bit rates, the only exception being FlowMAC at 3 kbps.

The results of the MUSHRA test with 14 listeners are illustrated in Figure 3. While USAC 8 kbps has a quality advantage on average over the other codecs here, this test demonstrates that FlowMAC at 3 kbps performs similar to DAC 6 kbps and both conditions outperform MBD 6 kbps. We notice that the performance of the DNN-based codecs highly varies for different items in the test set. In particular, FlowMAC performs poorly on the out-of-distribution test items, while its performance is on average comparable with DAC for speech and music. The copy-synthesis from BigVGAN performs best average and offers an measurement of the highest quality achievable with FlowMAC. We notice that these results more clearly highlight fine difference between the codecs, but are overall in accordance with our P.808 test results.

VI Conclusions

This work proposed FlowMAC, a low bit rate neural audio codec for high-quality coding at 3 kbps. We present a novel approach for training and synthesis for a neural audio codec based on CFM. Our subjective evaluations demonstrate that FlowMAC outperforms strong baselines while offering manageable complexity.

Acknowledgment

We want to thank Dr. Andreas Brendel for reviewing the manuscript. We thankfully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU).

References

  • [1] S. Morishima, H. Harashima, and Y. Katayama, “Speech coding based on a multi-layer neural network,” in IEEE International Conference on Communications, Including Supercomm Technical Sessions.   IEEE, 1990, pp. 429–433.
  • [2] S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2521–2525.
  • [3] K. Zhen, M. S. Lee, J. Sung, S. Beack, and M. Kim, “Efficient and scalable neural residual waveform coding with collaborative quantization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 361–365.
  • [4] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “WaveNet Based Low Rate Speech Coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 676–680.
  • [5] J. Valin and J. Skoglund, “A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet,” in 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, pp. 3406–3410.
  • [6] A. Mustafa, J. Büthe, S. Korse, K. Gupta, G. Fuchs, and N. Pia, “A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021, pp. 66–70.
  • [7] K. Zhen, J. Sung, M. Lee, S. Beack, and M. Kim, “Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding,” in 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, pp. 3396–3400.
  • [8] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An End-to-End Neural Audio Codec,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, pp. 495––507, 2021.
  • [9] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [10] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” Transactions on Machine Learning Research, 2023.
  • [11] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-Fidelity Audio Compression with Improved RVQGAN,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 27 980–27 993.
  • [12] L. Ziyin, T. Hartwig, and M. Ueda, “Neural Networks Fail to Learn Periodic Functions and How to Fix It,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1583–1594.
  • [13] X. Jiang, X. Peng, C. Zheng, H. Xue, Y. Zhang, and Y. Lu, “End-to-end neural speech coding for real-time communications,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022.
  • [14] N. Pia, K. Gupta, S. Korse, M. Multrus, and G. Fuchs, “NESC: Robust Neural End-2-End Speech Coding with GANs,” in 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH), 2022, pp. 4212–4216.
  • [15] Z. Du, S. Zhang, K. Hu, and S. Zheng, “FunCodec: A Fundamental, Reproducible and Integrable Open-Source Toolkit for Neural Speech Codec,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 591–595.
  • [16] H. Yang, I. Jang, and M. Kim, “Generative de-quantization for neural speech codec via latent diffusion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1251–1255.
  • [17] R. San Roman, Y. Adi, A. Deleforge, R. Serizel, G. Synnaeve, and A. Defossez, “From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 1526–1538.
  • [18] H. Liu, X. Xu, Y. Yuan, M. Wu, W. Wang, and M. D. Plumbley, “SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound,” 2024. [Online]. Available: https://arxiv.org/abs/2405.00233
  • [19] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow Matching for Generative Modeling,” in International Conference on Learning Representations (ICLR), 2023.
  • [20] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y. Marek, and R. Rombach, “Scaling Rectified Flow Transformers for High-Resolution Image Synthesis,” 2024. [Online]. Available: https://arxiv.org/abs/2403.03206
  • [21] S. Mehta, R. Tu, J. Beskow, E. Szèkely, and G. E. Henter, “Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 341–11 345.
  • [22] S. Kim, K. Shih, r. badlani, J. F. Santos, E. Bakhturina, M. Desta, R. Valle, S. Yoon, and B. Catanzaro, “P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting,” in Advances in Neural Information Processing Systems, 2023, pp. 74 213–74 228.
  • [23] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, and W.-N. Hsu, “Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale,” in Advances in Neural Information Processing Systems, 2023.
  • [24] S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A Universal Neural Vocoder with Large-Scale Training,” in International Conference on Learning Representations (ICLR), 2023.
  • [25] D. Misra, “Mish: A self regularized non-monotonic activation function,” 2020. [Online]. Available: https://arxiv.org/abs/1908.08681
  • [26] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8599–8608.
  • [27] J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • [28] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” pp. 1526–1530, 2019.
  • [29] S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Codebase: BigVGAN: A Universal Neural Vocoder with Large-Scale Training,” 2023. [Online]. Available: https://github.com/NVIDIA/BigVGAN
  • [30] International Telecommunication Union, “Recommendation ITU–T P.808 Subjective evaluation of speech quality with a crowdsourcing approach,” 2021.
  • [31] ——, “Recommendation ITU–R BS.1534-3 Method for the subjective assessment of intermediate quality level of audio systems,” 2015.
  • [32] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and Controllable Music Generation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [33] R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “Codebase: Descript Audio Codec (.dac): High-Fidelity Audio Compression with Improved RVQGAN,” 2023. [Online]. Available: https://github.com/descriptinc/descript-audio-codec
  • [34] J. Lecomte, M. Neuendorf, M. Multrus, N. Rettelbach, G. Fuchs, J. Robilliard, J. Lecomte, S. Wilde, S. Bayer, S. Disch, C. Helmrich, R. Lefebvre, P. Gournay, B. Bessette, J. Lapierre, K. Kjörling, H. Purnhagen, L. Villemoes, W. Oomen, and B. Grill, “The ISO/MPEG Unified Speech and Audio Coding Standard—Consistent High Quality for All Content Types and at All Bit Rates,” Journal of the Audio Engineering Society, vol. 61, pp. 956–977, 2013.