License: CC BY-NC-ND 4.0
arXiv:2403.07319v1 [cs.CV] 12 Mar 2024

Efficient Diffusion Model for Image Restoration by Residual Shifting

Zongsheng Yue, Jianyi Wang, Chen Change Loy Z. Yue, J. Wang, and C. C. Loy are with S-Lab, Nanyang Technological University (NTU), Singapore (E-mail: [email protected], {jianyi001,ccloy}@ntu.edu.sg).C. C. Loy is the corresponding author.
Abstract

While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks, namely image super-resolution, image inpainting, and blind face restoration, even only with four sampling steps. Our code and model are publicly available at https://github.com/zsyOAOA/ResShift.

Index Terms:
Markov chain, noise schedule, image super-resolution, image inpainting, face restoration.

I Introduction

Image restoration (IR) is a fundamental problem in low-level vision, aiming at recovering a high-quality (HQ) image given its corresponding low-quality (LQ) counterpart. It encompasses various sub-tasks upon the type of degradation, including but not limited to image denoising [1, 2], image super-resolution [3, 4] and image inpainting [5, 6]. Notably, the degradation model is usually complicated in some cases, e.g., real-world super-resolution, making the IR problem severely ill-posed and challenging.

The diffusion model [7], a newly emerged generative model, has demonstrated remarkable success in the realm of image generation [8, 9], surpassing the previous GAN-based approaches [10, 11]. The main idea of diffusion models is to construct a hidden Markov chain that gradually degrades an image to white Gaussian noise in the forward process, and then approximate the reverse process using a deep neural network to generate this image. Attributed to its powerful generative capability, diffusion model has also exhibited great promise in solving the IR tasks, including image denoising [12, 13], deblurring [14, 15], inpainting [16, 17, 18, 19], colorization [20, 21, 22], among others. The exploration of the potential of diffusion models in IR still remains an active and ongoing research pursuit.

Existing diffusion-based IR methods can be broadly categorized into two classes. One straightforward approach [23, 24, 21, 25, 26, 27] is to insert the LQ image into the input of current diffusion model, e.g., DDPM [8], as a condition. Then, the model is retrained specifically for the task of IR. After training, this model is able to generate a desirable HQ image from Gaussian noise and the observed LQ image via the reverse sampling process. Another popular way, as explored in [28, 29, 30, 31, 32, 33, 34], attempts to harness a pre-trained unconditional diffusion model as a prior to facilitate the IR problem. The primary idea involves modifying the reverse sampling procedure to align the generated result with the given LQ observation via the degradation model at each step. Unfortunately, both strategies inherit the Markov chain underlying DDPM, which can be inefficient during inference, i.e., often necessitating hundreds or even thousands of sampling steps. Although some acceleration techniques [35, 36, 37] have been developed to reduce the sampling steps in inference, they inevitably lead to a significant drop in performance, resulting in over-smooth results as shown in Fig. 1(e)-(j), in which the DDIM [36] algorithm is employed to speed up the inference. Therefore, there is a need to devise a new diffusion model tailored for IR, capable of achieving a harmonious balance between efficiency and performance, without sacrificing one for the other.

Let us revisit the diffusion model within the context of image generation. In the forward process, it constructs a Markov chain to gradually transform the observed data into a pre-specified prior distribution, typically a standard Gaussian distribution, over a large number of steps. Conversely, the reverse process trains a deep neural network to approximate the reverse trajectory of this Markov chain. The well-trained neural network facilitates the random image generation by drawing samples from the reverse Markov chain, initiating at the Gaussian distribution. While the Gaussian distribution is well-suited for the task of image generation, its optimality is questioned in the domain of IR, where the LQ images are available. In this paper, we argue that a reasonable diffusion model for IR should start from a prior distribution centered around the LQ image, enabling an iterative recovery of the HQ image from its LQ counterpart instead of Gaussian white noise. Notably, this design not only aligns with the specifics of IR but also holds the potential to reduce the requisite number of diffusion steps for sampling, thereby enhancing overall inference efficiency.

Refer to caption
Figure 1: Qualitative comparisons on one typical real-world example of the proposed method and recent state of the arts, including RealESRGAN [38], BSRGAN [39], SwinIR [40], LDM [25], and StableSR [29]. As for the diffusion-based approaches, namely LDM, StableSR, and our proposed method, we annotate the number of sampling steps with the format of “Method-A” for more intuitive visualization, where “A” denotes the number of sampling steps. Note that LDM and StableSR both contain 1000 diffusion steps in training and are accelerated to “A” steps using DDIM [36] during inference. Please zoom in for a better view.

Following the aforementioned motivation, we propose an efficient diffusion model involving a shorter Markov chain for the transition between the HQ image and its corresponding LQ one. The initial state of the Markov chain converges towards an approximate distribution of the HQ image, while the final state converges towards an approximate distribution of the LQ image. To achieve this, we carefully design a transition kernel that gradually shifts the residual information between the HQ/LQ image pair. This approach exhibits superior efficiency in comparison to existing diffusion-based IR methods, given its ability to expeditiously transfer the residual information in several steps. Moreover, our design also allows for an analytical and concise expression for the evidence lower bound, thereby simplifying the formulation of the optimization objective for training purposes. Building upon the constructed diffusion kernel, we further develop a highly flexible noise schedule that controls the speed of residual shifting and the strength of the added noise in each step. This schedule facilitates a fidelity-realism trade-off of the recovered results by tuning its hyper-parameters.

In summary, the main contributions of this work are as follows:

  • We propose an efficient diffusion model specifically for IR. It builds up a short Markov chain between the HQ/LQ images, rendering a fast reverse sampling trajectory during inference. Extensive experiments show that our approach requires only four sampling steps to achieve appealing results, outperforming or at least being comparable to current state-of-the-art methods. A preview of the comparison results of the proposed method to recent approaches is shown in Fig. 1.

  • A highly flexible noise schedule is designed for the proposed diffusion model, capable of controlling the transition properties more precisely, including the shifting speed and the noise level. Through tuning the hyper-parameters, our method offers a more graceful solution to the widely acknowledged perception-distortion trade-off in IR.

  • The proposed method is a general diffusion-based framework for IR, being able to handle various IR tasks. This study has thoroughly substantiated its effectiveness and superiority on three typical and challenging IR tasks, namely image super-resolution, image inpainting, and blind face restoration.

In summary, this work endeavors to formulate an efficient diffusion model customized for IR, with the intention of breaking down the limitation of prevailing approaches on inference efficiency. A preliminary version of this work has been published in NeurIPS 2023 [41], focusing only on the task of image super-resolution. This study makes substantial improvements in both model design and empirical evaluation across diverse IR tasks compared with the conference version. Concretely, we incorporate the perceptual loss into the model optimization and substitute the self-attention layer with Swin transformer in the network architecture. The former modification can further reduce the diffusion steps from 15 to 4, and the latter endows our model with graceful adaptability to handle arbitrary resolutions during inference.

The remainder of the manuscript is organized as follows: Sec. II introduces the related work. Sec. III presents our designed diffusion model for IR. In Sec. IV and Sec. V, extensive experiments are conducted to evaluate the performance of our proposed method on the task of image super-resolution and image inpainting, respectively. Sec. VI finally concludes the paper.

II Related Work

In this section, we briefly review the literature on image restoration, traversing from conventional non-diffusion methodologies to recent diffusion-based approaches.

II-A Conventional Image Restoration Approaches

Most of the conventional IR methods can be cast into the Maximum a Posteriori (MAP) framework, a Bayesian paradigm encompassing a likelihood (loss) term and a prior (regularization) term. The likelihood reflects the underlying noise distribution of the LQ image. The commonly used L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss indeed corresponds to a Gaussian or Laplacian assumption on image noise. To more accurately depict the noise distribution, some robust alternatives were introduced, such as Poissonian-Gaussian [42], MoG [43], MoEP [44], Dirichlet MoG [45, 46] and so on. Simultaneously, there has been an increased focus on employing image priors to address the inherent ill-posedness of IR over recent decades. Typical image priors encompass total variation [47], wavelet coring [48], non-local similarity [49, 3], sparsity [50, 51], low-rankness [1, 52], dark channel [53, 54], among others. These conventional methods are mainly limited by the model capacity and the subjectivity inherited from the manually designed assumptions on image noise and prior.

In recent years, the landscape of IR has been dominated by deep learning (DL)-based methodologies. The seminal works [4, 55, 2] proposed to solve the IR problem using a convolution neural network, outperforming traditional model-based methods on the tasks of image denoising, super-resolution, and deblurring. Then, many studies [56, 57, 58, 59, 60, 61, 62, 63, 64, 6] have emerged, mainly concentrating on designing more delicate network architectures to further improve the restoration performance. Besides, there have been some discernible investigations that seek to combine current DL tools and classical IR modeling ideas. Notable works include the plug-and play or unfolding paradigm [65, 66, 67], learnable image priors [68, 69, 70, 71], and the loss-oriented methods [72, 73, 74]. The infusion of deep neural networks, owing to their large model capacity, has substantively extended the frontiers of IR tasks.

II-B Diffusion-based Image Restoration Approaches

Inspired by principles from non-equilibrium statistical physics, Sohl-Dickstein et al. [7] proposed the diffusion model to fit complex distributions. Subsequent advancements by Ho et al. [8] and Song et al. [20] further improve its theoretical foundation by integrating denoising score matching and stochastic differential equation, thereby achieving impressive success in image generation [25]. Owing to its powerful generative capability, diffusion models have also found successful applications in the field of IR. Next, we provide a comprehensive overview of recent developments in diffusion-based IR methods.

The most straightforward solution to solve the IR problem using diffusion models is to introduce the LQ image as an addition condition in each timestep. Pioneering this approach, Saharia et al. [23] have successfully trained a diffusion model for image super-resolution. Subsequent studies [21, 14, 27] further expanded upon this approach, exploring its applicability in image deblurring, colorization, and denoising. To circumvent the resource-intensive process of training from scratch, an alternative strategy involves harnessing a pre-trained diffusion model to facilitate IR tasks. Numerous investigations, such as [18, 24, 28, 17, 32, 33, 15], reformulated the reverse sampling procedure of a pre-trained diffusion model into an optimization problem by incorporating the degradation model, enabling solving the IR problem in a zero-shot manner. Most of these methods, however, cannot deal with the blind IR problem, as they rely on a pre-defined degradation model. In contrast, some other works [29, 75, 30] directly introduced a trainable module that takes the LQ image as input. This module modulates the feature maps of the pre-trained diffusion model, steering it toward the direction of generating a desirable HQ image. Such a paradigm eliminates the reliance on a degradation model in the test phase, rendering it capable of handling the blind IR tasks.

The methodologies mentioned above are grounded in the foundational diffusion model initially crafted for image generation, necessitating a large number of sampling steps. This inefficiency presents a constraint on their application in real scenarios. The primary goal of our investigation is to devise a new diffusion model customized for IR, which facilitates a swift transition between the LQ/HQ image pair, thereby enhancing efficiency during inference.

III Methodology

In this section, we present our proposed diffusion model tailored for IR. For ease of presentation, the LQ image and the corresponding HQ image are denoted as 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, respectively. Notably, we further assume 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT have identical spatial resolution, which can be easily achieved by pre-upsampling the LQ image 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using nearest neighbor interpolation if necessary.

Refer to caption
Figure 2: Overview of the proposed method. Our method builds up a Markov chain between the HQ/LQ image pair by shifting their residuals. To alleviate the computational burden of this transition, it can be optionally moved to the latent space of VQGAN [76].

III-A Model Design

The iterative sampling paradigm of diffusion models has proven highly effective in generating intricate and vivid image details, inspiring us to design an iterative approach to address the IR problem as well. Our proposed method builds up a Markov chain to facilitate a transition from the HQ image to its LQ counterpart as shown in Fig. 2. In this way, the restoration of the desirable HQ image can be achieved by sampling along the reverse trajectory of this Markov chain that starts at the given LQ image. Next, we will detail how to construct such a Markov chain specifically for IR.

Forward Process. Let’s denote the residual between the LQ image and its corresponding HQ counterpart as 𝒆0subscript𝒆0\bm{e}_{0}bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., 𝒆0=𝒚0𝒙0subscript𝒆0subscript𝒚0subscript𝒙0\bm{e}_{0}=\bm{y}_{0}-\bm{x}_{0}bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Our core idea is to construct a transition from 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by gradually shifting their residual 𝒆0subscript𝒆0\bm{e}_{0}bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through a Markov chain with length T𝑇Titalic_T. Before that, we first introduce a shifting sequence {ηt}t=1Tsuperscriptsubscriptsubscript𝜂𝑡𝑡1𝑇\{\eta_{t}\}_{t=1}^{T}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, which increases monotonically with respect to timestep t𝑡titalic_t and adheres to the condition of η10subscript𝜂10\eta_{1}\to 0italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → 0 and ηT1subscript𝜂𝑇1\eta_{T}\to 1italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT → 1. Then, the transition distribution is formulated based on this shifting sequence as follows:

q(𝒙t|𝒙t1,𝒚0)=𝒩(𝒙t;𝒙t1+αt𝒆0,κ2αt𝑰),t=1,,T,formulae-sequence𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1subscript𝒚0𝒩subscript𝒙𝑡subscript𝒙𝑡1subscript𝛼𝑡subscript𝒆0superscript𝜅2subscript𝛼𝑡𝑰𝑡1𝑇q(\bm{x}_{t}|\bm{x}_{t-1},\bm{y}_{0})=\mathcal{N}(\bm{x}_{t};\bm{x}_{t-1}+% \alpha_{t}\bm{e}_{0},\kappa^{2}\alpha_{t}\bm{I}),~{}t=1,\cdots,T,italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I ) , italic_t = 1 , ⋯ , italic_T , (1)

where α1=η1subscript𝛼1subscript𝜂1\alpha_{1}=\eta_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and αt=ηtηt1subscript𝛼𝑡subscript𝜂𝑡subscript𝜂𝑡1\alpha_{t}=\eta_{t}-\eta_{t-1}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for t>1𝑡1t>1italic_t > 1, κ𝜅\kappaitalic_κ is a hyper-parameter controlling the noise variance, 𝑰𝑰\bm{I}bold_italic_I is the identity matrix. Notably, we show that the marginal distribution at any timestep t𝑡titalic_t is analytically integrable, namely

q(𝒙t|𝒙0,𝒚0)=𝒩(𝒙t;𝒙0+ηt𝒆0,κ2ηt𝑰),t=1,,T.formulae-sequence𝑞conditionalsubscript𝒙𝑡subscript𝒙0subscript𝒚0𝒩subscript𝒙𝑡subscript𝒙0subscript𝜂𝑡subscript𝒆0superscript𝜅2subscript𝜂𝑡𝑰𝑡1𝑇q(\bm{x}_{t}|\bm{x}_{0},\bm{y}_{0})=\mathcal{N}(\bm{x}_{t};\bm{x}_{0}+\eta_{t}% \bm{e}_{0},\kappa^{2}\eta_{t}\bm{I}),~{}t=1,\cdots,T.italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I ) , italic_t = 1 , ⋯ , italic_T . (2)

The design of the transition distribution presented in Eq. (1) is guided by two primary principles. The first principle concerns the standard deviation, i.e., καt𝜅subscript𝛼𝑡\kappa\sqrt{\alpha_{t}}italic_κ square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, which aims to facilitate a smooth transition between 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙t1subscript𝒙𝑡1\bm{x}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. This is achieved by bounding the expected distance between 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒙t1subscript𝒙𝑡1\bm{x}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with αtsubscript𝛼𝑡\sqrt{\alpha_{t}}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, given that the image data falls within the range of [0,1]01[0,1][ 0 , 1 ]. Mathematically, this is expressed as:

max[(𝒙0+ηt𝒆0)(𝒙0+ηt1𝒆0)]=max[αt𝒆0]<αt<αt,maxdelimited-[]subscript𝒙0subscript𝜂𝑡subscript𝒆0subscript𝒙0subscript𝜂𝑡1subscript𝒆0maxdelimited-[]subscript𝛼𝑡subscript𝒆0subscript𝛼𝑡subscript𝛼𝑡\text{max}[(\bm{x}_{0}+\eta_{t}\bm{e}_{0})-(\bm{x}_{0}+\eta_{t-1}\bm{e}_{0})]=% \text{max}[\alpha_{t}\bm{e}_{0}]<\alpha_{t}<\sqrt{\alpha_{t}},max [ ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] = max [ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] < italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , (3)

where max[]maxdelimited-[]\text{max}[\cdot]max [ ⋅ ] represents the pixel-wise maximizing operation. The hyper-parameter κ𝜅\kappaitalic_κ is introduced to increase the flexibility of this design. The second principle pertains to the mean parameter, i.e., 𝒙0+αt𝒆0subscript𝒙0subscript𝛼𝑡subscript𝒆0\bm{x}_{0}+\alpha_{t}\bm{e}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Combining with the definition of αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, namely αt=ηtηt1subscript𝛼𝑡subscript𝜂𝑡subscript𝜂𝑡1\alpha_{t}=\eta_{t}-\eta_{t-1}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, it induces the marginal distribution in Eq. (2). Furthermore, the marginal distributions of 𝒙1subscript𝒙1\bm{x}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT converges to δ𝒙0(𝒙)subscript𝛿subscript𝒙0𝒙\delta_{\bm{x}_{0}}(\bm{x})italic_δ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x )111δ𝝁(𝒙)subscript𝛿𝝁𝒙\delta_{\bm{\mu}}(\bm{x})italic_δ start_POSTSUBSCRIPT bold_italic_μ end_POSTSUBSCRIPT ( bold_italic_x ) denotes the Dirac distribution centered at 𝝁𝝁\bm{\mu}bold_italic_μ. and 𝒩(𝒙;𝒚0,κ2𝑰)𝒩𝒙subscript𝒚0superscript𝜅2𝑰\mathcal{N}(\bm{x};\bm{y}_{0},\kappa^{2}\bm{I})caligraphic_N ( bold_italic_x ; bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ), respectively, serving as approximations for the HQ image and the LQ image. By constructing the Markov chain in such a thoughtful way, it is possible to handle the IR task by inversely sampling from it given the LQ image 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Reverse Process. The reverse process endeavors to estimate the posterior distribution p(𝒙0|𝒚0)𝑝conditionalsubscript𝒙0subscript𝒚0p(\bm{x}_{0}|\bm{y}_{0})italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) through the following formalization:

p(𝒙0|𝒚0)=p(𝒙T|𝒚0)t=1Tp𝜽(𝒙t1|𝒙t,𝒚0)d𝒙1:T,𝑝conditionalsubscript𝒙0subscript𝒚0𝑝conditionalsubscript𝒙𝑇subscript𝒚0superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜽conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒚0dsubscript𝒙:1𝑇p(\bm{x}_{0}|\bm{y}_{0})=\int p(\bm{x}_{T}|\bm{y}_{0})\prod_{t=1}^{T}p_{\bm{% \theta}}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y}_{0})\mathrm{d}\bm{x}_{1:T},italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ∫ italic_p ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_d bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , (4)

where p(𝒙T|𝒚0)𝒩(𝒙T|𝒚0,κ2𝑰)𝑝conditionalsubscript𝒙𝑇subscript𝒚0𝒩conditionalsubscript𝒙𝑇subscript𝒚0superscript𝜅2𝑰p(\bm{x}_{T}|\bm{y}_{0})\approx\mathcal{N}(\bm{x}_{T}|\bm{y}_{0},\kappa^{2}\bm% {I})italic_p ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≈ caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ), p𝜽(𝒙t1|𝒙t,𝒚0)subscript𝑝𝜽conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒚0p_{\bm{\theta}}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y}_{0})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) represents the desirable inverse transition kernel from 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝒙t1subscript𝒙𝑡1\bm{x}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, parameterized by a learnable parameter 𝜽𝜽\bm{\theta}bold_italic_θ. Consistent with prevalent literature of diffusion model [7, 8, 20], we adopt the following Gaussian assumption:

p𝜽(𝒙t1|𝒙t,𝒚0)=𝒩(𝒙t1;𝝁𝜽(𝒙t,𝒚0,t),𝚺𝜽(𝒙t,𝒚0,t)).subscript𝑝𝜽conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒚0𝒩subscript𝒙𝑡1subscript𝝁𝜽subscript𝒙𝑡subscript𝒚0𝑡subscript𝚺𝜽subscript𝒙𝑡subscript𝒚0𝑡p_{\bm{\theta}}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y}_{0})=\mathcal{N}\left(\bm{x}_{t% -1};\bm{\mu}_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t),\bm{\Sigma}_{\bm{\theta}}(% \bm{x}_{t},\bm{y}_{0},t)\right).italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) , bold_Σ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) ) . (5)

The optimization for 𝜽𝜽\bm{\theta}bold_italic_θ is achieved by minimizing the negative evidence lower bound, namely,

min𝜽tDKL[q(𝒙t1|𝒙t,𝒙0,𝒚0)p𝜽(𝒙t1|𝒙t,𝒚0)],\min_{\bm{\theta}}\sum_{t}D_{\text{KL}}\left[q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_% {0},\bm{y}_{0})\|p_{\bm{\theta}}(\bm{x}_{t-1}|\bm{x}_{t},\bm{y}_{0})\right],roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] , (6)

where DKL[]D_{\text{KL}}[\cdot\|\cdot]italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ ⋅ ∥ ⋅ ] denotes the Kullback-Leibler (KL) divergence. More mathematical details can be found in [7] or [8].

By combining Eq. (1) and Eq. (2), the target distribution q(𝒙t1|𝒙t,𝒙0,𝒚0)𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0subscript𝒚0q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_{0},\bm{y}_{0})italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in Eq. (6) can be rendered tractable and expressed in an explicit form given below:

q(𝒙t1|𝒙t,𝒙0,𝒚0)=𝒩(𝒙t1|ηt1ηt𝒙t+αtηt𝒙0,κ2ηt1ηtαt𝑰).𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0subscript𝒚0𝒩conditionalsubscript𝒙𝑡1subscript𝜂𝑡1subscript𝜂𝑡subscript𝒙𝑡subscript𝛼𝑡subscript𝜂𝑡subscript𝒙0superscript𝜅2subscript𝜂𝑡1subscript𝜂𝑡subscript𝛼𝑡𝑰q(\bm{x}_{t-1}|\bm{x}_{t},\bm{x}_{0},\bm{y}_{0})=\mathcal{N}\left(\bm{x}_{t-1}% \bigg{|}\frac{\eta_{t-1}}{\eta_{t}}\bm{x}_{t}+\frac{\alpha_{t}}{\eta_{t}}\bm{x% }_{0},\kappa^{2}\frac{\eta_{t-1}}{\eta_{t}}\alpha_{t}\bm{I}\right).italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I ) . (7)

The detailed calculation of this derivation can be found in Appendix -A. Considering that the variance parameter is independent of 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we thus set it to be the true variance, i.e.,

𝚺𝜽(𝒙t,𝒚0,t)=κ2ηt1ηtαt𝑰.subscript𝚺𝜽subscript𝒙𝑡subscript𝒚0𝑡superscript𝜅2subscript𝜂𝑡1subscript𝜂𝑡subscript𝛼𝑡𝑰\bm{\Sigma}_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t)=\kappa^{2}\frac{\eta_{t-1}}% {\eta_{t}}\alpha_{t}\bm{I}.bold_Σ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) = italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I . (8)

The mean parameter 𝝁𝜽(𝒙t,𝒚0,t)subscript𝝁𝜽subscript𝒙𝑡subscript𝒚0𝑡\bm{\mu}_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t)bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) is reparameterized as:

𝝁𝜽(𝒙t,𝒚0,t)=ηt1ηt𝒙t+αtηtf𝜽(𝒙t,𝒚0,t),subscript𝝁𝜽subscript𝒙𝑡subscript𝒚0𝑡subscript𝜂𝑡1subscript𝜂𝑡subscript𝒙𝑡subscript𝛼𝑡subscript𝜂𝑡subscript𝑓𝜽subscript𝒙𝑡subscript𝒚0𝑡\bm{\mu}_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t)=\frac{\eta_{t-1}}{\eta_{t}}\bm% {x}_{t}+\frac{\alpha_{t}}{\eta_{t}}f_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t),bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) = divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) , (9)

where f𝜽()subscript𝑓𝜽f_{\bm{\theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) is a deep neural network with parameter 𝜽𝜽\bm{\theta}bold_italic_θ, aiming to predict 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We explored different parameterization forms on 𝝁θsubscript𝝁𝜃\bm{\mu}_{\theta}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and found that Eq. (9) exhibits superior stability and performance.

Based on Eq. (9), the objective function in Eq. (6) is then simplified as:

min𝜽twtf𝜽(𝒙t,𝒚0,t)𝒙022,subscript𝜽subscript𝑡subscript𝑤𝑡superscriptsubscriptnormsubscript𝑓𝜽subscript𝒙𝑡subscript𝒚0𝑡subscript𝒙022\min_{\bm{\theta}}\sum\nolimits_{t}w_{t}\|f_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0% },t)-\bm{x}_{0}\|_{2}^{2},roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where wt=αt2κ2ηtηt1subscript𝑤𝑡subscript𝛼𝑡2superscript𝜅2subscript𝜂𝑡subscript𝜂𝑡1w_{t}=\frac{\alpha_{t}}{2\kappa^{2}\eta_{t}\eta_{t-1}}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG. In practice, we empirically find that the omission of weight wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT results in a notable performance improvement, aligning with the conclusion in [8].

Perceptual Regularization. As presented above, our proposed method facilitates an iterative restoration process starting from the LQ image, in contrast to prior methods that initialize from Gaussian noise. This approach effectively reduces the number of diffusion steps. A comprehensive experimental analysis in Sec. IV-B substantiates that our method yields promising results with a mere 15 sampling steps, demonstrating a notable acceleration compared to established methodologies [25, 29].

Unfortunately, attempts at further acceleration, particularly with fewer than 5 sampling steps, tend to produce over-smooth results. This phenomenon is primarily attributed to that the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-based loss in Eq. (10) favors the prediction of an average over plausible solutions. To overcome this limitation, we introduce the perceptual regularization [77] to impose an additional constraint on the solution space, namely,

min𝜽tf𝜽(𝒙t,𝒚0,t)𝒙022+λlp(f𝜽(𝒙t,𝒚0,t),𝒙0),subscript𝜽subscript𝑡superscriptsubscriptnormsubscript𝑓𝜽subscript𝒙𝑡subscript𝒚0𝑡subscript𝒙022𝜆subscript𝑙𝑝subscript𝑓𝜽subscript𝒙𝑡subscript𝒚0𝑡subscript𝒙0\min_{\bm{\theta}}\sum\nolimits_{t}\|f_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},t)-% \bm{x}_{0}\|_{2}^{2}+\lambda l_{p}\left(f_{\bm{\theta}}(\bm{x}_{t},\bm{y}_{0},% t),\bm{x}_{0}\right),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) - bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (11)

where lp(,)subscript𝑙𝑝l_{p}(\cdot,\cdot)italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( ⋅ , ⋅ ) denotes the pre-trained LPIPS metric, λ𝜆\lambdaitalic_λ is a hyper-parameter controlling the relative importance of these two constraints. This simple solution effectively curtails the sampling trajectory to fewer steps, e.g., 4 steps in this study, as shown in Sec. IV-B, while concurrently maintaining superior performance.

Extension to Latent Space. To alleviate the computational overhead in training, our proposed model can be optionally moved into the latent space of VQGAN [76], where the original image is compressed by a factor of four in spatial dimensions. This does not require any modifications on our model other than substituting 𝒙𝟎subscript𝒙0\bm{x_{0}}bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT and 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with their latent codes. An intuitive illustration is shown in Fig. 2.

Refer to caption
Figure 3: Illustration of the proposed noise schedule. (a) HQ image. (b) Zoomed LQ image. (c)-(d) Diffused images of the proposed noise schedule in timesteps of 1, 3, 5, 7, 9, 12, and 15 under different values of κ𝜅\kappaitalic_κ by fixing p=0.3𝑝0.3p=0.3italic_p = 0.3 and T=15𝑇15T=15italic_T = 15. (e)-(f) Diffused images of our method with a specified configuration of κ=40,p=0.8,T=1000formulae-sequence𝜅40formulae-sequence𝑝0.8𝑇1000\kappa=40,p=0.8,T=1000italic_κ = 40 , italic_p = 0.8 , italic_T = 1000 and LDM [25] in timesteps of 100, 200, 400, 600, 800, 900, and 1000. (g) The relative noise intensity (vertical axes, measured by 1/λsnr1subscript𝜆snr\sqrt{\nicefrac{{1}}{{\lambda_{\text{snr}}}}}square-root start_ARG / start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT snr end_POSTSUBSCRIPT end_ARG end_ARG, where λsnrsubscript𝜆snr\lambda_{\text{snr}}italic_λ start_POSTSUBSCRIPT snr end_POSTSUBSCRIPT denotes the signal-to-noise ratio) of the schedules in (d) and (e) w.r.t. the timesteps (horizontal axes). (h) The shifting speed ηtsubscript𝜂𝑡\sqrt{\eta_{t}}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (vertical axes) w.r.t. to the timesteps (horizontal axes) across various configurations of p𝑝pitalic_p. Note that the diffusion processes in this figure are implemented in the latent space, but we display the intermediate results after decoding back to the image space for the purpose of easy visualization.

III-B Noise Schedule

The proposed method employs a hyper-parameter κ𝜅\kappaitalic_κ and a shifting sequence {ηt}t=1Tsuperscriptsubscriptsubscript𝜂𝑡𝑡1𝑇\{\eta_{t}\}_{t=1}^{T}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to determine the noise schedule of the diffusion process. In particular, the hyper-parameter κ𝜅\kappaitalic_κ regulates the overall noise intensity during the transition, and its influence on performance is empirically discussed in Sec. IV-B. The subsequent exposition mainly revolves around the construction of the shifting sequence {ηt}t=1Tsuperscriptsubscriptsubscript𝜂𝑡𝑡1𝑇\{\eta_{t}\}_{t=1}^{T}{ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

Equation (2) indicates that the stochastic perturbation in state 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is proportional to ηtsubscript𝜂𝑡\sqrt{\eta_{t}}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, incorporating a scaling factor κ𝜅\kappaitalic_κ. This observation motivates us to focus on the design of ηtsubscript𝜂𝑡\sqrt{\eta_{t}}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG rather than ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Previous work by Song et al. [78] has suggested that maintaining a sufficiently small value for κη1𝜅subscript𝜂1\kappa\sqrt{\eta_{1}}italic_κ square-root start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG, such as 0.04 in LDM [25], is imperative to ensure q(𝒙1|𝒙0,𝒚0)q(𝒙0)𝑞conditionalsubscript𝒙1subscript𝒙0subscript𝒚0𝑞subscript𝒙0q(\bm{x}_{1}|\bm{x}_{0},\bm{y}_{0})\approx q(\bm{x}_{0})italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≈ italic_q ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Further considering η10subscript𝜂10\eta_{1}\to 0italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → 0, we set η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be the minimum value between (0.04/κ)2superscript0.04𝜅2(\nicefrac{{0.04}}{{\kappa}})^{2}( / start_ARG 0.04 end_ARG start_ARG italic_κ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 0.0010.0010.0010.001. For the terminal step T𝑇Titalic_T, we set ηTsubscript𝜂𝑇\eta_{T}italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as 0.999, guaranteeing ηT1subscript𝜂𝑇1\eta_{T}\to 1italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT → 1. For the intermediate timesteps t[2,T1]𝑡2𝑇1t\in[2,T-1]italic_t ∈ [ 2 , italic_T - 1 ], we propose a non-uniform geometric schedule for ηtsubscript𝜂𝑡\sqrt{\eta_{t}}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG as follows:

ηt=η1×b0βt,t=2,,T1,formulae-sequencesubscript𝜂𝑡subscript𝜂1superscriptsubscript𝑏0subscript𝛽𝑡𝑡2𝑇1\sqrt{\eta_{t}}=\sqrt{\eta_{1}}\times b_{0}^{\beta_{t}},~{}t=2,\cdots,T-1,square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = square-root start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG × italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_t = 2 , ⋯ , italic_T - 1 , (12)

where

βtsubscript𝛽𝑡\displaystyle\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(t1T1)p×(T1),absentsuperscript𝑡1𝑇1𝑝𝑇1\displaystyle=\left(\frac{t-1}{T-1}\right)^{p}\times(T-1),= ( divide start_ARG italic_t - 1 end_ARG start_ARG italic_T - 1 end_ARG ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × ( italic_T - 1 ) , (13a)
b0subscript𝑏0\displaystyle b_{0}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =exp[12(T1)logηTη1].absent12𝑇1subscript𝜂𝑇subscript𝜂1\displaystyle=\exp\left[\frac{1}{2(T-1)}\log{\frac{\eta_{T}}{\eta_{1}}}\right].= roman_exp [ divide start_ARG 1 end_ARG start_ARG 2 ( italic_T - 1 ) end_ARG roman_log divide start_ARG italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ] . (13b)

It should be noted that the choice of βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and b0subscript𝑏0b_{0}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is grounded in the assumption of β1=0subscript𝛽10\beta_{1}=0italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0, βT=T1subscript𝛽𝑇𝑇1\beta_{T}=T-1italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_T - 1, and ηT=η1×b0T1subscript𝜂𝑇subscript𝜂1superscriptsubscript𝑏0𝑇1\sqrt{\eta_{T}}=\sqrt{\eta_{1}}\times b_{0}^{T-1}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG = square-root start_ARG italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG × italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT. The hyper-parameter p𝑝pitalic_p controls the growth rate of ηtsubscript𝜂𝑡\sqrt{\eta_{t}}square-root start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, as depicted in Fig. 3(h).

The proposed noise schedule exhibits a high degree of flexibility in three key aspects. First, in the case of small values of κ𝜅\kappaitalic_κ, the final state 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT converges towards a perturbation around the LR image, as illustrated in Fig. 3(c)-(d). Compared to the diffusion process ended at Gaussian noise, this design significantly shortens the length of the Markov chain, thereby improving the inference efficiency. Second, the hyper-parameter p𝑝pitalic_p provides precise control over the shifting speed, enabling a fidelity-realism trade-off in the SR results as analyzed in Sec. IV-B. Third, by setting κ=40𝜅40\kappa=40italic_κ = 40 and p=0.8𝑝0.8p=0.8italic_p = 0.8, our method achieves a diffusion process that remarkably resembles to LDM [25]. This is clearly demonstrated by the visual results of the diffusion process presented in Fig. 3(e)-(f), and further supported by the comparisons on the relative noise strength as shown in Fig. 3(g).

Refer to caption
Figure 4: Visual comparison of two different models containing some self-attention layers (denoted as model-1) or Swin transformers (denoted as model-2). (a1) and (a2): zoomed LQ images with resolutions of 64×64646464\times 6464 × 64 or 128×128128128128\times 128128 × 128. (b1) and (b2): super-resolved results by model-1. (c1) and (c2): visualized attention maps extracted from the first self-attention layer of model-1. Note that these visualized results are obtained by first calculating the first principal component of the attention map via PCA and then reshaping it to the targeted size. In the left-upper corner, we annotate the average entropy values of these attention maps. (d1) and (d2): super-resolved results by model-2.

@articleliang2024control, title=Control Color: Multimodal Diffusion-based Interactive Image Colorization, author=Liang, Zhexin and Li, Zhaochen and Zhou, Shangchen and Li, Chongyi and Loy, Chen Change, journal=arXiv preprint arXiv:2402.10855, year=2024

III-C Discussion on Arbitrary Resolution

It is widely acknowledged that the self-attention layer [79], a pivotal component in recent diffusion architectures, plays a crucial role in capturing global information in image generation. In the field of IR, however, it causes a blurring issue in handling the test images with arbitrary resolutions, particularly when the test resolution largely diverges from the training resolution. One typical example is provided in Figure 4, considering two LQ images with different resolutions. The baseline model with multiple self-attention layers, which is trained on a resolution of 64×64646464\times 6464 × 64, performs well when the LQ image aligns with the training resolution but yields blurred results when confronted with a mismatched resolution of 128×128128128128\times 128128 × 128.

To analyze the underlying reason, we visualize the attention maps extracted from the first attention layer in this baseline network, as shown in Fig. 4(c1) and (c2). Note that these two attention maps are both interpolated to the resolution of 256×256256256256\times 256256 × 256 for ease of comparison. Evidently, the example with a larger resolution tends to generate a more uniformly distributed attention map, i.e., Fig. 4(c2), being consistent with the entropy values annotated on the left-upper corner222The principle of maximum entropy posits that it achieves the maximum entropy when the attention map conforms to a uniform distribution. Consequently, a uniformly distributed attention map often leads to an over-smooth outcome, introducing undesirable distortions in performance.

To address this issue, some recent studies [14, 12] have chosen to discard the self-attention layers, a strategy that typically results in a noticeable decline in performance. Inspired by Liang et al. [40], we propose a solution by substituting the self-attention layer with that in Swin Transformer [80]. This straightforward replacement not only alleviates the blurring problem but also maintains the promised performance, as shown in Fig. 4(d1) and (d2). This is because the Swin Transformer computes the attention map in a local window, thus being independent of the image resolution.

IV Experiments on Image Super-resolution

This section offers an evaluation of the proposed method on the task of image super-resolution (SR), with a particular focus on the setting of ×4absent4\times 4× 4 SR following existing studies [39, 38]. We first provide an in-depth analysis on the proposed model and then conduct a thorough comparison against recent state-of-the-art methods (SotAs) on one synthetic and two real-world datasets. For the sake of brevity in presentation, our method is herein referred to ResShift or ResShiftL. The former is trained based on the primary loss of diffusion model in Eq. (10), while the latter further introduces the perceptual regularization as shown in Eq. (11). To facilitate a more profound understanding of our designed diffusion model and the noise schedule, the model analysis part of Sec. IV-B mainly centers around ResShift.

Refer to caption
Figure 5: Qualitative comparisons of ResShift under different combinations of (T𝑇Titalic_T, p𝑝pitalic_p, κ𝜅\kappaitalic_κ) on the task of image super-resolution. For example, “(15, 0.3, 2.0)” represents the recovered result with T=15𝑇15T=15italic_T = 15, p=0.3𝑝0.3p=0.3italic_p = 0.3, and κ=2.0𝜅2.0\kappa=2.0italic_κ = 2.0. Please zoom in for a better view.
TABLE I: Performance comparison of ResShift on the ImageNet-Test dataset for image super-resolution under different configurations.
  Configurations Metrics

 

T𝑇Titalic_T

p𝑝pitalic_p

κ𝜅\kappaitalic_κ

PSNR\uparrow

SSIM\uparrow

LPIPS\downarrow

 

4

0.3 2.0

25.64

0.6903

0.3242

10

25.20

0.6828

0.2517

15

25.01

0.6769

0.2312

30

24.52

0.6585

0.2253

50

24.22

0.6483

0.2212

15

0.3

2.0

25.01

0.6769

0.2312

0.5

25.05

0.6745

0.2387

1.0

25.12

0.6780

0.2613

2.0

25.32

0.6827

0.3050

3.0

25.39

0.5813

0.3432

15 0.3

0.5

24.90

0.6709

0.2437

1.0

24.84

0.6699

0.2354

2.0

25.01

0.6769

0.2312

8.0

25.31

0.6858

0.2592

16.0

24.46

0.6891

0.2772

 

IV-A Experimental Setup

Training Details. The HQ images in our training data, with a resolution of 256×256256256256\times 256256 × 256, are randomly cropped from the training set of ImageNet [81] like LDM [25]. The LQ images are synthesized using the degradation pipeline of RealESRGAN [38]. To train our model, we adopted the Adam [82] algorithm with its default settings in PyTorch [83] and set the mini-batch size as 64. The learning rate is gradually decayed from 5e-55e-55\text{e-}55 e- 5 to 2e-52e-52\text{e-}52 e- 5 according to the annealing cosine schedule [84], and a total of 500K iterations are implemented throughout the training. As for the network architecture, the UNet backbone in DDPM [8] is employed. To increase our model’s adaptability to arbitrary image resolutions, we replace the self-attention layer in UNet with Swin Transformer [80] as explained in Sec. III-C.

Refer to caption
Figure 6: Perception-distortion trade-off of ResShift and LDM. The vertical and horizontal axes represent the strength of the perception and distortion, measured by LPIPS and MSE, respectively.

Test Datasets. We randomly select 3000 images from the validation set of ImageNet [81] as our synthetic test data, denoted as ImageNet-Test for convenience. The LQ images are generated based on the commonly-used degradation model:

𝒚=(𝒙*𝒌)+𝒏,𝒚𝒙𝒌𝒏\bm{y}=(\bm{x}*\bm{k})\downarrow+\bm{n},bold_italic_y = ( bold_italic_x * bold_italic_k ) ↓ + bold_italic_n , (14)

where 𝒌𝒌\bm{k}bold_italic_k is the blurring kernel, 𝒏𝒏\bm{n}bold_italic_n is the noise, 𝒚𝒚\bm{y}bold_italic_y and 𝒙𝒙\bm{x}bold_italic_x denote the LQ image and HQ image, respectively. To comprehensively evaluate the performance of our model, we consider more complicated types of blurring kernel, downsampling operator, and noise type. The detailed settings on them can be found in Appendix -C1. It should be noted that we select the HQ images from ImageNet [81] instead of the prevailing datasets in SR, such as Set5 [85], Set14 [86], and Urban100 [87]. The rationale behind this setting is rooted in the fact that these prevailing datasets only contain very few source images, which fail to thoroughly evaluate the performance of various methods under a multitude of different degradation types.

Two real-world datasets are adopted to evaluate the efficacy of our method. The first is RealSR-V3 [88], containing 100 real images captured by Canon 5D3 and Nikon D810 cameras. Additionally, we collect another real-world dataset named RealSet80. It comprises 50 LR images widely used in recent literature [89, 90, 91, 92, 38, 93]. The remaining 30 images are downloaded from the internet by ourselves.

TABLE II: Quantitative comparison on performance, running time, and the number of parameters of different methods on ImageNet-Test dataset for Image Super-resolution. The results of the diffusion-based methods are denoted as “Method-A”, where “A” represents the number of sampling steps. Running time is tested on NVIDIA Tesla A100 GPU on the x4 (64\rightarrow256) SR task. The best and second best results are highlighted in bold and underline. The non-trainable parameters, such as the parameters of VQGAN in LDM, are marked with \textcolor[gray]0.5gray color for clarity.

 

Methods

Metrics
6 

PSNR\uparrow

SSIM\uparrow

LPIPS\downarrow

Runtime (s)

#Params (M)

 

ESRGAN [72]

20.67

0.448

0.485

0.038

16.70

RealSR-JPEG [94]

23.11

0.591

0.326

0.038

16.70

BSRGAN [39]

24.42

0.659

0.259

0.038

16.70

SwinIR [40]

23.99

0.667

0.238

0.107

28.01

RealESRGAN [38]

24.04

0.665

0.254

0.038

16.70

DASR [95]

24.75

0.675

0.250

0.022

8.06

 

LDM-15 [25]

24.89

0.670

0.269

0.247

113.60+\textcolor[gray]0.555.32

LDM-4 [25]

24.74

0.657

0.345

0.077

 

StableSR-15 [29]

23.37

0.631

0.262

1.070

52.49+\textcolor[gray]0.51422.49

StableSR-4 [29]

24.11

0.658

0.287

0.399

 

ResShift-15

25.01 0.677 0.231

0.682

118.59+\textcolor[gray]0.555.32

ResShiftL-4

25.02 0.683 0.208

0.186

 
Refer to caption
Figure 7: Qualitative results of different methods on the synthetic ImageNet-Test dataset for image super-resolution. Note that we only display the comparison results to the recent five SotA methods in (b)-(f) due to the page limitation, and the whole results can be found in Fig. 14 of the appendix. Please zoom in for a better view.
TABLE III: Quantitative results of different methods on two real-world datasets for image super-resolution. Note that the results of diffusion-based methods are denoted as “Method-A”, where “A” represents the number of sampling steps. The best and second best results are highlighted in bold and underline.

 

Methods

Datasets
5 
RealSR-V3 [88] RealSet80
5 

CLIPIQA\uparrow

MUSIQ\uparrow

CLIPIQA\uparrow

MUSIQ\uparrow

 

ESRGAN [72]

0.2362

29.048

0.4165

48.153

RealSR-JPEG [94]

0.3615

36.076

0.5828

57.379

BSRGAN [39]

0.5439

63.586

0.6263

66.629

SwinIR [40]

0.4654

59.632

0.6072

64.739

RealESRGAN [38]

0.4898

59.678

0.6189

64.496

DASR [95]

0.3629

45.825

0.5311

58.974

 

LDM-50 [25]

0.4907

54.391

0.5511

55.826

LDM-15 [25]

0.3836

49.317

0.4592

50.972

 

StableSR-50 [29]

0.5208

60.177

0.6214

62.761

StableSR-15 [29]

0.4974

59.099

0.5975

61.476

 

ResShift-15

0.6014

58.979

0.6645

62.782

ResShiftL-4 0.5854

56.184

0.6483

61.017

 

Compared Methods. We evaluate the effectiveness of ResShift and ResShiftL in comparison to seven recent SR methods, namely RealSR-JPEG [94], BSRGAN [39], RealESRGAN [38], SwinIR [40], DASR [95], LDM [25], and StableSR [29]. LDM and StableSR are both diffusion-based methods trained with 1,000 diffusion steps. For a fair comparison, we accelerate them to 15 or 4 steps using the DDIM [36] algorithm during inference in this study. The hyper-parameter η𝜂\etaitalic_η in DDIM is set to be 1.0 as this value yields the most realistic results. For clarity, the results of LDM and StableSR are denoted as “Method-A”, where “A” represents the number of inference steps.

Evaluation Metrics. We evaluate the efficacy of different methods using three widely used metrics in most of the SR literature, namely PSNR, SSIM [96], and LPIPS [77], on the synthetic dataset. Notably, PSNR and SSIM are calculated in the luminance (Y) channel of YCbCr space, while LPIPS is directly computed in the standard RGB (sRGB) space. For assessments on real-world datasets, we mainly rely on CLIPIQA [97] and MUSIQ [98] as evaluative metrics to compare the performance of various methods. CLIPIQA and MUSIQ are both non-reference metrics specifically designed for assessing the realism of images. Of particular note, CLIPIQA leverages the CLIP [99] model, pre-trained on the extensive Laion400M [100] dataset, thereby demonstrating strong generalization ability.