The Repository of SEGA-FURN
Import Disclaimer Note
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Face super-resolution is a domain-specific single image super-resolution, which aims to generate High-Resolution (HR) face images from their Low-Resolution (LR) counterparts. In this paper, we propose a novel face super-resolution method, namely Semantic Encoder guided Generative Adversarial Face Ultra-Resolution Network (SEGA-FURN) to ultra-resolve an unaligned tiny LR face image to its HR counterpart with multiple ultra-upscaling factors (e.g., 4× and 8×). In our method, the proposed semantic encoder has the ability to capture the embedded semantics to guide adversarial learning, and the novel architecture named Residual in Internal Dense Block (RIDB) is able to extract hierarchical features for generator. Moreover, we propose a joint discriminator which not only discriminates image data but also discriminates embedded semantics, learning the joint probability distribution of the image space and latent space, and we use a Relativistic average Least Squares loss (RaLS) as the adversarial loss, which can alleviate the gradient vanishing problem and enhance the stability of the training procedure. According to extensive experiments on large face datasets, it is obvious that our proposed method achieves superior super-resolution results and significantly outperforms other state-of-the-art methods in both qualitative and quantitative comparisons. The overview of our method is shown in Fig.1
Proposed SEGA-FURN and its components: Semantic Encoder
we propose a novel GAN-based SR method, namely Semantic Encoder guided Generative Adversarial Face Ultra-Resolution Network (SEGA-FURN), as shown in Fig. 1. Main contributions of this paper can be summarized as follows:
-
Our proposed method is able to ultra-resolve an unaligned tiny face image to a Super-Resolved (SR) face image with multiple upscaling factors (e.g., 4× and 8×). In addition, our methods does not need any prior information or facial landmark points.
-
We design a semantic encoder to reverse the image information back to the embedded semantics reflecting facial semantic attributes. The embedded semantics combined with image data are fed into a joint discriminator. Such innovation can let semantic encoder guide the discriminative process, which is beneficial to enhance the discriminative ability of the discriminator.
-
We propose a Residual in Internal Dense Block (RIDB) as the basic architecture for the generator. This innovation provides an effective way to take advantage of hierarchical features, resulting in increased feature extraction capability of the SEGA-FURN.
-
We propose a joint discriminator which is capable of learning the joint probability constructed by embedded semantics and visual information (HR and LR images), resulting in a powerful discriminative ability. Furthermore, in order to remedy the problem of vanishing gradient and improve the model stability, we make use of RaLS objective loss to optimize the training process.
In this section, we present the proposed method SEGA-FURN in detail. First, we describe the novel architecture of SEGA-FURN through four main parts: generator, semantic encoder, joint discriminator and feature extractor. Next, we introduce the objective loss function RaLS for optimizing the generator and discriminator respectively. Finally, we provide the overall perceptual loss used in SEGA-FURN. The overview of our method is shown in Fig. 1. In addition, the architecture of discriminator and generator can be seen in Fig. 2. Moreover, the structures of the proposed DNB and RIDB are presented in Fig. 3.
Figure 2. Red dotted rectangle: The architecture of Generator. Blue dotted rectangle: The architecture of the Joint Discriminator.
As shown at the top of Fig. 2, the proposed generator mainly consists of three stages: Shallow Feature Module (SFM), Multi-level Dense Block Module (MDBM), and Upsampling Module (UM). The LR face image
where
where
where
where
where
where
where
As mentioned in Sec.1, the novel architecture RIDB is proposed for the generator, which is used to form the DNB (as shown in Fig. 3).
Figure 3. Top: Dense Nested Block (DNB) consists of multiple RIDBs. Bottom: The proposed Residual in Internal Dense Block (RIDB).
The proposed RIDB is able to extract hierarchical features and address the vanishing-gradient problem, which is the commonly encountered issue in SRGAN, ESRGAN, SRDenseNet, URDGN, RDN. The proposed RIDB is made up of four internal dense blocks and all the internal dense blocks are cascaded through residual connections performing identity mapping. The structure of the RIDB is expressed as:
where
where
The proposed semantic encoder is supposed to extract embedded semantics (as shown in Fig. 1), which is used to project visual information (HR, LR) back to the latent space. The motivation is that the GAN-based SR models (SRGAN, ESRGAN, URDGN) only exploit visual information during discriminative procedure, ignoring the semantic information reflected by latent representation. Therefore, the proposed semantic encoder will complement the missing critical property. Previous GAN's work (BiGAN, ALI) has proved that the semantic representation is beneficial to the discriminator.
Based on this observation, the proposed semantic encoder is designed to inversely map the image to the embedded semantics. Significantly, the most important advantage of semantic encoder is that it is able to guide the discriminative process since the embedded semantics obtained from the semantic encoder can reflect semantic attributes, such as the facial features (shape and gender) and the spatial relationship between various components of the face (eyes, mouth). It can be emphasized that the embedded semantics is fed into joint discriminator along with HR and LR images. Thanks to this property, the semantic encoder can guide discriminator to optimize, thereby enhancing its discriminative ability. More details can be found in the supplementary material.
As shown in Fig 1, the proposed joint discriminator takes the tuple incorporating both visual information and embedded semantics as the input, where Embedded Semantics-Level Discriminative Sub-Net (ESLDSN) receives the input embedded semantics while the image information is sent to Image-Level Discriminative Sub-Net (ILDSN). Next, through the operation of the Fully Connected Module (FCM) on concatenated vector, the final probability is predicted. Thus, the joint discriminator has the ability to learn the joint probability distribution of image data (
Moreover, in order to alleviate the problem of gradient vanishing and enhance the model stability, we adopt the Relativistic average Least Squares GAN (RaLSGAN) objective loss for the joint discriminator by applying the RaD to the least squares loss function (LSGAN). Let's denote the real tuple by
where
$L_{D}^{RaLS}=\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})-1)^{2}]+\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})+1)^{2}]$
$L_{G}^{RaLS}=\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})-1)^{2}]+\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})+1)^{2}]$
where
We further exploit pre-trained VGG19 (VGG19) network as feature extractor
We involve content loss
Content Loss
$L_{content}=\frac{1}{WH}\sum_{q=1}^{W}\sum_{r=1}^{H}( \phi_{i,j}(I^{HR}){q,r}-\phi{i,j}( I^{SR})_{q,r})^{2}$
where
Perceptual Loss
The total loss function
$L_{perceptual} = \lambda {con}L{content} + \lambda_{adv}L_{G}^{RaLS}$
where
In this section, we first present the details of dataset and training implementation. Then, we demonstrate the experiments and evaluation results. We further compare our method with state-of-the-art methods. Moreover, in order to prove the effectiveness of SEGA-FURN, we conduct ablation experiments to verify the contributions of the components proposed in this work. More experimental details and visual results are presented in the supplementary material.
We conducted experiments on the public large-scale CelebFaces Attributes dataset, CelebA. It consists of 200K celebrity face images of 10,177 celebrities. We used a total of 202,599 images, where we randomly selected 162,048 HR face images as the training set, and all the rest 40,511 images were chosen as the testing set.
To verify the effectiveness of our method, we conducted experiments with multiple upscaling factors 4× and 8× respectively. We resized and cropped the images to 256x256 pixels as our HR face images without any alignment operation. In order to obtain two groups of LR downsampled face images, we used bicubic interpolation method with downsampling factor
We trained our network 30k epochs using the Adam optimizer by setting
We compare the proposed method with the state-of-the-art SR methods.
Qualitative Comparison
The 4× and 8× qualitative results are depicted in Fig. 4. The top two rows show the 4× visual results. As for Bicubic interpolation, we observe that its results contain over-smooth visualization effects. SRGAN relatively enhances the SR results compared to Bicubic interpolation, but it still fails to generate fine details especially in facial components, such as eyes, and mouth. It is obvious that ESRGAN produces overly smoothed visual results and misses specific textures. On the contrary, the 4× SR images produced by our method retain facial-specific details and are faithful to HR counterparts.
Figure 4. Qualitative comparison against state-of-the-art methods. The top two rows: the results of 4× upscaling factor from 32x32 pixels to 256x256 pixels. The bottom two rows: the results of 8× upsampling factor from 16x16 pixels to 256x256 pixels. From left to right: (a) HR images, (b) LR inputs, (c) Bicubic interpolation, (d) Results of SRGAN, (e) Results of ESRGAN, and (f) Our method.
To reveal the powerful super-resolution ability of our proposed method, we further conduct experiments with 8× ultra upscaling factor. As shown in the bottom two rows in Fig. 4, it apparently presents that the SR visual quality obtained by Bicubic interpolation, SRGAN and ESRGAN is decreased, since the magnification is increased, resulting in the correspondence between HR and LR images incompatible. The outputs of Bicubic interpolation generate unpleasant noises. SRGAN encounters mode collapse problem during super-resolution process, so that it produces severe distortions in SR images. As for ESRGAN, it produces the SR images which show broken textures, noticeable artifacts around facial components. In contrast, our method is capable of producing photo-realistic SR images which preserve perceptually sharper edges and fine facial textures.
Quantitative Comparison
Table 1. Quantitative comparison on CelebA dataset for upscaling factor 4x and 8x, in terms of average PSNR(dB) and SSIM. Numbers in bold are the best evaluation results among state-of-the-art methods.
The quantitative results with multiple ultra upscaling factors 4× and 8× are shown in Table 1. It is obvious that our method attains the best in both PSNR and SSIM evaluations, 30.14dB/0.87 for 4× and 25.21dB/0.73 for 8×, among all other methods. FaceAttr is the second best method for 4×, 29.78dB/0.82, however, it degrades dramatically and performs poorly when the upscaling factor increases to 8×, obtaining 21.82dB/0.62. In contrast, our proposed method ranks the first for both upscaling factors 4× and 8×, which reflects the robustness of the proposed SEGA-FURN. Moreover, it is notable that our proposed method not only boosts PSNR/SSIM by a large margin of 1.04dB/0.08 over the classic method URDGN with upscaling factor 4× but also is higher than URDGN which is the second best for the 8× upscaling. This observation shows a stable ability of our method with multiple upscaling factors. In addition, we compare with SRGAN and ESRGAN which also use generative adversarial structure. It is obvious that our method not only improves SR image quality from perceptual aspect but also achieves an impressive numerical results.
we further implemented ablation studies to investigate the performance of the proposed method. As shown in Table 2, we list several variants based on different proposed components. First, among them, RIDB-Net can be used as the baseline variant, which only contains single component RIDB. Second, the RIDB-RaLS-Net is constructed by removing the Semantic Encoder (SE) from the SEGA-FURN. Next, RIDB-SE-Net means to remove RaLS loss of SEGA-FURN, and RIDB-SE-RaLS-Net equals to SEGA-FURN including all of the three components. In addition, we provide the visual results of these variants in Fig. 5, and quantitative comparison in Table 3.
Figure 5. Qualitative comparison of ablation studies. The top two rows: the results of upscaling factor 4×. The bottom two rows: the results of upscaling factor 8×. From left to right: (a) HR images, (b) LR inputs, (c) Results of RIDB-Net, (d) Results of RIDB-RaLS-Net (e) Results of RIDB-SE-Net, and (f) Results of RIDB-SE-RaLS-Net (SEGA-FURN).
Table 2. Description of SEGA-FURN variants with different components in experiments.
Table 3. Quantitative comparison of different variants on CelebA dataset for upscaling factor 4x and 8x
Effect of RIDB
We compare the proposed RIDB with other feature extraction blocks, such as Residual Block (RB) from SRGAN and Residual in Residual Dense Block (RRDB) of ESRGAN . As shown in Fig. 4 and Table 1, it is noticeable that the SR results generated by our generator employing RIDB outperforms SRGAN utilizing RB and ESRGAN adopting RRDB both in qualitative and quantitative comparisons. The reason is that our RIDB introduces densely connected structure to combine different level features, but there is no dense connections in RB. In addition, different from RRDB, the proposed RIDB designs multi-level residual learning within each basic internal dense blocks, which is able to boost the flow of features through the generator and provide hierarchical features for the super-resolution process. Based on these observations and investigations, it is persuasive to validate the effectiveness of the proposed RIDB.
Effect of SE
The Ablation (A) and (C) performed by RIDB-Net and RIDB-SE-Net aim to illustrate the advantage of SE and also verify the effectiveness of the joint discriminator. The RIDB-SE-Net can obtain embedded semantics extracted by SE and further feed these semantics along with image data to the joint discriminator. In training process, the embedded semantics is capable of providing useful semantic information for the joint discriminator. Such innovation can enhance the discriminative ability of the joint discriminator. Compared with RIDB-Net which does not employ SE and joint discriminator, the RIDB-SE-Net achieves significant improvements in terms of quantitative comparisons. Furthermore, as shown in Fig. 5, there is also a noticeable refinement in detailed texture. The enhanced performance can verify that the extracted embedded semantics has superior impact on SR results and the SE along with the joint discriminator play a critical role of the proposed method.
Effect of RaLS
The ablation (A) and (B) are conducted to demonstrate the effect of RaLS loss. We replace the RaLS loss of RIDB-Net with the generic GAN loss, Binary Cross Entropy (BCE) and keep all the other components the same. As shown in Table 3, it is obvious that once we remove the RaLS loss in RIDB-Net, the quantitative results is lower than RIDB-RaLS-Net which has RaLS loss. As expected, BCE used in RIDB-Net shows unrefined textures. In contrast, when RaLS is utilized in variant, the visual results are perceptually pleasing with more natural textures and edges. Thus, it can demonstrate that the RaLS loss is capable of greatly improving the performance of super-resolution.
Final Effect
From the comparison between ablation (D) and other studies, it is obvious that the large enhancement is noticeable by integrating all these three components. Finally, we refer the RIDB-SE-RaLS-Net to SEGA-FURN which is the ultimate proposed method.
In this paper, we proposed a novel Semantic Encoder guided Generative Adversarial Face Ultra-resolution Network (SEGA-FURN) to super-resolve a tiny LR unaligned face image to its HR version with multiple large ultra-upscaling factors (e.g., 4× and 8×). Owing to the proposed Semantic Encoder, Residual in Internal Dense Block and the Joint Discriminator adopting RaLS loss, our method successfully produced photo-realistic SR face images. Extensive experiments and analysis demonstrated that SEGA-FURN is superior to the state-of-the-art methods.