1 Introduction

Increasing attention is devoted to detection of small faces with an image resolution as low as 10 pixels of height [1]. Meanwhile, facial analysis techniques, such as face alignment [2, 3] and verification [4, 5], have seen rapid progress. However, the performance of most existing techniques would degrade when given a low resolution facial image, because the input naturally carries less information, and images corrupted with down-sampling and blur would interfere the facial analysis procedure. Face hallucination [613], a task that super-resolves facial images, provides a viable means for improving low-res face processing and analysis, e.g. person identification in surveillance videos and facial image enhancement.

Fig. 1.
figure 1

(a) The original high-res image. (b) The low-res input with a size of 5pxIOD. (c) The result of bicubic interpolation. (d) An overview of the proposed face hallucination framework. The solid arrows indicate the hallucination step that hallucinates the face with spatial cues, i.e. the dense correspondence field. The dashed arrows indicate the spatial prediction step that estimates the dense correspondence field.

Prior on face structure, or face spatial configuration, is pivotal for face hallucination [6, 7, 12]. The availability of such prior distinguishes the face hallucination task from the general image super-resolution problem [1421], where the latter lacks of such global prior to facilitate the inference. In this study, we extend the notion of prior to pixel-wise dense face correspondence field. We observe that an informative prior provides a strong semantic guidance that enables face hallucination even from a very low resolution. Here the dense correspondence field is necessary for describing the spatial configuration for its pixel-wise (not by facial landmarks) and correspondence (not by face parsing) properties. The importance of dense field will be reflected in Sect. 3.2. An example is shown in Fig. 1 – even an eye is only visible from a few pixels in a low-res image, one can still recover its qualitative details through inferring from the global face structure.

Nevertheless, obtaining an accurate high-res pixel-wise correspondence field is non-trivial given only the low-res input. First, the definition of the high-res dense field is by itself ill-posed because the gray-scale of each pixel is distributed to adjacent pixels on the interpolated image (Fig. 1(c)). Second, the blur causes difficulties for many existing face alignment or parsing algorithms [3, 2224] because most of them rely on sharp edge information. Consequently, we face a chicken-and-egg problem - face hallucination is better guided by face spatial configuration, while the latter requires a high resolution face. This issue, however, has been mostly ignored or bypassed in previous works (Sect. 2).

In this study, we propose to address the aforementioned problem with a novel task-alternating cascaded framework, as shown as Fig. 1(d). The two tasks at hand - the high-level face correspondence estimation and low-level face hallucination are complementary and can be alternatingly refined with the guidance from each other. Specifically, motivated by the fact that both tasks are performed in a cascaded manner [15, 23, 25], they can be naturally and seamlessly integrated into an alternating refinement process. During the cascade iteration, the dense correspondence field is progressively refined with the increasing face resolution, while the image resolution is adaptively upscaled guided by the finer dense correspondence field.

To better recover different levels of texture details on faces, we propose a new gated deep bi-network architecture in the face hallucination step in each cascade. Deep convolutional neural networks have demonstrated state-of-the-art results for image super resolution [14, 15, 17, 18]. In contrast to aforementioned studies, the proposed network consists two functionality-specialized branches, which are trained end-to-end. The first branch, referred as common branch, conservatively recovers texture details that are only detectable from the low-res input, similar to general super resolution. The other branch, referred as high-frequency branch, super-resolves faces with the additional high-frequency prior warped by the estimated face correspondence field in the current cascade. Thanks to the guidance of prior, this branch is capable of recovering and synthesizing un-revealed texture details in the overly low-res input image. A pixel-wise gate network is learned to fuse the results from the two branches. Figure 2 demonstrates the properties of the gated deep bi-network. As can be observed, the two branches are complementary. Although the high-frequency branch synthesizes the facial parts that are occluded (the eyes with sun-glasses), the gate network automatically favours the results from the common branch during fusion.

Fig. 2.
figure 2

Examples for visualizing the effects of the proposed gated deep bi-network. (a) The bicubic interpolation of the input. (b) Results where only common branches are enabled. (c) Results where only high-frequency branches are enabled. (d) Results of the proposed CBN when both branches are enabled. (e) The original high-res image. Best viewed by zooming in the electronic version.

We refer the proposed framework as Cascaded Bi-Networks (CBN) hereafter. We summarize our contribution as follows:

  1. 1.

    While conducting face hallucination or dense face correspondence field is hard on low-res images, we circumvent this problem through a novel task-alternating cascade framework. In comparison to existing approaches, this framework has an appealing property of not assuming pre-aligned inputs or any availability of spatial information (e.g. landmark, parsing map).

  2. 2.

    We propose a gated deep bi-network that can effectively exploit face spatial prior to recover and synthesize texture details that even are not explicitly presented in the low-resolution input.

  3. 3.

    We provide extensive results and discussions to demonstrate and analyze the effectiveness of the proposed approach.

We perform extensive experiments against general super-resolution and face hallucination approaches on various benchmarks. Our method not only achieves high Peak Signal to Noise Ratio (PSNR), but also superior quality perceptually. Demo codes will be available in our project page http://mmlab.ie.cuhk.edu.hk/projects/CBN.html.

2 Related Work

Face hallucination and spatial cues. There is a rich literature in face hallucination [613]. Spatial cues are proven essential in most of previous works, and are utilized in various forms. For example, Liu et al. [7, 12] and Jin et al. [6] devised a warping function to connect the face local reconstruction with the high-res faces in the training set. However, a low-res correspondence fieldFootnote 1 may not be sufficient for aiding the high-res face reconstruction process, while obtaining the high-res correspondence field is ill-posed with only a low-res face given. Yang et al. [8] assumed that facial landmarks can be accurately estimated from the low-res face image. This is not correct if the low-res face is rather small (e.g. 5pxIOD), since the gray-scale is severely distributed to the adjacent pixels (Fig. 1(c)). Wang et al. [10] and Kolouri et al. [9] only aligned the input low-res faces with an identical similarity transform (e.g. the same scaling and rotation). Hence these approaches can only handle canonical-view low-res faces. Zhou et al. [26] pointed out the difficulties of predicting the spatial configuration over a low-res input, and did not take any spatial cues into account for hallucination. In contrast to all aforementioned approaches, we adaptively and alternatingly estimate the dense correspondence field as well as hallucinate the faces in a cascaded framework. The two mutual tasks aid each other and hence our estimation of the spatial cues and hallucination can be better refined with each other.

Cascaded prediction. The cascaded framework is privileged both for image super-resolution (SR) [15, 25] and facial landmark detection [2, 3, 22, 23, 2730]. For image SR, Wang et al. [15] showed that two rounds of \(2\times \) upscaling is better than a single round of \(4\times \) upscaling in their framework. For facial landmark detection, the cascaded regression framework has revolutionized the accuracy and has been extended to other areas [31]. The key success of the cascaded regression comes from its coarse-to-fine nature of the residual prediction. As pointed out by Zhang et al. [28], the coarse-to-fine nature can be better achieved by the increasing facial resolution among the cascades. To our knowledge, no existing work has integrated these two related tasks into a unified framework.

The bi-network architecture. The bi-network architecture [3234] has been explored in various form, such as bilinear networks [35, 36] and two-stream convolutional network [37]. In [35], the two factors, namely object identification and localization, are modeled by the two branches respectively. This is different from our model, where the two factors, the low-res face and the prior, are jointly modeled in one branch (the high-frequency branch), while the other branch (the common branch) only models the low-res face. In addition, the two branches are joined via the gate network in our model, different from the outer-production in [35]. In [37], both spatial and temporal information are modeled by the network, which is different from our model, where no temporal information is incorporated. Our architecture also differs from [26]. In [26], the output is the average weighted by a scalar between the result of one branch and the low-res input. Moreover, neither of the two branches utilizes any spatial cues or prior in [26].

3 Cascaded Bi-Network (CBN)

3.1 Overview

Problem and notation. Given a low-resolution input facial image, our goal is to predict its high-resolution image. We introduce the two main entities involved in our framework:

The facial image is denoted as a matrix \(\mathbf {I}\). We use \(\mathbf {x} \in \mathbb {R}^2\) to denote the (xy) coordinates of a pixel on \(\mathbf {I}\).

The dense face correspondence field defines a pixel-wise correspondence mapping from \(M \subset \mathbb {R}^2\) (the 2D face region in the mean face template) to the face region in image \(\mathbf {I}\). We represent the dense field with a warping function [38], \(\mathbf {x} = W(\mathbf {z}): M \rightarrow \mathbb {R}^2\), which maps the coordinates \(\mathbf {z} \in M\) from the mean shape template domain to the target coordinates \(\mathbf {x} \in \mathbb {R}^2\). See Fig. 3(a,b) for a clear illustration. Following [39], we model the warping residual \(W(\mathbf {z}) - \mathbf {z}\) as a linear combination of the dense facial deformation bases, i.e.

$$\begin{aligned} W(\mathbf {z}) = \mathbf {z} + \mathbf {B}(\mathbf {z}) \mathbf {p} \end{aligned}$$
(1)

where \(\mathbf {p} = [p_1 \dots p_N]^\top \in \mathbb {R}^{N \times 1}\) denotes the deformation coefficients and \(\mathbf {B}(\mathbf {z}) = [\mathbf {b}_1(\mathbf {z}) \dots \mathbf {b}_N(\mathbf {z})] \in \mathbb {R}^{2 \times N}\) denotes the deformation bases. The N bases are chosen in the AAMs manner [40], that 4 out of N correspond to the similarity transform and the remaining for non-rigid deformations. Note that the bases are pre-defined and shared by all samples. Hence the dense field is actually controlled by the deformation coefficients \(\mathbf {p}\) for each sample. When \(\mathbf {p} = \mathbf {0}\), the dense field equals to the mean face template.

We use the hat notation (\(~\hat{ }~\)) to represent ground-truth in the learning step. For example, we denote the high-resolution training image as \(\hat{\mathbf {I}}\).

Framework overview. We propose a principled framework to alternatively refine the face resolution and the dense correspondence field. Our framework consists of K iterations (Fig. 1(d)). Each iteration updates the prediction via

$$\begin{aligned}&\mathbf {p}_k = \mathbf {p}_{k-1} + f_k(\mathbf {I}_{k-1}; ~\mathbf {p}_{k-1}); W_k(\mathbf {z}) = \mathbf {z} + \mathbf {B}_k(\mathbf {z}) \mathbf {p}_k; \end{aligned}$$
(2)
$$\begin{aligned}&\mathbf {I}_k = {\uparrow } \mathbf {I}_{k-1} + g_k({\uparrow } \mathbf {I}_{k-1}; ~W_k(\mathbf {z})); ~~~~~~~~~~~ (\forall \mathbf {z} \in M_k), \end{aligned}$$
(3)

where k iterates from 1 to K. Here, Eq. 2 represents the dense field updating step while Eq. 3 stands for the spatially guided face hallucination step in each cascade. ‘\({\uparrow }\)’ denotes the upscaling process (\(2\times \) upscaling with bicubic interpolation in our implementation). All the notations are now appended with the index k to indicate the iteration. A larger k in the notation of \(\mathbf {I}_k\), \(W_k\), \(\mathbf {B}_k\) and \(M_k\) Footnote 2 indicates the larger resolution and the same k indicates the same resolution. The framework starts from \(\mathbf {I}_0\) and \(\mathbf {p}_0\). \(\mathbf {I}_0\) denotes the input low-res facial image. \(\mathbf {p}_0\) is a zero vector representing the deformation coefficients of the mean face template. The final hallucinated facial image output is \(\mathbf {I}_K\).

Fig. 3.
figure 3

(a,b) Illustration of the mean face template M and the facial image \(\mathbf {I}\). The grid denotes the dense correspondence field \(W(\mathbf {z})\). The warping from \(\mathbf {z}\) to \(\mathbf {x}\) is determined by this warping function \(W(\mathbf {z})\). (c,d) Illustration of the high-frequency prior \(\mathbf {E}\) and the prior after warping \(\mathbf {E}^W\) for the sample image in (b). Note that both \(\mathbf {E}\) and \(\mathbf {E}^W\) have C channels. Each channel only contains one ‘contour line’. For the purpose of visualization, in this figure, we reduce their channel dimension to one channel with \(\max \) operation. We leave out all indices k for clarity. Best viewed in the electronic version.

Fig. 4.
figure 4

Architecture of the proposed deep bi-network (for the k-th cascade). It consists of a common branch (blue), a high-frequency branch (red) and the gate network (cyan). (Color figure online)

Model, inference and learning. Our model is composed of functions \(f_k\) (dense field estimation) and \(g_k\) (face hallucination with spatial cues). The deformation bases \(\mathbf {B}_k\) are pre-defined for each cascade and fixed during the whole training and testing procedures. During testing, we repeatedly update the image \(\mathbf {I}_k\) and the dense correspondence field \(W_k(\mathbf {z})\) (basically the coefficients \(\mathbf {p}_k\)) with Eqs. 23. The learning procedure works similarly to the inference but incorporating the learning process of the two functions - \(g_k\) for hallucination and \(f_k\) for predicting the dense field coefficients. We present their learning procedures in Sects. 3.2 and 3.3 respectively.

3.2 \(g_k\) - Gated Deep Bi-Network: Face Hallucination with Spatial Cues

We propose a gated deep bi-network architecture for face hallucination with the guidance from spatial cues. We train one gated bi-network for each cascade. For the k-th iteration, we take in the input image \({\uparrow } \mathbf {I}_{k-1}\) and the current estimated dense correspondence field \(W_k(\mathbf {z})\), to predict the image residual \(\mathbf {G} = \mathbf {I}_k-{\uparrow }\mathbf {I}_{k-1}\).

As the name indicates, our gated bi-network contains two branches. In contrast to [35] where two branches are joined with outer production, we combine the two branches with a gate network. More precisely, if we denote the output from the common branch (A) and the high-frequency branch (B) as \(\mathbf {G}_A\) and \(\mathbf {G}_B\) respectively, we combine them with

$$\begin{aligned} g_k({\uparrow } \mathbf {I}_{k-1}; ~W_k(\mathbf {z})) = \mathbf {G} = (\mathbf {1} - \mathbf {G}_\lambda ) \otimes \mathbf {G}_A + \mathbf {G}_\lambda \otimes \mathbf {G}_B, \end{aligned}$$
(4)

where \(\mathbf {G}\) denotes our predicted image residual \(\mathbf {I}_k - {\uparrow } \mathbf {I}_{k-1}\) (i.e. the result of \(g_k\)), and \(\mathbf {G}_\lambda \) denotes the pixel-wise soft gate map that controls the combination of the two outputs \(\mathbf {G}_A\) and \(\mathbf {G}_B\). We use \(\otimes \) to denote element-wise multiplication.

Figure 4 provides an overview of the gated bi-network architecture. Three convolutional sub-networks are designed to predict \(\mathbf {G}_A\), \(\mathbf {G}_B\) and \(\mathbf {G}_\lambda \) respectively. The common branch sub-network (blue in Fig. 4) takes in only the interpolated low-res image \({\uparrow } \mathbf {I}_{k-1}\) to predict \(\mathbf {G}_A\) while the high-frequency branch sub-network (red in Fig. 4) takes in both \(\uparrow \mathbf {I}_{k-1}\) and the warped high-frequency prior \(\mathbf {E}^{W_k}\) (warped according to the estimated dense correspondence field). All the inputs (\(\uparrow \mathbf {I}_{k-1}\) and \(\mathbf {E}^{W_k}\)) as well as \(\mathbf {G}_A\) and \(\mathbf {G}_B\) are fed into the gate sub-network (cyan in Fig. 4) for predicting \(\mathbf {G}_\lambda \) and the final high-res output \(\mathbf {G}\).

We now introduce the high-frequency prior and the training procedure of the proposed gated bi-network.

High-frequency prior. We define high-frequency prior as the indication for location with high-frequency details. In this work, we generate high-frequency prior maps to enforce spatial guidance for hallucination. The prior maps are obtained from the mean face template domain. More precisely, for each training image, we compute the residual image between the original image \(\hat{\mathbf {I}}\) and the bicubic interpolation of \(\mathbf {I}_{0}\), and then warp the residual map into the mean face template domain. We average the magnitude of the warped residual maps over all training images and form the preliminary high-frequency map. To suppress the noise and provide a semantically meaningful prior, we cluster the preliminary high-frequency map into C continuous contours (10 in our implementation). We form a C-channel maps, with each channel carrying one contour. We refer this C-channel maps as our high-frequency prior, and denote it as \(E_k(\mathbf {z}): M_k \rightarrow \mathbb {R}^C\). We use \(\mathbf {E}_k\) to represent \(E_k(\mathbf {z})\) for all \(\mathbf {z} \in M_k\). An illustration of the prior is shown in Fig. 3(c).

Learning the gated bi-network. We train the three parts of convolutional neural networks to predict \(\mathbf {G}_A\), \(\mathbf {G}_B\) and \(\mathbf {G}_\lambda \) in our unified bi-network architecture. Each part of the network has a distinct training loss. For training the common branch, we use the following loss over all training samples

$$\begin{aligned} L_A = \Vert \hat{\mathbf {I}}_k - {\uparrow } \mathbf {I}_{k-1} - \mathbf {G}_A\Vert _F^2. \end{aligned}$$
(5)

The high-frequency branch has two inputs: \({\uparrow } \mathbf {I}_{k-1}\) and the warped high-frequency prior \(\mathbf {E}^{W_k}\) (see Fig. 3(d) for illustration) to predict the output \(\mathbf {G}_B\). The two inputs are fused in the channel dimension to form a \((1+C)\)-channel input. We use the following loss over all training samples

$$\begin{aligned} L_B = \sum _{c=1}^C \Vert (\mathbf {E}^{W_k})_c \otimes (\hat{\mathbf {I}}_k - {\uparrow } \mathbf {I}_{k-1} - \mathbf {G}_B)\Vert _F^2, \end{aligned}$$
(6)

where \((\mathbf {E}^{W_k})_c\) denotes the c-th channel of the warped high-frequency prior maps. Compared to the common branch, we additionally utilize the prior knowledge as input and only penalize over the high-frequency area. Learning to predict the gate map \(\mathbf {G}_\lambda \) is supervised by the final loss

$$\begin{aligned} L = \Vert \hat{\mathbf {I}}_k - {\uparrow } \mathbf {I}_{k-1} - \mathbf {G}\Vert _F^2. \end{aligned}$$
(7)

We train the proposed gated bi-network with three steps. Step i: We only enable the supervision from \(L_A\) (Eq. 5) to pre-train the common branch; Step ii: We only enable \(L_B\) (Eq. 6) to pre-train the high-frequency branch; Step iii: We finally fine-tune the whole gated bi-network with the supervision from L (Eq. 7). In the last step, we set the learning rate of the parameters related to the gate map to be 10 times as the parameters in the two branches. Note that we can still use back-propagation to learn the whole bi-network in our last step.

3.3 \(f_k\) - Dense Field Deformation Coefficients Prediction

We apply a simple yet effective strategy to update the correspondence field coefficients estimation (\(f_k\)). Observing that predicting a sparse set of facial landmarks is more robust and accurate under low resolution, we transfer the facial landmarks deformation coefficients to the dense correspondence field. More precisely, we simultaneously obtain two sets of N deformation bases: \(\mathbf {B}_k(\mathbf {z}) \in \mathbb {R}^{2 \times N}\) for the dense field, and \(\mathbf {S}_k(l) \in \mathbb {R}^{2 \times N}\) for the landmarks, where l is the landmark index. The bases for the dense field and landmarks are one-to-one related, i.e. both \(\mathbf {B}_k(\mathbf {z})\) and \(\mathbf {S}_k(l)\) share the same deformation coefficients \(\mathbf {p}_k \in \mathbb {R}^N\):

$$\begin{aligned} W_k(\mathbf {z}) = \mathbf {z} + \mathbf {B}_k(\mathbf {z}) \mathbf {p}_k; ~ \mathbf {x}_k(l) = \bar{\mathbf {x}}_k(l) + \mathbf {S}_k(l) \mathbf {p}_k, \end{aligned}$$
(8)

where \(\mathbf {x}_k(l) \in \mathbb {R}^2\) denotes the coordinates of the l-th landmark, and \(\bar{\mathbf {x}}_k(l)\) denotes its mean location.

To predict the deformation coefficients \(\mathbf {p}_k\) in each cascade k, we utilize the powerful cascaded regression approach [23] for estimation. A Gauss-Newton steepest descent regression matrix \(\mathbf {R}_k\) is learned in each iteration k to map the observed appearance to the deformation coefficients update:

$$\begin{aligned} \mathbf {p}_k = \mathbf {p}_{k-1} + f_k(\mathbf {I}_{k-1}; \mathbf {p}_{k-1}) = \mathbf {p}_{k-1} + \mathbf {R}_k (\phi (\mathbf {I}_{k-1}; \mathbf {x}_{k-1}(l)|_{l=1,...,L}) - \bar{\phi }), \end{aligned}$$
(9)

where \(\phi \) is the shape-indexed feature [2, 27] that concatenates the local appearance from all L landmarks, and \(\bar{\phi }\) is its average over all the training samples.

To learn the Gauss-Newton steepest descent regression matrix \(\mathbf {R}_k\), we follow [23] to learn the Jacobian \(\mathbf {J}_k\) and then obtain \(\mathbf {R}_k\) via constructing the project-out Hessian: \(\mathbf {R}_k = (\mathbf {J}_k^\top \mathbf {J}_k)^{-1} \mathbf {J}_k^\top \). We refer readers to [23] for more details.

It is worth mentioning that the face flow method [39] that applies a landmark-regularized Lucas-Kanade variational minimization [38] is also a good alternative to our problem. Since we have obtained satisfying results with our previously introduced deformation coefficients transfer strategy, which is purely discriminative and much faster than face flow (8 ms per cascade in our approach v.s. 1.4 s for face flow), we use the coefficients transfer approach in our experiments.

4 Experiments

Datasets. Following [6, 8], we choose the following datasets that contain both in-the-wild and lab-constrained faces with various poses and illuminations.

  1. 1.

    MultiPIE [41] was originally proposed for face recognition. A total of more than 750,000 faces from 337 identities are collected under lab-constrained environment. We use the same 351 images as used in [8] for evaluation.

  2. 2.

    BioID [42] contains 1521 faces also collected in the constrained settings. We use the same 100 faces as used in [6] for evaluation.

  3. 3.

    PubFig [43] contains 42461 faces (the evaluation subset) from 140 identities originally for evaluating face verification and later used for evaluating face hallucination [8]. The faces are collected from the web and hence in-the-wild. Due to the existence of invalid URLs, we use a total of 20991 faces for evaluation. Further, following [6], we use PubFig83 [44], a subset of PubFig with 13838 images, to experiment with input blurred by unknown Gaussian kernel. Similar to [6], we test with the same 100-image-subset of PubFig83.

  4. 4.

    Helen [45] contains 2330 in-the-wild faces with high resolution. The mean face size is as large as 275pxIOD. We evaluate with the 330-image test set.

Metric. We follow existing studies [6, 8, 12, 14, 15] to adopt PSNR (dB) and only evaluate on the luminance channel of the facial region. The definition of the facial region is the same as used in [6]. Similar to [6], SSIM is not reported for in-the-wild faces due to irregular facial shape.

Implementation details. Our framework consists of \(K=4\) cascades, and each cascade has its specific learned network parameters and Gauss-Newton steepest descent regression matrix. During training, our model requires two parts of training data, one for training the cascaded dense face correspondence field, and the other for training the cascaded gated bi-networks for hallucination. The model is trained by iterating between these two parts of the training data. For the former part, we use the training set from 300 W [46] (the same 2811 images used in [23]) for estimating deformation coefficient and BU4D [47, 48] dataset for obtaining dense face correspondence basis (following [39]). For the latter part, as no manual labeling is required, we leverage the existing large face database CelebA [49] for training the gated bi-network.

4.1 Comparison with State-of-the-Art Methods

We compare our approach with two types of methods: (I) general super resolution (SR) approaches and (II) face hallucination approaches. For SR methods, we compare with the recent state-of-the-art approaches [14, 15, 19, 50] based on the original released codes. For face hallucination methods, we report the result of [6, 12, 51] by directly referring to the literature [6]. We compare with [8, 52] by following the implementation of [8]. We re-transform the input face to canonical-view if the method assumes the input must be aligned. Hence, such method would enjoy extra advantages in the comparison. If the method requires exemplars, we feed in the same in-the-wild samples in our training set. We observe that such in-the-wild exemplars improve the exemplar-based baseline methods compared to their original implementation. Codes for [7] is not publicly available. Similar to [6], we provide the qualitative comparison with [7].

We conduct the comparison in two folds: 1. The input is the down-sampled version of the original high-res image as many of the previous SR methods are evaluated on [7, 14, 15, 19, 50] (referred as the conventional SR setting, Sect. 4.1); 2. The input is additionally blurred with unknown Gaussian kernel before down-sampling as in [6, 8, 12] (referred as the Gaussian-blurred setting, Sect. 4.1).

The Conventional SR Evaluation Setting We experiment with two scenarios based on two different types of input face size configuration:

  1. 1.

    Fixed up-scaling factors – The input image is generated by resizing the original image with a fixed factor. For MultiPIE, following [8] we choose the fixed factor to be 4. For the in-the-wild datasets (PubFig and Helen), we evaluate for scaling factors of 2, 3 and 4 as in [14, 15, 19, 50] (denoted as \(2\times , 3\times , 4\times \) respectively in Table 1). In this case, different inputs might have different face sizes. The proposed CBN is flexible to handle such scenario. Other existing face hallucination approaches [8, 12, 51, 52] cannot handle different input face sizes and their results in this scenario are omitted.

  2. 2.

    Fixed input face sizes – Similar to the face hallucination setting, the input image is generated by resizing the original image to ensure the input face size to be fixed (e.g. 5 or 8 pxIOD, denoted as 5/8px in Table 1). Hence, the required up-scaling factor is different for each input. For baseline approaches, [15] can naturally handle any up-scaling requirement. For other approaches, we train a set of models for different up-scaling factors. During testing, we pick up the most suitable model based on the specified up-scaling factor.

We need to point out that the latter scenario is more challenging and appropriate for evaluating a face hallucination algorithm, because recovering the details of the face with the size of 5/8pxIOD is more applicable for low-res face processing applications. In the former scenario, the input face is not small enough (as revealed in the bicubic PSNR in Table 1), such that it is more like a facial image enhancement problem rather than the challenging face hallucination task.

Table 1. Results under the conventional SR setting (for Sect. 4.1). Numbers in the parentheses indicate SSIM and the remaining represent PSNR (dB). The first part of the results are from Scenario 1 where each method super-resolves for a fixed factor (2\(\times \), 3\(\times \) or 4\(\times \)), while the latter part are from Scenario 2 that each method begins from the same face size (5 or 8 pxIOD, i.e. the inter-ocular distance is 5 or 8 pixels). The omitted results (-) are due to their incapability of handling varying input face size.
Table 2. Results under the Gaussian-blur setting (for Sect. 4.1). Numbers in parentheses indicate SSIM and the remaining represent PSNR (dB). Settings adhere to [6]. For a fair comparison, we feed in the same number of in-the-wild exemplars from CelebA when evaluating [8], instead of the originally used MultiPIE in the released codes.
Table 3. PSNR results (dB) of in-house comparison of the proposed CBN (for Sect. 4.2).

We report the results in Table 1, and provide qualitative results in Fig. 5. As can be seen from the results, our proposed CBN outperforms all general SR and face hallucination methods in both scenarios. The improvement is especially significant in the latter scenario because our incorporated face prior is more critical when hallucinating face from very low resolution. We observe that the general SR algorithms did not obtain satisfying results because they take full efforts to recover only the detectable high-frequency details, which obviously contain noise.

In contrast, our approach recovers the details according to the high-frequency prior as well as the estimated dense correspondence field, thus achieving better performance. The existing face hallucination approaches did not perform well either. In comparison to the evaluation under the constrained or canonical-view condition (e.g. [8]), we found that these algorithms are more likely to fail under in-the-wild setting with substantial shape deformation and appearance variation.

Fig. 5.
figure 5

Qualitative results from PubFig/HELEN with input size 5pxIOD (for Sect. 4.1, detailed results refer Table 1). Best viewed by zooming in the electronic version.

Fig. 6.
figure 6

Qualitative results from the PubFig83 dataset (for Sect. 4.1, detailed results refer Table 2). The six test samples presented are chosen by strictly following [6].

Fig. 7.
figure 7

Qualitative results for real surveillance videos (for Sect. 4.1). The test samples are directly imported from [6]. Best viewed by zooming in the electronic version.

The Gaussian-Blur Evaluation Setting It is also important to explore the capability of handling blurred input images [53]. Our method demonstrates certain degrees of robustness toward unknown Gaussian blur. Specifically, in this section, we still adopt the same model as in Sect. 4.1, with no extra efforts spent in the training to specifically cope with blurring. To compare with [6], we add Gaussian blur to the input facial image in the same way as [6]. The experimental settings are precisely the same as in [6] - the input faces have the same size (around 8pxIOD); the up-scaling factor is set to be 4; and \(\sigma \) for Gaussian blur kernel is set to be 1.6 for PubFig83 and 2.4 for BioID. Additional Gaussian noise with \(\eta =2\) is added in BioID. We note that our approach only uses single frame for inference, unlike multiple frames in [6].

We summarize the results in Table 2. Qualitative results are shown in Fig. 6. From the results it is observed that again CBN significantly outperforms all the compared approaches. We attribute the robustness toward the unknown Gaussian blur on the spatial guidance provided by the face high-frequency prior.

Taking advantages of such robustness of our approach, we further test the proposed algorithm over the faces from real surveillance videos. In Fig. 7, we compare our result with [6, 15]. Note that the presented test cases are directly imported from [6]. Again, our result demonstrates the most appealing visual quality compared to existing state-of-the-art approaches, suggesting the potential of our proposed framework in real-world applications.

Run Time The major time cost of our approach is consumed on the forwarding process of the gated deep bi-networks. On a single core i7-4790 CPU, the face hallucination steps for the four cascades (from 5pxIOD to 80pxIOD) require 0.13 s, 0.17 s, 0.70 s, 2.76 s, respectively. The time cost of the dense field prediction steps is negligible compared to the hallucination step. Our framework totally consumes 3.84 s, which is significantly faster than existing face hallucination approaches, for examples, 15–20 min for [6], 1 min for [8], 8 min for [12], thanks to CBN’s purely discriminative inference procedure and the non-exemplar and parametric model structure.

4.2 An Ablation Study

We investigate the effects of three important components in our framework:

  1. 1.

    Effects of the gated bi-network. (a) We explore the results if we replace the cascaded gated bi-network with the vanilla cascaded CNN, in which only the common branch (the blue branch in Fig. 4) is remained. In this case, the spatial information, i.e. the dense face correspondence field is not considered or optimized at all. (b) We also explore the case where only the high-frequency branch (the red branch in Fig. 4) is remained.

  2. 2.

    Effects of the progressively updated dense correspondence field. In our framework, the pixel-level correspondence field is refined progressively to better facilitate the subsequent hallucination process. We explore the results if we only use the correspondence estimated from the input low-res imageFootnote 3. In this case, the spatial configuration estimation is not updated with the growth of the resolution.

  3. 3.

    Effects of the cascade. The cascaded alternating framework is the core for our framework. We explore the results if we train one network and directly super resolve the input to the required size. High-frequency prior is still used in this baseline. We observe an even worse result without this prior.

We present the results in Table 3. The experimental setting follows the same setting in Sect. 4.1 - The PubFig and HELEN datasets super-resolve from 5pxIOD while the PubFig83 dataset up-scales 4 times with unknown Gaussian blur. The results suggest that all components are important to our proposed approach.

4.3 Discussion

Despite the effectiveness of our method, we still observe a small set of failure cases. Figure 8 illustrates three typical types of failure: (1) Over-synthesis of occluded facial parts, e.g., the eyes in Fig. 8(a). In this case, the gate network might have been misled by the light-colored sun-glasses and therefore favours the results from the high-frequency branch. (2) Ghosting effect, which is caused by inaccurate spatial prediction under low-res. It is rather challenging to localize facial parts with very large head pose in the low-res image. (3) Incorrect details such as gaze direction. We found that there is almost no reliable gaze direction information presented in the input. Our method only synthesizes the eyes with the most probable gaze direction. We leave it as future works to address the aforementioned drawbacks.

Fig. 8.
figure 8

Three types of representative failure cases of our approach (for Sect. 4.3).

5 Conclusion

We have presented a novel framework for hallucinating faces under substantial shape deformation and appearance variation. Owing to the specific capability to adaptively refine the dense correspondence field and hallucinate faces in an alternating manner, we obtain state-of-the-art performance and visually appealing qualitative results. Guided by the high-frequency prior, our framework can leverage spatial cues in the hallucination process.