Keywords

1 Introduction

Rapid development of video and image services call for high resolution face images. But existing condition of image and video equipment often can’t meet the requirement. Therefore, many single image super-resolution researches, targeting on sensitive regions such as vehicle plates and faces, attract much attention. Face hallucination refers to the technique of reconstructing the latent high-resolution (HR) face from a single low-resolution (LR) face. Recent state-of-the-art face hallucination methods are mostly learning-based. These learning-based methods [1] utilize HR and LR dictionary pair to obtain the similar local geometry between HR and LR samples, and achieve satisfactory result under stationary condition.

Freeman et al. [2] firstly proposed a patch-based Markov network to model the local geometry relationship between LR patches and their HR counterparts which is time consuming. Baker and Kanade [4] employed a Bayesian model to estimate the latent high-frequency components. Inspired by locally linear embedding (LLE), Chang et al. [7] obtained high-resolution image through neighbor embedding (NE) method, but the fixed number of neighbors is unstable and difficult to estimate. Considering that face images have stable structure, Ma et al. [8] took it as prior and proposed a position-patch based method, which only use the patches in the same position. But when the number of training faces is much larger than the LR patch’s dimension, unconstrained least square estimation (LSE) [8, 9] will lead to inaccurate solutions. Yang et al. [12] and Jung et al. [13] introduced sparse representation (SR) method and convex optimization to address this problem, respectively. Although these SR based methods have achieved good results, their performance are strictly limited by the noise level and degradation process. Figure 1 shows the framework of these typical position-patch based methods. To better adapt to real situation, constrained SR methods are proposed. By affixing similarity constrained factor to optimization coefficients [15, 16], adding illumination compensation for constrained factor [17], and introducing adaptive ℓ-q norm to sparse representation [18], these constrained SR methods improved the ability to resist noise. But the computational load of the commonly used SLEP toolbox [23] which solves the optimization problem is significant.

Fig. 1.
figure 1figure 1

Outline of the traditional position-patch based face hallucination framework.

In this paper, we prove that the common face database used to face hallucination is over-complete and not all the samples are necessary. We also establish a novel two-layer framework which has a natural parallel structure to infer HR patch. Inspired by the constrained sparse representation methods, we think part of the samples in training set are very different from the input face, which corresponding to the nearly-zero coefficients during optimization. These faces are less important and can even be directly ignored to improve robustness and also reduce computation. We find that reducing the number of trained samples according to their similarities with the input LR face hardly affect the result but save a lot of time, which proves our assumption above. So a sample selection strategy is needed. Considering that the samples in database are all well-aligned faces and almost under the same condition, we intuitively choose eigenface to present the sample reducing operation. Finally, by using global similarity selection via eigenface to speed up the reconstruction process, and using local similarity representation between the same position patches and neighbor patches to synthesize HR patch, our proposed similarity selection and representation (SSR) method get better performance when the input image is corrupted by noise and different blur kernels. As a general framework, the global comparison and local representation algorithm can also be replaced by other useful face recognition and representation methods conveniently. The proposed method has the following features.

  • When the surrounding (neighbor) patches are considered to improve robustness against face position changing, about ten times of calculation will be increased. Our two-layer parallel framework can boost the reconstruction speed.

  • By excluding insignificant or dissimilar samples in the training set before reconstruction, the algorithm is further accelerated with no decrease on the quality.

  • Using constrained exponential distance as coefficients can guarantee good structure similarities between patches, and meanwhile solve the over-fitting problem caused by noise in sparse representation, which make the method very robust to noise and blur kernel changing in real scenarios.

2 Existing Position-Patch Based Approaches

Let \( N \) be the dimension of a patch (usually a patch with size \( \sqrt N \times \sqrt N \)) and \( M \) be the number of samples in the training set. Given a test patch \( {\mathbf{x}}{ \in }R^{N \times 1} \), the patches at the same position in LR training set is represented as \( {\mathbf{Y}}{ \in }R^{N \times M} \), with mth column \( {\mathbf{Y}}^{\text{m}} \) (m = 1,…M) being the patch in sample m. Then the input patch can be represented by the sample patches as:

$$ {\mathbf{x}} = {\mathbf{Yw}} + {\mathbf{e}} $$
(1)

where \( {\mathbf{w}}{ \in }R^{M \times 1} \) is the coefficients vector with entry \( w_{m} \) and \( {\mathbf{e}} \) is the reconstruction error vector.

Obviously, solving the coefficients \( {\mathbf{w}} \) is the key problem in patch-based methods. In [79], reconstruction coefficients are solved by constrained least square estimation (LSE) as

$$ {\mathbf{w}}^{\varvec{*}} = \arg{\text{min}}_{{\mathbf{w}}} \left\| {{\mathbf{x}} - {\mathbf{Yw}}} \right\|_{2}^{2} {\text{ s}} . {\text{t}} . { }\sum\nolimits_{m = 1}^{M} {w_{m} = 1} $$
(2)

This least square problem’s closed-form solution can be solved with the help of Gram matrix, but it becomes unstable when the number of samples M is much bigger than the patch dimension N. Jung et al. [13] introduce sparse representation into face hallucination and convert (2) into a standard SR problem:

$$ \min_{{\mathbf{w}}} \left\| {\mathbf{w}} \right\|_{0} {\text{s}} . {\text{t}} . { }\left\| {{\mathbf{x}} - {\mathbf{Yw}}} \right\|_{2}^{2} \le\upvarepsilon $$
(3)

where \( \ell_{0} \) norm counts the non-zero entries number in \( {\mathbf{w}} \) and \( \upvarepsilon \) is error tolerance. Yang et al. [12] and Zhang et al. [14] respectively use squared \( \ell_{2} \) norm \( \left\| {\mathbf{w}} \right\|_{2}^{2} \) and \( \ell_{1} \) norm \( \left\| {\mathbf{w}} \right\|_{1} \) to replace \( \ell_{0} \) norm, which means the statistics of coefficients are constrained by Gaussian and Laplacian distribution.

Similarity constrained SR methods proposed in [1518] can be formulated as

$$ {\mathbf{w}}^{*} = \arg{\text{min}}_{{\mathbf{w}}} \left\{ {\left\| {{\mathbf{x}} - {\mathbf{Yw}}} \right\|_{2}^{2} + \lambda \left\| {{\mathbf{Dw}}} \right\|_{q} } \right\}, $$
$$ d_{mm} = \left\| {{\text{g}}{\mathbf{x}} - {\mathbf{Y}}^{\text{m}} } \right\|_{2} , 1 \le m \le M $$
(4)

where \( {\mathbf{D}}{ \in }R^{M \times M} \) is the diagonal matrix which controls the similarity constrains placed on coefficients. The entries \( d_{mm} \) on the main diagonal of \( {\mathbf{D}} \) represent the Euclidean distance with gain factor \( {\text{g}} \). Furthermore, in [1517], \( q \) is set to 1 and 2 respectively. In [18], an adaptively selected \( \ell_{q} \) norm scheme is introduced to improve its robustness against different conditions.

After the coefficients are obtained, the reconstructed HR test patch \( {\mathbf{x}}_{\text{H}} \) can be represented as

$$ {\mathbf{x}}_{\text{H}} = {\mathbf{Y}}_{\text{H}} {\mathbf{w}} $$
(5)

with coefficients being directly mapped to the patches of HR samples \( {\mathbf{Y}}_{\text{H}} \).

3 Proposed Method

The methods introduced in Sect. 2 have shown impressive results for experimental noise free faces. But when the noise level, blur kernel and degradation process change, the performance will drop dramatically. It’s mainly due to the under-sparse nature of noise, and the local geometry between the high dimension and the low dimension manifolds are no longer coherent since the degradation process has changed. To overcome this problem, we propose global similarity selection and local similarity representation to improve its robustness.

3.1 Global Similarity Selection

In SR based methods, a test LR patch is represented by a large number of LR sample patches through coefficients with sparse. Therefore, heavy calculation is cost on the coefficients with sparsity even the corrupted face is not sparse. So we think not all the samples in the over-complete face database are necessary. Some faces in training set are very different from the input face, which corresponding to very tiny weights during optimization. These faces are not worth occupying so much calculation because they have very limited impact on the results.

Therefore, for the noise corrupted faces, we no longer look for the representation with sparsity but use similarity representation to represent LR patch directly. In order to exclude the dissimilar samples which corresponding to the nearly-zero coefficients, a similarity comparison strategy is needed. And we choose global similarity selection scheme, instead of local (patch) similarity comparison before the reconstruction. This is mainly because the face databases we used are all same size, well aligned and under same lighting condition, global comparison can be reliable enough and very fast.

We intuitively apply Turk and Pentlad’s [10] eigenface method, which projects the test face image into the eigenface space, and selects the most similar \( M \) faces according to Euclidean distance. Given the LR face database \( {\mathbf{F}} \), with mth column being sample \( {\mathbf{F}}^{\text{m}} \). After being normalized by subtracting its mean value, the covariance matrix \( {\mathbf{C}} \) can be obtained by (6), where \( {\text{M}} \) is the number of samples in \( {\mathbf{F}} \). Then the eigenface \( {\mathbf{P}} \) is easy to compute by singular value decomposition, and the projected database \( {\mathbf{F^{\prime}}} \) is given in (7). Before reconstruction, the input LR face \( {\mathbf{x}} \) is firstly projected to the eigenface space by (8), and the similar faces can be selected between the samples and \( {\mathbf{x}} \) through Euclidean distance, according to \( {\mathbf{F^{\prime}}}_{m} - {\mathbf{x^{\prime}}}_{2}^{2} \).

$$ {\mathbf{C}} = \frac{1}{M}{\mathbf{F}}_{norm} {\mathbf{F}}_{norm}^{T} ,\,{\mathbf{F}}_{norm} = {\mathbf{F}} - mean\left( {\mathbf{F}} \right). $$
(6)
$$ {\mathbf{F^{\prime}}} = {\mathbf{F}}_{norm}^{T} {\mathbf{P}}. $$
(7)
$$ {\mathbf{x^{\prime}}} = \left[ {{\mathbf{x}} - mean\left( {\mathbf{F}} \right)} \right]^{T} {\mathbf{P}}. $$
(8)

Results shown in Fig. 3 in Sect. 4 demonstrate our assumption perfectly. This global similarity selection method have saved about half of the traditional method’s calculation before reconstruction.

3.2 Local Similarity Representation

After picking out the similar faces we need from the entire training set. To make full use of the information in neighbor patches, we establish a framework with parallel two-layer structure which integrates the information of patches surrounding the test LR patch, as shown in Fig. 2. Instead of using all the sample patches at the same position to estimate the test patch (as in Fig. 1), we change the traditional structure into a two layer mode which every sample outputs a middle-layer HR patch before synthesizing the final HR patch.

Fig. 2.
figure 2figure 2

Proposed parallel two-layer face hallucination framework

In Fig. 2, we can notice that the test LR patch’s neighbor patches in training samples are marked with dark lines. Let M be the number of samples we picked out and S (S is 9 in Fig. 2) be the number of neighbor patches we used surrounding the center patch. The test LR patch is still \( {\mathbf{x}}{ \in }R^{N \times 1} \), and all its neighbor patches in sample \( {\text{m}} \) (m = 1,…M) is represented as \( {\mathbf{Y}}_{\text{m}}^{S} { \in }R^{N \times S} \) with sth column being \( {\mathbf{Y}}_{\text{m}}^{\text{s}} \) (s = 1,…S).

For every sample in the LR training set, a weight vector \( {\mathbf{w}}_{\text{m}} { \in }R^{S \times 1} \) which represents the similarity between the test patch and its neighbor patches is obtained, the entries \( {\text{w}}_{\text{m}}^{\text{s}} \) are computed as

$$ {\text{w}}_{\text{m}}^{\text{s}} = exp\left( { \left\| {{\mathbf{x}} - {\mathbf{Y}}_{\text{m}}^{\text{s}} } \right\|_{2} + \left\| {D\left( {\text{s}} \right) - D\left( {\mathbf{x}} \right)} \right\|_{2}} \right), 1 \le {\text{s}} \le S. $$
(9)

The function \( {\text{D}}\left( \cdot \right) \) calculates the patch distance from the current patch to the center patch. Then the middle-layer HR patch of sample m can be represented as

$$ {\mathbf{x}}_{{{\text{H}}1{\text{m}}}} = \frac{1}{{{\text{G}}_{\text{m}} }}{\mathbf{Y}}_{\text{m}}^{\text{S}} {\mathbf{w}}_{\text{m}} , $$
$$ {\text{G}}_{\text{m}} = \sum\nolimits_{{{\text{s}} = 1}}^{\text{S}} {{\text{w}}_{\text{m}}^{\text{s}} } $$
(10)

After every sample output its corresponding middle-layer HR patch, the final HR patch can be synthesized simply by

$$ {\mathbf{x}}_{{{\text{H}}2}} = \frac{1}{{\mathop \sum \nolimits_{{{\text{m}} = 1}}^{M} {\text{G}}_{\text{m}} }} \sum\nolimits_{{{\text{m}} = 1}}^{M} {{\text{G}}_{\text{m}} {\mathbf{x}}_{{{\text{H}}1{\text{m}}}} } . $$
(11)

Finally, by assembling all reconstructed HR patches to the corresponding position and averaging the pixels in overlapped regions. The estimated HR face is obtained. The entire similarity selection and representation face hallucination method is summarized as following steps.

4 Experiments and Results

In this section, we conduct face hallucination experiments on the proposed SSR method to testify its performance under inconsistent degradation conditions between the test face image and training face images. The face database we apply is FEI face database [24], which contains 400 faces from 200 adults (100 men and 100 women). Among them, we randomly select 380 faces for training and the rest 20 faces for testing. All the samples are well-aligned and in the same size \( 360 \times 260 \). The LR samples are smoothed and down-sampled with factor of 4. The LR patch size is set to \( 3\times 3 \) pixels with overlap of 1 pixel and the HR patch size is \( 12 \times 12 \) pixels with 4 overlapped pixels. The smooth kernel we use in the training phase is fixed by \( 20 \times 20 \) Gaussian lowpass filter with standard deviation of 4.

In Fig. 3, we show how the number of similar faces M in the global similarity selection stage affect the quality of reconstruction. We can see that the second half of two curves are almost flat, which means the reconstruction quality is basically unchanged even the number of faces we use to reconstruct is reduced to half of the entire set. Therefore, we set M to 190 (half of the entire set) without affecting the quality in the following tests, while other methods still use the entire set with 380 training faces.

Fig. 3.
figure 3figure 3

Average PSNR and SSIM with the number of the similar faces M changes

We conduct the following experiments under two unconformity degradation conditions: noise corrupted and smoothed by different kernels. NE [7], LSR [8] and LcR [15] methods are tested for better comparison.

4.1 Robustness Against Noise

We add a zero-mean Gaussian noise (\( \upsigma = 1,2 \ldots 15 \)) to test face to get the simulated noisy face. The smooth kernel is the same with the one in the training phase. Some randomly selected objects’ results are shown in Figs. 4 and 5 when \( \upsigma \) is set to 5 and 10. The corresponding average PSNR and SSIM values are listed in Table 1. We can see that with the increase of noise level, the input face is no longer sparse. So the traditional SR methods can’t distinguish the noise component from original images, the performance drop dramatically, and there are many artificial texture due to gaussian noise. But our SSR method is much less affected by noise and is able to restore a much clearer face. Relative to the second best LcR, PSNR gains of SSR approach 0.23 dB and 1.83 dB respectively for noise level 5 and 10. More results under different noise levels are shown in Fig. 6. As we can see, with the noise level continues to grow, SSR method will continue to widen the gap with the traditional methods.

Fig. 4.
figure 4figure 4

Comparison of results based on different methods on FEI face database for noisy images (\( \upsigma = 5 \)). (a) Bicubic. (b) Chang’s NE [7]. (c) Ma’s LSR [8]. (d) Jiang’s LcR [15]. (e) Proposed SSR. (f) Original HR faces.

Fig. 5.
figure 5figure 5

Comparison of results based on different methods on FEI face database for noisy images (\( \upsigma = 10 \)). (a) Bicubic. (b) Chang’s NE [7]. (c) Ma’s LSR [8]. (d) Jiang’s LcR [15]. (e) Proposed SSR. (f) Original HR faces.

Table 1. PSNR and SSIM comparison of different methods
Fig. 6.
figure 6figure 6

Comparison of different methods: Average PSNR and SSIM with the noise level \( \upsigma \) grows.

4.2 Robustness Against Kernel Changes

We find the performance of SR methods decrease significantly even the kernel in reconstruction phase is slightly changed, compared to the one in training phase. However, this situation is the most common one in practical applications. Therefore, we decide to test different methods under the condition of inconformity kernels in training and reconstruction phase. As mentioned above, a fixed \( 20 \times 20 \) gaussian lowpass filter with standard deviation of 4 is used in training phase. We change the standard deviation to 8 from 4 to make the input face more blurred than the faces in LR set.

According to Table 1, despite SSR method can’t beat LSR and LcR in PSNR and SSIM values, results in Fig. 7 intuitively demonstrate that SSR method can create more details and sharper edges than others, which is more valuable in practical use. Performances of the first three methods are influenced a lot due to the different kernel, only the eye’s region is artificially enhanced while other region is basically the same blurred as input face. But SSR can generate much clear facial edges like eyes, nose and facial contours, which proves its superiority in subjective visual quality.

Fig. 7.
figure 7figure 7

Comparison of results with changed smooth kernel (Results best viewed with adequate zoom level where each face is shown with original size \( 3 60 \times 2 60 \)). (a) Bicubic. (b) Chang’s NE [7]. (c) Ma’s LSR [8]. (d) Jiang’s LcR [15]. (e) Proposed SSR. (f) Original HR faces.

5 Conclusion

In this paper, we have proposed a general parallel two-layer face hallucination framework to boost reconstruction speed and improve robustness against noise. Our method can exclude the unnecessary samples from over-complete training set using global similarity selection without quality loss. Then the local similarity representation stage can make the method output satisfactory result under severe conditions. Experiments on FEI face database demonstrate this method can achieve better results under heavy noise conditions and gain good visual quality when the degradation process is changed. As a general framework, many other existing face hallucination schemes can also be incorporated into SSR method conveniently.