# Blur kernel estimation using sparse representation and cross-scale self-similarity

- 53 Downloads

## Abstract

Blind image deconvolution, i.e., estimating both the latent image and the blur kernel from the only observed blurry image, is a severely ill-posed inverse problem. In this paper, we propose a blur kernel estimation method for blind motion deblurring using sparse representation and cross-scale self-similarity of image patches as priors to recover the latent sharp image from a single blurry image. Sparse representation indicates that image patches can always be represented well as a sparse linear combination of atoms in an appropriate dictionary. Cross-scale self-similarity results in that any image patch can in some way be well approximated by a number of other similar patches across different image scales. Our method is based on the observations that almost any image patch in a natural image has multiple similar patches in down-sampled versions of the image, and down-sampling produces image patches that are sharper than those in the blurry image itself. In our method, the dictionary for sparse representation is trained adaptively from sharper patches sampled from the down-sampled latent image estimate to make the similar patches of the latent sharp image well represented sparsely, and meanwhile, all patches from the latent image estimate are optimized to be as close to the sharper similar patches searched from the down-sampled version to enforce the sharp recovery of the latent image by constructing a non-local regularization. Experimental results on both simulated and real blurry images demonstrate that our method outperforms state-of-the-art blind deblurring methods.

## Keywords

Blind deconvolution Deblurring Sparse representation Self-similarity Cross-scale## 1 Introduction

Motion blur caused by camera shake has been one of the most common artifacts in digital imaging. Blind image deconvolution aims to recover the latent (unblurred) image from the only observed blurry image when the blur kernel is unknown. Despite over three decades of research in the field, blind deconvolution still remains a challenge for real-world photographs with unknown blurs. Recently, blind deconvolution has received renewed attention since Fergus et al.’s work [6] that removes the motion blur from a single image.

*is the observed blurry image,*

**y***is the blur kernel (or point spread function),*

**h***is the latent image and*

**x***is noise. Then, when the blur kernel is unknown, removing the motion blur from the observed blurry image becomes the so-called blind deconvolution operation, and the recovery of the latent image is a severely ill-posed inverse problem. The key to solve the ill-posed inverse problem is proper incorporation of various image priors about the latent image into the blind deconvolution process.*

**n**In recent years, impressive progress has been made in removing motion blur only given a single blurry image. Some methods explicitly or implicitly exploit edges for kernel estimation [3, 8, 10, 31]. This idea was introduced by Jia [8], who used an alpha matte to estimate the transparency of blurred object boundaries and performed the kernel estimation only using transparency. Joshi et al. [10] predicts sharp edges using edge profiles and estimates the blur kernel from the predicted edges. However, their goal is to remove small blurs, for it is not trivial to directly restore sharp edges from a severely blurred image. In [3, 31], strong edges are predicted from the latent image estimate using a shock filter and gradient thresholding, and then used for kernel estimation. Unfortunately, the shock filter could over-sharpen image edges, and is sensitive to noise, leading to an unstable estimate.

*or the motion blur kernel*

**x***, or both, and formulate the blind deconvolution as a joint optimization problem with some regularizations on both*

**h***and*

**x***[6, 14, 15, 22, 23, 26]:*

**h***∂*

_{∗}∈{

*∂*

_{0},

*∂*

_{x},

*∂*

_{y},

*∂*

_{xx},

*∂*

_{xy},

*∂*

_{yy},⋯ } denotes the partial derivative operator in different directions and orders,

*ω*

_{∗}is a weight for each partial derivative,

*ρ*(

*) is a regularization term on the latent sharp image*

**x***,*

**x***ρ*(

*) is a regularization term on the blur kernel*

**h***, and*

**h***λ*

_{x}and

*λ*

_{h}are regularization weights. The first term in the objective function uses image derivatives for reducing ringing artifacts. Many techniques based on sparsity priors of image gradients have been proposed to deal with motion blur. Most previous methods assume that gradient magnitudes of natural images follow a heavy-tailed distribution. Fergus et al. [6] represent the heavy-tailed distribution over gradient magnitudes with a zero-mean mixture of Gaussian based on natural image statistics. Levin et al. [13] propose a hyper-Laplacian prior to fit the heavy-tailed distribution of natural image gradients. Shan et al. [26] construct a natural gradient prior for the latent image by concatenating two piece-wise continuous convex functions. However, sparse gradient priors always prefer trivial solutions, that is, the delta kernel and exactly the blurry image as the latent image estimate because the blur reduces the overall gradient magnitude. To tackle this problem, there are mainly two streams of research works for blind deconvolution. They use maximum marginal probability estimation of

*alone (marginalizing over*

**h***) to recover the true kernel [6, 14, 15] or optimize directly the joint posterior probability of both*

**x***and*

**x***by performing some empirical strategies or heuristics to avoid the trivial solution during the minimization [22, 26]. Levin et al. [14, 15] suggest that a maximum-a-posterior (MAP) estimation of*

**h***alone is well conditioned and recovers an accurate kernel, while a simultaneous MAP estimation for solving blind deconvolution by jointly optimizing*

**h***and*

**x***would fail because it favors the trivial solution. Perrone and Favaro [22, 23] confirm the analysis of Levin et al. [14, 15] and on the other hand also declare that total variation-based blind deconvolution methods can work well by performing specific implementation. In their work, the total variation regularization parameter is initialized with a large value to help avoiding the trivial solution and iteratively reduced to allow for the recovery of more details. Blind deconvolution is in general achieved through an alternating optimization scheme. In [22, 23], the projected alternating minimization (PAM) algorithm of total variation blind deconvolution can successfully converge to the desired solution.*

**h**More present-day works often involve priors over larger neighborhoods or image patches, such as image super resolution [34], image denoising [30], non-blind image deblurring [9] and more. Gradient priors often consider two or three neighboring pixels, which are not sufficient for modeling larger image structures. Patch priors that consider larger neighborhoods (*e.g.*, 5 × 5 or 7 × 7 image patches) model more complex structures and dependencies in larger neighborhoods. Sun et al. [27] use a patch prior learned from an external collection of sharp natural images to restore sharp edges. Michaeli and Irani [17] construct a cross-scale patch recurrence prior for the estimation of the blur kernel. Lai et al. [12] obtain two color centers for every image patch and build a normalized color-line prior for blur kernel estimation. More recently, Pan et al. [21] introduce the dark channel prior based on statistics of image patches to kernel estimation, while Yan et al. [32] propose a patch-based bright channel prior for kernel estimation.

Recent work suggests that image patches can always be well represented sparsely with respect to an appropriate dictionary and the sparsity of image patches over the dictionary can be used as a prior to regularize the ill-posed inverse problem. Zhang et al. [36] use sparse representation of image patches as a prior and train the dictionary from an external collection of natural images or the blurry image itself via the K-SVD algorithm [1]. Li et al. [16] combine the dictionary pair and the sparse gradient prior with assumption that the blurry image and the sharp image have the same sparse coefficients under the blurry dictionary and the sharp dictionary respectively to restore the sharp image via sparse reconstruction using the blurry image sparse coefficients on the sharp dictionary. The key issue of sparse representation is to identify a specific dictionary that represents latent image patches in a sparse manner. Most methods use an external collection consisting of enormous images to learn a universal dictionary as training samples. To make all latent image patches represented sparsely over such a universal dictionary, the collection need provides massive training samples, and thus this may lead to an inefficient learning and a potentially unstable dictionary. Meanwhile, the collection needs to provide patches similar to the patches from the latent image, which cannot hold all the time. Alternatively, the blurry image itself is used as training samples, which cannot constantly guarantee the sparsity of sharp image patches over the learned dictionary.

In this paper, we focus on the regularization approach using patch priors for blind image deblurring. In our previous work, sparse representation and self-similarity are combined to work for image super resolution (SR) [21, 19]. Super resolution algorithms typically assume that the blur kernel is known (either the point spread function of the camera, or some default low-pass filter, *e.g.* a Gaussian), while blind deblurring refers to the task of estimating the unknown blur kernel. Michaeli and Irani [17] have showed super resolution algorithms cannot be applied directly to blind deblurring. We propose a blur kernel estimation method for blind motion deblurring using sparse representation and cross-scale self-similarity of image patches as priors to guide the recovery of the latent image. Our method is based on the observations that almost any image patch in a natural image has multiple similar patches in down-sampled versions of the image, and down-sampling produces image patches that are sharper than those in the blurry image itself. The additional information is thoroughly explored from abundant patch repetitions of cross-scale self-similar structures of the same image for the blind deconvolution problem. On the one hand, we incorporate cross-scale self-similarity into sparse representation via cross-scale dictionary learning that uses sharper patches sampled from the down-sampled version as training samples for better representing the similar patches of the latent sharp image over the learned dictionary. On the other hand, we construct a cross-scale non-local regularization to optimize all patches from the latent image estimate to be as close to the sharper similar patches searched from the down-sampled version for sharpening edges and details of the latent image estimate as possible. Finally, we take an approximate iterative approach to solve the resulting minimization problem by alternately optimizing the blur kernel and the latent image in a coarse-to-fine framework.

The remainder of this paper is organized as follows. Section 2 describes the background on sparse representation and multi-scale self-similarity. Section 3 makes detailed description on the proposed method, including our blind deconvolution model and the solution to our model. Section 4 presents experimental results on both simulated and real blurry images. Section 5 draws the conclusion.

## 2 Sparse representation and multi-scale self-similarity

### 2.1 Sparse representation

**Q**

_{j}

*, here*

**X****Q**

_{j}is a matrix extracting the

*j*th patch from an image

*ordered lexicographically by stacking either the rows or the columns of*

**X***into a vector, and the image patch \(\mathbf {Q}_{j}\boldsymbol {X}\in \mathbb {R}^{n}\) can be represented sparsely over \(\mathbf {D}\in \mathbb {R}^{n\times t}\), that is:*

**x***j*= 1,⋯ ,

*t*represents the atom of the dictionary

**D**, \({\boldsymbol {\alpha }}_{j} =[\alpha _{1},\cdots ,\alpha _{t}]^{\mathrm {T}} \in \mathbb {R}^{t} \) is the sparse representation coefficient of

**Q**

_{j}

*and ∥*

**X**

**α**_{j}∥

_{0}counts the nonzero entries in

**α**_{j}.

*m*is the number of training samples, dictionary learning attempts to find a dictionary

**D**that forms sparse representation coefficients

**α**_{i},

*i*= 1,⋯ ,

*m*for the training samples by jointly optimizing

**D**and

**α**_{i},

*i*= 1,⋯ ,

*m*as follows:

*T*≪

*n*controls the sparsity of

**α**_{i}for

*i*= 1,⋯ ,

*m*. The K-SVD algorithm [1] is an effective dictionary learning method which solves (4) by alternately optimizing

**D**and

**α**_{i},

*i*= 1,⋯ ,

*m*. As a matter of fact, the precision of the K-SVD algorithm can be controlled either by constraining the representation error or by constraining the number of nonzero entries in

**α**_{i}. We use the latter formulated in (4), because it is required in the orthogonal matching pursuit (OMP) algorithm [29] which obtains an approximation solution for (3).

**D**. Then, for the patch

**Q**

_{j}

*, we have to derive the sparse representation coefficient. Equation (3) can be formulated as the following*

**X***ℓ*

_{0}-norm minimization problem:

*T*is the sparsity constraint parameter. In our method, we obtain an approximation solution \(\boldsymbol {\hat {\alpha }}_{j}\) for (5) by using the OMP algorithm [29]. The OMP algorithm is a greedy iterative algorithm for approximately solving the above

*ℓ*

_{0}-minimization problem. It works by gradually finding the locations of the non-zeros in

**α**_{j}one at a time. After \(\boldsymbol {\hat {\alpha }}_{j}\) is derived, the reconstructed image patch \(\mathbf {Q}_{j}\hat {\boldsymbol {X}}\) can be represented sparsely over

**D**through \(\mathbf {Q}_{j} \boldsymbol {\hat {X}} = \mathbf {D} {\boldsymbol {\hat \alpha }}_{j}\).

### 2.2 Multi-scale self-similarity and non-local means

*e.g.*one part of road, building or natural landscape resembles another part of the object itself. Multi-scale self-similarity refers to explicit or implicit repetitions of structures at various sizes in the same scene. It can be observed that there are many multi-scale similar structures in a natural image. Figure 1 schematically illustrates patch repetitions of multi-scale self-similar structures both within the same scale and across different scales of a single image. For a patch marked with a red box in Fig. 1a, we search for its 5 most similar patches marked with blue boxes in this image. Figure 1b shows close-ups of the similar patches within the same scale. In this example, the image is down-sampled by a factor of

*a*= 2, as shown in Fig. 1c. For the patch marked by a red box in Fig. 1a at the original scale, we also search for its 5 most similar patches in the down-sampled image, marked by blue boxes. Figure 1d shows close-ups of the similar patches in the down-sampled image, i.e. cross-scale similar patches. When small image patches are used,

*e.g.*, 5 × 5 or 7 × 7 image patches, patch repetitions occur abundantly both within the same scale and across different scales of a natural image, even when we do not visually perceive any obvious repetitive structure. This is due to the fact that very small patches often contain only an edge, a corner,

*etc.*, and thus such patch repetitions are found abundantly in multiple image scales of almost any natural image [7]. Glasner et al. [7] perform a test to find out the amount of multi-scale similar patches in natural images, and come to the conclusion that there are plenty of multi-scale similar patches both within the same scale and across different scales in a single image.

The non-local means was firstly introduced for image denoising based on this self-similarity property of natural images in the seminal work of Buades [2], and since then, the non-local means is extended succesfully to other inverse problems such as image super resolution and non-blind image deblurring [5, 25]. The non-local means is based on the observation that similar image patches within the same scale are likely to be appeared in a single image, and these same-scale similar patches can provide additional information. For any patch **Q**_{j}* X* in the sharp image, its similar patches can be obtained using block matching that the similarity is measured by the distance between

**Q**

_{j}

*and any other patch of this image. The*

**X***p*most similar patches

**Q**

_{i}

*,*

**X***i*= 1,⋯ ,

*p*of

**Q**

_{j}

*are used to estimate*

**X****Q**

_{j}

*, and the difference between*

**X****Q**

_{j}

*and its estimation is the non-local regularization. In our previous work [21], we use cross-scale similar patches as well as same-scale similar patches to construct the multi-scale non-local regularization for super resolution reconstruction.*

**X**## 3 Blind deconvolution

### 3.1 Use of cross-scale self-similarity

In our blind deblurring model, we exploit effectively the additional information provided by cross-scale similar patches at down-sampled scales by employing the cross-scale non-local regularization and the cross-scale dictionary learning. In the cross-scale non-local regularization, all patches from the latent image estimate are optimized to be as close to their sharper similar patches searched from the down-sampled version to enforce the sharp recovery of the latent image as possible. The cross-scale dictionary learning, meanwhile, uses the down-sampled version of the latent image estimate as training samples to make the similar patches of the latent sharp image have sparse representations over the learned dictionary.

*N*is the size of the latent image, and

*a*is the down-scaling factor. The latent image patch and its down-sampled version can be represented as

**Q**

_{j}

*and*

**X****R**

_{i}

**X**^{a}, here \({\mathbf Q}_{j}\in \mathbb {R}^{n\times N}\) and \({\mathbf R}_{i}\in \mathbb {R}^{n\times N/a^{2}}\) are matrices extracting the

*j*th and

*i*th patch from

*and*

**X**

**X**^{a}respectively, and

*n*is the size of the image patch. For each patch

**Q**

_{j}

*in the latent image*

**X***, we can search for its*

**X***p*most similar patches

**R**

_{i}

**X**^{a}, for

*i*= 1,⋯ ,

*p*in

**X**^{a}using block matching. The linear combination of the

*p*most similar patches of

**Q**

_{j}

*(put into the set \(\mathcal {S}_{j}\)) is used to predict*

**X****Q**

_{j}

*, that is, the prediction can be represented as,*

**X***h*is the control parameter of the weight. The prediction error should be small and can be used as the regularization in our blind deblurring model [25].

The choice of training samples is very important for the dictionary learning problem. Ideally the sharp image should be used as training samples for dictionary learning. Unfortunately, the sharp image is an unknown quantity to recover. In our single-image super resolution work, the low-resolution image itself is used to learn an adaptive over-complete dictionary as training samples. However, it is not a good choice for blind deblurring to use the input blurry image itself as training samples, because these patches from the blurry image cannot guarantee the sparsity of sharp image patches over the learned dictionary. Since down-sampling the blurry image can provide sharper patches that are more similar to patches from the latent sharp image, we use the down-sampled version of the blurry image as training samples to obtain the dictionary **D** for sparse representation in our previous work [35]. In the proposed method, we present an improvement to the dictionary learning (see Section 3.4 for detail). Because of the use of cross-scale (i.e. down-sampled) similar patches, we call it *cross-scale dictionary learning*.

*f*(

*) and*

**ξ***f*(

*/*

**ξ***a*) are cross-scale similar patches and

*f*(

*/*

**ξ***a*) is an

*a*-times larger patch in the sharp image, here

*denotes the spatial coordinate. Accordingly, their blurry counterparts*

**ξ***q*(

*) and*

**ξ***r*(

*) are similar across image scales, and the size of*

**ξ***r*(

*) is*

**ξ***a*times as large as that of

*q*(

*) in the blurry image. In Fig. 3, the blurry image is*

**ξ***a*times the size of its down-sampled version. Down-scaling the blurry patch

*r*(

*) by a factor of*

**ξ***a*generates an

*a*-times smaller patch

*r*

^{a}(

*). Then,*

**ξ***q*(

*) and*

**ξ***r*

^{a}(

*) are of the same size and the patch*

**ξ***r*

^{a}(

*) from the down-sampled image is exactly an*

**ξ***a*-times sharper version of the patch

*q*(

*) in the blurry image. In such a case,*

**ξ***r*

^{a}(

*) can offer much exact prior information for the recovery of*

**ξ***q*(

*). Figure 3 schematically demonstrates that the patches at coarser image scales can serve as a good prior, although it is an ideal case.*

**ξ***r*

^{a}(

*) is*

**ξ***a*-times sharper than

*q*(

*). Consider a small patch*

**ξ***f*(

*) in the sharp image and the blur kernel*

**ξ***h*(

*), and then we have*

**ξ***q*(

*) is the blurry counterpart of*

**ξ***f*(

*). Since there are abundant cross-scale similar patches in a single image, we assume there is a similar patch with*

**ξ***f*(

*) elsewhere, and its size is*

**ξ***a*times as large as that of

*f*(

*), denoted by*

**ξ***f*(

*/*

**ξ***a*). This

*a*-times larger patch

*f*(

*/*

**ξ***a*) is convolved with the blur

*h*(

*), and then we have*

**ξ***r*(

*) is the blurry counterpart of*

**ξ***f*(

*/*

**ξ***a*). Now, if we down-scale the blurry image by a factor of

*a*, then this patch

*r*(

*) becomes:*

**ξ***r*

^{a}(

*) corresponds to the same patch*

**ξ***f*(

*), but convolved with the*

**ξ***a*-times narrower kernel

*h*(

*a*

*), rather than with*

**ξ***h*(

*). It implies that the patch*

**ξ***r*

^{a}(

*) in the down-scaled image is an*

**ξ***a*-times sharper version of the patch

*q*(

*) in the blurry image, as visualized in Fig. 3. The above proof shows that down-scaling an image by a factor of*

**ξ***a*produces

*a*-times sharper patches of the same size that are more similar to patches from the latent sharp image.

### 3.2 Our model

*and the blur kernel*

**x***:*

**h***∂*

_{x},

*∂*

_{y}} denotes the spatial derivative operator in two directions,

**D**is the learned dictionary for sparse representation,

*is the vector notation of the latent image*

**X***,*

**x**

**X**^{a}is the down-sampled version of

*by a factor of*

**X***a*, and

*λ*

_{c},

*λ*

_{s},

*λ*

_{g}and

*λ*

_{h}are regularization weights. Our blind deconvolution method is formulated as a constrained optimization problem that the objective could be minimized by constraining the number of nonzero entries in the sparse representation coefficients. In (11), the first term is the constraint of the observation model (i.e. data fidelity term), the second term is the sparsity prior, the third term is the cross-scale self-similarity prior (i.e. cross-scale non-local regularization), the fourth term is the smoothness constraint of the latent image, and the last term is the constraint of the blur kernel.

Blind deblurring in general involves two stages. The motion blur kernel * h* is firstly estimated by solving (11), which takes an iterative process that alternately optimizes the motion blur kernel

*and the latent image*

**h***. Then, the final deblurring result \(\boldsymbol {\hat x}\) is recovered from the given blurry image*

**x***with the blur kernel estimate \(\boldsymbol {\hat h}\) by performing various non-blind deconvolution methods, such as fast TV-*

**y***ℓ*

_{1}deconvolution [31], sparse deconvolution [14] and EPLL [37]

*etc.*.

### 3.3 Optimization

Equation (11) is a non-convex minimization problem, and cannot be solved in closed form. Instead it is solved by an approximate iterative optimization procedure, which alternates between optimizing the kernel * h* and the latent image

*. We will discuss these two steps separately.*

**x**#### 3.3.1 Optimizing *h*

*, which has a closed-form solution for \(\hat {\boldsymbol {h}}_{k + 1}\):*

**h**#### 3.3.2 Optimizing *x*

*in vector form, denoted by \(\boldsymbol {Y}\in \mathbb {R}^{N}\), and rewriting the convolution of the blur kernel and the latent image in matrix-vector form, (14) can be expressed as*

**y****G**

_{x}and \(\mathbf {G}_{y}\in \mathbb {R}^{N\times N}\) are the matrix forms of the partial derivative operators

*∂*

_{x}and

*∂*

_{y}in two directions respectively, and \(\mathbf {H}_{k + 1}\in \mathbb {R}^{N\times N}\) is the blur matrix. Setting the derivative of (15) with respect to

*to zero and letting \(\mathbf {G}=\mathbf {G}_{x}^{\mathrm {T}}\mathbf {G}_{x} + \mathbf {G}_{y}^{\mathrm {T}}\mathbf {G}_{y}\), we derive*

**X**

**α**_{j}and the down-sampled image \(\boldsymbol {\hat X}^{a}_{k + 1}\) on the right-hand side of (16) depend on unknown \(\boldsymbol {\hat X}_{k + 1}\), there is no closed-form solution available for solving (16). We solve approximately (16) with the following procedure:

- 1)
**Reconstruct****Z**_{ c}**through sparse reconstruction**

**α**_{j}over the dictionary

**D**by approximately solving (5). Another algorithms solve a convex relaxed version of the problem by replacing the

*ℓ*

_{0}by an

*ℓ*

_{1}-norm, called the

*ℓ*

_{1}-minimization algorithm. Yang et al. have showed through experiments that the OMP algorithm outperforms all

*ℓ*

_{1}-minimization algorithms in terms of success rate in the ideal scenario where the data noise is low, and is still effective for signals with high sparsity when the data are noisy [33]. In our method, we consider three reasons for directly solving the

*ℓ*

_{0}minimization: first, the sparse representation problem is separately solved as formulated in (17); second, the sparse constraint parameter is extremely low relative to the size of the dictionary; and third, the OMP algorithm has simple, fast implementations [28].

**α**_{j}on the right-hand side of (16) depends on unknown \(\boldsymbol {\hat X}_{k + 1}\), we approximate \({\boldsymbol {\hat X}}_{k + 1}\) using \({\boldsymbol {\hat X}}_{k}\) to solve the the sparse representation coefficient \( {\boldsymbol {\hat \alpha }}_{j} \) over the dictionary

**D**, as follows:

**D**, and the representation coefficient is \(\boldsymbol {\hat \alpha }_{j}\), that is, \(\mathbf {Q}_{j}\hat {\boldsymbol {X}}_{k}=\mathbf {D}\boldsymbol {\hat \alpha }_{j}\). Then the whole image can be reconstructed by averaging all reconstructed image patches \(\mathbf {D}\boldsymbol {\hat \alpha }_{j}\), such that

**Z**_{c}is the reconstructed latent image individually through sparse reconstruction.

- 2)
**Reconstruct****Z**_{ s}**through cross-scale non-local regularization**

**Z**_{s}is the reconstructed latent image individually through cross-scale non-local regularization.

- 3)
**Given****Z**_{c}**and****Z**_{s},**solve**\(\hat {\boldsymbol {x}}_{k + 1}\)

*n*

**Z**_{c}and \({\sum }_{j}\mathbf {Q}_{j}^{\mathrm {T}}{\sum }_{i\in {\mathcal {S}}_{j}}{w_{i}^{j}}\mathbf {R}_{i}\boldsymbol {\hat X}^{a}_{k + 1}\) with

*n*

**Z**_{s}as an approximation, (16) can be rewritten as:

*n*is the size of image patch, and

**I**is the identity matrix of size

*N*. Substituting (18), (20) and this term into (16) leads to (21). Since it is a linear equation with respect to \({\boldsymbol {\hat X}}_{k + 1}\), (21) can be solved by direct matrix inversion or the conjugate gradient method. We solve it in the frequency domain and the closed-form solution is given by:

**z**_{c}and

**z**_{s}represent

**Z**_{c}and

**Z**_{s}in 2-D image form, respectively.

### 3.4 Implementation

To speed up the convergence and handle of large blurs, following most existing methods, we estimate the blur kernel in a coarse-to-fine framework. That is, we apply our blind deconvolution model as solved in Section 3.3 using an approximate alternating iterative optimization procedure to each of the levels of the image pyramid constructed from the blurry image * y*. At the coarsest scale level, the latent image estimate is initialized with the observed blurry image. The intermediate latent image estimated at each coarser level is interpolated and then propagated to the next finer level as an initial estimate of the latent image to progressively refine the blur kernel estimate in higher resolutions. The intermediate latent images estimated during the iterations have no direct influence on the final deblurring result, and only affect this result indirectly by contributing to the refinement of the blur kernel estimate \(\boldsymbol {\hat h}\).

At the coarsest scale level, the dictionary learning uses the down-sampled blurry image as training samples. To better represent the latent image over the learned dictionary, we update the learned dictionary using the down-sampled intermediate latent image estimate as training samples. In the implementation of our coarse-to-fine iterative framework for estimating the blur kernel, the intermediate latent image estimated at the coarser scale is directly used for training the dictionary and the dictionary is iteratively updated once for each image scale during the solution.

*by the implementation of the pseudo-code outlined in Algorithm 1. We construct an image pyramid with*

**h***L*levels from an input blurry image

*. The number of pyramid levels is chosen such that, at the coarsest scale level, the size of the blur is smaller than that of the patch used in the blur kernel estimation stage. Let us use the notation \({\boldsymbol {\hat x}}_{k}^{l}\) for the intermediate latent image estimate, where the superscript*

**y***l*indicates the

*l*th level in the image pyramid, while the subscript

*k*indicates the

*k*th iteration at each scale level. The blur kernel estimation starts from the coarsest scale level

*l*= 1 of the image pyramid with the latent image initialized as \({\boldsymbol {\hat x}}_{0}^{1} = \boldsymbol {y}\). At each scale level

*l*∈{1,⋯ ,

*L*}, we take the iterative procedure that alternately optimizes the motion blur kernel

*and the latent image*

**h***, which is implemented repeatedly until the convergence or for a fixed number of iterations. Then the outcome of updating the latent image at the*

**x***l*th level is upsampled by interpolation and then used as an initial estimate of the latent image for the next finer level

*l*+ 1 to progressively refine the motion blur kernel estimate \(\boldsymbol {\hat h}\), which is repeated to achieve the final refinement of the blur kernel estimate \( {\boldsymbol {\hat h}} \) for the finest level.

In the blur kernel estimation process, we use the gray-scale versions of the blurry image * y* and the intermediate latent image estimate \(\boldsymbol {\hat x}\). Once the blur kernel estimate \(\boldsymbol {\hat h}\) has been obtained with the original image scale, we perform the final non-blind deconvolution with \(\boldsymbol {\hat h}\) on each color channel of

*to obtain the deblurring result.*

**y**Finally, our method need perform deconvolution in the Fourier domain. To avoid ringing artifacts at the image boundaries, we process the image near the boundaries using the simple *edgetaper* command in Matlab.

## 4 Experiments

Several experiments are conducted to demonstrate the performance of our method. We first test our method on the widely used dataset introduced in [14] and [27], and make qualitative and quantitative comparisons with the state-of-the-art blind deblurring methods. Then we show visual comparisons on real blurry photographs with unknown blurs. The relevant parameters of our method are set as follows: the dictionary **D** is of size *t* = 100, and the sparsity constraint parameter *T* = 4, designed to handle image patches of size *n* = 5 × 5, the maximum number of iterations maxIters is fixed as 14 for the inner loop, and the regularization weights are empirically set to *λ*_{c} = 0.15/*n*, *λ*_{s} = 0.15/*n*, *λ*_{g} = 0.001 and *λ*_{h} = 0.0015*N*. As the down-scaling factor increases, the patches at the down-sampled scale get sharper, but there exist less similar patches at the down-sampled scale. Following the setting of [17], the image pyramid is constructed with scale-gaps of *a* = 4/3 using down-scaling with a sinc function. Additional speed up is obtained by using the fast approximate nearest neighbor (NN) search of [18] in the blur kernel estimation stage, working with a single NN for every patch.

An additional parameter is the size of the blur kernel. Small blurs are hard to solve if it is initialized with a very large kernel. Conversely, large blurs will be truncated if too small a kernel is used [6]. Following the setting of [27], we do not assume that the size of the kernel is known and initialize that the size of the kernel is 51 × 51. Experiment results on both simulated and real blurry images show the size of the blur kernel is generally no larger than 51 × 51 for most of blurry images. Despite an input blurry image with a small blur kernel, our method is still able to obtain a good deblurring result, relatively insensitive to the initial setting of the kernel size.

### 4.1 Quantitative evaluation on synthetic datasets

We test our method on two publicly available datasets. One dataset, which is provided by Levin et al. [14], contains 32 images of size 255 × 255 blurred with 8 different kernels. The kernels range in size from 13 × 13 to 27 × 27. The blurred images with spatially invariant blur and ground-truth kernels were captured simultaneously by locking the Z-axis rotation handle but loosening the X and Y handles of the tripod. The other dataset provided by Sun et al. [27] comprises 640 large natural images of diverse scenes, which were obtained by synthetically blurring 80 high-quality images with the 8 blur kernels from [14] and adding 1*%* white Gaussian noise to the blurred images. We present qualitative and quantitative comparisons with the state-of-the-art blind deblurring methods [3, 4, 6, 11, 15, 17, 24, 22, 27, 31].

*. The smaller ER corresponds to the better quality. In principle, if ER = 1, the recovered kernel yields a deblurring result as good as the ground-truth kernel.*

**h**Quantitative comparison of various methods over the dataset of [14]

Quantitative comparison of various methods over the dataset of [27]

### 4.2 Qualitative comparison on real images

## 5 Conclusion

In this paper, we have presented a blur kernel estimation method for blind motion deblurring using sparse representation and cross-scale self-similarity of image patches as priors to regularize the inverse problem of recovering the latent image. Since patches repeat across scales in a single image, our priors thoroughly exploit the additional information provided by cross-scale similar patches at down-sampled scales of the intermediate latent image that are sharper and more similar to patches from the latent sharp image by employing the cross-scale dictionary learning and the cross-scale non-local regularization. On the one hand, the cross-scale dictionary learning uses patches from the intermediate latent image estimated at the coarser level of the image pyramid as training samples and updates the dictionary once for each image scale to ensure the sparsity of the latent image over this dictionary. On the other hand, the cross-scale non-local regularization optimizes all patches from the intermediate latent image estimate to be as close to the similar patches searched from down-sampled version to enforce sharp recovery of the latent image as possible. We have extensively validated the performance of our method through experiments on both simulated and real blurry images, and demonstrated that our method can remove effectively complex motion blurs from nature images and obtain satisfactory deblurring results, thanks to the use of cross-scale similar patches.

## Notes

### Funding Information

This study was funded by National Natural Science Foundation of China (61501008) and Beijing Municipal Natural Science Foundation (4172002).

### Compliance with Ethical Standards

### **Conflict of interests**

The authors declare that they have no conflicts of interest.

## References

- 1.Aharon M, Elad M, Bruckstein A (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322CrossRefGoogle Scholar
- 2.Buades A, Coll B, Morel J-M (2005) A non-local algorithm for image denoising. In: 2005 IEEE Computer society conference on computer vision and pattern recognition (CVPR), 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR), IEEE, San Diego, pp 60–6Google Scholar
- 3.Cho S, Lee S (2009) Fast motion deblurring. ACM Trans Graph 28(5):89–97CrossRefGoogle Scholar
- 4.Cho TS, Paris S, Horn BKP, Freeman WT (2011) Blur kernel estimation using the radon transform. In: IEEE conference on computer vision and pattern recognition (CVPR), vol 42 of IEEE conference on computer vision and pattern recognition (CVPR), Providence, pp 241–248Google Scholar
- 5.Dong W, Zhang L, Shi G, Wu X (2011) Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Trans Image Process 20(7):1838–1857MathSciNetCrossRefGoogle Scholar
- 6.Fergus R, Singh B, Hertzmann A, Roweis ST, Freeman WT (2006) Removing camera shake from a single photograph. ACM Trans Graph 25(3):787–794CrossRefGoogle Scholar
- 7.Glasner D, Bagon S, Irani M (2009) Super-resolution from a single image. In: International conference on computer vision, ICCV 2009, international conference on computer vision, ICCV 2009, IEEE, Kyoto, pp 349–356Google Scholar
- 8.Jia J (2007) Single image motion deblurring using transparency. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Minneapolis, pp 1–8Google Scholar
- 9.Jia C, Evans BL (2011) Patch-based image deconvolution via joint modeling of sparse priors. In: IEEE international conference on image processing (ICIP), IEEE international conference on image processing (ICIP), IEEE, Brussels, pp 681–684Google Scholar
- 10.Joshi N, Szeliski R, Kriegman D (2008) Psf estimation using sharp edge prediction. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Anchorage, pp 1–8Google Scholar
- 11.Krishnan D, Tay T, Fergus R (2011) Blind deconvolution using a normalized sparsity measure. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Providence, pp 233–240Google Scholar
- 12.Lai WS, Ding JJ, Lin YY, Chuang YY (2015) Blur kernel estimation using normalized color-line priors. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE computer society, Boston, pp 64–72Google Scholar
- 13.Levin A, Fergus R, Durand FED, Freeman WT Image and depth from a conventional camera with a coded aperture. ACM Transactions on Graphics (TOG) 26(3)Google Scholar
- 14.Levin A, Weiss Y, Durand F, Freeman WT (2009) Understanding and evaluating blind deconvolution algorithms. In: IEEE conference on computer vision and pattern recognition, IEEE conference on computer vision and pattern recognition, IEEE, Miami, pp 1964–1971Google Scholar
- 15.Levin A, Weiss Y, Durand F, Freeman WT (2011) Efficient marginal likelihood optimization in blind deconvolution. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Providence, pp 2657--2664Google Scholar
- 16.Li H, Zhang Y, Zhang H, Zhu Y, Sun J (2012) Blind image deblurring based on sparse prior of dictionary pair. In: International conference on pattern recognition (ICPR), international conference on pattern recognition (ICPR), IEEE, Tsukuba, pp 3054–3057Google Scholar
- 17.Michaeli T, Irani M (2014) Blind deblurring using internal patch recurrence. In: European conference on computer vision (ECCV), European conference on computer vision (ECCV), Springer International Publishing, Zurich, pp 783–798Google Scholar
- 18.Olonetsky I, Avidan S, Treecann K-d (2012) Tree coherence approximate nearest neighbor algorithm. In: European conference on computer vision, European conference on computer vision, Springer, Berlin, pp 602–615Google Scholar
- 19.Pan Z, Yu J, Huang H, Hu S, Zhang A, Ma H, Sun W (2013) Super-resolution based on compressive sensing and structural self-similarity for remote sensing images. IEEE Trans Geosci Remote Sens 51(9):4864–4876CrossRefGoogle Scholar
- 2021.Pan Z, Yu J, Hu S, Sun W (2014) Single image super resolution based on multi-scale structural self-similarity. Acta Automatica Sinica 40(4):594–603zbMATHGoogle Scholar
- 21.Pan J, Sun D, Pfister H, Yang MH (2016) Blind image deblurring using dark channel prior. pp 1628–1636Google Scholar
- 22.Perrone D, Favaro P (2014) Total variation blind deconvolution: the devil is in the details. In: IEEE conference on computer vision and pattern recognition (CVPR), IEEE conference on computer vision and pattern recognition (CVPR), IEEE, Columbus, pp 2909–2916Google Scholar
- 23.Perrone D, Favaro P (2016) A clearer picture of total variation blind deconvolution. IEEE Trans Pattern Anal Mach Intell 38(6):1041–1055CrossRefGoogle Scholar
- 24.Perrone D, Diethelm R, Favaro P (2015) Blind deconvolution via lower-bounded logarithmic image priors. In: International conference on energy minimization methods in computer vision and pattern recognition (EMMCVPR), international conference on energy minimization methods in computer vision and pattern recognition (EMMCVPR), Springer International Publishing, Hong KongGoogle Scholar
- 25.Protter M, Elad M, Takeda H, Milanfar P (2009) Generalizing the nonlocal-means to super-resolution reconstruction. IEEE Trans Image Process 18(1):36–51MathSciNetCrossRefGoogle Scholar
- 26.Shan Q, Jia J, Agarwala A (2008) High-quality motion deblurring from a single image. ACM Trans Graph 27(3):15–19CrossRefGoogle Scholar
- 27.Sun L, Cho S, Wang J, Hays J (2013) Edge-based blur kernel estimation using patch priors. In: IEEE international conference on computational photography (ICCP), IEEE international conference on computational photography (ICCP), IEEE, Cambridge, pp 1–8Google Scholar
- 28.Tropp JA (2004) Greed is good: algorithmic results for sparse approximation. IEEE Trans Inf Theory 50(10):2231–2242. https://doi.org/10.1109/TIT.2004.834793 MathSciNetCrossRefGoogle Scholar
- 29.Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 53(12):4655–4666MathSciNetCrossRefGoogle Scholar
- 30.Wang M, Yu J, Sun W (2015) Group-based hyperspectral image denoising using low rank representation. In: 2015 IEEE International conference on image processing (ICIP), 2015 IEEE international conference on image processing (ICIP), pp 1623–1627Google Scholar
- 31.Xu L, Jia J (2010) Two-phase kernel estimation for robust motion deblurring. In: European conference on computer vision: Part I, European conference on computer vision: Part I, Springer, Berlin Heidelberg, pp 157–170Google Scholar
- 32.Yan Y, Ren W, Guo Y, Wang R, Cao X (2017) Image deblurring via extreme channels prior. pp 6978–6986Google Scholar
- 33.Yang A, Ganesh A, Sastry S, Ma Y (2010) Fast l1-minimization Algorithms and an Application in Robust Face Recognition: A Review, Tech. Rep. UCB/EECS-2010–13, EECS Department. University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-13.html Google Scholar
- 34.Yang J, Wright J, Huang TS, Ma Y (2010) Image super-resolution via sparse representation. IEEE Trans Image Process 19(11):2861–2873MathSciNetCrossRefGoogle Scholar
- 35.Yu J, Chang Z, Xiao C, Sun W (2017) Blind image deblurring based on sparse representation and structural self-similarity. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, New Orleans, pp 1328–1332Google Scholar
- 36.Zhang H, Yang J, Zhang Y, Huang TS (2011) Sparse representation based blind image deblurring. In: IEEE international conference on multimedia and expo (ICME), IEEE international conference on multimedia and expo (ICME), IEEE, Barcelona, pp 1–6Google Scholar
- 37.Zoran D, Weiss Y (2011) From learning models of natural image patches to whole image restoration. In: IEEE international conference on computer vision (ICCV), IEEE international conference on computer vision (ICCV), IEEE, Barcelona, pp 479–486Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.