Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Inverse problems are prevalent in science, because science is rooted in observations, and observations are nearly always indirect and certainly never perfect: noise is inevitable, instruments suffer from limited bandwidth, various artifacts and far from faultless acquisition components. Yet there is widespread interest in getting the most out of any observations one can make. Indeed, observations at the frontiers of science are typically those that are the faintest, most blurred, noisy and imperfect.

Imaging science is no exception, and this is why, for many decades [37], imaging communities have worked tirelessly to develop effective methods of noise removal, deconvolution, tomography, segmentation, inpainting, and more, in order to extract useful measurements from imperfect data.

Fig. 1.
figure 1

An inverse problem: we observe the blurry, noisy data on the left. We would like to measure data from the image on the right.

Typically, inverse problems are ill-posed [26], meaning that their solution varies widely depending on the input data, in particular with noise. A useful approach for solving inverse problems is to use some sort of prior information on the data, called regularization. To formalize this, we turn to statistics.

2 Statistical Interpretation of Inverse Problems

We want to estimate some statistical parameter \(\theta \) on the basis of some observation vector \(\mathsf {x}\).

2.1 Maximum Likelihood

If f is the sampling distribution, \(f(\mathsf {x}|\mathsf {y})\) is the probability of \(\mathsf {x}\) when the population parameter is \(\mathsf {y}\). The function

$$\begin{aligned} \mathsf {y} \mapsto f(\mathsf {x}|\mathsf {y}) \end{aligned}$$
(1)

is called the likelihood. The Maximum Likelihood (ML) estimate is

$$\begin{aligned} \widehat{\mathsf {y}}_{\text {ML}} (\mathsf {x}) = \underset{\mathsf {y}}{{\text {argmax}}} ~ f(\mathsf {x}|\mathsf {y}) \end{aligned}$$
(2)

A very common example, assuming we have a linear operator \(\mathbf {H}\) (in matrix form) and Gaussian deviates, then

$$\begin{aligned} \underset{\mathsf {y}}{{\text {argmax}}} ~ f(\mathsf {y}) = - \Vert \mathbf {H} \mathsf {y} - \mathsf {x} \Vert ^2_2 = - \mathsf {y}^{\intercal } \mathbf {H}^{\intercal }\mathbf {H}\mathsf {y} + 2 \mathsf {x}^{\intercal } \mathbf {H} \mathsf {y} - \mathsf {x}^{\intercal } \mathsf {x} \end{aligned}$$
(3)

is a quadratic form with a unique maximum, provided by the normal equations:

$$\begin{aligned} \nabla f (\mathsf {y}) = - 2 \mathbf {H}^{\intercal }\mathbf {H} \mathsf {y} + 2 \mathbf {H}^{\intercal } \mathsf {x} = 0 \Rightarrow \widehat{y} = (\mathbf {H}^{\intercal }\mathbf {H})^{-1} \mathbf {H}^{\intercal }\mathsf {x} \end{aligned}$$
(4)

This simple least-square formulation is very general, yet versions of the Maximum Likelihood solution correspond to a large class of problems in statistics and imaging, from simple linear regression to the Wiener filter [28, 39] used in signal and image restoration [38], tomography with the filtered back-projection method [36], and many others. When it can be used, the ML solution is fast and effective. However the ML solution requires a descriptive model (with few degrees of freedom) and a lot of data, which is often unsuitable for images because we do not have a suitable model for natural images. When we do not have all these hypotheses, sometimes the Bayesian Maximum A Posteriori approach can be used instead.

2.2 Maximum a Posteriori

If we assume that we know a prior distribution g over \(\mathsf {y}\), i.e. some a-priori information. Following Bayesian statistics, we can treat \(\mathsf {y}\) as a random variable and compute the posterior distribution of \(\mathsf {y}\), via the Bayes theorem:

$$\begin{aligned} \mathsf {y} \mapsto f(\mathsf {y}|\mathsf {x}) = \frac{f(\mathsf {x}|\mathsf {y})g(\mathsf {y})}{\int _{\vartheta \in \varTheta } f(x|\vartheta ) g(\vartheta ) d\vartheta } \end{aligned}$$
(5)

Then the Maximum a Posteriori is the estimate

$$\begin{aligned} \widehat{\mathsf {y}}_{\text {MAP}} (\mathsf {x}) = \underset{\mathsf {y}}{{\text {argmax}}} ~ f(\mathsf {y} | \mathsf {x}) = \underset{\mathsf {y}}{{\text {argmax}}} ~ f(\mathsf {x}|\mathsf {y})g(\mathsf {y}) \end{aligned}$$
(6)

We say that the MAP estimate is a regularization of ML. The only difference between ML and MAP is the \(g(\mathsf {y})\) multiplicative terms. For easier handling, we take the log, which does not change the estimator since the \(\log \) function is monotonic:

$$\begin{aligned} \widehat{\mathsf {y}}_{\text {MAP}} (\mathsf {x}) = \underset{\mathsf {y}}{{\text {argmax}}} ~ \log f(\mathsf {x}|\mathsf {y}) + \log g(\mathsf {y}). \end{aligned}$$
(7)

The first term of the right-hand side is call the log-likelihood, and the second term is the regularisation. In optimisation theory, a minimization is usually preferred, so we simply multiply (7) by \(-1\). In particular, the log-likelihood become the negative log-likelihood [4, chap. 7].

3 Imaging Formulations

The very brief exposition of the previous section covers the basic principle of many statistical methods, including PCA, LDA, EM, Markov Random Fields, Hidden Markov Models, up to graph-cut type methods in imaging [5]. Many details can be found in classical texts on pattern analysis [22].

In the case of imaging, the log-likelihood term is often called the data fidelity. If we assume an image \(\overline{\mathsf {y}} \in \mathbb {R}^N\) is corrupted by noise and blur, for instance, we observe the data \(\mathsf {x} \in \mathbb {R}^Q\), and we can write:

$$\begin{aligned} \mathsf {x} = \mathbf {H} \overline{\mathsf {y}} + \mathsf {u},\qquad \mathbf {H} \in \mathbb {R}^{Q\times N}, \end{aligned}$$
(8)

where \(\mathsf {u}\) is the noise. \(\mathbf {H}\) can typically be a camera model including blur and defocus, a tomography projection matrix, an MRI analysis matrix (i.e. a Fourier transform), etc. The noise model is here additive, but with some work it is possible to express the likelihood of more complex noises: Rician, Poisson, Poisson-Gauss [27], etc. To recover an estimation \(\widehat{\mathsf {y}}\) of \(\overline{\mathsf {y}}\), the ML estimator is, in the additive Gaussian noise case, the least-square estimate:

$$\begin{aligned} \widehat{\mathsf {y}} = \underset{\mathsf {y}}{{\text {argmin}}} ~ \Vert \mathbf {H} \mathsf {y} - \mathsf {x} \Vert ^2_2. \end{aligned}$$
(9)

Often this is not robust. A number of MAP regularizations can be proposed. The simplest is the Tikhonov regularisation:

$$\begin{aligned} \widehat{\mathsf {y}} = \underset{\mathsf {y}}{{\text {argmin}}} ~ \Vert \mathbf {\Gamma } \mathsf {y}\Vert _2^2 + \lambda \Vert \mathbf {H} \mathsf {y} - \mathsf {x} \Vert ^2_2, \end{aligned}$$
(10)

where \(\lambda \) is a Lagrangian multiplier, and \(\mathbf {\Gamma }\) is a linear operator, which can be the identity or a spatial gradient for instance. The corresponding quadratic prior term expresses the belief that \(\mathbf {\Gamma }\mathsf {y}\) has a zero-centered Gaussian distribution, i.e. is typically smooth. This model is very easy to optimize but not realistic for most images, although it can be related to anisotropic diffusion for denoising [3, 33], and the Random Walker for segmentation [24].

A more interesting approach for imaging is to define a sparsity prior. If we assume \(\mathsf {y}\) to be sparsely representable, for instance in a wavelet basis, then it might be interesting to use this in a regularization prior. Ideally, one would like to use the \(\ell _0\) pseudo-norm to enforce sparsity, However this pseudo-norm is both non convex and non-differentiable, which makes it difficult to use in practice. A key element of compressive sensing [7] is based on the observation that the \(\ell _1\) norm is nearly as effective at promoting sparsity.

$$\begin{aligned} \widehat{\mathsf {y}} = \underset{\mathsf {y}}{{\text {argmin}}} ~ \Vert \mathbf {\Gamma } \mathsf {y}\Vert _1 + \lambda \Vert \mathbf {H} \mathsf {y} - \mathsf {x} \Vert ^2_2. \end{aligned}$$
(11)

We will now explore some of these priors. Before we can do that, we need to propose a way to define flexible-enough linear operators well suited to imaging.

4 Linear Operators

The classical operators in continuous-domain formulations of the problems we have seen so far are the gradient and its adjoint the divergence. These can be easily discretized using finite-difference schemes [20]. Continuous and discrete versions of wavelet operators can also be considered. In the sequel, we choose to define our operators on arbitrary graphs, in the framework of discrete calculus [25].

4.1 The Incidence Matrix

Given a directed graph of N vertices and M edges, we can define the \(M \times N\) incidence matrix \(\mathbf {A}\), with lines containing zeros and exactly one \(+1\) and one \(-1\) so that \(a_{i,k} = -1\) and \(a_{i,l} = +1\) if \(e_i\) is the (kl) edge. An illustrative example is best at this point (see Fig. 3). The matrix \(\mathbf {A}\) describes the graph but can also be thought of as an operator. If \(\mathsf {p}\) is a vector of values associated to the vertices, then \(\mathbf {A}.\mathsf {p}\) is the gradient operator, associating a value to every edge. The transpose matrix \(\mathbf {A}^\intercal \) is the adjoint operator, corresponding to the negative divergence (Fig. 2).

Fig. 2.
figure 2

Ideal and observed image

Fig. 3.
figure 3

A small graph and its associated incidence matrix.

4.2 The Dual-Constrained Total Variation Model

Among the interesting regularizations, the Total Variation (TV) [35], or ROF model after the initials of its inventors, promotes sparsity of the gradient. In other words, it corresponds to a piecewise-constant image prior. This is of interest for various reasons, one of which because it is an acceptable model for texture-free natural images. Simplified versions of the Mumford-Shah model [30] for image segmentation typically use a TV term instead of the more complex piecewise-smooth prior. In [8], authors introduce TV formulations for image restoration in a MAP framework.

A weighted version of the TV model can be written in the following way [23], in the continuous framework:

(12)

with \(\lambda \) a Lagrange multiplier. It is equivalent to the following min-max problem [10]

(13)

with p a projection vector field. Such min-max formulations are called primal-dual in optimization. The field p is introduced to achieve better speed. Constraining p can promote better results, as we will see in Sect. 6.

In discrete calculus form, we can write the same problem in this way:

$$\begin{aligned} \min _{\mathsf {u}}{\max _{\Vert \mathsf {p}\Vert _{\infty }\le 1,~\mathsf {p} \in \mathbb {R}^M}{ \mathsf {p}^{\intercal }((\mathbf {A}\mathsf {u})\cdot \sqrt{\mathsf {w}})}+\varPhi (\mathsf {u})}. \end{aligned}$$
(14)

Introducing the projection vector \(\mathbf {F} \in Rset^M = \mathsf {p}.\sqrt{\mathsf {w}}\), we can constrain \(\mathbf {F}\) to belong to a convex set \(C = \cap _{i=1}^{m-1} C_i\ne \emptyset \) where \(C_1, \ldots , C_{m-1}\) closed convex sets of \(\mathbb {R}^M\). Given \(\mathsf {g} \in \mathbb {R}^N\), \(\mathsf {\theta }_i\in \mathbb {R}^M\), \(\alpha \ge 1,\) \(C_i = \{\mathbf {F}\in \mathbb {R}^M \mid \Vert \mathsf {\theta }_{i}\cdot F\Vert _\alpha \le g_{i}\}\),

$$\begin{aligned} \min _{\mathsf {u}\in \mathbb {R}^N}{ \sup _{\mathbf {F}\in {C}}~~\mathbf {F}^\intercal (\mathbf {A}\mathsf {u})} +\varPhi (\mathsf {u}). \end{aligned}$$
(15)

The \(\mathbf {F}\) constraints can be interpreted as flow constraints on the vertices of the connecting graph. For image denoising, we can for example propose that \(g_i \in \mathbb {R}^N\) be a weight on vertex i, inversely function of the gradient of f at node i. In this case:

  • Over flat areas: weak gradient implies a strong \(g_i\), itself implyig a strong \(F_{i,j}\) \(\rightarrow \) weak local variations of u.

  • Near contours: strong gradient implies a weak \(g_i\) itself implying a weak \(F_{i,j}\) \(\rightarrow \) large local variations of u are allowed.

This model is the dual-constrained total variation (DCTV) [17]. To optimize it, we require algorithms capable of dealing with non-differentiable convex functionals.

5 Algorithms

Optimization algorithms are numerous but research have mostly focused on differentiable methods: gradient descent, conjugate gradient, etc [4], with the exception of the simplex method for linear programming [21]. However non-differentiable optimization methods have been available at least since the 1960s. The main tool for non-differentiable optimizing convex functionals is the proximity operator [1, 14, 29, 34]. We recall here the main points.

5.1 Proximity Operator

Let \(\varGamma _0(\mathbb {R}^N)\) be the set of proper (i.e. not everywhere equal to \(+\infty \)), lower semi-continuous, convex functionals taking values from \(\mathbb {R}^N\) to \((-\infty , +\infty ]\). Such functions are necessarily quite regular. In particular, they are continuous and almost everywhere differentiable. The subdifferential of \(f \in \varGamma _0(\mathbb {R}^N)\) at \(\mathsf {x}\) is given by

$$\begin{aligned} \partial f(\mathsf {x}) = \{\mathsf {u}, \forall \mathsf {y} \in \mathbb {R}^N, (\mathsf {y}-\mathsf {x})^\intercal \mathsf {u} + f(\mathsf {x}) \le f(\mathsf {y})\} \end{aligned}$$
(16)

This definition extends the notions of tangent and thus of derivative to the non-differentiable case. Where f is differentiable, the subdifferential and the derivative are equal. We note that the subdifferential at non-differentiable points is a set, not a scalar or a vector.

The proximity operator of f in \(\mathsf {x}\) is the operator \(\mathbb {R}^N \rightarrow \mathbb {R}^N\)

$$\begin{aligned} {\text {prox}}_f(\mathsf {x}) = \underset{\mathsf {y} \in {\text {dom}}(f)}{{\text {argmin}}}\, f(\mathsf {y}) + \frac{1}{2}\Vert \mathsf {y} - \mathsf {x}\Vert ^2_2 \end{aligned}$$
(17)

We have the following property

$$\begin{aligned} \mathsf {p} = {\text {prox}}_f(\mathsf {x}) \Leftrightarrow \mathsf {x}-\mathsf {p} \in \partial f(\mathsf {p}), \forall (\mathsf {x},\mathsf {p}) \in \mathbb {R}^N \times \mathbb {R}^N \end{aligned}$$
(18)

5.2 Splitting

One of the simplest cases is the situation when one wants to optimize the sum of two functions, one of which is smooth. Let \(f_1 \in \varGamma _0(\mathbb {R}^N)\) and \(f_2: \mathbb {R}^N \rightarrow \mathbb {R}\) convex and differentiable with a \(\beta -\) Lipschitz constant gradient \(\nabla f_2\)., i.e.

$$\begin{aligned} \forall (\mathsf {x},\mathsf {y}) \in \mathbb {R}^N \times \mathbb {R}^N ,\Vert \nabla f_2(\mathsf {x}) - \nabla f_2(\mathsf {y})\Vert \le \beta \Vert \mathsf {x} - \mathsf {y}\Vert , \end{aligned}$$
(19)

with \(\beta > 0\). If \(f_1(\mathsf {x}) + f_2(\mathsf {x}) \rightarrow +\infty \) when \(\Vert \mathsf {x}\Vert \rightarrow +\infty \Vert \) (i.e. \(f_1 + f_2\) is coercive), we wish to

$$\begin{aligned} \underset{\mathsf {x}\in \mathbb {R}^N}{{\text {minimize}}} ~ f_1(\mathsf {x}) + f_2(\mathsf {x}). \end{aligned}$$
(20)

It can be shown that this problem admits a solution and that for any \(\gamma > 0\), the following fixed-point equation holds

$$\begin{aligned} \mathsf {x} = {\text {prox}}_{\gamma f_1} (\mathsf {x} - \gamma \nabla f_2(\mathsf {x})). \end{aligned}$$
(21)

This suggests the following explicit-implicit algorithm

$$\begin{aligned} \mathsf {x}_{n+1} = {\text {prox}}_{\gamma f_1} (\mathsf {x}_n - \gamma \nabla f_2(\mathsf {x}_n)). \end{aligned}$$
(22)

This algorithm is the forward-backward, alternating an explicit forward gradient descent step with an implicit proximity operator backward step. It can be shown [15] that this algorithm converges to a solution to (20).

This fairly simple method extends well-known ones such as gradient descent and the proximal point algorithm. It can be improved, for instance replacing the gradient descent scheme with Nesterov’s method [32], which in this case yields an optimal convergence rate [31]. The corresponding method is the Beck-Teboule proximal gradient method [2].

5.3 Primal-Dual Methods

Many splitting methods exist, involving sums of two or more functions, and are detailed in [14]. In the case of (15), the presence of explicit constraints makes the analysis more difficult. Using convex analysis, and in particular the Fenchel-Rockafeller duality, and if the graph is regular, we can optimize it using the Parallel Proximal Algorithm (PPXA) [13]. In the more interesting case of an irregular graph, a primal-dual method is necessary [9]. We actually used the algorithm detailed in [6], which has since been generalized [12, 16].

We now show some results obtained from solving (15) in various contexts.

Fig. 4.
figure 4

(a) Original MRI image, (b) Noisy, blurred image SNR = 12.3 dB, (c) DCTV result SNR = 17.2 dB. (d) Original house image, (e) Noisy image PSNR = 28.1 dB, (f) DCTV result PSNR = 35 dB. (g) Original mesh, (h) noisy mesh, (i) restored mesh

6 Results

DCTV is a flexible framework. In Fig. 4(a,b,c) we restore a blurry, noisy version of an MRI scan using a local regular graph. This is the same image as in Fig. 1. In Fig. 4(d,e,f) we restore an image using an irregular, non-local graph. The fine texture of the brick wall has been restored to a high degree. In Fig. 4(g,h,i) we restore a noisy 3D mesh, with the same framework. Only the graph changes.

7 Discussion

Results presented here are interesting to some degree because we have kept the spatial part of the formulation fully discrete, with at its heart a graph representation for the numerical operators we use. However an important point is our assumption that the distribution of image values is continuous. In practice this is not the case and our approach is a relaxation of the reality, since images are typically discretized to 8 or 16-bit values. If we require to keep discretized values throughout the formulation, for instance to deal with labeled images, then the approach proposed here would not work. In this case, MRF formulations could be used [18, 19].

We have also kept the discussion in the convex framework. Many important problems are not convex, for instance blind deblurring, where the degradation kernel must be estimated at the same time as the restoration. There exist methods for dealing with non-convex image restoraton problems, for instance [11], but dealing with non-convexity and non-differentiability together remains a challenge in the general case.

8 Conclusion

In this short overview article, we have introduced inverse problems in imaging and a statistical interpretation: the MAP principle for solving inverse problems such as image restoration using a-priori information. We have shown how we can use a graph formulation of numerical operators using discrete calculus to propose a general framework for image restoration. This DCTV framework can be optimized using non-differentiable convex optimization techniques, in particular the proximity operator. We have illustrated this approach on several examples.

DCTV is by no means the only framework available but it is one of the most flexible, fast and effective. With small changes we can tackle very different problems such as mesh or point cloud regularization. In general, the combination of powerful optimization methods, graph representations of spatial information and fast algorithms is a compelling approach for many applications.