1 Introduction

1.1 What Is an Inverse Problem?

Generally speaking, inverse problems   typically consist in the reconstruction of causes for observed effects. In imaging applications the cause is usually a probe and the effect are observed data. The corresponding forward problems then consists in predicting experimental data given perfect knowledge of the probe. In some sense solving an inverse problems means “computing backwards”, which is usually more difficult then solving the forward problem.

To model these kind of problems mathematically we describe the imaging system or experimental setup by a forward operator \(F:\mathbb {X}\rightarrow \mathbb {Y}\) between Banach spaces \(\mathbb {X}, \mathbb {Y}\), which maps a probe \(f\in \mathbb {X}\) to the corresponding effect \(g\in \mathbb {Y}\). Then the inverse problem is, given data \(g\in \mathbb {Y}\), to find a solution \(f\in \mathbb {X}\) to the equation

$$\begin{aligned} F(f)=g. \end{aligned}$$
(5.1)

1.2 Ill-Posedness and Regularization

A first obvious question to ask is whether or not a probe f is uniquely determined by the data g, i.e. if the operator F is injective. Such questions are addressed e.g. in Sect. 13.6.6. Given uniqueness, one might try to find the solution \(f\) to (5.1) by just applying the inverse operator \(F^{-1}\), which gives another reason to call the problem anss inverse problem. However in practice this can cause several problems, due to the forward operator being ill-posed.

According to J. Hadamard a problem is called well-posed if the following conditions are satisfied:

  1. 1.

    There exists a solution.

  2. 2.

    The solution is unique.

  3. 3.

    The solution depends continuously on the data (stability).

Otherwise the problem is called ill-posed.

An inverse problem in the form of an operator equation (5.1) is well-posed if F is surjective (such that for all g there exists a solution), injective (such that the solution is unique) and if \(F^{-1}\) is continuous (guaranteeing stability). For many inverse problems in practice only the third condition is violated, and ill-posedness in the narrower sense often refers to this situation: The reconstruction of causes from observed effects is unstable since very different causes may have similar effects.

The remedy against ill-posedness is regularization: To obtain stable solutions of inverse problems one constructs a family of continuous operators \(R_{\alpha }:\mathbb {Y}\rightarrow \mathbb {X}\) parameterized by a parameter \(\alpha >0\) converging pointwise to the discontinuous inverse \(F^{-1}\):

$$\begin{aligned} R_{\alpha }\approx F^{-1},\qquad \lim _{\alpha \rightarrow 0}R_\alpha (F(f)) = f\qquad \text{ for } \text{ all } f\in \mathbb {X}. \end{aligned}$$
(5.2)

We will discuss several generic constructions of such families of stable approximate inverses \(R_{\alpha }\) in §5.2.

1.3 Examples

1.3.1 Numerical Differentiation

In our first example we consider the forward operator given by integration. We fix the free integration constant by working in spaces of functions with mean zero, \(L^2_\diamond ([0,1]):=\{f\in L^2([0,1]):\int _0^1 f(x)\mathop {}\!\mathrm {d}x =0\}\). Let \(F:L^2_\diamond ([0,1])\rightarrow L^2_\diamond ([0,1])\) be given by

$$\begin{aligned} F(f)(y)=\int _0^y f(x)\mathop {}\!\mathrm {d}x +c(f), \qquad y\in [0,1]. \end{aligned}$$
(5.3)

where c(f) is such that \(F(f)\in L^2_\diamond \). The corresponding inverse problem described by the operator equation (5.1) is to compute the derivative \(g'\). If \(g=F(f)\), then the existence of a solution to the inverse problem is guaranteed and this solution is unique. Now assume that instead of the exact data g we are given noisy data \({g^\mathrm {obs}}\) that fulfill

$$\begin{aligned} \Vert {{g^\mathrm {obs}}-g}\Vert _{L^2}\le \delta , \end{aligned}$$

for some small noise level \(\delta >0\). For example this could be one of the functions

$$\begin{aligned} {g_n^\mathrm {obs}}(x)=g(x)+\delta \sqrt{2}\sin (\pi nx). \end{aligned}$$

As the derivatives are given by \(({g_n^\mathrm {obs}})'(x)=g'(x)+\pi n\delta \sqrt{2}\cos (\pi nx)\), we have

$$\begin{aligned} \Vert {({g_n^\mathrm {obs}})'-g'}\Vert _{L^2} = \pi n\delta . \end{aligned}$$

This illustrates the typical ill-posedness of inverse problems: amplification of noise may be arbitrarily large for naive application of the inverse \(F^{-1}\). This is the main difficulty one has to cope with. Our toy example illustrates another typical feature of inverse problems: Each function in the image of the operator F defined above is at least once differentiable. Also for many other inverse problems the forward operator is smoothing in the sense that the output function has higher smoothness than the input function, and this property causes the instability of the inverse problem.

1.3.2 Fluorescence Microscopy

In fluorescence microscopy one is interested in recovering the density \(f\) of fluorescent markers in some specimen in \(\mathbb {R}^3\). The probe is sampled by a laser beam, and one detects fluorescent photons. In confocal microscopy, spatial resolution is achieved by focusing the laser beam by some lense and collecting fluorescent electrons by same lense such that out-of-focus fluorescent photons can be blocked by a pinhole.

Let \(y\in \mathbb {R}^3\) be the focal point and assume that the probability (density) that for the focus point \(y\in \mathbb {R}^3\) a fluorescent photon emitted by a marker at point \(x\in \mathbb {R}^3\) is detected is \(k(x-y)\). k is called the point-spread function, and we assume here that it is spatially invariant. Then our problem is described by the operator equation

$$\begin{aligned} g(y)=F(f)(y)=\int k(y-x)f(x)\mathop {}\!\mathrm {d}x, \end{aligned}$$

i.e. the observation \(g\) is given by a convolution of the marker density \(f\) with point spread function k. As convolution will usually blur an image, the forward operator is smoothing. Smoother kernels will lead to stronger smoothing.

1.4 Choice of Regularization Parameters and Convergence Concepts

Due to the ill-posedness discussed above it is essential to take into account the effects of noise in the observed data. Let \(f^{\dagger }\in \mathbb {X}\) denote the unknown exact solution, and first assume data \(g^{\delta }\) with deterministic errors such that

$$\begin{aligned} \Vert g^{\delta }-F(f^{\dagger })\Vert _{\mathbb {Y}}\le \delta . \end{aligned}$$
(5.4)

As mentioned above, regularization of an ill-posed operator equation (5.1) with an injective operator F consists in approximating the discontinuous inverse operator \(F^{-1}\) by a pointwise convergent family of continuous operators \(R_\alpha :\mathbb {Y}\rightarrow \mathbb {X}\), \(\alpha >0\). This immediately gives rise to the question which operator in the family should be chosen for the reconstruction, i.e how to choose the parameter \(\alpha \). Usually the starting point of deterministic error analysis in regularization theory is the following splitting of the reconstruction error:

$$\begin{aligned} \Vert R_{\alpha }(g^{\delta })-f^{\dagger }\Vert \le \Vert R_{\alpha }(g^{\delta })-R_{\alpha }(F(f^{\dagger }))\Vert + \Vert R_{\alpha }(F(f^{\dagger }))-f^{\dagger }\Vert . \end{aligned}$$
(5.5)

The first term on the right hand side is called propagated data noise error, and the second term is referred to as approximation error or bias.  Due to pointwise convergence (see (5.2)), the bias tends to 0 as \(\alpha \rightarrow 0\). Hence, to control this error term, we should choose \(\alpha \) as small as possible. However, as \(R_{\alpha }\) converges pointwise to the discontinuous operator \(F^{-1}\), the Lipschitz constant (or operator norm in the linear case) of \(R_{\alpha }\) will explode as \(\alpha \rightarrow 0\), and hence also the propagated data noise error. Therefore, \(\alpha \) must not be chosen too small. This indicates that the choice of the regularization parameter must be a crucial ingredient of a reliable regularization method. Probably the most well-known parameter choice rule in the deterministic setting is Morozov’s discrepancy principle:

$$\begin{aligned} \overline{\alpha }_{\mathrm {DP}}(\delta ,g^{\delta }) := \sup \{\alpha >0 : \Vert F(R_{\alpha }(g^{\delta }))-g^{\delta }\Vert \le \tau \delta \} \end{aligned}$$
(5.6)

with some parameter \(\tau \ge 1\). In other words, among all estimators \(R_{\alpha }(g^{\delta })\) which can explain the data within the noise level (times \(\tau \)), we choose the most stable one.

Definition 5.1

A family of operators \(R_{\alpha }:\mathbb {Y}\rightarrow \mathbb {X}\) parameterized by a parameter \(\alpha >0\) together with some rule \(\overline{\alpha }:[0,\infty )\times \mathbb {Y}\rightarrow (0,\infty )\) how to choose this parameter depending on the noise level \(\delta \) and the data \(g^{\delta }\) is called a regularization method if the worst case error tends to 0 with the noise level in the sense that

$$\begin{aligned} \lim _{\delta \rightarrow 0} \sup \left\{ \Vert R_{\overline{\alpha }(\delta ,g^{\delta })}(g^{\delta })-f^{\dagger }\Vert _{\mathbb {X}}\,:\, g^{\delta }\in \mathbb {Y}, \Vert g^{\delta }-F(f^{\dagger })\Vert _\mathbb {Y}\le \delta \right\} =0 \end{aligned}$$
(5.7)

for all \(f^{\dagger }\in \mathbb {X}\).

The convergence (5.7) is a minimal requirement that one expects from a regularization method. However, it can be shown that for ill-posed problems this convergence may be arbitrarily slow depending on \(f^{\dagger }\). This is of course not satisfactory. Fortunately, a-priori information on the solution \(f^{\dagger }\), which is often available, may help. If we know a-priori that \(f^{\dagger }\) belongs to some set \(\mathcal {K}\subset \mathbb {X}\), then it is often possible to derive explicit error bounds

$$ \sup \left\{ \Vert R_{\overline{\alpha }(\delta ,g^{\delta })}(g^{\delta })-f^{\dagger }\Vert _{\mathbb {X}}\,:\, g^{\delta }\in \mathbb {Y}, \Vert g^{\delta }-F(f^{\dagger })\Vert _\mathbb {Y}\le \delta \right\} \le \psi (\delta ) $$

for all \(f^{\dagger }\in \mathcal {K}\) with a function \(\psi \in C([0,\infty ))\) satisfying \(\psi (0)=0\).

Let us now consider statistical noise models instead of the deterministic noise model (5.4). Often statistical data \(G_t\) belong to a different space \(\mathbb {Y}'\), e.g. a space of distributions. The distribution depends on some parameter t, and we assume that \(G_t\rightarrow F(f^{\dagger })\) in some sense as \(t\rightarrow \infty \). Let t e.g. denote the number of observations or in photonic imaging the expected number of photons. As the estimator \(R_{\alpha }(g^{\mathrm {obs}})\) (where now \(R_{\alpha }\) is a mapping from \(\mathbb {Y}'\) to \(\mathbb {X}\)) will be a random variable, we have to use stochastic concepts of convergence, e.g. convergence in expectation. Other convergence concepts, in particular convergence in probability are also used frequently.

Definition 5.2

In the setting above, a family of operators \(R_{\alpha }:\mathbb {Y}'\rightarrow \mathbb {X}\) parameterized by a parameter \(\alpha >0\) together with some parameter choice rule \(\overline{\alpha }:[0,\infty )\times \mathbb {Y}'\rightarrow (0,\infty )\) is called a consistent estimator if

$$ \lim _{t\rightarrow \infty } \mathbb {E}\left[ \Vert R_{\overline{\alpha }(t,G_t)}(G_t)-f^{\dagger }\Vert _{\mathbb {X}}^2\right] =0 $$

for all \(f^{\dagger }\in \mathbb {X}\).

Again one may further ask not only for convergence, but even rates of convergence as \(t\rightarrow \infty \) on certain subsets \(\mathcal {K}\subset \mathbb {X}\).

2 Regularization Methods

In this section we will discuss generalized Tikhonov regularization, which is given by finding the minimum of

$$\begin{aligned} \hat{f}_\alpha \in {{\,\mathrm{argmin\,}\,}}_{f\in \mathbb {X}} \left[ \mathcal {S}_{{g^\mathrm {obs}}} (F(f))+\alpha \mathcal {R}(f)\right] , \end{aligned}$$
(5.8)

where \(\mathcal {S}_{{g^\mathrm {obs}}}\) is the data fidelity functional, which measures some kind of distance of \(F(f)\) and the data \({g^\mathrm {obs}}\) and causes the minimizer \(\hat{f}_\alpha \) to still explain the data well, whereas \(\mathcal {R}\) is the penalty functional which penalizes certain properties of the minimizer. This approach is called variational regularization, as our regularized solution is found by minimization of a functional (usually an integral functional).

2.1 Variational Regularization

We start with a probabilistic motivation of generalized Tikhonov regularization. From here until the end of this section we will consider the finite-dimensional setting \(\mathbb {X}=\mathbb {R}^n\), \(\mathbb {Y}=\mathbb {R}^m\) and use boldface symbols to denote finite-dimensional vectors/mappings. We start from (5.1), where \(\mathbf {f}\in \mathbb {R}^n\), \(\mathbf {g}\in \mathbb {R}^m\) and \(F:\mathbb {R}^n\rightarrow \mathbb {R}^m\) is some injective function. Given the data \(\mathbf {g}\) we want to find the solution \(\mathbf {f}\) of \(F(\mathbf {f})=\mathbf {g}\), but recall that we cannot just apply the inverse \(F^{-1}\) as discussed in Sect. 5.1. Instead we might estimate \(\mathbf {f}\) by maximizing the likelihood function \(\mathcal {L}(\mathbf {f})=P(\mathbf {g}|\mathbf {f})\), i.e. the probability that for a certain preimage \(\mathbf {f}\) the data \(\mathbf {g}\) will occur. If we assume that our data is normally distributed with covariance matrix \(\sigma ^2I\), then we can rearrange the problem by using the monotonicity of the logarithm, as well as the fact, that neither additive nor multiplicative constants change the extremal point:

$$\begin{aligned} \hat{\mathbf {f}}_\mathrm{{ML}}&\in {{\,\mathrm{argmax\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} P(\mathbf {g}|\mathbf {f}) ={{\,\mathrm{argmax\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \log \left( {P(\mathbf {g}|\mathbf {f})}\right) \\&={{\,\mathrm{argmax\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \log \left( {\frac{1}{\sqrt{2\pi }\sigma }\prod _{i=1}^m\exp \left( {\frac{-(\mathbf {g}_i-F(\mathbf {f})_i)^2}{2\sigma ^2}}\right) }\right) \\&={{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \frac{1}{2\sigma ^2} \sum _{i=1}^m(\mathbf {g}_i-F(\mathbf {f})_i)^2 ={{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \frac{1}{2\sigma ^2}\Vert {\mathbf {g}-F(\mathbf {f})}\Vert _2^2. \end{aligned}$$

This demonstrates the well-known fact that the maximum likelihood approach for Gaussian noise yields the least squares methods as first used by young Gauss to predict the path of the asteroid Ceres in 1801. However, as \(\hat{\mathbf {f}}_\mathrm{{ML}}= F^{-1}(\mathbf {g})\) for \(\mathbf {g}\) in the range of F, this approach has no regularizing effect. In fact it is more reasonable to maximize \(P(\mathbf {f}|\mathbf {g})\) instead of \(P(\mathbf {g}|\mathbf {f})\), as our goal should be to find the solution \(\mathbf {f}\) which is most likely to have caused the observation \(\mathbf {g}\), instead of just finding any \(\mathbf {f}\) which causes the observation \(\mathbf {g}\) with maximal probability. This leads to the Bayesian perspective on inverse problem with the characteristic feature that prior to the measurements a probability distribution (the so-called prior distribution) is assigned to the solution space \(\mathbb {X}\) modeling our prior knowledge on \(\mathbf {f}\). By Bayes’ theorem we have

$$\begin{aligned} P({\mathbf {f}}|{\mathbf {g}})=\frac{P({\mathbf {g}}|{\mathbf {f}})P({\mathbf {f}})}{P({\mathbf {g}})} \quad \Leftrightarrow \quad {\mathrm {posterior}}=\frac{{\mathrm {likelihood}}\cdot {\mathrm {prior}}}{\mathrm{{evidence}}}. \end{aligned}$$

Estimating \(\mathbf {f}\) by maximizing the posterior \(P(\mathbf {f}|\mathbf {g})\) is called maximum a posteriori probability (MAP) estimate. To use this approach we have to model the prior \(P(\mathbf {f})\). If we assume that \(\mathbf {f}\) is normally distributed with mean \(\mathbf {f}\in \mathbb {R}^n\) and covariance matrix \(\tau ^2 I\), then we find

$$\begin{aligned} \hat{\mathbf {f}}_\mathrm{{MAP}}&\in {{\,\mathrm{argmax\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} P(\mathbf {f}|\mathbf {g}) = {{\,\mathrm{argmax\,}\,}}_{\mathbf {f}\in \mathbb {R}^n}\left[ { \log \left( {P(\mathbf {g}|\mathbf {f})}\right) +\log \left( {P(\mathbf {f})}\right) }\right] \\&= {{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \frac{1}{2\sigma ^2}\left[ {\sum _{i=1}^m(\mathbf {g}_i-F(\mathbf {f})_i)^2+\frac{\sigma ^2}{\tau ^2}\sum _{j=1}^m(\mathbf {f}_j-(\mathbf {f}_0)_j)^2}\right] \\&={{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \left[ {\underbrace{\tfrac{1}{2}\Vert {\mathbf {g}-F(\mathbf {f})}\Vert _2^2+\tfrac{\alpha }{2}\Vert {\mathbf {f}-\mathbf {f}_0}\Vert _2^2}_{J_\alpha (\mathbf {f})}}\right] \end{aligned}$$

where \(\alpha =\frac{\sigma ^2}{\tau ^2}\). The functional \(J_\alpha (\mathbf {f})\) is the standard (quadratic) Tikhonov functional, and therefore MAP and Tikhonov regularization coincide in this setting.

In photonic imaging the data is often given by photon counts, and these are typically Poisson distributed in the absence of read-out error (compare Chap. 4). Recall that a random variable \(Z\in \mathbb {N}_0\) is called Poisson distributed with mean \(\lambda >0\), short \(Z\sim \mathrm {Pois}(\lambda )\), if \(P(Z=g) = e^{-\lambda }\lambda ^{g}/(g!)\) for all \(g\in \mathbb {N}_0\). Hence the negative log-likelihood function is given by

$$ -\log P(Z=g|\lambda ) = \lambda - g\log (\lambda )+\log g! = \lambda -g + g\log \left( \frac{g}{\lambda }\right) + C_g $$

where \(C_g\) is a constant independent of \(\lambda \). Now assume that \(\mathbf {g}\) is a vector of independent Poisson distributed random variables such that \(\mathbf {g}_i \sim \mathrm {Pois}(F(\mathbf {f})_i)\). It follows that the negative log-likelihood \(-\log (P(\mathbf {g}|\mathbf {f})) = -\sum _i \log P(\mathbf {g}_i|F(\mathbf {f})_i)\) is given (up to an additive constant independent of \(\mathbf {f}\)) by the Kullback-Leibler divergence

$$\begin{aligned} {{\,\mathrm{KL}\,}}(\mathbf {g},F(\mathbf {f})):=\sum _{i=1}^m \left[ F(\mathbf {f})_i -\mathbf {g}_i+ \mathbf {g}_i\log \left( {\frac{\mathbf {g}_i}{F(\mathbf {f})_i}}\right) \right] . \end{aligned}$$

If \(\mathbf {f}\) has a probability distribution \(P(f) = c\exp (-\mathcal {R}(\mathbf {f})/\tau ^2)\) this leads to generalized Tikhonov regularization

$$\begin{aligned} \hat{\mathbf {f}}_\alpha \in {{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n} \left[ \mathcal {S}_{\mathbf {g}}(F(\mathbf {f}))+\alpha \mathcal {R}(\mathbf {f})\right] \end{aligned}$$
(5.9)

with fidelity term \(\mathcal {S}_{\mathbf {g}}=\Vert {\mathbf {g}-\cdot }\Vert _2^2\) for normally distributed data, \(\mathcal {S}_{\mathbf {g}}={{\,\mathrm{KL}\,}}(\mathbf {g},\cdot )\) for Poisson distributed data and the penalty term \(\alpha \mathcal {R}\) with regularization parameter \(\alpha = \tau ^{-2}\) for Poisson data and \(\alpha = \sigma ^2/\tau ^2\) for Gaussian white noise.

Note that in the Bayesian setting above the regularization parameter \(\alpha \) is uniquely determined by the prior distribution and the likelihood functional. However, often only qualitative prior knowledge on the solution is available, but the parameter \(\tau \) is unknown. Then \(\alpha \) has to be determined by a-posteriori parameter choice rules analogous to the discrepancy principle (5.6) for deterministic errors.

Let us discuss a few popular choices of the penalty functional \(\mathcal {R}\), which allows the incorporation of prior information on the solution or is simply the negative logarithm of the density of the prior in the above Bayesian setting. For a-priori known sparsity of the solution one should choose the sparsity enforcing penalty \(\mathcal {R}(\mathbf {f})=\Vert {\mathbf {f}}\Vert _1\). If \(\mathbf {f}\) is an image with sharp edges, then the total variation seminorm is a good choice of \(\mathcal {R}\). However, we point out that for the total variation seminorm in a Bayesian setting there exists no straightforward useful infinite dimensional limit (see [16]). Bayesian prior modelling in infinite dimensional settings is often considerably more involved.

If \(\mathbf {f}\) is a probability density, a frequent choice of the penalty functional is \(\mathcal {R}(\mathbf {f})={{\,\mathrm{KL}\,}}(\mathbf {f},\mathbf {f}_0)\), which naturally enforces nonnegativity of the solution because of the logarithm. Alternatively, more general inequality constraints \(N(\mathbf {f})\le 0\) for some function N can be be incorporated into the penalty function by replacing \(\mathcal {R}\) by

$$\begin{aligned} \widetilde{\mathcal {R}}(\mathbf {f})= {\left\{ \begin{array}{ll} \mathcal {R}(\mathbf {f}), &{}\text { if } N(\mathbf {f})\le 0 \\ \infty , &{}\text { else.} \end{array}\right. } \end{aligned}$$

2.1.1 Implementation

In this paragraph we will discuss several possibilities to compute the minimizer \(\hat{\mathbf {f}}_\alpha \) of (5.9) for a linear forward operator denoted by \(F=T\). In the case of quadratic Tikhonov regularization it follows from the first order optimality conditions that the Tikhonov functional \(J_\alpha \) has the unique minimizer

$$\begin{aligned} \hat{\mathbf {f}}_\alpha =(T^*T+\alpha I)^{-1}\left( {T^*\mathbf {g}+\alpha \mathbf {f}_0}\right) \end{aligned}$$
(5.10)

for all \(\alpha >0\). So in order to compute the regularized solution we have to solve the linear system of equations

$$\begin{aligned} A\mathbf {f}=\mathbf {b}\qquad \text{ with }\quad A:=T^*T+\alpha I \text{ and } \mathbf {b}=T^*\mathbf {g}+\alpha \mathbf {f}_0. \end{aligned}$$

Solving this directly by for example Gauss-Jordan elimination requires \(\mathcal {O}\left( {n^3}\right) \) operations and we have to store the full matrix A. In imaging applications n is typically the number of pixels or voxels, which can be so large that storing A is impossible. Therefore, we have to resort to iterative methods which access A only via matrix-vector products. Such matrix-vector products can often be implemented efficiently without setting up the matrix A, e.g. by the fast Fourier transform (FFT) or by solving a partial differential equation. As A is positive definite, the most common method for solving \(A\mathbf {f}=\mathbf {b}\) is the conjugate gradient (CG) method:

Algorithm 5.1

(CG iteration for solving \(A\mathbf {f}=\mathbf {b}, A>0\) )

Initialization. Choose initial guess \(\mathbf {f}_0\in \mathbb {R}^n\). Set \(\mathbf {s}_0=\mathbf {b}-A\mathbf {f}_0; \quad \mathbf {d}_0=\mathbf {s}_0\).

General Step ( \(l = 0 , 1 , \ldots \) )

$$\begin{aligned} \gamma _l&=\frac{\Vert {\mathbf {s}_l}\Vert _2^2}{\langle \mathbf {d}_l,A \mathbf {d}_l\rangle } \\ \mathbf {f}_{l+1}&=\mathbf {f}_{l}+\gamma _l\mathbf {d}_l \\ \mathbf {s}_{l+1}&=\mathbf {s}_{l}+\gamma _l A\mathbf {d}_l \\ \beta _l&=\frac{\Vert {\mathbf {s}_{l+1}}\Vert _2^2}{\Vert {\mathbf {s}_l}\Vert _2^2} \\ \mathbf {d}_{l+1}&=\mathbf {s}_{l+1}+\beta _l \mathbf {d}_l. \end{aligned}$$

If \(A= T^*T+\alpha I\) as in Tikhonov regularization, the stopping criterion \(\mathbf {s}\ne 0\) may be replaced by

$$ \Vert {\mathbf {s}_l}\Vert >\frac{\mathrm {TOL}}{\alpha } $$

for some tolerance parameter \(\mathrm {TOL}>0\). It can be shown that \(\mathbf {s}_l = \mathbf {b}-A\mathbf {f}_l\) for all l. As \(\Vert A^{-1}\Vert \le 1/\alpha \), this guarantees the error bound \( \Vert {\bar{\mathbf {f}}-\mathbf {f}_L}\Vert \le \mathrm {TOL} \) to the exact minimum \(\bar{\mathbf {f}}=A^{-1}b\) of the Tikhonov functional.

In the case of more general data fidelity terms \(\mathcal {S}_\mathbf {g}\) and penalty terms \(\mathcal {R}\) one can use a primal-dual algorithm suggested by Chambolle and Pock [3]. To formulate this algorithm, recall that for a functional \(\mathcal {F}:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{\infty \}\) the conjugate functional is given by

$$\begin{aligned} \mathcal {F}^*(\mathbf {s}) := \sup _{\mathbf {x}\in \mathbb {R}^n}\left[ \langle \mathbf {s},\mathbf {x}\rangle -\mathcal {F}(\mathbf {x})\right] . \end{aligned}$$
(5.11)

If \(\mathcal {F}\) is convex and continuous, then \(\mathcal {F}^{**}=\mathcal {F}\). For more information on this and other basic notions of convex analysis used in the following we refer, e.g. to [17]. The algorithm is based on the saddle point formulation

$$ \min _{\mathbf {f}\in \mathbb {R}^n} \max _{\mathbf {z}\in \mathbb {R}^m} \left[ \langle T\mathbf {f},\mathbf {z}\rangle + \alpha \mathcal {R}(\mathbf {f}) - \mathcal {S}_{\mathbf {g}}^*(\mathbf {z}) \right] = \max _{\mathbf {z}\in \mathbb {R}^m}\min _{\mathbf {f}\in \mathbb {R}^n} \left[ \langle T\mathbf {f},\mathbf {z}\rangle + \alpha \mathcal {R}(\mathbf {f}) - \mathcal {S}_{\mathbf {g}}^*(\mathbf {z})\right] . $$

Note that an analytic computation of the maximum leads to the original problem (5.9) whereas a computation of the minimum leads to the dual problem

$$\begin{aligned} \max _{\mathbf {z}\in \mathbb {R}^m} -\left[ \mathcal {S}_{\mathbf {g}}^*(\mathbf {z}) + \alpha \mathcal {R}^*\left( -\frac{1}{\alpha }T^*\mathbf {z} \right) \right] \end{aligned}$$
(5.12)

The algorithm requires the computation of so called proximity operators (see Chap. 6). For a functional \(G:\mathbb {R}^n\rightarrow \mathbb {R}\) and a scalar \(\lambda >0\) the proximity operator \({{\,\mathrm{prox}\,}}_{G,\lambda }:\mathbb {R}^n\rightarrow \mathbb {R}^n\) is defined by

$$\begin{aligned} {{\,\mathrm{prox}\,}}_{G,\lambda }:= (\mathbf {z})={{\,\mathrm{argmin\,}\,}}_{\mathbf {x}\in \mathbb {R}^n}\left[ {\tfrac{1}{2}\Vert {\mathbf {z}-\mathbf {x}}\Vert _2^2+\lambda G(\mathbf {x})}\right] . \end{aligned}$$

For many popular choices of \(\mathcal {S}_{\mathbf {g}}\) and \(\mathcal {R}\) the proximity operator can either be calculated directly via a closed expression or there are efficient algorithms for their computation. To give a simple example, for \(G(\mathbf {x})=\tfrac{1}{2}\Vert {x}\Vert _2^2\) one can calculate the proximity operator just by (5.10) to be \((1+\lambda )^{-1}\mathbf {z}\).

One also needs to evaluate the proximity operator of the Fenchel convex conjugate \(G^*\), which can be done by Moreau’s identity

$$\begin{aligned} {{\,\mathrm{prox}\,}}_{G^*,\lambda }(\mathbf {z})=\mathbf {z}-\lambda {{\,\mathrm{prox}\,}}_{G,1/\lambda }\left( {\tfrac{1}{\lambda }\mathbf {z}}\right) . \end{aligned}$$

Algorithm 5.2

(Chambolle-Pock primal dual algorithm)

Initialization. Choose  \((\mathbf {f}_0, \mathbf {p}_0)\in \mathbb {R}^n\times \mathbb {R}^m\) and set \(\tilde{\mathbf {f}}_0=\mathbf {f}_0\).

General Step (\(k = 0 , 1 , \ldots \)) Choose parameters \(\tau _X^k,\tau _Y^k>0\), \(\theta _k\in [0,1]\) and let

$$\begin{aligned} \mathbf {p}_{k+1}&={{\,\mathrm{prox}\,}}_{\mathcal {S}_{\mathbf {g}},\tau _Y^k}\left( {\mathbf {p}_k+\tau _Y^kT\mathbf {f}_k}\right) \\ \mathbf {f}_{k+1}&={{\,\mathrm{prox}\,}}_{\alpha \mathcal {R},\tau _X^k}\left( {\mathbf {f}_k-\tau _X^kT^*\mathbf {p}_k}\right) \\ \tilde{\mathbf {f}}_{k+1}&=\theta _k\left( {\mathbf {f}_{k+1}-\mathbf {f}_{k}}\right) . \end{aligned}$$

For constant parameters \(\tau _X^k,\tau _Y^k>0\) and \(\theta _k=1\) it has been shown [3] that \(\mathbf {f}_k\) converges to a solution of (5.9) and \(\mathbf {p}_k\) converges to a solution of the corresponding dual problem. Under certain assumptions on \(\mathcal {S}_{\mathbf {g}}\) and \(\mathcal {R}\) special choices for the parameters \(\tau _X^k,\tau _Y^k>0\) and \(\theta _k\) will speed up the convergence.

To compute the minimizer of (5.9) with additional inequality constraints one can apply semismooth Newton methods for which we refer to [19].

2.2 Iterative Regularization

Whereas in the previous subsection we have discussed iterative methods to compute the minimum of a generalized Tikhonov functional, in the following we will discuss iterative methods for the solution of \(F(\mathbf {f})=\mathbf {g}\) without prior regularization. Typically the choice of the stopping index plays the role of the choice of the regularization parameter \(\alpha \). A motivation for the use of iterative regularization method is the fact that the Tikhonov functional is not convex in general for nonlinear operators. Therefore, it cannot be guaranteed that an approximation to the global minimum of the Tikhonov functional can be computed. For further information on iterative regularization methods we refer to the monograph [15].

2.2.1 Landweber Iteration

Landweber iteration can be derived as a method of steepest descent for the cost functional \(J(\mathbf {f})=\tfrac{1}{2}\Vert {F(\mathbf {f})-\mathbf {g}}\Vert _2^2\). As the direction of steepest descent is given by the negative gradient \(-J'(\mathbf {f})=-F'(\mathbf {f})^*\left( {F(\mathbf {f})-\mathbf {g}}\right) \), this leads to the iteration formula

$$\begin{aligned} \mathbf {f}_{k+1}=\mathbf {f}_k-\mu F'(\mathbf {f}_k)^*\left( {F(\mathbf {f}_k)-\mathbf {g}}\right) \end{aligned}$$
(5.13)

with a step size parameter \(\mu \). This parameter should be chosen such that \(\mu \Vert F'(\mathbf {f})^*F'(\mathbf {f})\Vert <1\) for all \(\mathbf {f}\). Since a too small choice of \(\mu \) slows down convergence considerably, it is advisable to compute the operator norm of \(F'(\mathbf {f})^*F'(\mathbf {f})\) by a few iterations of the power method.

Under certain conditions on the operator F it has been shown in [7] in a Hilbert space setting that Landweber iteration with the discrepancy principle as stopping rule is a regularization method in a sense analogous to Definition 5.1, i.e. the worst case error tends to 0 with the noise level for a sufficiently good initial guess.

2.2.2 Regularized Newton Methods

Although Landweber iteration often makes good progress in the first few iterations, asymptotic convergence is very slow. Faster convergence may be expected from Newton-type methods, which solve a linear system of equations or some minimization problem using the first order Taylor approximation

$$\begin{aligned} F(\mathbf {f})\approx F(\mathbf {f}_{k})+F'(\mathbf {f}_{k})(\mathbf {f}-\mathbf {f}_k). \end{aligned}$$
(5.14)

around a current iterate \(\mathbf {f}_k\). Plugging this approximation into the quadratic Tikhonov functional and using the last iterate as initial guess leads to the Levenberg-Marquardt-algorithm

$$\begin{aligned} \mathbf {f}_{k+1} ={{\,\mathrm{argmin\,}\,}}_{\mathbf {f}\in \mathbb {R}^n}\left[ {\tfrac{1}{2}\Vert {F'(\mathbf {f}_{k})(\mathbf {f}-\mathbf {f}_k)+F(\mathbf {f}_{k})-\mathbf {g}}\Vert _2^2+\tfrac{\alpha _k}{2}\Vert {\mathbf {f}-\mathbf {f}_k}\Vert _2^2}\right] . \end{aligned}$$

For a convergence analysis we refer to [6]. The minimization problems can be solved efficiently by Algorithm 5.1 without the need to set up the full Jacobi matrices \(F'(\mathbf {f}_k)\) in each step. Newton-type methods converge considerably faster than Landweber iteration. In fact the number of Landweber steps which is necessary to achieve an accuracy comparable to k Newton steps increases exponentially with k (cf. [4]). On the other hand, each iteration step is more expensive. Which of the two methods is more efficient may depend on the size of the noise level. Newton-type methods are typically favorable for small noise levels. Plugging the first order Taylor approximation (5.14) into the generalized Tikhonov functional (5.8) leads to the iteration formula

$$\begin{aligned} \mathbf {f}_{k+1} \in {{\,\mathrm{argmin\,}\,}}_{\mathbf {f}}\left[ \mathcal {S}_{{g^\mathrm {obs}}} (F(\mathbf {f}_{k})+F'(\mathbf {f}_{k})(\mathbf {f}-\mathbf {f}_k))+\alpha \mathcal {R}(f) \right] . \end{aligned}$$
(5.15)

If \(\mathcal {S}_{{g^\mathrm {obs}}}(\mathbf {g})=\frac{1}{2}\Vert \mathbf {g}-{g^\mathrm {obs}}\Vert ^2\) and \(\mathcal {R}(\mathbf {f}) =\frac{1}{2}\Vert \mathbf {f}-\mathbf {f}_0\Vert \), this leads the commonly used iteratively regularized Gauss-Newton method  where compared to the Levenberg-Marquardt method the penalty term is replaced by \(\tfrac{\alpha _k}{2}\Vert {\mathbf {f}-\mathbf {f}_0}\Vert _2^2\). Note that if \(\mathcal {S}_{{g^\mathrm {obs}}}\) and \(\mathcal {R}\) are convex, a convex optimization problems has to be solved in each Newton step, which can be done, e.g., by Algorithm 5.2. For a convergence analysis of (5.15) including the case of Poisson data we refer to [12].

3 Error Estimates

3.1 General Error Bounds for Variational Regularization

Under the deterministic noise model (5.4) we are now looking for error bounds of the form

$$\begin{aligned} \Vert {\hat{f}_\alpha -f^\dagger }\Vert ^2\le \phi \left( {\delta ,\alpha }\right) \end{aligned}$$
(5.16)

for the Tikhonov estimator \(\hat{f}_\alpha \). In a second step the right hand side may be minimized over \(\alpha \). As discussed in Sect. 5.1.4 such estimates can only be obtained under additional conditions on \(f^\dagger \), which are called source conditions. There are several forms of such conditions. Nowadays, starting with [8] such conditions are often formulated in the form of variational inequalities. For the sake of simplicity we confine ourselves to the case of Hilbert spaces with quadratic functionals \(\mathcal {R}\) and \(\mathcal {S}_{{g^\mathrm {obs}}}\) here, although the concepts can be generalized with little additional effort to general convex \(\mathcal {R}\) and \(\mathcal {S}_{{g^\mathrm {obs}}}\). For a concave, monotonically increasing and continuous function \(\psi :[0,\infty )\rightarrow [0,\infty )\) with \(\psi (0)=0\) we require that

$$\begin{aligned} \forall f\in \mathbb {X}\qquad \frac{1}{4}\Vert {f-f^\dagger }\Vert ^2 \le \frac{1}{2}\Vert {f}\Vert ^2-\frac{1}{2}\Vert {f^\dagger }\Vert ^2+\psi \left( {\Vert {F(f)-F(f^\dagger )}\Vert ^2}\right) . \end{aligned}$$
(5.17)

Such conditions are not easy to interpret at first sight, and we will come back to this in Sect. 5.3.2. However, they have been shown to be necessary for certain convergence rates of Tikhonov regularization and other regularization methods (see [11]), and sufficiency can be shown quite easily:

Theorem 5.3

If \(f^\dagger \) fulfill (5.17), then the Tikhonov estimator \(\hat{f}_\alpha \) in (5.8) satisfies the error estimate

$$\begin{aligned} \frac{1}{4} \Vert {\hat{f}_\alpha -f^\dagger }\Vert ^2\le \frac{\delta ^2}{\alpha }+ (-\psi )^*\left( {-\frac{1}{4\alpha }}\right) \end{aligned}$$
(5.18)

where \(\psi ^*(t):=\sup _{s\ge 0} \big ({st-\psi (t)}\big )\) denotes the conjugate function (see (5.11)).

Proof

By definition of \(\hat{f}_\alpha \) and our noise model we have

$$\begin{aligned} \frac{1}{2}\Vert {F(\hat{f}_\alpha )-{g^\mathrm {obs}}}\Vert ^2+\frac{\alpha }{2}\Vert {\hat{f}_\alpha }\Vert ^2\le \frac{1}{2}\Vert {F(f^\dagger )-{g^\mathrm {obs}}}\Vert ^2+\frac{\alpha }{2}\Vert {f^\dagger }\Vert ^2 \le \frac{\delta ^2}{2}+\frac{\alpha }{2}\Vert {f^\dagger }\Vert ^2, \end{aligned}$$

so together with our assumption (5.17) we find

$$\begin{aligned} \frac{1}{4}\Vert {\hat{f}_\alpha -f^\dagger }\Vert ^2&\le \frac{1}{2}\Vert {\hat{f}_\alpha }\Vert ^2-\frac{1}{2}\Vert {f^\dagger }\Vert ^2+\psi \left( {\Vert {F(\hat{f}_\alpha )-F(f^\dagger )}\Vert ^2}\right) \\&\le \frac{\delta ^2}{2\alpha } -\frac{1}{2\alpha }\Vert {F(\hat{f}_\alpha )-{g^\mathrm {obs}}}\Vert ^2 +\psi \left( {\Vert {F(\hat{f}_\alpha )-F(f^\dagger )}\Vert ^2}\right) . \end{aligned}$$

By the parallelogram law we have for all \(x,y,z\in \mathbb {X}\) that

$$\begin{aligned} \Vert {x-y}\Vert ^2\le 2\Vert {x-z}\Vert ^2+2\Vert {z-y}\Vert ^2-\Vert {x+y-2z}\Vert ^2\le 2\Vert {x-z}\Vert ^2+2\Vert {z-y}\Vert ^2. \end{aligned}$$

Apply this with \(x=F(\hat{f}_\alpha ), y=F(f^\dagger )\) and \(z={g^\mathrm {obs}}\) to find

$$\begin{aligned} \frac{1}{4\alpha }\Vert {F(\hat{f}_\alpha )-F(f^\dagger )}\Vert ^2\le \frac{\delta ^2}{2\alpha }+\frac{1}{2\alpha }\Vert {F(\hat{f}_\alpha )-{g^\mathrm {obs}}}\Vert ^2, \end{aligned}$$

so that finally we have

$$\begin{aligned} \frac{1}{4}\Vert {\hat{f}_\alpha -f^\dagger }\Vert ^2&\le \frac{\delta ^2}{\alpha } -\frac{1}{4\alpha }\Vert {F(\hat{f}_\alpha )-F(f^\dagger )}\Vert ^2+\psi \left( {\Vert {F(\hat{f}_\alpha )-F(f^\dagger )}\Vert ^2}\right) \\&\le \frac{\delta ^2}{\alpha } +\sup _{s\ge 0}\left[ {\frac{-s}{4\alpha }+\psi (s)}\right] =\frac{\delta ^2}{\alpha }+ (-\psi )^*\left( {-\frac{1}{4\alpha }}\right) .\qquad \qquad {\square } \end{aligned}$$

The two most commonly used types of functions \(\psi \) in the literature,

$$\begin{aligned} \psi _\nu (t)=t^{\nu /2}\qquad \text {and} \qquad \psi _p^{\mathrm {log}}(t)= (-\log t)^p \big ({1+o(1)}\big ),\quad \text {as } t\rightarrow 0 \end{aligned}$$

are referred to as Hölder and logarithmic source function,  respectively. For these functions we obtain

$$\begin{aligned} (-\psi _\nu )^*\left( {-\frac{1}{t}}\right)&=c t^\frac{\nu }{2-\nu } \\ \left( {-\psi _p^{\mathrm {log}}}\right) ^*\left( {-\frac{1}{t}}\right)&= (-\log t)^p \big ({1+o(1)}\big ),\quad \text {as } t\rightarrow 0. \end{aligned}$$

Note that the two terms on the right hand side of (5.18) correspond to the error splitting (5.5). The following theorem gives an optimal choice of \(\alpha \) balancing these two terms:

Theorem 5.4

If under the assumptions of Theorem 5.3 \(\psi \) is differentiable, the infimum of the right hand side of (5.18) is attained at \(\bar{\alpha }(\delta )\) if and only if

$$\begin{aligned} \frac{1}{4\bar{\alpha }(\delta )} = \psi ' \left( {4 \delta ^2}\right) , \end{aligned}$$

and

$$\begin{aligned} \frac{1}{4} \Vert {\hat{f}_{\bar{\alpha }} -f^\dagger }\Vert ^2 \le \psi \left( {4\delta ^2}\right) . \end{aligned}$$

Proof

Note from the definition of \((-\psi )^*\) that \((-\psi )^*(t^*)\ge tt^*+\psi (t)\) for all \(t\ge 0\), \(t^*\in \mathbb {R}\). Further equality holds true if and only if \(t^*= -\psi '(t)\), as for this choice of t the concave function \(\tilde{t}\mapsto \tilde{t}t^*+\psi (\tilde{t})\) attains its unique maximum at t and thus in particular \(tt^*+\psi (t)\ge (-\psi )^*(t^*)\). Therefore we have with \(t=4\delta ^2\) and \(t^*=-\frac{1}{4\alpha }\) that

$$ \frac{\delta ^2}{\alpha }+ (-\psi )^*\left( {-\frac{1}{4\alpha }}\right) \ge \psi (4\delta ^2) \qquad \text{ for } \text{ all } \alpha >0 $$

and \( \frac{\delta ^2}{\alpha }+ (-\psi )^*\left( {-\frac{1}{4\alpha }}\right) = \psi (4\delta ^2) \) if and only if \(-\frac{1}{4 \alpha } = -\psi ' \left( {4 \delta ^2}\right) \). \(\square \)

It can be shown that the same type of error bound can be obtained by a version of the discrepancy principle (5.6) which does not require knowledge of the function \(\psi \) describing abstract smoothness of the unknown solution \(f^{\dagger }\) [2]. This is an advantage in practice, because such knowledge is often unrealistic.

3.2 Interpretation of Variational Source Conditions

3.2.1 Connection to Stability Estimates

Variational source conditions (5.17) are closely related to so called stability estimates. In fact if (5.17) holds true for all \(f^\dagger \in \mathcal {K}\subset \mathbb {X}\), then all \(f_1,f_2\in \mathcal K\) satisfy the stability estimate

$$\begin{aligned} \frac{1}{4}\Vert {f_1-f_2}\Vert ^2 \le \psi \left( { \Vert {F(f_1)-F(f_2)}\Vert ^2}\right) , \end{aligned}$$

with the same function \(\psi \), since one of the terms \(\pm \left( {\Vert {f_1}\Vert ^2-\Vert {f_2}\Vert ^2}\right) \) will be non-positive. There exists a considerable literature on such stability estimates (see e.g. [1, 14]). However it is unclear if stability estimates also imply variational source conditions as two difficulties have to be overcome. Firstly the term \(\Vert {f}\Vert ^2-\Vert {f^\dagger }\Vert ^2\) might be negative and secondly one would have to extend the estimate from the set \(\mathcal K\) to the whole space \(\mathbb {X}\).

3.2.2 General Strategy for the Verification of Variational Source Conditions

In general the rate at which the error of reconstruction methods converges to 0 as the noise level tends to 0 in inverse problems depends on two factors: The smoothness of the solution \(f^{\dagger }\) and the degree of ill-posedness of F. We will describe both in terms of a family of finite dimensional subspaces \(V_n\subset \mathbb {X}\) or the corresponding orthogonal projections \(P_n:\mathbb {X}\rightarrow V_n\). The smoothness of \(f^{\dagger }\) will be measured by how fast the best approximations \(P_nf^{\dagger }\) in \(V_n\) converge to \(f^{\dagger }\):

$$\begin{aligned}&\Vert {(I-P_n) f^{\dagger }}\Vert _{\mathbb {X}}\le \kappa _n \end{aligned}$$
(5.19a)
$$\begin{aligned}&\lim _{n\rightarrow \infty } \kappa _n=0 \end{aligned}$$
(5.19b)

Inequalities of this type are called Bernstein inequalities, and they are well studied for many types of subspaces \(V_n\) such as spaces of polynomials, trigonometric polynomials, splines, or finite elements. We will illustrate this for the case of trigonometric polynomials below.

Concerning the degree of ill-posedness, recall that any linear mapping on a finite dimensional space is continuous. Therefore, a linear, injective operator T restricted to a finite dimensional space \(V_n\) has a continuous inverse \((T|_{V_n})^{-1}\) defined on \(T(V_n)\). However, the norm of these operators will grow with n, and the rate of growth may be used to measure the degree of ill-posedness. In the nonlinear case we may look at Lipschitz constants \(\sigma _n\) such that \(\Vert P_nf^{\dagger }-P_nf\Vert \le \sigma _n \Vert F(P_nf^{\dagger })-F(P_nf)\Vert \). However, to obtain optimal results it turns out that we need estimates of inner products of \(P_nf^{\dagger }-P_nf\) with \(f^{\dagger }\). Moreover, on the right hand side we have to deal with \(F(f^{\dagger })-F(f)\) rather than \(F(P_nf^{\dagger })-F(P_nf)\):

$$\begin{aligned}&\begin{aligned}&\left\langle P_n f^{\dagger },f^{\dagger }-f\right\rangle \le \sigma _n \Vert {F(f^{\dagger })-F(f)}\Vert _{\mathbb {Y}} \end{aligned} \end{aligned}$$
(5.19c)

The growth rate of \(\sigma _n\) describes what we will call local degree of ill-posedness of F at \(f^{\dagger }\).

Theorem 5.5

Let \(\mathbb {X}\) and \(\mathbb {Y}\) be Hilbert spaces and suppose that there exists a sequence of projection operators \(P_n:\mathbb {X}\rightarrow \mathbb {X}\) and sequences \((\kappa _n)_{n\in \mathbb {N}}\), \((\sigma _n)_{n\in \mathbb {N}}\) of positive numbers such that (5.19) holds true for all \(n\in \mathbb {N}\).

Then \(f^{\dagger }\) fulfills a variational source condition (5.17) with the concave, continuous, increasing function

$$\begin{aligned} \psi (\tau ):=\inf _{n\in \mathbb {N}} \left[ {\sigma _n\sqrt{\tau }+\kappa _n^{2}}\right] , \end{aligned}$$
(5.20)

which satisfies \(\psi (0)=0\).

Proof

By straightforward computations we see that the variational source condition (5.17) has the equivalent form

$$ \forall f\in \mathbb {X}:\qquad \left\langle f^\dagger , f-f^\dagger \right\rangle \le \frac{1}{4}\Vert {f-f^\dagger }\Vert ^2+\psi \left( {\Vert {F(f)-F(f^\dagger )}\Vert ^2}\right) . $$

Using (5.19a), (5.19c), and the Cauchy-Schwarz inequality we get for each \(n\in \mathbb {N}\) that

$$\begin{aligned}\left\langle f^{\dagger },f^{\dagger }-f\right\rangle =\,&\left\langle P_nf^{\dagger },f^{\dagger }-f\right\rangle + \left\langle (I-P_n)f^{\dagger },f^{\dagger }-f\right\rangle \\ \le \,&\sigma _n \Vert {F(f^{\dagger })-F(f)}\Vert _{\mathbb {Y}}+ \kappa _n \Vert {f^{\dagger }-f}\Vert _{\mathbb {X}}\\ \le \,&\sigma _n \Vert {F(f^{\dagger })-F(f)}\Vert _{\mathbb {Y}}+ \kappa _n^{2} + \frac{1}{4}\Vert {f^{\dagger }-f}\Vert _{\mathbb {X}}^2\\ \le \,&\sigma _n \Vert {F(f^{\dagger })-F(f)}\Vert _{\mathbb {Y}} + \kappa _n^{2} + \frac{1}{4}\Vert {f-f^{\dagger }}\Vert ^2. \end{aligned}$$

Taking the infimum over the right hand side with respect to \(n\in \mathbb {N}\) yields (5.20) with \(\tau =\Vert {F(f^{\dagger })-F(f)}\Vert _{\mathbb {Y}}^2\). As \(\psi \) is defined by an infimum over concave and increasing functions, it is also increasing and concave. Moreover, (5.19b) implies \(\psi (0)=0\). \(\square \)

3.2.3 Example: Numerical Differentiation

Recall that the trigonometric monomials \(e_n(x):= \mathrm {e}^{2\pi \text{ n }ix}\) form an orthonormal basis \(\{e_n :n\in \mathbb Z\setminus \{0\}\}\) of \(L^2_\diamond ([0,1])\), i.e. every function \(f\in L^2_\diamond ([0,1])\) can be expressed as \(f(x)=\sum _{n\in \mathbb Z\setminus \{0\}}\hat{f}(n) e_n(x)\) with \(\widehat{f}(n)=\int _0^1 f(x) \overline{e_n(x)} \mathop {}\!\mathrm {d}x\). Further note that from the definition (5.3) of the forward operator we get

$$ F(e_n) =\frac{1}{2\pi i n} e_n,\qquad n\in \mathbb {Z}\setminus \{0\}\,. $$

and that the kth derivative of \(f\) has Fourier coefficients \(\widehat{f^{(k)}}(n) = (2\pi i n)^k\widehat{f}(n)\). Therefore, the norm

$$ \Vert {f}\Vert _{H^s}:=\left( {\sum _{n\in \mathbb {Z}\setminus \{0\}} (2\pi n)^{2s}\left| {\widehat{f}(n)}\right| ^2}\right) ^{1/2} $$

(called Sobolev norm of order \(s\ge 0\)) fulfills \(\Vert {f}\Vert _{H^k} = \Vert f^{(k)}\Vert _{L^2}\) for \(k\in \mathbb {N}_0\), but it also allows to measure non-integer degrees of smoothness of \(f\).

We choose \(P_n\) as the orthogonal projection

$$ P_nf:= \sum _{0<|m|\le n} \widehat{f}(m) e_m. $$

Suppose that the kth distributional derivative belongs to \(L^2_\diamond ([0,1])\). Then

$$\begin{aligned} \Vert (I-P_n)f^{\dagger }\Vert _{L^2}^2&= \sum _{|m|>n}\left| {\widehat{f^{\dagger }}(m)}\right| ^2 = \sum _{|m|>n}(2\pi m)^{-2s}\left| {\widehat{(f^{\dagger )^{(s)}}}(m)}\right| ^2 \\&\le (2\pi n)^{-2s} \Vert f^{\dagger }\Vert _{H^s}^2 \end{aligned}$$

which shows that (5.19a) and (5.19b) are satisfied with \(\kappa _n = (2\pi n)^{-k}\). Moreover, we have that

$$\begin{aligned} \left\langle P_nf^{\dagger },f^{\dagger }-f\right\rangle&=\sum _{0<|n|\le m} -2\pi \mathrm i n \widehat{f^\dagger }(n) \overline{\left( {\frac{1}{2\pi \mathrm i n}\left( {\widehat{f^\dagger }(n)-\widehat{f}(n)}\right) }\right) } \\&\le \Big ({\sum _{0<|n|\le m} (2\pi n)^2|\widehat{f^\dagger }(n)|^2}\Big )^{1/2} \Big \Vert {F(f^\dagger )-F(f)}\Big \Vert \\&\le (2\pi m)^{1-s} \Vert {f^\dagger }\Vert _{H^s} \Big \Vert {F(f^\dagger )-F(f)}\Big \Vert . \end{aligned}$$

for \(s\in (0,1]\), so (5.19c) is satisfied with \(\sigma _n := (2\pi m)^{1-s}\). Choosing the parameter \(m \approx \Vert {f^\dagger }\Vert _{H^s}^{1/(1+s)} \tau ^{-1/2(1+s)}\) at which the infimum in (5.20) is attained approximately we see that there exists a constant \(C>0\) such that the variational source condition (5.17) is satisfied with \( \psi (t)=C \Vert {f^\dagger }\Vert _{H^s}^{2/(1+s)} t^{s/(1+s)}, \) and from Theorem 5.4 we obtain the error bound

$$\begin{aligned} \Vert \hat{f}_{\alpha }-f^{\dagger }\Vert _{L^2} = \mathcal {O}\left( {\delta ^{\frac{s}{s+1}}}\right) . \end{aligned}$$
(5.21)

It can be shown that this rate is optimal in the sense that there exists no reconstruction method R for which

$$\begin{aligned} \inf _R \sup \left\{ \Big \Vert {R({g^\mathrm {obs}})-f^\dagger }\Big \Vert :f^\dagger \in H^s, \Big \Vert {F(f^\dagger )-{g^\mathrm {obs}}}\Big \Vert \le \delta \right\} = o\left( {\delta ^\frac{s}{s+1}}\right) . \end{aligned}$$

3.2.4 Example: Fluorescence Microscopy

Similarly one can proceed for the example of fluorescence microscopy. As one has to work here with \(L^2\) spaces rather than \(L^2_{\diamond }\) spaces, the Sobolev norm is defined by \(\Vert f\Vert _{H^s}^2 = \sum _{n\in \mathbb {Z}} (1+n^2)^s \left| {\widehat{f}(n)}\right| ^2\). Assuming that the convolution kernel is a-times smoothing (\(a>0\)) in the sense that \(\Vert F(f)\Vert _{H^a}\sim \Vert f\Vert _{H^0}\), which is equivalent to the existence of two constant \(0<c<C\) such that the Fourier transform of the convolution kernel k fulfills

$$\begin{aligned} c (1+| \xi |^2)^{-a/2} \le |\widehat{k}(\xi )|\le C (1+| \xi |^2)^{-a/2} \end{aligned}$$

one can show that \(\Vert f^\dagger \Vert _{H^s}<\infty \) for \(s\in (0,a]\) implies a variational source condition with \(\psi (t)\sim t^\frac{s}{s+a}\) and an error bound

$$\begin{aligned} \Vert \hat{f}_{\alpha }-f^{\dagger }\Vert _{L^2} = \mathcal {O}\left( {\delta ^{\frac{s}{s+a}}}\right) . \end{aligned}$$
(5.22)

Again this estimate is optimal in the sense explained above.

3.2.5 Extensions

Variational source conditions with a given Hölder source function actually hold true on a slightly larger set. In the typical situation where the marker density of the investigated specimen is constant (or smooth) up to jumps, then it fulfills the same variational source condition with \(s=1/2\), although \(f^\dagger \in H^s\) if and only if \(s<1/2\). The sets on which a variational souce condition is satisfied can be characterized in terms of Besov spaces \(B^s_{2,\infty }\), and bounded subsets of such spaces are also the largest sets on which a Hölder-type error bound like (5.21) and (5.22) are satisfied with uniform constants (see [11]).

In the case where the convolution kernel is infinitely smoothing, e.g. if the kernel is a Gaussian, then we cannot expect to get a variational source condition with a Hölder source function under Sobolev smoothness assumptions. Instead one obtains logarithmic source functions \(\psi _p^{\mathrm {log}}\) as introduced above, which will again be optimal and lead to very slowly decaying error estimates as \(\delta \searrow 0\) [11].

Note that the rates (5.21) and (5.22) are restricted to smoothness indices \(s\in (0,1]\) and \(s\in (0,a]\), respectively. This restriction to low order rates is a well-known shortcoming of variational source conditions. Higher order rates can be obtained by imposing a variational source condition on the dual problem (5.12), which can again be again by verified by Theorem 5.5 (see [5, 18]).

With some small modifications the strategy in Theorem 5.5 can be extended to Banach space settings [9, 21] and nonlinear forward operators F, in particular those arising in inverse scattering problems [10, 20].

3.3 Error Bounds for Poisson Data

We already briefly discussed discrete versions of inverse problems with Poisson in Sect. 5.2.1. Such problems arise in many photonic imaging modalities such as fluorescence microscopy, coherent x-ray imaging, positron emission tomography, but also electron microscopy. In the following we briefly discuss a continuous setting for such problems.

We consider a forward operator \(F:\mathbb {X}\rightarrow \mathbb {Y}\) that maps an unknown sample \(f^{\dagger }\in \mathbb {X}\) a photon density \({\text {g}}^\dagger \in \mathbb {Y}= L^1(\mathbb {M})\) generated by \(f^{\dagger }\) on some measurement manifold \(\mathbb {M}\subset \mathbb {R}^d\). The given data are modeled by a Poisson process

$$\begin{aligned} G_t=\sum _{k=1}^N\delta _{x_k}, \end{aligned}$$

with density \(tg^\dagger \). Here \(\{x_1,\dots ,x_N\}\subset \mathbb {M}\) denote the positions of the detected photons and \(t>0\) can be interpreted as exposure time. Note that \(t\sim \mathbf {E}(N)\), i.e. the exposure time is proportional to the number of detected photons.

Now to discuss error bounds we first need some notion of noise level. But what is the “noise” in our setting? Our data \(G_t\) do not belong to \(\mathbb {Y}= L^1(\mathbb {M})\) and the “noise” is not additive. However, it follows from the properties of Poisson processes that

$$ \mathbf {E}\left[ {\left\langle \tfrac{1}{t}G_t,h\right\rangle }\right] = \int _{\mathbb {M}}h g^{\dagger }\,dx, $$

and the variance of \(\langle \frac{1}{t}G_t,h\rangle \) is proportional to \(\tfrac{1}{t}\). This suggests that \(\tfrac{1}{\sqrt{t}}\) plays the role of the noise level. More precisely, it is possible to derive concentration inequalities

$$\begin{aligned} \mathbf {P}\left( d(\tfrac{1}{t}G_t,{\text {g}}^\dagger ) >\frac{r+C}{\sqrt{t}} \right) \le \exp \left( -cr\right) \end{aligned}$$

where the distance function d is defined by a negative Besov norm (see [13, 22] for similar results).

As a reconstruction method we consider generalized Tikhonov regularization as in (5.8) with \(\mathcal {S}_{G_t} \) given by a negative (quasi-)log-likelihood corresponding to the Poisson data. As discussed in Sect. 5.2 this amounts to taking the Kullback-Leibler distance as data fidelity term in the finite dimensional case, and in particular in the implementations of this method. (Sometimes a small shift is introduced in the Kullback-Leibler divergence to “‘regularize” this term.) By assuming the VSC (5.17) with \(\psi (\Vert F(\mathbf {f}^{\dagger })-F(\mathbf {f})\Vert ^2)\) replaced by \(\psi \left( {\mathrm {KL}(F(\mathbf {f}^{\dagger }),F(\mathbf {f})))}\right) \) and an optimal parameter choice \(\bar{\alpha }\) one can than show the following error estimate in expectation

$$\begin{aligned} \mathbf {E}\left( {\Vert {\hat{f}_{\bar{\alpha }}-f^\dagger }\Vert _{\mathbb {X}}}\right) =\mathcal {O}\left( {\psi \left( {\frac{1}{\sqrt{t}}}\right) }\right) ,\qquad t\rightarrow \infty . \end{aligned}$$