1 Introduction

Reproducibility and robustness are important concerns in image processing and pattern recognition tasks, and for various applications such as medical image analysis for instance [18, 26]. While the first refers to the replicable reuse of a method (and generally a code) by associating input image data and method’s outputs [17], the second is generally understood as the ability of an algorithm to resist to uncontrolled phenomena and to data uncertainties, such as image noise [29]. This article focuses on the evaluation of this robustness, which is a crucial matter in machine learning and computer vision [2, 21] and increasingly with the emergence of deep learning algorithms [3, 6] and big data [22, 25]. However, in the field of image processing, this definition of robustness and its evaluation have not been further studied in such a way. The first definition we have proposed in RRPR 2016 [28] (called \(\alpha \)-robustness) was the first attempt in measuring robustness by considering multiple scales of noise, and applied to two tasks: still image denoising and background subtraction in videos. In this previous work, image data was supposed to be altered by an additive Gaussian (or equivalent) noise, which is a common hypothesis when we refer to noisy image content. This robustness measurement consisted in calculating the worst quality loss (the \(\alpha \) value) of a given algorithm, for a set of noise scales (e.g. increasing standard deviation of a Gaussian noise).

In the present article, we introduce in Sect. 2 a novel quality-scale definition of robustness still dedicated to image processing algorithms, by a generalized model of the pertubating phenomenon under consideration. Instead of representing only additive Gaussian noises, we can consider more complex image data uncertainties. To be able to evaluate robustness, we only need to measure data uncertainty by a monotonic increasing function. Moreover, together with the \(\alpha \) value presented earlier, we also calculate the scale of uncertainty (\(\sigma \)) that generated an algorithm’s worst loss of quality. Then, we apply this definition (called \((\alpha ,\sigma )\)-robustness) first by revisiting the topic of image enhancement and denoising with the parallel concern of representation of noise in a multi-scale manner (Sect. 3), as we did in [28]. In this context the uncertainty is modeled as a classic Gaussian noise. Second we study the impact of shape variability in liver volume segmentation from medical images (Sect. 4). Here, we also propose to measure the uncertainty (liver variability) by a monotonic function, thus adapted to our test of robustness. In Sect. 5, we describe the code that can be publicly downloaded in [24] to reproduce the results of this paper, and so that any reader may evaluate the robustness of image processing methods. We conclude and enlarge the viewpoint of this paper by proposing future axes of progress of this research in Sect. 6.

2 A Novel Definition of Robustness for Image Processing

We first consider that an algorithm designed for image processing may be perturbed, because of an input data altered with a given uncertainty. By extending notations from the work [20, 28], we pose:

$$\begin{aligned} \widehat{y}_i=y_i^0\odot \delta y_i,\ y_i\in \mathbb {R}^q,\ i=1,\dots ,n, \end{aligned}$$
(1)

which will be shortened by \(\mathbf {\widehat{Y}}=\mathbf {Y^0}\odot \delta \mathbf {Y}\) when the context allows it, i.e. when the subscripts are not necessary. The measurement \(\mathbf {\widehat{Y}}\) is obtained from a perfect value \(\mathbf {Y^0}\), corrupted by the alteration \(\delta \mathbf {Y}\). Classically, \(\delta \mathbf {Y}\) may be considered as a Gaussian noise by supposing that \(\delta y_i\simeq GI(0, \sigma ^2C_y)\) where \(\sigma ^2C_y\) is the covariance of the errors at a known noise scale \(\sigma \) (e.g. standard deviation or std.). This noise is generally added to the input data so that \(\mathbf {\widehat{Y}}=\mathbf {Y^0}+\delta \mathbf {Y}\). Section 3 explores this classic scenario of additive noise modeling.

Fig. 1.
figure 1

Evaluation of the \((\alpha ,\sigma )\)-robustness with a synthetic example with two algorithms compared by graphical inspection (a) where we can observe the most severe decrease of quality for Algorithm 1, and confirmed by the numerical evaluation (b).

In this article, we also consider more complex phenomena that do not refer to this model. In such difficult situations, alteration \(\delta \mathbf {Y}\) and operator \(\odot \) cannot be modeled theoretically or numerically evaluated, and we only know the measures \(\mathbf {\widehat{Y}}\) and the perfect case \(\mathbf {Y^0}\). A way to model the uncertainty is to define a variability scale \(\sigma \) between a given sample \(\mathbf {\widehat{Y}}\) and the perfect, standard case \(\mathbf {Y^0}\). In Sect. 4, we propose to study shape variability through this viewpoint.

Let A be an algorithm dedicated to image processing, leading to an output \(\mathbf {X}=\{x_i\}_{i=1,n}\) (in general the image resulting from the algorithm). Let N be an uncertainty specific to the target application of this algorithm, and \(\{\sigma _k\}_{k=1,m}\) the scales of N. The different outputs of A for every scale of N is \(\mathbb {X}=\{\mathbf {X_k}\}_{k=1,m}\). The ground truth is denoted by \(\mathbb {Y}^0=\big \{\mathbf {Y_k^0}\big \}_{k=1,m}\). Let \(Q({\mathbf {X_k}},\mathbf {Y_k^0})\) be a quality measure of A for scale k of N. This Q function’s parameters are the result of A and the ground truth for a noise scale k. An example can be the F-measure, combining true and false positive and negative detections for a binary decision (as binary segmentation for instance). Our new definition of robustness can be formalized as follows:

Definition 1

(\((\alpha ,\sigma )\)-robustness ). Algorithm A is considered as robust if the difference between the output \(\mathbb {X}\) and ground truth \(\mathbb {Y}^0\) is bounded by a Lipschitz continuity of the Q function:

$$\begin{aligned} d_Y\left( Q(\mathbf {X_k},\mathbf {Y_k^0}),Q(\mathbf {X_{k+1}},\mathbf {Y_{k+1}^0})\right) \le \alpha d_X(\sigma _{k+1}, \sigma _k),\ 1\le k< m, \end{aligned}$$
(2)

where

$$\begin{aligned} \nonumber&d_Y\left( Q(\mathbf {X_k},\mathbf {Y_k^0}),Q(\mathbf {X_{k+1}},\mathbf {Y_{k+1}^0})\right) = Q(\mathbf {X_{k+1}},\mathbf {Y_{k+1}^0})-Q(\mathbf {X_k},\mathbf {Y_k^0}), \\&d_X(\sigma _{k+1}, \sigma _k) = |\sigma _{k+1} - \sigma _k|. \end{aligned}$$
(3)

We calculate the robustness measure \((\alpha ,\sigma )\) of A as the \(\alpha \) value obtained and the scale \(\sigma =\sigma _k\) where this value is reached.

In other words, \(\alpha \) measures the worst drop in quality through the scales of uncertainty \(\{\sigma _{k}\}\), and \(\sigma \) keeps the uncertainty scale leading to this value. The most robust algorithm should have a low \(\alpha \) value, and a very high \(\sigma \) value. Figure 1 is a synthetic example of evaluation of two algorithms with this definition. This example illustrates the better robustness of Algorithm 2, since its \(\alpha \) value is smaller than the one of Algorithm 1. Moreover, we can precise that this robustness is achieved for a larger value of uncertainty with the \(\sigma \) value.

3 Application to Image Enhancement and Denoising

Image denoising has been addressed by a wide range of methodologies, which can be appreciated in a general manner in [16] for instance. The shock filter [23] is a PDE scheme that consists in employing morphological operators depending on the sign of the Laplacian operator. The original algorithm is not able to reduce image noise accurately, but several authors have improved it for this purpose. As summarized in Fig. 2-b, our test of robustness concerns these approaches based on shock scheme [1, 7, 30]; another PDE-based algorithm named coherence filtering [32]; together with the classic median [12] and bilateral [27] filterings; and an improved version of the median filter [14]. We use 13 very famous images (Barbara, Airplane, etc.), with additive white Gaussian noise altering them with varying kernel std., by considering the scales of noise \(\{\sigma _k\}=\{5,10,15,20,25\}\). The quality measure is the SSIM (structural similarity) originally introduced by [31].

Fig. 2.
figure 2

Evaluation of \((\alpha ,\sigma )\)-robustness for image denoising algorithms, by studying quality function decrease through scales of noise (a) or numerically by appreciating the \((\alpha ,\sigma )\) values for each algorithm.

Fig. 3.
figure 3

Illustrations of the results obtained for all the image denoising algorithms of our test.

Thanks to Definition 1, we are able to evaluate the robustness of various algorithms (Fig. 2), from a visual assessment thanks to the graph in Fig. 2(a), or numerically by getting the \((\alpha ,\sigma )\) values as in (b).

Since we consider an additive noise (\(\mathbf {\widehat{Y}}=\mathbf {Y^0}+\delta \mathbf {Y}\) with our notations), quality functions are decreasing monotonically over the set of noise scales, revealing that the tested algorithms loose progressively their efficiency. We can appreciate the good behavior of the algorithms SmoothedMedian and SmoothedShock, with a lower \(\alpha \) value and a larger \(\sigma \) scale than the other approaches, which means that the worst quality decrease has been observed when an aggressive Gaussian noise is applied to images.

Figure 3 presents the outputs obtained for all algorithms of our test. This confirms the good image enhancement achieved by the most robust methods, SmoothedMedian and SmoothedShock.

4 Application to Liver Volume Segmentation

Liver segmentation has been addressed by various approaches in the literature [11], and mostly oriented towards CT (computerized tomography) modality (see e.g. [10]). We propose to compare two liver extraction approaches in this test of robustness. The automatic model-based algorithm presented in [15] (named hereafter MultiVarSeg) is based on the prior 3-D representation of any patient’s liver by accumulating images from diverse public datasets. We compare MultiVarSeg with a free available semi-automatic segmentation software, called SmartPaint [19]. It allows a fully interactive segmentation of medical volume images based on region growing.

Fig. 4.
figure 4

Evaluation of \((\alpha ,\sigma )\)-robustness for liver segmentation algorithms, by studying quality function fluctuations through scales of variability (a) or numerically by appreciating the \((\alpha ,\sigma )\) values for each algorithm. The scale \(\sigma \) obtained for each algorithm in (b) is depicted with a vertical dotted line in (a).

Fig. 5.
figure 5

Illustration of the results obtained by the two algorithms of our test.

To compare these methods, we employ the datasets provided by the Research Institute against Digestive Cancer (IRCAD) [13] and by the SLIVER benchmark [11]. We propose to study the uncertainty of liver shape variability, revealing this organ’s complex and variable geometry. First, we construct a bounding box (BB) with standard dimensions of the liver certified by an expert and computed by the mean values of our database. We measure the liver variability of a given binary image (object of interest -the liver- vs. background) by the following function:

$$\begin{aligned} \sigma =\frac{\#(L\setminus BB)}{\#(L)}\times 100, \end{aligned}$$
(4)

where L is the set of pixels that belong to the liver in a binary segmentation. \(L\setminus BB\) represents pixels that belong to the liver measured outside the standard box BB. The operator \(\#(.)\) stands for the cardinality of sets. To compare the tested algorithms, we use the Dice coefficient, which is a very common way to measure the accuracy of any segmentation method.

In Fig. 4, we present the result of the test of robustness by considering scale of variability following Eq. 4.

We consider here a more complex phenomenon producing uncertainty upon image data (general formalism \(\mathbf {\widehat{Y}}=\mathbf {Y^0}\odot \delta \mathbf {Y}\)), measured by a variability function. It provokes non-linear quality functions for both algorithms, however our definition of robustness enables the assessment in this case. We can thus observe the robustness of MultiVarSeg compared to SmartPaint, by a lower \(\alpha \) and a larger \(\sigma \) values.

In Fig. 5 are depicted the segmentation results obtained for each tested method. This visual inspection permits to confirm the accuracy of the model-based approach MultiVarSeg.

5 Reproducibility

We have developed a Python code, provided publicly in [24], which permits to assess visually and numerically robustness of image processing techniques. The reader can thus reproduce the plots and tables of Figs. 1, 2 and 4 of this paper. These elements are automatically created by means of the input data files presented as in Fig. 6.

Fig. 6.
figure 6

Input data files for the three tests of robustness presented in this article (a–c). Comments (lines starting by ‘#’) have been removed in this figure, for a sake of clarity. In (d), the outputs are a table written in LaTeX summarizing the robustness test (left); a figure for a visual inspection of robustness and an output in the console (right).

Such files are composed of: the quality measure in the first line; in the second line the name of noise or uncertainty to be studied, followed by values of scales; then the next lines concern quality values of the tested algorithms, with their name at the first position, line by line until the end of the file.

Once any user runs:

figure a

for instance, our program will provide a plot displayed and saved as ‘fig_rob.pdf’; a LaTeX file named ‘tab_rob.tex’ containing the table with values of \((\alpha ,\sigma )\)-robustness in decreasing order of \(\alpha \); it will also print these values in the console (see Fig. 6-d).

To obtain these measures, our program first calculates \(\alpha \) according to Definition 1. To do so, we can rewrite Eq. 2 to determine \(\alpha \) as:

$$\begin{aligned} \alpha \ge \frac{d_Y\left( Q(\mathbf {X_k},\mathbf {Y_k^0}),Q(\mathbf {X_{k+1}},\mathbf {Y_{k+1}^0})\right) }{d_X(\sigma _{k+1}, \sigma _k)},\ 1\le k< m. \end{aligned}$$
(5)

The denominator \(d_X(\sigma _{k+1}, \sigma _k)\) does not equal zero, this is easily ensured by always considering distinct scales of uncertainty, i.e. by assuming wlog. That \(\sigma _{k+1}>\sigma _k,\ 1\le k< m\). We could select any value of \(\alpha \) satisfying this equation, however, we prefer a reproducible strategy by computing the maximal value:

$$\begin{aligned} \alpha = \max _{1\le k< m}\left\{ \frac{d_Y\left( Q(\mathbf {X_k},\mathbf {Y_k^0}),Q(\mathbf {X_{k+1}},\mathbf {Y_{k+1}^0})\right) }{d_X(\sigma _{k+1}, \sigma _k)}\right\} . \end{aligned}$$
(6)

During this process, we also store the uncertainty scale \(\sigma \) where this \(\alpha \) value has been reached.

Fig. 7.
figure 7

from [8].

Simulations of FDG-PET (FluoroDeoxyGlucose - Positron Emission Tomography) and CT (right and left respectively) by VIP,

6 Discussion

In this paper, we have introduced a novel approach to measure robustness of image processing algorithms. We have first proposed to model image uncertainty, which encompasses the classic additive Gaussian noise alteration. Second, we have refined the factors we calculate for a given algorithm. Beside the quality loss obtained by considering Lipschitz continuity over the scales of uncertainties, we also keep the scale where this worst decrease appears. This permits to study the weakness of a method, and for which kind of image data it may happen in a concrete application.

As future concern, we would like to compare our measure with other approaches, such as calculating area under the curve, or by summing the successive quality variations. For both image enhancement and segmentation, we have conducted our study with datasets of limited size, and we have to confirm our results with larger image collections. We also hope that the code freely downloadable at [24] will help researchers and engineers to address more easily this problem of robustness for image processing in their activity.

Furthermore, it would be interesting to study noises inherent to acquisition machines from a multi-scale point of view, as Rician noise in MRI (magnetic resonance imaging) for instance [5, 9]. Drawing a relation between organ shape variability and robust image processing is another important question that is not studied in such a way in the literature. Our first measure of variability can be obviously applied to any other organ than the liver, and should be enhanced by further researches. More precisely, we could increase the number of parameters to represent complex organic shapes, but using more sophisticated models, such as [4] for instance. Robustness could be thus studied at a (slightly) greater dimension, to better understand the variation of image processing’s outcomes.

Whatever the uncertainty studied, it is necessary to acquire a voluminous amount of data, and to annotate it in order to determine algorithms’ robustness. For completing such a database, we could use simulation, as VIP (Virtual Imaging Platform) that consists in generating images, with various parameters related to acquisition machine and target organ’s anatomy [8] (see Fig. 7). To do so, we would have to inject in this simulator data from the target modality (CT, MRI, ultrasound) and from organ localization (e.g. binary masks of liver volume).