1 Introduction

Fully automatic semantic segmentation of medical images is a major challenge. Over the last few years, Deep Learning and Convolutional Neural Networks (ConvNets) have reached outstanding performances on various visual recognition tasks [9]. Regarding semantic segmentation on natural images, state-of-the-art performances are currently obtained with Fully Convolutional Neural Networks (FCNs) [1, 3]. Consequently, several attempts have been made to apply those methods on medical images [11, 15, 16]. In challenges like Liver Tumor Segmentation Challenge (LiTS), leading methods are based on FCNs [5, 10].

However, training deep ConvNets requires large amount of data with clean annotations. The annotation process is an extremely time consuming task for semantic segmentation, which requires pixel-level labeling. This challenge is amplified in the medical field, where highly qualified professionals are needed. In this paper, we focus on abdomen 3D CT-scans from an internal dataset with more than 1000 patients, each volume containing about a hundred of \(512 \times 512\) images. The segmentation masks have been realized by clinical experts but they have focused on specific organs or anatomical structures, e.g. liver pathologies. As a consequence, the collected labels intrinsically contain missing annotations, as illustrated in Fig. 1.

Fig. 1.
figure 1

Our 3D CT-scan dataset is labeled by clinical experts who focused on certain organ pathologies, e.g. liver. The ground truth annotations are therefore incomplete. We define ambiguity maps to train binary class predictors, which ignore incorrect background labels.

Several learning methodologies can be used to address the aforementioned missing annotations issue. Weakly Supervised Learning (WSL) can be used to leverage coarse annotations, e.g. global image or volume labels. WSL is generally closely connected to Multiple Instance Learning [4], and has been used for WSL segmentation of natural images [13, 14] and medical data [7]. However, performing pixel-wise prediction from global labels is known to be a challenging task, making WSL approaches generally substantially inferior to their fully supervised counterparts. Since missing annotations are incorporated to background pixel classes, another option to address this problem is to design models able to incorporate noisy labels, which have been recently applied for semantic segmentation [8, 12]. Although interesting, most of these methods rely on the assumption that the ratio of noisy labels remains relatively low, whereas more than 50% of the organs are commonly missing in our context.

In this paper, we introduce SMILE, a new method for Semantic segmentation with MIssing Labels and ConvNEts. Firstly, we design a learning scheme which converts the segmentation of K organ classes into K binary problems, and we define ambiguity maps which allow to train the model with 100% of clean labels (see Fig. 1), while retaining a largely sufficient number of negative samples. The model trained at this first stage is then used for automatically predicting labels for missing organs, using a Curriculum strategy [2] (SMILEr). We perform extensive experiments in an sub-set of our dataset for the segmentation of three organ classes: liver, pancreas and stomach. We show that our approach significantly outperform a strong FCN baseline based on Deeplab [3], especially when the number of missing organs is large. The final model (SMILEr) trained with only 30% of present organs performs similarly to a baseline trained with complete ground truth annotations.

2 SMILE Model

The SMILE model is dedicated to semantic segmentation with missing labels using ConvNets. The missing organ annotations are labeled as “background”, as shown in Fig. 1.

SMILE is based on the strong DeepLab baseline [3], which shows impressive results for natural and medical images [5]. The DeepLab backbone architecture is a Fully Convolutional Networks (FCN), as shown in Fig. 2, e.g. Res-Net [6]. In DeepLab, 1 \(\times \) 1 convolutions and soft-max are applied to classify each pixel into K (+1, i.e. background) classes.

2.1 Handling Missing Annotations

In our context, the main limitation of DeepLab is that background labels sometimes correspond to missing organs. Therefore, back-propagating these background labels may damage training performances by conflicting with pixels where the organ is properly annotated.

SMILE Architecture. To address this problem, we choose to start from the (\(K+1\)) multi-class classification formulation, and to classify each organ independently using K binary classifiers. The SMILE architecture is shown in Fig. 2. We use \(1\times 1\) convolutions, as in DeepLab, but we apply a sigmoid activation function to predict the presence/absence of an organ at each pixel.

SMILE Training. During training, the K binary models generate K losses at each pixel by computing the binary cross entropy: \(L_k(\hat{y_k}, y_k^*) = -(y_k^*~log(\hat{y_k}) + (1-y_k^*)~log(1-\hat{y_k}))\). The final loss aggregates these K losses through summation:

$$\begin{aligned} L(\hat{y},y^*) = \sum _{k=1}^K w_k ~L_k(\hat{y_k}, y_k^*) \end{aligned}$$
(1)

where \(w_k \in \left\{ 0;1\right\} \) is a binary weight map which select or ignore pixels for class k.

The \(w_k\) weights are the core of the SMILE model, which are used to ignore ambiguous annotations during training. We illustrate the rationale of our approach in Fig. 2. We consider a volume where only one organ is annotated. In the baseline DeepLab model, pixels for the other organs in each slice are incorrectly labeled as background, and back-propagated consequently. Contrarily, with SMILE, we only back-propagate labels which are certain. In this example, we can back-propagating positive/negative labels for the annotated organ at every pixels p: we thus have \(w_{a}=1 \forall {p}\). On the other hand, for unannotated organs, we only use pixels which are certainly not belonging to the given class for training the binary classifier: \(w_{u}=1\) for all pixels of the annotated organs. Other pixels are ignored during training, i.e. \(w_{u}=0\).

Fig. 2.
figure 2

SMILE architecture and training. The presence of an organ at each pixel is determined by using K independent binary classifiers. During training, a weight \(w_k\) for each class enables to ignore ambiguous pixels.

The idea behind SMILE is to only use true positive and true negative labels during training. To formalize this, we consider a given organ class k with its associated binary classification problem. We denote as \(\beta _k\) the ratio of pixels for the organ in all volumes of image slices, and \(\alpha \) the ratio of missing labels for this organ in the dataset. Table 1 shows the confusion matrix for the labels used by SMILE and the DeepLab baseline. We can see that they both use the same amount of true positives: \(TP=(1-\alpha )\cdot \beta _k\). For negative examples, however, the baseline uses \(FN=\alpha \cdot \beta _k\) false negatives, i.e. the amount of unannotated pixels belonging to the organ. The ratio \(\frac{TP}{FN}=\frac{1-\alpha }{\alpha }\) gives a good indication on the influence of the wrong information: with \(\alpha > 0.5\), \(\frac{TP}{FN} < 1\), which means that the model incorporates more wrong labels than correct ones, dramatically deteriorating its performances.

On the other hand, the baseline learns with more true negatives (\(1-\beta _k\)) than SMILE \((1-\alpha )(1-\beta _k) + \epsilon \), where \(\epsilon = \sum _{k'\ne k}\beta _{k'}\) corresponds to the other organ labels (see Fig. 2). However, we take advantage on the class unbalance: generally \(\beta<< 1\), e.g. \(\beta =0.05\), since the organs represent a small proportion of the total volume. As a consequence, even if we remove some background examples, we still have largely enough information to learn it properly.

Table 1. Training label analysis. GT: Ground Truth

2.2 Incremental Self-supervision and Relabeling

The number of true positives (TP) is linearly decreasing with respect to the ratio of missing organ annotation \(\alpha \) (Table 1). SMILE can thus be improved by recovering TP in unannotated training images. We propose a self-supervised approach to achieve this goal, called SMILEr (SMILE with relabeling). The idea of SMILEr is to iteratively produce new positive target labels \(y_{i,t}^*=1\) in an image with missing annotations \(\mathbf {x}_i\) for each class kFootnote 1, using a curriculum strategy [2].

Basically, SMILEr is initialized with SMILE, which has been trained with correct positive labels only (Table 1) that can be regarded as “easy positive samples”. Let us denote as \(\hat{y_{i}}^+\), the pixels predicted as positive by SMILE in a given unannotated image \(\mathbf {x}_i\). SMILEr then add new \(\oplus \) labels \(y_{i,t}^{*,+}\) by selecting the top scoring pixels among \(\hat{y_{i}}^{+}\). The model is then retrained with the augmented training set, and the process is iterated T times, by selecting an increasing ratio \(\gamma _t= \frac{t}{T} \gamma _{max}\) of top scoring pixels among positives.

The new \(\oplus \) labels \(y_{i,t}^{*}\) incorporated at each curriculum iteration are "harder examples", since they are incrementally determined by the model trained with an increased set of auto-supervised positives.

Fig. 3.
figure 3

Dice score versus the proportion of missing annotations \(\alpha \). The baseline is represented in blue, SMILE in red and SMILEr in green. (Color figure online)

3 Experiments and Results

We perform experiments on a subset of our dataset with complete ground truth annotations for three organs: liver, pancreas and stomach, which gathers 72 3D volume CT-scans.  We generate a partially annotated dataset by randomly removing \(\alpha \%\) of organs in the volumes independently.

Quantitative Evaluations. We compare our approach to the DeepLab baseline [3] with a varying ratio of missing annotations \(\alpha \). We randomly split training (\(80\%\)) and testing (\(20\%\)) data K times, and report averages and standard deviations of Dice scores over the K runs. For SMILEr, we fix \(T=2\) and \(\gamma _{max}=0.66\).

Figure 3 shows the results for the baseline, SMILE and SMILEr, for each organ and on average. As expected, the maximum scores are reached when 100% of the annotations are kept, i.e. \(\alpha =0\). When \(\alpha \) increases, the performances of the baseline dramatically drop, whereas our approach continues to perform well. For example, SMILE performs similarly as the method trained with complete annotations with \(\alpha =40\%\), whereas the baseline performance is decreased by about 20 points. The gain is even more pronounced for SMILEr which results are comparable to the fully annotated method for \(\alpha =70\%\), whereas the baseline performs very poorly in this regime.

Fig. 4.
figure 4

SMILEr behaviour with \(T=3\) iterations, \(\gamma _{max}=1.0\) and \(\alpha =50\%\). SMILEr prediction in red, selected \(\oplus \) pixels for the next iteration in blue. (Color figure online)

SMILEr Analysis. Figure 3 highlights the fact that the Dice score is better when the organ is bigger. Regarding SMILEr, we can observe that its improvement is especially pronounced for small organs, see for example the large performance boost for pancreas and stomach.

Figure 4 shows how the training evolves during the \(T=3\) curriculum iterations of SMILEr, and with \(\gamma _{max}=1\). At \(t=0\), we show the segmentation of SMILE, blue pixels indicating the new positive labels added for training for the next step. We can see how the segmentation is refined and is nearly perfect at \(\gamma _2 = 0.66\) (\(t=2\)). It is also interesting to see how the model tends to over predict some labels at \(\gamma _3 = 1.0\).

Finally, we give in Fig. 5 the final segmentation for the three organ classes in a test image, for SMILEr and the baseline, at \(\alpha =70\%\). We can notice the incapacity of the baseline, whereas SMILEr successfully segments all organs.

Fig. 5.
figure 5

Segmentation results for the baseline and SMILEr, with \(\alpha = 70\%\). The liver is in blue, the pancreas in red and the stomach in green. (Colou figure online)

4 Conclusions

We introduce a new model, SMILE, dedicated to semantic segmentation with incomplete ground truth. SMILE is based on the use of certain labels for training a first model, which is lately used to incrementally re-label positive pixels. Experiments show that SMILE can achieve comparable performances to a model trained with complete annotations with only \(30\%\) of labels. Future works are the application of SMILE to other organ classes, and the incorporation of uncertainty for selecting the target pixels labels in our curriculum approach.