Keywords

1 Introduction

Glioma are the most common family of brain tumors, and forms some of highest-mortality and economically costly diseases of brain cancer [1,2,3]. The diagnosed method is highly relayed on manual segmentation and analysis of multi-modal MRI scans by bio-medical experts. Nevertheless, this diagnosed way is severely limited by the labor-intensive character of the manual segmentation process and disagreement or mistakes between manual segmentation. Consequently, there exists a great need for a fast and robust automated segmentation algorithm. Convolutional neural networks (CNNs) have been verified to be extremely effective for a variety of semantic segmentation tasks [4].

While CNN segmentation algorithms are abundant in biomedical imaging, only very few make use of nested-topological prior information. Among the few that do [5,6,7,8,9,10,11], we find three different approaches. First, the use of cascaded algorithms where the network consists of successive segmentation networks. Second, the information on the nested-classes is incorporated into the loss function, imposing penalties on solutions that do not respect the nested geometry relations. Third, Markov random fields are used to formalizing class relationship in the post-processing of the network output. Here, we make use of a new activation function [12] that is directly implementing class hierarchy in the network training and generalize it to 3 nested classes. For the glioma labels we assume that active tumor regions are always contained in the tumor core which is surrounded by the tumor edema, resulting in a hierarchical three-class model. In sharp contrast with nested-class method, the softmax-based method of multi-class ignores the geometric prior between different classes, and assumes the classes are mutually-exclusive, meaning one pixel cannot belong to different classes at the same time, which absolutely discards the topological information and sometimes leads the unreasonable segmentation results. The comparison of Dice score criteria between two different methods is implemented and it obviously indicates the nested-class method achieves higher accuracy than the softmax-based method, especially for the internal-classes.

In the following, we introduce a brief overview of start-of-the-art 3D-residual U-net architecture and multi-class-nested activation and loss function. We then propose and evaluate our model architectures for Brats tumor segmentation. Finally, we implement the comparison between two main avenues and illustrate the multi-level activation performs better especially in the inter-class.

2 Methodology

2.1 Network Architecture

The nested-classes relationship between different labels are shown in Fig. 2. The general network structure shown in Fig. 1 is stemming from the previously used glioma segmentation network by Isensee [13] to process large 3D input blocks of \(144\times 144\times 144\) voxels. The original network is inspired by the U-net [14] which allows the network to intrinsically recombine different scales throughout the entire network. This vertical depth is set as 5, which balances between the spatial resolution and feature representations. The context module is a pre-activation residual block, and is connected by \(3\times 3\times 3\) convolutions with input stride 2. The purpose of the localization pathway is to extract features from the lower levels of the network and transform them to a high spatial resolution by means of a simple upscale technology. The upsampled features and its corresponding level of the context aggregation feature are recombined via concatenation. Furthermore, the localization module, consisting of a \(3\times 3\times 3\) convolution followed by a \(1\times 1\times 1\) convolution, is designed to gather these features.

Fig. 1.
figure 1

Network architecture from [13]: Context pathway (left) aggregates high level information; Localization pathway (right) localizes precisely

Fig. 2.
figure 2

Schematic description of the nesting of classes in the BRATS challenge, which respects the following hierarchy: Enhancing Tumor (ET) \(\in \) Tumor core \(\in \) Tumor

The deep supervision is introduced in the localization pathway by integrating segmentation layers at different levels of the network and combining them via elementwise summation to form the final network output. The output activation layer is multi-level Sigmoid layer instead of softmax layer in the Isensee’s network which converting the multi-class problem to binary ones. Intrinsically, the multi-level activation is the assemble of multi-sigmoid function and then straightforwardly maps to multi-class segmentation incorporating the topological prior. Consequently, it overcomes the softmax-based method’s shortcoming which is blind to the geometric prior.

2.2 Crop Preprocessing

For 3D network architecture, the larger patch size of training dataset contains more continuous context knowledge and localization information which are beneficial to improve the segmentation accuracy. In order to acquire to the larger cube size patch of 3D image, the valuable knowledge in the MRI is extracted as much as possible while the meaningless information is cropped. Then the crop processing is implemented, and the maximum size of cube patch is selected as [144, 144, 144].

The crop preprocessing equation is defined as:

$$\begin{aligned} \begin{aligned} array&=[a_{min}-(b_{size}-a)/2:a_{min}+(b_{size}+a)/2]\\ a&=a_{max}-a_{min} \end{aligned} \end{aligned}$$
(1)

where \(a_{min}\) and \(a_{max}\) are the min and max non-zero information index of MRI image, and a represents the length of non-zero information. \(b_{size}\) is the cube patch size and selected as 144.

The index is recorded and used in the image post-processing stage to recovery back to the original shape [155, 240, 240]. However, a little of meaningful information which exceeds the cube patch size 144 is unavoidably ignored and have little effect on the segmentation result. In order to equally compare the softmax-based with the multi-level method, no data augmentation operation is used in the stage of image pre-processing.

Fig. 3.
figure 3

Multi-class activation function, Eq. (1) with m + 1 = 4, h = 0.8 and k = 10

2.3 Multi-level Method

Here, we use one output channel and a multi-class-nested activation function, as first proposed in [12]. The multi-level method is inspired by continuous regression, and thereby generalizing logistic regression to hierarchically-nested classes. It is shown in Fig. 3 and defined as

$$\begin{aligned} \begin{aligned} a(x)=\sum ^m_{n=1}\sigma (k[x+h(n-\frac{m+1}{2})]) \end{aligned} \end{aligned}$$
(2)

Where \(\sigma \) is the sigmoid function, k is the steepness and h is the spacing between consecutive Sigmoids. For Brain tumor segmentation challenge 4-classes nested label case, we have m + 1 = 4, and we take h = 0.5 and steepness = 10. The corresponding loss function, called Modified Cross-Entropy (MCE) in [12], is defined as

$$\begin{aligned} \begin{aligned} L_{MCE} = - \frac{1}{N_{tot}}\sum _{pixel\,i} \sum _{classes\,c}{y^c_i w^c}log(P^c [a(x_i)]) \end{aligned} \end{aligned}$$
(3)

where \(w^c\) is the weight of corresponding label, which we take as \(w^{c\alpha }\)(\(w^{c\alpha }={(\frac{N_{tot}}{N_c}})^{\alpha }\)), where \(N_{tot}\) is the sum number of pixels, \(N_{c}\) the number of pixels in each class, and where \(y^c=1\) for the ground-truth label c of pixel i and \(y^c=0\) otherwise. Furthermore, the mapping function \(P^c\) is defined as

$$\begin{aligned} \begin{aligned}&P^{c=0}(a)=1-a/3 \\&P^{c=1}(a)=a\varTheta (1-a)+(3-a)/2\varTheta (a-1) \\&P^{c=2}(a)=a/2\varTheta (2-a)+(3-a)\varTheta (a-2) \\&P^{c=3}(a)=a/3. \end{aligned} \end{aligned}$$
(4)

Where \(\varTheta (x)\) is the Heaviside function. The other one loss function, called Normalized Cross-Entropy (NCE) in [12], is defined as

$$\begin{aligned} \begin{aligned} L_{NCE} = - \frac{1}{N_{tot}}\sum _{pixel} \sum _{i\,classes}{y^c_i w^c}log(\varTheta ^c [a(x_i)]) \end{aligned} \end{aligned}$$
(5)

Furthermore, the mapping function \(Q^c\) is defined as

$$\begin{aligned} \begin{aligned}&Q^{c=0}(a)=s(1-a)\\&Q^{c=1}(a)=a\varTheta (1-a)+s(2-a)\varTheta (a-2)\\&P^{c=2}(a)=s(a-1)\varTheta (2-a)+(3-a)\varTheta (a-2)\\&P^{c=3}(a)=s(a-2). \end{aligned} \end{aligned}$$
(6)

where s is the softplus function, and \(\varTheta (x)\) is the Heaviside function.

Weighted modified and Normalized cross-entropy losses are naturally combined with standard cross-entropy loss and mitigate the class unbalance problem. They also have the ability to encode of any hierarchical and mutually-exclusive topological relationship of classes in a network architecture.

2.4 Evaluation Metrics

In the task for BRATS, the number of positives and negatives are highly unbalanced. Consequently, four typical different metrics are used by the organizers to evaluate the performance of the algorithm and then rank the different teams.

Give a ground-truth segmentation map G and a segmentation map corresponding one class generated by the algorithm. The four evaluation criteria are defined as following.

Dice similarity coefficient (DSG):

$$\begin{aligned} \begin{aligned} DSC=\frac{2(G\cap {P})}{|G|+|P|} \end{aligned} \end{aligned}$$
(7)

The Dice similarity coefficient measures the overlap in percentage between G and P.

Hausdorff distance (95th percentile) is defined as:

$$\begin{aligned} \begin{aligned} H(G,P)=max(supinf_{x\in G,y\in P}d(x,y),supinf_{y\in P,x\in G}d(x,y)) \end{aligned} \end{aligned}$$
(8)

where d(xy) denotes the distance of x and y, sup denotes the supremum and inf for the infimum. This measures how far two subsets of a metric space are from each other. As used in this challenge, it is modified to obtain a robustified version by using the 95th percentile instead of the maximum (100 percentile) distance.

Sensitivity (also called the true positive rate) measures the proportion of actual positives that are correctly identified. Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified. Assume P is the number of real positive prediction pixel of lesion and N is the number of real negative prediction pixel of lesion. Condition positive P consists with true positive TP and false negative FN. Besides, the condition negative N is also divided into TN true negative and FP false positive.

Then, the metrics of Sensitivity and Specificity are illustrated as:

$$\begin{aligned} \begin{aligned} Sensitivity=\frac{TR}{P}=\frac{TP}{TP+FN} \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned} \begin{aligned} Specificity=\frac{TN}{N}=\frac{TN}{TN+FP} \end{aligned} \end{aligned}$$
(10)

Then the values of those four metrics were computed by the organizers independently and made available in the validation leaderboard.

Fig. 4.
figure 4

Segmentation results, for five different validation cases. The tumor class is depicted in red, tumor core in green and enhancing tumor in blue. (Color figure online)

3 Experiment Results

In BRATS 2018 dataset [15,16,17,18,19], there are four types, Necrotic core, Edema, Non-enhancing core and Enhancing core that form the three tumor classes in Fig. 2. The dataset contains 4 different modalities for MRI, native (T1), post-contrast T1-weighted (T1Gd), T2-weighted (T2) and T2 Fluid Attenuated Inversion Recovery (FLAIR) which are all used as different input channels. We train the networks using ADAM optimizer with an initial learning rate of 0.0005, and to regularize the network, we use early stopping when the precision on the 20% of the training dataset reserved for validation is no longer improved, and dropout (with rate 0.3) in all residual block before the multi-class sigmoid function. Some slices of segmentation results containing the tumor, tumor core and enhancing core are shown in Fig. 4. We observe that the topology geometry between different labels is constrained to the nested-classes relationship, consequently avoiding errors stemming from the lack of topological prior.

Table 1. Validation results presented on the leaderboard
Table 2. Quantitative evaluation of Dice score

The segmentation result is severely affected by highly unbalanced problems existing in the Brats dataset. As class imbalance in a data set increases, the performance of a neural net trained on that data has been shown to decrease dramatically [20]. In order to mitigate this issue, many methods [21,22,23] were proposed to modify the loss function to alleviate this problems. Here, the weighted cross entropy incorporating the nested-class information is proposed and investigated. We experimented with different weighting schemes (\(\alpha =1,0.5,0.4,0.3\)) and with the different losses (MCE and NCE) proposed in [12]. The best performing combination turned out to be \(\alpha =0.4\) and MCE loss function. The segmentation thresholds to determine the boundaries between classes, were set to [0.95, 1.65, 2.2] on the validation process. For this final configuration, we reached Dice scores of 86% for the complete tumor, 77% for the tumor core and 72% for the enhancing core as presented in Table 1. The weighted-modified-cross-entropy performs much better than the result achieved by normalized cross-entropy, and weight scheme affects the segmentation result severely since the extraordinary unbalance problem. The different weight schemes [0.5, 0.4, 0.3] are compared and the optimal weight scheme is taken as 0.4. In comparison with the softmax-based method based on the same network architecture proposed by Isensee without ensembles operation, any complicated image pre-processing and post-processing steps and extra training dataset, it indicates that the Dice score of nested-class (enhancing core) drastically improved from 0.691 to 0.719 while the Dice core of whole tumor and tumor core almost remains at same extent. The quantitative evaluation (Mean, std, Median, 25%, 75% quantile) of Dice score of enhancing core and whole tumor and tumor core are showed in Table 2. And other evaluation metrics (the proportion of actual positives correctly identified—Sensitivity, the proportion of actual negatives correctly identified—Specificity and Hausdorff95) are listed in Table 3.

Table 3. Sensitivity, Specificity and Hausdorff95 results presented on the leaderboard

3.1 Threshold Scheme Definition and Analysis

Setting the optimal threshold is an important component of the multi-class segmentation task, and it is straightforwardly linked to segmentation boundary. From the activation function (4 nested-class sigmoid function) Fig. 3, the 4 classes segmentation problem is corresponding with the threshold scheme with 3 parameters [Threshold-1, Threshold-2, Threshold-3]. The threshold scheme is optimally chosen during the validation procedure, and then fixed and applied into test dataset.

In order to analyze how the threshold affects the segmentation accuracy, the relationship between boundary threshold and Dice score is illustrated in Fig. 5. The target threshold is changed to the value taken from a specific interval which is considered to be possible to achieve optimal segmentation result when other thresholds are fixed at the optimal value. The criteria Dice score of three classes is very sensitive to the threshold-3 value compared with other two threshold indexes, that it may drop into Dice score valley within interval [2.2, 2.4]. The threshold-2 index has little impact on the Dice score of whole classes except for threshold greater than 1.8. Consequently, it is easier to make an optimal threshold scheme after determining indexes of threshold-3 and threshold-2. After experiment and optimization, the suitable threshold scheme in the Brats challenge is selected as [0.95, 1.65, 2.2].

Fig. 5.
figure 5

Boundary division of threshold scheme

4 Conclusions

In this paper we applied the technique of multi-level activation to the nested classes segmentation of glioma. The results of our experiments indicate that the multi-level activation function and its corresponding loss function are efficient compared to Softmax output layer based on the same network framework. Using the MCE loss function and a reweighting scheme with power-law = 0.4, we obtain Dice scores 86% for complete tumor, 77% for tumor core and 72% for enhancing core on the validation leaderboard of the 2018 BRATS challenge, proving the applicability of the multi-level activation scheme. Finally, this activation could be combined with other network architectures. Using it with the best performing architecture of the BRATS challenge could even lead to further improved results.