Keywords

1 Introduction

Brain imaging studies using magnetic resonance imaging (MRI) or computed tomography (CT) provides an important information for disease diagnosis and treatment planning [6]. One of the major challenges in brain tumor segmentation is unbalanced training data which the majority of the voxel healthy and only fewer voxels are non-healthy or a tumor. A model learned from class imbalanced training data is biased towards the majority class. The predicted results of such networks have low sensitivity, showing the ability of not correctly predicting non-healthy classes. In medical applications, the cost of misclassification of the minority class could be more than the cost of misclassification of the majority class. For example, the risk of not detecting tumor could be much higher than referring to a healthy subject to doctors.

The problem of class imbalanced has been recently addressed in diseases classification, tumor recognition, and tumor segmentation. Two types of approaches proposed in the literature: data-level and algorithm-level approaches.

At data-level, the objective is to balance the class distribution through re-sampling the data space [21], by including SMOTE (Synthetic Minority Over-sampling Technique) of the positive class [10] or by under-sampling of the negative class [19]. However, these approaches often lead to remove some important samples or add redundant samples to the training set.

Algorithm-level based solutions address class imbalanced problem by modifying the learning algorithm to alleviate the bias towards majority class. Examples are cascade training [8, 33, 36], training with cost-sensitive function [40], such as Dice coefficient loss [12, 35, 38], and asymmetric similarity loss [16] that modifying the training data distribution with regards to the misclassification cost.

Here, we study the advantage of mixing adversarial loss with weighted categorical cross-entropy and weighted \(\ell 1\) losses in order to mitigate the negative impact of the class imbalanced. Moreover, we train voxel-GAN simultaneously with semantic segmentation masks and inverse class frequency segmentation masks, named complementary segmentation labels. Assume, Y is true segmentation label annotated by expert and \(\bar{Y}\) is complementary label where the \(P(\bar{Y} = i\mid Y=j), i\ne j \in \{0,1,...,c-1\},\) and c is a number of semantic segmentation class labels. The complementary label \(\bar{Y}\) is a reverse label for the background labels. Then, our network train with both true segmentation mask Y and complementary segmentation mask \(\bar{Y}\) at the same time.

Automating brain tumor segmentation is challenging task due to the high diversity in the appearance of tissues among different patients, and in many cases, the similarity between healthy and non-healthy tissues. Numerous automatic approaches have been developed to speed up medical image segmentation [6, 25]. We can roughly divide the current automated algorithms into two categories: those based on generative models and those based on discriminative models.

Generative probabilistic approaches build the model based on prior domain knowledge about the appearance and spatial distribution of the different tissue types. Traditionally, generative probabilistic models have been popular where simple conditionally independent Gaussian models [13] or Bayesian learning [32] are used for tissue appearance. On the contrary, discriminative probabilistic models, directly learn the relationship between the local features of images and segmentation labels without any domain knowledge. Traditional discriminative approaches such as SVMs [2, 9], random forests [23], and guided random walks [11] have been used in medical image segmentation. Deep neural networks (DNNs) are one of the most popular discriminative approaches, where the machine learns the hierarchical representation of features without any handcrafted features [22]. In the field of medical image segmentation, Ronneberger et al. [37] presented a fully convolutional neural network, named UNet, for segmenting neuronal structures in electron microscopic stacks.

Recently, GANs [15] have gained a lot of momentum in the research fraternities. Mirza et al. [26] extended the GANs framework to the conditional setting by making both the generator and the discriminator network class conditional. Conditional GANs have the advantage of being able to provide better representations for multi-modal data generation since there is a control on the modes of the data being generated. This makes cGANs suitable for image semantic segmentation task, where we condition on an observed image and generate a corresponding output image.

Unlike previous works on cGANs [18, 27, 34, 36, 41], we investigate the 3D MR or CT images into 3D semantic segmentation. Summarizing, the main contributions of this paper are:

  • We introduce voxel-GAN, a new adversarial framework that improves semantic segmentation accuracy.

  • Our proposed method mitigates imbalanced training data with biased complementary labels in task of semantic segmentation.

  • We study the effect of different losses and architectural choices that improve semantic segmentation.

The rest of the paper is organized as follows: in the next section, we explain our proposed method for learning brain tumor segmentation, while the detailed experimental results are presented in Sect. 3. We conclude the paper and give an outlook on future research in Sect. 4.

2 voxel-GAN

In a conventional generative adversarial network, generative model G tries to learn a mapping from random noise vector z to output image y; \(G: z \rightarrow y\). Meanwhile, a discriminative model D estimates the probability of a sample coming from the training data \(x_{real}\) rather than the generator \(x_{fake}\). The GAN objective function is a two-player mini-max game like Eq. (1).

$$\begin{aligned} \begin{aligned} \underset{G}{m}in \, \underset{D }{m}ax \, V(D, G) = E_{y} [log D(y)] + \\ E_{z} [log (1-D(G(z)))] \end{aligned} \end{aligned}$$
(1)

Similar conditional GAN [26]; in our proposed voxel-GAN, segmentor network takes 3D multimodal MR or CT images x and Gaussian vector z, and outputs a 3D semantic segmentation; The discriminator takes the segmentor output S(x, z) and the ground truth annotated by an expert \(y_{seg}\) and outputs a confidence value D(x) of whether a 3D object input x is real or synthetic. The training procedure is similar to two-player mini-max game as shown in Eq. (2).

$$\begin{aligned} \begin{aligned} \mathcal {L}_{adv} \leftarrow \underset{S}{m}in \, \underset{D }{m}ax \, V(D, S) = E_{x,y_{seg}} [log D(x,y_{seg})] + \\ E_{x,z} [log (1-D(x, S(x,z)))] \end{aligned} \end{aligned}$$
(2)

In this work, similar to the work of Isola et al. [18], we used Gaussian noise z in the generator alongside the input data x. As discussed by Isola et al. [18], in training procedure of conditional generative model from conditional distribution P(y|x), that would be better to model produces more than one sample y, from each input x. When the generator G, takes plus input image x, random vector z, then G(x, z) can generate as many different values for each x as there are values of z. Specially for medical image segmentation, the diversity of image acquisition methods (e.g., MRI, fMRI, CT, ultrasound), regarding their settings (e.g., echo time, repetition time), geometry (2D vs. 3D), and differences in hardware (e.g., field strength, gradient performance) can result in variations in the appearance of body organs and tumour shapes [17], thus learning random vector z with input image x makes network robust against noise and act better in the output samples. This has been confirmed by our experimental results using datasets having a large range of variation.

To mitigate the problem of unbalanced training samples, the segmentor loss is weighted same as Eq. (3) to reduce effect of class voxel frequencies for the whole training dataset.

$$\begin{aligned} w_i = {\left\{ \begin{array}{ll} avg \{f_i\} \{0<i<c\} /f_{max}, &{} \text{ if } i \text{ is } \text{ max } \text{ frequency } \\ 1, &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{L1}(S) = E_{x,z} \parallel y_{seg} - S((x* w) ,z) \parallel \end{aligned}$$
(4)

The segmentor loss Eq. (4) is mixed with \(\ell 1\) term to minimize the absolute difference between the predicted value and the existing largest value. Previous studies [36, 41] on cGANs have shown the success of mixing the cGANs objective with \(\ell 1\) distance. Hence, the \(\ell 1\) objective function takes into account CNNs feature differences between the predicted segmentation and the ground truth segmentation and resulting in fewer noises and smoother boundaries.

The final objective function for semantic segmentation of brain tumors \(\mathcal {L}_{seg}\) calculated by adversarial loss and additional segmentor \(\ell 1\) loss as follows:

$$\begin{aligned} \mathcal {L}_{seg} (D, S) = \mathcal {L}_{adv} (D, S) + \mathcal {L}_{L1}(S) \end{aligned}$$
(5)
Fig. 1.
figure 1

The architecture of the proposed voxel-GAN consists of a segmentor network S and a discriminative network D. S takes 3D multi modal images as a condition and generates the 3D semantic segmentation as outputs, D determines whether those outputs are real or fake. We use two modified 3D UNet architecture as a segmentor network in order to capture local and global features extracted in bottleneck and last convolutional decoder. Here, D is 3D fully convolutional encoder.

2.1 Segmentor Network

As shown in Fig. 1, the segmentor architecture is two, 3D fully convolutional encoder-decoder network that predicts a label for each voxel. The first encoder takes \(64 \times 64 \times 64\) of multi-modal MRI or CT images at same time as different channel input. Last decoder outputs 3D images with size \(64 \times 64 \times 64\). Similar to UNet [37], we added the skip connections between each layer i and layer \(n-i\), where n is the total number of layers in each encoder and decoder part. Each skip connection simply concatenates all channels at layer i with those at layer \(n-i\). Moreover, we concatenate the bottleneck features and last convolutional decoder to capture better feature representation.

2.2 Discriminator Network

As depicted in Fig. 1, the discriminator is 3D fully convolutional encoder network which classifies whether a predicted voxel label belongs to right class. Similar to the pix-to-pix [18], we use path-GAN as discriminator with setting of voxel to voxel analysis. More specifically, the discriminator is trained to minimize the average negative cross-entropy between predicted and the true labels.

Then, the segmentor and the discriminator networks are trained through back propagation corresponding to a two-player mini-max game. We use categorical cross entropy [29] as an adversarial loss. As mentioned before, we weighted loss to only attenuate healthy voxel impact in training and testing time.

3 Experiments

We validated the performance of our voxel-GAN on two recent medical imaging challenges: real patient data obtained from the MICCAI 2018, MRI brain tumor segmentation (BraTS) [3,4,5, 25] and CT brain lesion segmentation challenge (ISLES-2018) [20, 24].

3.1 Datasets and Pre-processing

The first experiment is carried out on real patient data obtained from BraTS2018 challenge [3,4,5, 25]. The BraTS2018 released data in three subsets train, validation, and test comprising 289, 68, and 191 MR images respectively in four multisite modalities of T1, T2, T1ce, and Flair which the annotated file provided only for the training set. The challenge is semantic segmentation of complex and heterogeneously located of tumor(s) on highly imbalanced data. Pre-processing is an important step to bring all subjects in similar distributions, we applied z-score normalization on four modalities with computing the mean and stdev of the brain intensities. We also applied bias field correction introduced by Nyúl et al. [30]. Figure 2 shows an 2D slice of prepocessed images (our network takes 3D images).

In second experiment, We applied the ISLES2018 benchmark which contains 94 computer tomography (CT) and MRI training data in six modalities of CT, 4DPWI, CBF, CBV, MTT, Tmax, and annotated ground truth file. The examined patients were suffering from different brain cancers. The challenging part is binary segmentation of unbalance labels. Here, pre-processing is carried out in a slice-wise fashion. We applied Hounsfield unit (HU) values, which were windowed in the range of [30, 100] to get soft tissues and contrast. Furthermore, we applied histogram equalization to increase the contrast for better differentiation of abnormal lesion tissue.

To prevent over fitting, we added data augmentation to each datasets such as randomly cropped, re-sizing, scaling, rotation between \(-10\) and 10 degree, and Gaussian noise applied on training and testing time for both datasets.

Fig. 2.
figure 2

The brain MR image, from Brats 2018 after pre-processing. We extracted complementary mask from inverse of ground truth file annotated by medical expert, presented in the first column. Other binary masks extracted from ground truth file in columns 2–4 respectively are whole tumor, enhanced tumor, and core of tumor which they are used by the discriminator. The 5–8 columns are a slice of example 3D input of the segmentor.

3.2 Implementation

Configuration: Our proposed method is implemented based on a Keras library [7] with back-end Tensorflow [1] supporting 3D convolutional network and is publicly availableFootnote 1. All training and experiments were conducted on a workstation equipped with a multiple GPUs. The learning rate was initially set to 0.0001. The Adadelta optimizer is used in both the segmentor and the discriminator that continues learning even when many updates have been done. The model is trained for up to 200 epochs on each dataset separately. We used Adadelta as an optimizer for cGAN network.

Network Architecture: In this work, a segmentor network is a modified UNet architecture that we designed two UNet architecture with sharing circumvent bottlenecks and last fully convolutional layer in decoder part. The UNet architecture allows low-level features to shortcut across the network. Motivated by previous studies on interpreting encoder-decoder networks [31], that show the bottleneck features carried local features and fully convolutional up-sampling encoder represented global features, we concatenate circumvent bottlenecks and last fully convolutional layer to capture more important features.

Our discriminator is fully convolutional Markovian PatchGAN classifier [18] which only penalizes structure at the scale of image patches. Unlike, the PathGAN discriminator introduced by Isola et al. [18] which classified each N N patch for real or fake, we have achieved better results for task of semantic segmentation in voxel level 1 \(\times \) 1 \(\times \) d we consider N = 1 and different d = 64, 32, 16, and 8. We used categorical cross entropy [29] as an adversarial loss with combination of \(\ell 1\) loss in generator network.

Fig. 3.
figure 3

The number of pixels for each tumor classes represents how imbalanced is training data in detail of two subsets: high and low grade glioma brain tumor on BraTS2018.

Regarding the highly imbalance datasets as shown in Fig. 3, minority voxels with lesion label are not trained as well as majority voxels with non-lesion label. Therefore, we weighted only non-lesion classes to be in same average of lesion or tumor(s) classes. Tables 1 and 2 describe our achieved results with and without weighting loss on BraTS2018.

3.3 Evaluation

We followed the evaluation criteria introduced by the BraTSFootnote 2, the ISLESFootnote 3 challenge organizers.

The segmentation of the brain tumor or lesion from medical images is highly interesting in surgical planning and treatment monitoring. As mentioned by Menze et al. [25], the goal of segmentation is to delineate different tumor structures such as active tumor core, enhanced tumor, and whole tumor regions.

Fig. 4.
figure 4

The achieved accuracy obtained by voxel-GAN in terms of Dice and sensitivity at training and validation time on BraTS-2018.

Figure 4 shows good trade-off between Dice and Sensitivities in training and validation time which it shows success for tackling of unbalancing data.

From Table 1, the proposed voxel-GAN achieved better results in terms of Dice compared to 2D-cGAN. One likely explanation is that the voxel-GAN architecture is trained on 3D convolutional features and the segmentor loss is weighted for imbalanced data.

Table 1. Comparison results of our achieved accuracy for semantic segmentation by voxel-GAN (trained model with weighted loss and complementary labels) with related work and top ranked team, in terms of Dice, sensitivity, specificity, and Hausdorff distance on five fold cross validation after 80 epochs while the reported results in second and third rows are after 200 epochs. WT, ET, and CT are abbreviation of whole tumor, enhanced tumor, and core of tumor regions respectively.

Unlike previous works [14, 28, 39], we start training from scratch and even after 200 epochs our results are not as good as top ranked team. From Table 1, two top ranked team used ensemble of pre-trained models. Ensemble networks provides good solution for imbalanced data by modifying the training data distribution with regards to the different misclassification costs. In future we will focus on training voxel-GAN with one segmentor from scratch and many different pre-trained discriminators.

Table 2. The achieved accuracy for semantic segmentation by 3D-GAN in terms of Dice and Hausdorff distance after 80 epochs. Here, the model trained based on 3D UNet as segmentor and 3D fully convolution as discriminator. The WT, ET, and TC are abbreviation of whole tumor, enhanced tumor, and the tumorous core respectively.
Fig. 5.
figure 5

Visual results from our model on axial views of Brats18-2013-37-1, Brats18-CBICA-AAC-1, and Brats18-CBICA-AAK-1 from the test set overlaid T1C modality. The green color codes the whole tumour (WT) region, while blue and yellow represent the enhanced tumour (ET) and the tumorous core (TC) respectively. (Color figure online)

Table 3. The achieved accuracy for semantic segmentation on ISLES dataset by voxel-GAN in terms of Dice, Hausdorff distance, Precision, and Recall on five fold cross validation after 200 epochs.

4 Conclusion

In this paper, we presented a new 3D conditional generative adversarial architecture, named voxel-GAN, that mitigates the issue of unbalanced data for the brain lesion or tumor segmentation. To this end, we proposed a segmentor network and a discriminator network where the first segments the voxel label, and the later classifies whether the segmented output is real or fake. Moreover, we analyzed an effects of different losses and architectural choices that help to improve semantic segmentation results. We validated our framework on CT ISLES2018 and MRI BraTS-2018 images for lesion and tumor semantic segmentation. In the future, we plan to investigate ensemble network based on voxel-GAN but with many pre-trained discriminator networks for semantic segmentation task.