1 Introduction

Brain tumor is one of the most fatal cancers, which consists of uncontrolled, unnatural growth and division of the cells in the brain tissue [1]. The most frequent types of brain tumors in adults are gliomas that arise from glial cells and infiltrating the surrounding tissues [2]. According to the malignant degree of gliomas and their origin, these neoplasms can be categorized into Low Grade Gliomas (LGG) and High Grade Gliomas (HGG) [2, 3]. The former is slower-growing and comes with a life expectancy of several years, while the latter is more aggressive and infiltrative, having a shorter survival period and requiring immediate treatment [2]. Therefore, segmenting brain tumor timely and automatically would be of critical importance for assisting the doctors to improve diagnosis, perform surgery and make treatment planning.

In recent years, convolutional neural networks (CNNs) have been widely applied to automatic brain tumor segmentation tasks. Pereira et al. [15] and Havaei et al. [13] respectively trained a CNN to predict the label of the central voxel only within a patch, which causes that they suffer from high computational cost and time consumption during inference. To reduce the computational burden, Kamnitsas et al. [5] propose an efficient model named DeepMedic that can predict the labels of voxels within a patch simultaneously, in order to achieve dense predictions. Recently, fully convolutional networks (FCNs) have achieved promising results. Shen et al. [6] and Zhao et al. [11] allow end-to-end dense training and testing for brain tumor segmentation at the slice level to improve computational efficiency. With a large variety of CNN architectures proposed, the performance of automatic brain tumor segmentation from Magnetic Resonance Imaging (MRI) images has been improved greatly.

In this work, we construct multiple different CNN architectures and approaches to ensemble their prediction results, in order to produce stable and robust segmentation performance. We evaluate our approaches on the validation set of 2018 Brain Tumor Segmentation (BraTS) challenge, where we obtain the good performance with average Dice scores of 0.8136, 0.9095 and 0.8651 for enhancing tumor, whole tumor and tumor core, respectively. Correspondingly, we achieve promising scores for BraTS 2018 testing set are 0.7775, 0.8842 and 0.7960, respectively.

2 Data

We use the dataset of 2018 Brain Tumor Segmentation challenge [2, 4, 7, 8, 21] for experiments, which consists of the training set, validation set and testing set. The training set contains 210 HGG and 75 LGG cases whose corresponding manual segmentations are provided. As shown in Fig. 1, the provided manual segmentations include four labels: 1 for necrotic (NCR) and the non-enhancing (NET) tumor, 2 for edema (ED), 4 for enhancing tumor (ET), and 0 for everything else, i.e. normal tissue and background (black padding). The validation set and testing set contain 66 cases and 191 cases with unknow grade and hidden segmentations, respectively. Each case has four MRI sequences that are named T1, T1 contrast enhanced (T1ce), T2 and FLAIR, respectively. These datasets are provided after their pre-processing, i.e. co-registered to the same anatomical template, interpolated to the same resolution (1 mm\(^3\)) and skull-stripped, where dimensions of each MRI sequence are 240 \(\times \) 240 \(\times \) 155. Besides, the official evaluation is calculated by merging the predicted labels into three regions: whole tumor (1,2,4), tumor core (1,4) and enhancing tumor (4). The valuation for validation set is conducted via an online systemFootnote 1.

Fig. 1.
figure 1

Example of images from the BRATS 2018 dataset. From left to right: Flair, T1, T1ce, T2 and manual annotation overlaid on the Flair image: edema (green), necrosis and non-enhancing (yellow), and enhancing (red). (Color figure online)

3 Methods

3.1 Basic Networks

As is well known, brain tumor segmentation from MRI images is a very tough and challenging task due to the severe class imbalance problem. Following [14], we decompose the multi-class brain tumor segmentation into three different but related sub-tasks to deal with the class imbalance problem. (1) Coarse segmentation to detect whole tumor. In this sub-task, the region of whole tumor is located. To reduce overfitting, we define the first task being the five-class segmentation problem. (2) Refined segmentation for whole tumor and its intra-tumoral classes. The above obtained coarse tumor mask is dilated by 5 voxels as the ROI for the second task. In this sub-task, the precise classes for all voxels within the dilated region are predicted. (3) Precise segmentation for enhancing tumor. We specially design the third sub-task to segment the enhancing tumor, due to its high difficulty of segmentation.

Model Cascade. In view of the above three sub-tasks, it is probably easy to train a CNN individually for each sub-task, which is the currently popular Model Cascade (MC) strategy. We use a 3D variant of the FusionNet [10], as illustrated in Fig. 2. The network architecture consists of an encoding path (upper half of the network) to extract complex semantic features and a symmetric decoding path (lower half of the network) to recover the same resolution as the input to achieve voxel-to-voxel predictions. The network is constructed by four types of basic building blocks, as shown in Fig. 2. In addition, the network has not only the short shortcuts in residual blocks, but also three long skip connections to merge the feature maps from the same level in the encoding path during decoding by using a voxel-wise addition. We employ the identical network architecture for each sub-task, except for the final convolutional classification layer. The number of channels of last classification layer is equal to 5, 5 and 2 for the first, second and third sub-tasks, respectively. Besides, size of input patches for the network is 32 \(\times \) 32 \(\times \) 16 \(\times \) 4, where the number 4 indicates the four MRI modalities. During inference, we adopt overlap-tile strategy in [9]. Thus, we abandon the prediction results of border region and only retain the predictions in the center region (\(20 \times 20 \times 5\)). This trick is also used in the following models. Different from [20] that is a typical example of model cascade strategy, we dilate the coarse tumor mask to prevent tumor omitting in the second sub-task and adopt the same 3D basic network architecture for each sub-task instead of sophisticated operations that design different networks for different sub-tasks.

Fig. 2.
figure 2

This figure is reproduced from [14].

Network architecture used in each sub-task. The building blocks are represented by colored cubes with numbers nearby being the number of feature maps. C equals to 5, 5, and 2 for the first, second, and third task, respectively. (Best viewed in color)

One-Pass Multi-task Network. The above proposed model cascade approach has obtained promising segmentation performance. To a certain extent, it alleviates the problem of class imbalance. However, model cascade approach needs to train a series of deep models individually for the three different sub-tasks, which leads to large memory cost and system complexity during training and testing. In addition, we have observed that the networks used for three sub-tasks are almost the same except for the training data and the classification layer. It is obvious that the three sub-tasks are relative to each other.

Therefore, we employ the one-pass multi-task network (OM-Net) proposed in [14], which is a multi-task learning framework that incorporates the three sub-tasks into a end-to-end holistic network, to save a lot of parameters and exploit the underlying relevance among the three sub-tasks. The OM-Net proposed in [14] is described in Fig. 3, which is composed of the sharable parameters and task-specific parameters. Specially, the shared backbone model refers to the network layers outlined by the yellow dashed line in Fig. 2, while three respective branches for different sub-tasks are designed after the shared parts.

Fig. 3.
figure 3

This figure is reproduced from [14].

Architecture of OM-Net. Data-i, Feature-i, and Output-i denote training data, feature, and classification layer for the i-th task, respectively. The shared backbone model refers to the network layers outlined by the yellow dashed line in Fig. 2

In addition, inspired by the curriculum learning theory proposed by Bengio et al. [12] that humans can learn a set of concepts much better when the concepts to be learned are presented by gradually increasing the difficulty level, we adopt the curriculum learning-based training strategy in [14] to train OM-Net more effectively. The training strategy of our framework is to start training the network on the first easiest sub-task, then gradually add the more difficult sub-tasks and their corresponding training data to the model. This is a process from easy to difficult, highly consistent with the thought of manual segmentation of the tumor. Besides, the training data conforming to the sampling strategy of the other sub-tasks can be transferred to achieve data sharing. Eventually, the OM-Net is a single deep model to slove three sub-tasks simultaneously in one-pass. It is also significantly smaller in the number of trainable parameters than model cascade strategy and can be trained end-to-end using stochastic gradient descent to achieve data sharing and parameters sharing in a holistic network.

3.2 Extended Networks

In this section, we extend and improve the MC-baseline and OM-Net from four aspects to further promote the performance. The four aspects are elaborated in the following.

Deeper OM-Net. We deepen the OM-Net by appending a residual block (the violet block in Fig. 2 right) after each existing residual block of OM-Net, which is the easiest and most direct way to boost the performance.

Dense Connections. Inspired by [17], the basic 3D network of MC-baseline is modified by adding a series of nested and dense skip connections to form a more powerful architecture. The purpose of the re-designed skip connections is to reduce the semantic gap between the feature maps of the encoder and decoder [17].

Attention Mechanisms. Attention mechanisms have been shown to improve performance across a range of tasks, which is attributed to their ability to focus on the more informative components and suppress less useful ones. Particularly, “Squeeze-and-Excitation” (SE) block is proposed to adaptively perform channel-wise feature recalibration by explicitly modelling interdependencies between channels in [16], in order to boost the representational power of CNNs.

Fig. 4.
figure 4

The adopted “Squeeze-and-Excitation” (SE) block.

Inspired by it, we introduce SE blocks to OM-Net, in order to recalibrate the feature maps and further improve the learning and representational properties of OM-Net. The SE block is described in Fig. 4. Similar to [16], the SE block focuses on channels to adaptively recalibrate channel-wise feature responses in two steps, squeeze and excitation. It helps the network to increase the sensitivity to informative features and suppress less useful ones.

Multi-scale Contextual Information. To deal with the 3D medical scans, we employ the above 3D CNNs that process small 3D patches. However, small patches cause the network to lean the limited contextual information. It seems necessary to introduce larger patches, in order to provide larger receptive fields and more contextual information to the network. Therefore, inspried by [5], we design a two parallel pathway architecture that processes two scale input patches simultaneously. As shown in Fig. 5, we incorporate both local and larger contextual information to the model, which not only extracts semantic features at a higher resolution, but also considers larger contextual information from the lower resolution level. It can provide rich information to discriminate voxels that appear very similar when considering only local appearance, avoiding making wrong predictions.

Fig. 5.
figure 5

The proposed network architecture to introduce multi-scale contextual information.

Table 1. Mean values of Dice and Hausdorff95 measurements on BraTS 2018 validation set (submission id DL-86-61).
Table 2. Mean values of Sensitivity and Specificity measurements on BraTS 2018 validation set.
Table 3. The segmentation results of our proposed method on BraTS 2018 testing set.
Fig. 6.
figure 6

Example segmentation results on the validation set of BraTS 2018. From left to right: Flair, T1ce, segmentation results using MC-Net only overlaid on Flair image, and segmentation results using the proposed method overlaid on Flair image; edema (green), necrosis and non-enhancing (blue), and enhancing (red). (Color figure online)

3.3 Ensembles of the Above Multiple Models

Model ensembling is an effective method to improve performance, e.g. Kamnitsas et al. [19] ensembled DeepMedic [5], 3D FCN [18], and 3D U-Net [9] into EMMA. In this paper, we also adopt model ensembling to obtain more robust segmentation results. Above multiple models, including MC-Net, OM-Net and their variants are trained separately, and the predicted probabilities are averaged at testing time. Additionally, a simple yet effective post-processing method [14] is adopted to improve segmentation performance.

4 Experiments and Results

Pre-processing. We adopt the minimal pre-processing operation to the BraTS 2018 data. That is, each sequence is individually normalized by subtracting its mean value and dividing by its standard deviation of the intensities within the brain area in that sequence.

Segmentation Results. Table 1 presents the mean values of Dice and Hausdorff95 measurements of the different models on BraTS 2018 validation set, meanwhile Table 2 presents the corresponding mean values of Sensitivity and Specificity measurements. We can see that the OM-Net is superior to MC-Net, despite the fewer training parameters of OM-Net. Besides, the extended networks including MC-Net (Dense connections), MC-Net (Multi-scale), OM-Net (Attention), Deeper OM-Net and Deeper OM-Net (Attention) improve the segmentation performance to some extent. Finally, it shows that the proposed method achieves promising performance with average Dice scores of 0.8136, 0.9095 and 0.8651 for enhancing tumor, whole tumor and tumor core, respectively. In addition, we also provide qualitative comparisons in Fig. 6. From Fig. 6, we can see that model ensembling is much better and the effectiveness of the proposed method is justified.

Table 3 presents the segmentation results of our proposed method on BraTS 2018 testing set. It shows that the proposed method yields excellent performance, winning the third position in the BraTS 2018 competition.

5 Conclusion

In this work, we employ the OM-Net to obtain strong basic results, and then extend and improve MC-baseline and OM-Net from multiple aspects to further promote the performance. Eventually, the predictions of these models are ensembled to produce robust performance for brain tumor segmentation. The proposed method yields promising results, winning third place in the final testing stage of the BraTS 2018 challenge.