1 Introduction

Multi-spectral MRI is the usual imaging modality used to detect, localize and grade brain tumors [1]. Huge effort has been invested lately in the development of automatic MRI data processing techniques [2, 3]. A wide range of algorithms were developed that cover the whole arsenal of decision making algorithms. Most solutions rely on supervised and semi-supervised machine learning techniques supported by advanced image segmentation methods like: random forest ensembles [4,5,6,7], discrete and real AdaBoost [8], extremely random trees [9], support vector machines [10], convolutional neural network [11, 12], deep neural networks [13,14,15], Gaussian mixture models [16, 17], fuzzy c-means clustering in semi-supervised context [18, 19], tumor growth model [20], cellular automata combined with level sets [21], active contour models combined with texture features [22], and graph cut based segmentation [23]. Earlier brain tumor segmentation solutions were remarkably summarized by Gordillo et al. in [24].

In this study we built an evaluation framework to evaluate ensemble learning algorithms in segmenting brain tumors from volumetric MRI data. We compare the accuracy and efficiency achieved by various decision making techniques, employed within the same scenario to work with the very same pre-processed data originating from the BraTS 2016 database. The rest of the paper is structured as follows: Sect. 2 presents provides the technical details of the framework and the algorithms included in the evaluation. Section 3 analyses and discusses the obtained results. Section 4 concludes the investigation.

Fig. 1.
figure 1

Block diagram of the evaluation framework.

2 Materials and Methods

2.1 Framework

Data. This study is based on the whole set of 220 high-grade (HG) tumor records of the BraTS 2016 train dataset [2]. Each record contains four data channels (T1, T2, T1C, FLAIR). All channels are registered to the T1 channel. Volumes consist of \(155\times 240 \times 240\) isovolumetric voxels. Each voxels reflects one cubic millimeter of brain tissues. An average volume contains approximately 1.5 million brain voxels. The human expert made annotations provided by BraTS is used as ground truth within this study.

Processing Steps. The main steps of this application are presented in Fig. 1. Data records need a preprocessing to provide uniform histograms and to generate further features for the classification. Data originating from train records are sampled for the training of ensembles. Trained ensembles are evaluated using the whole test volumes. Post-processing is applied to the prediction result provided by the ensembles, to regularize the shape of the tumor and improve the segmentation quality. Finally, the precision of the segmentation is evaluated using statistical tools.

Pre-processing. There are three main pre-processing problems to handle when working with MRI data: (1) the intensity non-uniformity [25,26,27]; (2) the great variety of MR image histograms; (3) generating further features. The HG tumor volumes of the BraTS dataset contains no relevant inhomogeneity [2], so its compensation can be omitted. Uniform histograms are provided for each data channel of each MR record, using a context dependent linear transform, which assigns the 25 and 75 percentile to intensity levels 600 and 800, respectively, and forces all transformed intensities to be situated in the predefined range of 200 to 1200. Details of this transform are presented in our previous paper [28]. Beside the 4 observed data channels, 100 further features are generated using morphological, gradient, Gabor wavelet based techniques [28, 29].

Decision Making. The 220 HG tumor records were randomly divided into two equal groups, which served as train and test data during the two-round cross validation. Thus we obtain segmentation accuracy benchmark for each MRI record using ensembles trained with data from the complementary group. Ensemble units were trained using the feature vectors of 10,000 randomly selected voxels from the train records that contained 93% negatives and 7% positives, as described in our previous study [28]. All ensembles were trained to separate two classes: normal tissues and whole tumor lesions.

Post-processing. Our post-processing step relabels each pixel based on the rate of predicted positives situated within a \(11\times 11\times 11\) cubic neighborhood. The threshold was set empirically at 35%.

Evaluation Criteria. The accuracy indicators involved in this study are based on the amount of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The main accuracy indicators derived from these numbers, namely the Dice score (DS), sensitivity (true positive rate, TPR), specificity (true negative rate, TNR), and accuracy (ACC), are presented in Table 1. These indicators are established for each individual HG tumor record, and then average and median values are computed to characterize the overall accuracy. The evaluation criterion of algorithm efficiency is the average runtime of the whole processing of individual MR records.

2.2 Algorithms

Ensemble learning methods achieve high accuracy in classification from the majority voting of several weak classifiers. In this study we investigate the following algorithms:

  • Random forest (RF) classifier, as implemented in OpenCV ver. 3.4.0. RF is an ensemble of binary decision trees. The main parameters are the number of trees and the maximum tree depth. Train data sets of 10,000 items were best learned using maximum depth set to seven.

  • Ensemble of real Adaboost classifiers, as implemented in OpenCV ver. 3.4.0.

  • Ensemble of perceptron networks (ANN), as implemented in OpenCV ver. 3.4.0., using four layers of sizes 104, 15, 7, and 1, respectively.

  • Ensemble of binary decision trees (BDT), using an own implementation [28]. BDTs can be trained to perfectly separate negative from positive samples unless there exist coincident feature vectors with different ground truth. The maximum depth of BDTs was \(20.6 \pm 3.4\) (\(\mathrm{AVG} \pm \mathrm{SD}\)), but decisions were made at average depth of \(7.71\pm 2.89\).

Table 1. Criteria to evaluate segmentation quality

3 Results and Discussion

The above listed machine ensemble learning techniques were tested using the 220 high-grade tumor records of the BraTS 2016 database. Four ensembles sizes ranging from 5 to 255 were evaluated. Quality indicators shown in Table 1 were extracted for each algorithm and each MRI record separately, together with the average and median value for each indicator for overall accuracy evaluation. Comparisons in group involving all algorithms, and one-against-one tests were carried out, using individual data records and the whole HG data set as well.

Overall average and median values of the four main quality indicators are exhibited in Table 2, for all evaluated ensemble learning algorithms and various ensemble sizes. Median values were found greater than the average, for all indicators and scenarios, because there are a few records of reduced or damaged quality that are likely to be segmented considerably worse than all others. Highest values highlighted in each column of the table indicate that the random forest achieved slightly better results than any other evaluated technique. The accuracy of segmentation rises together with the ensemble size up to 125 units, above which it seems to stabilize or fall slightly. Highest achieved average Dices scores approached 81%, while median values surpass 86%. The accuracy of all evaluated ensemble learning techniques is around 98%, meaning that approximately one pixel out of 50 is misclassified.

Table 2. Various statistical accuracy indicator values achieved by tested techniques and ensemble sizes, expressed in percentage (%). Best performance is highlighted in all columns. AVG stands for average, MED stands for median.
Fig. 2.
figure 2

Main quality indicator values obtained for individual HG tumor volumes, using the random forest method in ensemble of 125, sorted in increasing order.

Table 3. Comparison of the tested ensemble learning techniques using the DS obtained for the 220 individual HG tumor records. Bests scores were identified and highlighted in each row of the table.

Figure 2 exhibits the Dice score and Sensitivity in the left panel, respectively the Specificity and Accuracy in the right panel, indicator values obtained by the random forest using ensemble of 125, which was identified as the most accurately performing algorithm. Approximately 10% of the records lead to mediocre result. In these cases the classification methods failed to capture the main specific characteristics of the data, probably because the recorded images were of low quality.

Table 3 shows for each test scenario (algorithm and ensemble size) the number of successfully segmented records, where the Dice score exceeded predefined threshold values ranging from 50% to 92%. Highest values highlighted for each threshold value indicate again that random forest achieved the best segmentation quality.

Figure 3 presents the outcome of one-against-one comparison of the tested algorithms, each using ensembles of 125 units. Dice scores shown here were obtained on each individual HG tumor records. Each cross (\(\times \)) in the graph shows the Dice score achieved by the two ensemble learning techniques on the very same data. Most crosses are situated in the proximity of the diagonal, indicating that both algorithms obtained pretty much the same accuracy. There are also crosses apart from the diagonal, representing scenarios where one of the methods led to significantly better segmentation quality.

Fig. 3.
figure 3

Dice scores obtained for individual volumes by the four algorithms using ensembles of size 125, plotted one algorithms vs. another, in all possible six combinations.

Table 4 exhibits the same results as Fig. 3, but here the one-against-one outcome of tests is organized in a tournament format. The tournament was won by the random forest algorithm, followed by BDT, Adaboost, and ANN. Figure 4 compares the efficiency of the four evaluated algorithms. Total runtimes exhibited here include the duration of histogram normalization and feature generation, segmentation and post-processing of an average sized never seen MR data volume. All tests were performed on a notebook computer, using a single core of a quad-core i7 processor that runs at 3.4 GHz. AdaBoost and ANN proved to be significantly less efficient than RF and BDT.

Table 4. Dice score tournament using the 54 LG volumes: algorithms against each other, each using ensembles of size 125. Here ANN proved to be the weakest.
Fig. 4.
figure 4

Runtime benchmarks of the four classification algorithms: the average value of the total processing time in a single record testing problem.

4 Conclusions

This study attempted to compare the accuracy and efficiency of various ensemble learning algorithms involved in a brain tumor segmentation based on multispectral magnetic resonance image data. The performed investigation indicates that publicly available implementations of ensemble learning methods are all capable to detect and segment the tumor with an acceptable accuracy. The small differences in terms of accuracy, and larger ones in terms of efficiency together revealed that random forest is the best decision making algorithm from the investigated ones. Further works will aim at involving more data sets and more machine learning algorithms into the comparative study.