1 Introduction

Breast cancer represents about \(12\%\) of all tumour new cases and about the \(25\%\) of all cancers in women. With these numbers, it is the most common women tumour worldwide and the second overall, with almost 2 million new diagnosed cases/year [5]. The key factor to improve the breast neoplasm prognosis is early detection, especially for cancer (malignant tumours). The world health organisation indicates the x-ray mammography as the standard diagnostic tool [26], for its high resolution and low operational costs. However, among its main disadvantages, there is the use of ionising radiations (x-rays) and its low specificity, especially for radiographically dense breast tissue (as in under-forty women) or when the patients have scars or breast implants.

In recent years, Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) has demonstrated a great potential in screening different tumours tissues, gaining an increasing popularity as an important complementary diagnostic methodology for early detection of breast cancer [12]. DCE-MRI advantages include its ability to acquire 3D high resolution dynamic (functional) information, not available with conventional RX imaging [23] and its limited invasiveness, since it does not make use of any ionising radiations or radiocontrast. It has been successfully used for under-forty women and for high-risk patients [4], both for assessing therapy effects and for staging newly diagnosed breast cancer [17].

DCE-MRI consists of 4-dimensional data, obtained by combining different 3D volumes acquired before (pre) and after (post) the intravenous injection of a paramagnetic contrast agent (usually Gadolinium-based), as depicted in Fig. 1a. Each voxel is associated with a Time Intensity Curve (TIC) representative of the temporal dynamics of the acquired signal (see Fig. 1b) that reflects the absorption and the release of the contrast agent, following the vascularisation characteristics of the tissue under analysis [22].

Fig. 1.
figure 1

DCE-MRI and time intensity curves. (a) A representation of the four dimensions (3 spatial + 1 temporal) of a typical breast DCE-MRI scan; (b) some examples of time intensity curves.

While the use of DCE-MRI has proved to improve breast cancer diagnosis [14], it is a very time-consuming and error-prone task that involves analysis of a huge amount of data [16]. A visual assessment of the lesion malignity could be performed referring to the Fig. 1b. Type I corresponds to a straight (Ia) or curved (Ib) line where the contrast absorption continues over the entire dynamic study (typical of healthy tissues or benign neoplasms); Type II represents a plateau curve with a sharp bend after the initial upstroke (typical of probably malignant lesions); finally, Type III shows a washout time course (typical of malignant lesions). It follows that radiologists can hardly inspect DCE-MRI data without the use of a Computer Aided Detection/Diagnosis (CAD) system designed to reduce such amount of data, allowing them to focus attention only on regions of interest.

In particular, in this paper we focus on a CAD system for the automatic lesion diagnosis in order to better assist the physician decision by reducing the inter/intra-operator variability.

When lesion diagnosis is performed by means of classifier systems [8], many features have been proposed so far, roughly grouped in Clinical [9] (as age, parental and history), Dynamic [7, 9] (directly extracted from the TIC), Textural [6] (as variance, kurtosis and skewness), Pharmacokinetic [15] (extracted by means of mathematical models of contrast agent and tissue), Spatio-temporal [25] (as DFT coefficient map, margin and radial gradients) and Morphological [7] (as lesion eccentricity, compactness and perimeter).

While newer hand-crafted features are continuously proposed by domain experts, in the last years deep learning approaches have gained popularity in many pattern recognition tasks, being able to outperform classical machine learning techniques in different fields [10]. Among these approaches, we can cite Convolutional Neural Networks (CNN) that are composed of different convolutional layers stacked in a deep architecture (as in Fig. 2) meant to automatically learn the best data representation as composed by simpler concepts. They usually perform better than classifiers trained on hand-crafted features because are able to learn a compact hierarchical representation of an image which well fits the specific task to solve.

Deep approaches have been used in brain DCE-MRI, both for lesion and anatomical segmentation [18, 24], in prostate tissues analysis, by using deep auto-encoders for tumours grading and diagnosis [13, 20], while breast DCE-MRI cancer lesion detection was never faced.

In a recent publication, Antropova et al. [2] first proposed to apply Deep Learning for the breast cancer lesion diagnosis task, by feeding a pre-trained CNN with the ROI extracted from DCE-MRI slices containing a lesion. They adopted the AlexNet architecture pre-trained on the ImageNet dataset [11] as feature extractor, by using a Support Vector Machine (SVM) for the malignant/benignant classification task. To the best of our knowledge, it is the first time that a Deep Learning approach was used in breast DCE-MRI analysis. This notwithstanding, the results reported in [2] refer to a single training modality (the CNN used as feature extractor); moreover, they are given by using cross-validation (CV) for performance evaluation, which can provide in this case an overestimation of the actual performance. Finally, no comparisons with other approaches on the same data are reported.

The aim of this work is then to assess CNN capability for automatic lesion classification in breast DCE-MRI, in order to have broader results on its applicability and effectiveness for a very specific task such as the considered one. In addition to the training modality proposed in [2], we also explored the fine-tuning of a pre-trained AlexNet and the complete training from scratch of the same net. All the results have been also compared with those achieved by other proposals, by using a Leave-One-Patient-Out CV evaluation in order to ensure fair and more reliable findings.

This rest of the paper is organized as follows: Sect. 2 gives some information on the proposed methodology and on the literature proposals used to validate the approach considered in this study and presents the dataset used for comparing the results. In Sect. 3, we present the results obtained by the different CNN-based approaches, together with those achieved by other methods under comparison. Finally, in Sect. 4, we discuss these results, by providing some conclusions.

2 Materials and Methods

2.1 Dataset

Patients. The dataset is constituted of 42 women breast DCE-MRI 4D data, (average age 40 years, in range 16–69) with benign or malignant lesions histopathologically proven: 42 regions of interest (ROIs) were malignant and 25 were benign for a total of 67 ROIs.

Data Acquisition. All patients underwent imaging with a 1.5 T scanner (Magnetom Symphony, Siemens Medical System, Erlangen, Germany) equipped with breast coil. DCE T1-weighted FLASH 3D coronal images were acquired (TR/TE: 9.8/4.76 ms; flip angle: 25\(^\circ \); field of view 370 \(\times \) 185 mm \(\times \) mm; matrix: 256 \(\times \) 128; thickness: 2 mm; gap: 0; acquisition time: 56 s; 80 slices spanning entire breast volume). One series (\(t_{0}\)) was acquired before and 9 series (\(t_{1}\)-\(t_{9}\)) after intravenous injection of 0.1 mmol/kg of a positive paramagnetic contrast agent (gadolinium-diethylene-triamine penta-acetic acid, Gd-DOTA, Dotarem, Guerbet, Roissy CdG Cedex, France). An automatic injection system was used (Spectris Solaris EP MR, MEDRAD, Inc., Indianola, PA) and injection flow rate was set to 2 ml/s followed by a flush of 10 ml saline solution at the same rate.

Gold Standard. An experienced radiologist (A.P.) delineated suspect ROIs using T1-weighted and subtractive image series. Starting from DCE-MRI acquired data, the subtractive image series is defined by subtracting \(t_0\) series from \(t_4\) series. In subtractive images, any tissue that does not absorb contrast agent is suppressed. Manual segmentation stage was performed in OsiriX [21], that allows the user to define ROIs at a sub-pixel level. Per each ROI the lesion was histopathologically proven. The evidence of malignity was used as Gold Standard (GS) for the ROI Classification problem.

2.2 Lesion Diagnosis in DCE-MRI

The breast lesion diagnosis issue has been mainly designed in the pattern recognition framework. The most recent literature proposals mostly differ in the feature vector used to describe the classification subject. In the following we will briefly review those we considered for comparison; then, in the next sub-section the CNN-based approach is described. The choice of the comparing approaches is motivated by the attempt of covering the most part of the feature taxonomy presented in the introduction.

In [9] Glaßer et al. proposed to use a Decision Tree trained on Clinical and Dynamic features in order to consider both the patient high-level information and contrast agent perfusion parameters derived from the signal temporal dynamic.

Fusco et al. [7] suggested to use both Dynamic and Morphological features, combining them by using a Multiple Classifier System, in order to take into account the contrast agent concentration and the lesion shape.

In [19], trying to exploit both spatial and spatio-temporal information, a Random Forest classifier (made up of 10 Random Trees each one using a random subset of features with no limitation on its maximum depth) was trained on the spatio-temporal version of Local Binary Pattern in Three Orthogonal Planes (LBP-TOP [1]).

2.3 CNN for Lesion Diagnosis

As stated in the introduction, Antropova et al. [2] investigated deep learning for the lesion diagnosis task using DCE-MRI data. They proposed to apply transfer learning from a pre-trained CNN (see Fig. 2).

Fig. 2.
figure 2

The considered convolutional neural network.

Their approach can be summarised as follows:

  1. 1.

    Take only the second post-contrast series from the 4D DCE-MRI data.

  2. 2.

    Depending on the size of the tumour, a tile around the lesion is extracted from each lesion slice. The tile size varies between 1 and 1.5 times the maximum diameter of the observed lesion.

  3. 3.

    Images are up-sampled to yield 256\(\,\times \,\)256 pixel ROIs.

  4. 4.

    AlexNet [11] pre-trained on the ImageNet [3] database is used to extract a feature vector from the last internal convolutional layer (fc7 in Fig. 2). 4096 features per each slice are extracted.

  5. 5.

    Performance are evaluated using a 10-fold cross-validation, by considering an SVM as a classifier (no information about kernel or others hyper-parameters were provided).

To better investigate the capability of the Antropova et al. [2] proposal, we also explored two other training modalities, i.e., fine-tuning and the training from scratch of AlexNet.

Fine-tuning of a pre-trained CNN consists in replacing the last trained fully connected layers with the same untrained ones. Retraining the so modified network will softly adapt the weights of the pre-trained layers to the new task and will strongly train the new fully connected layer according to the new findings. Fine-tuning allows us to use a reduced number of training images and to achieve a stable result in fewer epochs. We chose the best number of epochs needed to avoid overfitting by considering the loss function values during the training phase.

In order to perform training from scratch, we deployed exactly the same network architecture proposed by AlexNet in [11], so performing a totally new supervised training. This approach is, usually, higher demanding and needs a greater amount of images to achieve a valuable solution.

2.4 Performance Evaluation

To obtain a fair performance evaluation, k-fold cross-validation (CV) is commonly used. In our case, however, even if each lesion is composed of different slices, the lesion diagnosis task has to predict a single class for the whole lesion. For this reason, it is very important to perform a Leave-One-Patient-Out Cross-Validation (LOPO-CV) instead of a slice-based k-fold CV one, in order to reliably compare different models by avoiding mixing intra-patient slices in the evaluation phase. Therefore, in the next Section all the results are reported by performing a LOPO-CV and comparing each described approach in terms of Accuracy (ACC), Sensitivity (SEN), Specificity (SPE) and Area Under the ROC Curve (AUC).

It is worth noting that, since classification is always performed at the slice level, a combining strategy has to be applied in order to provide a unique class label for each lesion. Among the possible combining strategies of the results provided by the classifier at slice level, we chose to investigate the following ones:

Majority Voting: The class of the lesion corresponds to the most voted class over all the slices.

Weighted Majority Voting: As for the majority voting, but each slice contribution is weighted by its class probability provided by the classifier.

Weighted Majority Voting by Slice Area: As in the previous case, but each slice contribution is proportionally weighted by using its area.

Naïve Bayes: Predicted classes of the slices are combined according to the Naïve Bayes approach.

Max Prob. Value per Slice: The class of the lesion corresponds to the class of the slice with the highest class probability values as provided by the classifier.

Biggest Slice: The class of the lesion is the same of the biggest slice.

3 Experimental Results

This Section shows the results obtained by applying deep approaches for lesion diagnosis. All the results are here presented without any remark. A discussion is reported in the next Section.

Table 1 compares all the CNN-based approaches so far presented, by performing the training to the best of our capability in order to achieve a fair comparison. Antropova et al. [2], in fact, propose to use AlexNet as a feature extractor, but do not provide enough information about the SVM hyper-parameters settings. So, we performed an optimization of the classification stage: The best results were obtained by using a SVM with a polynomial kernel of degree equal to 3 and C=1.

The fine-tuning of AlexNet has been performed as discussed in Sect. 2; in this case, the best results were achieved with 75 epochs and a learning rate of \(10^{-5}\).

Table 1. Comparing different AlexNet training modalities, by varying the slice combining strategy. Average values obtained in Leave-One-Patient-Out CV over 42 patients are reported.

Finally, we also investigate whether training by scratch of AlexNet could improve the lesion diagnosis with respect to the previously described approaches. In this case, we found that the learning rate strongly influences the learning curve. Looking at the Loss function evaluation per each epoch (Fig. 3), we can see that the network suffers from under-fitting. The loss function settles to a certain value and, even if further training steps are performed, the net does not learn more.

Fig. 3.
figure 3

Training from scratch loss function values for different learning rates (LR), by varying the number of epochs.

For the training from scratch approach in the considered case, a learning rate of \(10^{-2}\) and 100 epochs are enough to reach the best working point of the net. Unfortunately, very poor results can be achieved, with an accuracy equal to 54.76% and an AUC value of 68.55%.

Table 2 compares the best results obtained by a deep approach with those obtained by applying the methods proposed in [7, 9, 19].

Table 2. Comparison of the best results obtained by a CNN-based lesion diagnosis approach with those achieved by other state-of-the-art approaches. Average values obtained in Leave-One-Patient-Out CV over 42 patients are reported.

Finally, in order to closely replicate the results reported in [2], we also performed a slice based 10-fold CV of the SVM (using the same parameters that gave us the results reported in the previous tables) fed with the 4096 features extracted by using AlexNet, obtaining 91.75% and 96.90% in terms of Accuracy and AUC, respectively.

All the results have been obtained by using the MatLab Neural Network Toolbox from Mathworks, over our University’s computing infrastructure (SCoPE - www.scope.unina.it) where three DELL R720 are equipped with two Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 128GB RAM and a cluster of five NVIDIA Tesla K20m GPU is available.

4 Discussion and Conclusions

The aim of this paper was to investigate automatic lesion malignancy classification in breast DCE-MRI by means of Convolutional Neural Networks (CNN), analyzing a literature proposal and comparing it with different training modality and combining strategies. The evaluations were performed on ROIs manually segmented by an experienced radiologist (A.P.) for all the patients in our database. Obtained results were compared with previous findings in the literature, by using a Leave-One-Patient-Out cross-validation (LOPO-CV) in order to ensure a fair comparison and more reliable findings.

Antropova et al. [2] presented the first use of deep learning for the lesion classification task in DCE-MRI data. They propose to apply transfer learning from a pre-trained CNN. Results presented in table 1 show that a CNN pre-trained on natural images (ImageNet dataset) used as a feature extractor performs better than considering a fine-tuning modality, in terms of Accuracy, Sensitivity, Specificity and AUC. The same table also shows that the most effective slice combining technique is to consider as lesion class the one predicted by the slice containing the biggest ROI. This is reasonable, since the biggest ROI in a lesion is likely to bring the majority of the lesion malignancy information. Reported results confirm that the training from scratch approach is not feasible with the reduced number of biomedical images usually available.

Table 2 compares the best CNN-based approach with other approaches presented so far in the literature, showing that, even if deep learning can outperform two of them [7, 9], it cannot overcome the method based on the LBP-TOP descriptor [19]. This result seems to suggest that while CNNs show promising results in treating biomedical images, they have to be carefully designed and tuned in order to outperform approaches specifically designed to suitably exploit data information for the specific task. It is worth recalling, in fact, that the presented CNN, differently from the LBP-TOP descriptor, do not use neither dynamic nor spatio-temporal information. Moreover, training times of our best CNN-based solution are about two orders of magnitude graeater with respect to those needed by the other approaches.

Our results also confirm that it is important to compare all the approaches on a patient base, in order to obtain a more reliable outcome. The results obtained by using a 10-fold CV are in fact significantly higher when compared with those obtained by using a LOPO-CV. The former are however intrinsically unfair because of exploiting inter-patient knowledge, so biasing the cross-validation results.

As a final remark, we would like to highlight that a problem concerning the use of a CNN is that there is no clear physiological interpretation of the classifier operational mode, thus it can be very difficult to motivate results from the physician point of view. Another limit of this study is the population size; our findings should be then confirmed on a larger dataset. Future works will focus on these aspects and on the study of net design choices that can be able to suitably exploit dynamic or spatio-temporal information coming from DCE-MRI data.