1 Introduction

Breast cancer is affecting millions of women every year in the world and is the reason of highest cause of deaths by cancers among women [1]. The survival rates of breast cancer largely vary in the countries. In North America, it is greater than 80%, in Sweden and Japan it is around 60%, while in low-income countries it is below 40% [1]. The main reason of low survival rate in low-income countries is the lack of programs for early detection and the shortage of enough diagnosis and healthcare facilities. Therefore, it is vital to detect breast cancer in the pre-mature stage to minimize the rate of mortality. Mammography and ultrasound images are the common tools to identify cancers by the experts and requires expert radiologists. Manual process may cause to generate high false positive and false negative numbers. Therefore, nowadays computer aided diagnosis systems (CADs) are vastly used to aid radiologists during the process of decision making in identifying cancers. The CAD systems now potentially reducing the efforts of radiologists and minimizing the false positive and negative numbers in diagnosis.

Machine learning techniques have showed significant performance in various healthcare applications over the traditional computer aided systems for disease diagnosis and patient monitoring [2]. However, the traditional machine learning techniques involve a hand-created step for extraction of features which is very difficult sometimes. It also requires domain knowledge and an expert radiologist. Meanwhile, deep learning (DL) models automatically develop a learning process adaptively and can extract features from the input dataset considering the target output [34]. The DL methods tremendously reduce the exhaustive process of data engineering and feature extraction while enabling the reusability of the methods. Numerous researches [5] have been conducted to study breast cancer images from various perceptions. Machine learning (ML), convolutional neural networks (CNNs), and deep learning methods are now widely used to classify breast cancers from the breast images.

CNN models have been effectively used in the wide-ranging computer vision fields for years [6, 7]. Since the last few years numerous researches have been conducted applying CNN-based deep learning deep architectures for disease diagnosis. A CNN-based image recognition and classification model perhaps first applied in the competition of ImageNet [8]. After then CNN-based models are currently considered in various applications, for example, image segmentation in medical image processing, feature extraction from images, finding region of interests, object detection, natural language processing, etc.

CNN has incredible trainable parameters at the various layers that are applied for extracting important features at various abstraction levels [9]. Meanwhile, a CNN model needs a huge dataset to train. Particularly in the medical field medical dataset may not be always possible to obtain. Moreover, a CNN model requires high speed computing resources to train and tune its hyper parameters. To overcome the data unavailability transfer learning techniques at present vastly applied in classification of medical images. Applying transfer learning techniques, a model can use knowledge from other the pre-trained models (e.g., VGG16 [10], AlexNet [11], DenseNet [12], etc.) that are trained over a huge dataset to classify images. This lessens the requirement of data linked to the problem we are tackling with. The pre-trained models often are used as feature extractors in images from abstract level to more detailed levels. Transfer learning techniques using pre-trained models have shown promising results in different medical diagnosis, such as chest X-ray image analysis for pneumonia and COVID-19 patients’ identification [13], retina image analysis for blind person classification, MRI image analysis for brain tumor classification, etc. Deep learning models leveraging CNNs are used widely to classify breast cancers. We now discuss some of the promising researches that have been proposed using CNN.

Authors in [14] proposed a learning framework leveraging deep learning architecture that can learn features automatically form mammography images in order to identify cancer. The framework was tested on the BCDR-FM dataset. Although they showed improved results, however, did not compare employing pre-trained models. Authors in [15] considered AlexNet as a feature extractor for mass diagnosis in mammography images. Support Vector Machin (SVM) is applied as a classification model after AlexNet generates features. The outcome of the proposed model is higher compared to the analytically feature extractor method. In our approach, we considered eight different pre-trained models and show their performances using ultrasound images. Authors in [16] considered transfer learning approach using GoogleNet [17] and AlexNet pre-trained models and some preprocessing techniques. The model is applied on mammograms images, where cancers are already segmented. The authors claim that the model achieves improved performance than the human involved methods. Authors in [18] proposed a convolutional neural network leveraging Inception-v3 pre-trained model to classify breast cancer using breast ultrasound images. The model supports facility for extracting Multiview features. The model is trained only on 316 images and achieved 0.9468 AUC, 0.886 sensitivity, and 0.876 specificity.

Authors in [19] developed an ensembled CNN model leveraging VGG192 and ResNet1523 pre-trained models with fine tuning. The authors considered a dataset managed by JABTS. There are 1536 breast masses that include 897 malignant and 639 benign cases. The model achieves 0.951 AUC, 90.9% sensitivity, and 87.0 specificity. Authors in [20] developed another ensemble-based computer aided diagnosis (CAD) system combining VGGNet, ResNet, and DenseNet pre-trained models. They considered a private database that consists of 1687 images that includes 953 benign and 734 malignant cases. The model achieved 91.0% accuracy and 0.9697 AUC score. The model is also tested on the BUSI dataset. In this dataset, the model achieved 94.62% accuracy and 0.9711 AUC score. Authors in [21] implemented two different approaches (1) a CNN and (2) a transfer learning to classify breast cancer from combining two sets of datasets, one containing 780 images and another containing 163 images. The model showed better performance results combining traditional and generative adversarial network augmentation techniques. In the transfer learning approach, the authors compared the performance of four pre-trained models, mainly, VGG16, Inception [22], ResNet, and NASNet [23]. In the combined dataset, the NASNet achieved highest accuracy value 99%. Authors in [24] compared three CNN-based transfer learning models ResNet50, Xception, and InceptionV3, and proposed a base model that consists of three convolutional layers to classify breast cancers from the breast ultrasound images dataset. The dataset comprised of 2058 images that includes 1370 benign and 688 malignant cases. According to their analysis, InceptionV3 showed best accuracy of 85.13% with AUC score 0.91. Authors in [25] analyzed four pre-trained models VGG16, VGG19, InceptionV3, and ResNet50 on a dataset that consists of 5000 breast images comprised of 2500 benign and 2500 malignant cases. InceptionV3 model achieved the highest AUC of 0.905.

Authors in [26] proposed a CNN model for breast cancer classification considering the local and frequency domain information using Histopathological images. The objective is to utilize the important information of images that are carried by the local and frequency domain information which sometime shows better accuracy for the model. The proposed model is applied on the BreakHis dataset. However, the model obtained 94.94% accuracy.

Authors in [27] proposed a novel deep neural network consisting of clustering method and CNN model for breast cancer classification using Histopathological images. The model is based on CNN, a Long-Short-Term-Memory (LSTM), and a mixture of the CNN and LSTM models. In the model, both Softmax and SVM are applied at the classifier layer. However, the model achieved 91% accuracy.

From the above discussion, it is evident researchers still on the search for a better model to classify breast cancers. In order to overcome the scarcity of datasets, this research combines two publicly available ultrasound image datasets. Then eight different pre-trained models after fine tuning are applied on the combined dataset to observe the performance results of breast cancer classification. However, the pre-trained models did not show expected outcome. Therefore, we also develop a shallow CNN-based model. The model outperforms all the fine-tuned pre-trained models in all the performance metrics. The proposed model is also faster in training. We also employed different evaluation techniques to prove the better outcome of the proposed model. The details of the methods study, evaluation results and discussion are presented in Sect. 3.

The paper is organized as follows: Sect. 2 discusses materials and methods that are used for the purpose of breast cancer classification. Section 3 proposes the custom CNN model. Section 4 discusses evaluation results of the pre-trained models and the proposed custom. Finally, the paper concludes in Sect. 5.

2 Materials and methods

In this research, we consider two publicly available breast ultrasound image datasets [28, 29]. The two datasets are considered mainly for two reasons: (1) to increase the size of the dataset for the training purpose in order to avoid overfitting and biasness and (2) to consider three classes (benign, malignant and normal). Combining the datasets also will improve the reliability of the model. Dataset in [28] contains 250 images in which there are two categories: malignant and benign cases. The size of the images is different. The minimum and the maximum size of the images are 57 × 75 and 61 × 199 pixels with gray and RGB colors, respectively. Therefore, all the images are transformed into gray color to fit into the model. The dataset in [29] contains 780 images, in which there are three categories: malignant, benign, and normal cases. The average image size of the images is 500 × 500 pixels. The breast ultrasound images are collected from 600 women in 2018, and the age range of the women is between 25 and 75 years. Table 1 shows the class distribution of the images in the two datasets. Figure 1 demonstrates examples of ultrasound images of different cases in the two datasets.

Table 1 Datasets and distribution of classes
Fig. 1
figure 1

Sample images of breast ultrasound images of different cases in the two datasets

Data normalization is an important pre-processing phase before feeding the data into a model for training. With pre-processing the data features become easily interpretable by the model. Lack of correct pre-processing makes the model slow in training and unstable. Generally, standardization and normalization techniques are used in scaling data. Normalization technique rescales the data values between 0 and 1. Since the datasets that are considered In this research, are both gray and color images, hence the values of the pixels lie between 0 and 255. We consider zero-centering approach that shifts the distribution data values in such a way that its mean becomes equal to zero. Assume a dataset D, that consists of N samples and M features. Therefore, D[:, i] denotes ith feature and D[j, :] denotes sample j. The equation below defines zero-centering.

$$\begin{aligned} & \hat{D}\left[ {:, i} \right] = \frac{{D[:, i] - \mu_{i} }}{{\sigma_{i} }}, \quad {\text{where}}\; \mu_{i} = \frac{1}{N}*\mathop \sum \limits_{k = 1}^{N} D[k, i] \;{\text{and}} \\ & \sigma_{i} = \sqrt {\frac{1}{N - 1}*\mathop \sum \limits_{k = 1}^{N} (D[k, i] - \mu_{i} )^{2} } \\ \end{aligned}$$

In this research, we employed k-fold (k = 5) cross-validation on the dataset to overcome overfitting problem during model training. In k-fold cross validation method, K different datasets of same size is generated, where each fold is used to validate the model, and k−1 folds are considered for the purpose of model training. This ensures that the model produces reliable accuracy. Cross-validation is a widely used mechanism to resample data for evaluating machine learning models when the dataset sample size is small. Cross-validation is mainly considered to approximate the learning skill of a machine learning model using data which the model has not seen previously. The result of a model obtained using cross-validation is normally less biased or gives optimistic estimation skill of the model compared to train/test split method. Table 2 shows how the fivefold cross-validation generates five different datasets of ultrasound images from the two datasets.

Table 2 Data distribution in fivefolds after applying fivefold cross validation

2.1 Fine-tuned pre-trained CNN models and the proposed custom model

During the last few years, transfer learning algorithms are widely used in many research problems in machine learning which concentrate on preserving knowledge acquired during unraveling one problem and employing the knowledge into another but a relevant problem. For example, an algorithm that is trained to learn in recognizing dogs can be applied to recognize horses. Authors in [30] formally define the transfer learning in terms of domain and task as follows:

Let an arbitrary domain D = {X, P(X)}. Here X denotes a feature vector {x1, x2, …, xn} and the probability distribution in X is denoted by P(X). A task T on D is defined as T = {Y, F(.)}, where Y = {y1, y2,…, yn} denotes the domain of labels of X and F(.) denotes a predictive function which learns from {xi, yi}∈{X, Y}. Then F(.) is applied to predict the corresponding label F(x’) in an unknown instance x’.

Hence, the transfer learning is defined as: consider two pairs of domains (DA, TA) and (DB, TB), where, DA is a source domain and DB is a target domain. TA and TB are learning tasks of DA and DB, respectively. The goal of a transfer learning technique is to enhance the learning of the predictive function FB(.) in DB by applying the knowledge learned from DA and TA, such that DA ≠ DB, TA ≠ TB.

One of the reasons that the transfer learning algorithms being used when small size dataset is available to train a custom model, but the goal is to produce an accurate model. A custom model employing transfer learning, applies the knowledge of the pre-trained models that are trained over a huge dataset for a long duration. There are mainly two approaches to apply transfer learning: (i) model developing and (ii) using pre-trained models. The pre-trained model approach is widely used in deep learning domain. Considering the importance of the pre-trained models as feature extractors this research implements eight pre-trained models using the weights of the convolutional layers of the pre-trained models. These weights act as feature extractors for classifying breast cancers applying on the ultrasound images. Table 3 shows the pre-trained models that are considered in this research. All the models are built on convolutional neural network and were trained on the ImageNet database [31] that consists of a million images. The models can classify 1000 objects (mouse, keyboard, pencil, and many animals) from different images. Therefore, all the models have learned huge feature representations from a large number of images.

Table 3 Pre-trained models and their image input size

From the Table 3, we see that different models us different input size. Therefore, the images in the dataset are transformed accordingly to feed into the models. In the fine-tuning process of the pre-trained models, the final layer is substituted with a classifier that can classify three objects since the dataset consists of images with three classes (normal, malignant, and benign). Hence the models are fine tuned at the top layers. In the fine-tuning process the last three layers of the models are substituted with (i) a fully connected layer (ii) Softmax activation layer, and (iii) a custom classifier.

We considered three different optimizers to train the models and to determine which model produces the best results. The brief description of the optimizers is given below:

Stochastic Gradient Descent with Momentum (SGDM) is the fundamental optimizer in neural network that is used for the convergence of neural networks, i.e., moving in the direction of the optimum cost function. The following equation is used to update neural network parameters to calculate the gradient \(\nabla\).

$$\theta = \theta - \mu *\nabla_{\theta } J\left( {\theta ;x,y} \right)$$

here \(\theta\): a parameter (weights, biases and activations), \(\mu\): learning rate, \(\nabla\): is the gradient, and \(J\) is cost function. Root Mean Square Propagation (RMSprop) is formulated by Geoffrey Hinton. RMSprop attempts to reduce the oscillations. It also adjusts learning rate automatically. In addition, for each parameter RMSprop selects a different learning rate. In RMSprop, an update is done according to the equations described below.

For each parameter \(w_{j}\)

$$v_{t} = \rho v_{t - 1} + \left( {1 - \rho } \right)*g_{t}^{2} ,\quad \Delta w_{t} = - \frac{\mu }{{\sqrt {v_{t} + \varepsilon } }}*g_{t} ,\quad w_{t + 1} = w_{t} + \Delta w_{t}$$

here \(\mu\): initial learning, \(v_{t}\): exponential average of squares of gradients, and \(g_{t}\): gradient at time along \(w_{j}\). Adam optimizer associates the heuristics of momentum and RMSprop. The equation is given below.

For each parameter \(w_{j}\)

$$\begin{aligned} & v_{t} = \beta_{1} *v_{t - 1} - \left( {1 - \beta } \right)*g_{t} , s_{t} = \beta_{2} *s_{t - 1} - \left( {1 - \beta_{2} } \right)*g_{t}^{2} , \hfill \\ & \Delta w_{t} = - \tau \frac{{v_{t} }}{{\sqrt {s_{t} + \in } }}*g_{t} , w_{t + 1} = w_{t} + \Delta w_{t} \hfill \\ \end{aligned}$$

here \(\mu\): initial learning, \(v_{t}\): exponential average of gradients \(w_{j}\) and \(g_{t}\): gradient at time t along \(w_{j}\) \(s_{t}\): exponential average of squares of gradients along \(w_{j}\), \(\beta_{1} ,\beta_{2}\) are hyperparameters.

The fine-tuned pre-trained models used Softmax activation function to generate the probability between the range 0 and 1 of the class outcomes from the input images. Using Softmax activation function at the end of a CNN model to convert its outcome scores into a normalized probability distribution is a very well-known practice. Softmax function is defined with following equation:

$$\sigma \left( {\vec{z}} \right)_{i} = \frac{{e^{{z_{i} }} }}{{\mathop \sum \nolimits_{j = 1}^{k} e^{{z_{j} }} }},$$

where \(\vec{z}\) is a input vector, \(z_{i}\) are the elements in \(\vec{z}\), \(e^{{z_{i} }}\) is the exponential function, and \(\mathop \sum \nolimits_{j = 1}^{k} e^{{z_{j} }} {\text{is}}\;{\text{the}}\;{\text{normalization}}\;{\text{term}}.\)

3 Proposed custom model

Figure 2 shows the proposed custom model. The model consists of one convolutional layer consists of 20 filters of 5 × 3 × 3 with stride [1 1] and padding [0 0 0 0]. The input image size for the model is 227*227*3, and the images are normalized with the zero-centering approach. The model applies batch normalization with 20 channels. It also consists of one max pooling layer, one fully connected layer. Dropout regularization is also added after the fully connected layer. Finally, SoftMax activation function is applied since the model needs to classify three classes. The initial learning rate 1.0000e−04 is considered during training. The model also considers mini-batch size 8. The model is trained using three optimizers as the pre-trained models are trained. The model is trained and validated using the configuration of Table 4.

Fig. 2
figure 2

The architecture of the custom model

Table 4 Training parameters of the pre-trained models

4 Performance measure

The performance of the fine-tuned pre-trained models are evaluated with various standard performance of metrics. The metrics are accuracy (ACC), Area Under Curve (AUC), precision, recall, sensitivity, specificity, and F1-score. Confusion matrix for each model is also generated to observe the scores of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) of normal, malignant, and benign cases. The TP (e.g., malignant) score represents how the model correctly classify real malignant cases as malignant. The FP (e.g., malignant) represents how the model wrongly classifies benign cases as malignant. Similarly, TN (e.g., benign) score represents how the model correctly classifies benign cases as benign, and FN (e.g., benign) score represents how the model wrongly classifies malignant cases as benign.

Another important metric is the precision that demonstrates the performance of a model in terms of proportion of the truly classified patients as malignant, benign, and normal cases. Meanwhile, sensitivity or recall value shows the proportion of a case (e.g., malignant) a model truly classifies as malignant cases. Specificity demonstrates the percentage of a case (e.g., benign) that a model classifies correctly. Through the F1-score we achieve a single score from precision and recall through evaluating their harmonic mean. In the below, we show the formula of different metrics.

$$\begin{aligned} & {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}},\; {\text{recall}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}},\;{\text{precision}} = \frac{\text{TP}}{{{\text{FP}} + {\text{TP}}}} \\ & {\text{specifity}} = \frac{\text{TN}}{{{\text{FP}} + {\text{TN}}}},\; F1 = 2 \times \frac{{({\text{precision}} \times {\text{recall}}}}{{\left( {{\text{precision}} + {\text{recall}}} \right)}} \\ \end{aligned}$$

5 Results and discussion

The scores of performance evaluation of the fine-tuned pre-trained models as well as the custom model are shown are Table 5. Among the pre-trained models ResNet50 demonstrates the highest accuracy score 92.04 with Adam optimizer and the highest precision score 92.73 with RMSprop optimizer. VGG16 achieves the highest AUC score 0.972 with Adam optimizer. GoogleNet achieved the highest score 93.85 in sensitivity and recall using RMSprop optimizer. ResNet18 achieved the highest score 92.29 in specificity metric. Compared to all the pre-models the proposed custom model outperforms in all the evaluation metrics. The proposed model obtained 100% accuracy and 1.0 AUC score using Adam optimizer. In other metrics, the proposed model also shows better results. Table 6 summarizes the models’ performance with the best scores with different evaluation metrics and compares the results with the proposed CNN model.

Table 5 Evaluation results of the pre-trained models and the custom model
Table 6 Comparison results of the best pre-trained models and the custom model

Figure 3 shows the confusion matrix generated from the different pre-trained models as well as the proposed custom model. The figure only shows the confusion matrix of the best pre-trained models as mentioned in the Table 6. From the confusion matrix of the custom model, we observe that in all the classification of breast cancers the score is high. For example, the model classifies 100% benign class, 100% malignant class, and 100% normal classes using Adam optimizer. The results also outperform the results of the pre-trained models. Table 7 shows the classification results of the models.

Fig. 3
figure 3

Confusion matrix of different pre-rained models and the custom model

Table 7 Comparison of classification results (the custom model (Adam optimizer) and best pre-trained models)

Table 8 shows performance comparison results between the custom and the pre-trained models. The custom model outperforms all the pre-trained models with respect to accuracy, prediction time and number of parameters. The custom model is also very fast in training than all the fine-tuned pre-trained models. The reason is that the custom model has only one fully connected layer. In addition, the custom model requires a very small number of trainable parameters compared to the other models. All the models are trained in a GPU (NVIDIA® GeForce GTX 1660 Ti with Max-Q design and 6 GB RAM) considering a mini-batch size of 8. Figure 4 shows execution time and the accuracy score of each model. To calculate accurate time, we run the code four times. The area of each marker in the Fig. 4 shows the size of the number of parameters in the networks. The time of models’ prediction is calculated with respect to the fastest network. From the plot, it is quite evident that that custom model is fast and training and produces higher accuracy than the other pre-trained models. Figure 5 shows the accuracy and loss values when the custom model is trained and validated. From the graph in Fig. 5, it is evident that the custom model generates very high accuracy result as claimed in Table 8.

Table 8 Comparison of the pre-trained models and the custom model (accuracy, prediction time, and parameters)
Fig. 4
figure 4

Performance comparison of different pre-rained models and the custom model

Fig. 5
figure 5

Accuracy and loss of the custom model during training phase

5.1 Heat map visualization

The custom model’s performance is also evaluated by generating heat map visualization using Grad-CAM tool [32] to see how the model identifies the region of interest and how well the model distinguishes cancer classes. Grad-CAM is used to judge whether a model identifies the key areas in the images for prediction. Grad-CAM visualizes the portion of an image through heatmap of a class label that the model focuses for prediction. Figure 6 shows a sample Grad-CAM output of benign and malignant classes and prediction probability. From the output, we observe that the model perfectly focuses on the key areas of images to classify cancers.

Fig. 6
figure 6

Heat map visualization of same images using the custom model

6 Conclusion

This study implemented eight pre-trained CNN models with fine tuning leveraging transfer learning to observe the classification performance of breast cancer from ultrasound images. The images are combined from two different datasets. We evaluated the fine-tuned pre-trained models applying the Adam, RMSprop, and SGDM optimizers. The highest accuracy 92.4% is achieved by the ResNet50 with Adam optimizer and the highest AUC 0.97 score is achieved by VGG16. We also proposed a shallow custom model since the pre-trained models have not shown expected results and all the pre-trained models have many convolutional layers and need long duration in the training phase. The proposed custom model consists of only one convolutional layer as feature extractors. The custom model achieved 100% accuracy and 1.0AUC value. With respect to training time, the custom model is faster than any other model and needs small size of trainable parameters. The future plan is to validate the model with other datasets that include new ultrasound images.