Keywords

1 Introduction

Melanoma corresponds to skin cancer that causes 75% of the deaths of all cutaneous cancer diseases. It is estimated that there are 160000 new cases of melanomas and 48000 deaths per year according to the World Health Organization. Dermatologists perform a biopsy in order to confirm the presence or not of a melanoma. This procedure increases its complexity when the person has a large number of suspicious moles, since it turns out to be an invasive process. In addition, the detection of melanoma is totally conditioned to the training of the dermatologist. Therefore it is important to have a tool that allows to classify the presence of a melanoma without invading the human body and whose effectiveness of classification is superior to the average of the dermatologists, reducing the subjectivity of human vision.

During the last years the use of automatic processes in image classification has increased notably and the classification of melanoma is not the exception. One of the first works corresponds to the use of the ABCD rule [1]; it is a formula computed from a combination of different characteristics of the mole: asymmetry, edges, color and diameter of the lunar. Depending on the score that is obtained, the injury is classified. Subsequently, the method of Menzies [2] has a set of characteristics related to a benign mole and another that are strictly related to a melanoma. Therefore, depending on the characteristics found in the lesion, it will be considered a malignant or benign mole. Another technique considers seven points (ELM7) [3] in which seven fundamental characteristics of the moles are evaluated. Regarding the segmentation of the lesion the Otsu method was mainly used  [4]. Given these three methods and the Otsu method, different automatic melanoma detection systems were developed as can be seen in Table 1. The neural networks were used at first, and then migrated to other data mining techniques. However, from 2012/2013, with the increase in hardware speed, neural networks begin to be used again but this time with more layers. Then deep convolutional networks emerge, with a very good performance in the classification of images; today they determine the state-of-the-art in this field. In addition, papers were published independently in terms of classification and segmentation. Table 2 lists the most important works in this field.

The paper is organized as follows. Section 2 describes the methodology. Section 3 presents the experimental setup and results, and finally concluding remarks are given in Sect. 4.

Table 1. Previous background to the use of deep learning methods
Table 2. Reference works using deep learning methods

2 Models Description

In the present work, images from the Buenos Aires Italian HospitalFootnote 1, ISIC ArchiveFootnote 2 and DermnetFootnote 3 were used. The automatic classification of melanoma using convolutional networks (CNNs) was investigated by Lecun et al. [18]. Even though they have a very good performance in the classification of the images, they have the disadvantage of being sensitive to the invariance of transformations: rotations, change of scale and orientation. One of the ways to solve this problem is to combine CNN (local descriptors) with Fisher vectors [19]. In this work we propose an ensemble of the hybrid system introduced by Yu et al. [17] with that of a CNN network to improve the performance of the resulting classifier. Our model includes an image preprocessing, a segmentation, and classification modules. In the preprocessing module the images were rescaled and Max-Constancy was applied [20]. Max-Constancy is a technique used to filter the effects of the light source that can produce a distortion in the image (similar to the filter that automatically performs the human being). The segmentation was performed by the UNET network [12]. In order to train the network that segments the lesion, only the images that come from the ISIC Archive source are used since they have the respective masks. The classification was solved by means of the VGG-16 network [21] trained on ImagenetFootnote 4. In addition, heat maps were visualized on the classified images to evaluate the performance in the training phase of the VGG-16 network. The heat maps correspond to the areas that CNN uses to perform the aforementioned classification. The method that will be used to obtain these heat maps is GRAD-CAM [22].

For the CNN model, once the dataset was created data augmentation was carried out to increase the number of images in the training set. That is, they were rotated at 0, 90 and 180\({\circ }\), and zoomed at the same time, producing different versions of each image.

Fine-Tuning was performed for both the CNN model and the hybrid model. That is, the Fully-Connected layers were first removed and the images were passed through the convolutional layers. The output of the network will be called “Deep Features”. Then with these descriptors a mini network formed by the last block of the CNN network and the Fully-Connected layers was trained. On the other hand, for the hybrid model, the descriptors of CNN were encoded in the Fisher vectors, which are the inputs of a SVM with the aim of training it and generating the hybrid classifier. The final classifier is an ensemble of the two hybrid models and a CNN model (VGG-16): the final class probability is estimated by the average of the probabilities provided by each model.

The Classifier 1 is the simplest model. It is composed of a VGG-16 network pre-trained in IMAGENET, in which the weights of the last block of the network and of the last fully connected layer are changed based on the use of training and validation datasets. The Hybrid Model [17] (Classifier 2) takes as input the deep features which are reduced in dimentionality by Principal Component Analysis (PCA). A Gaussian mixture model (GMM) is first learned by sampled images from training set, and the Fisher Vector representation is calculated for each image. The number of PCA and GMM components are learned with the objective of maximizing the ROC curve in the Validation dataset. It should be noted that the input of this classifier corresponds to the descriptors that are the product of passing the images to the last convolutional layer of Classifier 1. This model is shown in Fig. 1. Then we have Classifier 3 (see top chart in Fig. 3) that takes as input the descriptors of the segmented images for which we train a U-NET network with the objective of segmenting the images. Figure 2 shows an example of the application of the segmenter to an image; the grayscale area is not considered as part of the lesion for the segmenter. It should be noted that Classifier 3 is similar to Classifier 2 but the descriptors are obtained from the segmented images instead of the original images. Our proposed model, built by the Ensemble of the three classifiers, can be seen at the bottom of Fig. 3.

Fig. 1.
figure 1

Scheme of hybrid models

Fig. 2.
figure 2

Segmented melanoma with a gray scale outside the lesion

Fig. 3.
figure 3

Hybrid model and the ensemble of three classifiers

Finally, the Ensemble of classifiers is applied to the testing dataset (constituted by the original images as segmented). The segmentation in the testing file is necessary to obtain the input of Classifier 3, while the input of classifiers 1 and 2 are the original images.

3 Results

Several testing datasets were generated and in each of them the proposed model was applied. The original testing dataset is composed of 100 images where there are 50 melanomas and 50 benign moles, we will call it ISIC testing dataset. Then there is the XY testing dataset, in which operations of translations are carried out on the X and Y axes, obtaining 250 melanomas and 250 benign moles. In addition, there is the ROT testing dataset, consisting of rotation operations and reflection of the original images, obtaining 250 melanomas and 250 benign moles. Finally, we have the testing dataset of the Italian Hospital (HI) with lower quality images in relation to those used in the training and the ISIC, ROT and XY test dataset. As mentioned in the Sect. 2, Max-Constancy was applied to all the images. Experiments were executed on a machine with core-i7 6700 HQ microprocessor, 16 GB of memory and with a GTX 950m GPU.

The parameters that achieve an optimal performance in each model are shown in Table 3.

Table 3. Training parameters for each model.

It should be noted that in the training of the VGG-16 network the learning rate is reduced (to half its value) if the same ROC value is obtained during 5 epochs. The binary cross-entropy is used as loss function while in the U-NET network, the Jaccard coefficient was used to measure the error between the predicted mask and the real mask.

Fig. 4.
figure 4

(a): PR curves in the testing datasets (Ensemble model), (b): PR curves in testing datasets (Classifier 1) and (c): PR curves in testing datasets (Classifier 2).

Although with the Hybrid model the lowest performance is obtained, it is observed that when combined with Classifier 1, the best performance is obtained. The latter is visualized in Fig. 4 (a) in relation to (b) and (c) that represent Classifier 1 and the Hybrid model without segmentation (Classifier 2), respectively.

4 Conclusions

In this work we develop techniques of data mining and deep training of neural networks (deep learning) with the objective of classifying images of moles in ‘Melanomas’ or ‘No Melanomas’. An ensemble of three classifiers is utilized for this purpose. The classifiers include a convolutional neural network (VGG-16), Fisher vector encoding, and image segmentation by means of a U-NET network. It was found that when the testing dataset is composed of images that are originated through translation operations on the X and Y axes of the original images, the performance in the CNN network decreases more rapidly than with the hybrid model. This is because the data augmentation performed in the training dataset, used in the training of the network, does not include the translation operation on the axes. We conclude that the CNN network is less invariant than the hybrid model for this type of operation, that was not applied in the data augmentation in the training dataset.