1 Introduction

Malignant skin diseases take thousands of lives around of the world. For example, in 2016 in the United States, 83510 new cases of skin cancer have been diagnosed, from this, 13650 people have died [11]. The detection of this cancer is performed by clinical analysis, and the best clinical method used is ABCD [3, 7]. This method analyzes the morphology of the lesion and its evolution. However, it requires a manual procedure and a high level of proficiency. As a solution for this problem, some researchers proposed computer-assisted methods, based on statistics, pattern recognition, machine learning, and deep learning, among others [2].

According to the state of the art, some researches achieves good results detecting malignant and benign lesions. However, this one could be insufficient in real scenarios due to correlations between diseases of different classes, it is common to find cases of benign lesions that become malignant over time. Find the subclass of a sample could provide more information for a specialist to make a successful diagnosis. In addition, the datasets analyzed are made up of dermatoscopic images, such data samples are inaccessible to people who don’t have dermatoscopes. On the other hand, we have made up a dataset of non-dermatoscopic images, these one are samples of skin lesions taken with a conventional camera.

Many methods were used for the skin detection, but currently, the best results have been obtained with the use of Convolutional Neural Network (CNN), as demonstrated in [1]. In this work, we use the CNN architecture VGG-19 as [6], but we propose the use of Autoencoders (AEs) instead of fully-connected networks. Further, we have tested this one on dataset with 3 class and 11 sub-class. The main contribution of our work is the use of AEs as method of classification, to identify the kind of skin disease which a sample belongs. The result of this, will classify the samples as benign, premalignant or malignant diseases.

This paper is organized as follow: Sect. 2 presents the concepts for the development of the proposal, specifically about CNN and AEs, Sect. 3 describes the datasets used, Sect. 4 shows the proposed method, Sect. 5 shows the experimental results. Finally, in Sect. 6, presents the conclusions of the paper.

2 Background

The methods for detecting skin diseases are based on feature extraction. There are two approaches, a clinical analysis method [3] based on the specialist’s experience and a method computer-aided that uses Machine Learning for processing samples [6, 9].

2.1 Convolutional Neural Network (CNN)

Originally a CNN requires a lot of training to obtain good results, depending on the complexity of the training data. To reduce the time required and improve the accuracy results, some works such as [4], use transfer learning to initialize the filters of the network. This helps the process of feature extraction made on convolutional layers. On the other hand, fully connected layers are restarted to fine-tune the CNN and set the number of classes.

2.2 Classification by Reconstruction

An autoencoder (AE) can be seen as neural network that tries to reconstruct the input data, these are known as a class of unsupervised learning algorithms [5]. Unlike supervised algorithms, not need labels or class information, also AEs have been used as a method to pre-train a network and initialize its weights. According to [8], this research introduce the use of AEs as a classification method.

3 Datasets

In this section, we present a new dataset of non-dermatoscopic images, built using different sourcesFootnote 1. This dataset consists of 2360 unsegmented images of medium and high quality, divided into three main classes (benign, pre-malignant and malignant), each class is divided into subclasses as shown in (Table 1).

Table 1. Skin diseases dataset

The sub-classes considered in this dataset were selected because they are the most common, lethal and easily confused by other lesions of less severity. This was done with the help of a specialist in dermatology and oncology, a professional at the National Institute of Neoplasm Diseases of Peru (INENFootnote 2). Finally, we pick 1554 samples for the final dataset, this was split in training (1169) and testing (385) samples.

In addition, we used 4 different datasets to validate our proposal; MNISTFootnote 3, CIFAR-10Footnote 4, SVHNFootnote 5 and the ISBI Challenge 2016 DatasetFootnote 6.

4 Proposed Method

Our proposed method is based on the use of Convolutional Neural Network (CNN) and Autoencoders (AE). For the evaluations, we measure the accuracy, precision, recall and f\(\beta \) metric.

4.1 Data Preprocessing

Using a semi-automated process (Fig. 1), we segment the images using thresholding techniques [9], available in the Python library sklearn-imageFootnote 7, we fine-tune segmentation using a hand-craft tool available at githubFootnote 8. Then, we generate the images dataset at 224 \(\times \) 224 pixels dimension.

Fig. 1.
figure 1

Image segmentation process

Then, we generate synthetic data to increase the number of samples. For this, through the clinical analysis and specialist assistance (INEN), we pick the best samples of each subclass and performed rotations in 0, 90, 180 and 270 degrees to increase the training data by 33%. The testing data is immutable.

4.2 Features Extractor and Classifier

We use a VGG19 network architecture [6], which consists of 19 layers, as we can see in Fig. 2, this scheme conforms the feature extractor that we use. According to [10], to obtain the most general-purpose representation for learning is used the output of the last convolutional layer of the CNN. The original classifier is modified so that our network can classify three types of classes. Thus, the network weights are pre-trained on ImagenetFootnote 9 and the CNN was fine-tuned to the target dataset by transfer learning.

Fig. 2.
figure 2

Convolutional Neural Network VGG-19

4.3 Clasification with Autoencoders

Before using AE as a classification method, we have to save the feature vectors of the first fully connected layer of the network, as shown in the Fig. 2, this one is the new representation of each image which consists of 4096 values. Then, we can train our first AE [5] which we will call global-AE. The global-AE is trained with the entire training dataset until the reconstruction error is minimized. Additionally, we generate n-autoencoders (n-AEs) which are cloned from global-AE, where n is the number of classes in the dataset (See Fig. 3).

Then, in the training phase, each training sample will only feed the AE associated with its class. What happens here is that each AE will be to specialize in reconstructing the data of its own class.

For the test phase, each sample is tested by the n-AEs, to generate a reconstruction error vector by sample, as we see in the Fig. 3. Finally, we get the minimum reconstruction error for each vector to know the class to which the sample belongs.

Fig. 3.
figure 3

Each AE consists of 5 layers (1 to feed, 2 for enconding and 2 for decoding)

5 Results

To validate the classification model with AEs, We train the CNN network with different datasets (MNIST, CIFAR-10, SVHN and ISBI) to get the accuracy classification and the feature vectors as described in Sect. 4.2. These feature vectors feed our method of classification with AE named CNN-AE described in Sect. 4.3. We setting our CNN with a minibatch of 30 samples, and learning rate between \({<}10^{-3}, 10^{-5}{>}\). In Table 2, we can see the results obtained by CNN and CNN-AE. Here we can observe that CNN-AE is comparable to CNN. If we focus only on the accuracy indicator, we get results that are in general slightly worse.

Table 2. Comparison of CNN and CNN-AE classification models

In addition, we compared our results with the results of the ISBI Contest Dataset. It is available on its ISBI-2016 webpageFootnote 10. Table 3 shows these results. The winner of the contest has an Accuracy 1% higher than our model, while our Average Precision is slightly higher. However, according to the sensitivity metric, our model is better identifying True Positive, equivalent to cases of skin cancer.

Table 3. Proposed method (CNN, CNN-AE) vs the winner of ISBI contest

Finally, we perform a comprehensive evaluation for the dataset we presented in this work, which is conformed by 3 classes and 11 sub-classes. First, we train the VGG19 CNN network to classify (VGG19 with 3 classes) which we will call CNN-3. In Fig. 4 shows confusion matrix for CNN-3 network.

Fig. 4.
figure 4

Confusion Matrix for CNN-3 network for our own dataset

Know only the skin lesion class is not enough for an adequate diagnosis. It is important to know the sub-class (kind of disease) that is being detected. To achieve this, we use the CNN-3 network and use it as a feature extractor Sect. 4.2. Moreover, we performed the training VGG19 CNN network with AE to classify (11 sub-classes), which we will call CNN-AE-11 described in Sect. 4.3.

Fig. 5.
figure 5

Confusion Matrix for CNN-AE-11 and CNN-AE-11/3 for our own dataset

In Fig. 5(a), we see the confusion matrix of CNN-AE-11 with a accuracy (0.722) lowest that CNN-3 (0.841). However, if we analyze the results of CNN-AE-11, we can see that some samples were wrongly classified as sub-class, but were correctly classified as class, according to Table 1. Therefore, we group the results of CNN-AE-11 by class hits (benign, pre-malignant and malignant) and we will call CNN-AE-11/3, as we can see in the Fig. 5(b). Now, we can deduce that the accuracy of CNN-AE-11/3 is 85.71%, improving the accuracy of CNN-3 (84.15%). The results obtained in this test are shown in the Table 4.

Table 4. Evaluation for the own dataset, CNN-3, CNN-AE-11, and CNN-AE-11/3.

6 Conclusions

  • This work has reached the first place of the ISBI-2016 Contest 3. Moreover, according to the sensitivity metric, our model is better identifying True Positive, equivalent to cases of skin cancer. So, our model is better due to the fact that there is a greater risk for sick people who are classified as healthy.

  • Classification using autoencoders is a novel method for the malignant diseases diagnosis. It has shown comparable results as demonstrated in Table 2, even with unbalanced datasets. This feature is important for this kind of research, since there is no availability of large datasets with images, to build datasets.

  • Finally, the detection of malignant diseases requires the analysis of all the information that we can obtain from a diagnostic method; even the errors provide information, patterns, and behavior, which resemble the clinical diagnosis that is performed by specialists.