Keywords

1 Introduction

To build classification systems capable of reliable performance, adequate image representation is necessary. Adopting multimodal image features presented in [10, 12, 13], proves to achieve higher classification accuracies for biomedical images, as this contributes towards sufficient image representation. However, some classification tasks such as ImageCLEF 2015 Medical Clustering Task [8], as well as real clinical cases, lack corresponding text representations.

Hence, this paper utilizes automatic generated keywords proposed in [14] to substitute as text representation for the classification of radiographs into body regions, focusing on a different feature extraction method. The obtained keywords are combined with visual features for multi modal image representation. The generated text information can also be further applied for semantic tagging and image retrieval purposes.

We show that by adopting a multi-modal image representation and classification method described in Subsects. 2.2 and 2.3, the overall prediction accuracy is increased as shown in Sect. 3, by evaluating the model performance on a dataset presented in Subsect. 2.1.

2 Materials and Methods

2.1 Dataset

The Medical Clustering Task was held at ImageCLEF 2015, an evaluation campaign organized by the CLEF InitiativeFootnote 1. For this task, 750 high resolution x-ray images collected from a hospital in Dhaka, Bangladesh [1] were distributed. The training set included 500 images and test set 250 images, with annotations of the following classes: ‘Body’, ‘Head-Neck’, ‘Upper-Limb’, ‘Lower-Limb’ and ‘True-Negative’. An excerpt of the x-rays is displayed in Fig. 1.

Fig. 1.
figure 1

An excerpt of images from the CVC digital x-ray dataset, Medical Clustering task, ImageCLEF 2015. Original data is available from www.cvcrbd.org.

For the creation of the keyword generative model, the dataset distributed for the ImageCLEF Caption Prediction Task [7] was applied and is presented in [14].

2.2 Image Representation

For visual representation, two methods are applied for comparison purposes: Deep convolutional activation features (DeCaf) [6] and Bag-of-Keypoints [5] computed with dense SIFT descriptors [11]. The deep visual features are the average pool layer of the deep learning system Inception_V3 [18], which is pre-trained on ImageNet [15]. The activation features were extracted using the neural network API Keras 2.2.0 [4]. The Bag-of Keypoints visual features were created using the VLFEAT library [19].

To obtain multi-modal image representation, text information was created. The keyword generative model proposed in [14] was used to automatically create keywords for all 750 images, belonging to training and test sets. Furthermore, a compact text representation was achieved by applying vector quantization on a Bag-of-Words [17] codebook and Term Frequency-Inverse Document Frequency (Tf-IDF) [16].

2.3 Classification Models

Random forest (RF) [2] models with 1,000 trees were created as image classification models. These RF-models were trained using either visual or multi-modal image representations. Principal Component Analysis (PCA) [9] was applied to reduce computational time, feature dimension and noise. The vector size for visual features was reduced from 2,048 to 50, and from 150 to 50 for the text features. For comparison, multi-class Support Vector Machines (SVM) [3] using the same multi-modal image representations as the RF models, were modeled with the following parameters: kernel = radial basis function, cost parameter = 10 and gamma = 1/num_of_features.

3 Results

The achieved prediction accuracies using either visual or multi-modal image representation are listed in Table 1. For comparison purposes, the different classifier setups used for training are shown in the first column.

Table 1. Prediction accuracies obtained using the different visual and text representations, as well as classifier setup. Evaluation was done on ImageCLEF Medical Clustering test set with 250 x-rays.

Figure 2 displays a word cloud created with the automatically generated keywords from the ImageCLEF Medical Clustering Training Set.

Fig. 2.
figure 2

Word cloud of automatically generated keywords for images in the ImageCLEF 2015 Medical Clustering Training Set.

4 Discussion

Adopting multi-modal representations for classification task proves to obtain higher prediction accuracies, as listed in Table 1. This is the case for both Random Forest and Support Vector Machines classification models. The prediction rate is optimized by applying DeCaf as visual representation, in comparison to the traditional Bag-of-Keypoints features. It can be seen from Fig. 2, that the generated keywords contribute to a more adequate representation, as information on body regions achieved.

5 Conclusions

An approach for optimizing prediction accuracies using deep convolutional activation features combined with automatically generated keywords was presented. Following the results shown in Table 1, using multimodal image representations achieves higher classification accuracies than just visual features. This is observed for the different classification models and visual feature extraction method. As the prediction models trained with deep convolutional activation features outperform those trained with traditional Bag-of-Keypoints, continuous work can be based on evaluating several image enhancement techniques.