Keywords

1 Introduction

Lung cancer is the most common cause of cancer death in the world, responsible for nearly 1.6 million deaths annually [10]. The main contributing factor for the high death rate of lung cancer is the late diagnosis [19]. Once diagnosed, lung cancer is often in an advanced stage, with 15% or less chance of a 5-year survival [15]. At that stage, tumours are already composed by multiple clonal subpopulations of cancer cells and, consequently, the treatment must be shaped based on the individual tumour heterogeneity. Precision medicine is the medical field that tailors practices and/or therapies to individual patients by taking into account the individual variability of genes. The traditional method of analysing the tumour is by extracting tumour tissue in a biopsy, which is then characterised using genomic-based approaches. In spite of being a successful approach in clinical oncology, repeated biopsies tend to increase medical complications. Further, tumour characterisation usually demands several biopsies, since the results can vary depending on the part of the tumour that is analysed [23].

In Non-small cell lung cancer (NSCLC), which accounts 85% of all lung cancers [5], mutational testing of selected genes is a standard practice to determine how affected patients will respond to targeted therapy [9]. This includes determining the mutation status of epidermal growth factor receptor (EGFR), a cell receptor that activates growth and survival [24], and Kristen rat sarcoma viral oncogene homolog (KRAS), which activates the same pathway as EGFR when mutated [21]. Patients with mutant EGFR are sensitive to tyrosine kinase inhibitors (TKIs) gefitinib and erlotinib. Hence, patients with mutated EGFR lung cancer, who receive treatments with targeted TKIs are expected to have a longer progression-free survival in comparison to chemotherapy treatment. However, if gefitinib is administered in cases with non-mutated EGFR, the patient will undergo a shorter progression-free survival [18]. KRAS mutation status is also helpful for treatment planning. It has proven to be correlated with response to chemotherapy since patients with mutated KRAS which undergo chemotherapy have revealed inferior responses and shorter survival compared to patients with no KRAS mutation [13, 26]. On that basis, identifying patients with mutated EGFR and KRAS is highly important in precision medicine.

As a less invasive technique compared to biopsy, radiographic medical imaging opens new opportunities for tumour characterisation. Images exhibit strong phenotypic differences between tumours, such as tumour size, presence of emphysema and/or fibrosis. Those differences normally fail to be recognised by the naked eye, thus they may have the potential to be valuable predictors of therapeutic benefit. Moreover, a great advantage of medical imaging is its ability to provide a full state visualisation of a tumour at a macroscopic level. Therefore, radiogenomics, the fusion of medical images and genomics, offers attractive opportunities for non-invasive treatment planning.

Given the relevance of the problem, in this paper, we propose predictive models for EGFR and KRAS mutation status, using a set of clinical and radiologist-observed qualitative imaging features, taking advantage of the learning capabilities of machine learning techniques.

The remainder of this paper is as follows. In Sect. 2, we present the related work which has been done so far. In Sect. 3, we present the proposed approach, while in Sect. 4 we detail and discuss the experimental results. We conclude the paper with Sect. 5, summarising its main contributions and findings.

2 Related Work

A thorough search of the relevant literature yielded only one related article which investigated whether EGFR and KRAS mutation status can be predicted using qualitative features obtained from imaging data. Gevaert et al. [11] used 89 qualitative image features of NSCLC patients tumours, annotated by a thoracic oncologist, to create models to predict EGFR and KRAS mutation status. A univariate correlation study was performed between mutation status and the qualitative imaging features and, afterwards, the most correlated features were used in a multivariate analysis using decision trees. Emphysema, airway abnormality, the percentage of ground glass component and the type of tumour margin reached the significance threshold of correlation with EGFR mutation status and they were used to build a decision tree model, which achieved an area under the ROC curve (AUC) of 0.89. With regard to KRAS mutation, no features reached the significance threshold of correlation and, consequently, the models built for KRAS were not considered useful (AUC = 0.55). Furthermore, some studies have investigated the association between EGFR mutation status and quantitative features, rather than qualitative features [7, 16, 17].

With a view on the advances that have been made, in this study, we propose different experimental methodologies that take advantage of powerful machine learning techniques to create predictive models using a new set of features. By using a different cohort of patients, as well as different features, one can further evaluate the relation between EGFR and KRAS mutation status and radiographic imaging data.

3 Methodology

This study aimed to investigate whether clinical and qualitative imaging features are advantageous mutation status predictors and build predictive models using two algorithms: Random Forest (RF) [3] and Multi-layer Perceptrons (MLP) [22] networks. The data was divided into a training set (80%) and a test set (20%), ensuring that each set maintains an equal proportion of instances of each class. It is important to mention that the set of subjects used for training and testing were kept constant for all the performed experiments. Hyper-parameters were chosen applying grid-search with 5-fold cross-validation to the training data and selecting the set of hyper-parameters of the model with the highest F-measure. The developed code and used data are available on Github.Footnote 1

3.1 The Dataset

The study included a subset of 158 NSCLC patients tested for EGFR mutation status and 157 NSCLC patients tested for KRAS mutation status, characterised by qualitative and clinical features. The data was obtained from the open-access NSCLC-Radiogenomics dataset available at the cancer imaging archive (TCIA) database [2, 6, 12].

The qualitative features were obtained from an analysis of pre-treatment computed tomography (CT) images using a controlled vocabulary. The used terms are commonly used in radiology clinical practice and derive from descriptions in the radiology literature [1]. Definitions of some of the terms used in this description can be found in [14]. The template of semantic terms was developed exclusively for nodules since it is the most prevalent expression of lung cancer. Therefore, other manifestations of lung cancer besides nodules (e.g. central obstructive tumours) are not included in this study.

From the 30 qualitative features available in the NSCLC-Radiogenomics dataset, some were discarded due to a large number of not applicable values (e.g. the fibrosis type field in a patient that has fibrosis absent), thus, a subset of 18 qualitative features was used in this study. The used set includes nodule and parenchymal features, which describe the nodules geometry, location, internal features and other related findings. Additionally, the patient’s gender and smoking status were considered due to its significant association with mutation status prevalence, confirmed in recent studies [8, 20, 25]. From this point forward, gender and smoking status are designated as clinical features and the qualitative features extracted from the images as imaging features. Table 1 shows detailed information regarding the data distribution and the nomenclature used to classify the tumours.

The dataset comprises percentages of 26% and 25% mutated cases for EGFR and KRAS, respectively. Before feeding the data into the model, features were converted to binary vectors following a one-hot encoding strategy. Thereafter, the number of features increased from 20 to 73.

3.2 Random Forest

RF models were implemented for predicting EGFR and KRAS mutation status. As an algorithm based on ensemble learning, RF makes the predictions taking advantage of a group of models, instead of a single model. Random Forest samples both observations and features of training data in order to build independent decision trees which contribute by voting for the ensemble prediction. Bearing in mind that a decision tree is an unstable algorithm, by averaging the results of all decision trees, the variance component of the model will be minimised, which approximates the ensemble to an ideal model.

Due to the ability of RF to recognise the importance of the features for the problem in mind, it was conducted an analysis of the most valuable ones.

3.3 Multi-layer Perceptron

By virtue of the remarkable ability of MLP to extract patterns and detect trends, its performance on predicting the genes mutation status was tested.

The MLP assumes the distribution of classes is similar, which in this case would result in a model biased towards the negative class. However, in this study, the correct classification of both classes is equally important, since the classification of a patient with the wrong mutation status could lead to the administration of a less suitable treatment and, consequently, to shorter progression-free survival. To overcome class imbalance, it was conducted a Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC) data approach, in which new instances are created based on the 5 nearest neighbours of the feature space that belong to the minority class. In comparison to traditional over-sampling, SMOTE-NC has the advantage of building a more general decision region of the minority class [4]. After applying SMOTE-NC, the training set contains the same number of mutated and wildtype samples. On the premise of keeping the fed data constant among the two algorithms, SMOTE-NC was applied to all the performed experiments.

Table 1. Distribution of the data used for KRAS and EGFR mutation status prediction experiments.

4 Results and Discussion

The same experimental set-up was followed regarding EGFR and KRAS mutations. Interestingly, the average results were quite different, in a sense that it was possible to achieve models that reliably predict EGFR mutations but not KRAS mutation presence. From the experiments conducted, the predictive model for KRAS mutation with the best performance had an AUC of 0.56, having the RF as the classifier. Therefore, the following results are exclusively regarding EGFR.

4.1 EGFR Mutation Status Prediction

Three experiments were conducted in order to achieve the greatest set of features to predict EGFR mutation status: using clinical features, imaging features and both imaging and clinical features. Clinical features were only attempted with MLP, since the RF is not an ideal model for a set of 2 features, due to the lack of feature combinations. The range of values used in the grid search for RF and MLP for each hyper-parameter is presented in Tables 2 and 3, respectively.

Table 2. Set of hyper-parameter values used in the grid search for RF.
Table 3. Set of hyper-parameter values used in the grid search for MLP.

The hyper-parameters of the models which achieved the highest F-measure in the 5-fold cross-validation for RF and MLP are included in Tables 4 and 5, respectively, for each experiment.

Table 4. Hyper-parameters of the RF model which achieved the highest F-measure in the 5-fold cross-validation.
Table 5. Hyper-parameters of the MLP model which achieved the highest F-measure in the 5-fold cross-validation

Table 6 shows the results obtained on the test set by the RF and MLP models which provided the best results in the training set, in the three performed experiments.

Table 6. Results obtained in the different experiments.

On average, a better performance was achieved when both clinical and imaging features were used. The RF model achieved an equal performance using both clinical and imaging features or just imaging features; however, MLP reached a higher performance (AUC = 0.96) when clinical features were added to the feature set.

Focusing on the MLP experiments, clinical features achieved a satisfactory performance when predicting EGFR mutation status (AUC = 0.68), whereas, when using imaging data a reliable predictive model is obtained (AUC = 0.94). Further, when imaging and clinical data were combined, it was created a model which further increases the imaging data performance (AUC = 0.96). The RF model achieved the same performance in the two experiments potentially due to its intrinsic feature subsampling.

The ROC curves for the RF and MLP models using imaging and clinical features are presented in Fig. 1. The confusion matrix obtained by the two models using imaging and clinical features was identical, and it is presented in Fig. 2.

Fig. 1.
figure 1

ROC curve of the RF model (left) and MLP model (right).

Fig. 2.
figure 2

Confusion matrix of the RF model and MLP models.

Feature Importance. Taking advantage of the RF ability to recognise features importance, it was conducted an analysis in order to find the most valuable predictors amongst the used set of features. The RF model outputs a score for each feature which sums to one, and it describes the average decrease in impurity over trees.

Since the one-hot encoding approach was used to convert from categorical data to binary vectors, there is an importance score associated with each feature value and not a single score to the feature itself. Therefore, the feature score was considered to be the sum of the scores of each of its values. For instance, if having emphysema present had a score of 0.10 and emphysema absent a score of 0.08, the feature emphysema has a score of 0.18. Figure 3 shows the importance scores as a result of this analysis. Emphysema is plainly the feature with the highest score (0.18) followed by lung parenchymal features (0.16).

Fig. 3.
figure 3

Features importance scores.

5 Conclusions and Future Work

In this work, we report a radiogenomics model that is able to predict the mutation status of EGFR (AUC = 0.96) from CT scans in a less invasive procedure, compared to repeated biopsies during treatment. To the best of our knowledge, this work was the second attempt to create predictive models for EGFR and KRAS mutation status using qualitative radiographic image features. Even though our model outperforms the model created by Gevaert et al. [11] (AUC = 0.89), the present work did not use the same dataset as the first attempt and, consequently, results cannot be directly compared.

The results of this study suggest that an image signature exists that accurately predicts EGFR mutation status but not KRAS. Considering the fact that class proportions are similar in EGFR (26% mutated, 74 % wildtype) and KRAS (25 % mutated, 75% wildtype), the most reasonable hypothesis for this result is that KRAS mutations are not evident through radiographic qualitative features in the same extent as EGFR mutations, which appear to have particular patterns. Gevaert et al. [11], which was also able to create a predictive model for EGFR mutations but not for KRAS, also hypothesised that the results might result from different class proportions between EGFR and KRAS; however in the present study class proportions are similar, which enhances the hypothesis that KRAS mutations do not manifest through qualitative imaging features.

When clinical and imaging data were combined, the performance slightly increased, on average. These results have shown that, although the limited amount of data, images and clinical data combined are potential predictors of EGFR mutation status. However, it is important to further validate the results with a larger dataset, to clarify these features importance.

From the total set of features used in this work, emphysema and lung parenchymal features were the ones that presented a higher correlation with EGFR mutation status.

The model created in this study for the prediction of EGFR mutation status opens interesting opportunities for a better treatment planning and supports oncologists and radiologists with additional information at diagnosis. Key issues to be investigated in the future is whether quantitative features extracted directly from the image have the same predictive ability or even exceed qualitative features.