# Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model

**Part of the following topical collections:**

## Abstract

### Background

Accurate prediction of anticancer drug responses in cell lines is a crucial step to accomplish the precision medicine in oncology. Although many popular computational models have been proposed towards this non-trivial issue, there is still room for improving the prediction performance by combining multiple types of genome-wide molecular data.

### Results

We first demonstrated an observation on the CCLE and GDSC datasets, i.e., genetically similar cell lines always exhibit higher response correlations to structurally related drugs. Based on this observation we built a cell line-drug complex network model, named CDCN model. It captures different contributions of all available cell line-drug responses through cell line similarities and drug similarities. We executed anticancer drug response prediction on CCLE and GDSC independently. The result is significantly superior to that of some existing studies. More importantly, our model could predict the response of new drug to new cell line with considerable performance. We also divided all possible cell lines into “sensitive” and “resistant” groups by their response values to a given drug, the prediction accuracy, sensitivity, specificity and goodness of fit are also very promising.

### Conclusion

CDCN model is a comprehensive tool to predict anticancer drug responses. Compared with existing methods, it is able to provide more satisfactory prediction results with less computational consumption.

## Keywords

Anticancer drug response Cell line-drug complex network Computational prediction model Cell line Precision medicine## Abbreviations

- AUC
Area under the ROC curve

- CCLE
Cancer cell line encyclopedia

- CDCN
Cell line-drug complex network

- CSN
Cell line similarity network

- DSN
Drug similarity network

- GDSC
Genomics of drug sensitivity in cancer

- IC50
The concentration of an anticancer drug to kill half cancer cells

- NCI
National cancer institute

- NRMSE
Normalized root mean square error

- RF
Random forest

- RMSE
Root mean square error

- ROC
Receiver operating characteristic

- SVR
Support vector regression

## Background

The inherent heterogeneity of cancers always makes the same cancer patients exhibiting different anticancer drug responses, which is a major difficulty in cancer treatment. It is critical to accurately predict the therapy responses of patients based on their molecular and clinical profiles [1, 2]. With the rapid development of high-throughput technology, a huge number of publicly available cancer genomic data have been generated by large research agencies. It supplies a golden opportunity to translate massive data into knowledge of tumor biology and then improve anticancer drug response prediction. Many computational methods have greatly contributed to this non-trivial issue [3, 4, 5, 6]. Supervised learning technique is one of the most widely used approaches. It can be mainly partitioned into regression and classification models [7]. The former always generate numerical estimations of drug sensitivity represented by activity area or IC50 [3, 8], and the latter tend to make a high or low sensitivity prediction depending on the predetermined response levels [9, 10]. Machine learning tools to implement these methods include support vector machines [11], random forests [12], neural network [4] and logistic ridge regression [13]. Comparative analysis suggested that regression model, such as elastic net and ridge regression, exhibit good and robust performance in different settings [9, 14].

Besides the above two types of methods, another important method that gains much attention is the network-based models [15, 16, 17, 18, 19]. One of the earliest attempts should be traced back to Zhang et al. [20], who presented a dual-layer integrated cell line-drug network model by combining the predictions from the individual layers. Reader could refer to [7, 9, 21] for grasping more computational approaches.

Although achieving promising results for certain drugs, most models focused on predicting three types of responses, i.e., ‘old drug to old cell line’, ‘old drug to new cell line’ and ‘new drug to old cell line’ (here ‘old’ means tested or existed, and ‘new’ means untested), but paid less attention to the response prediction of ‘new drug to new cell line’. As we all know, updating an existing cancer screen with the latest available drugs and cell lines is not a trivial issue, because it always requires the same expertise, infrastructure and conditions as when the screen was accomplished the first time around. In addition, comprehensive prediction might make potential cancer screen more accurate and experimental design more flexible, as well as accelerate early drug evaluation. Such efforts should be greatly aided by accurate preclinical computational methods.

To predict the response of ‘new drug to new cell line’, we should take advantage of all observed (tested or existed) cell line-drug response values. Importantly, two questions need to be asked. The first is whether observed response values have statistical power to predict the response of ‘new drug to new cell line’. The second is how to evaluate the prediction performance of the proposed model. We aim to answer the above two questions.

Shivakumar et al. found that structural similarity between drug pairs in the NCI-60 dataset highly correlates with the similarity between their activities across the cancer cell lines [22]. Zhang et al. showed that genetically similar cell lines may also respond very similarly to a given drug, and structurally related drugs may have similar responses to a given cell line [20]. We are wondering whether their ideas could be extended to a more general circumstance, that is, genetically similar cell lines always exhibit higher response correlations to structurally related drugs. If it is true, we aim to construct a cell line-drug complex network (CDCN) model which incorporates cell line similarity and drug similarity information, as well as cell line-drug responses. To answer the second question, we executed CDCN model on the Cancer Cell Line Encyclopedia (CCLE) [23] and the Genomics of Drug Sensitivity in Cancer (GDSC) [24] datasets respectively, and obtained the satisfactory prediction result. Besides inputting missing values of drug response data, we also classified cell lines into sensitive group and resistant group according to the observed response to a given drug. The prediction accuracy, sensitivity, specificity and goodness of fit further justified the good performance of our model.

## Methods

### Data and preprocessing

### Generalized observation

For the first question, we want to know whether available drug-cell line response values have the statistical power to predict the response of ‘new drug to new cell line’. Motivated by [20, 22], we first examined the response correlations between genetically similar cell lines and structurally similar drugs.

Cell line similarities are measured by Pearson correlation coefficients between their corresponding gene expression profiles. The correlations of most cell line pairs (around 92% for CCLE, 70% for GDSC) are larger than 0.8. We divided all possible cell line pairs with correlation coefficients higher than 0.9 into high similar group ‘Hc’, and other pairs into low similar group ‘Lc’.

*d*(D

_{i}, D

_{j}) = 1 − T(D

_{i}, D

_{j}), where T(D

_{i}, D

_{j}) is the Tanimoto coefficient between drugs D

_{i}and D

_{j}. Based on the drug distance matrix (see Additional file 1: Table S1 and Additional file 2: Table S2), we clustered all drugs using “complete” method in R. Drugs with high distances tend to be in different clusters, while drugs with similar structure are expected to be clustered together (see Fig. 2a and c). For CCLE dataset, we extracted such drug pairs from Fig. 2a with Tanimoto coefficient greater than 0.5 and distance less than 0.49 into high similar group ‘Hd’: {17-AAG, Paclitaxel, AZD6244, PD-0325901, Nilotinib, PD-0332991, AEW541, PF2341066, Erlotinib, ZD-6474, AZD0530, TAE684, Lapatinib, PLX4720, PHA-665752, Irinotecan, Topotecan}. Other drug pairs were divided into low similar group ‘Ld’. For GDSC dataset, we extracted such drug pairs from Fig. 2c with Tanimoto coefficient greater than 0.5 and distance less than 0.45 into high similar group ‘Hd’: {Tipifarnib, PLX4720, Dasatinib, Sunitinib, PHA-665752, AZ628, Imatinib, AMG-706, BMS-754807, PF-02341066, Bosutinib, A-770041, PD-173074, AZD6244, CI-1040, PD-0325901, Erlotinib, AZD-0530, Gefitinib, BIBW2992, NVP-TAE684, WH-4023}. Other drug pairs were divided into low similar group ‘Ld’. From Fig. 2b and d we found that more similar Cell lines always show higher response correlations to more similar drugs, it holds for both CCLE and GDSC data sets.

### Construction of cell line-drug complex network model

*ρ*(C, C

_{i}) as the Pearson correlation coefficient between cell lines C and C

_{i},

*T*(D, D

_{j}) as the Tanimoto coefficient between drugs D and D

_{j}. Meanwhile, we use

*R*(C, D) to represent the observed response value of the pair (C, D) ∈ Ω. Define C

_{i}and C

_{j}as adjacent if

*ρ*(C

_{i}, C

_{j}) ≠ 0, and the weight of this edge as

*ρ*(C

_{i}, C

_{j}). Similarly, D

_{i}and D

_{j}are called adjacent if their weight

*T*(D

_{i}, D

_{j}) > 0. Define C

_{i}and D

_{j}as adjacent if

*R*(C

_{i}, D

_{j}) is available. Obviously, the resulting network involves cell line similarity and drug similarity information, as well as cell line-drug response situations, so we call it the cell line-drug complex network (CDCN). In fact, this network is the dual-layer integrated cell line-drug network in [20]. Figure 3b showed a CDCN corresponding to the cell line-drug response matrix described in Fig. 3a.

Define \( w\left(\mathrm{C},{\mathrm{C}}_i\right)={e}^{-\frac{{\left(1-\rho \left(\mathrm{C},{\mathrm{C}}_i\right)\right)}^2}{2{\alpha}^2}} \) as a weight function of cell lines. It increases with respect to *ρ*(C, C_{i}), where the parameter *α* measures the decay rate with the decrease of *ρ*(C, C_{i}). Similarly, define a weight function of drugs \( w\left(\mathrm{D},{\mathrm{D}}_j\right)={e}^{-\frac{{\left(1-T\left(\mathrm{D},{\mathrm{D}}_j\right)\right)}^2}{2{\tau}^2}} \) with decay parameter *τ*.

_{i}, D

_{j}) besides (C, D). Based on the generalized observation we are able to make a prediction by dealing with all possible observed response values

*R*(C

_{i}, D

_{j}) as the following,

*w*(C, C

_{i})

*w*(D, D

_{j}) reflects the contribution of

*R*(C

_{i}, D

_{j}) to \( \widehat{R}\left(\mathrm{C},\mathrm{D}\right) \).

*R*(C, D

_{j}) and

*R*(C

_{i}, D) are not known for any existing drug D

_{j}and any existing cell line C

_{i}). In this circumstance, the cell line-drug response matrix and the corresponding cell line-drug complex network showed in Fig. 3 would be changed into ones depicted in Fig. 4. Formula (1) also has a ‘little variation’ in the assignment of the pair (C

_{i}, D

_{j}), that is

The ‘little variation’ is crucial for accomplishing the response prediction of ‘new drug to new cell line’. To highlight the difference between two formulas, we called formula (1) as CDCN model I and formula (2) as CDCN model II.

*α*,

*τ*) could be optimized by minimizing the following overall error function

*α*and

*τ*are ranged from 0 to 1 with increment 0.01, respectively, and the pair (

*α*,

*τ*) takes all possible combinations.

Where C ranges over all cell lines for which *R*(C, D) are known, and *n* is the number of such cell lines.

## Results

We executed the following four experiments. (1) Using CDCN model I to predict general responses for the CCLE and GDSC datasets and comparing with six popular computational models. (2) Taking each existed drug-cell line pair as a ‘new drug-new cell line’ pair, we used CDCN model II to predict special responses of these ‘new pairs’, and then compared with the general prediction of model I. (3) Using two models to impute missing data in GDSC independently. (4) Evaluating the model accuracy, sensitivity, specificity and goodness of fit by classifying cell lines into sensitive and resistant groups to some given drug.

### General response prediction

### Special response prediction

*R*(C

_{i}, D) and

*R*(C, D

_{j}) (see Fig. 11). However, their prediction tendencies are completely consistent except for a few drugs, so model II is a reliable tool for predicting response of ‘new drug-new cell line’.

### Inputting missing data in drug response matrix

*P*-value by t.test to illustrate the “consistent pattern” statistically. As is shown in Fig. 12, the observed response values of wild type cell lines are significantly higher than that of BRAF mutated cell lines to three MEK inhibitors AZD6244 (fold-change = 1.26 and

*P*= 3.75e-6), RDEA119 (fold-change = 2.02 and

*P*= 3.02e-11) and PD-0325901 (fold-change = 1.40 and

*P*= 1.61e-9). Consistently, the predicted response values of wild type cell lines are also higher than that of BRAF mutated cell lines to AZD6244 (fold-change = 1.09 and

*P*= 6.64e-5 for CDCN model I; fold-change = 0.98 and

*P*= 6.07e-7 for CDCN model II), RDEA119 (fold-change = 1.10 and

*P*= 4.79e-3 for CDCN model I; fold-change = 1.29 and

*P*= 2.91e-5 for CDCN model II) and PD-0325901 (fold-change = 1.35 and

*P*= 9.41e-6 for CDCN model I; fold-change = 1.17 and

*P*= 3.90e-3 for CDCN model II). In summary, BRAF-mutated cell lines are more sensitive to MEK inhibitors, which is in accordance with the previously published work [20]. Similarly, we also looked at the response difference of the dual kinase inhibitor Lapatinib between EGFR mutated and wild type cell lines. More than half of response values are missing. We found that EGFR-mutated cell lines are more sensitive to Lapatinib (see Fig. 13) which is in agreement with the study [6]. All above results proved that our model could correctly predict drug responses of missing data in GDSC dataset.

_{i}and D

_{j}is defined as 1-

*T*(D

_{i}, D

_{j}). We repeated above procedure five times, and used the mean of five Pearson correlation coefficients between predicted and observed response values as the model accuracy. As is shown in Fig. 14a, our model significantly outperforms kNN methods at different values of k. To further verify the robustness of our model, we also randomly deleted 20% of response values in CCLE dataset and obtained similar result as the 10% case (see Fig. 14b).

### Prediction accuracy, sensitivity, specificity and goodness of fit

Here we should point out that the goodness of fit is relatively small (lower than 0.2) for around half of drugs in both CCLE and GDSC. It is possible even if our model is satisfactory, because CCLE and GDSC are both cross-section datasets, the goodness of fit may be lower because of the variation between the observed values.

## Discussion

There are two key steps for network-based method, i.e., the construction of cell line and drug similarity networks by different types of data and an effective model to execute the prediction. Our method improved the above two steps through an intuitive weighted model which captured different contributions of all available cell line-drug responses. Instead of selecting large plenty of genomic features and making prediction for each drug independently, our model used only two parameters to predict responses for all drugs. This not only decreases the risk of overfitting, but also significantly reduces the computational consumption.

As we all know, a main challenge of computational prediction models is how to achieve good performance with low computational consumption. One may take the following efforts to further improve the performance of the model. First, we can integrate other important information, such as copy numbers, gene mutations, drug resistance and transcriptomic signatures of drug sensitivity into the cell line-drug network to get new knowledge. Second, we could further decrease the computational cost by selecting a few informative genes with respect to drug response to construct cell line similarity network instead of using all genes.

## Conclusion

We built a simple computational model to comprehensively predict anticancer drug responses. One of the main contributions is to provide a technique to predict the response of “new drug to new cell line”*.* Moreover, besides inputting missing values of drug response data, our model could also predict responses of a new drug to existing patients (cell lines), available drugs to a new patient, or even new drugs to new patients. These are more helpful in real clinical practice.

## Notes

### Acknowledgements

The authors thank the reviewers for their helpful comments.

### Funding

This work was supported by the National Natural Science Foundation of China (61572327 to XZ).

### Availability of data and materials

Our data and software are publically available at https://zenodo.org/record/1403638#.W4FzDthKjBI. Gene expression profiles and drug response measures (Activity area) for CCLE dataset are available from the website (http://www.broadinstitute.org/ccle). Gene expression levels and drug response measures (IC50) for GDSC dataset are available from the website (http://www.cancerrxgene.org/downloads). Chemical structure data for drugs are available from PubChem (http://pubchem.ncbi.nlm.nih.gov).

### Authors’ contributions

DW performed the computational experiments and wrote the original draft. CL contributed to data interpretation. XZ revised the manuscript critically and provided the funding. YL designed the framework of the model, analyzed experiment results and modified the manuscript. All authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary material

## References

- 1.Rubin MA. Health: make precision medicine work for cancer care. Nature. 2015;520(7547):290–1.CrossRefGoogle Scholar
- 2.Kohane IS. Health Care Policy. Ten things we have to do to achieve precision medicine. Science. 2015;349(6243):37–8.CrossRefGoogle Scholar
- 3.Falgreen S, et al. Predicting response to multidrug regimens in cancer patients using cell line experiments and regularised regression models. BMC Cancer. 2015;15:235.CrossRefGoogle Scholar
- 4.Menden MP, et al. Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One. 2013;8(4):e61318.CrossRefGoogle Scholar
- 5.Bayer I, Groth P, Schneckener S. Prediction errors in learning drug response from gene expression data - influence of labeling, sample size, and machine learning algorithm. PLoS One. 2013;8(7):e70294.CrossRefGoogle Scholar
- 6.Wang L, et al. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer. 2017;17(1):513.CrossRefGoogle Scholar
- 7.Azuaje F. Computational models for predicting drug responses in cancer research. Brief Bioinform. 2017;18(5):820–9.PubMedGoogle Scholar
- 8.Neto EC, et al. The stream algorithm: computationally efficient ridge-regression via Bayesian model averaging, and applications to pharmacogenomic prediction of cancer cell line sensitivity. Pac Symp Biocomput. 2014:27–38.Google Scholar
- 9.Jang IS, et al. Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. Pac Symp Biocomput. 2014:63–74.Google Scholar
- 10.Fersini E, Messina E, Archetti F. A p-median approach for predicting drug response in tumour cells. BMC Bioinformatics. 2014;15:353.CrossRefGoogle Scholar
- 11.Dong Z, et al. Anticancer drug sensitivity prediction in cell lines from baseline gene expression through recursive feature selection. BMC Cancer. 2015;15:489.CrossRefGoogle Scholar
- 12.Daemen A, et al. Modeling precision treatment of breast cancer. Genome Biol. 2013;14(10):R110.CrossRefGoogle Scholar
- 13.Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 2014;15(3):R47.CrossRefGoogle Scholar
- 14.Cancer Cell Line Encyclopedia, C. and C. Genomics of Drug Sensitivity in Cancer, Pharmacogenomic agreement between two cancer cell line data sets. Nature. 2015;528(7580):84–7.Google Scholar
- 15.Niepel M, et al. Profiles of basal and stimulated receptor signaling networks predict drug response in breast cancer lines. Sci Signal. 2013;6(294):ra84.CrossRefGoogle Scholar
- 16.Fey D, et al. Signaling pathway models as biomarkers: patient-specific simulations of JNK activity predict the survival of neuroblastoma patients. Sci Signal. 2015;8(408):ra130.CrossRefGoogle Scholar
- 17.Ceol A, et al. Genome and network visualization facilitates the analyses of the effects of drugs and mutations on protein-protein and drug-protein networks. BMC Bioinformatics. 2016;17(Suppl 4):54.CrossRefGoogle Scholar
- 18.Lee S, et al. Building the process-drug-side effect network to discover the relationship between biological processes and side effects. BMC Bioinformatics. 2011;12(Suppl 2):S2.CrossRefGoogle Scholar
- 19.Stanfield Z, Coskun M, Koyuturk M. Drug response prediction as a link prediction problem. Sci Rep. 2017;7:40321.CrossRefGoogle Scholar
- 20.Zhang N, et al. Predicting anticancer drug responses using a dual-layer integrated cell line-drug network model. PLoS Comput Biol. 2015;11(9):e1004498.CrossRefGoogle Scholar
- 21.Costello JC, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32(12):1202–12.CrossRefGoogle Scholar
- 22.Shivakumar P, Krauthammer M. Structural similarity assessment for drug sensitivity prediction in cancer. BMC Bioinformatics. 2009;10(Suppl 9):S17.CrossRefGoogle Scholar
- 23.Barretina J, et al. The Cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603–7.CrossRefGoogle Scholar
- 24.Garnett MJ, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012;483(7391):570–5.CrossRefGoogle Scholar
- 25.O'Boyle NM, et al. Open babel: an open chemical toolbox. J Cheminform. 2011;3:33.CrossRefGoogle Scholar
- 26.Willett P, Barnard JM, Downs GM. Chemical Similarity Searching. J Chem Inf Comput Sci. 1998;38(6):983–96.CrossRefGoogle Scholar
- 27.Swamidass SJ, et al. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics. 2005;21(Suppl 1):i359–68.CrossRefGoogle Scholar
- 28.Haibe-Kains B, et al. Inconsistency in large pharmacogenomic studies. Nature. 2013;504(7480):389–93.CrossRefGoogle Scholar
- 29.Safikhani Z, et al. Revisiting inconsistency in large pharmacogenomic studies. F1000Res. 2016;5:2333.CrossRefGoogle Scholar
- 30.Beretta L, Santaniello A. Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak. 2016;16(Suppl 3):74.CrossRefGoogle Scholar
- 31.Staunton JE, et al. Chemosensitivity prediction by transcriptional profiling. Proc Natl Acad Sci U S A. 2001;98(19):10787–92.CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.