Application of transfer learning for cancer drug sensitivity prediction
Abstract
Background
In precision medicine, scarcity of suitable biological data often hinders the design of an appropriate predictive model. In this regard, large scale pharmacogenomics studies, like CCLE and GDSC hold the promise to mitigate the issue. However, one cannot directly employ data from multiple sources together due to the existing distribution shift in data. One way to solve this problem is to utilize the transfer learning methodologies tailored to fit in this specific context.
Results
In this paper, we present two novel approaches for incorporating information from a secondary database for improving the prediction in a target database. The first approach is based on latent variable cost optimization and the second approach considers polynomial mapping between the two databases. Utilizing CCLE and GDSC databases, we illustrate that the proposed approaches accomplish a better prediction of drug sensitivities for different scenarios as compared to the existing approaches.
Conclusion
We have compared the performance of the proposed predictive models with databasespecific individual models as well as existing transfer learning approaches. We note that our proposed approaches exhibit superior performance compared to the abovementioned alternative techniques for predicting sensitivity for different anticancer compounds, particularly the nonlinear mapping model shows the best overall performance.
Keywords
Drug sensitivity prediction Pharmacogenomic studies CCLE GDSC Transfer learning Nonlinear mapping Latent variable Cost optimizationAbbreviations
 AUC
Area under the curve
 CCLE
Cancer cell line encyclopedia
 CLP
Combined latent prediction
 GDSC
Genomics of drug sensitivity in cancer
 LLP
Latentlatent prediction
 LRP
Latent regression prediction
 MP
Mapped prediction
 NRMSE
Normalized root mean squared error
 RF
Random forest
 TL
Transfer learning
Background
A consistent challenge in precision medicine is to design appropriate models for predicting the sensitivity of a tumor to an anticancer compound with high accuracy. In this aspect, largescale pharmacogenomic studies of cancer genomes have provided unprecedented insights for studying anticancer therapeutics to determine putative prediction of drug sensitivity. The Genomics of Drug Sensitivity in Cancer (GDSC) [1] of the Cancer Genome Project and the Cancer Cell Line Encyclopedia (CCLE) [2] from the Broad Institute are two such studies where drug sensitivity profiles and genomic information across hundreds of compounds and cancer cell lines have been systematically gathered. There exists significant overlaps between the two databases which can further be utilized in designing more accurate sensitivity predictive models. Biological data for designing suitable predictive models are frequently scarce and therefore the availability of a secondary dataset often holds the promise for a better model development. However, majority of the machine learning approaches used in drug sensitivity prediction follow the inherent assumption that both training data and test data are in the same feature space with the same distribution. But, when training and test data, despite being in the same feature space, exhibit different distributions, one need to take the distribution shift into account. This is where transfer learning (TL) methodologies come into play [3].
Often in TL environment, the source and target domains can be considered as linked subspaces as part of a highlevel common domain space [4]. We, therefore need to assume that there exists some consistency between the different datasets to be utilized in TL. HaibeKains et al. [5] at first pointed out that, although the gene expression from CCLE and GDSC databases are well correlated between themselves, unexpectedly the measured pharmacological drug responses using common estimators such as IC_{50} and the area under the curve (AUC) measures are highly discordant. In response, the CCLE and GDSC investigators performed their own analysis [6] and presented results opposing the conclusions in [5]. They pointed out that in majority of the drugs, the exhibited AUC and IC_{50} distributions are dominated by drug insensitive lines with a much smaller number of outliers, and postulated that the differences in cell line biology between studies have resulted in the poor correlation. Considering these facts, they have demonstrated significant improvement in correlation between most of the drugs. In any event, the fact that both the databases are providing information about the same biological process, make them suitable candidates for applying transfer learning methodologies.
In case of inconsistent data with different distributions for training and test sets, various TL approaches [3] have been attempted for dataset shift. Unsupervised methods such as INSPIRE (INferring Shared modules from multiPle gene expREssion datasets) [7] is primarily focused on the expression datasets to extract a lowdimensional representation and predicts tumor phenotypes using regularized regression approaches. Inductive transfer learning (ITL) approaches, as in [8], tackle the issue of prediction for scarce primary data using a secondary dataset through importance sampling i.e., reweighting the secondary distribution to the primary. While the primary data size is assumed to be significantly smaller than secondary data, for large number of unlabeled data, one has to adapt to covariate shift along with ITL. Boosting based approaches such as DynamicTrAdaBoost [9] applies ensemble methods to both source and target instances and then employs an update mechanism incorporating only the source instances useful for target task, with an additional dynamic correction factor. Kernel based ITL methods [10, 11] focus on finding an appropriate kernel for the newly available data, modeling the difference with existing data as a problem of finding the suitable bias.

Cost optimization based approach where we employ latent variable models to extract the underlying variables between different datasets. In this case, TL can be applied to only the output (Fig. 1(a)), as in parameter transfer approach [12, 13] or to both model input and output (Fig. 1(b)), as in [14, 15].

Domain transfer approach where we design maps between databases to transfer data from primary domain to secondary and utilize the secondary data to improve the prediction model. Here, TL is applied to both input and output (Fig. 1(c)), as in instance transfer approach [14, 15].
To summarize, the key contributions of the paper is – we have implemented two TL based approach, where the target (primary) data is either transferred to a common latent variable space along with the source (secondary) data, or to the source domain through nonlinear mapping to improve the prediction of limited primary data employing the available secondary data.
Results
To evaluate the performance of our transfer learning algorithms, we have initially retrieved the data common to both CCLE and GDSC. From GDSC (v6.0) and CCLE, there are 15,664 common genes available in 623 common cell lines along with 15 common drugs. We have performed a drugwise analysis and found that the number of cell lines decreases from 623 after incorporating the available drug sensitivity values, resulting in datasets with cell lines between 91−310 along with 15,664 genes and corresponding sensitivity measures. For analysis involving gene expression, we have used ReliefF [16] to select the top 200 genes from each dataset and taken the intersection as the final feature set. For drug sensitivity measure, we have used the AUC values as they have more concordance between databases (median ρ_{s}=0.34) than IC_{50} (median ρ_{s}=0.28) [5]. Note that in spite of our discussion on inconsistencies between databases, the main goal here is to consider the scenario where a small portion of database 1 (i.e., GDSC) is available while data for the entire database 2 (i.e., CCLE) is available and we would like to use database 2 to improve the prediction performance for the rest of database 1. Thus, for evaluation, we will use the GDSC experimental AUCs as the gold standard and compare with the predicted AUCs.
Latent variable cost optimization approach
Comparison of Kfold crossvalidation performance for 4 GDSC drug sensitivity prediction approaches – Latent Regression Prediction (LRP), LatentLatent Prediction (LLP), Combined Latent Prediction (CLP) and Direct Prediction (DP), using data from CCLE
Drug  Pearson Correlation  NRMSE  

LRP  LLP  CLP  DP  LRP  LLP  CLP  DP  
17AAG  0.5441  0.4691  0.6382  0.4591  0.2117  0.2147  0.1930  0.2164 
AZD6244  0.3988  0.4155  0.4524  0.4008  0.1833  0.1718  0.1684  0.1703 
Nilotinib  0.9053  0.3886  0.8768  0.4524  0.0728  0.1295  0.0888  0.1242 
Nutlin3  0.4093  0.5473  0.5646  0.5108  0.1965  0.1756  0.1745  0.1799 
PD0325901  0.6448  0.4502  0.6606  0.4465  0.1614  0.1870  0.1585  0.1878 
PD0332991  0.2497  0.0912  0.2540  0.0884  0.1695  0.1729  0.1672  0.1733 
PLX4720  0.5682  0.5040  0.6384  0.5001  0.1237  0.1290  0.1173  0.1291 
Domain transfer approach
Comparison of Kfold crossvalidation performance for three GDSC drug sensitivity prediction approaches – Mapped Prediction (MP), CCLE model Prediction (CP) and Direct Prediction (DP) using data from CCLE
Drug  Pearson Correlation  NRMSE  

MP  CP  DP  MP  CP  DP  
17AAG  0.6062  0.4354  0.4591  0.2112  0.3073  0.2164 
AZD6244  0.4692  0.3580  0.3579  0.1683  0.2173  0.1743 
Nilotinib  0.8698  0.7957  0.4524  0.1093  0.1323  0.1242 
Nutlin3  0.5606  0.3102  0.5114  0.1852  0.2180  0.1808 
PD0325901  0.6132  0.5731  0.4224  0.1689  0.1875  0.1865 
PD0332991  0.0923  0.0305  0.0802  0.1748  0.1764  0.1755 
PLX4720  0.6335  0.6135  0.5001  0.1242  0.159  0.1291 
Discussion
From Table 1, it is evident that the CLP method yields the best performance. Additionally, even though the LLP method often yield better results than DP, it frequently underperforms than LRP. Overall, 6 drugs out of 7 yield the best performance for CLP method while only Nilotinib performs the best with LRP. The prediction performance is similar in the reverse direction (i.e., CCLE as the primary set and GDSC as secondary) where 5 out of 7 drugs show best performance for CLP.
For the Domain Transfer approach, it is evident from Table 2 that the MP approach performs significantly better than the both CP and DP. Furthermore, the performance of the CP approach is much worse compared to either MP or DP, which can be attributed to the existing distribution shift between CCLE and GDSC data in general. Note that among the 7 drugs, 17AAG and PD0325901 has moderate concordance (0.5≤ρ_{s}<0.6) while AZD6244, Nutlin3 and PD0332991 have poor concordance (ρ_{s}<0.4) between databases. For PLX4720 and Nilotinib, there exist moderate to high consistency in terms of Pearson correlation (ρ=0.57 and ρ=0.88 respectively), although the rank correlation is low (ρ_{s}=0.29 and ρ_{s}≈0.1 respectively). We have also implemented a model that uses the ensemble of available CCLE and GDSC data directly for training and predicts for the unlabeled GDSC expression data, referred as the Combined Model Prediction. An additional section provides a detailed description and comparative analysis of this model with the MP approach [see Additional file 1].
Comparison with inductive transfer learning
Comparison of prediction performance for DITLKRR approach with Mapped Prediction (MP) and Direct Prediction (DP) approaches for 4 common drugs
Drug  Number of features  Pearson Correlation  NRMSE  

MP  DP  DITLKRR  MP  DP  DITLKRR  
17AAG  47  0.6319  0.4749  0.2885  0.1942  0.2167  0.4056 
AZD6244  49  0.4407  0.4016  0.1468  0.1554  0.1570  0.2042 
Nilotinib  35  0.9338  0.4674  0.1701  0.1003  0.1257  0.1410 
Nutlin3  48  0.5921  0.5207  0.1500  0.1881  0.1903  0.2697 
Conclusions
In precision medicine, data from multiple large pharmacological studies can be utilized to design better predictive models. In this regard, transfer learning is employed to eliminate the distribution shift between the primary and secondary datasets. In this paper, we have proposed two different TL approaches to incorporate data from two large studies i.e., CCLE and GDSC for designing a better predictive model. In the first approach, we have used a latent variable approach and then optimized the appropriate cost functions to get a pertinent prediction model. The second method uses a nonlinear mapping between both genomic and sensitivity data to transfer the primary data to secondary domain space and perform prediction utilizing the secondary datasets. Both methods show marked improvement in drug sensitivity prediction compared to direct prediction and existing TL approaches, while the mapping approach shows the best overall performance.
Applicability of Drug Sensitivity Prediction approaches for Matched and Unmatched Pairs of sets between Databases
Prediction Approach  Applicability  

Matched  Unmatched  
Direct Prediction  Yes  Yes 
Latent Regression Prediction  Yes  No 
LatentLatent Prediction  Yes  Yes 
Combined Latent Prediction  Yes  No 
Mapped Prediction (Domain Transfer)  Yes  Yes 
Direct Inductive Transfer Learning  Yes  Yes 
Furthermore, in Mapped Prediction, drug sensitivity mapping between databases using polynomials is drugdependent and thus vulnerable to a userfault. One potential new step can be modeling the map to be robust against the outliers. Another development can be investigating the effect of model stacking using the proposed approaches.
Methods
Latent variable cost optimization approach
In this section, our goal is to analyze the transfer learning approach from the viewpoint of a cost function optimization. Here, the assumption is that– if there exists such a way to transfer data from both CCLE and GDSC to a common space, then the information available in both databases can be incorporated together to result in a better overall performance [3]. Therefore, it can be inferred that in a suitable common space, the individual concordance between the common set (i.e., underlying latent variable) and each dataset will be maximized and the reconstruction errors from the common set will be minimized. This is the rationale behind the cost function optimization approach.
Drug sensitivity prediction via cost optimization of sensitivity data
If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.
We have also implemented a kNN regression based transfer learning approach for sensitivity prediction [see Additional file 1], which is computationally inexpensive to implement but often underperforms the LRP approach. We then applied an iterative update scheme to improve the performance of kNN approach and combined the updated kNN model with the LRP model [see Additional file 1]. The combined model shows similar performance to LRP model.
Drug sensitivity prediction via cost optimization of genomic and sensitivity data
In this section, we have utilized both gene expression and AUC data in cost optimization to improve the drug sensitivity prediction. Here, the goal is to establish a relationship between the two underlying latent variables corresponding to gene expression and AUC datasets respectively, and then exploiting this relationship for the prediction of unknown AUC values. This method is regarded as the LatentLatent Prediction (LLP) since it involves the prediction of one latent variable from another. Figure 3 illustrates the use of LLP method for drug sensitivity prediction. Again, we assume that only a small portion, y_{11}, of GDSC AUC set, y_{1}, is known. Then, the corresponding CCLE AUC set, y_{21}, in CCLE is used with y_{11} to perform the cost optimization in (1) to generate the latent vector w_{1} and the regression coefficients a_{1},a_{2}.
where \(V_{k} = \left [\begin {array}{ll} \vec {1} & v_{k} \end {array}\right ]\) and \(v_{k} = \left [\begin {array}{lll} \vec {1} & x_{1k} & x_{2k} \end {array}\right ] \lambda _{k}\).
We have used Random Forest (RF) [18, 20] as our prediction model here. If only a part of CCLE drug sensitivity response is known along with a bigger portion of GDSC sensitivity set, then this whole process can be utilized for the prediction of CCLE responses by interchanging the GDSC and CCLE values.
Combined latent drug sensitivity prediction
The whole process is depicted as the Combined Latent Prediction (CLP). Comparisons among the three optimization based approaches yield that the combined method performs the best while the LLP approach often underperforms than LRP.
Domain transfer approach
In this section, our goal is to analyze whether the dependency structure between CCLE and GDSC can be modeled using a common mapping across different cell lines. The hypothesis is that– if there exists such a common mapping so that the data from one domain can be shifted to the other, then the additional information available in the second database can easily be transferred to the first to produce an overall better performance [3]. For analysis, we have considered a polynomial regression mapping [21] and selected the polynomial order by utilizing the Spearman rank correlation (ρ_{s}) between each pair of datasets from the two databases. This infers a high concordance for gene expression data between databases but poor consistency for drug sensitivity measures such as AUC or IC_{50} [5].
Gene expression mapping
Comparison of performance of gene expression mapping for two common drugs
Drug  Number of genes  Number of Test cell lines  Pearson Correlation with CCLE  Reconstruction MSE  

Original GDSC  Mapped GDSC  
17AAG  371  259  0.8729  0.9406  0.8256 
AZD6244  383  245  0.8486  0.9405  0.6297 
Drug sensitivity mapping
where \(\hat {d}_{1j}\) denotes the mapped drug sensitivity dataset for jth drug, \(D_{2j} = \left [\begin {array}{lll} \vec {1} & d_{2j} & d_{2j}^{2} \end {array}\right ]\) is the design matrix, β contains the regression coefficients quantifying the strength of the association and ε_{n×1} is the mapping error.
Note that, out of the 15 common drugs, 3 of the drugs have moderate consistency (0.5≤ρ_{s}<0.6) between databases, 3 have fair consistency (0.4≤ρ_{s}<0.5) and the rest have poor consistency (ρ_{s}<0.4). Figure 5 illustrates the effect of the mapping of AUC values from CCLE to GDSC space for the drug AZD6244 with poor consistency between databases (ρ_{s}=0.26).
Comparison of performance of drug sensitivity (AUC) mapping for two common drugs
Drug  Number of Test cell lines  Pearson Correlation with GDSC  Reconstruction MSE  

Original CCLE  Mapped CCLE  
17AAG  259  0.5176  0.5232  0.0330 
AZD6244  245  0.4022  0.3267  0.0177 
Drug sensitivity prediction using nonlinear mapping
where A_{0},A_{1} are defined from (16).
The whole process is referred as the Mapped Prediction (MP) of GDSC data. Furthermore, if only a part of CCLE gene expression data is available with corresponding drug sensitivity values along with a bigger portion of labeled GDSC data, then this whole process can be utilized for the prediction of CCLE sensitivity by interchanging the GDSC and CCLE values. For prediction using gene expression, we have used a Bias Corrected Random Forest (BCRF) [19, 22] model where the effect of bias correction is measured using the residual angle [23].
Notes
Acknowledgments
Not applicable.
Funding
This work was supported by NIH grant R01GM12208401. The publication costs of this article was funded by NIH grant R01GM122084.
Availability of data and materials
For the analysis of transfer learning, the MATLAB codes are available in the following link: https://github.com/dhruba018/Transfer_Learning_Precision_Medicine, while the primary and secondary gene expression and area under the curve data are from the Genomics of Drug Sensitivity in Cancer repository, http://www.cancerrxgene.org/ and Cancer Cell Line Encyclopedia https://portals.broadinstitute.org/ccle, respectively.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 17, 2018: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume19supplement17.
Authors’ contributions
SRD, RR, SG and RP conceived of and designed the experiments. SRD and RR performed the experiments. SRD and RP analyzed the data. SRD, RR, KM, SG and RP wrote the paper. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
 1.Yang W, Soares J, Greninger P, Edelman EJ, Lightfoot H, Forbes S, Bindal N, Beare D, Smith JA, Thompson IR, et al. Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 2013; 41(D1):955–61.CrossRefGoogle Scholar
 2.Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, Wilson CJ, Lehár J, Kryukov GV, Sonkin D, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012; 483(7391):603–7.CrossRefGoogle Scholar
 3.Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22(10):1345–59.CrossRefGoogle Scholar
 4.Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016; 3(1):9.CrossRefGoogle Scholar
 5.HaibeKains B, ElHachem N, Birkbak N. J, Jin AC, Beck AH, Aerts HJ, Quackenbush J. Inconsistency in large pharmacogenomic studies. Nature. 2013; 504(7480):389–93.CrossRefGoogle Scholar
 6.Consortium CCLE, of Drug Sensitivity in Cancer Consortium G, et al. Pharmacogenomic agreement between two cancer cell line data sets. Nature. 2015; 528(7580):84–87.Google Scholar
 7.Celik S, Logsdon BA, Battle S, Drescher CW, Rendi M, Hawkins RD, Lee SI. Extracting a lowdimensional description of multiple gene expression datasets reveals a potential driver for tumorassociated stroma in ovarian cancer. Genome Med. 2016; 8(1):66.CrossRefGoogle Scholar
 8.Garcke J, Vanck T. Importance weighted inductive transfer learning for regression. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer: 2014. p. 466–81.Google Scholar
 9.AlStouhi S, Reddy C. Adaptive boosting for transfer learning using dynamic updates. In: Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases  Volume Part I (ECML PKDD’11), Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis (Eds.), Vol. Part I. Berlin: SpringerVerlag: 2011. p. 60–75.Google Scholar
 10.Rückert U, Kramer S. Kernelbased inductive transfer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer: 2008. p. 220–33.Google Scholar
 11.Sugiyama M, Kawanabe M. Machine Learning in Nonstationary Environments: Introduction to Covariate Shift Adaptation. Cambridge: MIT press; 2012, pp. 48–71.CrossRefGoogle Scholar
 12.Bonilla EV, Chai KM, Williams C. Multitask gaussian process prediction. In: Advances in Neural Information Processing Systems. USA: Curran Associates Inc.: 2008. p. 153–60.Google Scholar
 13.Gao J, Fan W, Jiang J, Han J. Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2008. p. 283–91.Google Scholar
 14.Jiang J, Zhai C. Instance weighting for domain adaptation in nlp. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, ACL, vol. 7. Prague: Association for Computational Linguistics: 2007. p. 264–71.Google Scholar
 15.Liao X, Xue Y, Carin L. Logistic regression with an auxiliary data source. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM: 2005. p. 505–12.Google Scholar
 16.Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence, AAAI, vol. 2. San Jose: AAAI Press / The MIT Press: 1992. p. 129–34.Google Scholar
 17.Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.CrossRefGoogle Scholar
 18.Rahman R, Otridge J, Pal R. Integratedmrf: random forestbased framework for integrating prediction from different data types. Bioinformatics (Oxford, England). 2017; 33(9):1407–1410.CrossRefGoogle Scholar
 19.Song J. Bias corrections for random forest in regression using residual rotation. J Korean Stat Soc. 2015; 44(2):321–6.CrossRefGoogle Scholar
 20.Rahman R, Haider S, Ghosh S, Pal R. Design of probabilistic random forests with applications to anticancer drug sensitivity prediction. Cancer Informat. 2015; 14(Suppl 5):57.Google Scholar
 21.Draper NR, Smith H. Applied regression analysis. 1966; 709(1):13.Google Scholar
 22.Zhang G, Lu Y. Biascorrected random forests in regression. J Appl Stat. 2012; 39(1):151–60.CrossRefGoogle Scholar
 23.Matlock K, De Niz C, Rahman R, Ghosh S, Pal R. Investigation of model stacking for drug sensitivity prediction. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM: 2017. p. 772.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.