Abstract
Machine learning (ML) methods are still rarely used for gene expression/mutation-based prediction of individual tumor responses on anticancer chemotherapy due to relatively rare clinical case histories supplemented with high-throughput molecular data. This leads to high vulnerability of most ML methods are to overtraining. Recently, we proposed a novel hybrid global-local approach to ML termed FLOating Window Projective Separator (FloWPS) that avoids extrapolation in the feature space and may improve robustness of classifiers even for datasets with limited number of preceding cases. FloWPS has been validated for the support vector machines (SVM) method, where if significantly improved the quality of classifiers. The core property of FloWPS is data trimming, i.e. sample-specific removal of features. The irrelevant features in a sample that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. In addition, for each point of a validation dataset, only the proximal points of the training dataset are taken into account. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. Here, we applied this approach to seven popular ML methods, including SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). We performed computational experiments for 21 high throughput clinically annotated gene expression datasets totally including 1778 cancer patients who either responded or not on chemotherapy treatments. The biggest dataset had samples for 235, whereas the smallest for 41 individual cases. For global ML methods, such as SVM, RF, BNB, ADA and MLP, FloWPS essentially improved the classifier quality. Namely, the area under the receiver-operator curve (ROC AUC) for the responder vs non-responder classifier, increased from typical range 0.65–0.85 to 0.80–0.95, respectively. On the other hand, FloWPS was shown useless for purely local ML techniques such as kNN method or RR. However, both these local methods exhibited low sensitivity or specificity in cases when false positive or false negative errors, respectively, should be avoided. According to sensitivity-specificity criterion, for all the datasets tested, the best performance in combination with FloWPS data trimming was shown for the binomial naïve Bayesian method, which can be valuable for further development of predictors in personalized oncology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Buzdin, A., et al.: RNA sequencing for research and diagnostics in clinical oncology. Semin. Cancer Biol. (2019). https://doi.org/10.1016/j.semcancer.2019.07.010
Zhukov, N.V., Tjulandin, S.A.: Targeted therapy in the treatment of solid tumors: practice contradicts theory. Biochem. Biokhim. 73, 605–618 (2008)
Borisov, N., Buzdin, A.: New paradigm of machine learning (ML) in personalized oncology: data trimming for squeezing more biomarkers from clinical datasets. Front. Oncol. 9, 658 (2019). https://doi.org/10.3389/fonc.2019.00658
Artemov, A., et al.: A method for predicting target drug efficiency in cancer based on the analysis of signaling pathway activation. Oncotarget 6, 29347–29356 (2015). https://doi.org/10.18632/oncotarget.5119
Shepelin, D., et al.: Molecular pathway activation features linked with transition from normal skin to primary and metastatic melanomas in human. Oncotarget 7, 656–670 (2016). https://doi.org/10.18632/oncotarget.6394
Zolotovskaia, M.A., et al.: Pathway based analysis of mutation data is efficient for scoring target cancer drugs. Front. Pharmacol. 10 (2019). https://doi.org/10.3389/fphar.2019.00001
Turki, T., Wang, J.T.L.: Clinical intelligence: new machine learning techniques for predicting clinical drug response. Comput. Biol. Med. 107, 302–322 (2019). https://doi.org/10.1016/j.compbiomed.2018.12.017
Turki, T., Wei, Z.: A link prediction approach to cancer drug sensitivity prediction. BMC Syst. Biol. 11 (2017). https://doi.org/10.1186/s12918-017-0463-8
Turki, T., Wei, Z., Wang, J.T.L.: Transfer learning approaches to improve drug sensitivity prediction in multiple myeloma patients. IEEE Access 5, 7381–7393 (2017). https://doi.org/10.1109/ACCESS.2017.2696523
Turki, T., Wei, Z., Wang, J.T.L.: A transfer learning approach via procrustes analysis and mean shift for cancer drug sensitivity prediction. J. Bioinform. Comput. Biol. 16, 1840014 (2018). https://doi.org/10.1142/S0219720018400140
Mulligan, G., et al.: Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. Blood 109, 3177–3188 (2007). https://doi.org/10.1182/blood-2006-09-044974
Borisov, N., Tkachev, V., Muchnik, I., Buzdin, A.: Individual Drug Treatment Prediction in Oncology Based on Machine Learning Using Cell Culture Gene Expression Data (2017). https://doi.org/10.1145/3155077.3155078
Borisov, N., Tkachev, V., Suntsova, M., Kovalchuk, O., Zhavoronkov, A., Muchnik, I., Buzdin, A.: A method of gene expression data transfer from cell lines to cancer patients for machine-learning prediction of drug efficiency. Cell Cycle 17, 486–491 (2018). https://doi.org/10.1080/15384101.2017.1417706
Borisov, N., Tkachev, V., Buzdin, A., Muchnik, I.: Prediction of drug efficiency by transferring gene expression data from cell lines to cancer patients. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds.) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. LNCS (LNAI), vol. 11100, pp. 201–212. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99492-5_9
Tkachev, V., et al.: FLOating-window projective separator (FloWPS): a data trimming tool for support vector machines (SVM) to improve robustness of the classifier. Front. Genet. 9 (2019). https://doi.org/10.3389/fgene.2018.00717
Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992). https://doi.org/10.1080/00031305.1992.10475879
Toloşi, L., Lengauer, T.: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994 (2011). https://doi.org/10.1093/bioinformatics/btr300
Tikhonov, A.N., Arsenin, V.I.: Solutions of Ill-Posed Problems. Winston ; Distributed solely by Halsted Press, Washington (1977)
Cho, H.-J., Lee, S., Ji, Y.G., Lee, D.H.: Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma. PLoS ONE 13, e0207204 (2018). https://doi.org/10.1371/journal.pone.0207204
Davoudi, A., Ozrazgat-Baslanti, T., Ebadi, A., Bursian, A.C., Bihorac, A., Rashidi, P.: Delirium prediction using machine learning models on predictive electronic health records data. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 568–573. IEEE, Washington, DC (2017). https://doi.org/10.1109/BIBE.2017.00014
Turki, T., Wei, Z.: Learning approaches to improve prediction of drug sensitivity in breast cancer patients. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3314–3320. IEEE, Orlando, FL, USA (2016). https://doi.org/10.1109/EMBC.2016.7591437
Zhang, L., et al.: Applications of machine learning methods in drug toxicity prediction. Curr. Top. Med. Chem. 18 (2018). https://doi.org/10.2174/1568026618666180727152557
Wang, Z., et al.: In silico prediction of blood-brain barrier permeability of compounds by machine learning and resampling methods. Chem. Med. Chem. 13, 2189–2201 (2018). https://doi.org/10.1002/cmdc.201800533
Yosipof, A., Guedes, R.C., García-Sosa, A.T.: Data mining and machine learning models for predicting drug likeness and their disease or organ category. Front. Chem. 6 (2018). https://doi.org/10.3389/fchem.2018.00162
Minsky, M.L., Papert, S.A.: Perceptrons - Expanded Edition: An Introduction to Computational Geometry. MIT press, Boston (1987)
Prados, J., Kalousis, A., Sanchez, J.-C., Allard, L., Carrette, O., Hilario, M.: Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents. Proteomics 4, 2320–2332 (2004). https://doi.org/10.1002/pmic.200400857
Robin, X., Turck, N., Hainard, A., Lisacek, F., Sanchez, J.-C., Müller, M.: Bioinformatics for protein biomarker panel classification: what is needed to bring biomarker panels into in vitro diagnostics? Expert Rev. Proteomics 6, 675–689 (2009). https://doi.org/10.1586/epr.09.83
Gent, D.H., Esker, P.D., Kriss, A.B.: Statistical power in plant pathology research. Phytopathology 108, 15–22 (2018). https://doi.org/10.1094/PHYTO-03-17-0098-LE
Ioannidis, J.P.A., Hozo, I., Djulbegovic, B.: Optimal type I and type II error pairs when the available sample size is fixed. J. Clin. Epidemiol. 66, 903–910.e2 (2013). https://doi.org/10.1016/j.jclinepi.2013.03.002
Wetterslev, J., Jakobsen, J.C., Gluud, C.: Trial sequential analysis in systematic reviews with meta-analysis. BMC Med. Res. Methodol. 17, 39 (2017). https://doi.org/10.1186/s12874-017-0315-7
Kim, H.-Y.: Statistical notes for clinical researchers: Type I and type II errors in statistical decision. Restorative Dent. Endodontics 40, 249 (2015). https://doi.org/10.5395/rde.2015.40.3.249
Lu, J., Qiu, Y., Deng, A.: A note on type S/M errors in hypothesis testing. Br. J. Math. Stat. Psychol. 72, 1–17 (2019). https://doi.org/10.1111/bmsp.12132
Litière, S., Alonso, A., Molenberghs, G.: Type I and Type II error under random-effects misspecification in generalized linear mixed models. Biometrics 63, 1038–1044 (2007). https://doi.org/10.1111/j.1541-0420.2007.00782.x
Cummins, R.O., Hazinski, M.F.: Guidelines based on fear of type II (false-negative) errors: why we dropped the pulse check for lay rescuers. Circulation 102, I377–I379 (2000)
Rodriguez, P., Maestre, Z., Martinez-Madrid, M., Reynoldson, T.B.: Evaluating the type II error rate in a sediment toxicity classification using the reference condition approach. Aquat. Toxicol. 101, 207–213 (2011). https://doi.org/10.1016/j.aquatox.2010.09.020
Hatzis, C., et al.: A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305, 1873–1881 (2011). https://doi.org/10.1001/jama.2011.593
Itoh, M., et al.: Estrogen receptor (ER) mRNA expression and molecular subtype distribution in ER-negative/progesterone receptor-positive breast cancers. Breast Cancer Res. Treat. 143, 403–409 (2014). https://doi.org/10.1007/s10549-013-2763-z
Horak, C.E., et al.: Biomarker analysis of neoadjuvant doxorubicin/cyclophosphamide followed by ixabepilone or Paclitaxel in early-stage breast cancer. Clin. Cancer Res. 19, 1587–1595 (2013). https://doi.org/10.1158/1078-0432.CCR-12-1359
Korde, L.A., et al.: Gene expression pathway analysis to predict response to neoadjuvant docetaxel and capecitabine for breast cancer. Breast Cancer Res. Treat. 119, 685–699 (2010). https://doi.org/10.1007/s10549-009-0651-3
Miller, W.R., Larionov, A.: Changes in expression of oestrogen regulated and proliferation genes with neoadjuvant treatment highlight heterogeneity of clinical resistance to the aromatase inhibitor, letrozole. Breast Cancer Res. 12, R52 (2010). https://doi.org/10.1186/bcr2611
Miller, W.R., Larionov, A., Anderson, T.J., Evans, D.B., Dixon, J.M.: Sequential changes in gene expression profiles in breast cancers during treatment with the aromatase inhibitor, letrozole. Pharmacogenomics J. 12, 10–21 (2012). https://doi.org/10.1038/tpj.2010.67
Popovici, V., et al.: Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res. 12, R5 (2010). https://doi.org/10.1186/bcr2468
Iwamoto, T., et al.: Gene pathways associated with prognosis and chemotherapy sensitivity in molecular subtypes of breast cancer. J. Nat. Cancer Inst. 103, 264–272 (2011). https://doi.org/10.1093/jnci/djq524
Miyake, T., et al.: GSTP1 expression predicts poor pathological complete response to neoadjuvant chemotherapy in ER-negative breast cancer. Cancer Sci. 103, 913–920 (2012). https://doi.org/10.1111/j.1349-7006.2012.02231.x
Liu, J.C., et al.: Seventeen-gene signature from enriched Her2/Neu mammary tumor-initiating cells predicts clinical outcome for human HER2+: ERα- breast cancer. Proc. Natl. Acad. Sci. U.S.A. 109, 5832–5837 (2012). https://doi.org/10.1073/pnas.1201105109
Shen, K., et al.: Cell line derived multi-gene predictor of pathologic response to neoadjuvant chemotherapy in breast cancer: a validation study on US Oncology 02-103 clinical trial. BMC Med. Genomics 5, 51 (2012). https://doi.org/10.1186/1755-8794-5-51
Turnbull, A.K., et al.: Accurate prediction and validation of response to endocrine therapy in breast cancer. J. Clin. Oncol. 33, 2270–2278 (2015). https://doi.org/10.1200/JCO.2014.57.8963
Chauhan, D., et al.: A small molecule inhibitor of ubiquitin-specific protease-7 induces apoptosis in multiple myeloma cells and overcomes bortezomib resistance. Cancer Cell 22, 345–358 (2012). https://doi.org/10.1016/j.ccr.2012.08.007
Terragna, C., et al.: The genetic and genomic background of multiple myeloma patients achieving complete response after induction therapy with bortezomib, thalidomide and dexamethasone (VTD). Oncotarget 7, 9666–9679 (2016). https://doi.org/10.18632/oncotarget.5718
Amin, S.B., et al.: Gene expression profile alone is inadequate in predicting complete response in multiple myeloma. Leukemia 28, 2229–2234 (2014). https://doi.org/10.1038/leu.2014.140
Raponi, M., et al.: Identification of molecular predictors of response in a study of tipifarnib treatment in relapsed and refractory acute myelogenous leukemia. Clin. Cancer Res. 13, 2254–2260 (2007). https://doi.org/10.1158/1078-0432.CCR-06-2609
Goldman, M., et al.: The UCSC cancer genomics browser: update 2015. Nucleic Acids Res. 43, D812–D817 (2015). https://doi.org/10.1093/nar/gku1073
Tricoli, J.V., et al.: Biologic and clinical characteristics of adolescent and young adult cancers: acute lymphoblastic leukemia, colorectal cancer, breast cancer, melanoma, and sarcoma: biology of AYA cancers. Cancer 122, 1017–1028 (2016). https://doi.org/10.1002/cncr.29871
Tomczak, K., Czerwińska, P., Wiznerowicz, M.: The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. (Poznan, Poland) 19, A68–A77 (2015). https://doi.org/10.5114/wo.2014.47136
Acknowledgements
The study was supported by Russian Foundation for Basic Research Grant 19-29-01108.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Tkachev, V., Buzdin, A., Borisov, N. (2019). Flexible Data Trimming for Different Machine Learning Methods in Omics-Based Personalized Oncology. In: Bebis, G., Benos, T., Chen, K., Jahn, K., Lima, E. (eds) Mathematical and Computational Oncology. ISMCO 2019. Lecture Notes in Computer Science(), vol 11826. Springer, Cham. https://doi.org/10.1007/978-3-030-35210-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-35210-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35209-7
Online ISBN: 978-3-030-35210-3
eBook Packages: Computer ScienceComputer Science (R0)