Abstract
Powerful machine learning tools exist to extract biological patterns for diagnosis or prediction from high-dimensional datasets. Simultaneous advances in high-throughput profiling technologies have led to a rapid acceleration of biomarker discovery investigations across all areas of medicine. However, the translation of biomarker signatures into clinically useful tools has thus far been difficult. In this chapter, several important considerations are discussed that influence such translation in the context of classifier design. These include aspects of variable selection that go beyond classification accuracy, as well as effects of variability on assay stability and sample size. The consideration of such factors may lead to an adaptation of biomarker discovery approaches, aimed at an optimal balance of performance and clinical translatability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cohen Freue GV, Meredith A, Smith D, Bergman A, Sasaki M, Lam KK et al (2013) Computational biomarker pipeline from discovery to clinical implementation: plasma proteomic biomarkers for cardiac transplantation. PLoS Comput Biol 9:e1002963
Zhang Z, Chan DW (2010) The road from discovery to clinical diagnostics: lessons learned from the first FDA-cleared in vitro diagnostic multivariate index assay of proteomic biomarkers. Cancer Epidemiol Biomarkers Prev 19:2995–2999
Alymani NA, Smith MD, Williams DJ, Petty RD (2010) Predictive biomarkers for personalised anti-cancer drug use: discovery to clinical implementation. Eur J Cancer 46:869–879
Deyati A, Younesi E, Hofmann-Apitius M, Novac N (2013) Challenges and opportunities for oncology biomarker discovery. Drug Discov Today 18:614–624
Jin G, Zhou X, Wang H, Wong STC (2010) The challenges in blood proteomic biomarker discovery. In: Pham T (ed) Comput Biol. Springer, New York, pp 273–299
Rifai N, Gillette MA, Carr SA (2006) Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol 24:971–983
Füzéry AK, Levin J, Chan MM, Chan DW (2013) Translation of proteomic biomarkers into FDA approved cancer diagnostics: issues and challenges. Clin Proteomics 10:13
Goodsaid F, Mattes WB (2013) Thepath from biomarker discovery to regulatory qualification. 1 edn., Academic Press. Accessed 16 July 2013. ISBN: 0123914965
Kotsiantis SB (2007) Supervisedmachine learning: a review of classification techniques. Informatica 31:249–268. doi:10.1115/1.1559160
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Elements 1:337–387. doi:10.1007/b94608
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Kononenko I (2015) Machine learning for medical diagnosis: history, state of the art and perspective. Artif Intell Med 23:89–109. doi:10.1016/S0933-3657(01)00077-X
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. Proceedings of 23rd international conference machine learning. pp 161–168. doi: 10.1145/1143844.1143865
Guo Y, Graber A, McBurney RN, Balasubramanian R (2010) Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms. BMC Bioinformatics 11:447
Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. Proceedings of 25th international conference machine learning. pp 96–103. doi: 10.1145/1390156.1390169
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
He Z, Yu W (2010) Stable feature selection for biomarker discovery. Comput Biol Chem 34:215–225
Loscalzo S, Yu L, Ding C (2009) Consensus group stable feature selection. Proceedings of 15th ACM SIGKDD international conference on knowledge discovery and data mining. pp 567–576
Awada W, Dittman D, Wald R, Napolitano A, Khoshgoftaar TM(2012) A review of the stability of feature selection techniques for bioinformatics data. In: Proceedings of 2012 IEEE 13th international conference information reuse and integration IRI 2012. pp 356–363
Haury AC, Gestraud P, Vert JP (2011) The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6(12), e28210
Braun DC, Reynolds JD (2012) Cost-effective variable selection in habitat surveys. Methods Ecol Evol 3:388–396
Guns T, Nijssen S, De Raedt L (2011) Itemset mining: a constraint programming perspective. Artif Intell 175:1951–1983
Talbi EG (2013) Combining metaheuristics with mathematical programming, constraint programming and machine learning. 4OR Q J Oper Res 11:101–150
Lapin M, Hein M, Schiele B (2014) Learning using privileged information: SV M+ and weighted SVM. Neural Netw 53:95–108
Pechyony D, Vapnik V (2010) On the theory of learning with privileged information. Nips pp 1894–1902
Vapnik V, Vashist A (2009) A new learning paradigm: learning using privileged information. Neural Netw 22:544–557
Chapelle O, Shivaswamy P, Vadrevu S, Weinberger K, Zhang Y, Tseng B (2011) Boosted multi-task learning. Mach Learn 85:149–173
Evgeniou T, Pontil M (2004) Regularized multi--task learning. Proceedings of 10th ACM SIGKDD pp 109–117
Romera-Paredes B, Argyriou A, Pontil M, Berthouze N (2012) Exploiting unrelated tasks in multi-task learning. Proceedings of 15th international conference of artificial intelligence statistics, vol 22, pp 951–959
Wang H, Nie F, Huang H, Risacher SL, Saykin AJ, Shen L et al (2012) Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28:i127–i136
Gong P, Ye J, Zhang C (2012) Robust multi-task feature learning. KDD 2012:895–903
Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44:4–31
Schwarz E, Izmailov R, Spain M, Barnes A, Mapes JP, Guest PC et al (2010) Validation of a blood-based laboratory test to aid in the confirmation of a diagnosis of schizophrenia. Biomark Insights 5:39–47
Gyorffy B, Molnar B, Lage H, Szallasi Z, Eklund AC (2009) Evaluation of microarray preprocessing algorithms based on concordance with RT-PCR in clinical samples. PLoS One 4(5):e5645
Pollack AZ, Perkins NJ, Mumford SL, Ye A, Schisterman EF (2013) Correlated biomarker measurement error: an important threat to inference in environmental epidemiology. Am J Epidemiol 177:84–92
Shawe-Taylor J, Anthony M, Biggs NL (1993) Bounding sample size with the Vapnik-Chervonenkis dimension. Discret Appl Math 42:65–73
Cohn D, Tesauro G (1991) Howtight are the Vapnik-Chervonenkisbounds? Neural Comput 4:249–269
Dobbin K, Simon R (2005) Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27–38
Shao L, Fan X, Cheng N, Wu L, Cheng Y (2013) Determination of minimum training sample size for microarray-based cancer outcome prediction-an empirical assessment. PLoS One 8:e68579
Dobbin KK, Zhao Y, Simon RM (2008) How large a training set is needed to develop a classifier for microarray data? Clin Cancer Res 14:108–114
Hwang D, Schmitt WA, Stephanopoulos G, Stephanopoulos G (2002) Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics 18:1184–1193
De Valpine P, Bitter HM, Brown MPS, Heller J (2009) A simulation-approximation approach to sample size planning for high-dimensional classification studies. Biostatistics 10:424–435
Beleites C, Neugebauer U, Bocklitz T et al (2013) Sample size planning for classification models. Anal Chim Acta 760:25–33
Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W et al (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10:213
Acknowledgments
This study was supported by the DFG Emmy-Noether-Program SCHW 1768/1-1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media LLC
About this protocol
Cite this protocol
Schwarz, E. (2017). Identification and Clinical Translation of Biomarker Signatures: Statistical Considerations. In: Guest, P.C. (eds) Multiplex Biomarker Techniques. Methods in Molecular Biology, vol 1546. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-6730-8_6
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6730-8_6
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-6729-2
Online ISBN: 978-1-4939-6730-8
eBook Packages: Springer Protocols