Abstract
Recently introduced high-throughput technologies are producing unprecedented volumes of biomedical data available for mining and analysis. The early predictions of the imminent breakthroughs in our understanding of human diseases and making predictive diagnostics easy, however, turned out to be largely over optimistic.
We argue that this situation is not coincidental, but rather is caused by the statistical properties of the data collected. A typical high-throughput biological dataset is deeply imbalanced: the data matrix includes many measured quantities or “levels” in a relatively small number of subjects. Thus, any attempt to analyze these datasets would be undermined by so-called “Dimensionality Curse” that may be solved by removing a majority of variables. The feature selection aimed at increasing the classification power may be done using data mining or correlation-based approaches. In this chapter, both theory-driven and data-driven approaches to deal with complexity in biological systems are discussed in details.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bartlett JW, Frost C, Mattsson N, Skillbäck T, Blennow K, Zetterberg H, Schott JM. Determining cut-points for Alzheimer’s disease biomarkers: statistical issues, methods and challenges. Biomark Med. 2012;6(4):391–400.
Drier Y, Domany E. Do two machine-learning based prognostic signatures for breast cancer capture the same biological processes? PLoS One. 2011;6(3):e17795. doi:10.1371/journal.pone.0017795. http://dx.doi.org/10.1371%2Fjournal.pone.0017795
Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2005;21(2):171–8.
Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci U S A. 2006;103(15):5923–8.
Gray MA, Delahunt B, Fowles JR, Weinstein P, Cookes RR, Nacey JN. Demographic and clinical factors as determinants of serum levels of prostate specific antigen and its derivatives. Anticancer Res. 2004;24:2069–72.
Hekal IA, Ibrahiem E. Obesity-PSA relationship: a new formula. Prostate Cancer Prostatic Dis. 2010;13(2):186–90.
Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, Flynn J, Shekar M, Wang H, Park J, Cui W, Wall GD, Wisotzkey R, Alag S, Akhtari S, Ronaghi M. Ontology-based meta-analysis of global collections of high-throughput public data. PLoS One. 2010;5(9):e13066. doi:10.1371/journal.pone.0013066. http://dx.doi.org/10.1371%2Fjournal.pone.0013066
Mayer G, Heinze G, Mischak H, Hellemons ME, Heerspink HJ, Bakker SJ, de Zeeuw D, Haiduk M, Rossing P, Oberbauer R. Omics-bioinformatics in the context of clinical data. Methods Mol Biol. 2011;719:479–97.
McDermott JE, Wang J, Mitchell H, Webb-Robertson BJ, Hafen R, Ramey J, Rodland KD. Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data. Expert Opin Med Diagn. 2013;7(1):37–51.
Pyatnitskiy M, Karpova M, Moshkovskii S, Lisitsa A, Archakov A. Clustering mass spectral peaks increases recognition accuracy and stability of SVM-based feature selection. J Proteomics Bioinform. 2010;3:048–54. doi:10.4172/jpb.1000120.
Saeys Y, Inza I, Larraaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
Sinay YG. Probability theory, an introductory course. Berlin/New York: Springer; 1992.
van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.
Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011;7(10):e1002240. doi:10.1371/journal.pcbi.1002240. http://dx.doi.org/10.1371%2Fjournal.pcbi.1002240
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–9.
Acknowledgment
The authors express gratitude to the general support provided by College of Science, George Mason University, a State Contract 14.607.21.0098 dated November 27th, 2014 (Ministry of Science and Education, Russia) and by the Human Proteome Scientific Program of the Federal Agency of Scientific Organizations, Russia.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media Dordrecht
About this entry
Cite this entry
Veytsman, B., Baranova, A. (2015). High-Throughput Approaches to Biomarker Discovery and Challenges of Subsequent Validation. In: Preedy, V., Patel, V. (eds) General Methods in Biomarker Research and their Applications. Biomarkers in Disease: Methods, Discoveries and Applications. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7696-8_20
Download citation
DOI: https://doi.org/10.1007/978-94-007-7696-8_20
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7695-1
Online ISBN: 978-94-007-7696-8
eBook Packages: Biomedical and Life SciencesReference Module Biomedical and Life Sciences