Abstract
Support vector machines (SVM) are a widely used state-of-the-art classifier in molecular diagnostics. However, there is little work done on its overfitting analysis to avoid deceptive diagnostic results. In this work, we investigate the important problem and prove that a SVM classifier would inevitably encounter overfitting for gene expression array data under a standard Gaussian kernel due to the built-in large data variations from DNA amplification mechanism in the transcriptional profiling. We have found that SVM demonstrates its own special overfitting characteristics on array data, in addition to showing that feature selection algorithms may not contribute to overcoming overfitting, and discussing overfitting in biomarker discovery algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, New York (1998)
Han, X.: Nonnegative Principal component Analysis for Cancer Molecular Pattern Discovery. IEEE/ACM Transaction of Computational Biology and Bioinformatics 7(3), 537–549 (2010)
Han, H., Li, X.-L.: Multi-resolution Independent Component Analysis for High-Performance Tumor Classification and Biomarker Discovery. BMC Bioinformatics 12(S1), S7 (2011)
Boersma, B.J., et al.: A stromal gene signature associated with inflammatory breast cancer. Int. J. Cancer 122(6), 1324–1332 (2008)
Brunet, J., Tamayo, P., Golub, T., Mesirov, J.: Molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA 101(12), 4164–4169 (2004)
Singh, D., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)
Hedenfalk, I., et al.: Gene-Expression Profiles in Hereditary Breast Cancer. The New England Journal of Medicine 344, 539–548 (2001)
van ’t Veer, L.J., et al.: Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer. Nature 415(6871), 530–536 (2001)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series in Statistics. Springer, NY (2002)
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001)
Lin, C.: Projected gradient methods for non-negative matrix factorization. Neural Computation 19(10), 2756–2779 (2007)
Fox, R., Dimmic, M.: A two-sample Bayesian t-test for microarray data. BMC Bioinformatics 7(126) (2006)
Twyman, R., Primrose, S.: Principles of gene manipulation and genomics, 7th edn. Blackwell Publishing (2006)
Stein, A., et al.: A Serial Analysis of Gene Expression (SAGE) Database Analysis of Chemosensitivity: Comparing Solid Tumors with Cell Lines and Comparing Solid Tumors from Different Tissue Origins. Cancer Research 64, 2805–2816 (2004)
Pomeroy, S.L., et al.: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870), 436–442 (2002)
Han, H.: A novel profile-biomarker diagnosis for mass spectral proteomics. In: Pacific Symposium on Biocomputing (PSB), vol. 19, pp. 340–351 (2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Han, H. (2014). Analyzing Support Vector Machine Overfitting on Microarray Data. In: Huang, DS., Han, K., Gromiha, M. (eds) Intelligent Computing in Bioinformatics. ICIC 2014. Lecture Notes in Computer Science(), vol 8590. Springer, Cham. https://doi.org/10.1007/978-3-319-09330-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-09330-7_19
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09329-1
Online ISBN: 978-3-319-09330-7
eBook Packages: Computer ScienceComputer Science (R0)