Abstract
Microarrays are being increasingly used in cancer research for a better understanding of the molecular variations among tumours or other biological conditions. They allow for the measurement of tens of thousands of transcripts simultaneously in one single experiment. The problem of analysing these data sets becomes non-standard and represents a challenge for both statisticians and biologists, as the dimension of the feature space (the number of genes or transcripts) is much greater than the number of tissues. Therefore, the selection of marker genes among thousands to diagnose a cancer type is of crucial importance and can help clinicians to develop gene-expression-based diagnostic tests to guide therapy in cancer patients. In this chapter, we focus on the classification and the prediction of a sample given some carefully chosen gene expression profiles. We review some state-of-the-art machine learning approaches to perform gene selection: recursive feature elimination, nearest-shrunken centroids and random forests. We discuss the difficulties that can be encountered when dealing with microarray data, such as selection bias, multiclass and unbalanced problems. The three approaches are then applied and compared on a typical cancer gene expression study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aha DW and Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Learning from data: artificial intelligence and statistics V. Springer, New York, pp 199â206
Al-Shahrour F, Diaz-Uriarte R, Dopazo J (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578â580
Ambroise C, McLachlan G (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99:6562â6566
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015â2033
Breiman L (1996) Bagging predictors. Mach Learn 24:123â140
Breiman L (2001) Random forests. Mach Learn 45:5â32
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, The Wadsworth statistics/probability series, Belmont, CA
Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P (2005) Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 28:171â182
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121â167
Buyse M, Loi S, vanât Veer L, Viale G, Delorenzi M, Glas A, Saghatchian dâAssignies M, Bergh J, Lidereau R, Ellis P (2006) Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98:1183â1192
Chen C, Liaw A, Breiman L (2004) Using random forests to learn unbalanced data, Department of Statistics, University of Berkeley
Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines: and other kernel-based learning methods, Cambridge University Press, New York
Dabney A, Storey J (2005) Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. UW Biostatistics Working Paper Series, Article 267
Dennis G Jr, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:Article R60
Diaz-Uriarte R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 7:Article 328
Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics 7:Article S12
Dudoit S, Fridlyand J (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77â87
Efron B (1979) Bootstrapping methods: another look at the jackknife. Ann Stat 7:1â26
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78:316â331
Efron B, Tibshirani R (1997) Improvements on cross-validation: the. 632 + bootstrap method. J Am Stat Assoc 92:548â560
Eitrich T, Lang B (2006) Efficient optimization of support vector machine learning parameters for unbalanced datasets. J Comput Appl Math 196: 425â436
Gevaert O, Smet F, Timmerman D, Moreau Y, Moor B (2006) Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 22:184â190
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531â537
Guo Y, Hastie T, Tibshirani R (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8:86â100
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157â1182
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Support vector machine with recursive feature selection. Mach Learn 46:389â422
Izmirlian G (2004) Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann New York Acad Sci 1020: 154â174
John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, Morgan Kaufmann
Kim H, Pang S, Je H, Kim D, Yang Bang S (2003) Constructing support vector machine ensemble. Pattern Recogn 36:2757â2767
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273â324
LĂȘ Cao K-A, Bonnet A, Gadat S (2009) Multiclass classification and gene selection with a stochastic algorithm. Comput Stat Data Anal 53:3601â3615
LĂȘ Cao K-A, Goncalves O, Besse P, Gadat S (2007) Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genetics Mol Biol 6:Article 29
Lee Y, Lee C (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19:1132â1139
Li C, Tseng G, Wong W (2003) Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall, New York, pp 1â34
Liaw A, Wiener M (2003) Classification and regression by randomForest. R News 2/3:18â22
McLachlan G (1977) A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification. Pattern Recogn 9:147â149
McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
McLachlan G, Chevelu J, Zhu J (2008) Correcting for selection bias via cross-validation in the classification of microarray data. In: Balakrishnan N, Pena E, Silvapulle MJ (eds) Beyond parametrics in Interdisciplinary research: Festschrift in Honor of Professor Paranab K. Sen. Hayward, Vol 1. IMS Collections, California, pp 364â376
McLachlan G, Do K, Ambroise C (2004) Analyzing microarray gene expression data. Wiley-Interscience, New York
McLachlan G, Ng S-K (2008) Expert networks with mixed continuous and categorical feature variables: a location modeling approach. In: Peters H, Vogel M (eds) Machine learning research progress. Hauppauge, New York, pp 1â14
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell M (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:284â288
Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 365:488â492
Mundra P, Rajapakse J (2007) SVM-RFE with relevancy and redundancy criteria for gene selection. Lect Notes Comp Sci 4774:242â252
Nuyten D, van de Vijver M (2008) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer and normal tissue toxicity. In: Seminars in radiation oncology, pp 105â114
Prasad A, Iverson L, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181â199
Qiao X, Liu Y (2008) Adaptive weighted learning for unbalanced multicategory classification. Biometrics (in press)
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98:15149â15154
Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371â1382
Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631â643
Svetnik V, Liaw A, Tong C, Culberson J, Sheridan R, Feuston B (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comp Sci 43:1947â1958
Tang Y, Zhang Y, Huang Z (2007) Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE ACM Trans Comput Biol Bioinformatics 4:365â389
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567â6572
vanât Veer L, Dai H, Van de Vijver M, He Y, Hart A, Mao M, Peterse H, Van der Kooy K, Marton M, Witteveen A (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530â536
Vapnik V (2000) The nature of statistical learning theory, Springer, New York
Wang S, Zhu J (2007) Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 23:972â979
Weston J, Watkins C (1999) Multi-class support vector machines. In: Proceedings ESANN, Brussels, Belgium
Wood I, Visscher P, Mengersen K (2007) Classification based upon gene expression data: bias and precision of error rates. Bioinformatics 23:1363â1370
Yeang C, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R, Angelo M, Reich M, Lander E, Mesirov J, Golub T (2001) Molecular classification of multiple tumor types. Bioinformatics 17:316â322
Yousef M, Jung S, Showe L, Showe M (2007) Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8:144
Zhou X, Tuck D (2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23:1106â1114
Zhu J, McLachlan G, Ben-Tovim Jones L, Wood I (2008) On selection biases with prediction rules formed from gene expression data. J Stat Plann Infer 138:374â386
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Cao, KA.L., McLachlan, G.J. (2009). Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures. In: Pham, T. (eds) Computational Biology. Applied Bioinformatics and Biostatistics in Cancer Research. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-0811-7_3
Download citation
DOI: https://doi.org/10.1007/978-1-4419-0811-7_3
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-0810-0
Online ISBN: 978-1-4419-0811-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)