Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures

Cao, Kim-Anh Lê; McLachlan, Geoffrey J.

doi:10.1007/978-1-4419-0811-7_3

Kim-Anh Lê Cao &
Geoffrey J. McLachlan²

Part of the book series: Applied Bioinformatics and Biostatistics in Cancer Research ((ABB))

1459 Accesses
1 Citations

Abstract

Microarrays are being increasingly used in cancer research for a better understanding of the molecular variations among tumours or other biological conditions. They allow for the measurement of tens of thousands of transcripts simultaneously in one single experiment. The problem of analysing these data sets becomes non-standard and represents a challenge for both statisticians and biologists, as the dimension of the feature space (the number of genes or transcripts) is much greater than the number of tissues. Therefore, the selection of marker genes among thousands to diagnose a cancer type is of crucial importance and can help clinicians to develop gene-expression-based diagnostic tests to guide therapy in cancer patients. In this chapter, we focus on the classification and the prediction of a sample given some carefully chosen gene expression profiles. We review some state-of-the-art machine learning approaches to perform gene selection: recursive feature elimination, nearest-shrunken centroids and random forests. We discuss the difficulties that can be encountered when dealing with microarray data, such as selection bias, multiclass and unbalanced problems. The three approaches are then applied and compared on a typical cancer gene expression study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 179.00; Price excludes VAT (USA)

Softcover Book: USD 229.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aha DW and Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Learning from data: artificial intelligence and statistics V. Springer, New York, pp 199–206
Google Scholar
Al-Shahrour F, Diaz-Uriarte R, Dopazo J (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578–580
Article PubMed CAS Google Scholar
Ambroise C, McLachlan G (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99:6562–6566
Article PubMed CAS Google Scholar
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, The Wadsworth statistics/probability series, Belmont, CA
Google Scholar
Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P (2005) Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 28:171–182
Article PubMed Google Scholar
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167
Article Google Scholar
Buyse M, Loi S, van‘t Veer L, Viale G, Delorenzi M, Glas A, Saghatchian d’Assignies M, Bergh J, Lidereau R, Ellis P (2006) Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98:1183–1192
Google Scholar
Chen C, Liaw A, Breiman L (2004) Using random forests to learn unbalanced data, Department of Statistics, University of Berkeley
Google Scholar
Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines: and other kernel-based learning methods, Cambridge University Press, New York
Google Scholar
Dabney A, Storey J (2005) Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. UW Biostatistics Working Paper Series, Article 267
Google Scholar
Dennis G Jr, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:Article R60
Google Scholar
Diaz-Uriarte R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 7:Article 328
Google Scholar
Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics 7:Article S12
Google Scholar
Dudoit S, Fridlyand J (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Article CAS Google Scholar
Efron B (1979) Bootstrapping methods: another look at the jackknife. Ann Stat 7:1–26
Article Google Scholar
Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78:316–331
Article Google Scholar
Efron B, Tibshirani R (1997) Improvements on cross-validation: the. 632 + bootstrap method. J Am Stat Assoc 92:548–560
Article Google Scholar
Eitrich T, Lang B (2006) Efficient optimization of support vector machine learning parameters for unbalanced datasets. J Comput Appl Math 196: 425–436
Article Google Scholar
Gevaert O, Smet F, Timmerman D, Moreau Y, Moor B (2006) Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 22:184–190
Article Google Scholar
Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Article PubMed CAS Google Scholar
Guo Y, Hastie T, Tibshirani R (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8:86–100
Article PubMed Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Support vector machine with recursive feature selection. Mach Learn 46:389–422
Article Google Scholar
Izmirlian G (2004) Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann New York Acad Sci 1020: 154–174
Article CAS Google Scholar
John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, Morgan Kaufmann
Google Scholar
Kim H, Pang S, Je H, Kim D, Yang Bang S (2003) Constructing support vector machine ensemble. Pattern Recogn 36:2757–2767
Article Google Scholar
Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324
Article Google Scholar
Lê Cao K-A, Bonnet A, Gadat S (2009) Multiclass classification and gene selection with a stochastic algorithm. Comput Stat Data Anal 53:3601–3615
Article Google Scholar
Lê Cao K-A, Goncalves O, Besse P, Gadat S (2007) Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genetics Mol Biol 6:Article 29
Google Scholar
Lee Y, Lee C (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19:1132–1139
Article PubMed CAS Google Scholar
Li C, Tseng G, Wong W (2003) Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall, New York, pp 1–34
Google Scholar
Liaw A, Wiener M (2003) Classification and regression by randomForest. R News 2/3:18–22
Google Scholar
McLachlan G (1977) A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification. Pattern Recogn 9:147–149
Article Google Scholar
McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
Book Google Scholar
McLachlan G, Chevelu J, Zhu J (2008) Correcting for selection bias via cross-validation in the classification of microarray data. In: Balakrishnan N, Pena E, Silvapulle MJ (eds) Beyond parametrics in Interdisciplinary research: Festschrift in Honor of Professor Paranab K. Sen. Hayward, Vol 1. IMS Collections, California, pp 364–376
Google Scholar
McLachlan G, Do K, Ambroise C (2004) Analyzing microarray gene expression data. Wiley-Interscience, New York
Book Google Scholar
McLachlan G, Ng S-K (2008) Expert networks with mixed continuous and categorical feature variables: a location modeling approach. In: Peters H, Vogel M (eds) Machine learning research progress. Hauppauge, New York, pp 1–14
Google Scholar
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell M (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:284–288
Article Google Scholar
Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 365:488–492
Article CAS Google Scholar
Mundra P, Rajapakse J (2007) SVM-RFE with relevancy and redundancy criteria for gene selection. Lect Notes Comp Sci 4774:242–252
Article Google Scholar
Nuyten D, van de Vijver M (2008) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer and normal tissue toxicity. In: Seminars in radiation oncology, pp 105–114
Google Scholar
Prasad A, Iverson L, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199
Article Google Scholar
Qiao X, Liu Y (2008) Adaptive weighted learning for unbalanced multicategory classification. Biometrics (in press)
Google Scholar
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98:15149–15154
Article PubMed CAS Google Scholar
Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371–1382
Article Google Scholar
Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643
Article PubMed CAS Google Scholar
Svetnik V, Liaw A, Tong C, Culberson J, Sheridan R, Feuston B (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comp Sci 43:1947–1958
CAS Google Scholar
Tang Y, Zhang Y, Huang Z (2007) Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE ACM Trans Comput Biol Bioinformatics 4:365–389
Article CAS Google Scholar
Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567–6572
Article PubMed CAS Google Scholar
van‘t Veer L, Dai H, Van de Vijver M, He Y, Hart A, Mao M, Peterse H, Van der Kooy K, Marton M, Witteveen A (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536
Google Scholar
Vapnik V (2000) The nature of statistical learning theory, Springer, New York
Google Scholar
Wang S, Zhu J (2007) Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 23:972–979
Article PubMed CAS Google Scholar
Weston J, Watkins C (1999) Multi-class support vector machines. In: Proceedings ESANN, Brussels, Belgium
Google Scholar
Wood I, Visscher P, Mengersen K (2007) Classification based upon gene expression data: bias and precision of error rates. Bioinformatics 23:1363–1370
Article PubMed CAS Google Scholar
Yeang C, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R, Angelo M, Reich M, Lander E, Mesirov J, Golub T (2001) Molecular classification of multiple tumor types. Bioinformatics 17:316–322
Google Scholar
Yousef M, Jung S, Showe L, Showe M (2007) Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8:144
Article PubMed CAS Google Scholar
Zhou X, Tuck D (2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23:1106–1114
Article PubMed CAS Google Scholar
Zhu J, McLachlan G, Ben-Tovim Jones L, Wood I (2008) On selection biases with prediction rules formed from gene expression data. J Stat Plann Infer 138:374–386
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, 4072 St Lucia, Queensland, Australia
Geoffrey J. McLachlan

Authors

Kim-Anh Lê Cao
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey J. McLachlan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geoffrey J. McLachlan .

Editor information

Editors and Affiliations

Discipline of Information Technology, James Cook University, Townsville, 4811, Australia
Tuan Pham

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cao, KA.L., McLachlan, G.J. (2009). Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures. In: Pham, T. (eds) Computational Biology. Applied Bioinformatics and Biostatistics in Cancer Research. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-0811-7_3

Download citation

DOI: https://doi.org/10.1007/978-1-4419-0811-7_3
Published: 01 September 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-0810-0
Online ISBN: 978-1-4419-0811-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics