Skip to main content

Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures

  • Chapter
  • First Online:
Computational Biology

Abstract

Microarrays are being increasingly used in cancer research for a better understanding of the molecular variations among tumours or other biological conditions. They allow for the measurement of tens of thousands of transcripts simultaneously in one single experiment. The problem of analysing these data sets becomes non-standard and represents a challenge for both statisticians and biologists, as the dimension of the feature space (the number of genes or transcripts) is much greater than the number of tissues. Therefore, the selection of marker genes among thousands to diagnose a cancer type is of crucial importance and can help clinicians to develop gene-expression-based diagnostic tests to guide therapy in cancer patients. In this chapter, we focus on the classification and the prediction of a sample given some carefully chosen gene expression profiles. We review some state-of-the-art machine learning approaches to perform gene selection: recursive feature elimination, nearest-shrunken centroids and random forests. We discuss the difficulties that can be encountered when dealing with microarray data, such as selection bias, multiclass and unbalanced problems. The three approaches are then applied and compared on a typical cancer gene expression study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 179.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 229.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Aha DW and Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Learning from data: artificial intelligence and statistics V. Springer, New York, pp 199–206

    Google Scholar 

  • Al-Shahrour F, Diaz-Uriarte R, Dopazo J (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578–580

    Article  PubMed  CAS  Google Scholar 

  • Ambroise C, McLachlan G (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci 99:6562–6566

    Article  PubMed  CAS  Google Scholar 

  • Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  • Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, The Wadsworth statistics/probability series, Belmont, CA

    Google Scholar 

  • Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P (2005) Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 28:171–182

    Article  PubMed  Google Scholar 

  • Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167

    Article  Google Scholar 

  • Buyse M, Loi S, van‘t Veer L, Viale G, Delorenzi M, Glas A, Saghatchian d’Assignies M, Bergh J, Lidereau R, Ellis P (2006) Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst 98:1183–1192

    Google Scholar 

  • Chen C, Liaw A, Breiman L (2004) Using random forests to learn unbalanced data, Department of Statistics, University of Berkeley

    Google Scholar 

  • Cristianini N, Shawe-Taylor J (1999) An introduction to support vector machines: and other kernel-based learning methods, Cambridge University Press, New York

    Google Scholar 

  • Dabney A, Storey J (2005) Optimal feature selection for nearest centroid classifiers, with applications to gene expression microarrays. UW Biostatistics Working Paper Series, Article 267

    Google Scholar 

  • Dennis G Jr, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:Article R60

    Google Scholar 

  • Diaz-Uriarte R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 7:Article 328

    Google Scholar 

  • Ding Y, Wilkins D (2006) Improving the performance of SVM-RFE to select genes in microarray data. BMC Bioinformatics 7:Article S12

    Google Scholar 

  • Dudoit S, Fridlyand J (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87

    Article  CAS  Google Scholar 

  • Efron B (1979) Bootstrapping methods: another look at the jackknife. Ann Stat 7:1–26

    Article  Google Scholar 

  • Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78:316–331

    Article  Google Scholar 

  • Efron B, Tibshirani R (1997) Improvements on cross-validation: the. 632 + bootstrap method. J Am Stat Assoc 92:548–560

    Article  Google Scholar 

  • Eitrich T, Lang B (2006) Efficient optimization of support vector machine learning parameters for unbalanced datasets. J Comput Appl Math 196: 425–436

    Article  Google Scholar 

  • Gevaert O, Smet F, Timmerman D, Moreau Y, Moor B (2006) Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 22:184–190

    Article  Google Scholar 

  • Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537

    Article  PubMed  CAS  Google Scholar 

  • Guo Y, Hastie T, Tibshirani R (2007) Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8:86–100

    Article  PubMed  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Article  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Support vector machine with recursive feature selection. Mach Learn 46:389–422

    Article  Google Scholar 

  • Izmirlian G (2004) Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann New York Acad Sci 1020: 154–174

    Article  CAS  Google Scholar 

  • John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, Morgan Kaufmann

    Google Scholar 

  • Kim H, Pang S, Je H, Kim D, Yang Bang S (2003) Constructing support vector machine ensemble. Pattern Recogn 36:2757–2767

    Article  Google Scholar 

  • Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324

    Article  Google Scholar 

  • LĂȘ Cao K-A, Bonnet A, Gadat S (2009) Multiclass classification and gene selection with a stochastic algorithm. Comput Stat Data Anal 53:3601–3615

    Article  Google Scholar 

  • LĂȘ Cao K-A, Goncalves O, Besse P, Gadat S (2007) Selection of biologically relevant genes with a wrapper stochastic algorithm. Stat Appl Genetics Mol Biol 6:Article 29

    Google Scholar 

  • Lee Y, Lee C (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19:1132–1139

    Article  PubMed  CAS  Google Scholar 

  • Li C, Tseng G, Wong W (2003) Model-based analysis of oligonucleotide arrays and issues in cDNA microarray analysis. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall, New York, pp 1–34

    Google Scholar 

  • Liaw A, Wiener M (2003) Classification and regression by randomForest. R News 2/3:18–22

    Google Scholar 

  • McLachlan G (1977) A note on the choice of a weighting function to give an efficient method for estimating the probability of misclassification. Pattern Recogn 9:147–149

    Article  Google Scholar 

  • McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York

    Book  Google Scholar 

  • McLachlan G, Chevelu J, Zhu J (2008) Correcting for selection bias via cross-validation in the classification of microarray data. In: Balakrishnan N, Pena E, Silvapulle MJ (eds) Beyond parametrics in Interdisciplinary research: Festschrift in Honor of Professor Paranab K. Sen. Hayward, Vol 1. IMS Collections, California, pp 364–376

    Google Scholar 

  • McLachlan G, Do K, Ambroise C (2004) Analyzing microarray gene expression data. Wiley-Interscience, New York

    Book  Google Scholar 

  • McLachlan G, Ng S-K (2008) Expert networks with mixed continuous and categorical feature variables: a location modeling approach. In: Peters H, Vogel M (eds) Machine learning research progress. Hauppauge, New York, pp 1–14

    Google Scholar 

  • Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell M (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33:284–288

    Article  Google Scholar 

  • Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 365:488–492

    Article  CAS  Google Scholar 

  • Mundra P, Rajapakse J (2007) SVM-RFE with relevancy and redundancy criteria for gene selection. Lect Notes Comp Sci 4774:242–252

    Article  Google Scholar 

  • Nuyten D, van de Vijver M (2008) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer and normal tissue toxicity. In: Seminars in radiation oncology, pp 105–114

    Google Scholar 

  • Prasad A, Iverson L, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199

    Article  Google Scholar 

  • Qiao X, Liu Y (2008) Adaptive weighted learning for unbalanced multicategory classification. Biometrics (in press)

    Google Scholar 

  • Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci 98:15149–15154

    Article  PubMed  CAS  Google Scholar 

  • Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371–1382

    Article  Google Scholar 

  • Statnikov A, Aliferis C, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631–643

    Article  PubMed  CAS  Google Scholar 

  • Svetnik V, Liaw A, Tong C, Culberson J, Sheridan R, Feuston B (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comp Sci 43:1947–1958

    CAS  Google Scholar 

  • Tang Y, Zhang Y, Huang Z (2007) Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE ACM Trans Comput Biol Bioinformatics 4:365–389

    Article  CAS  Google Scholar 

  • Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci 99:6567–6572

    Article  PubMed  CAS  Google Scholar 

  • van‘t Veer L, Dai H, Van de Vijver M, He Y, Hart A, Mao M, Peterse H, Van der Kooy K, Marton M, Witteveen A (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–536

    Google Scholar 

  • Vapnik V (2000) The nature of statistical learning theory, Springer, New York

    Google Scholar 

  • Wang S, Zhu J (2007) Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics 23:972–979

    Article  PubMed  CAS  Google Scholar 

  • Weston J, Watkins C (1999) Multi-class support vector machines. In: Proceedings ESANN, Brussels, Belgium

    Google Scholar 

  • Wood I, Visscher P, Mengersen K (2007) Classification based upon gene expression data: bias and precision of error rates. Bioinformatics 23:1363–1370

    Article  PubMed  CAS  Google Scholar 

  • Yeang C, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R, Angelo M, Reich M, Lander E, Mesirov J, Golub T (2001) Molecular classification of multiple tumor types. Bioinformatics 17:316–322

    Google Scholar 

  • Yousef M, Jung S, Showe L, Showe M (2007) Recursive cluster elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8:144

    Article  PubMed  CAS  Google Scholar 

  • Zhou X, Tuck D (2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23:1106–1114

    Article  PubMed  CAS  Google Scholar 

  • Zhu J, McLachlan G, Ben-Tovim Jones L, Wood I (2008) On selection biases with prediction rules formed from gene expression data. J Stat Plann Infer 138:374–386

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoffrey J. McLachlan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Cao, KA.L., McLachlan, G.J. (2009). Statistical Analysis on Microarray Data: Selection of Gene Prognosis Signatures. In: Pham, T. (eds) Computational Biology. Applied Bioinformatics and Biostatistics in Cancer Research. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-0811-7_3

Download citation

Publish with us

Policies and ethics