Classification Approaches for Microarray Gene Expression Data Analysis

  • Leo Wang-Kit CheungEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 802)


Classification approaches have been developed, adopted, and applied to distinguish disease classes at the molecular level using microarray data. Recently, a novel class of hierarchical probabilistic models based on a kernel-imbedding technique has become one of the best classification tools for microarray data analysis. These models were first developed as kernel-imbedded Gaussian processes (KIGPs) for binary class classification problems using microarray gene expression data, then they were further improved for multiclass classification problems under a unifying Bayesian framework. Specifically, an adaptive algorithm with a cascading structure was designed to find appropriate featuring kernels, to discover potentially significant genes, and to make optimal disease (e.g., tumor/cancer) class predictions with associated Bayesian posterior probabilities. Simulation studies and applications to publish real data showed that KIGPs performed very close to the Bayesian bound and consistently outperformed or performed among the best of a lot of state-of-the-art methods. The most unique advantage of the KIGP approach is its ability to explore both the linear and the nonlinear underlying relationships between the target features of a given disease classification problem and the involved explanatory gene expression data. This line of research has shed light on the broader usability of the KIGP approach for the analysis of other high-throughput omics data and omics data collected in time series fashion, especially when linear model based methods fail to work.

Key words

Microarray gene expression Kernel-imbedding Gaussian processes Markov chains Monte Carlo methods Nonlinear systems 



This work was partially supported by the Loyola University Medical Center Research Development Funds and the SUN Microsystems Academic Equipment Grant for Bioinformatics. The author would like to thank Dr. Xin Zhao at Sanjole Inc. for his involvement on the KIGP work.


  1. 1.
    Golub TR, Slonim D, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537.PubMedCrossRefGoogle Scholar
  2. 2.
    Dudoit S, Fridlyand J, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. JASA 97:77–87.Google Scholar
  3. 3.
    Dudoit S, Shaffer J, Boldrick J (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 18:71–103.CrossRefGoogle Scholar
  4. 4.
    Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Amer. Statis. Assoc. 99:96–104.Google Scholar
  5. 5.
    Bair E, Hastie T, Paul D et al (2006) Prediction by supervised principal component. J. Amer. Statis. Assoc. 101:119–137.Google Scholar
  6. 6.
    Tibshirani R, Hastie T, Narasimhan B et al (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA 99:6567–6572.PubMedCrossRefGoogle Scholar
  7. 7.
    Guyon I, Weston J, Barnhill S (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46:389–422.CrossRefGoogle Scholar
  8. 8.
    Zhu J, Hastie T (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5:427–443.PubMedCrossRefGoogle Scholar
  9. 9.
    Lönnstedt I, Britton T (2005) Hierarchical Bayes models for cDNA microarray gene expression. Biostatistics 6:279–291.PubMedCrossRefGoogle Scholar
  10. 10.
    Chu W, Ghahramani Z, Falciani F et al (2005) Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics 21:3385–3393.PubMedCrossRefGoogle Scholar
  11. 11.
    Lee KE, Sha N, Dougherty ER et al (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics19:90–97.PubMedCrossRefGoogle Scholar
  12. 12.
    Zhou X, Wang X, Dougherty ER (2004) Gene prediction using multinomial probit regression with Bayesian gene selection. EURASIP Journal on Applied Signal Processing 1: 115–124.Google Scholar
  13. 13.
    Zhou X, Liu K, Wong STC (2004) Cancer classification and prediction using logistic regression with Bayesian gene selection. Journal of Biomedical Informatics 37:249–259.PubMedCrossRefGoogle Scholar
  14. 14.
    Pochet N, Smet FD, Suykens JAK et al (2004) Systematic benchmarking of microarray data classification: assessing the role of non-linearity and dimensionality reduction. Bioinformatics 20:3185–3195.PubMedCrossRefGoogle Scholar
  15. 15.
    Zhou X, Wang X, Dougherty ER (2004) A Bayesian approach to nonlinear probit gene selection and classification. Journal of the Franklin Institute 341:137–156.CrossRefGoogle Scholar
  16. 16.
    Zhao X, Cheung LWK (2007) A hierarchical Bayesian approach with kernel-imbedded Gaussian processes for micoarray gene expression data analysis. BMC Bioinformatics 8:67.PubMedCrossRefGoogle Scholar
  17. 17.
    Zhao X, Cheung LWK (2011) Multi-class kernel-imbedded Gaussian processes for microarray data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8(4):1041–1053.Google Scholar
  18. 18.
    Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6:259–275.CrossRefGoogle Scholar
  19. 19.
    MacKay DJC (1992) The evidence framework applied to classification networks. Neural Computation 4:720–736.CrossRefGoogle Scholar
  20. 20.
    Kwok JT (2000) The evidence framework applied to support vector machines. IEEE Trans. on Neural Networks 11:1162–1173.CrossRefGoogle Scholar
  21. 21.
    Gestel TV, Suykens JVK, Lanckriet G et al (2002) Bayesian framework for least-squares support vector machine classifiers, Gaussian processes, and kernel fisher discriminant analysis. Neural Computation 14:1115–1147.PubMedCrossRefGoogle Scholar
  22. 22.
    Neal RM (1996) Bayesian learning for neural networks. Springer, New York.CrossRefGoogle Scholar
  23. 23.
    Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge, Massachusetts.Google Scholar
  24. 24.
    Cristianini N, Shawe-Tayer J (2000) An introduction to support vector machines. Cambridge University Press.Google Scholar
  25. 25.
    Kuh A (2004) Least Square Kernel Methods and Applications. In: Soft Computing in Communications. Wang L (ed) p:361–383. Springer, Berlin.Google Scholar
  26. 26.
    Müller K, Mika S, Rätsch G et al (2001) An Introduction to Kernel-Based Learning Algorithms. IEEE Trans. Neural Networks 12:181–202. CrossRefGoogle Scholar
  27. 27.
    Diaz-Uriarte R, Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:1–13.CrossRefGoogle Scholar
  28. 28.
    Cheung LWK (2004) Use of runs statistics for pattern recognition in genomic DNA sequences. Journal of Computational Biology 11:107–124.PubMedCrossRefGoogle Scholar
  29. 29.
    Nuel G (2006) Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics. Algorithms Mol Biol 1:5.PubMedCrossRefGoogle Scholar
  30. 30.
    Aston J, Martin D (2007) Distributions associated with general runs and patterns in hidden Markov models. The Annals of Applied Statistics 1: 585–611.CrossRefGoogle Scholar
  31. 31.
    Martin J, Regad L, Camproux A-C et al (2010) Finite Markov Chain Embedding for the Exact Distribution of Patterns in a Set of Random Sequences. In: Advances in Data Analysis- Statistics for Industry and Technology: Theory and Applications to Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks. Skiadas C (ed). p.171-180. Springer.Google Scholar
  32. 32.
    Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-Cell-lymphoma identified by gene expression profiling. Nature 403:503–511.PubMedCrossRefGoogle Scholar
  33. 33.
    Hedenfalk I, Duggan D, Chen Y et al (2001) Gene expression profiles in hereditary breast cancer. The New England Journal of Medicine 344:539–548.PubMedCrossRefGoogle Scholar
  34. 34.
    Zangrando A, Dell’orto MC, Te Kronnie G et al (2009) MLL rearrangements in pediatric acute lymphoblastic and myeloblastic leukemias: MLL specific and lineage specific signatures. BMC Med Genomics 2:36.PubMedCrossRefGoogle Scholar
  35. 35.
    Chiang DY, Villanueva A, Hoshida Y et al (2008) Focal gains of VEGFA and molecular classification of hepatocellular carcinoma. Cancer Res 68:6779–6788.PubMedCrossRefGoogle Scholar
  36. 36.
    Pomeroy S, Tamayo P, Gaasenbeek M et al (2002) Prediction of central nervous system embryonal tumoroutcome based on gene expression. Nature 415:436–442.PubMedCrossRefGoogle Scholar
  37. 37.
    Jones J, Otu H, Spentzos D et al (2005) Gene signatures of progression and metastasis in renal cell cancer. Clin Cancer Res 11: 5730–5739.PubMedCrossRefGoogle Scholar
  38. 38.
    Alon U, Barkai N, Notterman D et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA 96:6745–6750.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Bioinformatics Core, Department of Preventive Medicine and Epidemiology, Stritch School of MedicineLoyola University Medical CenterChicagoUSA

Personalised recommendations