A Survey of Classification Techniques for Microarray Data Analysis

Yip, Wai-Ki; Amin, Samir B.; Li, Cheng

doi:10.1007/978-3-642-16345-6_10

Wai-Ki Yip⁵,
Samir B. Amin⁶ &
Cheng Li⁴

Part of the book series: Springer Handbooks of Computational Statistics ((SHCS))

4177 Accesses
8 Citations

Abstract

With the recent advance of biomedical technology, a lot of ‘OMIC’ data from genomic, transcriptomic, and proteomic domain can now be collected quickly and cheaply. One such technology is the microarray technology which allows researchers to gather information on expressions of thousands of genes all at the same time. With the large amount of data, a new problem surfaces – how to extract useful information from them. Data mining and machine learning techniques have been applied in many computer applications for some time. It would be natural to use some of these techniques to assist in drawing inference from the volume of information gathered through microarray experiments. This chapter is a survey of common classification techniques and related methods to increase their accuracies for microarray analysis based on data mining methodology. Publicly available datasets are used to evaluate their performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The Human Genome Project (2003, last modified 2008). The human genome project home page. Retrieved from http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml.
Speed, T. (Ed.). (2003). Statistical analysis of gene expression microarray data (Chap. 3). New York: Chapman & Hall/CRC.
Google Scholar
NCBI. Dna_microarray (2007). Retrieved from http://www.ncbi.nlm.nih.gov/About/primer/microarrays.html.
Piatetsky-Shapiro, G., & Tamayo, P. (Dec 2003). Microarray data mining: Facing the challenges. SIGKDD Explorations, 5(2), 1–5.
Article Google Scholar
Chng, W. J., et al. (Apr 2007). Molecular dissection of hyperdiploid multiple myeloma by gene expression profiling. Cancer Research, 67(7), 2982–2989.
Article Google Scholar
Golub, T. R., et al. (Oct 15 1999). Molecular classification of cnacer: class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531–537.
Google Scholar
Shipp, M. A., et al. (Jan 2002). Diffuse large b-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine, 8(1), 68–74.
Article Google Scholar
Kamber, M., & Han, J. (2006). Data mining: Concepts and techniques (2nd ed.). Amsterdam: Elsevier.
MATH Google Scholar
Moore, A. (2006). Lecture notes on data mining. Retrieved from http://www.autonlab.org/tutorials/.
Breiman, L., et al. (1984). Classification and regression trees. Belmont, CA: Wadsworth Press.
MATH Google Scholar
Zhang, H., et al. (2003). Cell and tumor classification using gene expression data: Construction of forests. Proceedings of the National Academy of Sciences of the United States of America, 100(7), 4168–4172, APR.
Google Scholar
Tan, P. J., Dowe, D. L., & Dix, T. I. (2007). Building classification models from microarray data with tree-based classification algorithms. AI:2007: Advance in Artificial Intelligence, 4830.
Google Scholar
Li, X., & Eick, C. F. (2003). Fast decision tree learning techniques for microarray data collections. The 2003 International Conference on Machine Learning and Applications, 2.
Google Scholar
Peterson, L. E., & Coleman, M. A. (Jan 2008). Machine learning-based receiver operating characteristic (roc) curves for crisp and fuzzy classification of dna microarrays in cancer research. International Journal of Approximate Reasoning, 47, 17–36.
Article MATH Google Scholar
Pique-Regi, R., et al. (2005). Sequential diagonal linear discriminant analysis (seqdlda) for microarray classification and gene identification. Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conf Workshop.
Google Scholar
Guo, Y. (2007). Regularized linear discriminant analysis and its application to microarray. Biostatistics, 8(1), 86–100.
Article MATH Google Scholar
Vapnik, V. (1998). Statistical learning theory (1st ed.). John Wiley and Sons, Inc., Hoboken, New Jersey.
MATH Google Scholar
Brown, M. et al. (Jan 2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences of the United States of America, 97(1), 262–267.
Article Google Scholar
Guyon, B., Weston, S., Barnhill, V., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.
Article MATH Google Scholar
Zhang, X., et al. (April 2006). Recursive svm feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7, 197.
Article Google Scholar
Zhang, X., et al. (2006). Gene selection using support vector machines with non-convex penalty. Bioinformatics 2006, 22(1), 88–95.
Google Scholar
Zhou, X., & Tuck, D. P. (2007). Msvm-rfe: Extensions of svm-rfe for multiclass gene selection on dna microaarray. Bioinformatics, 23(15), 2029.
Article Google Scholar
Khan, J. et al. (Jul 2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673–679.
Article Google Scholar
O’Neill, M., & Song, L. (2003). Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect. BMC Bioinformatics, 4, 13.
Article Google Scholar
Cho, H. S., et al. (2003). cdna microarray data based classification of cancers using neural networks and genetic algorithms. Nanotech, 1, 28–31.
Google Scholar
Friedman, N., et al. (2000). Using bayesian networks to analyze expression data. Journal of Computational Biology, 7, 601–620.
Article Google Scholar
de Ferrari, L., & Aikens, S. (2006). Mining housekeeping genes with a naive bayes classifier. BMC Genomics, 7, 277.
Article Google Scholar
Helman, P., et al. (2004). A bayesian network classification methodology for gene expression data. Journal of Computational Biology, 11(4), 581–615.
Article Google Scholar
Demichelis, F., et al. (2006). A hierarchical nave bayes model for handling sample heterogeneity in classification problems: An application to tissue microarrays. BMC Bioinformatics, 7, 514.
Article Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
MathSciNet MATH Google Scholar
Dettling, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics, 20(18), 3583–3593.
Article Google Scholar
Dudoit, S., & Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9), 1090–1099.
Article Google Scholar
Long, P. M., & Bega, V. B. (2003). Boosting and microarray data. Machine Learning, 52(1), 31–44.
Article MATH Google Scholar
Simon, R. (2008). Challenges of microarray data and the evaluation of gene expression profile signatures. Cancer Investigation, 26, 327–332.
Article Google Scholar
Yanaihara, N., et al. (Mar 2006). Unique microrna molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell, 9(3), 189–198.
Article Google Scholar
Bianchi, F., et al. (Nov 2007). Survival prediction of stage i lung adenocarcinomas by expression of 10 genes. Journal of Clinical Investigation, 117(11), 3436–3444.
Article Google Scholar
NCI. Review (2003). Retrieved from http://linus.nci.nih.gov/~brb/book.html.
Simon, R., et al. (2004). Design and analysis of DNA microarray investigations. London-Berlin-Heidelberg: Springer-Verlag.
MATH Google Scholar
Slawski, M., et al. (Oct 2008). Cma: A comprehensive bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics, 9(1), 439.
Article Google Scholar
Golub, T. R., et al. (Oct 1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531–537.
Article Google Scholar
Reich, M., et al. (May 2006). Genepattern 2.0. Nature Genetics, 38(5), 500–501.
Google Scholar
Gadisseur, A., et al. (Jun 2009). Laboratory diagnosis and molecular classification of von willebrand disease. Acta Haematology, 121(2–3), 71–84.
Article Google Scholar
Moreno, C. S., et al. (Nov 2005). Novel molecular signaling and classification of human clinically nonfunctional pituitary adenomas identified by gene expression profiling and proteomic analyses. Cancer Research, 65(22), 10214–10222.
Article Google Scholar
Tibshirani, R., et al. (Mar 2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 99, 6567–6572.
Article Google Scholar
Li, C., et al. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Science United States of America, 98, 31–36.
Article MATH Google Scholar
Lin, M., et al. (2004). dchipsnp: Significance curve and clustering of snp-array-based loss-of-heterozygosity data. Bioinformatics, 20, 1233–1240.
Article Google Scholar
Wired. (Aug 2003). The end of cancer (as we know it). Wired, 11, 8.
Google Scholar
The Scientist. (2004). The making of microarray prognosis. The Scientist, 18(5), 32.
Google Scholar
Cobb, K. (Fall 2006). Microarrays: The search for meaning in a vast sea of data. Biomedical Computation Review, 2, 17–23.
Google Scholar
Dobbin, K., & Simon, R. (2005). Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics, 6(1), 27–38.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dana Farber Cancer Institute, 44 Binney Street, CLSB 11036, Boston, Massachusetts, 02115, USA
Cheng Li
Harvard School of Public Health, 677 Huntington Avenue, Boston, Massachusetts, 02115, USA
Wai-Ki Yip
Dana Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts, 02115, USA
Samir B. Amin

Authors

Wai-Ki Yip
View author publications
You can also search for this author in PubMed Google Scholar
Samir B. Amin
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Li .

Editor information

Editors and Affiliations

, Institute of Statistics, National Chiao Tung University, Ta Hsueh Road 1001, Hsinchu, 30050, Taiwan, Taiwan R.O.C.
Henry Horng-Shing Lu
, Department of Empirical Inference, MPI for Intelligent Systems, Spemannstraße 38, Tübingen, 72076, Germany
Bernhard Schölkopf
School of Medicine, Dept. Epidemiology & Public Health, Yale University, College Street 60, New Haven, 06520, Connecticut, USA
Hongyu Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yip, WK., Amin, S.B., Li, C. (2011). A Survey of Classification Techniques for Microarray Data Analysis. In: Lu, HS., Schölkopf, B., Zhao, H. (eds) Handbook of Statistical Bioinformatics. Springer Handbooks of Computational Statistics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16345-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-16345-6_10
Published: 09 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16344-9
Online ISBN: 978-3-642-16345-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics