Summary
Gene expression levels are useful in discriminating between cancer and normal examples and/or between different types of cancer. In this chapter, ensembles of k-nearest neighbors are employed for gene expression based cancer classification. The ensembles are created by randomly sampling subsets of genes, assigning each subset to a k-nearest neighbor (k-NN) to perform classification, and finally, combining k-NN predictions with majority vote. Selection of subsets is governed by the statistical dependence between dataset complexity and classification error, confirmed by the copula method, so that least complex subsets are preferred since they are associated with more accurate predictions. Experiments carried out on six gene expression datasets show that our ensemble scheme is superior to a single best classifier in the ensemble and to the redundancy-based filter, especially designed to remove irrelevant genes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Proc Natl Acad Sci 96:6745–6750
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Nature 415:436–442
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Cancer Cell 1:203–209
Sima C, Attoor S, Braga-Neto U, Lowey J, Suh E, Dougherty ER (2005) Error estimation confounds feature selection in expression-based classification. In: Proc IEEE Int Workshop Genomic Sign Proc and Stat, Newport, Rhode Island
Braga-Neto U, Dougherty ER (2004) Pattern Recognition 37:1267–1281
Kuncheva L (2004) Combining pattern classifiers: methods and algorithms. John Wiley & Sons, Hoboken
Dudoit S, Fridlyand J (2003) Classification in microarray experiments. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall∖CRC Press, Boca Raton
Yu L (2008) Feature selection for genomic data analysis. In Liu H, Motoda H (eds) Computational methods of feature selection. Chapman & Hall∖CRC, Boca Raton
Sklar A (1959) Fonctions de répartition à n dimensions et leurs marges. Publications of the Institute of Statistics, University of Paris
Nelsen RB (2006) An inroduction to copulas. Springer Science+Business Media, New York
Joe H (1997) Multivariate models and dependence concepts. Chapman & Hall∖CRC Press, Boca Raton
Zar JH (1999) Biostatistical analysis. Prentice Hall, Upper Saddle River
Gandrillon O (2004) Guide to the gene expression data. In: Proc ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp 116–120
Bø TH, Jonassen I (2002) Genome Biology 3:0017.1–0017.11
Box GEP, Müller ME (1958) The Annals of Mathematical Statistics 29:610–611
Schweizer B, Wolff EF (1981) The Annals of Statistics 9:879–885
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Okun, O., Priisalu, H. (2008). Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification. In: Okun, O., Valentini, G. (eds) Supervised and Unsupervised Ensemble Methods and their Applications. Studies in Computational Intelligence, vol 126. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78981-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-78981-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78980-2
Online ISBN: 978-3-540-78981-9
eBook Packages: EngineeringEngineering (R0)