Comparing Algorithms for Clustering of Expression Data: How to Assess Gene Clusters
Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.
In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.
Key wordsMicroarrays mRNA expression clustering evaluation
This work is supported by the National Science Foundation under Grant No. 0218521, as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program.
- 1.Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V., Eisen, M., Brown, P., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Bio. Cell. 9, 3273–3297.Google Scholar
- 2.Hughes, T., Marton, M., Jones, A., Roberts, C., Stoughton, R., Armour, C., Bennett, H., Coffey, E., Dai, H., He, Y., Kidd, M., King, A., Meyer, M., Slade, D., Lum, P., Stepaniants, S., Shoemaker, D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M. & Friend, S. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109–126.PubMedCrossRefGoogle Scholar
- 7.Jain, A.K. & Dubes, R.C. (1988).”Algorithms for clustering data”. Prentice Hall, Englewood Cliffs, NJ.Google Scholar
- 13.Wu, Z. & Leahy, R. (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. PAMI. 15, 1101–1113.Google Scholar
- 14.Shi, J. & Malik, J. (1997). Normalized cuts and image segmentation. Proc. CVPR. 731–737.Google Scholar
- 16.Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.Google Scholar
- 18.Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Gene Ontol. Consortium. Nat Genet. 25, 25–29.Google Scholar
- 19.Speer, N., Spieth, C. & Zell, A. (2004). A memetic clustering algorithm for the functional partition of genes based on the gene ontology. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), San Diego, USA IEEE Press, 252–259.Google Scholar
- 33.Dirks, W. & Yona, G. (2003). A comprehensive study of the notion of functional link between genes based on microarray data, promoter signals, protein-protein interactions and pathway analysis. Technical report TR2004-1921, Computing and Information Science, Cornell University.Google Scholar
- 34.Kanehisa, M. (1996). Toward pathway engineering: a new database of genetic and molecular pathways. Sci. Technol. Jpn. 59, 34–38.Google Scholar