Comparing Algorithms for Clustering of Expression Data: How to Assess Gene Clusters

Yona, Golan; Dirks, William; Rahman, Shafquat

doi:10.1007/978-1-59745-243-4_21

Golan Yona^6,7,
William Dirks⁸ &
Shafquat Rahman⁹

Part of the book series: Methods in Molecular Biology ((MIMB,volume 541))

2892 Accesses
9 Citations

Abstract

Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.

In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V., Eisen, M., Brown, P., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Bio. Cell. 9, 3273–3297.
CAS Google Scholar
Hughes, T., Marton, M., Jones, A., Roberts, C., Stoughton, R., Armour, C., Bennett, H., Coffey, E., Dai, H., He, Y., Kidd, M., King, A., Meyer, M., Slade, D., Lum, P., Stepaniants, S., Shoemaker, D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M. & Friend, S. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109–126.
Article PubMed CAS Google Scholar
Liu, E.T. (2003). Classification of cancers by expression profiling. Curr. Opin. Genet. Dev. 13, 97–103.
Article PubMed CAS Google Scholar
McCormick, S.M., Frye S.R., Eskin, S.G., Teng, C.L., Lu, C.M., Russell, C.G., Chittur, K.K. & McIntire L.V. (2003). Microarray analysis of shear stressed endothelial cells. Biorheology, 40, 5–11.
PubMed Google Scholar
Yeatman, T.J. (2003). The future of clinical cancer management: one tumor, one chip. Am. Surg. 69, 41–44.
PubMed Google Scholar
Yoo, M.S., Chun, H.S., Son, J.J., DeGiorgio, L.A., Kim, D.J., Peng, C. & Son J.H. (2003). Brain research. Mol. Brain Res. 110, 76–84.
Article PubMed CAS Google Scholar
Jain, A.K. & Dubes, R.C. (1988).”Algorithms for clustering data”. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Jain, A.K., Murthy, M.N. & Flynn, P.J. (1999). Data clustering: a review. ACM Comput. Surv.. 31, 264–323.
Article Google Scholar
Boutros, P.C. & Okey, A.B. (2005). Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform. 6, 33 1–343.
Article PubMed CAS Google Scholar
D’haeseleer, P. (2005). How does gene expression clustering work? Nat. Biotechnol. 23, 1499–1501.
Article PubMed Google Scholar
Gray, R. M., Kieffer, J. C. & Linde, Y. (1980). Locally optimal block quantizier design. Inf. Control 45, 178–198.
Article Google Scholar
Rose, K., Gurewitz, E. & Fox, G. (1990). A deterministic annealing approach to clustering. Patt. Rec. Lett. 11, 589–594.
Article Google Scholar
Wu, Z. & Leahy, R. (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. PAMI. 15, 1101–1113.
Google Scholar
Shi, J. & Malik, J. (1997). Normalized cuts and image segmentation. Proc. CVPR. 731–737.
Google Scholar
Dubnov, S., El-Yaniv, R., Gdalyahu, Y., Schneidman, E., Tishby, N. & Yona, G. (2002). A new non-parametric pairwise clustering algorithm based on iterative estimation of distance profiles. Mach. Learn., 47, 35–61.
Article Google Scholar
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.
Google Scholar
Bolshakova, N., Azuaje, F. & Cunningham, P. (2005). A knowledge-driven approach to cluster validity assessment. Bioinformatics. 21, 2546–2547.
Article PubMed CAS Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Gene Ontol. Consortium. Nat Genet. 25, 25–29.
CAS Google Scholar
Speer, N., Spieth, C. & Zell, A. (2004). A memetic clustering algorithm for the functional partition of genes based on the gene ontology. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), San Diego, USA IEEE Press, 252–259.
Google Scholar
Raychaudhuri, S., Schutze, H. & Altman, R.B. (2002). Using text analysis to identify functionally coherent gene groups. Genome Res. 12, 1582–1590.
Article PubMed CAS Google Scholar
Gat-Viks, I., Sharan, R. & Shamir, R. (2003). Scoring clustering solutions by their biological relevance. Bioinformatics 19 2381–2389.
Article PubMed CAS Google Scholar
Bolshakova, N. & Azuaje, F. (2003). Machaon CVE: cluster validation for gene expression data. Bioinformatics 19, 2494–2495.
Article PubMed CAS Google Scholar
Bertoni, A. & Valentini, G. (2006). Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artif. Intell. Med. 37 85–109.
Article PubMed Google Scholar
Olman, V., Xu, D. & Xu, Y. (2003).CUBIC: identification of regulatory binding sites through data clustering. J. Bioinform. Comput. Biol. 1, 21–40.
Article PubMed CAS Google Scholar
McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.C. & Simon, R. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics. 18, 1462–1469.
Article PubMed CAS Google Scholar
Yeung, K.Y., Haynor, D.R. & Ruzzo, W.L. (2001). Validating clustering for gene expression data. Bioinformatics. 17, 309–318.
Article PubMed CAS Google Scholar
Smolkin, M. & Ghosh, D. (2003).Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics. 4, 36.
Article PubMed Google Scholar
Dudoit, S. & Fridlyand, J. (2003).Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19 1090–1099.
Article PubMed CAS Google Scholar
Zhang, K. & Zhao, H. (2000). Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics. 1, 156–173.
Article PubMed CAS Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461–464.
Article Google Scholar
Bejerano, G. (2003). Efficient exact p-value computation and applications to biosequence analysis. In the proceedings of RECOMB 2003, 38–47, ACM press, New York.
Chapter Google Scholar
Yona, G., Dirks, W., Rahman, R. & Lin, M. (2006). Effective similarity measures for expression profiles. Bioinformatics. 22, 1616–1622.
Article PubMed CAS Google Scholar
Dirks, W. & Yona, G. (2003). A comprehensive study of the notion of functional link between genes based on microarray data, promoter signals, protein-protein interactions and pathway analysis. Technical report TR2004-1921, Computing and Information Science, Cornell University.
Google Scholar
Kanehisa, M. (1996). Toward pathway engineering: a new database of genetic and molecular pathways. Sci. Technol. Jpn. 59, 34–38.
Google Scholar
Gygi, S.P., Rochon, Y., Franza, B.R. & Aebersold, R. (1999). Correlation between protein and mRNA abundance in yeast. Mol. Cell Biol. 19, 1720–1730.
PubMed CAS Google Scholar
Qian, J., Dolled-Filhart, M., Lin, J., Yu, H. & Gerstein, M. (2001). Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J. Mol. Biol. 312, 1053–1066.
Article Google Scholar

Download references

Acknowledgments

This work is supported by the National Science Foundation under Grant No. 0218521, as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program.

Author information

Authors and Affiliations

Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, USA
Golan Yona
Department of Computer Science, Technion - Israel Institute of Technology, Haifa, Israel
Golan Yona
Center for Integrative Genomics, University of California, Berkeley, Berkeley, CA, USA
William Dirks
Mathworks Inc., Natick, MA, USA
Shafquat Rahman

Authors

Golan Yona
View author publications
You can also search for this author in PubMed Google Scholar
William Dirks
View author publications
You can also search for this author in PubMed Google Scholar
Shafquat Rahman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Yona, G., Dirks, W., Rahman, S. (2009). Comparing Algorithms for Clustering of Expression Data: How to Assess Gene Clusters. In: Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R., McDermott, J. (eds) Computational Systems Biology. Methods in Molecular Biology, vol 541. Humana Press. https://doi.org/10.1007/978-1-59745-243-4_21

Download citation

DOI: https://doi.org/10.1007/978-1-59745-243-4_21
Published: 10 March 2009
Publisher Name: Humana Press
Print ISBN: 978-1-58829-905-5
Online ISBN: 978-1-59745-243-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics