Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information

  • Saket Navlakha
  • James White
  • Niranjan Nagarajan
  • Mihai Pop
  • Carl Kingsford
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods.


Hierarchical Tree Decompositions Variation of Information Clustering Protein Interaction Networks Metagenomics OTUs 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arnau, V., Mars, S., Marín, I.: Iterative cluster analysis of protein interaction data. Bioinformatics 21(3), 364–378 (2005)CrossRefPubMedGoogle Scholar
  2. 2.
    Bader, G.D., Hogue, C.W.V.: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Bernard, A., Vaughn, D.S., Hartemink, A.J.: Reconstructing the topology of protein complexes. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS (LNBI), vol. 4453, pp. 32–46. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Böhm, C., Plant, C.: HISSCLU: a hierarchical density-based method for semi-supervised clustering. In: Proceedings of the 2008 International Conference on Extending Database Technology, pp. 440–451. ACM Press, New York (2008)Google Scholar
  5. 5.
    Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488+ (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Brun, C., Chevenet, F., Martin, D., Wojcik, J., Guenoche, A., Jacq, B.: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 5(1), R6 (2003)CrossRefGoogle Scholar
  7. 7.
    Buehler, E.C., Sachs, J.R., Shao, K., Bagchi, A., Ungar, L.H.: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics 20(17), 3266–3269 (2004)CrossRefPubMedGoogle Scholar
  8. 8.
    Cole, J.R., Chai, B., Farris, R.J., Wang, Q., Kulam, S.A., McGarrell, D.M., Garrity, G.M., Tiedje, J.M.: The ribosomal database project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 33, 294–296 (2005)CrossRefGoogle Scholar
  9. 9.
    Corby-Harris, V., et al.: Geographical distribution and diversity of bacteria associated with natural populations of Drosophila melanogaster. Appl. Environ. Microbiol. 73, 3470–3479 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    DeSantis, T.Z., Hugenholtz, P., Keller, K., Brodie, E.L., Larsen, N., Piceno, Y.M., Phan, R., Andersen, G.L.: NAST: a multiple sequence alignment server for comparative analysis of 16s rRNA genes. Nucleic Acids Res. 34(Web Server issue), W394–W399 (2006)CrossRefGoogle Scholar
  11. 11.
    Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)CrossRefPubMedGoogle Scholar
  12. 12.
    Dotan-Cohen, D., Melkman, A.A., Kasif, S.: Hierarchical tree snipping: Clustering guided by prior knowledge. Bioinformatics 23(24), 3335–3342 (2007)CrossRefPubMedGoogle Scholar
  13. 13.
    Eckburg, P.B., Bik, E.M., Bernstein, C.N., Purdom, E., Dethlefsen, L., Sargent, M., Gill, S.R., Nelson, K.E., Relman, D.A.: Diversity of the human intestinal microbial flora. Science 308(5728), 1635–1638 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Felsenstein, J.: PHYLIP: Phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989)Google Scholar
  16. 16.
    Fulthorpe, R.R., Roesch, L.F.W., Riva, A., Triplett, E.W.: Distantly sampled soils carry few species in common. ISME J. 2, 901–910 (2008)CrossRefPubMedGoogle Scholar
  17. 17.
    Garey, M.R., Johnson, D.S.: Comptuers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, New York (1979)Google Scholar
  18. 18.
    Gascuel, O.: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14(7), 685–695 (1997)CrossRefPubMedGoogle Scholar
  19. 19.
    Guldener, U., Munsterkotter, M., Kastenmuller, G., Strack, N., van Helden, J., Lemer, C., Richelles, J., Wodak, S.J., Garcia-Martinez, J., Perez-Ortin, J.E., Michael, H., Kaps, A., Talla, E., Dujon, B., Andre, B., Souciet, J.L., De Mon tigny, J., Bon, E., Gaillardin, C., Mewes, H.W.: CYGD: the comprehensive yeast genome database. Nucleic Acids Res. 33(suppl. 1), D364+ (2005)Google Scholar
  20. 20.
    Hart, T.G., Ramani, A.K., Marcotte, E.M.: How complete are current yeast and human protein-interaction networks? Genome Biol. 7, 120+ (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Jaccard, P.: Nouvelles recherches sur la distribution florale. Bulletin de la Socit Vaudoise des Sciences Naturelles, 223–270 (1908)Google Scholar
  22. 22.
    Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., Gerstein, M.: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 302(5644), 449–453 (2003)CrossRefPubMedGoogle Scholar
  23. 23.
    Jukes, T.H., Cantor, C.R.: Evolution of Protein Molecules. Academic Press, London (1969)CrossRefGoogle Scholar
  24. 24.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)CrossRefGoogle Scholar
  25. 25.
    Kennedy, J., et al.: Diversity of microbes associated with the marine sponge, Haliclona simulans, isolated from Irish waters and identification of polyketide synthase genes from the sponge metagenome. Environ. Microbiol. 10, 1888–1902 (2008)CrossRefPubMedGoogle Scholar
  26. 26.
    Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y., Apweiler, R., Hermjakob, H.: IntAct—open source resource for molecular interaction data. Nucleic Acids Res. 35(Database issue), D561–D565 (2007)CrossRefGoogle Scholar
  27. 27.
    Kimura, M.: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 (1980)CrossRefPubMedGoogle Scholar
  28. 28.
    King, A.D., Przulj, N., Jurisica, I.: Protein complex prediction via cost-based clustering. Bioinformatics 20(17), 3013–3020 (2004)CrossRefPubMedGoogle Scholar
  29. 29.
    Li, X.L., Foo, C.S., Ng, S.K.: Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. In: Comp. Syst. Bioinformatics Conference, vol. 6, pp. 157–168 (2007)Google Scholar
  30. 30.
    Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A.C.C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., Lapidus, A., Grigoriev, I., Richardson, P., Hugenholtz, P., Kyrpides, N.C.C.: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 495–500 (2007)Google Scholar
  31. 31.
    Meila, M.: Comparing clusterings—an information based distance. J. Multivariate Anal. 98(5), 873–895 (2007)CrossRefGoogle Scholar
  32. 32.
    Mirkin, B.: Mathematical classification and clustering. J. Global Optim. 12(1), 105–108 (1998)CrossRefGoogle Scholar
  33. 33.
    Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD Conference, pp. 419–432 (2008)Google Scholar
  34. 34.
    Navlakha, S., Schatz, M.C., Kingsford, C.: Revealing biological modules via graph summarization. J. Comp. Biol. 16(2), 253–264 (2009)CrossRefGoogle Scholar
  35. 35.
    Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 103(23), 8577–8582 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  36. 36.
    Pei, P., Zhang, A.: A “seed-refine” algorithm for detecting protein complexes from protein interaction data. IEEE T. Nanobiosci. 6(1), 43–50 (2007)CrossRefGoogle Scholar
  37. 37.
    Qiu, J., Noble, W.S.: Predicting co-complexed protein pairs from heterogeneous data. PLoS Comp. Biol. 4(4) (2008)Google Scholar
  38. 38.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRefGoogle Scholar
  39. 39.
    Rives, A.W., Galitski, T.: Modular organization of cellular networks. Proc. Natl. Acad. Sci. USA 100(3), 1128–1133 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Samanta, M.P., Liang, S.: Predicting protein functions from redundancies in large-scale protein interaction networks. Proc. Natl. Acad. Sci. USA 100(22), 12579–12583 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Schloss, P.D., Handelsman, J.: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol. 71(3), 1501–1506 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Schloss, P.D., Handelsman, J.: Toward a census of bacteria in soil. PLoS Comp. Biol. 2(7), e92 (2006)CrossRefGoogle Scholar
  43. 43.
    Sharan, R., Ulitsky, I., Shamir, R.: Network-based prediction of protein function. Nat. Mol. Syst. Biol. 3, 88 (2007)Google Scholar
  44. 44.
    Sogin, M.L.L., Morrison, H.G.G., Huber, J.A.A., Welch, D.M.M., Huse, S.M.M., Neal, P.R.R., Arrieta, J.M.M., Herndl, G.J.J.: Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. USA 103(32), 12115–12120 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  45. 45.
    Tan, M., Smith, E., Broach, J., Floudas, C.: Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures. BMC Bioinformatics 9(1), 268 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  46. 46.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRefPubMedPubMedCentralGoogle Scholar
  47. 47.
    Toronen, P.: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics 5, 32 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  48. 48.
    van Dongen, S.: A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam (2000)Google Scholar
  49. 49.
    Wang, Q., Garrity, G.M., Tiedje, J.M., Cole, J.R.: Naive bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73(16), 5261–5267 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  50. 50.
    Warnecke, F., Luginbühl, P., Ivanova, N., Ghassemian, M., Richardson, T.H., Stege, J.T., Cayouette, M., Mchardy, A.C., Djordjevic, G., Aboushadi, N., Sorek, R., Tringe, S.G., Podar, M., Martin, H.G., Kunin, V., Dalevi, D., Madejska, J., Kirton, E., Platt, D., Szeto, E., Salamov, A., Barry, K., Mikhailova, N., Kyrpides, N.C., Matson, E.G., Ottesen, E.A., Zhang, X., Hernández, M., Murillo, C., Acosta, L.G., Rigoutsos, I., Tamayo, G., Green, B.D., Chang, C., Rubin, E.M., Mathur, E.J., Robertson, D.E., Hugenholtz, P., Leadbetter, J.R.: Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature 450(7169), 560–565 (2007)CrossRefPubMedGoogle Scholar
  51. 51.
    Yu, H., Paccanaro, A., Trifonov, V., Gerstein, M.: Predicting interactions in protein networks by completing defective cliques. Bioinformatics 22(7), 823–829 (2006)CrossRefPubMedGoogle Scholar
  52. 52.
    Zhu, X., Gerstein, M., Snyder, M.: Getting connected: analysis and principles of biological networks. Genes Dev. 21(9), 1010–1024 (2007)CrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Saket Navlakha
    • 1
    • 2
  • James White
    • 2
  • Niranjan Nagarajan
    • 1
    • 2
  • Mihai Pop
    • 1
    • 2
  • Carl Kingsford
    • 1
    • 2
  1. 1.Department of Computer ScienceUSA
  2. 2.Center for Bioinformatics and Computational Biology, Institute for Advanced Computer StudiesUniversity of MarylandUSA

Personalised recommendations