Ortholog Clustering on a Multipartite Graph

  • Akshay Vashist
  • Casimir Kulikowski
  • Ilya Muchnik
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3692)


We present a method for automatically extracting groups of orthologous genes from a large set of genomes through the development of a new clustering method on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E| + |V| log |V|), where V and E are the sets of vertices and edges, respectively in the graph. However, if we discretize the similarity scores into a constant number of bins, the run time improves to O(|E| + |V|). The proposed method was applied to seven complete eukaryote genomes on which manually curated ortholog clusters, KOG (eukaryotic ortholog clusters, are constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters. Finally, we demonstrate how gene order information can be incorporated in the proposed method for improving ortholog detection.


Linkage Function Orthologous Relationship Multiple Genome Pfam Family Ortholog Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fitch, W.M.: Distinguishing homologous from analogous proteins. Syst Zool 19, 99–113 (1970)CrossRefGoogle Scholar
  2. 2.
    Fujibuchi, W., Ogata, H., Matsuda, H., Kanehisa, M.: Automatic detection of conserved gene clusters in multiple genomes by graph comparison and P-quasi grouping. Nucleic Acids Res 28, 4096–4036 (2002)Google Scholar
  3. 3.
    Kamvysselis, M., Patterson, N., Birren, B., Berger, B., Lander, E.: Whole-genome comparative annotation and regulatory motif discovery in multiple yeast species. In: RECOMB, pp. 157–166 (2003)Google Scholar
  4. 4.
    Remm, M., Strom, C., Sonnhammer, E.: Automatics clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 314, 1041–1052 (2001)CrossRefGoogle Scholar
  5. 5.
    Koonin, E.V., et al.: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 5 (2004)Google Scholar
  6. 6.
    Tatusov, R., Koonin, E., Lipmann, D.: A genomic perspective on protein families. Science 278, 631–637 (1997)CrossRefGoogle Scholar
  7. 7.
    Zmasek, C., Eddy, S.: RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BioMed Central Bioinformatics 3 (2002)Google Scholar
  8. 8.
    Huynen, M.A., Bork, P.: Measuring genome evolution. Proc. Natl. Acad. Sci. USA 95, 5849–5856 (1998)CrossRefGoogle Scholar
  9. 9.
    Tang, J., Moret, B.: Phylogenetic reconstruction from gene rearrangement data with unequal gene content. In: Dehne, F., Sack, J.-R., Smid, M. (eds.) WADS 2003. LNCS, vol. 2748, pp. 37–46. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Dawande, M., Keskinocak, P., Swaminathan, J.M., Tayur, S.: On bipartite and multipartite clique problems. J. Algorithms 41, 388–403 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Matula, D.W., Beck, L.L.: Smallest-last ordering and clustering and graph coloring algorithms. J. ACM 30, 417–427 (1983)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Mirkin, B., Muchnik, I.: Induced layered clusters, hereditary mappings, and convex geometries. Appl. Math. Lett. 15, 293–298 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Fredman, M.L., Tarjan, R.E.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34, 596–615 (1987)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  15. 15.
    Altschul, S., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)CrossRefGoogle Scholar
  16. 16.
    Rand, W.M.: Objective criterion for the evaluation of clustering methods. J. Am. stat. Assoc. 66, 846–850 (1971)CrossRefGoogle Scholar
  17. 17.
    Bateman, A., et al.: The Pfam protein families database. Nucleic Acids Res. 32, 138–141 (2004)CrossRefGoogle Scholar
  18. 18.
    Guigo, R., Muchnik, I., Smith, T.: Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol. 6, 189–213 (1996)CrossRefGoogle Scholar
  19. 19.
    Cannon, S.B., Young, N.D.: OrthoParaMap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BMC Bioinformatics 4 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Akshay Vashist
    • 1
  • Casimir Kulikowski
    • 1
  • Ilya Muchnik
    • 1
    • 2
  1. 1.Department of Computer Science 
  2. 2.DIMACS RutgersThe State University of New JerseyPiscatawayUSA

Personalised recommendations