Skip to main content
Log in

Understanding the “Horizontal Dimension” of Molecular Evolution to Annotate, Classify, and Discover Proteins with Functional Domains

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Protein evolution proceeds by two distinct processes: 1) individual mutation and selection for adaptive mutations and 2) rearrangement of entire domains within proteins into novel combinations, producing new protein families that combine functional properties in ways that previously did not exist. Domain rearrangement poses a challenge to sequence alignment-based search methods, such as BLAST, in predicting homology since the methodology implicitly assumes that related proteins primarily differ from each other by individual mutations. Moreover, there is ample evidence that the evolutionary process has used (and continues to use) domains as building blocks, therefore, it seems fit to utilize computational, domain-based methods to reconstruct that process. A challenge and opportunity for computational biology is how to use knowledge of evolutionary domain recombination to characterize families of proteins whose evolutionary history includes such recombination, to discover novel proteins, and to infer protein-protein interactions. In this paper we review techniques and databases that exploit our growing knowledge of “horizontal” protein evolution, and suggest possible areas of future development. We illustrate the power of the domain-based methods and the possible directions of future development by a case history in progress aiming at facilitating a particular approach to understanding microbial pathogenicity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hunter S, Apweiler R, Attwood T K et al. InterPro: The integrative protein signature database. Nucleic Acids Res., 2009, 37(Database Issue): D211–D215.

    Article  Google Scholar 

  2. Orengo C A, Thornton J M. Protein families and their evolution — A structural perspective. Annual Review of Biochemistry, 2005, 74(1): 867–900.

    Article  Google Scholar 

  3. Apic G, Gough J, Teichmann S A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. Journal of Molecular Biology, 2001, 310(2): 311–325.

    Article  Google Scholar 

  4. Bjorklund A K, Ekman D, Light S, Frey-Skott J, Elofsson A. Domain rearrangements in protein evolution. Journal of Molecular Biology, 2005, 353(4): 911–923.

    Article  Google Scholar 

  5. Moore A D, Björklund Å K, Ekman D, Bornberg-Bauer E, Elofsson A. Arrangements in the modular evolution of proteins. Trends in Biochemical Sciences, 2008, 33(9): 444–451.

    Article  Google Scholar 

  6. Woese C R, Fox G E. Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America, 1977, 74(11): 5088–5090.

    Article  Google Scholar 

  7. Tasneem A, Iyer L, Jakobsson E, Aravind L. Identification of the prokaryotic ligand-gated ion channels and their implications for the mechanisms and origins of animal Cys-loop ion channels. Genome Biology, 2004, 6(1): R4.

    Article  Google Scholar 

  8. Bocquet N, L Prado de Carvalho, Cartaud J et al. A prokaryotic proton-gated ion channel from the nicotinic acetylcholine receptor family. Nature, 2007, 445(7123): 116–119.

    Article  Google Scholar 

  9. Hilf R J C, Dutzler R. X-ray structure of a prokaryotic pentameric ligand-gated ion channel. Nature, 2008, 452(7185): 375–379.

    Article  Google Scholar 

  10. Mulder N, Apweiler R. InterPro and InterProScan: Tools for protein sequence classification and comparison. Methods Mol. Biol., 2007, 396: 59–70.

    Article  Google Scholar 

  11. Benson D A, Karsch-Mizrachi I, Lipman D J, Ostell J, Wheeler D L. GenBank. Nucl. Acids Res., 2008, 36(Suppl. 1): D25–D30.

    Google Scholar 

  12. UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res., 2008, 36(Database Issue): D190–D195.

    Google Scholar 

  13. Hulo N, Bairoch A, Bulliard Vetal. The 20 years of PROSITE. Nucleic Acids Res., 2008, 36(Database Issue): D245–D249.

    Google Scholar 

  14. Lima T, Auchincloss A H, Coudert E et al. HAMAP: A database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res., 2009, 37(Database Issue): D471–D478.

    Article  Google Scholar 

  15. Finn R D, Mistry J, Tate J et al. The Pfam protein families database. Nucleic Acids Res., 2002, 30(1): 276–280.

    Article  Google Scholar 

  16. Attwood T K, Bradley P, Flower D R et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 2003, 31(1): 400–402.

    Article  Google Scholar 

  17. Corpet F, Servant F, Gouzy J, Kahn D. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res., 2000, 28(1): 267–269.

    Article  Google Scholar 

  18. Letunic I, Goodstadt L, Dickens NJ et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res., 2002, 30(1): 242–244.

    Article  Google Scholar 

  19. Haft D H, Selengut J D, White O. The TIGRFAMs database of protein families. Nucleic Acids Res., 2003, 31(1): 371–373.

    Article  Google Scholar 

  20. Wu C H, Lai-Su L, Yeh L-S L, Huang H et al. The protein information resource. Nucleic Acids Res., 2003, 31(1): 345–347.

    Article  Google Scholar 

  21. Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 2001, 313(4): 903–919.

    Article  Google Scholar 

  22. Pearl F, Todd A, Sillitoe I et al. The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res., 2005. 33(Database Issue): D247–D251.

    Article  Google Scholar 

  23. Mi H, Lazareva-Ulitsky B, Loo R et al. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res., 2005, 33(Database Issue): D284–D288.

    Article  Google Scholar 

  24. Ashburner M, Ball C A, Blake J A, Botstein D, Butler H, Cherry J M, Davis A P, Dolinski K, Dwight S S, Eppig J T, Harris M A, Hill D P, Issel-Tarver L, Kasarskis A, Lewis S, Matese J C, Richardson J E, Ringwald M, Rubin G M, Sherlock G. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 2000, 25: 25–29.

    Article  Google Scholar 

  25. Berman H, Henrick K, Nakamura H, Markley J L. The worldwide protein data bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucl. Acids Res., 2007, 35(Suppl. 1): D301–D303.

    Article  Google Scholar 

  26. Bailey T L, Boden M, Buske F A, Frith M, Grant C E, Clementi L, Ren J, Li W W, Noble W S. MEME SUITE: Tools for motif discovery and searching. Nucl. Acids Res., 2009, 37(Suppl. 2): W202–W208.

    Article  Google Scholar 

  27. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux P S, Pagni M, Sigrist C J A. The PROSITE database. Nucl. Acids Res., 2006, 34(Suppl. 1): D227–D230.

    Article  Google Scholar 

  28. Attwood T K, Bradley P, Flower D R, Gaulton A, Maudling N, Mitchell A L, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C. PRINTS and its automatic supplement, prePRINTS. Nucl. Acids Res., 2003, 31(1): 400–402.

    Article  Google Scholar 

  29. Bateman A, Coin L, Durbin R, Finn R D, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer E L. The Pfam protein families database. Nucleic Acids Res., 2004, 32(Database Issue): D138–D141.

    Article  Google Scholar 

  30. Letunic I, Copley R R, Pils B, Pinkert S, Schultz J, Bork P. SMART 5: Domains in the context of genomes and networks. Nucl. Acids Res., 2006, 34(Suppl. 1): D257–D260.

    Article  Google Scholar 

  31. Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 1995, 21(1): 51–80.

    Google Scholar 

  32. Bailey T L, Elkan C. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, UK, July 16–19, 1995, pp.21–29.

  33. Tompa M, Li N, Bailey T L, Church G M, De Moor B, Eskin E, Favorov A V, Frith M C, Fu Y, Kent W J, Makeev V J, Mironov A A, Noble W S, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotech., 2005, 23(1): 137–144.

    Article  Google Scholar 

  34. Bailey T L, Gribskov M. Combining evidence using p-values: Application to sequence homology searches. Bioinformatics, 1998, 14(1): 48–54.

    Article  Google Scholar 

  35. Saier M H Jr, Tran C V, Barabote R D. TCDB: The transporter classification database for membrane transport protein analyses and information. Nucl. Acids Res., 2006, 34(Suppl. 1): D181–D186.

    Article  Google Scholar 

  36. Hu J, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucl. Acids Res., 2005, 33(15): 4899–4913.

    Article  Google Scholar 

  37. Liu Y, Liu X S, Wei L, Altman R B, Batzoglou S. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Research, 2004, 14(3): 451–458.

    Article  Google Scholar 

  38. Wang T, Stormo G. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics, 2003, 19(18): 2369–2380.

    Article  Google Scholar 

  39. Sinha S, van Nimwegen E, Siggia E. A probabilistic method to detect regulatory modules. In Proc. the Eleventh International Conference on Intelligent Systems for Molecular Biology, Brisbane, Australia, June 20-July 3, 2003, pp.292–301.

  40. Sinha S, Blanchette M, Tompa M. PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics, 2004, 5(1): 170.

    Article  Google Scholar 

  41. Frith M C, Saunders N F W, Kobe B, Bailey T L. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol., 2008, 4(5): e1000071.

    Article  MathSciNet  Google Scholar 

  42. Tilson J L, Blatecky A, Rendon G, Ger M F, Jakobsson E. MotifNetwork: Genome-wide domain analysis using grid-enabled workflows. In Proc. the 7th IEEE International Conference on Bioinformatics and Bioengineering, Boston, USA, Oct. 14–17, 2007, pp.872–879.

  43. Tilson J L, Rendon G, Ger M F, Jakobsson E. MotifNetwork: A grid-enabled workflow for high-throughput domain analysis of biological sequences: Implications for annotation and study of phylogeny, protein interactions, and intraspecies variation. In Proc. the 7th IEEE International Conference on Bioinformatics and Bioengineering, Boston, USA, Oct. 14–17, 2007, pp.620–627.

  44. Foster I, Kesselman C. Chapter 2 — Framework. The Grid: Blueprint for a New Computing Infrastructure. Morgan-Kaufman, 1999.

  45. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock M R, Wipat A, Li P. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004, 20(17): 3045–3054.

    Article  Google Scholar 

  46. Kandaswamy G, Gannon D. A mechanism for creating scientific application services on-demand from workflows. In International Conference on Parallel Processing Workshops, Columbus, USA, Aug. 14–18, 2006.

  47. Rajasekar A, Wan M, Moore R, Schroeder W. A prototype rule-based distributed data management system. In HPDC Workshop on Next Generation Distributed Data Management, Paris, France, June 19–23, 2006.

  48. Tilson J L, Rendon G, Ger M F, Jakobsson E. Algorithms and Performance Measurements for MotifNetwork Analysis Programs. 2009, RENCI: Chapel Hill, NC. p.46.

    Google Scholar 

  49. Kuzniar A, van Ham R C H J, Pongor S, Leunissen J A M. The quest for orthologs: Finding the corresponding gene across genomes. Trends in Genetics, 2008, 24(11): 539–551.

    Article  Google Scholar 

  50. Jothi R, Zotenko E, Tasneem A, Przytycka T M. COCO-CL: Hierarchical clustering of homology relations based on evolutionary correlations. Bioinformatics, 2006, 22(7): 779–788.

    Article  Google Scholar 

  51. Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A, Rao B S, Smirnov S, Sverdlov A, Vasudevan S, Wolf Y, Yin J, Natale D. The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 2003, 4(1): 41.

    Article  Google Scholar 

  52. Schneider A, Dessimoz C, Gonnet G H. OMA browser exploring orthologous relations across 352 complete genomes. Bioinformatics, 2007, 23(16): 2180–2182.

    Article  Google Scholar 

  53. Natarajan S, Jakobsson E. Functional equivalency inferred from “authoritative sources”. in Networks of Homologous Proteins. PLoS ONE, 2009, 4(6): e5898.

  54. Finn R D, Marshall M, Bateman A. iPfam: Visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics, 2005, 21: 410–412.

    Article  Google Scholar 

  55. Stein A, Russell R B, Aloy P. 3did: Interacting protein domains of known three-dimensional structure. Nucleic Acids Res., 2005, 33(Database Issue): D413–D417.

    Article  Google Scholar 

  56. Ng S K, Zhang Z, Tan S H., Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 2003, 19(8): 923–929.

    Article  Google Scholar 

  57. Rhodes D R, Tomlins S A, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan A M. Probabilistic model of the human protein-protein interaction network. Nat. Biotech., 2005, 23(8): 951–959.

    Article  Google Scholar 

  58. Pagel P, Wong P, Frishman D. A domain interaction map based on phylogenetic profiling. Journal of Molecular Biology, 2004, 344(5): 1331–1346.

    Article  Google Scholar 

  59. Raghavachari B, Tasneem A, Przytycka T M, Jothi R. DOMINE: A database of protein domain interactions. Nucl. Acids Res., 2008, 36(Suppl. 1): D656–D661.

    Google Scholar 

  60. Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J. Mol. Biol., 2001, 311(4): 681–692.

    Article  Google Scholar 

  61. Kim W K, Park J, Suh J K. Database of interacting proteins large scale statistical prediction of protein-protein interaction by potentially interacting domain (PID) pair. Genome Inform., 2002, 13: 42–50.

    Google Scholar 

  62. Deng M, Mehta S, Sun F, Chen T. Inferring domain-domain interactions from protein-protein interactions. Genome Res., 2002, 12(10): 1540–1548.

    Article  Google Scholar 

  63. Nye T M, Berzuini C, Gilks W R, Babu M M, Teichmann S A. Statistical analysis of domains in interacting protein pairs. Bioinformatics, 2005, 21(7): 993–1001.

    Article  Google Scholar 

  64. Riley R, Lee C, Sabatti C, Eisenberg D. Inferring protein domain interactions from databases of interacting proteins. Genome Biol., 2005, 6(10): R89.

    Article  Google Scholar 

  65. Jothi R, Cherukuri P F, Tasneem A, Przytycka T M. Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions. Journal of Molecular Biology, 2006, 362(4): 861–875.

    Article  Google Scholar 

  66. Natarajan S, Mashl R J, Jakobsson E. Evolutionary coupling in the Kv1.2-Beta2 complex. University of Illinois at Urbana-Champaign, 2009.

  67. Han D S, Kim H S, Jang W H, Lee S D, Suh J K. PreSPI: A domain combination based prediction system for protein-protein interaction. Nucl. Acids Res., 2004, 32(21): 6312–6320.

    Article  Google Scholar 

  68. Wojcik J, Schachter V. Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics, 2001, 17(Suppl. 1): S296–S305.

    Google Scholar 

  69. Chen X W, Liu M. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics, 2005, 21(24): 4394–4400.

    Article  Google Scholar 

  70. Schlicker A, Huthmacher C, Ramirez F, Lengauer T, Albrecht M. Functional evaluation of domain domain interactions and human protein interaction networks. Bioinformatics, 2007, 23(7): 859–865.

    Article  Google Scholar 

  71. Bjorkholm P, Sonnhammer E L L. Comparative analysis and unification of domain-domain interaction networks. Bioinformatics, 2009, Advance Access Published Online, Aug. 31, 2009, DOI: 10.1093/bioinformatics/btp522.

    Google Scholar 

  72. Pandey J, Koyuturk M, Subramaniam S, Grama A. Functional coherence in domain interaction networks. Bioinformatics, 2008, 24(16): i28–i34.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gloria Rendon.

Additional information

This work is supported by NSF of USA under Grant Nos. 0835718 and 0235792, NIH under Grant Nos. 5PN2EY016570-06 and 5R01NS063405-02, the Beckman Institute for Advanced Science and Technology, the National Center for Supercomputing Applications, and the Renaissance Computing Institute.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rendon, G., Ger, MF., Kantorovitz, R. et al. Understanding the “Horizontal Dimension” of Molecular Evolution to Annotate, Classify, and Discover Proteins with Functional Domains. J. Comput. Sci. Technol. 25, 82–94 (2010). https://doi.org/10.1007/s11390-010-9307-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9307-3

Keywords

Navigation