Abstract
Background
Details of functional speciation within gene families can be difficult to identify using standard multiple sequence alignment (MSA) methods. The evolutionary trace (ET) was developed as a visualization tool to combine MSA, phylogenetic and structural data for identification of functional sites in proteins. The method has been successful in extracting evolutionary details of functional surfaces in a number of biological systems and modifications of the method are useful in creating hypotheses about the function of previously unannotated genes. We wish to facilitate the graphical interpretation of disparate data types through the creation of flexible software implementations.
Results
We have implemented the ET method in a JAVA graphical interface, JEvTrace. Users can analyze and visualize ET input and output with respect to protein phylogeny, sequence and structure. Function discovery with JEvTrace is demonstrated on two proteins with recently determined crystal structures: YlxR from Streptococcus pneumoniae with a predicted RNA-binding function, and a Haemophilus influenzae protein of unknown function, YbaK. To facilitate analysis and storage of results we propose a MSA coloring data structure. The sequence coloring format readily captures evolutionary, biological, functional and structural features of MSAs.
Conclusions
Protein families and phylogeny represent complex data with statistical outliers and special cases. The JEvTrace implementation of the ET method allows detailed mining and graphical visualization of evolutionary sequence relationships.
Similar content being viewed by others
References
Mewes HW, Albermann K, Heumann K, Liebl S, Pfeiffer F: MIPS: a database for protein sequences, homology data and yeast genome information. Nucleic Acids Res. 1997, 25: 28-30. 10.1093/nar/25.1.28.
Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002, 30: 31-34. 10.1093/nar/30.1.31.
Gerlt JA, Babbitt PC: Can sequence determine function?. Genome Biol. 2000, 1: reviews0005.1-0005.10. 10.1186/gb-2000-1-5-reviews0005.
Brenner SE: Errors in genome annotation. Trends Genet. 1999, 15: 132-133. 10.1016/S0168-9525(99)01706-0.
Pearl F, Todd AE, Bray JE, Martin AC, Salamov AA, Suwa M, Swindells MB, Thornton JM, Orengo CA: Using the CATH domain database to assign structures and functions to the genome sequences. Biochem Soc Trans. 2000, 28: 269-275.
Jones DT, Tress M, Bryson K, Hadley C: Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins. 1999, Suppl 3: 104-111. 10.1002/(SICI)1097-0134(1999)37:3+<104::AID-PROT14>3.3.CO;2-G.
Panchenko A, Marchler-Bauer A, Bryant SH: Threading with explicit models for evolutionary conservation of structure and sequence. Proteins. 1999, Suppl13: 133-140. 10.1002/(SICI)1097-0134(1999)37:3+<133::AID-PROT18>3.3.CO;2-4.
Russell RB, Sasieni PD, Sternberg MJ: Supersites within super-folds. Binding site similarity in the absence of homology. J Mol Biol. 1998, 282: 903-918. 10.1006/jmbi.1998.2043.
Lichtarge O, Bourne HR, Cohen FE: Evolutionarily conserved Gαβγ binding surfaces support a model of the G protein-receptor complex. Proc Natl Acad Sci USA. 1996, 93: 7507-7511. 10.1073/pnas.93.15.7507.
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature. 1999, 402: 83-86. 10.1038/47048.
Kolesov G, Mewes HW, Frishman D: SNAPping up functionally related genes based on context information: a colinearity-free approach. J Mol Biol. 2001, 311: 639-656. 10.1006/jmbi.2001.4701.
Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast. 2001, 18: 523-531. 10.1002/yea.706.abs.
Di Gennaro JA, Siew N, Hoffman BT, Zhang L, Skolnick J, Neilson LI, Fetrow JS: Enhanced functional annotation of protein sequences via the use of structural descriptors. J Struct Biol. 2001, 134: 232-245. 10.1006/jsbi.2001.4391.
Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996, 257: 342-358. 10.1006/jmbi.1996.0167.
Du P, Alkorta I: Sequence divergence analysis for the prediction of seven-helix membrane protein structures: I. Comparison with bacteriorhodopsin. Protein Eng. 1994, 7: 1221-1229.
Landgraf R, Fischer D, Eisenberg D: Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng. 1999, 12: 943-951. 10.1093/protein/12.11.943.
Innis CA, Shi J, Blundell TL: Evolutionary trace analysis of TGF-beta and related growth factors: implications for site-directed mutagenesis. Protein Eng. 2000, 13: 839-847. 10.1093/protein/13.12.839.
Aloy P, Querol E, Aviles FX, Sternberg MJ: Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol. 2001, 311: 395-408. 10.1006/jmbi.2001.4870.
Wells JA: Systematic mutational analyses of protein-protein interfaces. Methods Enzymol. 1991, 202: 390-411.
Taylor WR: Residual colours: a proposal for aminochromography. Protein Eng. 1997, 10: 743-746. 10.1093/protein/10.7.743.
Osipiuk J, Gornicki P, Maj L, Dementieva I, Laskowski R, Joachimiak A: Streptococcus pneumoniae YlxR at 1.35 Å shows a putative new fold. Acta Crystallogr D Biol Crystallogr. 2001, 57: 1747-1751. 10.1107/S0907444901014019.
Grill S, Moll I, Hasenohrl D, Gualerzi CO, Blasi U: Modulation of ribosomal recruitment to 5'-terminal start codons by translation initiation factors IF2 and IF3. FEBS Lett. 2001, 495: 167-171. 10.1016/S0014-5793(01)02378-X.
Bae W, Xia B, Inouye M, Severinov K: Escherichia coli CspA-family RNA chaperones are transcription antiterminators. Proc Natl Acad Sci USA. 2000, 97: 7784-7789. 10.1073/pnas.97.14.7784.
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29: 22-28. 10.1093/nar/29.1.22.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Zhang H, Huang K, Li Z, Banerjei L, Fisher KE, Grishin NV, Eisenstein E, Herzberg O: Crystal structure of YbaK protein from Haemophilus influenzae (HI1434) at 1.8 Å resolution: functional implications. Proteins. 2000, 40: 86-97. 10.1002/(SICI)1097-0134(20000701)40:1<86::AID-PROT100>3.0.CO;2-Y.
Burns DM, Beacham IR: Identification and sequence analysis of a silent gene (ushA0) in Salmonella typhimurium. J Mol Biol. 1986, 192: 163-175.
Bensing BA, Dunny GM: Cloning and molecular analysis of genes affecting expression of binding substance, the recipient-encoded receptor(s) mediating mating aggregate formation in Enterococcus faecalis. J Bacteriol. 1993, 175: 7421-7429.
Varani L, Gunderson SI, Mattaj IW, Kay LE, Neuhaus D, Varani G: The NMR structure of the 38 kDa U1A protein - PIE RNA complex reveals the basis of cooperativity in regulation of polyadenylation by human U1A protein. Nat Struct Biol. 2000, 7: 329-335. 10.1038/74101.
Feng W, Tejero R, Zimmerman DE, Inouye M, Montelione GT: Solution NMR structure and backbone dynamics of the major cold-shock protein (CspA) from Escherichia coli: evidence for conformational dynamics in the single-stranded RNA-binding site. Biochemistry. 1998, 37: 10881-10896. 10.1021/bi980269j.
Markus MA, Hinck AP, Huang S, Draper DE, Torchia DA: High resolution solution structure of ribosomal protein L11-C76, a helical protein with a flexible loop that becomes structured upon binding to RNA. Nat Struct Biol. 1997, 4: 70-77.
GRASP: Graphical Representation and Analysis of Structural Properties. [http://btcpxx.che.uni-bayreuth.de/COMPUTER/Software/GRASP/]
Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces. J Mol Biol. 1998, 280: 1-9. 10.1006/jmbi.1998.1843.
Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics. 2001, 17: 284-285. 10.1093/bioinformatics/17.3.284.
Walther D: WebMol-a Java-based PDB viewer. Trends Biochem Sci. 1997, 22: 274-275. 10.1016/S0968-0004(97)89047-0.
Joachimiak MP, Chang C, Rosenthal PJ, Cohen FE: The impact of whole genome sequence data on drug discovery - a malaria case study. Mol Med. 2001, 7: 698-710.
Wilson CA, Kreychman J, Gerstein M: Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000, 297: 233-249. 10.1006/jmbi.2000.3550.
Devos D, Valencia A: Practical limits of function prediction. Proteins. 2000, 41: 98-107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.3.CO;2-J.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.
Devereux J, Haeberli P, Smithies O: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984, 12: 387-395.
Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987, 25: 351-360.
Feng DF, Doolittle RF: Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. 1996, 266: 368-382.
Higgins DG, Sharp PM: Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci. 1989, 5: 151-153.
Rogers JS, Swofford DL: Multiple local maxima for likelihoods of phylogenetic trees: a simulation study. Mol Biol Evol. 1999, 16: 1079-1085.
Protein sequence and structure utilities - ACCESS. [http://www.cmpharm.ucsf.edu/~srp/utils.html]
Lee B, Richards FM: The interpretation of protein structures: estimation of static accessibility. J Mol Biol. 1971, 55: 379-400.
Defay TR, Cohen FE: Multiple sequence information for threading algorithms. J Mol Biol. 1996, 262: 314-323. 10.1006/jmbi.1996.0515.
Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.
Huang CC, Couch GS, Pettersen EF, Ferrin TE: Chimera: an extensible molecular modeling application constructed using standard components. Pac Symp Biocomput. 1996, 1: 724-
Creighton TE: Proteins: Structures and Molecular Properties. 1992, New York: WH Freeman
Karplus PA: Hydrophobicity regained. Protein Sci. 1997, 6: 1302-1307.
SCF sequence coloring format description and source code download. [http://www.cmpharm.ucsf.edu/~marcinj/SCF/]
Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M: The protein data bank: a computer-based archival file for macromolecular structures. Arch Biochem Biophys. 1978, 185: 584-591.
JEvTrace manual and executable JAVA package download. [http://www.cmpharm.ucsf.edu/~marcinj/JEvTrace/]
Cho SJ, Lee MG, Yang JK, Lee JY, Song HK, Suh SW: Crystal structure of Escherichia coli CyaY protein reveals a previously unidentified fold for the evolutionarily conserved frataxin family. Proc Natl Acad Sci USA. 2000, 97: 8932-8937. 10.1073/pnas.160270897.
Sanner MF, Olson AJ, Spehner JC: Reduced surface: an efficient way to compute molecular surfaces. Biopolymers. 1996, 38: 305-320. 10.1002/(SICI)1097-0282(199603)38:3<305::AID-BIP4>3.3.CO;2-8.
Acknowledgements
We are deeply grateful for the help of Dietlind Gerloff, Dirk Walther, Jonathan Blake, John-Marc Chandonia, Wally Novak, Anthony Lau and Chern-Sing Goh during the development of the application. Anthony Lau and Elaine Meng provided invaluable comments on the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Joachimiak, M.P., Cohen, F.E. JEvTrace: refinement and variations of the evolutionary trace in JAVA. Genome Biol 3, research0077.1 (2002). https://doi.org/10.1186/gb-2002-3-12-research0077
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1186/gb-2002-3-12-research0077