Abstract
The molecular function of a protein can be deduced by analysing the ‘homology’ that exists due to common evolutionary ancestry among different organisms, while the cellular function can be inferred by focussing on the interactions between specific proteins. The molecular function could be predicted based on methods that rely on comparing a sequence to another sequence of known function as proteins having similar sequences are usually homologous performing similar function. On the other hand, in order to detect remote homologs or sequences which are very divergent, sequence-profile comparison methods were developed which use profile hidden Markov model (HMM). A profile HMM is generated from an alignment of multiple sequences and inherits more information than a single sequence. More advanced methods use profile-profile comparison methods to detect homology among sequences having very low sequence identity. In general, given a protein sequence with unknown function, these methods are used in a hierarchical manner to identify the function and serve as powerful annotation tools for predicting the function of a novel protein. With many genomes currently being sequenced, knowledge of these methods for annotation is increasingly becoming important.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289–D294. https://doi.org/10.1093/nar/gkq1238
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389
Berezovsky IN, Grosberg AY, Trifonov EN (2000) Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466:283–286
Biegert A, Söding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775. https://doi.org/10.1073/pnas.0810767106
Brendel V, Bucher P, Nourbakhsh IR et al (1992) Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A 89:2002–2006
Brenner SE (1999) Errors in genome annotation. Trends Genet 15:132–133. https://doi.org/10.1016/S0168-9525(99)01706-0
Claros MG, von Heijne G (1994) TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 10:685–686
Claverie J-M, States DJ (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17:191–201. https://doi.org/10.1016/0097-8485(93)85010-A
Cserzö M, Eisenhaber F, Eisenhaber B, Simon I (2002) On filtering false positive transmembrane protein predictions. Protein Eng 15:745–752
Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20:136–137
Desai DK, Nandi S, Srivastava PK, Lynn AM (2011) Mod Enz a: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinforma 2011:743782. https://doi.org/10.1155/2011/743782
Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431. https://doi.org/10.1016/S0168-9525(01)02348-4
Di Tommaso P, Moretti S, Xenarios I et al (2011) T-coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res 39:W13–W17. https://doi.org/10.1093/nar/gkr245
Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) Prob cons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340. https://doi.org/10.1101/gr.2821705
Dosztányi Z (2018) Prediction of protein disorder based on IUPred. Protein Sci 27:331–340. https://doi.org/10.1002/pro.3334
Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434. https://doi.org/10.1093/bioinformatics/bti541
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763. https://doi.org/10.1093/bioinformatics/14.9.755
Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi.1002195
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. https://doi.org/10.1093/nar/gkh340
Eisenhaber B, Eisenhaber F (2007) Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? Curr Protein Pept Sci 8:197–203
Eisenhaber F, Frömmel C, Argos P (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. II The paradox with secondary structural class. Proteins 25:169–179. https://doi.org/10.1002/(SICI)1097-0134(199606)25:2<169::AID-PROT3>3.0.CO;2-D
Eisenhaber B, Bork P, Eisenhaber F (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292:741–758. https://doi.org/10.1006/jmbi.1999.3069
Eisenhaber B, Eisenhaber F, Maurer-Stroh S, Neuberger G (2004) Prediction of sequence signals for lipid post-translational modifications: insights from case studies. Proteomics 4:1614–1625. https://doi.org/10.1002/pmic.200300781
Eisenhaber B, Kuchibhatla D, Sherman W et al (2016) The recipe for protein sequence-based function prediction and its implementation in the ANNOTATOR software environment. Methods Mol Biol 1415:477–506. https://doi.org/10.1007/978-1-4939-3572-7_25
Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:222–230. https://doi.org/10.1093/nar/gkt1223
Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9:133–142
Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–335
Hannenhalli SS, Russell RB (2000) Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303:61–76. https://doi.org/10.1006/jmbi.2000.4036
Hargbo J, Elofsson A (1999) Hidden Markov models that use predicted secondary structures for fold recognition. Proteins 36:68–76
Huynen M, Snel B, Lathe W, Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10:1204–1210. https://doi.org/10.1101/gr.10.8.1204
Jaakkola T, Diekhans M, Haussler D (2000) A discriminative framework for detecting remote protein homologies. J Comput Biol 7:95–114. https://doi.org/10.1089/10665270050081405
Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P (2008) egg NOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 36(Database issue):D250–D254 Epub 2007 Oct 16
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. https://doi.org/10.1002/bip.360221211
Käll L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036. https://doi.org/10.1016/j.jmb.2004.03.016
Kamran M, Sinha S, Dubey P et al (2016) Identification of putative Z-ring-associated proteins, involved in cell division in human pathogenic bacteria Helicobacter pylori. FEBS Lett 590:2158–2171. https://doi.org/10.1002/1873-3468.12230
Karchin R, Karplus K, Haussler D (2002) Classifying G-protein coupled receptors with support vector machines. Bioinformatics 18:147–159
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. https://doi.org/10.1093/molbev/mst010
Kawabata T, Nishikawa K (2000) Protein structure comparison using the markov transition model of evolution. Proteins 41:108–122
Kelley LA, MacCallum RM, Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299:499–520. https://doi.org/10.1006/jmbi.2000.3741
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580. https://doi.org/10.1006/jmbi.2000.4315
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. https://doi.org/10.1093/bioinformatics/btl158
Linding R, Jensen LJ, Diella F et al (2003a) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459
Linding R, Russell RB, Neduva V, Gibson TJ (2003b) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708
Liu J, Hegyi H, Acton TB et al (2004) Automatic target selection for structural genomics on eukaryotes. Proteins 56:188. https://doi.org/10.1002/prot.20012
Mamitsuka H (1996) A learning method of hidden Markov models for sequence discrimination. J Comput Biol 3:361–373
Marchler-Bauer A, Lu S, Anderson JB et al (2011) CDD: A conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229. https://doi.org/10.1093/nar/gkq1189
Marcotte EM, Pellegrini M, Thompson MJ et al (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402:83–86. https://doi.org/10.1038/47048
Marcotte EM, Xenarios I, van der Bliek AM, Eisenberg D (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci 97:12115–12120. https://doi.org/10.1073/pnas.220399497
Maurer-Stroh S, Eisenhaber F (2004) Myristoylation of viral and bacterial proteins. Trends Microbiol 12:178–185. https://doi.org/10.1016/j.tim.2004.02.006
Maurer-Stroh S, Washietl S, Eisenhaber F (2003a) Protein Prenyltransferases: Anchor Size, Pseudogenes and Parasites. Biol Chem 384:977–989. https://doi.org/10.1515/BC.2003.110
Maurer-Stroh S, Washietl S, Eisenhaber F (2003b) Protein prenyltransferases. Genome Biol 4:212. https://doi.org/10.1186/GB-2003-4-4-212
Mott R (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 300:649–659. https://doi.org/10.1006/jmbi.2000.3875
Neuberger G, Maurer-Stroh S, Eisenhaber B et al (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328:581–592
Nielsen H (2017) Predicting secretory proteins with SignalP. In: Methods in molecular biology. Humana Press, Clifton, pp 59–73
Ofran Y, Punta M, Schneider R, Rost B (2005) Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. Drug Discov Today 10:1475–1482. https://doi.org/10.1016/S1359-6446(05)03621-4
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219
Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics 42:3.1.1–3.1.8. https://doi.org/10.1002/0471250953.bi0301s42
Pellegrini M, Marcotte EM, Thompson MJ et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96:4285–4288
Powell S, Forslund K, Szklarczyk D et al (2014) EggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:231–239. https://doi.org/10.1093/nar/gkt1253
Promponas VJ, Enright AJ, Tsoka S et al (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16:915–922
Puntervoll P, Linding R, Gemünd C et al (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. https://doi.org/10.1038/nmeth.1818
Schäffer AA, Wolf YI, Ponting CP et al (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000–1011
Schneider G, Wildpaner M, Sirota FL et al (2010) Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. Methods Mol Biol 609:257–267. https://doi.org/10.1007/978-1-60327-241-4_15
Sigrist CJA, Cerutti L, Hulo N et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274
Sinha S, Lynn AM (2014) HMM-ModE: implementation, benchmarking and validation with HMMER3. BMC Res Notes 7:483. https://doi.org/10.1186/1756-0500-7-483
Sirota FL, Ooi H-S, Gattermayer T et al (2010) Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics 11:S15. https://doi.org/10.1186/1471-2164-11-S1-S15
Snel B, Lehmann G, Bork P, Huynen MA (2000) STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res 28:3442–3444. https://doi.org/10.1093/nar/28.18.3442
Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125
Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248. https://doi.org/10.1093/nar/gki408
Srivastava PK, Desai DK, Nandi S, Lynn AM (2007) HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics 8:104. https://doi.org/10.1186/1471-2105-8-104
Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503. https://doi.org/10.1038/35080529
Tusnády GE, Simon I (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849–850
van Dongen SM (2000) Graph clustering by flow simulation. PhD thesis, Utrecht University Repository
von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690
Ward JJ, McGuffin LJ, Bryson K et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139. https://doi.org/10.1093/bioinformatics/bth195
Wistrand M, Sonnhammer ELL (2004) Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol 338:847–854. https://doi.org/10.1016/j.jmb.2004.03.023
Wong W-C, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867. https://doi.org/10.1371/journal.pcbi.1000867
Wong W-C, Maurer-Stroh S, Schneider G, Eisenhaber F (2012) Transmembrane helix: simple or complex. Nucleic Acids Res 40:W370–W375. https://doi.org/10.1093/nar/gks379
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285. https://doi.org/10.1016/0097-8485(94)85023-2
Yoon B-J (2009) Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 10:402–415
Acknowledgements
This work was supported by grants from Jawaharlal Nehru University and Open source drug discovery, Council of scientific and industrial research (OSDD-CSIR) project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sinha, S., Eisenhaber, B., Lynn, A.M. (2018). Predicting Protein Function Using Homology-Based Methods. In: Shanker, A. (eds) Bioinformatics: Sequences, Structures, Phylogeny . Springer, Singapore. https://doi.org/10.1007/978-981-13-1562-6_13
Download citation
DOI: https://doi.org/10.1007/978-981-13-1562-6_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1561-9
Online ISBN: 978-981-13-1562-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)