Predicting Protein Function Using Homology-Based Methods

  • Swati SinhaEmail author
  • Birgit Eisenhaber
  • Andrew M. Lynn


The molecular function of a protein can be deduced by analysing the ‘homology’ that exists due to common evolutionary ancestry among different organisms, while the cellular function can be inferred by focussing on the interactions between specific proteins. The molecular function could be predicted based on methods that rely on comparing a sequence to another sequence of known function as proteins having similar sequences are usually homologous performing similar function. On the other hand, in order to detect remote homologs or sequences which are very divergent, sequence-profile comparison methods were developed which use profile hidden Markov model (HMM). A profile HMM is generated from an alignment of multiple sequences and inherits more information than a single sequence. More advanced methods use profile-profile comparison methods to detect homology among sequences having very low sequence identity. In general, given a protein sequence with unknown function, these methods are used in a hierarchical manner to identify the function and serve as powerful annotation tools for predicting the function of a novel protein. With many genomes currently being sequenced, knowledge of these methods for annotation is increasingly becoming important.


Protein function prediction Sequence analysis Annotation Profile HMM HMM-ModE ANNOTATOR 



This work was supported by grants from Jawaharlal Nehru University and Open source drug discovery, Council of scientific and industrial research (OSDD-CSIR) project.


  1. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289–D294. CrossRefPubMedGoogle Scholar
  2. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. CrossRefGoogle Scholar
  3. Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. CrossRefPubMedPubMedCentralGoogle Scholar
  4. Berezovsky IN, Grosberg AY, Trifonov EN (2000) Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466:283–286CrossRefGoogle Scholar
  5. Biegert A, Söding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775. CrossRefPubMedPubMedCentralGoogle Scholar
  6. Brendel V, Bucher P, Nourbakhsh IR et al (1992) Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A 89:2002–2006CrossRefGoogle Scholar
  7. Brenner SE (1999) Errors in genome annotation. Trends Genet 15:132–133. CrossRefPubMedGoogle Scholar
  8. Claros MG, von Heijne G (1994) TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 10:685–686PubMedGoogle Scholar
  9. Claverie J-M, States DJ (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17:191–201. CrossRefGoogle Scholar
  10. Cserzö M, Eisenhaber F, Eisenhaber B, Simon I (2002) On filtering false positive transmembrane protein predictions. Protein Eng 15:745–752CrossRefGoogle Scholar
  11. Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20:136–137CrossRefGoogle Scholar
  12. Desai DK, Nandi S, Srivastava PK, Lynn AM (2011) Mod Enz a: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinforma 2011:743782. CrossRefGoogle Scholar
  13. Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431. CrossRefPubMedGoogle Scholar
  14. Di Tommaso P, Moretti S, Xenarios I et al (2011) T-coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res 39:W13–W17. CrossRefPubMedPubMedCentralGoogle Scholar
  15. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) Prob cons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340. CrossRefPubMedPubMedCentralGoogle Scholar
  16. Dosztányi Z (2018) Prediction of protein disorder based on IUPred. Protein Sci 27:331–340. CrossRefPubMedGoogle Scholar
  17. Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434. CrossRefPubMedGoogle Scholar
  18. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763. CrossRefPubMedGoogle Scholar
  19. Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211PubMedGoogle Scholar
  20. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. CrossRefPubMedPubMedCentralGoogle Scholar
  21. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. CrossRefPubMedPubMedCentralGoogle Scholar
  22. Eisenhaber B, Eisenhaber F (2007) Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? Curr Protein Pept Sci 8:197–203CrossRefGoogle Scholar
  23. Eisenhaber F, Frömmel C, Argos P (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. II The paradox with secondary structural class. Proteins 25:169–179.<169::AID-PROT3>3.0.CO;2-D CrossRefPubMedGoogle Scholar
  24. Eisenhaber B, Bork P, Eisenhaber F (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292:741–758. CrossRefPubMedGoogle Scholar
  25. Eisenhaber B, Eisenhaber F, Maurer-Stroh S, Neuberger G (2004) Prediction of sequence signals for lipid post-translational modifications: insights from case studies. Proteomics 4:1614–1625. CrossRefPubMedGoogle Scholar
  26. Eisenhaber B, Kuchibhatla D, Sherman W et al (2016) The recipe for protein sequence-based function prediction and its implementation in the ANNOTATOR software environment. Methods Mol Biol 1415:477–506. CrossRefPubMedGoogle Scholar
  27. Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:222–230. CrossRefGoogle Scholar
  28. Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9:133–142CrossRefGoogle Scholar
  29. Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–335CrossRefGoogle Scholar
  30. Hannenhalli SS, Russell RB (2000) Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303:61–76. CrossRefPubMedGoogle Scholar
  31. Hargbo J, Elofsson A (1999) Hidden Markov models that use predicted secondary structures for fold recognition. Proteins 36:68–76CrossRefGoogle Scholar
  32. Huynen M, Snel B, Lathe W, Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10:1204–1210. CrossRefPubMedPubMedCentralGoogle Scholar
  33. Jaakkola T, Diekhans M, Haussler D (2000) A discriminative framework for detecting remote protein homologies. J Comput Biol 7:95–114. CrossRefPubMedGoogle Scholar
  34. Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P (2008) egg NOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 36(Database issue):D250–D254 Epub 2007 Oct 16PubMedGoogle Scholar
  35. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. CrossRefPubMedGoogle Scholar
  36. Käll L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036. CrossRefPubMedGoogle Scholar
  37. Kamran M, Sinha S, Dubey P et al (2016) Identification of putative Z-ring-associated proteins, involved in cell division in human pathogenic bacteria Helicobacter pylori. FEBS Lett 590:2158–2171. CrossRefPubMedGoogle Scholar
  38. Karchin R, Karplus K, Haussler D (2002) Classifying G-protein coupled receptors with support vector machines. Bioinformatics 18:147–159CrossRefGoogle Scholar
  39. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. CrossRefPubMedPubMedCentralGoogle Scholar
  40. Kawabata T, Nishikawa K (2000) Protein structure comparison using the markov transition model of evolution. Proteins 41:108–122CrossRefGoogle Scholar
  41. Kelley LA, MacCallum RM, Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299:499–520. CrossRefPubMedGoogle Scholar
  42. Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580. CrossRefPubMedPubMedCentralGoogle Scholar
  43. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. CrossRefPubMedGoogle Scholar
  44. Linding R, Jensen LJ, Diella F et al (2003a) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459CrossRefGoogle Scholar
  45. Linding R, Russell RB, Neduva V, Gibson TJ (2003b) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708CrossRefGoogle Scholar
  46. Liu J, Hegyi H, Acton TB et al (2004) Automatic target selection for structural genomics on eukaryotes. Proteins 56:188. CrossRefPubMedGoogle Scholar
  47. Mamitsuka H (1996) A learning method of hidden Markov models for sequence discrimination. J Comput Biol 3:361–373CrossRefGoogle Scholar
  48. Marchler-Bauer A, Lu S, Anderson JB et al (2011) CDD: A conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229. CrossRefPubMedGoogle Scholar
  49. Marcotte EM, Pellegrini M, Thompson MJ et al (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402:83–86. CrossRefPubMedGoogle Scholar
  50. Marcotte EM, Xenarios I, van der Bliek AM, Eisenberg D (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci 97:12115–12120. CrossRefPubMedGoogle Scholar
  51. Maurer-Stroh S, Eisenhaber F (2004) Myristoylation of viral and bacterial proteins. Trends Microbiol 12:178–185. CrossRefPubMedGoogle Scholar
  52. Maurer-Stroh S, Washietl S, Eisenhaber F (2003a) Protein Prenyltransferases: Anchor Size, Pseudogenes and Parasites. Biol Chem 384:977–989. CrossRefPubMedGoogle Scholar
  53. Maurer-Stroh S, Washietl S, Eisenhaber F (2003b) Protein prenyltransferases. Genome Biol 4:212. CrossRefPubMedPubMedCentralGoogle Scholar
  54. Mott R (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 300:649–659. CrossRefPubMedGoogle Scholar
  55. Neuberger G, Maurer-Stroh S, Eisenhaber B et al (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328:581–592CrossRefGoogle Scholar
  56. Nielsen H (2017) Predicting secretory proteins with SignalP. In: Methods in molecular biology. Humana Press, Clifton, pp 59–73Google Scholar
  57. Ofran Y, Punta M, Schneider R, Rost B (2005) Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. Drug Discov Today 10:1475–1482. CrossRefPubMedGoogle Scholar
  58. Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219PubMedGoogle Scholar
  59. Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics 42:3.1.1–3.1.8. CrossRefGoogle Scholar
  60. Pellegrini M, Marcotte EM, Thompson MJ et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96:4285–4288CrossRefGoogle Scholar
  61. Powell S, Forslund K, Szklarczyk D et al (2014) EggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:231–239. CrossRefGoogle Scholar
  62. Promponas VJ, Enright AJ, Tsoka S et al (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16:915–922CrossRefGoogle Scholar
  63. Puntervoll P, Linding R, Gemünd C et al (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630CrossRefGoogle Scholar
  64. Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. CrossRefGoogle Scholar
  65. Schäffer AA, Wolf YI, Ponting CP et al (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000–1011CrossRefGoogle Scholar
  66. Schneider G, Wildpaner M, Sirota FL et al (2010) Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. Methods Mol Biol 609:257–267. CrossRefPubMedGoogle Scholar
  67. Sigrist CJA, Cerutti L, Hulo N et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274CrossRefGoogle Scholar
  68. Sinha S, Lynn AM (2014) HMM-ModE: implementation, benchmarking and validation with HMMER3. BMC Res Notes 7:483. CrossRefPubMedPubMedCentralGoogle Scholar
  69. Sirota FL, Ooi H-S, Gattermayer T et al (2010) Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics 11:S15. CrossRefPubMedPubMedCentralGoogle Scholar
  70. Snel B, Lehmann G, Bork P, Huynen MA (2000) STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res 28:3442–3444. CrossRefPubMedPubMedCentralGoogle Scholar
  71. Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. CrossRefPubMedGoogle Scholar
  72. Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248. CrossRefPubMedPubMedCentralGoogle Scholar
  73. Srivastava PK, Desai DK, Nandi S, Lynn AM (2007) HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics 8:104. CrossRefPubMedPubMedCentralGoogle Scholar
  74. Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503. CrossRefPubMedGoogle Scholar
  75. Tusnády GE, Simon I (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849–850CrossRefGoogle Scholar
  76. van Dongen SM (2000) Graph clustering by flow simulation. PhD thesis, Utrecht University RepositoryGoogle Scholar
  77. von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690CrossRefGoogle Scholar
  78. Ward JJ, McGuffin LJ, Bryson K et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139. CrossRefPubMedGoogle Scholar
  79. Wistrand M, Sonnhammer ELL (2004) Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol 338:847–854. CrossRefPubMedGoogle Scholar
  80. Wong W-C, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867. CrossRefPubMedPubMedCentralGoogle Scholar
  81. Wong W-C, Maurer-Stroh S, Schneider G, Eisenhaber F (2012) Transmembrane helix: simple or complex. Nucleic Acids Res 40:W370–W375. CrossRefPubMedPubMedCentralGoogle Scholar
  82. Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285. CrossRefPubMedGoogle Scholar
  83. Yoon B-J (2009) Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 10:402–415CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Swati Sinha
    • 1
    Email author
  • Birgit Eisenhaber
    • 1
  • Andrew M. Lynn
    • 2
  1. 1.Bioinformatics Institute (BII) Agency for ScienceTechnology and Research (A*STAR)SingaporeSingapore
  2. 2.School of Computational and Integrative SciencesJawaharlal Nehru UniversityNew DelhiIndia

Personalised recommendations