Predicting Protein Function Using Homology-Based Methods

Sinha, Swati; Eisenhaber, Birgit; Lynn, Andrew M.

doi:10.1007/978-981-13-1562-6_13

Swati Sinha²,
Birgit Eisenhaber² &
Andrew M. Lynn³

2474 Accesses
3 Citations

Abstract

The molecular function of a protein can be deduced by analysing the ‘homology’ that exists due to common evolutionary ancestry among different organisms, while the cellular function can be inferred by focussing on the interactions between specific proteins. The molecular function could be predicted based on methods that rely on comparing a sequence to another sequence of known function as proteins having similar sequences are usually homologous performing similar function. On the other hand, in order to detect remote homologs or sequences which are very divergent, sequence-profile comparison methods were developed which use profile hidden Markov model (HMM). A profile HMM is generated from an alignment of multiple sequences and inherits more information than a single sequence. More advanced methods use profile-profile comparison methods to detect homology among sequences having very low sequence identity. In general, given a protein sequence with unknown function, these methods are used in a hierarchical manner to identify the function and serve as powerful annotation tools for predicting the function of a novel protein. With many genomes currently being sequenced, knowledge of these methods for annotation is increasingly becoming important.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289–D294. https://doi.org/10.1093/nar/gkq1238
Article CAS PubMed Google Scholar
Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389
Article CAS PubMed PubMed Central Google Scholar
Berezovsky IN, Grosberg AY, Trifonov EN (2000) Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466:283–286
Article CAS PubMed Google Scholar
Biegert A, Söding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775. https://doi.org/10.1073/pnas.0810767106
Article PubMed PubMed Central Google Scholar
Brendel V, Bucher P, Nourbakhsh IR et al (1992) Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A 89:2002–2006
Article CAS PubMed PubMed Central Google Scholar
Brenner SE (1999) Errors in genome annotation. Trends Genet 15:132–133. https://doi.org/10.1016/S0168-9525(99)01706-0
Article CAS PubMed Google Scholar
Claros MG, von Heijne G (1994) TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 10:685–686
CAS PubMed Google Scholar
Claverie J-M, States DJ (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17:191–201. https://doi.org/10.1016/0097-8485(93)85010-A
Article CAS Google Scholar
Cserzö M, Eisenhaber F, Eisenhaber B, Simon I (2002) On filtering false positive transmembrane protein predictions. Protein Eng 15:745–752
Article PubMed Google Scholar
Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20:136–137
Article CAS PubMed Google Scholar
Desai DK, Nandi S, Srivastava PK, Lynn AM (2011) Mod Enz a: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinforma 2011:743782. https://doi.org/10.1155/2011/743782
Article CAS Google Scholar
Devos D, Valencia A (2001) Intrinsic errors in genome annotation. Trends Genet 17:429–431. https://doi.org/10.1016/S0168-9525(01)02348-4
Article CAS PubMed Google Scholar
Di Tommaso P, Moretti S, Xenarios I et al (2011) T-coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res 39:W13–W17. https://doi.org/10.1093/nar/gkr245
Article CAS PubMed PubMed Central Google Scholar
Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005) Prob cons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340. https://doi.org/10.1101/gr.2821705
Article CAS PubMed PubMed Central Google Scholar
Dosztányi Z (2018) Prediction of protein disorder based on IUPred. Protein Sci 27:331–340. https://doi.org/10.1002/pro.3334
Article CAS PubMed Google Scholar
Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434. https://doi.org/10.1093/bioinformatics/bti541
Article CAS PubMed Google Scholar
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763. https://doi.org/10.1093/bioinformatics/14.9.755
Article CAS PubMed Google Scholar
Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211
PubMed Google Scholar
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi.1002195
Article CAS PubMed PubMed Central Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. https://doi.org/10.1093/nar/gkh340
Article CAS PubMed PubMed Central Google Scholar
Eisenhaber B, Eisenhaber F (2007) Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? Curr Protein Pept Sci 8:197–203
Article CAS PubMed Google Scholar
Eisenhaber F, Frömmel C, Argos P (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. II The paradox with secondary structural class. Proteins 25:169–179. https://doi.org/10.1002/(SICI)1097-0134(199606)25:2<169::AID-PROT3>3.0.CO;2-D
Article CAS PubMed Google Scholar
Eisenhaber B, Bork P, Eisenhaber F (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292:741–758. https://doi.org/10.1006/jmbi.1999.3069
Article CAS PubMed Google Scholar
Eisenhaber B, Eisenhaber F, Maurer-Stroh S, Neuberger G (2004) Prediction of sequence signals for lipid post-translational modifications: insights from case studies. Proteomics 4:1614–1625. https://doi.org/10.1002/pmic.200300781
Article CAS PubMed Google Scholar
Eisenhaber B, Kuchibhatla D, Sherman W et al (2016) The recipe for protein sequence-based function prediction and its implementation in the ANNOTATOR software environment. Methods Mol Biol 1415:477–506. https://doi.org/10.1007/978-1-4939-3572-7_25
Article CAS PubMed Google Scholar
Finn RD, Bateman A, Clements J et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:222–230. https://doi.org/10.1093/nar/gkt1223
Article CAS Google Scholar
Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9:133–142
Article CAS PubMed Google Scholar
Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–335
Article CAS PubMed Google Scholar
Hannenhalli SS, Russell RB (2000) Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303:61–76. https://doi.org/10.1006/jmbi.2000.4036
Article CAS PubMed Google Scholar
Hargbo J, Elofsson A (1999) Hidden Markov models that use predicted secondary structures for fold recognition. Proteins 36:68–76
Article CAS PubMed Google Scholar
Huynen M, Snel B, Lathe W, Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10:1204–1210. https://doi.org/10.1101/gr.10.8.1204
Article CAS PubMed PubMed Central Google Scholar
Jaakkola T, Diekhans M, Haussler D (2000) A discriminative framework for detecting remote protein homologies. J Comput Biol 7:95–114. https://doi.org/10.1089/10665270050081405
Article CAS PubMed Google Scholar
Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P (2008) egg NOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res 36(Database issue):D250–D254 Epub 2007 Oct 16
CAS PubMed Google Scholar
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637. https://doi.org/10.1002/bip.360221211
Article CAS PubMed Google Scholar
Käll L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036. https://doi.org/10.1016/j.jmb.2004.03.016
Article CAS PubMed Google Scholar
Kamran M, Sinha S, Dubey P et al (2016) Identification of putative Z-ring-associated proteins, involved in cell division in human pathogenic bacteria Helicobacter pylori. FEBS Lett 590:2158–2171. https://doi.org/10.1002/1873-3468.12230
Article CAS PubMed Google Scholar
Karchin R, Karplus K, Haussler D (2002) Classifying G-protein coupled receptors with support vector machines. Bioinformatics 18:147–159
Article CAS PubMed Google Scholar
Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. https://doi.org/10.1093/molbev/mst010
Article CAS PubMed PubMed Central Google Scholar
Kawabata T, Nishikawa K (2000) Protein structure comparison using the markov transition model of evolution. Proteins 41:108–122
Article CAS PubMed Google Scholar
Kelley LA, MacCallum RM, Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299:499–520. https://doi.org/10.1006/jmbi.2000.3741
Article CAS PubMed Google Scholar
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580. https://doi.org/10.1006/jmbi.2000.4315
Article CAS PubMed Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. https://doi.org/10.1093/bioinformatics/btl158
Article CAS PubMed Google Scholar
Linding R, Jensen LJ, Diella F et al (2003a) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459
Article CAS PubMed Google Scholar
Linding R, Russell RB, Neduva V, Gibson TJ (2003b) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708
Article CAS PubMed PubMed Central Google Scholar
Liu J, Hegyi H, Acton TB et al (2004) Automatic target selection for structural genomics on eukaryotes. Proteins 56:188. https://doi.org/10.1002/prot.20012
Article CAS PubMed Google Scholar
Mamitsuka H (1996) A learning method of hidden Markov models for sequence discrimination. J Comput Biol 3:361–373
Article CAS PubMed Google Scholar
Marchler-Bauer A, Lu S, Anderson JB et al (2011) CDD: A conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229. https://doi.org/10.1093/nar/gkq1189
Article CAS PubMed Google Scholar
Marcotte EM, Pellegrini M, Thompson MJ et al (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402:83–86. https://doi.org/10.1038/47048
Article CAS PubMed Google Scholar
Marcotte EM, Xenarios I, van der Bliek AM, Eisenberg D (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci 97:12115–12120. https://doi.org/10.1073/pnas.220399497
Article CAS PubMed PubMed Central Google Scholar
Maurer-Stroh S, Eisenhaber F (2004) Myristoylation of viral and bacterial proteins. Trends Microbiol 12:178–185. https://doi.org/10.1016/j.tim.2004.02.006
Article CAS PubMed Google Scholar
Maurer-Stroh S, Washietl S, Eisenhaber F (2003a) Protein Prenyltransferases: Anchor Size, Pseudogenes and Parasites. Biol Chem 384:977–989. https://doi.org/10.1515/BC.2003.110
Article CAS PubMed Google Scholar
Maurer-Stroh S, Washietl S, Eisenhaber F (2003b) Protein prenyltransferases. Genome Biol 4:212. https://doi.org/10.1186/GB-2003-4-4-212
Article PubMed PubMed Central Google Scholar
Mott R (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 300:649–659. https://doi.org/10.1006/jmbi.2000.3875
Article CAS PubMed Google Scholar
Neuberger G, Maurer-Stroh S, Eisenhaber B et al (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328:581–592
Article CAS PubMed Google Scholar
Nielsen H (2017) Predicting secretory proteins with SignalP. In: Methods in molecular biology. Humana Press, Clifton, pp 59–73
Google Scholar
Ofran Y, Punta M, Schneider R, Rost B (2005) Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. Drug Discov Today 10:1475–1482. https://doi.org/10.1016/S1359-6446(05)03621-4
Article CAS PubMed Google Scholar
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219
CAS PubMed Google Scholar
Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics 42:3.1.1–3.1.8. https://doi.org/10.1002/0471250953.bi0301s42
Article Google Scholar
Pellegrini M, Marcotte EM, Thompson MJ et al (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96:4285–4288
Article CAS PubMed PubMed Central Google Scholar
Powell S, Forslund K, Szklarczyk D et al (2014) EggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res 42:231–239. https://doi.org/10.1093/nar/gkt1253
Article CAS Google Scholar
Promponas VJ, Enright AJ, Tsoka S et al (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16:915–922
Article CAS PubMed Google Scholar
Puntervoll P, Linding R, Gemünd C et al (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630
Article CAS PubMed PubMed Central Google Scholar
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. https://doi.org/10.1038/nmeth.1818
Article CAS Google Scholar
Schäffer AA, Wolf YI, Ponting CP et al (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000–1011
Article PubMed Google Scholar
Schneider G, Wildpaner M, Sirota FL et al (2010) Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. Methods Mol Biol 609:257–267. https://doi.org/10.1007/978-1-60327-241-4_15
Article CAS PubMed Google Scholar
Sigrist CJA, Cerutti L, Hulo N et al (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274
Article CAS PubMed Google Scholar
Sinha S, Lynn AM (2014) HMM-ModE: implementation, benchmarking and validation with HMMER3. BMC Res Notes 7:483. https://doi.org/10.1186/1756-0500-7-483
Article PubMed PubMed Central Google Scholar
Sirota FL, Ooi H-S, Gattermayer T et al (2010) Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics 11:S15. https://doi.org/10.1186/1471-2164-11-S1-S15
Article CAS PubMed PubMed Central Google Scholar
Snel B, Lehmann G, Bork P, Huynen MA (2000) STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res 28:3442–3444. https://doi.org/10.1093/nar/28.18.3442
Article CAS PubMed PubMed Central Google Scholar
Söding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125
Article PubMed Google Scholar
Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248. https://doi.org/10.1093/nar/gki408
Article CAS PubMed PubMed Central Google Scholar
Srivastava PK, Desai DK, Nandi S, Lynn AM (2007) HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences. BMC Bioinformatics 8:104. https://doi.org/10.1186/1471-2105-8-104
Article CAS PubMed PubMed Central Google Scholar
Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503. https://doi.org/10.1038/35080529
Article CAS PubMed Google Scholar
Tusnády GE, Simon I (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849–850
Article PubMed Google Scholar
van Dongen SM (2000) Graph clustering by flow simulation. PhD thesis, Utrecht University Repository
Google Scholar
von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690
Article Google Scholar
Ward JJ, McGuffin LJ, Bryson K et al (2004) The DISOPRED server for the prediction of protein disorder. Bioinformatics 20:2138–2139. https://doi.org/10.1093/bioinformatics/bth195
Article CAS PubMed Google Scholar
Wistrand M, Sonnhammer ELL (2004) Improving profile HMM discrimination by adapting transition probabilities. J Mol Biol 338:847–854. https://doi.org/10.1016/j.jmb.2004.03.023
Article CAS PubMed Google Scholar
Wong W-C, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867. https://doi.org/10.1371/journal.pcbi.1000867
Article CAS PubMed PubMed Central Google Scholar
Wong W-C, Maurer-Stroh S, Schneider G, Eisenhaber F (2012) Transmembrane helix: simple or complex. Nucleic Acids Res 40:W370–W375. https://doi.org/10.1093/nar/gks379
Article CAS PubMed PubMed Central Google Scholar
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285. https://doi.org/10.1016/0097-8485(94)85023-2
Article CAS PubMed Google Scholar
Yoon B-J (2009) Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 10:402–415
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by grants from Jawaharlal Nehru University and Open source drug discovery, Council of scientific and industrial research (OSDD-CSIR) project.

Author information

Authors and Affiliations

Bioinformatics Institute (BII) Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Swati Sinha & Birgit Eisenhaber
School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
Andrew M. Lynn

Authors

Swati Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Birgit Eisenhaber
View author publications
You can also search for this author in PubMed Google Scholar
Andrew M. Lynn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Swati Sinha .

Editor information

Editors and Affiliations

Department of Bioinformatics, Central University of South Bihar, Gaya, Bihar, India
Asheesh Shanker

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sinha, S., Eisenhaber, B., Lynn, A.M. (2018). Predicting Protein Function Using Homology-Based Methods. In: Shanker, A. (eds) Bioinformatics: Sequences, Structures, Phylogeny . Springer, Singapore. https://doi.org/10.1007/978-981-13-1562-6_13

Download citation

DOI: https://doi.org/10.1007/978-981-13-1562-6_13
Published: 14 October 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1561-9
Online ISBN: 978-981-13-1562-6
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics