Skip to main content

Homology-Based Annotation of Large Protein Datasets

  • Protocol
  • First Online:
  • 4250 Accesses

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1415))

Abstract

Advances in DNA sequencing technologies have led to an increasing amount of protein sequence data being generated. Only a small fraction of this protein sequence data will have experimental annotation associated with them. Here, we describe a protocol for in silico homology-based annotation of large protein datasets that makes extensive use of manually curated collections of protein families. We focus on annotations provided by the Pfam database and suggest ways to identify family outliers and family variations. This protocol may be useful to people who are new to protein data analysis, or who are unfamiliar with the current computational tools that are available.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE (2015) Big data: astronomical or genomical? PLoS Biol 13(7):e1002195. doi:10.1371/journal.pbio.1002195

    Article  PubMed  PubMed Central  Google Scholar 

  2. Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823–826

    CAS  PubMed  PubMed Central  Google Scholar 

  3. Tian W, Skolnick J (2003) How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333(4):863–882

    Article  CAS  PubMed  Google Scholar 

  4. Pearson WR (2013) An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinform. Chapter 3: Unit3 1. doi:10.1002/0471250953.bi0301s42

    Google Scholar 

  5. Friedberg I (2006) Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–242. doi:10.1093/bib/bbl004

    Article  CAS  PubMed  Google Scholar 

  6. Redfern O, Grant A, Maibaum M, Orengo C (2005) Survey of current protein family databases and their application in comparative, structural and functional genomics. J Chromatogr B Analyt Technol Biomed Life Sci 815(1-2):97–107. doi:10.1016/j.jchromb.2004.11.010

    Article  CAS  PubMed  Google Scholar 

  7. Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M, Chothia C, Gough J (2009) SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res 37(Database issue):D380–D386. doi:10.1093/nar/gkn762

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Lees J, Yeats C, Perkins J, Sillitoe I, Rentzsch R, Dessailly BH, Orengo C (2012) Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis. Nucleic Acids Res 40(Database issue):D465–D471. doi:10.1093/nar/gkr1181

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer EL, Tate J, Punta M (2014) Pfam: the protein families database. Nucleic Acids Res 42(Database issue):D222–D230. doi:10.1093/nar/gkt1223

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Letunic I, Doerks T, Bork P (2015) SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 43(Database issue):D257–D260. doi:10.1093/nar/gku949

    Article  PubMed  PubMed Central  Google Scholar 

  11. Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O (2007) TIGRFAMs and genome properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 35(Database issue):D260–D264. doi:10.1093/nar/gkl1043

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Mi H, Muruganujan A, Thomas PD (2013) PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 41(Database issue):D377–D386. doi:10.1093/nar/gks1118

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Akiva E, Brown S, Almonacid DE, Barber AE 2nd, Custer AF, Hicks MA, Huang CC, Lauck F, Mashiyama ST, Meng EC, Mischel D, Morris JH, Ojha S, Schnoes AM, Stryke D, Yunes JM, Ferrin TE, Holliday GL, Babbitt PC (2014) The Structure-Function Linkage Database. Nucleic Acids Res 42(Database issue):D521–D530. doi:10.1093/nar/gkt1130

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Alexeyenko A, Lindberg J, Perez-Bercoff A, Sonnhammer EL (2006) Overview and comparison of ortholog databases. Drug Discov Today Technol 3(2):137–143. doi:10.1016/j.ddtec.2006.06.002

    Article  PubMed  Google Scholar 

  15. Gabaldon T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14(5):360–366. doi:10.1038/nrg3456

    Article  CAS  PubMed  Google Scholar 

  16. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36(Database issue):D419–D425. doi:10.1093/nar/gkm993

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42(Database issue):D310–D314. doi:10.1093/nar/gkt1242

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(Database issue):D376–D381. doi:10.1093/nar/gku947

    Article  PubMed  PubMed Central  Google Scholar 

  19. Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJ, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Thomas PD, Finn RD (2015) The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43(Database issue):D213–D221. doi:10.1093/nar/gku1243

    Article  PubMed  PubMed Central  Google Scholar 

  20. Marchler-Bauer A, Derbyshire MK, Gonzales NR, Lu S, Chitsaz F, Geer LY, Geer RC, He J, Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F, Marchler GH, Song JS, Thanki N, Wang Z, Yamashita RA, Zhang D, Zheng C, Bryant SH (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43(Database issue):D222–D226. doi:10.1093/nar/gku1221

    Article  PubMed  PubMed Central  Google Scholar 

  21. UniProt C (2015) UniProt: a hub for protein information. Nucleic Acids Res 43(Database issue):D204–D212. doi:10.1093/nar/gku989

    Google Scholar 

  22. Kunin V, Teichmann SA, Huynen MA, Ouzounis CA (2005) The properties of protein family space depend on experimental design. Bioinformatics 21(11):2618–2622. doi:10.1093/bioinformatics/bti386

    Article  CAS  PubMed  Google Scholar 

  23. Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C (2009) PSI-2: structural genomics to cover protein domain family space. Structure 17(6):869–881. doi:10.1016/j.str.2009.03.015

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Levitt M (2009) Nature of the protein universe. Proc Natl Acad Sci U S A 106(27):11079–11084. doi:10.1073/pnas.0905029106, 0905029106 [pii]

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Mistry J, Coggill P, Eberhardt RY, Deiana A, Giansanti A, Finn RD, Bateman A, Punta M (2013) The challenge of increasing Pfam coverage of the human proteome. Database (Oxford) 2013: bat023.

    Google Scholar 

  26. Godzik A (2011) Metagenomics and the protein universe. Curr Opin Struct Biol 21(3):398–403. doi:10.1016/j.sbi.2011.03.010

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Triant DA, Pearson WR (2015) Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 16:99. doi:10.1186/s13059-015-0656-7

    Article  PubMed  PubMed Central  Google Scholar 

  28. Schlessinger A, Schaefer C, Vicedo E, Schmidberger M, Punta M, Rost B (2011) Protein disorder—a breakthrough invention of evolution? Curr Opin Struct Biol 21(3):412–418. doi:10.1016/j.sbi.2011.03.014

    Article  CAS  PubMed  Google Scholar 

  29. Brown CJ, Johnson AK, Dunker AK, Daughdrill GW (2011) Evolution and disorder. Curr Opin Struct Biol 21(3):441–446. doi:10.1016/j.sbi.2011.02.005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A (2016) The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44(D1):D279–85. doi:10.1093/nar/gkv1344, Epub 2015 Dec 15

    Google Scholar 

  31. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, Strausberg RL, Frazier M, Venter JC, 2007. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 5(3), e16

    Google Scholar 

  32. Bateman A, Coggill P, Finn RD (2010) DUFs: families in search of function. Acta Crystallogr Sect F: Struct Biol Cryst Commun 66(Pt 10):1148–1152. doi:10.1107/S1744309110001685

    Article  CAS  Google Scholar 

  33. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer EL, Eddy SR, Bateman A, Finn RD (2012) The Pfam protein families database. Nucleic Acids Res 40(Database issue):D290–D301. doi:10.1093/nar/gkr1065

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34(Database issue):D247–D251. doi:10.1093/nar/gkj149

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Gene Ontology C (2015) Gene Ontology Consortium: going forward. Nucleic Acids Res 43(Database issue):D1049–D1056. doi:10.1093/nar/gku1179

    Article  Google Scholar 

  36. Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12(2):85–94

    Article  CAS  PubMed  Google Scholar 

  37. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659. doi:10.1093/bioinformatics/btl158

    Article  CAS  PubMed  Google Scholar 

  38. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584

    Google Scholar 

  39. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7(10):e1002195. doi:10.1371/journal.pcbi.1002195, Pii: PCOMPBIOL-D-11-00572

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23):3150–3152. doi:10.1093/bioinformatics/bts565, Pii: bts565

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Remmert M, Biegert A, Hauser A, Soding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9(2):173–175. doi:10.1038/nmeth.1818

    Article  CAS  Google Scholar 

  42. Huang YJ, Mao B, Aramini JM, Montelione GT (2014) Assessment of template-based protein structure predictions in CASP10. Proteins 82(Suppl 2):43–56. doi:10.1002/prot.24488

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235–242, doi:gkd090 [pii]

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Gillis J, Pavlidis P (2013) Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinformatics 14(Suppl 3):S15

    Article  PubMed  PubMed Central  Google Scholar 

  46. Sheydina A, Eberhardt RY, Rigden DJ, Chang Y, Li Z, Zmasek CC, Axelrod HL, Godzik A (2014) Structural genomics analysis of uncharacterized protein families overrepresented in human gut bacteria identifies a novel glycoside hydrolase. BMC Bioinformatics 15:112. doi:10.1186/1471-2105-15-112

    Article  PubMed  PubMed Central  Google Scholar 

  47. Heger A, Holm L (2003) Exhaustive enumeration of protein domain families. J Mol Biol 328(3):749–767

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Stijn van Dongen (European Bioinformatics Institute) for some important clarifications concerning the clustering method MCL.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Punta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Punta, M., Mistry, J. (2016). Homology-Based Annotation of Large Protein Datasets. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3572-7_8

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3570-3

  • Online ISBN: 978-1-4939-3572-7

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics