Skip to main content

The Classification of Protein Domains

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1525))

Abstract

The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.

In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Fleischmann R et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512

    Article  CAS  PubMed  Google Scholar 

  2. Reddy TBK et al (2015) The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43:D1099–D1106

    Article  CAS  PubMed  Google Scholar 

  3. Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA (2004) Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14:208–216

    Article  CAS  PubMed  Google Scholar 

  4. Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 34:1066–1080

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453

    Article  CAS  PubMed  Google Scholar 

  6. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  CAS  PubMed  Google Scholar 

  8. Ponting CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2:19–29

    Article  CAS  PubMed  Google Scholar 

  9. Bru C et al (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33:D212–D215

    Article  CAS  PubMed  Google Scholar 

  10. Portugaly E, Linial N, Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 35:D241–D246

    Article  CAS  PubMed  Google Scholar 

  11. Heger A (2004) ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res 33:D188–D191

    Article  PubMed Central  Google Scholar 

  12. The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212

    Article  PubMed Central  Google Scholar 

  13. Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kelil A, Wang S, Brzezinski R, Fleury A (2007) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 8:286

    Article  PubMed  PubMed Central  Google Scholar 

  15. Gnanavel M et al (2014) CLAP: a web-server for automatic classification of proteins with special reference to multi-domain proteins. BMC Bioinformatics 15:343

    Article  PubMed  PubMed Central  Google Scholar 

  16. Krishnamurthy N, Brown DP, Kirshner D, Sjölander K (2006) PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 7:R83

    Article  PubMed  PubMed Central  Google Scholar 

  17. Loewenstein Y, Portugaly E, Fromer M, Linial M (2008) Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24:i41–i49

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Enright AJ, Kunin V, Ouzounis CA (2003) Protein families and TRIBES in genome sequence space. Nucleic Acids Res 31:4632–4638

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461

    Article  CAS  PubMed  Google Scholar 

  20. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659

    Article  CAS  PubMed  Google Scholar 

  21. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Hauser M, Mayer CE, Söding J (2013) kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14:248

    Article  PubMed  PubMed Central  Google Scholar 

  23. Feng DF, Doolittle RF (1996) Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol 266:368–382

    Article  CAS  PubMed  Google Scholar 

  24. Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6:361–365

    Article  CAS  PubMed  Google Scholar 

  25. Finn RD et al (2015) HMMER web server: 2015 update. Nucleic Acids Res 43:W30–W38

    Article  PubMed  PubMed Central  Google Scholar 

  26. Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175

    Article  CAS  Google Scholar 

  27. Mitchell A et al (2015) The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43:D213–D221

    Article  PubMed  Google Scholar 

  28. Sillitoe I et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381

    Article  PubMed  Google Scholar 

  29. Pedruzzi I et al (2014) HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res 43:D1064–D1070

    Article  PubMed  PubMed Central  Google Scholar 

  30. Mi H, Muruganujan A, Thomas PD (2013) PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 41:D377–D386

    Article  CAS  PubMed  Google Scholar 

  31. Nikolskayaw QN, Arighi CN, Huang H, Barker WC, Wu CH (2006) PIRSF family classification system for protein functional and evolutionary analysis. Evol Bioinforma 2:197–209

    Google Scholar 

  32. Finn RD et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230

    Article  CAS  PubMed  Google Scholar 

  33. Attwood TK et al (2012) The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012. Database (Oxford) 2012:bas019

    Google Scholar 

  34. Sigrist CJA et al (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347

    Article  CAS  PubMed  Google Scholar 

  35. Letunic I, Doerks T, Bork P (2015) SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 43:D257–D260

    Article  PubMed  Google Scholar 

  36. Oates ME et al (2015) The SUPERFAMILY 1.75 database in 2014: a doubling of data. Nucleic Acids Res 43:D227–D233

    Article  PubMed  Google Scholar 

  37. Haft DH et al (2013) TIGRFAMs and genome properties in 2013. Nucleic Acids Res 41:D387–D395

    Article  CAS  PubMed  Google Scholar 

  38. Heger A, Holm L (2003) Exhaustive enumeration of protein domain families. J Mol Biol 328:749–767

    Article  CAS  PubMed  Google Scholar 

  39. Penel S et al (2009) Databases of homologous gene families for comparative genomics. BMC Bioinformatics 10(Suppl 6):S3

    Article  PubMed  PubMed Central  Google Scholar 

  40. Kriventseva EV et al (2015) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43:D250–D256

    Article  PubMed  Google Scholar 

  41. Jones P et al (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236–1240

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Petryszak R, Kretschmann E, Wieser D, Apweiler R (2005) The predictive power of the CluSTr database. Bioinformatics 21:3604–3609

    Article  CAS  PubMed  Google Scholar 

  43. Thomas PD (2010) GIGA: a simple, efficient algorithm for gene tree inference in the genomic age. BMC Bioinformatics 11:312

    Article  PubMed  PubMed Central  Google Scholar 

  44. Wu CH et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980

    Article  CAS  PubMed  Google Scholar 

  47. Richardson JS (1981) The anatomy and taxonomy of protein structure. Adv Protein Chem 34:167–339

    Article  CAS  PubMed  Google Scholar 

  48. Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540

    CAS  PubMed  Google Scholar 

  49. Orengo CA et al (1997) CATH—a hierarchic classification of protein domain structures. Structure 5:1093–1108

    Article  CAS  PubMed  Google Scholar 

  50. Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33:88–96

    Article  CAS  PubMed  Google Scholar 

  51. Sowdhamini R, Rufino SD, Blundell TL (1996) A database of globular protein structural domains: clustering of representative family members into similar folds. Fold Des 1:209–220

    Article  CAS  PubMed  Google Scholar 

  52. Gibrat JF, Madej T, Bryant SH (1996) Surprising similarities in structure comparison. Curr Opin Struct Biol 6:377–385

    Article  CAS  PubMed  Google Scholar 

  53. Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 3:e232

    Article  PubMed  PubMed Central  Google Scholar 

  54. Taylor W, Orengo CA (1989) Protein structure alignment. J Mol Biol 208:1–22

    Article  CAS  PubMed  Google Scholar 

  55. Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233:123–138

    Article  CAS  PubMed  Google Scholar 

  56. Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19:ii246–ii255

    PubMed  Google Scholar 

  57. Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol 3:141–148

    Article  CAS  PubMed  Google Scholar 

  58. Gerstein M, Levitt M (1998) Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 7:445–456

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 346:1173–1188

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Dayhoff MO (2005) Atlas of protein sequence and structure. Natl. Biomed. Res. Foundation

    Google Scholar 

  61. Orengo CA, Jones DT, Thornton JM (1994) Protein superfamilles and domain superfolds. Nature 372:631–634

    Article  CAS  PubMed  Google Scholar 

  62. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42:D310–D314

    Article  CAS  PubMed  Google Scholar 

  63. Das S et al (2015) Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 31:3460–3467

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Lee DA, Rentzsch R, Orengo C (2010) GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res 38:720–737

    Article  CAS  PubMed  Google Scholar 

  65. Holm L, Sander C (1994) Parser for protein folding units. Proteins 19:256–268

    Article  CAS  PubMed  Google Scholar 

  66. Marchler-Bauer A et al (2014) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43:D222–D226

    Article  PubMed  PubMed Central  Google Scholar 

  67. Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11:739–747

    Article  CAS  PubMed  Google Scholar 

  68. Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 60:2256–2268

    Article  CAS  PubMed  Google Scholar 

  69. Fox NK, Brenner SE, Chandonia J-MM (2014) SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42:D304–D309

    Article  CAS  PubMed  Google Scholar 

  70. Andreeva A et al (2007) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36:D419–D425

    Article  PubMed  PubMed Central  Google Scholar 

  71. Cheng H et al (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10:e1003926

    Article  PubMed  PubMed Central  Google Scholar 

  72. Sowdhamini R et al (1998) Protein three-dimensional structural databases: domains, structurally aligned homologues and superfamilies. Acta Crystallogr D Biol Crystallogr 54:1168–1177

    Article  CAS  PubMed  Google Scholar 

  73. Orengo CA (1999) CORA—topological fingerprints for protein structural families. Protein Sci 8:699–715

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Orengo CA, Taylor WR (1996) In: Computer methods for macromolecular sequence analysis, vol 266. Elsevier, Amsterdam, pp 617–635

    Google Scholar 

  75. Cuff A, Redfern O, Dessailly B, Orengo C (2011) In Protein function prediction for omics era. Springer, Netherlands

    Google Scholar 

  76. Furnham N et al (2012) FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies. Nucleic Acids Res 40:D776–D782

    Article  CAS  PubMed  Google Scholar 

  77. Furnham N et al (2012) Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol 8:e1002403

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Barrett AJ (1992) Enzyme nomenclature: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Academic, San Diego, CA

    Google Scholar 

  79. Hadley C, Jones DT (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7:1099–1112

    Article  CAS  PubMed  Google Scholar 

  80. Lupas AN, Ponting CP, Russell RB (2001) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134:191–203

    Article  CAS  PubMed  Google Scholar 

  81. Park J et al (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284:1201–1210

    Article  CAS  PubMed  Google Scholar 

  82. Gough J, Chothia C (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 30:268–272

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Yeats C et al (2006) Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 34:D281–D284

    Article  CAS  PubMed  Google Scholar 

  84. Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol 348:1235–1260

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natalie Dawson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media New York

About this protocol

Cite this protocol

Dawson, N., Sillitoe, I., Marsden, R.L., Orengo, C.A. (2017). The Classification of Protein Domains. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-6622-6_7

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6620-2

  • Online ISBN: 978-1-4939-6622-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics