Bioinformatics pp 209-223 | Cite as

Protein Sequence Databases

  • Terry Clark


The near exponential growth in protein sequence data is at the foundation of transformations in biological research and related technological developments. The use of protein sequence data is widespread in fields including agronomy, biochemistry, ecology, etymology, evolution, genetics, genetic engineering, genomics, molecular phylogenetics and systematics, pharmacology, and toxicology. The remarkable increase in available protein sequences will most likely continue with the proliferation of genome sequencing projects, the latter enabled by ongoing improvements in DNA sequencing technology.1 Along with opportunities, protein sequence data bring scientifically challenging problems.


Protein Data Bank Protein Sequence Database European Molecular Biology Laboratory Gene Ontology Consortium Protein Sequence Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Altschul SF (1991) Amino acid substitution matrices from an information theoretic prospective. J Mol Biol 219:555–565CrossRefPubMedGoogle Scholar
  2. Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410PubMedGoogle Scholar
  3. Altschul SA, Boguski MS, Gish W, Wootton JC (1994) Issues in searching molecular sequence databases. Nat Genet 6:119–129CrossRefPubMedGoogle Scholar
  4. Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402CrossRefPubMedGoogle Scholar
  5. Altschul SF, Wootton JC, Getz M et al (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272(20):5101–5109CrossRefPubMedGoogle Scholar
  6. Andreeva A, Howorth D, Brenner SE et al (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32:D226–D229CrossRefPubMedGoogle Scholar
  7. Apweiler R (2001) Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2:9–18CrossRefPubMedGoogle Scholar
  8. Bairoch A, Boeckmann B, Ferro S, Gasteiger E (2004) Swiss-Port: Juggling between evolution and stability. Briefings in Bioinformatics 5(1):39–55Google Scholar
  9. Balaji S, Sujatha SN et al (2001) PALI-a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Res 29:61–65CrossRefPubMedGoogle Scholar
  10. Barker WC, Garavelli JS, Haft DH et al (1998) The PIR-International protein sequence database. Nucleic Acids Res 26:27–32CrossRefPubMedGoogle Scholar
  11. Bateman A, Birney E, Cerruti L et al (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280CrossRefPubMedGoogle Scholar
  12. Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2000) GenBank. Nucleic Acids Res 28(1):15–18Google Scholar
  13. Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2007) GenBank. Nucleic Acids Res 36:D25–D30Google Scholar
  14. Benson DA, Karsch-Mizarchi I, Karsch-Mizrachi I et al (2006) GenBank. Nucleic Acids Res 35:D21–D25CrossRefGoogle Scholar
  15. Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242CrossRefPubMedGoogle Scholar
  16. Berman HM, Henrick K, Nakamura H et al (2007) The Worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–D303CrossRefPubMedGoogle Scholar
  17. Biswas M, O’Rourke JF, Camon E et al (2002) Applications of InterPro in protein annotation and genome analysis. Brief Bioinform 3(3):285–295CrossRefPubMedGoogle Scholar
  18. Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370CrossRefPubMedGoogle Scholar
  19. Boeckmann B, Blatter MC, Farniglietti L et al (2005) Protein variety and functional diversity: Swiss-Prot annotation in its biological context. CR Biol 328:882–899CrossRefGoogle Scholar
  20. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94CrossRefPubMedGoogle Scholar
  21. Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142CrossRefPubMedGoogle Scholar
  22. Camon E, Magrane M, Barrell D et al (2004) The Gene Ontology Annotation (GOA database: sharing knowledge in UniProt with gene ontology. Nucleic Acids Res 32:D262–D266CrossRefPubMedGoogle Scholar
  23. Cantor CR, Schimmel PR (1980) Biophysical chemistry, Part I: The conformation of biological macromolecules. WH Freeman, San Francisco and OxfordGoogle Scholar
  24. Dayhoff MO, Eck RV Chang M et al (1965) Atlas of protein sequence and structure, Vol 1. National Biomedical Research Foundation, Silver Spring, MDGoogle Scholar
  25. de Castro E, Sigrist CJA, Gattiker A et al (2006) ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res 34:W362–W365CrossRefPubMedGoogle Scholar
  26. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge UKCrossRefGoogle Scholar
  27. Eddy SR (1996) Hidden Markov models. Curr Opin in Struct Biol 6:361–365CrossRefGoogle Scholar
  28. Finn RD, Mistry J, Schuster-Bockler B et al (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34:D247–D251CrossRefPubMedGoogle Scholar
  29. Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113CrossRefPubMedGoogle Scholar
  30. Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242CrossRefPubMedGoogle Scholar
  31. Ganfornina MD, Sánchez D (1999) Generation of evolutionary novelty by functional shift. BioEssays 21:432–439CrossRefPubMedGoogle Scholar
  32. Geer RC, Sayers EW (2003) Entrez: Making use of its power. Briefings in Bioinformatics 4(2):179–184Google Scholar
  33. Gerlt JA, Babbitt PC (2001) Divergent evolution of enzymatic function: Mechanistically and functionally distinct suprafamilies. Annu Rev Biochem 70:209–246CrossRefPubMedGoogle Scholar
  34. Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358CrossRefPubMedGoogle Scholar
  35. Gribskov M, Fana F, Harper J et al (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29:111–113CrossRefPubMedGoogle Scholar
  36. Henikoff S, Greene SA, Piertrokovski S et al (1997) Gene families: The taxonomy of protein paralogs and chimeras. Science 278(5338):609–614CrossRefPubMedGoogle Scholar
  37. Henrick K, Feng Z, Bluhm WF (2008) Remediation of the protein data bank archive. Nucleic Acids Res 36:D426–D433Google Scholar
  38. Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268CrossRefPubMedGoogle Scholar
  39. Koonin EV and Galperin MY (2002) Principles and methods of sequence analysis. In: Sequence– Evolution – Function, 1st edition. Kluwer, Waltham, MAGoogle Scholar
  40. Kunin V, Cases I, Anton J et al (2003) Myriads of protein families, and still counting. Genome Biol 4:401CrossRefPubMedGoogle Scholar
  41. Leinonen R, Diez FG, Binns D et al (2004) UniProt Archive. Bioinformatics 20:3236–3237CrossRefPubMedGoogle Scholar
  42. Lesk AM (2001) Introduction to protein architecture. Oxford University Press, OxfordGoogle Scholar
  43. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659Google Scholar
  44. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441CrossRefPubMedGoogle Scholar
  45. Moeller S, Leser U, Fleischmann W, Apweiler R (1999) EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 15:219–227CrossRefGoogle Scholar
  46. Moore GE (1965) Cramming more components onto integrated circuits. Electron Mag 38:8Google Scholar
  47. Mulder NJ (2007) Protein family databases. Encyclopedia of life sciences Wiley, New York.Google Scholar
  48. Mulder NJ, Apweiler R, Attwood TK et al (2003) The InterPro Database brings increased coverage and new features. Nucleic Acids Res 31(1):315–318Google Scholar
  49. Mulder NJ, Apweiler R, Attwood TK et al (2007) New developments in the InterPro database. Nucleic Acids Res 35:D224–228CrossRefPubMedGoogle Scholar
  50. Mushegan AR (2007) Foundations of comparative genomics. Academic, Burlington, MAGoogle Scholar
  51. Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46:395–415CrossRefGoogle Scholar
  52. Natale DA, Vinakaya CR, Wu CH (2005) Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia Genet, Genomics, Proteomics Bioinform:. doi: 10.1002/047001153X.g403314 Google Scholar
  53. NC-IUBMB (2008) Enzyme Nomenclature. Accessed 30 Apr 2008
  54. Orengo CA, Peral FMG, Bray JE et al (1999) Assigigning genomic sequences to CATH. Nucleic Acids Res 28(1):277–282Google Scholar
  55. Ouzounis CA, Coulson RMR, Enright AH et al (2003) Classification schemes for protein structure and function. Nat Rev Genet 4:508–519Google Scholar
  56. Pearson WR (1995) Comparison of methods for searching protein sequence databases. Prot Sci 4:1145–1160CrossRefGoogle Scholar
  57. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence analysis. Proc Natl Acad Sci USA 85:2444–2448CrossRefPubMedGoogle Scholar
  58. Pearson WR, Wood TC (2001) Statistical significance of biological sequence comparison. In: Bourne BE, Weissig H (eds) Handbook of statistical genetics. Wiley, West Sussex, EnglandGoogle Scholar
  59. PlantsP (2008) Functional genomics of plant phosphorylation. Accessed 1 March 2008
  60. Pontig CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2(1):19–29CrossRefGoogle Scholar
  61. PRF (2008) Protein Research Foundation. Accessed 26 Oct 2008
  62. Pruitt KD, Tatusova T, Maglott DR et al (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65CrossRefPubMedGoogle Scholar
  63. Raes J, Harrington ED, Singh AH et al (2007) Protein function space: viewing the limits or limited by our view. Curr Opin Struct Biol 17:362–369CrossRefPubMedGoogle Scholar
  64. Reddy BVB, Bourne PE (2003) Protein structure evolution and the SCOP database. In: Bourne BE, Weissig H (eds) Structural bioinformatics, 1st edn. Wiley-Liss, Hoboken, NJGoogle Scholar
  65. RefSeq (2008) The National Center for Biotechnology Information: Reference Sequence database. Accessed 26 Feb 2008
  66. Rost B, Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechnol 7:457–461CrossRefPubMedGoogle Scholar
  67. Rusch DB, Halpern AL, Sutton G et al (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through Eastern tropical Pacific. PLoS Biol 5:398–431CrossRefGoogle Scholar
  68. Sangar V, Blankenberg DJ, Altman N et al (2007) Quantitative sequence-function relationship in proteins based on gene ontology. BMC Bioinform 8:294CrossRefGoogle Scholar
  69. Schneider M, Bairoch A, Wu CH et al (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol 138:59–66CrossRefPubMedGoogle Scholar
  70. Sigrist CJ, Cerutti L, Hulo N et al (2002) PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274CrossRefPubMedGoogle Scholar
  71. Suzek BE, Huang H, McGarvey P et al (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288CrossRefPubMedGoogle Scholar
  72. The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29CrossRefGoogle Scholar
  73. The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197CrossRefGoogle Scholar
  74. The UniProt Consortium (2008a) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D190–D195Google Scholar
  75. The UniProt Consortium (2008b) The Universal Protein Resource (UniProt). Nucleic Acids Res 36:D190–D195CrossRefGoogle Scholar
  76. UniProt (2008) Accessed 30 Apr 2008
  77. Ware D, Jaiswal P, Ni J et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30:103–105CrossRefPubMedGoogle Scholar
  78. Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev of Biophys 36:307–340CrossRefGoogle Scholar
  79. Wieser D, Kretschmann E, Apweiler R (2004) Filtering erroneous protein annotation. Bioinformatics 20(1):i342–i347CrossRefPubMedGoogle Scholar
  80. Wilson CA, Kreychman J, Gerstein M (2000) Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 297:233–249CrossRefPubMedGoogle Scholar
  81. Wu CH, Nikolskaya A, Huang H et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114CrossRefPubMedGoogle Scholar
  82. Wu CH, Apweiler R, Bairoch A et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187–D191Google Scholar
  83. wwPDB (2008) Worldwide Protein Data. Accessed 8 Sept 2008
  84. Yosef N, Sharan R, Noble WS (2008) Improved network-based identification of protein orthologs. Bioinformatics 24(16):i200–i206CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.University of QueenslandBrisbaneAustralia

Personalised recommendations