Skip to main content

Protein Sequence Databases

  • Chapter
  • First Online:
Bioinformatics
  • 3466 Accesses

Abstract

The near exponential growth in protein sequence data is at the foundation of transformations in biological research and related technological developments. The use of protein sequence data is widespread in fields including agronomy, biochemistry, ecology, etymology, evolution, genetics, genetic engineering, genomics, molecular phylogenetics and systematics, pharmacology, and toxicology. The remarkable increase in available protein sequences will most likely continue with the proliferation of genome sequencing projects, the latter enabled by ongoing improvements in DNA sequencing technology.1 Along with opportunities, protein sequence data bring scientifically challenging problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The number of bases in GenBank has doubled in content approximately every 18 months since 1982 (Fig. 10.1). This growth is in stride with the doubling of components of integrated circuit density at approximately 18–24 month intervals, commonly referred to as Moore’s Law (Moore 1965).

References

  • Altschul SF (1991) Amino acid substitution matrices from an information theoretic prospective. J Mol Biol 219:555–565

    Article  CAS  PubMed  Google Scholar 

  • Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    CAS  PubMed  Google Scholar 

  • Altschul SA, Boguski MS, Gish W, Wootton JC (1994) Issues in searching molecular sequence databases. Nat Genet 6:119–129

    Article  CAS  PubMed  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  Google Scholar 

  • Altschul SF, Wootton JC, Getz M et al (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272(20):5101–5109

    Article  CAS  PubMed  Google Scholar 

  • Andreeva A, Howorth D, Brenner SE et al (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32:D226–D229

    Article  CAS  PubMed  Google Scholar 

  • Apweiler R (2001) Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2:9–18

    Article  CAS  PubMed  Google Scholar 

  • Bairoch A, Boeckmann B, Ferro S, Gasteiger E (2004) Swiss-Port: Juggling between evolution and stability. Briefings in Bioinformatics 5(1):39–55

    Google Scholar 

  • Balaji S, Sujatha SN et al (2001) PALI-a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Res 29:61–65

    Article  CAS  PubMed  Google Scholar 

  • Barker WC, Garavelli JS, Haft DH et al (1998) The PIR-International protein sequence database. Nucleic Acids Res 26:27–32

    Article  CAS  PubMed  Google Scholar 

  • Bateman A, Birney E, Cerruti L et al (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280

    Article  CAS  PubMed  Google Scholar 

  • Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2000) GenBank. Nucleic Acids Res 28(1):15–18

    Google Scholar 

  • Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2007) GenBank. Nucleic Acids Res 36:D25–D30

    Google Scholar 

  • Benson DA, Karsch-Mizarchi I, Karsch-Mizrachi I et al (2006) GenBank. Nucleic Acids Res 35:D21–D25

    Article  Google Scholar 

  • Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242

    Article  CAS  PubMed  Google Scholar 

  • Berman HM, Henrick K, Nakamura H et al (2007) The Worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–D303

    Article  CAS  PubMed  Google Scholar 

  • Biswas M, O’Rourke JF, Camon E et al (2002) Applications of InterPro in protein annotation and genome analysis. Brief Bioinform 3(3):285–295

    Article  CAS  PubMed  Google Scholar 

  • Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370

    Article  CAS  PubMed  Google Scholar 

  • Boeckmann B, Blatter MC, Farniglietti L et al (2005) Protein variety and functional diversity: Swiss-Prot annotation in its biological context. CR Biol 328:882–899

    Article  CAS  Google Scholar 

  • Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94

    Article  CAS  PubMed  Google Scholar 

  • Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142

    Article  CAS  PubMed  Google Scholar 

  • Camon E, Magrane M, Barrell D et al (2004) The Gene Ontology Annotation (GOA database: sharing knowledge in UniProt with gene ontology. Nucleic Acids Res 32:D262–D266

    Article  CAS  PubMed  Google Scholar 

  • Cantor CR, Schimmel PR (1980) Biophysical chemistry, Part I: The conformation of biological macromolecules. WH Freeman, San Francisco and Oxford

    Google Scholar 

  • Dayhoff MO, Eck RV Chang M et al (1965) Atlas of protein sequence and structure, Vol 1. National Biomedical Research Foundation, Silver Spring, MD

    Google Scholar 

  • de Castro E, Sigrist CJA, Gattiker A et al (2006) ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res 34:W362–W365

    Article  PubMed  Google Scholar 

  • Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge UK

    Book  Google Scholar 

  • Eddy SR (1996) Hidden Markov models. Curr Opin in Struct Biol 6:361–365

    Article  CAS  Google Scholar 

  • Finn RD, Mistry J, Schuster-Bockler B et al (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34:D247–D251

    Article  CAS  PubMed  Google Scholar 

  • Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113

    Article  CAS  PubMed  Google Scholar 

  • Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242

    Article  CAS  PubMed  Google Scholar 

  • Ganfornina MD, Sánchez D (1999) Generation of evolutionary novelty by functional shift. BioEssays 21:432–439

    Article  CAS  PubMed  Google Scholar 

  • Geer RC, Sayers EW (2003) Entrez: Making use of its power. Briefings in Bioinformatics 4(2):179–184

    Google Scholar 

  • Gerlt JA, Babbitt PC (2001) Divergent evolution of enzymatic function: Mechanistically and functionally distinct suprafamilies. Annu Rev Biochem 70:209–246

    Article  CAS  PubMed  Google Scholar 

  • Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358

    Article  CAS  PubMed  Google Scholar 

  • Gribskov M, Fana F, Harper J et al (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29:111–113

    Article  CAS  PubMed  Google Scholar 

  • Henikoff S, Greene SA, Piertrokovski S et al (1997) Gene families: The taxonomy of protein paralogs and chimeras. Science 278(5338):609–614

    Article  CAS  PubMed  Google Scholar 

  • Henrick K, Feng Z, Bluhm WF (2008) Remediation of the protein data bank archive. Nucleic Acids Res 36:D426–D433

    Google Scholar 

  • Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268

    Article  CAS  PubMed  Google Scholar 

  • Koonin EV and Galperin MY (2002) Principles and methods of sequence analysis. In: Sequence– Evolution – Function, 1st edition. Kluwer, Waltham, MA

    Google Scholar 

  • Kunin V, Cases I, Anton J et al (2003) Myriads of protein families, and still counting. Genome Biol 4:401

    Article  PubMed  Google Scholar 

  • Leinonen R, Diez FG, Binns D et al (2004) UniProt Archive. Bioinformatics 20:3236–3237

    Article  CAS  PubMed  Google Scholar 

  • Lesk AM (2001) Introduction to protein architecture. Oxford University Press, Oxford

    Google Scholar 

  • Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Google Scholar 

  • Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441

    Article  CAS  PubMed  Google Scholar 

  • Moeller S, Leser U, Fleischmann W, Apweiler R (1999) EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 15:219–227

    Article  Google Scholar 

  • Moore GE (1965) Cramming more components onto integrated circuits. Electron Mag 38:8

    Google Scholar 

  • Mulder NJ (2007) Protein family databases. Encyclopedia of life sciences Wiley, New York.

    Google Scholar 

  • Mulder NJ, Apweiler R, Attwood TK et al (2003) The InterPro Database brings increased coverage and new features. Nucleic Acids Res 31(1):315–318

    Google Scholar 

  • Mulder NJ, Apweiler R, Attwood TK et al (2007) New developments in the InterPro database. Nucleic Acids Res 35:D224–228

    Article  CAS  PubMed  Google Scholar 

  • Mushegan AR (2007) Foundations of comparative genomics. Academic, Burlington, MA

    Google Scholar 

  • Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46:395–415

    Article  Google Scholar 

  • Natale DA, Vinakaya CR, Wu CH (2005) Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia Genet, Genomics, Proteomics Bioinform:. doi:10.1002/047001153X.g403314

    Google Scholar 

  • NC-IUBMB (2008) Enzyme Nomenclature. http://www.chem.qmul.ac.uk/iubmb/enzyme/. Accessed 30 Apr 2008

  • Orengo CA, Peral FMG, Bray JE et al (1999) Assigigning genomic sequences to CATH. Nucleic Acids Res 28(1):277–282

    Google Scholar 

  • Ouzounis CA, Coulson RMR, Enright AH et al (2003) Classification schemes for protein structure and function. Nat Rev Genet 4:508–519

    Google Scholar 

  • Pearson WR (1995) Comparison of methods for searching protein sequence databases. Prot Sci 4:1145–1160

    Article  CAS  Google Scholar 

  • Pearson WR, Lipman DJ (1988) Improved tools for biological sequence analysis. Proc Natl Acad Sci USA 85:2444–2448

    Article  CAS  PubMed  Google Scholar 

  • Pearson WR, Wood TC (2001) Statistical significance of biological sequence comparison. In: Bourne BE, Weissig H (eds) Handbook of statistical genetics. Wiley, West Sussex, England

    Google Scholar 

  • PlantsP (2008) Functional genomics of plant phosphorylation. http://plantsp.genomics.purdue.edu/. Accessed 1 March 2008

  • Pontig CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2(1):19–29

    Article  Google Scholar 

  • PRF (2008) Protein Research Foundation. http://www.prf.or.jp/en/dbi.shtml/. Accessed 26 Oct 2008

  • Pruitt KD, Tatusova T, Maglott DR et al (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65

    Article  CAS  PubMed  Google Scholar 

  • Raes J, Harrington ED, Singh AH et al (2007) Protein function space: viewing the limits or limited by our view. Curr Opin Struct Biol 17:362–369

    Article  CAS  PubMed  Google Scholar 

  • Reddy BVB, Bourne PE (2003) Protein structure evolution and the SCOP database. In: Bourne BE, Weissig H (eds) Structural bioinformatics, 1st edn. Wiley-Liss, Hoboken, NJ

    Google Scholar 

  • RefSeq (2008) The National Center for Biotechnology Information: Reference Sequence database. http://www.ncbi.nlm.nih.gov/RefSeq/key.html#status/. Accessed 26 Feb 2008

  • Rost B, Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechnol 7:457–461

    Article  CAS  PubMed  Google Scholar 

  • Rusch DB, Halpern AL, Sutton G et al (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through Eastern tropical Pacific. PLoS Biol 5:398–431

    Article  CAS  Google Scholar 

  • Sangar V, Blankenberg DJ, Altman N et al (2007) Quantitative sequence-function relationship in proteins based on gene ontology. BMC Bioinform 8:294

    Article  Google Scholar 

  • Schneider M, Bairoch A, Wu CH et al (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol 138:59–66

    Article  CAS  PubMed  Google Scholar 

  • Sigrist CJ, Cerutti L, Hulo N et al (2002) PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274

    Article  CAS  PubMed  Google Scholar 

  • Suzek BE, Huang H, McGarvey P et al (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288

    Article  CAS  PubMed  Google Scholar 

  • The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29

    Article  Google Scholar 

  • The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197

    Article  Google Scholar 

  • The UniProt Consortium (2008a) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D190–D195

    Google Scholar 

  • The UniProt Consortium (2008b) The Universal Protein Resource (UniProt). Nucleic Acids Res 36:D190–D195

    Article  Google Scholar 

  • UniProt (2008) http://www.uniprot.org/. Accessed 30 Apr 2008

  • Ware D, Jaiswal P, Ni J et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30:103–105

    Article  CAS  PubMed  Google Scholar 

  • Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev of Biophys 36:307–340

    Article  CAS  Google Scholar 

  • Wieser D, Kretschmann E, Apweiler R (2004) Filtering erroneous protein annotation. Bioinformatics 20(1):i342–i347

    Article  CAS  PubMed  Google Scholar 

  • Wilson CA, Kreychman J, Gerstein M (2000) Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 297:233–249

    Article  CAS  PubMed  Google Scholar 

  • Wu CH, Nikolskaya A, Huang H et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114

    Article  CAS  PubMed  Google Scholar 

  • Wu CH, Apweiler R, Bairoch A et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187–D191

    Google Scholar 

  • wwPDB (2008) Worldwide Protein Data. http://www.wwpdb.org/. Accessed 8 Sept 2008

  • Yosef N, Sharan R, Noble WS (2008) Improved network-based identification of protein orthologs. Bioinformatics 24(16):i200–i206

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Terry Clark .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Clark, T. (2009). Protein Sequence Databases. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_10

Download citation

Publish with us

Policies and ethics