Protein Sequence Databases

Clark, Terry

doi:10.1007/978-0-387-92738-1_10

Terry Clark⁴

3466 Accesses

Abstract

The near exponential growth in protein sequence data is at the foundation of transformations in biological research and related technological developments. The use of protein sequence data is widespread in fields including agronomy, biochemistry, ecology, etymology, evolution, genetics, genetic engineering, genomics, molecular phylogenetics and systematics, pharmacology, and toxicology. The remarkable increase in available protein sequences will most likely continue with the proliferation of genome sequencing projects, the latter enabled by ongoing improvements in DNA sequencing technology.¹ Along with opportunities, protein sequence data bring scientifically challenging problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The number of bases in GenBank has doubled in content approximately every 18 months since 1982 (Fig. 10.1). This growth is in stride with the doubling of components of integrated circuit density at approximately 18–24 month intervals, commonly referred to as Moore’s Law (Moore 1965).

References

Altschul SF (1991) Amino acid substitution matrices from an information theoretic prospective. J Mol Biol 219:555–565
Article CAS PubMed Google Scholar
Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
CAS PubMed Google Scholar
Altschul SA, Boguski MS, Gish W, Wootton JC (1994) Issues in searching molecular sequence databases. Nat Genet 6:119–129
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Article CAS PubMed Google Scholar
Altschul SF, Wootton JC, Getz M et al (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272(20):5101–5109
Article CAS PubMed Google Scholar
Andreeva A, Howorth D, Brenner SE et al (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32:D226–D229
Article CAS PubMed Google Scholar
Apweiler R (2001) Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2:9–18
Article CAS PubMed Google Scholar
Bairoch A, Boeckmann B, Ferro S, Gasteiger E (2004) Swiss-Port: Juggling between evolution and stability. Briefings in Bioinformatics 5(1):39–55
Google Scholar
Balaji S, Sujatha SN et al (2001) PALI-a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Res 29:61–65
Article CAS PubMed Google Scholar
Barker WC, Garavelli JS, Haft DH et al (1998) The PIR-International protein sequence database. Nucleic Acids Res 26:27–32
Article CAS PubMed Google Scholar
Bateman A, Birney E, Cerruti L et al (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280
Article CAS PubMed Google Scholar
Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2000) GenBank. Nucleic Acids Res 28(1):15–18
Google Scholar
Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2007) GenBank. Nucleic Acids Res 36:D25–D30
Google Scholar
Benson DA, Karsch-Mizarchi I, Karsch-Mizrachi I et al (2006) GenBank. Nucleic Acids Res 35:D21–D25
Article Google Scholar
Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242
Article CAS PubMed Google Scholar
Berman HM, Henrick K, Nakamura H et al (2007) The Worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–D303
Article CAS PubMed Google Scholar
Biswas M, O’Rourke JF, Camon E et al (2002) Applications of InterPro in protein annotation and genome analysis. Brief Bioinform 3(3):285–295
Article CAS PubMed Google Scholar
Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370
Article CAS PubMed Google Scholar
Boeckmann B, Blatter MC, Farniglietti L et al (2005) Protein variety and functional diversity: Swiss-Prot annotation in its biological context. CR Biol 328:882–899
Article CAS Google Scholar
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Article CAS PubMed Google Scholar
Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142
Article CAS PubMed Google Scholar
Camon E, Magrane M, Barrell D et al (2004) The Gene Ontology Annotation (GOA database: sharing knowledge in UniProt with gene ontology. Nucleic Acids Res 32:D262–D266
Article CAS PubMed Google Scholar
Cantor CR, Schimmel PR (1980) Biophysical chemistry, Part I: The conformation of biological macromolecules. WH Freeman, San Francisco and Oxford
Google Scholar
Dayhoff MO, Eck RV Chang M et al (1965) Atlas of protein sequence and structure, Vol 1. National Biomedical Research Foundation, Silver Spring, MD
Google Scholar
de Castro E, Sigrist CJA, Gattiker A et al (2006) ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res 34:W362–W365
Article PubMed Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge UK
Book Google Scholar
Eddy SR (1996) Hidden Markov models. Curr Opin in Struct Biol 6:361–365
Article CAS Google Scholar
Finn RD, Mistry J, Schuster-Bockler B et al (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34:D247–D251
Article CAS PubMed Google Scholar
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113
Article CAS PubMed Google Scholar
Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242
Article CAS PubMed Google Scholar
Ganfornina MD, Sánchez D (1999) Generation of evolutionary novelty by functional shift. BioEssays 21:432–439
Article CAS PubMed Google Scholar
Geer RC, Sayers EW (2003) Entrez: Making use of its power. Briefings in Bioinformatics 4(2):179–184
Google Scholar
Gerlt JA, Babbitt PC (2001) Divergent evolution of enzymatic function: Mechanistically and functionally distinct suprafamilies. Annu Rev Biochem 70:209–246
Article CAS PubMed Google Scholar
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358
Article CAS PubMed Google Scholar
Gribskov M, Fana F, Harper J et al (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29:111–113
Article CAS PubMed Google Scholar
Henikoff S, Greene SA, Piertrokovski S et al (1997) Gene families: The taxonomy of protein paralogs and chimeras. Science 278(5338):609–614
Article CAS PubMed Google Scholar
Henrick K, Feng Z, Bluhm WF (2008) Remediation of the protein data bank archive. Nucleic Acids Res 36:D426–D433
Google Scholar
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Article CAS PubMed Google Scholar
Koonin EV and Galperin MY (2002) Principles and methods of sequence analysis. In: Sequence– Evolution – Function, 1^st edition. Kluwer, Waltham, MA
Google Scholar
Kunin V, Cases I, Anton J et al (2003) Myriads of protein families, and still counting. Genome Biol 4:401
Article PubMed Google Scholar
Leinonen R, Diez FG, Binns D et al (2004) UniProt Archive. Bioinformatics 20:3236–3237
Article CAS PubMed Google Scholar
Lesk AM (2001) Introduction to protein architecture. Oxford University Press, Oxford
Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Google Scholar
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Article CAS PubMed Google Scholar
Moeller S, Leser U, Fleischmann W, Apweiler R (1999) EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 15:219–227
Article Google Scholar
Moore GE (1965) Cramming more components onto integrated circuits. Electron Mag 38:8
Google Scholar
Mulder NJ (2007) Protein family databases. Encyclopedia of life sciences Wiley, New York.
Google Scholar
Mulder NJ, Apweiler R, Attwood TK et al (2003) The InterPro Database brings increased coverage and new features. Nucleic Acids Res 31(1):315–318
Google Scholar
Mulder NJ, Apweiler R, Attwood TK et al (2007) New developments in the InterPro database. Nucleic Acids Res 35:D224–228
Article CAS PubMed Google Scholar
Mushegan AR (2007) Foundations of comparative genomics. Academic, Burlington, MA
Google Scholar
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46:395–415
Article Google Scholar
Natale DA, Vinakaya CR, Wu CH (2005) Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia Genet, Genomics, Proteomics Bioinform:. doi:10.1002/047001153X.g403314
Google Scholar
NC-IUBMB (2008) Enzyme Nomenclature. http://www.chem.qmul.ac.uk/iubmb/enzyme/. Accessed 30 Apr 2008
Orengo CA, Peral FMG, Bray JE et al (1999) Assigigning genomic sequences to CATH. Nucleic Acids Res 28(1):277–282
Google Scholar
Ouzounis CA, Coulson RMR, Enright AH et al (2003) Classification schemes for protein structure and function. Nat Rev Genet 4:508–519
Google Scholar
Pearson WR (1995) Comparison of methods for searching protein sequence databases. Prot Sci 4:1145–1160
Article CAS Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence analysis. Proc Natl Acad Sci USA 85:2444–2448
Article CAS PubMed Google Scholar
Pearson WR, Wood TC (2001) Statistical significance of biological sequence comparison. In: Bourne BE, Weissig H (eds) Handbook of statistical genetics. Wiley, West Sussex, England
Google Scholar
PlantsP (2008) Functional genomics of plant phosphorylation. http://plantsp.genomics.purdue.edu/. Accessed 1 March 2008
Pontig CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2(1):19–29
Article Google Scholar
PRF (2008) Protein Research Foundation. http://www.prf.or.jp/en/dbi.shtml/. Accessed 26 Oct 2008
Pruitt KD, Tatusova T, Maglott DR et al (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65
Article CAS PubMed Google Scholar
Raes J, Harrington ED, Singh AH et al (2007) Protein function space: viewing the limits or limited by our view. Curr Opin Struct Biol 17:362–369
Article CAS PubMed Google Scholar
Reddy BVB, Bourne PE (2003) Protein structure evolution and the SCOP database. In: Bourne BE, Weissig H (eds) Structural bioinformatics, 1st edn. Wiley-Liss, Hoboken, NJ
Google Scholar
RefSeq (2008) The National Center for Biotechnology Information: Reference Sequence database. http://www.ncbi.nlm.nih.gov/RefSeq/key.html#status/. Accessed 26 Feb 2008
Rost B, Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechnol 7:457–461
Article CAS PubMed Google Scholar
Rusch DB, Halpern AL, Sutton G et al (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through Eastern tropical Pacific. PLoS Biol 5:398–431
Article CAS Google Scholar
Sangar V, Blankenberg DJ, Altman N et al (2007) Quantitative sequence-function relationship in proteins based on gene ontology. BMC Bioinform 8:294
Article Google Scholar
Schneider M, Bairoch A, Wu CH et al (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol 138:59–66
Article CAS PubMed Google Scholar
Sigrist CJ, Cerutti L, Hulo N et al (2002) PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274
Article CAS PubMed Google Scholar
Suzek BE, Huang H, McGarvey P et al (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288
Article CAS PubMed Google Scholar
The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29
Article Google Scholar
The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197
Article Google Scholar
The UniProt Consortium (2008a) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D190–D195
Google Scholar
The UniProt Consortium (2008b) The Universal Protein Resource (UniProt). Nucleic Acids Res 36:D190–D195
Article Google Scholar
UniProt (2008) http://www.uniprot.org/. Accessed 30 Apr 2008
Ware D, Jaiswal P, Ni J et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30:103–105
Article CAS PubMed Google Scholar
Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev of Biophys 36:307–340
Article CAS Google Scholar
Wieser D, Kretschmann E, Apweiler R (2004) Filtering erroneous protein annotation. Bioinformatics 20(1):i342–i347
Article CAS PubMed Google Scholar
Wilson CA, Kreychman J, Gerstein M (2000) Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 297:233–249
Article CAS PubMed Google Scholar
Wu CH, Nikolskaya A, Huang H et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114
Article CAS PubMed Google Scholar
Wu CH, Apweiler R, Bairoch A et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187–D191
Google Scholar
wwPDB (2008) Worldwide Protein Data. http://www.wwpdb.org/. Accessed 8 Sept 2008
Yosef N, Sharan R, Noble WS (2008) Improved network-based identification of protein orthologs. Bioinformatics 24(16):i200–i206
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

University of Queensland, Brisbane, QLD, Australia
Terry Clark

Authors

Terry Clark
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Terry Clark .

Editor information

Editors and Affiliations

Inst. Molecular Bioscience, University of Queensland, St.Lucia, 4072, Australia
David Edwards
Dept. Plant & Microbial Biology, University of California, Berkeley, Koshland Hall 111, Berkeley, 94720, U.S.A.
Jason Stajich
e-Health Research Centre, Adelaide St. 300, Brisbane, 4000, Australia
David Hansen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Clark, T. (2009). Protein Sequence Databases. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_10

Download citation

DOI: https://doi.org/10.1007/978-0-387-92738-1_10
Published: 05 August 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-92737-4
Online ISBN: 978-0-387-92738-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics