Abstract
The near exponential growth in protein sequence data is at the foundation of transformations in biological research and related technological developments. The use of protein sequence data is widespread in fields including agronomy, biochemistry, ecology, etymology, evolution, genetics, genetic engineering, genomics, molecular phylogenetics and systematics, pharmacology, and toxicology. The remarkable increase in available protein sequences will most likely continue with the proliferation of genome sequencing projects, the latter enabled by ongoing improvements in DNA sequencing technology.1 Along with opportunities, protein sequence data bring scientifically challenging problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul SF (1991) Amino acid substitution matrices from an information theoretic prospective. J Mol Biol 219:555–565
Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Altschul SA, Boguski MS, Gish W, Wootton JC (1994) Issues in searching molecular sequence databases. Nat Genet 6:119–129
Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Altschul SF, Wootton JC, Getz M et al (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272(20):5101–5109
Andreeva A, Howorth D, Brenner SE et al (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32:D226–D229
Apweiler R (2001) Functional information in Swiss-Prot: the basis for large-scale characterisation of protein sequences. Brief Bioinform 2:9–18
Bairoch A, Boeckmann B, Ferro S, Gasteiger E (2004) Swiss-Port: Juggling between evolution and stability. Briefings in Bioinformatics 5(1):39–55
Balaji S, Sujatha SN et al (2001) PALI-a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Res 29:61–65
Barker WC, Garavelli JS, Haft DH et al (1998) The PIR-International protein sequence database. Nucleic Acids Res 26:27–32
Bateman A, Birney E, Cerruti L et al (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280
Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2000) GenBank. Nucleic Acids Res 28(1):15–18
Benson DA, Karsch-Mizarchi I, Lipman DJ, et al (2007) GenBank. Nucleic Acids Res 36:D25–D30
Benson DA, Karsch-Mizarchi I, Karsch-Mizrachi I et al (2006) GenBank. Nucleic Acids Res 35:D21–D25
Berman HM, Westbrook J, Feng Z et al (2000) The protein data bank. Nucleic Acids Res 28:235–242
Berman HM, Henrick K, Nakamura H et al (2007) The Worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35:D301–D303
Biswas M, O’Rourke JF, Camon E et al (2002) Applications of InterPro in protein annotation and genome analysis. Brief Bioinform 3(3):285–295
Boeckmann B, Bairoch A, Apweiler R et al (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Res 31:365–370
Boeckmann B, Blatter MC, Farniglietti L et al (2005) Protein variety and functional diversity: Swiss-Prot annotation in its biological context. CR Biol 328:882–899
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94
Burke J, Davison D, Hide W (1999) d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 9:1135–1142
Camon E, Magrane M, Barrell D et al (2004) The Gene Ontology Annotation (GOA database: sharing knowledge in UniProt with gene ontology. Nucleic Acids Res 32:D262–D266
Cantor CR, Schimmel PR (1980) Biophysical chemistry, Part I: The conformation of biological macromolecules. WH Freeman, San Francisco and Oxford
Dayhoff MO, Eck RV Chang M et al (1965) Atlas of protein sequence and structure, Vol 1. National Biomedical Research Foundation, Silver Spring, MD
de Castro E, Sigrist CJA, Gattiker A et al (2006) ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res 34:W362–W365
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge UK
Eddy SR (1996) Hidden Markov models. Curr Opin in Struct Biol 6:361–365
Finn RD, Mistry J, Schuster-Bockler B et al (2006) Pfam: clans, web tools and services. Nucleic Acids Res 34:D247–D251
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113
Friedberg I (2006) Automated protein function prediction–the genomic challenge. Brief Bioinform 7(3):225–242
Ganfornina MD, Sánchez D (1999) Generation of evolutionary novelty by functional shift. BioEssays 21:432–439
Geer RC, Sayers EW (2003) Entrez: Making use of its power. Briefings in Bioinformatics 4(2):179–184
Gerlt JA, Babbitt PC (2001) Divergent evolution of enzymatic function: Mechanistically and functionally distinct suprafamilies. Annu Rev Biochem 70:209–246
Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA 84:4355–4358
Gribskov M, Fana F, Harper J et al (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29:111–113
Henikoff S, Greene SA, Piertrokovski S et al (1997) Gene families: The taxonomy of protein paralogs and chimeras. Science 278(5338):609–614
Henrick K, Feng Z, Bluhm WF (2008) Remediation of the protein data bank archive. Nucleic Acids Res 36:D426–D433
Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 87:2264–2268
Koonin EV and Galperin MY (2002) Principles and methods of sequence analysis. In: Sequence– Evolution – Function, 1st edition. Kluwer, Waltham, MA
Kunin V, Cases I, Anton J et al (2003) Myriads of protein families, and still counting. Genome Biol 4:401
Leinonen R, Diez FG, Binns D et al (2004) UniProt Archive. Bioinformatics 20:3236–3237
Lesk AM (2001) Introduction to protein architecture. Oxford University Press, Oxford
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Moeller S, Leser U, Fleischmann W, Apweiler R (1999) EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 15:219–227
Moore GE (1965) Cramming more components onto integrated circuits. Electron Mag 38:8
Mulder NJ (2007) Protein family databases. Encyclopedia of life sciences Wiley, New York.
Mulder NJ, Apweiler R, Attwood TK et al (2003) The InterPro Database brings increased coverage and new features. Nucleic Acids Res 31(1):315–318
Mulder NJ, Apweiler R, Attwood TK et al (2007) New developments in the InterPro database. Nucleic Acids Res 35:D224–228
Mushegan AR (2007) Foundations of comparative genomics. Academic, Burlington, MA
Myers G (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J ACM 46:395–415
Natale DA, Vinakaya CR, Wu CH (2005) Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia Genet, Genomics, Proteomics Bioinform:. doi:10.1002/047001153X.g403314
NC-IUBMB (2008) Enzyme Nomenclature. http://www.chem.qmul.ac.uk/iubmb/enzyme/. Accessed 30 Apr 2008
Orengo CA, Peral FMG, Bray JE et al (1999) Assigigning genomic sequences to CATH. Nucleic Acids Res 28(1):277–282
Ouzounis CA, Coulson RMR, Enright AH et al (2003) Classification schemes for protein structure and function. Nat Rev Genet 4:508–519
Pearson WR (1995) Comparison of methods for searching protein sequence databases. Prot Sci 4:1145–1160
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence analysis. Proc Natl Acad Sci USA 85:2444–2448
Pearson WR, Wood TC (2001) Statistical significance of biological sequence comparison. In: Bourne BE, Weissig H (eds) Handbook of statistical genetics. Wiley, West Sussex, England
PlantsP (2008) Functional genomics of plant phosphorylation. http://plantsp.genomics.purdue.edu/. Accessed 1 March 2008
Pontig CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2(1):19–29
PRF (2008) Protein Research Foundation. http://www.prf.or.jp/en/dbi.shtml/. Accessed 26 Oct 2008
Pruitt KD, Tatusova T, Maglott DR et al (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35:D61–D65
Raes J, Harrington ED, Singh AH et al (2007) Protein function space: viewing the limits or limited by our view. Curr Opin Struct Biol 17:362–369
Reddy BVB, Bourne PE (2003) Protein structure evolution and the SCOP database. In: Bourne BE, Weissig H (eds) Structural bioinformatics, 1st edn. Wiley-Liss, Hoboken, NJ
RefSeq (2008) The National Center for Biotechnology Information: Reference Sequence database. http://www.ncbi.nlm.nih.gov/RefSeq/key.html#status/. Accessed 26 Feb 2008
Rost B, Valencia A (1996) Pitfalls of protein sequence analysis. Curr Opin Biotechnol 7:457–461
Rusch DB, Halpern AL, Sutton G et al (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through Eastern tropical Pacific. PLoS Biol 5:398–431
Sangar V, Blankenberg DJ, Altman N et al (2007) Quantitative sequence-function relationship in proteins based on gene ontology. BMC Bioinform 8:294
Schneider M, Bairoch A, Wu CH et al (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol 138:59–66
Sigrist CJ, Cerutti L, Hulo N et al (2002) PROSITE: A documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274
Suzek BE, Huang H, McGarvey P et al (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23:1282–1288
The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29
The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197
The UniProt Consortium (2008a) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D190–D195
The UniProt Consortium (2008b) The Universal Protein Resource (UniProt). Nucleic Acids Res 36:D190–D195
UniProt (2008) http://www.uniprot.org/. Accessed 30 Apr 2008
Ware D, Jaiswal P, Ni J et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30:103–105
Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev of Biophys 36:307–340
Wieser D, Kretschmann E, Apweiler R (2004) Filtering erroneous protein annotation. Bioinformatics 20(1):i342–i347
Wilson CA, Kreychman J, Gerstein M (2000) Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 297:233–249
Wu CH, Nikolskaya A, Huang H et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114
Wu CH, Apweiler R, Bairoch A et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:D187–D191
wwPDB (2008) Worldwide Protein Data. http://www.wwpdb.org/. Accessed 8 Sept 2008
Yosef N, Sharan R, Noble WS (2008) Improved network-based identification of protein orthologs. Bioinformatics 24(16):i200–i206
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Clark, T. (2009). Protein Sequence Databases. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_10
Download citation
DOI: https://doi.org/10.1007/978-0-387-92738-1_10
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-92737-4
Online ISBN: 978-0-387-92738-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)