Abstract
Protein sequence databases do not contain just the sequence of the protein itself but also annotation that reflects our knowledge of its function and contributing residues. In this chapter, we will discuss various public protein sequence databases, with a focus on those that are generally applicable. Special attention is paid to issues related to the reliability of both sequence and annotation, as those are fundamental to many questions researchers will ask. Using both well-annotated and scarcely annotated human proteins as examples, it will be shown what information about the targets can be collected from freely available Internet resources and how this information can be used. The results are shown to be summarized in a simple graphical model of the protein’s sequence architecture highlighting its structural and functional modules.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Stretton, A. O. W. (2002) The first sequence: Fred Sanger and insulin. Genetics 162, 527–532.
Dayhoff, M. O., Eck, R. V., Chang, M. A., Sochard, M. R. (1965) Atlas of Protein Sequence and Structure. Silver Spring, Maryland: National Biomedical Research Foundation.
Hunt, L. (1984) Margaret Oakley Dayhoff, 1925–1983. Bull Math Biol 46, 467–472.
George, D. G., Barker, W. C., Hunt, L. T. (1986) The protein identification resource (PIR). Nucl Acids Res 14, 11–15.
Bairoch, A., Boeckmann, B. (1991) The SWISS-PROT protein sequence data bank. Nucl Acids Res 19, 2247–2249.
Appel, R. D., Bairoch, A., Hochstrasser, D. F. (1994) A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem Sci 19, 258–260.
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D. (1998) GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664.
Maglott, D. R., Katz, K. S., Sicotte, H., Pruitt, K. D. (2000) NCBI’s LocusLink and RefSeq. Nucl Acids Res 28, 126–128.
(2004) Genome Res 14(Special issue on Ensembl), 925–995.
Bairoch, A., Apweiler, R. (1996) The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucl Acids Res 24, 21–25.
Claverie, J. M., Sauvaget, I., Bouqueleret, L. (1985) Computer generation and statistical analysis of a data bank of protein sequences translated from Genbank. Biochimie 67, 437–443.
Schuler, G. D., Epstein, J. A., Ohkawa, H., Kans, J. A. (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol 266, 141–162.
Mulder, N., Apweiler, R. (2007) InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 396, 59–70.
Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M. F., Kellis, M., Lindblad-Toh, K., Lander, E. S. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 104, 19428–19433.
Pruitt, K. D., Tatusova, T., Maglott, D. R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl Acids Res 35, D61–D65.
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=71774083
Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., Edgar, R. (2007) NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucl Acids Res 35, D760–D765.
Pagni, M., Ioannidis, V., Cerutti, L., Zahn-Zabal, M., Jongeneel, C. V., Falquet, L. (2004) MyHits: a new interactive resource for protein annotation and domain identification. Nucl Acids Res 32, W332–W335.
Sperisen, P., Iseli, C., Pagni, M., Stevenson, B. J., Bucher, P., Jongeneel, C. V. (2004) trome, trEST and trGEN: databases of predicted protein sequences. Nucl Acids Res 32, D509–D511.
Bult, C. J., Eppig, J. T., Kadin, J. A., Richardson, J. E., Blake, J. A. (2008) The Mouse Genome Database (MGD): mouse biology and model systems. Nucl Acids Res 36, D724–D728.
Drysdale, R. A., Crosby, M. A., FlyBase Consortium (2005) FlyBase: Genes and gene models. Nucl Acid Res 33, D390–D395.
Stein, L. D., Sternberg, P., Durbin, R., Thierry-Mieg, J., Spieth, J. (2001) WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucl Acids Res 29, 82–86.
Sickmeier, M., Hamilton, J. A., LeGall, T., Vacic, V., Cortese, M. S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N., Obradovic, Z., Dunker, A. K. (2007) DisProt: the database of disordered proteins. Nucl Acids Res 35, D786–D793.
Hornbeck, P. V., Chabra, I., Kornhauser, J. M., Skrzypek, E., Zhang, B. (2004) PhosphoSite: a bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics 4, 1551–1561.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Rebhan, M. (2010). Protein Sequence Databases. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 609. Humana Press. https://doi.org/10.1007/978-1-60327-241-4_3
Download citation
DOI: https://doi.org/10.1007/978-1-60327-241-4_3
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-60327-240-7
Online ISBN: 978-1-60327-241-4
eBook Packages: Springer Protocols