Databases and Protein Structures

  • Henrik ChristensenEmail author
  • Lisbeth E. de Vries
Part of the Learning Materials in Biosciences book series (LMB)


Bioinformatics databases contain biological data from scientific experiments most importantly DNA and protein sequences and protein structures. Databases of published literature, computational analysis of primary data, and metadata are also important. Primary and secondary databases refer to the type and source of stored data. Primary databases, such as GenBank and ENA, are also called archives or repositories, and they take information directly from the individual researcher, and data are owned by the submitter with privileges to change data. The nucleotide databases DDBJ, EMBL, and GenBank are automatically translated to the protein level if the DNA sequences are coding. The secondary databases (e.g., Swiss-Prot and PDB) are curated, and they perform a quality control and sorting of information before the information is made accessible to the public. These databases have better chances of reducing redundancy. They can also bypass the submitters of entries in the primary databases which are no longer updated. Domains are compact units of proteins that may behave independently and be associated with certain functions. Motifs are conserved regions of proteins which may be part of domains. The prediction of domains can be performed based on single motifs, multiple motifs, and full domains or using procedures mixing different methods. The function of a protein can be predicted by a rather low identity to other known proteins over rather short length of comparison and rather low similarity to protein structures. Proteomics is dealing with the prediction of proteins based on the measurements of mass-to-charge ratios (m/z). The prediction of proteins is then done with programs like Mascot where the m/z coordinates from an analysis are held up against the reviewed part of UniProt.


  1. André I, Potocki-Véronèse G, Barbe S, Moulis C, Remaud-Siméon M. 2014. CAZyme discovery and design for sweet dreams. Curr. Opin. Chem. Biol. 19:17–24.CrossRefPubMedGoogle Scholar
  2. Arnold K, Bordoli L, Kopp J, Schwede T. 2006. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22, 195–201.CrossRefPubMedGoogle Scholar
  3. Attwood TK, Avison H, Beck ME, Bewley M, Bleasby AJ, Brewster F, Cooper P, Degtyarenko K, Geddes AJ, Flower DR, Kelly MP, Lott S, Measures KM, Parry-Smith DJ, Perkins DN, Scordis P, Scott D, Worledge C. 1997. The PRINTS database of protein fingerprints: a novel information resource for computational molecular biology. J Chem Inf Comput Sci. 37, 417–424.CrossRefPubMedGoogle Scholar
  4. Bagos PG, Liakopoulos TD, Spyropoulos IC, & Hamodrakas SJ. 2004. A Hidden Markov Model method, capable of predicting and discriminating beta-barrel outer membrane proteins. BMC Bioinformatics 15;5:29.CrossRefGoogle Scholar
  5. Barker WC, George DG, Mewes HW, Pfeiffer F, & Tsugita A. 1993. The PIR-International databases. Nucleic Acids Res. 21:3089–92.CrossRefPubMedPubMedCentralGoogle Scholar
  6. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, Sayers EW. 2018. GenBank. Nucleic Acids Res. 46(D1):D41–D47.CrossRefPubMedGoogle Scholar
  7. Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2016. GenBank. Nucleic Acids Res. 44, D67–72.CrossRefPubMedGoogle Scholar
  8. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, Brown CT, Porras-Alfaro A, Kuske CR, Tiedje JM. 2014. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Res 42, D633–42.CrossRefPubMedPubMedCentralGoogle Scholar
  9. Cook CE, Bergman MT, Cochrane G, Apweiler R, Birney E. 2018. The European Bioinformatics Institute in 2017: data coordination and integration. Nucleic Acids Res. 46:D21–D29.CrossRefPubMedGoogle Scholar
  10. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 72, 5069–5072.CrossRefPubMedPubMedCentralGoogle Scholar
  11. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A, Salazar GA, Tate J, Bateman A. 2016. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44:D279–85.CrossRefPubMedGoogle Scholar
  12. Finn RD, Attwood TK, Babbitt PC, Bateman A, Bork P, Bridge AJ, Chang H-Y, Dosztányi Z, El-Gebali S, Fraser M, Gough J, Haft D, Holliday GL, Huang H, Huang X, Letunic I, Lopez R, Lu S, Marchler-Bauer A, Mi H, Mistry J, Natale DA, Necci M, Nuka G, Orengo CA, Park Y, Pesseat S, Piovesan D, Potter SC, Rawlings ND, Redaschi N, Richardson L, Rivoire C, Sangrador-Vegas A, Sigrist C, Sillitoe I, Smithers B, Squizzato S, Sutton G, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Xenarios, I, Yeh L-S, Young S-Y, & Mitchell AL. 2017. InterPro in 2017 — beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199CrossRefPubMedGoogle Scholar
  13. Gao F, Luo H, Zhang CT, Zhang R. 2015. Gene essentiality analysis based on DEG 10, an updated database of essential genes. Methods Mol. Biol. 1279:219–33.CrossRefPubMedGoogle Scholar
  14. Ghosh P. 2018. Variation, Indispensability, and Masking in the M protein. Trends Microbiol. 26, 132–144.CrossRefPubMedGoogle Scholar
  15. Gibas C, & Jambeck P. 2001. Developing Bioinformatics Computer Skills An Introduction to Software Tools for Biological Applications. O'Reilly Media, Beijing.Google Scholar
  16. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD. 2018. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 46:D851–D860.CrossRefPubMedGoogle Scholar
  17. Henikoff S, & Henikoff JG. 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 89, 10915–10919.CrossRefPubMedPubMedCentralGoogle Scholar
  18. Higgs PG, & Attwood TK. 2005. Bioinformatics and Molecular Evolution. Wiley.Google Scholar
  19. Holm, L. & Laakso, L.M. 2016. Dali server update. Nucleic Acids Res. 8;44(W1):W351–5.CrossRefPubMedPubMedCentralGoogle Scholar
  20. Käll L, Krogh A & Sonnhammer EL. 2004. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338:1027–36.CrossRefPubMedGoogle Scholar
  21. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. 2016. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44(D1):D457–62.CrossRefPubMedGoogle Scholar
  22. Karsch-Mizrachi I, Takagi T, Cochrane G, International Nucleotide Sequence Database Collaboration. 2018. The international nucleotide sequence database collaboration. Nucleic Acids Res. 46(D1):D48–D51.PubMedGoogle Scholar
  23. Kelley LA, Mezulis S, Yates CM, Wass MN & Sternberg MJ. 2015. The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc. 10:845–58.CrossRefPubMedPubMedCentralGoogle Scholar
  24. Kodama Y, Mashima J, Kosuge T, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y, Takagi T. 2018. DNA Data Bank of Japan: 30th anniversary. Nucleic Acids Res. 46(D1):D30–D35.CrossRefPubMedGoogle Scholar
  25. Letunic I, Bork P. 2018. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 2018 Jan 4;46(D1):D493–D496.CrossRefPubMedGoogle Scholar
  26. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 610–618.CrossRefPubMedGoogle Scholar
  27. Nakai K, Horton P. 1999. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci. 24:34–6.CrossRefPubMedGoogle Scholar
  28. NCBI. 2018. NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 46, D8-D13.CrossRefGoogle Scholar
  29. Petersen TN, Brunak S, von Heijne G, & Nielsen H 2011. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature Methods 8:785–786.CrossRefPubMedGoogle Scholar
  30. Petsko GA & Ringe D 2004. Protein structure and function. Primers in Biology. New Science Press Ltd. London, UK.Google Scholar
  31. Pickett CL & Whitehouse CA.1999. The cytolethal distending toxin family. Trends Microbiol. 7, 292–297.CrossRefPubMedGoogle Scholar
  32. Rose PW, Prlić A, Altunkaya A, Bi C, Bradley AR, Christie CH, Costanzo LD, Duarte JM, Dutta S, Feng Z, Green RK, Goodsell DS, Hudson B, Kalro T, Lowe R, Peisach E, Randle C, Rose AS, Shao C, Tao YP, Valasatava Y, Voigt M, Westbrook JD, Woo J, Yang H, Young JY, Zardecki C, Berman HM, Burley SK. 2017. The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 45:D271–D281.CrossRefPubMedGoogle Scholar
  33. Silvester N, Alako B, Amid C, Cerdeño-Tarrága A, Clarke L, Cleland I, Harrison PW, Jayathilaka S, Kay S, Keane T, Leinonen R, Liu X, Martínez-Villacorta J, Menchi M, Reddy K, Pakseresht N, Rajan J, Rossello M, Smirnov D, Toribio AL, Vaughan D, Zalunin V, Cochrane G. 2018. The European Nucleotide Archive in 2017. Nucleic Acids Res. 46(D1):D36–D40.CrossRefPubMedGoogle Scholar
  34. Sonnhammer ELL & Kahn D. 1994. Modular arrangement of proteins as inferred from analysis of homology, Protein Sci. 3, 482–492CrossRefPubMedPubMedCentralGoogle Scholar
  35. Soria-Guerra RE, Nieto-Gomez R, Govea-Alonso DO, Rosales-Mendoza S. 2015. An overview of bioinformatics tools for epitope prediction: implications on vaccine development. J Biomed Inform. 53:405–14.CrossRefPubMedGoogle Scholar
  36. Tsirigos KD, Bagos PG, Hamodrakas SJ. 2011. OMPdb: a database of {beta}-barrel outer membrane proteins from Gram-negative bacteria. Nucleic Acids Res. 39(Database issue):D324–31.CrossRefPubMedGoogle Scholar
  37. UniProt. 2018. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46: 2699.CrossRefGoogle Scholar
  38. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T. 2018. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018 May 21. [Epub ahead of print]CrossRefPubMedPubMedCentralGoogle Scholar
  39. Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO 2014. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Res. 42(Database issue):D643–8.CrossRefPubMedGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Veterinary Animal SciencesUniversity of CopenhagenCopenhagenDenmark
  2. 2.Københavns ProfessionshøjskoleCopenhagenDenmark

Personalised recommendations