Skip to main content

Columba: Multidimensional Data Integration of Protein Annotations

  • Conference paper
Data Integration in the Life Sciences (DILS 2004)

Abstract

We present COLUMBA, an integrated database of protein annotations. COLUMBA is centered around proteins whose structure has been resolved and adds as much annotations as possible to those proteins, describing their proper-ties such as function, sequence, classification, textual description, participation in pathways, etc. Annotations are extracted from seven (soon eleven) external data sources. In this paper we describe the motivation for building COLUMBA, its integrational architecture and the software tools we developed for the integrated data sources and keeping COLUMBA up-to-date. We put special focus on two aspects: First, COLUMBA does not try to remove redundancies and overlaps in data sources, but views each data source as a proper dimension describing a protein. We explain the advantages of this approach compared to a tighter semantic integration as pursued in many other projects. Second, we highlight our current investigations regarding the quality of data in COLUMBA by identification of hot spots of poor data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bernstein, F.C., Koetzle, T.F., Williams, G.J.B., Meyer Jr., E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542 (1977)

    Article  Google Scholar 

  2. Karp, P.D.: What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9), 753–754 (1998)

    Article  Google Scholar 

  3. Devos, D., Valencia, A.: Intrinsic errors in genome annotation. Trends in Genetics 17(8), 429–431 (2001)

    Article  Google Scholar 

  4. Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., Ouzounis, C.A.: Modeling the percolation of annotation erros in a database of protein sequences. Bioinformatics 18(12), 1641–1649 (2002)

    Article  Google Scholar 

  5. Kanehisa, M., Goto, S., Kawashima, S., Nakaya, A.: The KEGG database at GenomeNet. Nucleic Acid Research 30(1), 42–46 (2002)

    Article  Google Scholar 

  6. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)

    Google Scholar 

  7. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern of recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983)

    Article  Google Scholar 

  8. Bairoch, A.: The ENZYME database. Nucleic Acid Research 28(1), 304–305 (2000)

    Article  Google Scholar 

  9. Preissner, R., Goede, R., Froemmel, C.: Dictionary of interfaces in proteins (DIP). Databank of complementary molecular surface patches. J. Mol. Biol. 280(3), 535–550 (1998)

    Article  Google Scholar 

  10. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH- A Hierarchic Classification of Protein Domain Structures. Structure 5(8), 1093–1108 (1997)

    Article  Google Scholar 

  11. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL. Nucleic Acids Research 31(1), 365–370 (2003)

    Article  Google Scholar 

  12. Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., Wagner, L.: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31(1), 28–33 (2003)

    Article  Google Scholar 

  13. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 25(1), 25–29 (2000)

    Google Scholar 

  14. Wang, G., Dunbrack Jr., R.L.: PISCES: a protein sequence culling server. Bioinformatics 19(12), 1589–1591 (2003)

    Article  Google Scholar 

  15. Krause, A., Stoye, J., Vingron, M.: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 28(1), 270–272 (2000)

    Article  Google Scholar 

  16. Michal, G.: Biochemical Pathways, Boehringer Mannheim GmbH (1993)

    Google Scholar 

  17. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: GenBank. Nucleic Acids Res. 31(1), 23–37 (2003)

    Article  Google Scholar 

  18. Lakshmanan, L., Sadri, F., Subramanian, I.: On the Logical Foundation of Schema Integration and Evolution in Heterogeneous Database Systems. In: Ceri, S., Tsur, S., Tanaka, K. (eds.) DOOD 1993. LNCS, vol. 760, pp. 81–100. Springer, Heidelberg (1993)

    Google Scholar 

  19. Do, H.H., Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches. In: Conference on Very Large Data Bases(VLDB), pp. 610–621 (2002)

    Google Scholar 

  20. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10, 334–350 (2001)

    Article  MATH  Google Scholar 

  21. Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26, 65–74 (1997)

    Article  Google Scholar 

  22. Greer, D.S., Westbrook, J.D., Bourne, P.E.: An ontology driven architecture for derived representations of macromolecular structure. Bioinformatics 18(9), 1280–1281 (2002)

    Article  Google Scholar 

  23. Rahm, E., Do, H.H.: Data Cleaning: Problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4) (2000)

    Google Scholar 

  24. Bhat, T.N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Scheider, B., Schneider, K., Thanki, N., Weissig, H., Westbrook, J., Berman, H.M.: The PDB data uniformity project. Nucleic Acid Research 29(1), 214–218 (2001)

    Article  Google Scholar 

  25. Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Urunea, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acid Research 31(1), 458–462 (2003)

    Article  Google Scholar 

  26. Stein, L.: Creating a bioinformatics nation. Nature 417(6885), 119–120 (2002)

    Article  Google Scholar 

  27. Laskowski, R.A.: PDBsum: summaries and analyses of PDB structures. Nucleic Acids Research 29(1), 221–222 (2001)

    Article  Google Scholar 

  28. Reichert, J., Suhnel, J.: The IMB Jena Image Library of Biological Macromolecules: 2002 update. Nucleic Acids Res. 30(1), 253–254 (2002)

    Article  Google Scholar 

  29. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31(13), 3784–3788 (2003)

    Article  Google Scholar 

  30. Cornell, M., Paton, N.W., Shengli, W., Goble, C.A., Miller, C.J., Kirby, P., Eilbeck, K., Brass, A., Hayes, A., Oliver, S.G.: GIMS - A Data Warehouse for Storage and Analysis of Genome Sequence and Function Data. In: 2nd IEEE International Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland (2001)

    Google Scholar 

  31. Paton, N.W., Khan, S.A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C.A., Hubbard, S.J., Oliver, S.G.: Conceptual Modelling of Genomic Information. Bioinformatics 16(6), 548–557 (2000)

    Article  Google Scholar 

  32. Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University Berlin (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rother, K. et al. (2004). Columba: Multidimensional Data Integration of Protein Annotations. In: Rahm, E. (eds) Data Integration in the Life Sciences. DILS 2004. Lecture Notes in Computer Science(), vol 2994. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24745-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24745-6_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21300-0

  • Online ISBN: 978-3-540-24745-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics