Skip to main content

A Survey on Integrating Data in Bioinformatics

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

Abstract

Data integration is an open challenge in bioinformatics. Querying and retrieving data from remote and/or local sources and analyzing them are very time consuming tasks for biologists. Data integration allows biologists to combine knowledge from multiple disciplines. This has become a critical issue in biological research in recent years. Advances in technology have pathed the way to a huge and growing amount of available biological data. However, it is important to highlight that the distinctive feature in integrating biological data is not mainly concerned with the amount of data but with their complexity. Biological data sources are considered strongly heterogeneous in many aspects. Several approaches and systems based on different technologies and techniques, have been proposed in the literature to deal with the problem of integrating biological sources. Nevertheless it does not exist yet an approach able to solve all mentioned problems. This chapter provides a survey on data integration in the field of biological sources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cochrane, G.R., Galperin, M.Y.: The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources. Nucleic Acids Research 38, D1–D4 (2009)

    Article  Google Scholar 

  2. Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J.M., White, R.: Centre d’etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6(3), 575–577 (1990)

    Article  Google Scholar 

  3. Murray, J.C., Buetow, K.H., Weber, J.L., Ludwigsen, S., Scherpbier-Heddema, T., Manion, F., Quillen, J., Sheffield, V.C., Sunden, S., Duyk, G.M., Weissenbach, J., Gyapay, G., Dib, C., Morrissette, J., Lathrop, G.M., Vignal, A., White, R., Matsunamic, N., Gerken, S., Melis, R., Albertsen, H., Plaetke, R., Odelberg, S., Ward, D., Dausset, J., Cohen, D., Cann, H.: A comprehensive human linkage map with centimorgan density. Science 265(5181), 2049–2054 (1994)

    Article  Google Scholar 

  4. McKusick, V.A.: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore (1998)

    Google Scholar 

  5. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A.: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33(database Issue), D514–D517 (2005)

    Article  Google Scholar 

  6. Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, B.F.F.: GenBank. Nucleic Acids Research 26(1), 1–7 (1997)

    Article  Google Scholar 

  7. Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M.P.G., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., Apweiler, R.: EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Research 35(1), D16–D20 (2006)

    Google Scholar 

  8. Kaminuma, E., Mashima, J., Kodama, Y., Gojobori, T., Ogasawara, O., Okubo, K., Takagi, T., Nakamura, Y.: DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Research 38(database issue), D33–D38 (2010)

    Article  Google Scholar 

  9. Barker, W.C., Garavelli, J.S., McGarvey, P.B., Marzec, C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.S.L., Ledley, R.S., Mewes, H.W., Pfeiffer, F., Tsugita, A., Wu, C.: The PIR-International Protein Sequence Database. Nucleic Acids Research 27(1), 39–43 (1998)

    Article  Google Scholar 

  10. Bairoch, A., Boeckmann, B.: The SWISS-PROT protein sequence data bank. Nucleic Acids Research 20, 2019–2022 (1992)

    Google Scholar 

  11. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalon, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000)

    Article  Google Scholar 

  12. Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Uruena, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Research 31(1), 458–462 (2002)

    Article  Google Scholar 

  13. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Edgar, R.: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 37(database issue), D5–D15 (2009)

    Google Scholar 

  14. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T.F., Rezwan, F., Sharma, A., Williams, E., Bradley, X.Z., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S.G., Rocca-Serra, P., Sansone, S.A., Sklyar, N., Zhao, M., Sarkans, U., Brazma, A.: ArrayExpress update from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Research 37(database issue), D868–D872 (2009)

    Google Scholar 

  15. Vizcaíno, J.A., Côté, R., Reisinger, F., Foster, J.M., Mueller, M., Rameseder, J., Hermjakob, H., Martens, L.: A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9(18), 4276–4283 (2009)

    Article  Google Scholar 

  16. Rice, P., Longden, I., Bleasby, A.: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics 16(6), 276–277 (2000)

    Article  Google Scholar 

  17. Harris, M.A., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research 61, D258–D261 (2004)

    Google Scholar 

  18. Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics 41(5), 687–693 (2008)

    Article  Google Scholar 

  19. Perrière, G., Gouy, M.: WWW-query: An on-line retrieval system for biological sequence banks. Biochimie 78(5), 364–369 (1999)

    Article  Google Scholar 

  20. Davidson, S.B., Crabtree, J., Brunk, B.P., Schug, J., Tannen, V., Overton, G.C., Stoeckert Jr., C.J.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2), 512–531 (2001)

    Article  Google Scholar 

  21. Haas, L.M., Schwarz, P.M., Kodali, P., Kotlar, E., Rice, J.E., Swope, W.C.: DiscoveryLink: a system for integrated access to life sciences data sources. IBM Systems Journal 40(2), 489–511 (2001)

    Article  Google Scholar 

  22. Stein, L.D.: Integrating biological databases. Nature Reviews Genetics 4, 337–345 (2003)

    Article  Google Scholar 

  23. Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002)

    Google Scholar 

  24. Hernandez, T., Kambhampati, S.: Integration of Biological Sources: Current Systems and Challenges Ahead. Sigmod Record 33, 51–60 (2004)

    Article  Google Scholar 

  25. Mork, P., Halevy, A., Tarczy-Hornoch, P.: A model for data integration systems of biomedical data applied to online genetic databases. In: Proceedings of the AMIA Symposium, pp. 473–477 (2001)

    Google Scholar 

  26. Friedman, M., Levy, A., Millstein, T.: Navigational Plans For Data Integration. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 67–73 (1999)

    Google Scholar 

  27. Widom, J.: Research Problems in Data Warehousing. In: The Proceedings of the 4th International Conference Information and Knowledge Management, pp. 25–30 (1995)

    Google Scholar 

  28. Theodoratos, D., Sellis, T.: Data Warehouse Configuration. In: Proceedings of 23rd International Conference on Very Large Data Bases, pp. 126–135 (1997)

    Google Scholar 

  29. Davidson, S.B., Overton, G.C., Tannen, V., Wong, L.: BioKleisli: a digital library for biomedical researchers. International Journal on Digital Libraries 1(1), 36–53 (1997)

    Google Scholar 

  30. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)

    Article  Google Scholar 

  31. Levy, A.Y.: Logic-based techniques in data integration. Logic-Based Artificial Intelligence, 575–595 (2000)

    Google Scholar 

  32. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and Languages. Journal of Intelligent Information Systems 8(2), 117–132 (1997)

    Article  Google Scholar 

  33. Adali, S., Candan, K.S., Papakonstantinou, Y., Subrahmanian, V.S.: Query caching and optimization in distributed mediator systems. ACM SIGMOD Record 25(2), 137–146 (1996)

    Article  Google Scholar 

  34. Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. Journal of Logic Programming 43, 49–73 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  35. Friedman, M., Weld, D.S.: Efficiently Executing Information-Gathering Plans. In: Proceeding of the International Joint Conference of Artificial Intelligence, pp. 785–791 (1997)

    Google Scholar 

  36. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Query-answering algorithms for information agents. In: Proceedings of the 13th National Conference on Artificial Intelligence, pp. 40–47 (1996)

    Google Scholar 

  37. Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., Thornton, J., Orengo, C.A.: The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research 37, D310–D314(2009)

    Article  Google Scholar 

  38. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys 22(3), 182–236 (1990)

    Article  Google Scholar 

  39. Etzold, T., Argos, P.: SRS–an indexing and retrieval tool for flat file data libraries. Bioinformatics 9(1), 49–57 (1993)

    Article  Google Scholar 

  40. Schuler, G.D., Epstein, J.A., Ohkawa, H., Kans, J.A.: Entrez: molecular biology database and retrieval system. Methods in Enzymology 266, 141–162 (1996)

    Article  Google Scholar 

  41. Ritter, O.: The integrated genomic database (IGD). In: Suhai, S. (ed.) Computational Methods in Genome Research, pp. 57–73. Plenum Press, New York (1994)

    Chapter  Google Scholar 

  42. Wang, L., Rodriguez-Tomé, P., Redaschi, N., McNeil, P., Robinson, A., Lijnzaad, P.: Accessing and distributing EMBL data using CORBA. Genome Biology 1(5) (2000)

    Google Scholar 

  43. Barrillot, E., Lesser, U., Lijnzaad, P., Cussat-Blanc, C., Jungfer, K., Guyon, F., Vaysseix, G., Helgesen, C., Rodriguez-Tomé, P.: A proposal for a standard CORBA interface for genome maps. Bioinformatics 15(2), 157–169 (1999)

    Google Scholar 

  44. Parsons, J.D., Rodriguez-Tomé, P.: JESAM: CORBA software components to create and publish EST alignments and clusters. Bioinformatics 16(4), 313–325 (2000)

    Article  Google Scholar 

  45. Biomolecular Sequence Analysis RFP response Joint Revised Submission. Concept Five Technologies Inc., EMBL-EBI, Genome Informatics Corp., Millenium Pharm. Inc., Neomorphic Software Inc., NetGenics Inc. OMG Document lifesci. (August 1, 1999)

    Google Scholar 

  46. Genomic Maps RFP response Joint Second Revised Submission (with errata). EMBL-EBI, Millenium Pharm Inc., NetGenics Inc. OMG Document lifesci. (November 11, 1999)

    Google Scholar 

  47. Stevens, R., Baker, P., Bechhofer, S., Ng, G., Jacoby, A., Paton, N.W., Goble, C.A., Brass, A.: TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics 16(2), 184–186 (2000)

    Article  Google Scholar 

  48. Siepel, A., Farmer, A., Tolopko, A., Zhuang, M., Mendes, P., Beavis, W., Sobral, B.: ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources. Bioinformatics 17(1), 83–94 (2000)

    Article  Google Scholar 

  49. Durbin, R., Mieg, J.T.: A C. elegans Database (1991) Documentation, code and data available from anonymous FTP servers at , lirmm.lirmm.fr , cele.mrc-lmb.cam.ac.uk and ncbi.nlm.nih.gov

  50. Cherry, J.M., Cartinhour, S.W., Goodman, H.M.: AAtDB, an Arabidopsis thaliana database. Plant Molecular Biology Reporter 10, 308–309 (1992)

    Article  Google Scholar 

  51. Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., Lewis, S.: The generic genome browser: a building block for a model organism system database. Genome Research 12(10), 1599–1610 (2002)

    Article  Google Scholar 

  52. Hubbard, T., et al.: The Ensembl genome database project. Nucleic Acids Research 30(1), 38–41 (2001)

    Article  Google Scholar 

  53. Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: BioMart - biological queries made easy. BMC Genomics (2009), doi:10.1186/1471-2164-10-22

    Google Scholar 

  54. Armano, G., Manconi, A.: ProDaMa: an open source Python library to generate protein structure datasets. BMC Research Notes 2, 202 (2009)

    Article  Google Scholar 

  55. Armano, G., Manconi, A.: A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. In: Proceeding of DART 2010 - 4th International Workshop on Distributed Agent-Based Retrieval Tools (2010)

    Google Scholar 

  56. Di Lorenzo, G., Hacid, H., Paik, H.: Data Integration in Mashups. Services Computing 38(1), 59–66 (2009)

    Google Scholar 

  57. Mandola, F., Miller, E.: RDF Primer (2004), http://www.w3.org/TR/rdf-primer/

  58. Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema (2004), http://www.w3.org/TR/rdf-schema/

  59. Smith, M.K., Welty, C., McGuiness, D.L.: OWL Web Ontology Language (2004), http://www.w3.org/TR/owl-guide/

  60. Soldatova, L.N., King, R.D.: Are the Current Ontologies used in Biology Good Ontologies? Nature Biotechnology 23, 1095–1098 (2005)

    Article  Google Scholar 

  61. Kim, D.H., Sreenivasaiah, K.: Curren trends and new challenges of databses and web applications for system driven biological research. Frontiers in Physiology 1, 147 (2010), doi:10.3389/fphys.2010.00147.

    Google Scholar 

  62. Martin, S., Hohman, M.M., Liefeld, T.: The impact of Life Science Identifier on informatics data. Drug Discovery Today 10, 1566–1572 (2005)

    Article  Google Scholar 

  63. Laibe, C., Le Novere, N.: MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology 1, 58 (2007), doi:10.1186/1752-0509-1–58.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Manconi, A., Rodriguez-Tomé, P. (2011). A Survey on Integrating Data in Bioinformatics. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22913-8_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22912-1

  • Online ISBN: 978-3-642-22913-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics