Abstract
Data integration is an open challenge in bioinformatics. Querying and retrieving data from remote and/or local sources and analyzing them are very time consuming tasks for biologists. Data integration allows biologists to combine knowledge from multiple disciplines. This has become a critical issue in biological research in recent years. Advances in technology have pathed the way to a huge and growing amount of available biological data. However, it is important to highlight that the distinctive feature in integrating biological data is not mainly concerned with the amount of data but with their complexity. Biological data sources are considered strongly heterogeneous in many aspects. Several approaches and systems based on different technologies and techniques, have been proposed in the literature to deal with the problem of integrating biological sources. Nevertheless it does not exist yet an approach able to solve all mentioned problems. This chapter provides a survey on data integration in the field of biological sources.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Cochrane, G.R., Galperin, M.Y.: The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources. Nucleic Acids Research 38, D1–D4 (2009)
Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J.M., White, R.: Centre d’etude du polymorphisme humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6(3), 575–577 (1990)
Murray, J.C., Buetow, K.H., Weber, J.L., Ludwigsen, S., Scherpbier-Heddema, T., Manion, F., Quillen, J., Sheffield, V.C., Sunden, S., Duyk, G.M., Weissenbach, J., Gyapay, G., Dib, C., Morrissette, J., Lathrop, G.M., Vignal, A., White, R., Matsunamic, N., Gerken, S., Melis, R., Albertsen, H., Plaetke, R., Odelberg, S., Ward, D., Dausset, J., Cohen, D., Cann, H.: A comprehensive human linkage map with centimorgan density. Science 265(5181), 2049–2054 (1994)
McKusick, V.A.: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders, 12th edn. Johns Hopkins University Press, Baltimore (1998)
Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A.: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33(database Issue), D514–D517 (2005)
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, B.F.F.: GenBank. Nucleic Acids Research 26(1), 1–7 (1997)
Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M.P.G., Plaister, S., Sobhany, S., Stoehr, P., Vaughan, R., Wu, D., Zhu, W., Apweiler, R.: EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Research 35(1), D16–D20 (2006)
Kaminuma, E., Mashima, J., Kodama, Y., Gojobori, T., Ogasawara, O., Okubo, K., Takagi, T., Nakamura, Y.: DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Research 38(database issue), D33–D38 (2010)
Barker, W.C., Garavelli, J.S., McGarvey, P.B., Marzec, C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.S.L., Ledley, R.S., Mewes, H.W., Pfeiffer, F., Tsugita, A., Wu, C.: The PIR-International Protein Sequence Database. Nucleic Acids Research 27(1), 39–43 (1998)
Bairoch, A., Boeckmann, B.: The SWISS-PROT protein sequence data bank. Nucleic Acids Research 20, 2019–2022 (1992)
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalon, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000)
Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., Ionides, J., John, M., Keller, P.A., Krissinel, E., McNeil, P., Naim, A., Newman, R., Oldfield, T., Pineda, J., Rachedi, A., Copeland, J., Sitnov, A., Sobhany, S., Suarez-Uruena, A., Swaminathan, J., Tagari, M., Tate, J., Tromm, S., Velankar, S., Vranken, W.: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Research 31(1), 458–462 (2002)
Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., Edgar, R.: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 37(database issue), D5–D15 (2009)
Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T.F., Rezwan, F., Sharma, A., Williams, E., Bradley, X.Z., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S.G., Rocca-Serra, P., Sansone, S.A., Sklyar, N., Zhao, M., Sarkans, U., Brazma, A.: ArrayExpress update from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Research 37(database issue), D868–D872 (2009)
Vizcaíno, J.A., Côté, R., Reisinger, F., Foster, J.M., Mueller, M., Rameseder, J., Hermjakob, H., Martens, L.: A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9(18), 4276–4283 (2009)
Rice, P., Longden, I., Bleasby, A.: EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics 16(6), 276–277 (2000)
Harris, M.A., et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research 61, D258–D261 (2004)
Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics 41(5), 687–693 (2008)
Perrière, G., Gouy, M.: WWW-query: An on-line retrieval system for biological sequence banks. Biochimie 78(5), 364–369 (1999)
Davidson, S.B., Crabtree, J., Brunk, B.P., Schug, J., Tannen, V., Overton, G.C., Stoeckert Jr., C.J.: K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2), 512–531 (2001)
Haas, L.M., Schwarz, P.M., Kodali, P., Kotlar, E., Rice, J.E., Swope, W.C.: DiscoveryLink: a system for integrated access to life sciences data sources. IBM Systems Journal 40(2), 489–511 (2001)
Stein, L.D.: Integrating biological databases. Nature Reviews Genetics 4, 337–345 (2003)
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246 (2002)
Hernandez, T., Kambhampati, S.: Integration of Biological Sources: Current Systems and Challenges Ahead. Sigmod Record 33, 51–60 (2004)
Mork, P., Halevy, A., Tarczy-Hornoch, P.: A model for data integration systems of biomedical data applied to online genetic databases. In: Proceedings of the AMIA Symposium, pp. 473–477 (2001)
Friedman, M., Levy, A., Millstein, T.: Navigational Plans For Data Integration. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 67–73 (1999)
Widom, J.: Research Problems in Data Warehousing. In: The Proceedings of the 4th International Conference Information and Knowledge Management, pp. 25–30 (1995)
Theodoratos, D., Sellis, T.: Data Warehouse Configuration. In: Proceedings of 23rd International Conference on Very Large Data Bases, pp. 126–135 (1997)
Davidson, S.B., Overton, G.C., Tannen, V., Wong, L.: BioKleisli: a digital library for biomedical researchers. International Journal on Digital Libraries 1(1), 36–53 (1997)
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Levy, A.Y.: Logic-based techniques in data integration. Logic-Based Artificial Intelligence, 575–595 (2000)
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and Languages. Journal of Intelligent Information Systems 8(2), 117–132 (1997)
Adali, S., Candan, K.S., Papakonstantinou, Y., Subrahmanian, V.S.: Query caching and optimization in distributed mediator systems. ACM SIGMOD Record 25(2), 137–146 (1996)
Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. Journal of Logic Programming 43, 49–73 (2000)
Friedman, M., Weld, D.S.: Efficiently Executing Information-Gathering Plans. In: Proceeding of the International Joint Conference of Artificial Intelligence, pp. 785–791 (1997)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Query-answering algorithms for information agents. In: Proceedings of the 13th National Conference on Artificial Intelligence, pp. 40–47 (1996)
Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., Thornton, J., Orengo, C.A.: The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research 37, D310–D314(2009)
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys 22(3), 182–236 (1990)
Etzold, T., Argos, P.: SRS–an indexing and retrieval tool for flat file data libraries. Bioinformatics 9(1), 49–57 (1993)
Schuler, G.D., Epstein, J.A., Ohkawa, H., Kans, J.A.: Entrez: molecular biology database and retrieval system. Methods in Enzymology 266, 141–162 (1996)
Ritter, O.: The integrated genomic database (IGD). In: Suhai, S. (ed.) Computational Methods in Genome Research, pp. 57–73. Plenum Press, New York (1994)
Wang, L., Rodriguez-Tomé, P., Redaschi, N., McNeil, P., Robinson, A., Lijnzaad, P.: Accessing and distributing EMBL data using CORBA. Genome Biology 1(5) (2000)
Barrillot, E., Lesser, U., Lijnzaad, P., Cussat-Blanc, C., Jungfer, K., Guyon, F., Vaysseix, G., Helgesen, C., Rodriguez-Tomé, P.: A proposal for a standard CORBA interface for genome maps. Bioinformatics 15(2), 157–169 (1999)
Parsons, J.D., Rodriguez-Tomé, P.: JESAM: CORBA software components to create and publish EST alignments and clusters. Bioinformatics 16(4), 313–325 (2000)
Biomolecular Sequence Analysis RFP response Joint Revised Submission. Concept Five Technologies Inc., EMBL-EBI, Genome Informatics Corp., Millenium Pharm. Inc., Neomorphic Software Inc., NetGenics Inc. OMG Document lifesci. (August 1, 1999)
Genomic Maps RFP response Joint Second Revised Submission (with errata). EMBL-EBI, Millenium Pharm Inc., NetGenics Inc. OMG Document lifesci. (November 11, 1999)
Stevens, R., Baker, P., Bechhofer, S., Ng, G., Jacoby, A., Paton, N.W., Goble, C.A., Brass, A.: TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics 16(2), 184–186 (2000)
Siepel, A., Farmer, A., Tolopko, A., Zhuang, M., Mendes, P., Beavis, W., Sobral, B.: ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources. Bioinformatics 17(1), 83–94 (2000)
Durbin, R., Mieg, J.T.: A C. elegans Database (1991) Documentation, code and data available from anonymous FTP servers at , lirmm.lirmm.fr , cele.mrc-lmb.cam.ac.uk and ncbi.nlm.nih.gov
Cherry, J.M., Cartinhour, S.W., Goodman, H.M.: AAtDB, an Arabidopsis thaliana database. Plant Molecular Biology Reporter 10, 308–309 (1992)
Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., Lewis, S.: The generic genome browser: a building block for a model organism system database. Genome Research 12(10), 1599–1610 (2002)
Hubbard, T., et al.: The Ensembl genome database project. Nucleic Acids Research 30(1), 38–41 (2001)
Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., Kasprzyk, A.: BioMart - biological queries made easy. BMC Genomics (2009), doi:10.1186/1471-2164-10-22
Armano, G., Manconi, A.: ProDaMa: an open source Python library to generate protein structure datasets. BMC Research Notes 2, 202 (2009)
Armano, G., Manconi, A.: A Collaborative Web Application for Supporting Researchers in the Task of Generating Protein Datasets. In: Proceeding of DART 2010 - 4th International Workshop on Distributed Agent-Based Retrieval Tools (2010)
Di Lorenzo, G., Hacid, H., Paik, H.: Data Integration in Mashups. Services Computing 38(1), 59–66 (2009)
Mandola, F., Miller, E.: RDF Primer (2004), http://www.w3.org/TR/rdf-primer/
Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema (2004), http://www.w3.org/TR/rdf-schema/
Smith, M.K., Welty, C., McGuiness, D.L.: OWL Web Ontology Language (2004), http://www.w3.org/TR/owl-guide/
Soldatova, L.N., King, R.D.: Are the Current Ontologies used in Biology Good Ontologies? Nature Biotechnology 23, 1095–1098 (2005)
Kim, D.H., Sreenivasaiah, K.: Curren trends and new challenges of databses and web applications for system driven biological research. Frontiers in Physiology 1, 147 (2010), doi:10.3389/fphys.2010.00147.
Martin, S., Hohman, M.M., Liefeld, T.: The impact of Life Science Identifier on informatics data. Drug Discovery Today 10, 1566–1572 (2005)
Laibe, C., Le Novere, N.: MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology 1, 58 (2007), doi:10.1186/1752-0509-1–58.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Manconi, A., Rodriguez-Tomé, P. (2011). A Survey on Integrating Data in Bioinformatics. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-22913-8_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)