Abstract
Data integration is the process of combining (also called “merging” or “joining”) data together to create a single unified data object from what were multiple, distinct data objects. The motivation for integrating data is usually to bring together the information needed to jointly analyze or model some phenomena. By producing a single, consistently structured object through data integration, the process of further manipulating those data is vastly simplified, while presumed relationships among the data are clarified.
Data integration is essential for many scientific disciplines, but especially in disciplines such as ecology and the environmental sciences, where processes and patterns of interest often emerge from interactions among numerous complex physical phenomena. Observations of these distinct phenomena are often collected by disparate parties in uncoordinated ways, using different data systems. It is then necessary to gather these data together and appropriately integrate them, to clarify through further modeling and analysis the nature and strength of any relationships among them. Synthesis studies, in particular, often require finding, and then bringing together disparate data in order to integrate them, and reveal new insights.
This chapter describes aspects of data that are critical for determining whether and how data can be integrated, and discusses some of the theoretical considerations and common mechanisms for integrating data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allemang D, Hendler J (2011) Semantic web for the working ontologist, Effective modeling in RDFS and OWL, 2nd edn. Morgan Kaufmann, Waltham, MA
Arctic Data Center (2016) NSF Arctic Data Center. https://arcticdata.io. Accessed 5 Dec 2016
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web: a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci Am 284:34–43
Buttigieg PL, Pafilis E, Lewis S et al (2016) The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J Biomed Semantics 7:57. doi:10.1186/s13326-016-0097-6
Canfield MR (ed) (2011) Field notes on science and nature. Harvard University Press, Cambridge, MA
Carpenter SR, Armbrust EV, Arzberger PW et al (2009) Accelerate synthesis in ecology and environmental sciences. BioSci 59:699–701. doi:10.1525/bio.2009.59.8.11
Codd EF (1970) A relational model of data for large shared data banks. Commun ACM Classics 13:377–387. doi:10.1145/362384.362685
Codd EF (2000) The relational model for database management: version 2. Addison Wesley, Reading, MA
Connolly T, Begg C (2014) Database systems: a practical approach to design, implementation, and management, 6th edn. Pearson, Upper Saddle River, NJ
Cook RB, Wei Y, Hook LA et al (2017) Preserve: protecting data for long-term use, Chapter 6. In: Recknagel F, Michener W (eds) Ecological informatics. Data management and knowledge discovery. Springer, Heidelberg
Cox S (2015) Ontology for observations and sampling features, with alignments to existing models. http://www.semantic-web-journal.net/content/ontology-observations-and-sampling-features-alignments-existing-models-0. Accessed 5 Dec 2016
Darwin C (1837) Darwin’s first diagram of an evolutionary tree from First Notebook on “Transmutation of Species”. https://commons.wikimedia.org/wiki/File:Darwins_first_tree.jpg. In the public domain {{PD-US}}. Accessed 24 Jan 2017
DataONE (2016) DataONE. https://www.dataone.org. Accessed 5 Dec 2016
Fegraus E, Andelman S, Jones MB et al (2005) Maximizing the value of ecological data with structured metadata: an introduction to Ecological Metadata Language (EML) and principles for metadata creation. Bull Ecol Soc Am 86:158–168. doi:10.1890/0012-9623(2005)86[158:MTVOED]2.0.CO;2
Greene B (2005) The fabric of the cosmos. Vintage, New York
Halpin T, Morgan T (2008) Information modeling and relational databases: from conceptual analysis to logical design, 2nd edn. Morgan Kaufman, Burlington, MA
Hampton SE, Anderson SS, Bagby SC et al (2015) The Tao of open science for ecology. Ecosphere 6:1–13
Heath T, Bizer C (2011) Linked data: evolving the web into a global data space (1st edn). Synthesis lectures on the semantic web: theory and technology 1:1, 1–136. Morgan & Claypool, London
Hempel C (1970) Aspects of scientific explanation, and other essays in the philosophy of science. The Free Press, New York
International DOI Foundation (2016) https://www.doi.org. Accessed 5 Dec 2016
Kuhn T (1996) Structure of scientific revolutions. University of Chicago Press, Chicago
Madin J, Bowers S, Schildhauer M et al (2008) Advancing ecological research with ontologies. TREE 23(3):159–168. doi:10.1016/j.tree.2007.11.007
Michener WK (2015) Ten simple rules for creating a good data management plan. PLoS Comput Biol 11(10):e1004525. doi:10.1371/journal.pcbi.1004525
Norvig P (2009) Natural language corpus data, Chapter 14. In: Segaran T, Hammerbacher J (eds) Beautiful data: the stories behind elegant data solutions. O’Reilly Media, Sebastopol, CA, pp 219–242
ORCID, Inc. (2016) ORCiD. http://orcid.org. Accessed 5 Dec 2016
Platt JR (1964) Strong inference. Science 146(3642):347–353. doi:10.1126/science.146.3642.347
Quine WV (1981) Theories and things. Belknap Press of Harvard University Press, Cambridge, MA
Strasser C, Abrams S, Cruse P (2014) DMPTool2: expanding functionality for better data management planning. Int J Digital Curation 9(1):324–330. doi:10.2218/ijdc.v9i1.319
Taper ML, Lele SR (eds) (2004) The nature of scientific evidence: statistical, philosophical, and empirical considerations. The University of Chicago Press, Chicago
University of Michigan (2016a) Ontobee: environment ontology. http://www.ontobee.org/ontology/ENVO?iri=http://purl.obolibrary.org/obo/ENVO_09200001. Accessed 5 Dec 2016
University of Michigan (2016b) Ontobee: environment ontology. http://www.ontobee.org/ontology/ENVO?iri=http://purl.obolibrary.org/obo/ENVO_01000224. Accessed 5 Dec 2016
W3C OWL Working Group (2016) Web Ontology Language (OWL). https://www.w3.org/2001/sw/wiki/OWL. Accessed 5 Dec 2016
Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23. doi:10.18637/jss.v059.i10
Acknowledgements
I would like to thank Julien Brun for suggesting several useful changes and corrections to the text. Shawn Bowers was a stalwart companion while dissecting the structure of numerous scientific datasets. But I want to acknowledge especially many years of fruitful and stimulating discussions with Matthew B. Jones on matters regarding the nature of ecological data, and the need for better software tools and cyberinfrastructure to support synthesis and collaboration in the environmental sciences. The National Center for Ecological Analysis and Synthesis, NCEAS, has provided a strongly supportive environment for advancing ecoinformatics practice, and still represents, to my mind, a beacon for promoting and facilitating synthesis in the ecological and conservation sciences. Finally, I want to thank colleagues from several past and ongoing NSF-sponsored Cyberinfrastructure projects, including DataONE (NSF #1430508), SEEK (NSF #0225676), SONet (NSF #0753144), and the KNB (NSF #9980154). It has been a continual and pleasurable collaborative learning process with many bright and selfless colleagues.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Schildhauer, M. (2018). Data Integration: Principles and Practice. In: Recknagel, F., Michener, W. (eds) Ecological Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-59928-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-59928-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59926-7
Online ISBN: 978-3-319-59928-1
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)