Synonyms
Data fusion; Duplicate detection; LSID; Object identification
Definition
An entity is the representation of a (not necessarily physical) real-world object, such as a gene, a protein, or a disease, within a database. To integrate information about the same entities from different databases, these representations must be analyzed to uncover the corresponding underlying objects. This process is called entity identification. A variation of entity identification is duplicate detection, which analyses two or more entities to determine whether they represent the same real-world object or not. Finally, data fusion is the process of generating a single, homogeneous representation from multiple, possibly inconsistent entities that represent the same real-world object.
When entities have globally unique keys, such as ISBN numbers in the case of books, entity identification and duplicate detection are simple. However, in life science databases, one usually has only descriptive...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Bhat TN, Bourne P, Feng Z, Gilliland G, Jain S, Ravichandran V, Schneider B, Schneider K, Thanki N, Weissig H, et al. The PDB data uniformity project. Nucleic Acids Res. 2001;29(1):214–8.
Brenner SE. Errors in genome annotation. Trends Genet. 1999;15(4):132–3.
Gibson G, Muse SV. A primer of genome science. Sunderland: Sinauer Associates; 2001.
Karp P.D. Models of identifiers. In: Proceedings of the 2nd Meeting on Interconnection of Molecular Biology Databases; 1995.
Kingsbury D. Consensus, common entry, and community curation. Nat Biotechnol. 1996;14(6):679.
Krauthammer M, Nenadic G. Term identification in the biomedical literature. J Biomed Inform. 2004;37(6):512–26.
Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357–69.
Müller H, Naumann F, Freytag J.-C. Data quality in genome databases. In: Proceedings of the 8th Conference on Information Quality; 2003.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–7.
Tamames J, Valencia A. The success (or not) of HUGO nomenclature. Genome Biol. 2006;7(5):402.
Trissl S, Rother K, Müller H, Koch I, Steinke T, Preissner R, Frömmel C, Leser U. Columba: an integrated database of proteins, structures, and annotations. BMC Bioinformatics. 2005;6(1):81.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Leser, U. (2018). Semantic Data Integration for Life Science Entities. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_627
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_627
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering