Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Semantic Data Integration for Life Science Entities

  • Ulf LeserEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_627


Data fusion; Duplicate detection; LSID; Object identification


An entity is the representation of a (not necessarily physical) real-world object, such as a gene, a protein, or a disease, within a database. To integrate information about the same entities from different databases, these representations must be analyzed to uncover the corresponding underlying objects. This process is called entity identification. A variation of entity identification is duplicate detection, which analyses two or more entities to determine whether they represent the same real-world object or not. Finally, data fusion is the process of generating a single, homogeneous representation from multiple, possibly inconsistent entities that represent the same real-world object.

When entities have globally unique keys, such as ISBN numbers in the case of books, entity identification and duplicate detection are simple. However, in life science databases, one usually has only descriptive...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Bhat TN, Bourne P, Feng Z, Gilliland G, Jain S, Ravichandran V, Schneider B, Schneider K, Thanki N, Weissig H, et al. The PDB data uniformity project. Nucleic Acids Res. 2001;29(1):214–8.CrossRefGoogle Scholar
  2. 2.
    Brenner SE. Errors in genome annotation. Trends Genet. 1999;15(4):132–3.CrossRefGoogle Scholar
  3. 3.
    Gibson G, Muse SV. A primer of genome science. Sunderland: Sinauer Associates; 2001.Google Scholar
  4. 4.
    Karp P.D. Models of identifiers. In: Proceedings of the 2nd Meeting on Interconnection of Molecular Biology Databases; 1995.Google Scholar
  5. 5.
    Kingsbury D. Consensus, common entry, and community curation. Nat Biotechnol. 1996;14(6):679.MathSciNetCrossRefGoogle Scholar
  6. 6.
    Krauthammer M, Nenadic G. Term identification in the biomedical literature. J Biomed Inform. 2004;37(6):512–26.CrossRefGoogle Scholar
  7. 7.
    Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357–69.CrossRefGoogle Scholar
  8. 8.
    Müller H, Naumann F, Freytag J.-C. Data quality in genome databases. In: Proceedings of the 8th Conference on Information Quality; 2003.Google Scholar
  9. 9.
    Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–7.CrossRefGoogle Scholar
  10. 10.
    Tamames J, Valencia A. The success (or not) of HUGO nomenclature. Genome Biol. 2006;7(5):402.CrossRefGoogle Scholar
  11. 11.
    Trissl S, Rother K, Müller H, Koch I, Steinke T, Preissner R, Frömmel C, Leser U. Columba: an integrated database of proteins, structures, and annotations. BMC Bioinformatics. 2005;6(1):81.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Humboldt University of BerlinBerlinGermany

Section editors and affiliations

  • Louiqa Raschid
    • 1
  1. 1.Robert H. Smith School of BusinessUniversity of MarylandCollege ParkUSA