Abstract
This chapter discusses the concepts and methods of entity resolution (ER) and how they can be applied in practice to eliminate redundant data records and support master data management programs. The chapter is organized into two main parts. The first part discusses the components of ER with particular emphasis approximate matching algorithms and the activities that comprise identity information management. The second part provides a step-by-step guide to build an ER process including data profiling, data preparation, identity attribute selection, rule development, ER algorithm considerations, deciding on an identity management strategy, results analysis, and rule refinement. Each step in the process is illustrated with an actual example using the OYSTER open-source, entity resolution system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lee Y, Pipino L, Funk J, Wang R (2006) Journey to data quality. MIT Press, Cambridge
English L (2009) Information quality applied. Wiley, Indianapolis
Dyché J, Levy E (2006) Customer data integration: Reaching a single version of the truth. Wiley, New York
Maydanchik A (2007) Data quality assessment. Technics Publications, Bradley Beach
Huang KT, Lee Y, Wang R (1999) Quality information and knowledge management. Prentice Hall PTR, Upper Saddle River
Talburt J (2011) Entity resolution and information quality. Morgan Kaufmann, Burlington
Lim E et al (1993) Entity identification in database integration. In: Proceedings of ninth international conference on data engineering, pp 294–301
Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Winkler W (1995) Matching and record linkage. In: Cox B et al (ed) Business survey methods. Wiley, New York, pp 355–384
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278
Neiling M, Jurk S (2003) The object identification framework. In Proceedings of KDD03 workshop on data cleaning, record linkage, and object consolidation, pp 33–40
Mann G, Yarowsky D (2003) Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL, pp 33–40
Newcombe H (1967) Record linking: the design of efficient systems for linking records into individual and family histories. Am J Hum Genet 19(3):335–359
Newcombe H (1988) Handbook of record linkage. Oxford University Press, Oxford
Newcombe H, Kennedy J (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun ACM 5(11):563–566
Newcombe H et al (1950) Automatic linkage of vital records. Science 130(3381):954–959
Tepping B (1968) A model for optimum linkage of records. J Am Stat Assoc 63(324):1321–1332
Herzog T, Scheuren F, Winkler W (2007) Data quality and record linkage techniques. Springer, New York
Hernandez M, Stolfo S (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2:9–37
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouse. In: Proceedings of the 28th international conference on Very Large Data Bases (VLDB), pp 586–597
Wang R, Madnick S (1989) The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of the fifth IEEE International Conference on Data Engineering
Cohen W, Kautz H, McAllester D (2000) Hardening soft information sources. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 255–259
Bilenko M et al (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23
Elmagarmid A, Ipeirotis P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Garcia-Molina H (2006) Pair-wise entity resolution: overview and challenges. In: Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM ‘06), p 1
Berson A, Dubov L (2007) Master data management and customer data integration for a global enterprise. McGraw-Hill, New York
Bilenko M, Basu S, Sahami M (2005) Adaptive product normalization: using online learning for record linkage in comparison shopping. In: Proceeding of the fifth IEEE international conference on data mining (ICDM’05)
Quantim C et al (1998) How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure. Int J Med Inform 49(1):117–122
Hsiung P et al (2004) Alias detection in link data sets. In: Proceedings of the international conference on intelligence analysis
eHealth (2010) Key findings. In: eHealth Initiative. http://www.ehealthinitiative.org/key-findings.html Accessed 9 Aug 2010
Inmon W, Nesavich A (2008) Tapping into unstructured data. Pearson Education, Crawfordsville
Freitag D (1998) Multi-strategy learning for information extraction. In: Shavlik J (ed) Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, Burlington, pp 161–169
Cowie J, Wilks Y (1996) Information extraction. Commun ACM 39(1):80–91
Hashemi R et al (2002) Extraction of features with unstructured representation from HTML documents. In: Proceedings of international association for development of information society, pp 47–53
Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–232
Liu B et al (2010) Refining information extraction rules using data provenance. IEEE Data Eng Bull 33:17–24
Blaschke C, Valencia A (2002) The frame-based module of the Suiseki information extraction system. IEEE Intell Syst 17:14–20
Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29
Califf M, Mooney R, Cohn D (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210
Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 251–254
Kimball R, Caserta J (2004) The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Wiley, New York
Chan Y, Talburt J, Talley T (2010) Data engineering: mining, information and intelligence. Springer, Norwell
Lindsey E (2008) Three-dimensional analysis, 1st edn. Data Profiling LLC
Borkar V, Deshmukh K, Sarawagi S (2000) Automatically extracting structure from free text addresses. Bull Technical Committee Data Eng 23:2000
SAS DataFlux (2011) Data management. http://www.dataflux.com/Products/Data-Management-Studio.aspx Accessed 9 Mar 2011
Informatica (2011) Products. http://www.informatica.com/products{_}services/Pages/index.aspx#page=page-8. Accessed 9 Mar 2011
IBM (2011) InfoSphere Platform. http://www-01.ibm.com/software/data/identity-insight-solutions/. Accessed 14 Sept 2012
Pushkarev V et al (2010) An overview of open source data quality tools. In Proceedings of information and knowledge engineering conference, Las Vegas
Talend (2012) www.talend.com. Accessed 14 Sept 2012
Ataccama (2012) DQ analyzer overview. http://www.ataccama.com/en/products/dq-analyzer.html. Accessed 14 Sept 2012
Pentaho (2012) Pentaho Kettle project. http://kettle.pentaho.com/. Accessed 14 Sept 2012
SQLPower Software (2012) Products. http://www.sqlpower.ca/page/architect. Accessed 14 Sept 2012
Naumann F, Herschel M (2010) An introduction to duplicate detection. Morgan & Claypool, San Rafael
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88
Levenshtein V (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4):845–848
Smith T, Waterman M (2001) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Waterman M, Smith T, Beyer W (1976) Some biological sequence metrics. Adv Math 20(3):367–387
Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Proceedings of 3rd annual European symposium on algorithms, pp 327–340
Holland G, Talburt J (2010) q-Gram Tetrahedral Ratio (qTR) for approximate string matching. In: Proceedings of 2010 Annual Acxiom Laboratory for Applied Research Conference (ALAR-10)
Jaro M (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Porter E, Winkler W (1997) Approximate string comparison and its effect on an advanced record linkage system. In: Advanced Record Linkage System. U.S. Census Bureau, pp 190–199
Winkler W (1999) The state of record linkage and current research problems. Statistical Research Division, U.S. Census Bureau. Research Report
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47
Russell R (1922) Patent No. US1435663 (A)
Knuth D (1989) Sorting and searching. Art Comput Program 3:391–392
Rajkovic P, Jankovic D (2007) Adaptation and application of Daitch-Mokotoff Soundex algorithm on Serbian names. In: Proceedings of XVII conference on applied mathematics, pp 193–204
Michelson M, Knoblock C (2006) Learning blocking schemes for record linkage. In: Proceedings of the 21st national conference on artificial intelligence, pp 440–445
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection (data-centric systems and applications). Springer, New York
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD ‘95), 1995, pp 127–138
Yan S et al (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, pp 185–194
Zhou Y, Talburt J (2011) Entity identity information management. In: Proceedings of the 16th International Conference on Information Quality(ICIQ-11), Adelaide, Australia, pp 327–341
Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767
Nuray-Turan R, Kalashnikov D, Mehrotra S (2007) Self-tuning in graph-based reference disambiguation. In: Proceedings of the 12th international conference on database systems for advanced applications. Springer, Berlin, pp 325–336
Chen Z, Kalashnikov D, Mehrotra S (2007) Adaptive graphical approach to entity resolution. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘07), pp 204–213
Xu X et al (2007) SCAN: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 824–833
Benjelloun O et al (2006) Generic entity resolution in the SERF project. IEEE Data Eng Bull:13–20
Whang S, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endowment 3(1–2):1326–1337
Zhou Y et al (2012) Implementing Boolean matching rules in an entity resolution system using XML scripts. In: Proceedings of the 2012 international conference on information and knowledge engineering (IKE’12), Las Vegas
Barateiro J, Galhardas H (2005) A survey of data quality tools. Datenbak-Spektrum 14:15–21
Li WN, Bheemavaram R, Zhang X (2009) Transitive closure of data records: application and computation. Data Eng Int Ser Oper Res Manage Sci 132:39–75
Jonas J (2006) Sequence neutrality in information systems. http://jeffjonas.typepad.com/jeff_jonas/2006/01/sequence_neutra.html. Accessed 13 Sept 2012
Ahire SL (1997) Management science- total quality management interfaces: an integrative framework. Interfaces 27(6):91–105
Talburt J, Nelson E (2009) CoDoSA: a light-weight, XML framework for integrating unstructured textual information. In: 15th Americas conference on information systems. AIS Electronic Library, San Francisco, p 489
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Talburt, J.R., Zhou, Y. (2013). A Practical Guide to Entity Resolution with OYSTER. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-36257-6_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)