A Practical Guide to Entity Resolution with OYSTER

Talburt, John R.; Zhou, Yinle

doi:10.1007/978-3-642-36257-6_11

John R. Talburt² &
Yinle Zhou²

5150 Accesses
7 Citations

Abstract

This chapter discusses the concepts and methods of entity resolution (ER) and how they can be applied in practice to eliminate redundant data records and support master data management programs. The chapter is organized into two main parts. The first part discusses the components of ER with particular emphasis approximate matching algorithms and the activities that comprise identity information management. The second part provides a step-by-step guide to build an ER process including data profiling, data preparation, identity attribute selection, rule development, ER algorithm considerations, deciding on an identity management strategy, results analysis, and rule refinement. Each step in the process is illustrated with an actual example using the OYSTER open-source, entity resolution system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lee Y, Pipino L, Funk J, Wang R (2006) Journey to data quality. MIT Press, Cambridge
Google Scholar
English L (2009) Information quality applied. Wiley, Indianapolis
Google Scholar
Dyché J, Levy E (2006) Customer data integration: Reaching a single version of the truth. Wiley, New York
Google Scholar
Maydanchik A (2007) Data quality assessment. Technics Publications, Bradley Beach
Google Scholar
Huang KT, Lee Y, Wang R (1999) Quality information and knowledge management. Prentice Hall PTR, Upper Saddle River
Google Scholar
Talburt J (2011) Entity resolution and information quality. Morgan Kaufmann, Burlington
Google Scholar
Lim E et al (1993) Entity identification in database integration. In: Proceedings of ninth international conference on data engineering, pp 294–301
Google Scholar
Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Article Google Scholar
Winkler W (1995) Matching and record linkage. In: Cox B et al (ed) Business survey methods. Wiley, New York, pp 355–384
Google Scholar
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278
Google Scholar
Neiling M, Jurk S (2003) The object identification framework. In Proceedings of KDD03 workshop on data cleaning, record linkage, and object consolidation, pp 33–40
Google Scholar
Mann G, Yarowsky D (2003) Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL, pp 33–40
Google Scholar
Newcombe H (1967) Record linking: the design of efficient systems for linking records into individual and family histories. Am J Hum Genet 19(3):335–359
Google Scholar
Newcombe H (1988) Handbook of record linkage. Oxford University Press, Oxford
Google Scholar
Newcombe H, Kennedy J (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun ACM 5(11):563–566
Article Google Scholar
Newcombe H et al (1950) Automatic linkage of vital records. Science 130(3381):954–959
Article Google Scholar
Tepping B (1968) A model for optimum linkage of records. J Am Stat Assoc 63(324):1321–1332
Article Google Scholar
Herzog T, Scheuren F, Winkler W (2007) Data quality and record linkage techniques. Springer, New York
MATH Google Scholar
Hernandez M, Stolfo S (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2:9–37
Article Google Scholar
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouse. In: Proceedings of the 28th international conference on Very Large Data Bases (VLDB), pp 586–597
Google Scholar
Wang R, Madnick S (1989) The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of the fifth IEEE International Conference on Data Engineering
Google Scholar
Cohen W, Kautz H, McAllester D (2000) Hardening soft information sources. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 255–259
Google Scholar
Bilenko M et al (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23
Article Google Scholar
Elmagarmid A, Ipeirotis P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Garcia-Molina H (2006) Pair-wise entity resolution: overview and challenges. In: Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM ‘06), p 1
Google Scholar
Berson A, Dubov L (2007) Master data management and customer data integration for a global enterprise. McGraw-Hill, New York
Google Scholar
Bilenko M, Basu S, Sahami M (2005) Adaptive product normalization: using online learning for record linkage in comparison shopping. In: Proceeding of the fifth IEEE international conference on data mining (ICDM’05)
Google Scholar
Quantim C et al (1998) How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure. Int J Med Inform 49(1):117–122
Article Google Scholar
Hsiung P et al (2004) Alias detection in link data sets. In: Proceedings of the international conference on intelligence analysis
Google Scholar
eHealth (2010) Key findings. In: eHealth Initiative. http://www.ehealthinitiative.org/key-findings.html Accessed 9 Aug 2010
Inmon W, Nesavich A (2008) Tapping into unstructured data. Pearson Education, Crawfordsville
Google Scholar
Freitag D (1998) Multi-strategy learning for information extraction. In: Shavlik J (ed) Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, Burlington, pp 161–169
Google Scholar
Cowie J, Wilks Y (1996) Information extraction. Commun ACM 39(1):80–91
Article Google Scholar
Hashemi R et al (2002) Extraction of features with unstructured representation from HTML documents. In: Proceedings of international association for development of information society, pp 47–53
Google Scholar
Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–232
Article MATH Google Scholar
Liu B et al (2010) Refining information extraction rules using data provenance. IEEE Data Eng Bull 33:17–24
Google Scholar
Blaschke C, Valencia A (2002) The frame-based module of the Suiseki information extraction system. IEEE Intell Syst 17:14–20
Google Scholar
Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29
Google Scholar
Califf M, Mooney R, Cohn D (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210
Google Scholar
Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 251–254
Google Scholar
Kimball R, Caserta J (2004) The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Wiley, New York
Google Scholar
Chan Y, Talburt J, Talley T (2010) Data engineering: mining, information and intelligence. Springer, Norwell
Google Scholar
Lindsey E (2008) Three-dimensional analysis, 1st edn. Data Profiling LLC
Google Scholar
Borkar V, Deshmukh K, Sarawagi S (2000) Automatically extracting structure from free text addresses. Bull Technical Committee Data Eng 23:2000
Google Scholar
SAS DataFlux (2011) Data management. http://www.dataflux.com/Products/Data-Management-Studio.aspx Accessed 9 Mar 2011
Informatica (2011) Products. http://www.informatica.com/products{_}services/Pages/index.aspx#page=page-8. Accessed 9 Mar 2011
IBM (2011) InfoSphere Platform. http://www-01.ibm.com/software/data/identity-insight-solutions/. Accessed 14 Sept 2012
Pushkarev V et al (2010) An overview of open source data quality tools. In Proceedings of information and knowledge engineering conference, Las Vegas
Google Scholar
Talend (2012) www.talend.com. Accessed 14 Sept 2012
Ataccama (2012) DQ analyzer overview. http://www.ataccama.com/en/products/dq-analyzer.html. Accessed 14 Sept 2012
Pentaho (2012) Pentaho Kettle project. http://kettle.pentaho.com/. Accessed 14 Sept 2012
SQLPower Software (2012) Products. http://www.sqlpower.ca/page/architect. Accessed 14 Sept 2012
Naumann F, Herschel M (2010) An introduction to duplicate detection. Morgan & Claypool, San Rafael
MATH Google Scholar
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88
Article Google Scholar
Levenshtein V (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4):845–848
MathSciNet Google Scholar
Smith T, Waterman M (2001) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Waterman M, Smith T, Beyer W (1976) Some biological sequence metrics. Adv Math 20(3):367–387
Article MathSciNet MATH Google Scholar
Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Proceedings of 3rd annual European symposium on algorithms, pp 327–340
Google Scholar
Holland G, Talburt J (2010) q-Gram Tetrahedral Ratio (qTR) for approximate string matching. In: Proceedings of 2010 Annual Acxiom Laboratory for Applied Research Conference (ALAR-10)
Google Scholar
Jaro M (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Article Google Scholar
Porter E, Winkler W (1997) Approximate string comparison and its effect on an advanced record linkage system. In: Advanced Record Linkage System. U.S. Census Bureau, pp 190–199
Google Scholar
Winkler W (1999) The state of record linkage and current research problems. Statistical Research Division, U.S. Census Bureau. Research Report
Google Scholar
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47
Article MATH Google Scholar
Russell R (1922) Patent No. US1435663 (A)
Google Scholar
Knuth D (1989) Sorting and searching. Art Comput Program 3:391–392
Google Scholar
Rajkovic P, Jankovic D (2007) Adaptation and application of Daitch-Mokotoff Soundex algorithm on Serbian names. In: Proceedings of XVII conference on applied mathematics, pp 193–204
Google Scholar
Michelson M, Knoblock C (2006) Learning blocking schemes for record linkage. In: Proceedings of the 21st national conference on artificial intelligence, pp 440–445
Google Scholar
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection (data-centric systems and applications). Springer, New York
Google Scholar
Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD ‘95), 1995, pp 127–138
Google Scholar
Yan S et al (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, pp 185–194
Google Scholar
Zhou Y, Talburt J (2011) Entity identity information management. In: Proceedings of the 16th International Conference on Information Quality(ICIQ-11), Adelaide, Australia, pp 327–341
Google Scholar
Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767
Article Google Scholar
Nuray-Turan R, Kalashnikov D, Mehrotra S (2007) Self-tuning in graph-based reference disambiguation. In: Proceedings of the 12th international conference on database systems for advanced applications. Springer, Berlin, pp 325–336
Google Scholar
Chen Z, Kalashnikov D, Mehrotra S (2007) Adaptive graphical approach to entity resolution. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘07), pp 204–213
Google Scholar
Xu X et al (2007) SCAN: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 824–833
Google Scholar
Benjelloun O et al (2006) Generic entity resolution in the SERF project. IEEE Data Eng Bull:13–20
Google Scholar
Whang S, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endowment 3(1–2):1326–1337
Google Scholar
Zhou Y et al (2012) Implementing Boolean matching rules in an entity resolution system using XML scripts. In: Proceedings of the 2012 international conference on information and knowledge engineering (IKE’12), Las Vegas
Google Scholar
Barateiro J, Galhardas H (2005) A survey of data quality tools. Datenbak-Spektrum 14:15–21
Google Scholar
Li WN, Bheemavaram R, Zhang X (2009) Transitive closure of data records: application and computation. Data Eng Int Ser Oper Res Manage Sci 132:39–75
Article Google Scholar
Jonas J (2006) Sequence neutrality in information systems. http://jeffjonas.typepad.com/jeff_jonas/2006/01/sequence_neutra.html. Accessed 13 Sept 2012
Ahire SL (1997) Management science- total quality management interfaces: an integrative framework. Interfaces 27(6):91–105
Article Google Scholar
Talburt J, Nelson E (2009) CoDoSA: a light-weight, XML framework for integrating unstructured textual information. In: 15th Americas conference on information systems. AIS Electronic Library, San Francisco, p 489
Google Scholar

Download references

Author information

Authors and Affiliations

University of Arkansas at Little Rock, Little Rock, AR, USA
John R. Talburt & Yinle Zhou

Authors

John R. Talburt
View author publications
You can also search for this author in PubMed Google Scholar
Yinle Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John R. Talburt .

Editor information

Editors and Affiliations

University of Queensland, Brisbane, Australia
Shazia Sadiq

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Talburt, J.R., Zhou, Y. (2013). A Practical Guide to Entity Resolution with OYSTER. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-36257-6_11
Published: 13 February 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36256-9
Online ISBN: 978-3-642-36257-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics