Skip to main content

A Practical Guide to Entity Resolution with OYSTER

  • Chapter
  • First Online:
Handbook of Data Quality

Abstract

This chapter discusses the concepts and methods of entity resolution (ER) and how they can be applied in practice to eliminate redundant data records and support master data management programs. The chapter is organized into two main parts. The first part discusses the components of ER with particular emphasis approximate matching algorithms and the activities that comprise identity information management. The second part provides a step-by-step guide to build an ER process including data profiling, data preparation, identity attribute selection, rule development, ER algorithm considerations, deciding on an identity management strategy, results analysis, and rule refinement. Each step in the process is illustrated with an actual example using the OYSTER open-source, entity resolution system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lee Y, Pipino L, Funk J, Wang R (2006) Journey to data quality. MIT Press, Cambridge

    Google Scholar 

  2. English L (2009) Information quality applied. Wiley, Indianapolis

    Google Scholar 

  3. Dyché J, Levy E (2006) Customer data integration: Reaching a single version of the truth. Wiley, New York

    Google Scholar 

  4. Maydanchik A (2007) Data quality assessment. Technics Publications, Bradley Beach

    Google Scholar 

  5. Huang KT, Lee Y, Wang R (1999) Quality information and knowledge management. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  6. Talburt J (2011) Entity resolution and information quality. Morgan Kaufmann, Burlington

    Google Scholar 

  7. Lim E et al (1993) Entity identification in database integration. In: Proceedings of ninth international conference on data engineering, pp 294–301

    Google Scholar 

  8. Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  9. Winkler W (1995) Matching and record linkage. In: Cox B et al (ed) Business survey methods. Wiley, New York, pp 355–384

    Google Scholar 

  10. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278

    Google Scholar 

  11. Neiling M, Jurk S (2003) The object identification framework. In Proceedings of KDD03 workshop on data cleaning, record linkage, and object consolidation, pp 33–40

    Google Scholar 

  12. Mann G, Yarowsky D (2003) Unsupervised personal name disambiguation. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL, pp 33–40

    Google Scholar 

  13. Newcombe H (1967) Record linking: the design of efficient systems for linking records into individual and family histories. Am J Hum Genet 19(3):335–359

    Google Scholar 

  14. Newcombe H (1988) Handbook of record linkage. Oxford University Press, Oxford

    Google Scholar 

  15. Newcombe H, Kennedy J (1962) Record linkage: making maximum use of the discriminating power of identifying information. Commun ACM 5(11):563–566

    Article  Google Scholar 

  16. Newcombe H et al (1950) Automatic linkage of vital records. Science 130(3381):954–959

    Article  Google Scholar 

  17. Tepping B (1968) A model for optimum linkage of records. J Am Stat Assoc 63(324):1321–1332

    Article  Google Scholar 

  18. Herzog T, Scheuren F, Winkler W (2007) Data quality and record linkage techniques. Springer, New York

    MATH  Google Scholar 

  19. Hernandez M, Stolfo S (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 2:9–37

    Article  Google Scholar 

  20. Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouse. In: Proceedings of the 28th international conference on Very Large Data Bases (VLDB), pp 586–597

    Google Scholar 

  21. Wang R, Madnick S (1989) The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of the fifth IEEE International Conference on Data Engineering

    Google Scholar 

  22. Cohen W, Kautz H, McAllester D (2000) Hardening soft information sources. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 255–259

    Google Scholar 

  23. Bilenko M et al (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23

    Article  Google Scholar 

  24. Elmagarmid A, Ipeirotis P, Verykios V (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  25. Garcia-Molina H (2006) Pair-wise entity resolution: overview and challenges. In: Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM ‘06), p 1

    Google Scholar 

  26. Berson A, Dubov L (2007) Master data management and customer data integration for a global enterprise. McGraw-Hill, New York

    Google Scholar 

  27. Bilenko M, Basu S, Sahami M (2005) Adaptive product normalization: using online learning for record linkage in comparison shopping. In: Proceeding of the fifth IEEE international conference on data mining (ICDM’05)

    Google Scholar 

  28. Quantim C et al (1998) How to ensure data security of an epidemiological follow-up: quality assessment of an anonymous record linkage procedure. Int J Med Inform 49(1):117–122

    Article  Google Scholar 

  29. Hsiung P et al (2004) Alias detection in link data sets. In: Proceedings of the international conference on intelligence analysis

    Google Scholar 

  30. eHealth (2010) Key findings. In: eHealth Initiative. http://www.ehealthinitiative.org/key-findings.html Accessed 9 Aug 2010

  31. Inmon W, Nesavich A (2008) Tapping into unstructured data. Pearson Education, Crawfordsville

    Google Scholar 

  32. Freitag D (1998) Multi-strategy learning for information extraction. In: Shavlik J (ed) Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, Burlington, pp 161–169

    Google Scholar 

  33. Cowie J, Wilks Y (1996) Information extraction. Commun ACM 39(1):80–91

    Article  Google Scholar 

  34. Hashemi R et al (2002) Extraction of features with unstructured representation from HTML documents. In: Proceedings of international association for development of information society, pp 47–53

    Google Scholar 

  35. Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–232

    Article  MATH  Google Scholar 

  36. Liu B et al (2010) Refining information extraction rules using data provenance. IEEE Data Eng Bull 33:17–24

    Google Scholar 

  37. Blaschke C, Valencia A (2002) The frame-based module of the Suiseki information extraction system. IEEE Intell Syst 17:14–20

    Google Scholar 

  38. Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29

    Google Scholar 

  39. Califf M, Mooney R, Cohn D (2003) Bottom-up relational learning of pattern matching rules for information extraction. J Mach Learn Res 4:177–210

    Google Scholar 

  40. Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 251–254

    Google Scholar 

  41. Kimball R, Caserta J (2004) The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. Wiley, New York

    Google Scholar 

  42. Chan Y, Talburt J, Talley T (2010) Data engineering: mining, information and intelligence. Springer, Norwell

    Google Scholar 

  43. Lindsey E (2008) Three-dimensional analysis, 1st edn. Data Profiling LLC

    Google Scholar 

  44. Borkar V, Deshmukh K, Sarawagi S (2000) Automatically extracting structure from free text addresses. Bull Technical Committee Data Eng 23:2000

    Google Scholar 

  45. SAS DataFlux (2011) Data management. http://www.dataflux.com/Products/Data-Management-Studio.aspx Accessed 9 Mar 2011

  46. Informatica (2011) Products. http://www.informatica.com/products{_}services/Pages/index.aspx#page=page-8. Accessed 9 Mar 2011

  47. IBM (2011) InfoSphere Platform. http://www-01.ibm.com/software/data/identity-insight-solutions/. Accessed 14 Sept 2012

  48. Pushkarev V et al (2010) An overview of open source data quality tools. In Proceedings of information and knowledge engineering conference, Las Vegas

    Google Scholar 

  49. Talend (2012) www.talend.com. Accessed 14 Sept 2012

  50. Ataccama (2012) DQ analyzer overview. http://www.ataccama.com/en/products/dq-analyzer.html. Accessed 14 Sept 2012

  51. Pentaho (2012) Pentaho Kettle project. http://kettle.pentaho.com/. Accessed 14 Sept 2012

  52. SQLPower Software (2012) Products. http://www.sqlpower.ca/page/architect. Accessed 14 Sept 2012

  53. Naumann F, Herschel M (2010) An introduction to duplicate detection. Morgan & Claypool, San Rafael

    MATH  Google Scholar 

  54. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88

    Article  Google Scholar 

  55. Levenshtein V (1965) Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4):845–848

    MathSciNet  Google Scholar 

  56. Smith T, Waterman M (2001) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  57. Waterman M, Smith T, Beyer W (1976) Some biological sequence metrics. Adv Math 20(3):367–387

    Article  MathSciNet  MATH  Google Scholar 

  58. Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Proceedings of 3rd annual European symposium on algorithms, pp 327–340

    Google Scholar 

  59. Holland G, Talburt J (2010) q-Gram Tetrahedral Ratio (qTR) for approximate string matching. In: Proceedings of 2010 Annual Acxiom Laboratory for Applied Research Conference (ALAR-10)

    Google Scholar 

  60. Jaro M (1989) Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida. J Am Stat Assoc 84(406):414–420

    Article  Google Scholar 

  61. Porter E, Winkler W (1997) Approximate string comparison and its effect on an advanced record linkage system. In: Advanced Record Linkage System. U.S. Census Bureau, pp 190–199

    Google Scholar 

  62. Winkler W (1999) The state of record linkage and current research problems. Statistical Research Division, U.S. Census Bureau. Research Report

    Google Scholar 

  63. Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47

    Article  MATH  Google Scholar 

  64. Russell R (1922) Patent No. US1435663 (A)

    Google Scholar 

  65. Knuth D (1989) Sorting and searching. Art Comput Program 3:391–392

    Google Scholar 

  66. Rajkovic P, Jankovic D (2007) Adaptation and application of Daitch-Mokotoff Soundex algorithm on Serbian names. In: Proceedings of XVII conference on applied mathematics, pp 193–204

    Google Scholar 

  67. Michelson M, Knoblock C (2006) Learning blocking schemes for record linkage. In: Proceedings of the 21st national conference on artificial intelligence, pp 440–445

    Google Scholar 

  68. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection (data-centric systems and applications). Springer, New York

    Google Scholar 

  69. Hernandez M, Stolfo S (1995) The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD ‘95), 1995, pp 127–138

    Google Scholar 

  70. Yan S et al (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, pp 185–194

    Google Scholar 

  71. Zhou Y, Talburt J (2011) Entity identity information management. In: Proceedings of the 16th International Conference on Information Quality(ICIQ-11), Adelaide, Australia, pp 327–341

    Google Scholar 

  72. Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst 31(2):716–767

    Article  Google Scholar 

  73. Nuray-Turan R, Kalashnikov D, Mehrotra S (2007) Self-tuning in graph-based reference disambiguation. In: Proceedings of the 12th international conference on database systems for advanced applications. Springer, Berlin, pp 325–336

    Google Scholar 

  74. Chen Z, Kalashnikov D, Mehrotra S (2007) Adaptive graphical approach to entity resolution. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘07), pp 204–213

    Google Scholar 

  75. Xu X et al (2007) SCAN: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 824–833

    Google Scholar 

  76. Benjelloun O et al (2006) Generic entity resolution in the SERF project. IEEE Data Eng Bull:13–20

    Google Scholar 

  77. Whang S, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endowment 3(1–2):1326–1337

    Google Scholar 

  78. Zhou Y et al (2012) Implementing Boolean matching rules in an entity resolution system using XML scripts. In: Proceedings of the 2012 international conference on information and knowledge engineering (IKE’12), Las Vegas

    Google Scholar 

  79. Barateiro J, Galhardas H (2005) A survey of data quality tools. Datenbak-Spektrum 14:15–21

    Google Scholar 

  80. Li WN, Bheemavaram R, Zhang X (2009) Transitive closure of data records: application and computation. Data Eng Int Ser Oper Res Manage Sci 132:39–75

    Article  Google Scholar 

  81. Jonas J (2006) Sequence neutrality in information systems. http://jeffjonas.typepad.com/jeff_jonas/2006/01/sequence_neutra.html. Accessed 13 Sept 2012

  82. Ahire SL (1997) Management science- total quality management interfaces: an integrative framework. Interfaces 27(6):91–105

    Article  Google Scholar 

  83. Talburt J, Nelson E (2009) CoDoSA: a light-weight, XML framework for integrating unstructured textual information. In: 15th Americas conference on information systems. AIS Electronic Library, San Francisco, p 489

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John R. Talburt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Talburt, J.R., Zhou, Y. (2013). A Practical Guide to Entity Resolution with OYSTER. In: Sadiq, S. (eds) Handbook of Data Quality. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36257-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36257-6_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36256-9

  • Online ISBN: 978-3-642-36257-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics