Record Matching

Arasu, Arvind; Domingo-Ferrer, Josep

doi:10.1007/978-1-4614-8265-9_594

Arvind Arasu³ &
Josep Domingo-Ferrer⁴

21 Accesses

Synonyms

Deduplication in Data Cleaning; Duplicate detection; Entity resolution; Instance identification; Merge-purge; Name matching; Record linkage

Definition

Record matching is the problem of identifying whether two records in a database refer to the same real-world entity. For example, in Fig. 1, the customer record A1 in Table A and record B1 in Table B probably refer to the same customer, and should therefore be matched. (The example in Fig. 1 was adapted from an example in [21].) As Fig. 1 suggests, the same entity can be encoded in different ways in a database; this phenomenon is fairly common and occurs due to a variety of natural reasons such as different formatting conventions, abbreviations, and typographic errors. Record matching is often studied in the following setting: Given two relations A and B, identify all pairs of matching records, one from each relation. For the two tables in Fig. 1, a reasonable output might be the pairs (A1, B1) and (A2, B2). In some settings of...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Arasu A, Chaudhuri S, Kaushik R Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering; 2008. p. 40–9.
Google Scholar
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases; 2006. p. 918–29.
Google Scholar
Bilenko M, Mooney, RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2004. p. 39–48.
Google Scholar
Chaudhuri S, Chen B.C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases; 2007. p. 327–38.
Google Scholar
Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 313–24.
Google Scholar
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering; 2006.
Google Scholar
Cochinwala M, Kurien V, Lalk G, Shasha D. Efficient data reconciliation. Inf Sci. 2001;137(1–4):1–15.
Article MATH Google Scholar
Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2000;18(3):288–321.
Article Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.
Article Google Scholar
Felligi IP, Sunter AB. A theory for record linkage. J Am Stat Soc. 1969;64(328):1183–210.
Article Google Scholar
Hernandez M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1995. p. 127–38.
Article Google Scholar
Jaro MA. Unimatch: a record linkage system: user’s manual. Technical Report. Washington, DC: US Bureau of the Census; 1976.
Google Scholar
Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida J Am Stat Assoc. 1989;84(406):414–20.
Article Google Scholar
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006. p. 802–3.
Google Scholar
McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 169–78.
Google Scholar
Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130(3381):954–9.
Article Google Scholar
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 269–78.
Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 743–54.
Google Scholar
Torra V, Domingo-Ferrer J. Record linkage methods for multidatabase data mining. In: Torra V, editor. Information fusion in data mining. Springer; 2003. p. 101–32.
Google Scholar
Winkler W. Improved decision rules in the felligi-sunter model of record linkage. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1993.
Google Scholar
Winkler W. The state of record linkage and current research problems. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Arvind Arasu
Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer

Authors

Arvind Arasu
View author publications
You can also search for this author in PubMed Google Scholar
Josep Domingo-Ferrer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arvind Arasu .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Microsoft Research, Microsoft Corporation, Redmond, WA, USA
Venkatesh Ganti

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Arasu, A., Domingo-Ferrer, J. (2018). Record Matching. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_594

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_594
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics