Decision Models for Record Linkage

Gu, Lifang; Baxter, Rohan

doi:10.1007/11677437_12

Lifang Gu²⁰ &
Rohan Baxter²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3755))

3441 Accesses
14 Citations

Abstract

The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in record linkage, i.e., the problem of assigning record pairs with an appropriate matching status. Techniques for solving this problem are referred to as decision models. Most existing decision models rely on good training data, which is, however, not commonly available in real-world applications. Decision models based on unsupervised machine learning techniques have recently been proposed. In this paper, we review several existing decision models and then propose an enhancement to cluster-based decision models. Experimental results show that our proposed decision model achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review. The proposed model also provides a mechanism to trade off the accuracy with the number of record pairs required for clerical review.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fayyad, U., Piatesky-Shapiro, G., Smith, P.: From Data Mining to Knowledge Discovery in Databases (a Survey). AI Magazine 17, 37–54 (1996)
Google Scholar
Fellegi, L., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society 64, 1183–1210 (1969)
Google Scholar
Winkler, W.: The State of Record Linkage and Current Research Problems. Technical Report RR/1999/04, US Bureau of the Census (1999)
Google Scholar
Jaro, M.: Software Demonstrations. In: Proc. of an International Workshop and Exposition - Record Linkage Techniques, Arlington, VA, USA (1997)
Google Scholar
Gill, L.: Methods for Automatic Record Matching and Linking and their Use in National Statistics. Technical Report National Statistics Methodological Series No. 25, National Statistics, London (2001)
Google Scholar
Copas, J., Hilton, F.: Record Linkage: Statistical Models for Matching Computer Records. Journal of the Royal Statistical Society Series A 153, 287–320 (1990)
Google Scholar
Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering. IEEE, Los Alamitos (2002)
Google Scholar
Christen, P., Churches, T., Hegland, M.: Febrl - A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)
Chapter Google Scholar
Elfeky, M., Verykios, V.: On Search Enhancement of the Record Linkage Process. In: Proc. of ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA, pp. 31–33 (2003)
Google Scholar
Gu, L., Baxter, R.: Adaptive Filtering for Efficient Record Linkage. In: Proc. of the SIAM Data Mining Conference, pp. 477–481 (2004)
Google Scholar
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington DC, USA (2003)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of the International Conference on Database Systems for Advanced Applications (DASFAA 2003), Kyoto, Japan (2003)
Google Scholar
Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian Decision Model for Cost Optimal Record Matching. The VLDB Journal (2002)
Google Scholar
Winkler, W.: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In: Proc. of the Section on Survey Research Methods, pp. 667–671 (1988)
Google Scholar
Hartigan, J., Wong, M.: A k-means Clustering Algorithm. Applied Statistics 28, 100–108 (1979)
Article MATH Google Scholar
Fraley, C., Raftery, A.: Model-Based Clustering, Density Estimation and Discriminant Analysis. Journal of the American Statistical Association 97, 611–631 (2002)
Article MATH MathSciNet Google Scholar
Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)
Chapter Google Scholar
Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. of ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA, pp. 25–27 (2003)
Google Scholar
Venables, W., Smith, D.: An Introduction to R (2003), http://www.r-project.org
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley Professional, Reading (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

CSIRO ICT Centre, GPO Box 664, Canberra, ACT, 2601, Australia
Lifang Gu
Australian Taxation Office, 2 Constitution Avenue, Canberra, ACT, 2601, Australia
Rohan Baxter

Authors

Lifang Gu
View author publications
You can also search for this author in PubMed Google Scholar
Rohan Baxter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Australian Taxation Office,
Graham J. Williams
School of Computing and Mathematics, University of Western Sydney, Sydney, NSW, Australia
Simeon J. Simoff

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gu, L., Baxter, R. (2006). Decision Models for Record Linkage. In: Williams, G.J., Simoff, S.J. (eds) Data Mining. Lecture Notes in Computer Science(), vol 3755. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11677437_12

Download citation

DOI: https://doi.org/10.1007/11677437_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32547-5
Online ISBN: 978-3-540-32548-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics