Skip to main content

Decision Models for Record Linkage

  • Chapter
Data Mining

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3755))

Abstract

The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in record linkage, i.e., the problem of assigning record pairs with an appropriate matching status. Techniques for solving this problem are referred to as decision models. Most existing decision models rely on good training data, which is, however, not commonly available in real-world applications. Decision models based on unsupervised machine learning techniques have recently been proposed. In this paper, we review several existing decision models and then propose an enhancement to cluster-based decision models. Experimental results show that our proposed decision model achieves the same accuracy of existing models while significantly reducing the number of record pairs required for manual review. The proposed model also provides a mechanism to trade off the accuracy with the number of record pairs required for clerical review.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fayyad, U., Piatesky-Shapiro, G., Smith, P.: From Data Mining to Knowledge Discovery in Databases (a Survey). AI Magazine 17, 37–54 (1996)

    Google Scholar 

  2. Fellegi, L., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society 64, 1183–1210 (1969)

    Google Scholar 

  3. Winkler, W.: The State of Record Linkage and Current Research Problems. Technical Report RR/1999/04, US Bureau of the Census (1999)

    Google Scholar 

  4. Jaro, M.: Software Demonstrations. In: Proc. of an International Workshop and Exposition - Record Linkage Techniques, Arlington, VA, USA (1997)

    Google Scholar 

  5. Gill, L.: Methods for Automatic Record Matching and Linking and their Use in National Statistics. Technical Report National Statistics Methodological Series No. 25, National Statistics, London (2001)

    Google Scholar 

  6. Copas, J., Hilton, F.: Record Linkage: Statistical Models for Matching Computer Records. Journal of the Royal Statistical Society Series A 153, 287–320 (1990)

    Google Scholar 

  7. Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering. IEEE, Los Alamitos (2002)

    Google Scholar 

  8. Christen, P., Churches, T., Hegland, M.: Febrl - A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Elfeky, M., Verykios, V.: On Search Enhancement of the Record Linkage Process. In: Proc. of ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA, pp. 31–33 (2003)

    Google Scholar 

  10. Gu, L., Baxter, R.: Adaptive Filtering for Efficient Record Linkage. In: Proc. of the SIAM Data Mining Conference, pp. 477–481 (2004)

    Google Scholar 

  11. Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), Washington DC, USA (2003)

    Google Scholar 

  12. Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of the International Conference on Database Systems for Advanced Applications (DASFAA 2003), Kyoto, Japan (2003)

    Google Scholar 

  13. Verykios, V., Moustakides, G., Elfeky, M.: A Bayesian Decision Model for Cost Optimal Record Matching. The VLDB Journal (2002)

    Google Scholar 

  14. Winkler, W.: Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In: Proc. of the Section on Survey Research Methods, pp. 667–671 (1988)

    Google Scholar 

  15. Hartigan, J., Wong, M.: A k-means Clustering Algorithm. Applied Statistics 28, 100–108 (1979)

    Article  MATH  Google Scholar 

  16. Fraley, C., Raftery, A.: Model-Based Clustering, Density Estimation and Discriminant Analysis. Journal of the American Statistical Association 97, 611–631 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  17. Christen, P.: Probabilistic Data Generation for Deduplication and Data Linkage. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 109–116. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  18. Baxter, R., Christen, P., Churches, T.: A Comparison of Fast Blocking Methods for Record Linkage. In: Proc. of ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, USA, pp. 25–27 (2003)

    Google Scholar 

  19. Venables, W., Smith, D.: An Introduction to R (2003), http://www.r-project.org

  20. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley Professional, Reading (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gu, L., Baxter, R. (2006). Decision Models for Record Linkage. In: Williams, G.J., Simoff, S.J. (eds) Data Mining. Lecture Notes in Computer Science(), vol 3755. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11677437_12

Download citation

  • DOI: https://doi.org/10.1007/11677437_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32547-5

  • Online ISBN: 978-3-540-32548-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics