Skip to main content

Heuristic Supervised Approach for Record Linkage

  • Conference paper
Modeling Decisions for Artificial Intelligence (MDAI 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7647))

Abstract

Record linkage is a well known technique used to link records from one database to records from another database which make reference to the same individuals. Although it is usually used in database integration, it is also used in the data privacy field for the disclosure risk evaluation of protected datasets. In this paper we compare two different supervised algorithms which rely on distance-based record linkage techniques, specifically using the Choquet integral’s fuzzy integral to compute the distance between records. The first approach uses a linear optimization problem which determines the optimal fuzzy measure for the linkage. While, the second approach is a kind of gradient algorithm with constraints for the fuzzy measures’ identification. We show the advantages and drawbacks of both algorithms and also in which situations they will work better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Statistics Canada. Record linkage at statistics canada (2010), http://www.statcan.gc.ca/record-enregistrement/index-eng.htm

  2. Abril, D., Navarro-Arribas, G., Torra, V.: Choquet integral for record linkage. Annals of Operations Research, 1–14, 10.1007/s10479-011-0989-x

    Google Scholar 

  3. Abril, D., Navarro-Arribas, G., Torra, V.: Supervised learning using mahalanobis distance for record linkage. In: Bernard De Baets, R.M., Troiano, L. (eds.) Proc. of 6th International Summer School on Aggregation Operators - AGOP 2011, pp. 223–228 (2011), Lulu.com

  4. Abril, D., Navarro-Arribas, G., Torra, V.: Improving record linkage with supervised learning for disclosure risk assessment. Information Fusion 13(4), 274–284 (2012)

    Article  Google Scholar 

  5. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer-Verlag New York, Inc. (2006)

    Google Scholar 

  6. Brand, R.: Microdata Protection through Noise Addition. In: Domingo-Ferrer, J. (ed.) Inference Control in Statistical Databases. LNCS, vol. 2316, pp. 97–116. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Brand, R., Domingo-Ferrer, J., Mateo-Sanz, J.: Reference datasets to test and compare sdc methods for protection of numerical microdata. Technical report, European Project IST-2000-25069 CASC (2002)

    Google Scholar 

  8. U.S. Census Bureau. Data extraction system

    Google Scholar 

  9. Choquet, G.: Theory of capacities. Annales de l’institut Fourier 5, 131–295 (1953)

    Article  MathSciNet  Google Scholar 

  10. Colledge, M.: Frames and business registers: An overview. Business Survey Methods. Wiley Series in Probability and Statistics (1995)

    Google Scholar 

  11. Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: The small aggregates method. In: Proc. of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, Statistics, Canada, pp. 195–204 (1993)

    Google Scholar 

  12. Domingo-Ferrer, J., Torra, V.: A quantitative comparison of disclosure control methods for microdata. In: Doyle, P., Lane, J., Theeuwes, J., Zayatz, L. (eds.) Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 111–133. Elsevier (2001)

    Google Scholar 

  13. Domingo-Ferrer, J., Torra, V.: Ordinal, continous and heterogeneous anonymity through microaggregation. Data Mining and Knowledge Discovery 11(2), 195–212 (2005)

    Article  MathSciNet  Google Scholar 

  14. Dunn, H.: Record linkage. American Journal of Public Health 36(12), 1412–1416 (1946)

    Article  Google Scholar 

  15. Elmagarmid, A., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)

    Article  Google Scholar 

  16. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  17. Grabisch, M.: A new algorithm for identifying fuzzy measures and its application to pattern recognition. In: Fourth IEEE International Conference on Fuzzy Systems, Yokohama, Japan, pp. 145–150 (1995)

    Google Scholar 

  18. J. P. E. Group. Standard IS 10918-1 (ITU-T T.81) (2001), http://www.jpeg.org

  19. I. IBM ILOG CPLEX. High-performance mathematical programming engine. International Business Machines Corp. (2010)

    Google Scholar 

  20. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association 84(406), 414–420 (1989)

    Article  Google Scholar 

  21. Lane, J., Heus, P., Mulcahy, T.: Data access in a cyber world: Making use of cyberinfrastructure. Transactions on Data Privacy 1(1), 2–16 (2008)

    MathSciNet  Google Scholar 

  22. Laszlo, M., Mukherjee, S.: Minimum spanning tree partitioning algorithm for microaggregation. IEEE Trans. on Knowl. and Data Eng. 17(7), 902–911 (2005)

    Article  Google Scholar 

  23. Moore, R.: Controlled data swapping techniques for masking public use microdata sets. U.S. Bureau of the Census (1996) (unpublished manuscript)

    Google Scholar 

  24. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  25. Pagliuca, D., Seri, G.: Some results of individual ranking method on the system of enterprise acounts annual survey. Esprit SDC Project, Delivrable MI-3/D2 (1999)

    Google Scholar 

  26. Torra, V., Abowd, J.M., Domingo-Ferrer, J.: Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment. In: Domingo-Ferrer, J., Franconi, L. (eds.) PSD 2006. LNCS, vol. 4302, pp. 233–242. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  27. Torra, V., Narukawa, Y.: Modeling Decisions: Information Fusion and Aggregation Operators. Springer (2007)

    Google Scholar 

  28. Torra, V., Navarro-Arribas, G., Abril, D.: Supervised learning for record linkage through weighted means and owa operators. Control and Cybernetics 39(4), 1011–1026 (2010)

    Google Scholar 

  29. USA Government, http://data.gov (2010)

  30. UK Government, http://data.gov.uk (2010)

  31. Winkler, W.E.: Data cleaning methods. In: Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)

    Google Scholar 

  32. Winkler, W.E.: Re-identification Methods for Masked Microdata. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 216–230. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Murillo, J., Abril, D., Torra, V. (2012). Heuristic Supervised Approach for Record Linkage. In: Torra, V., Narukawa, Y., López, B., Villaret, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2012. Lecture Notes in Computer Science(), vol 7647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34620-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34620-0_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34619-4

  • Online ISBN: 978-3-642-34620-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics