Skip to main content

Using Metric Space Indexing for Complete and Efficient Record Linkage

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Abstract

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Experimental data, additional figures and source code can be downloaded from: http://github.com/digitisingscotland/pakdd2018-metric-linkage.

  2. 2.

    Relatively high Levenshtein edit distances are included since Cora contains a number of low-similarity true matches.

  3. 3.

    Noting that recent research identifies some problematic aspects with using the F-measure to compare record linkage procedures at different similarity thresholds [14].

References

  1. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong, pp. 87–96 (2006)

    Google Scholar 

  2. Bo, L., Yujian, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007). https://doi.org/10.1109/TPAMI.2007.1078

    Article  Google Scholar 

  3. Broder, A.: On the resemblance and containment of documents. In: IEEE Compression and Complexity of Sequences, Salerno, Italy, pp. 21–29 (1997)

    Google Scholar 

  4. Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  5. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)

    Google Scholar 

  6. Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with M-tree. In: Italian Symposium on Advanced Database Systems 1997, pp. 67–86 (1997)

    Google Scholar 

  7. Connor, R.: A tale of four metrics. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 210–217. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_16

    Chapter  Google Scholar 

  8. Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric search. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) SISAP 2017. LNCS, pp. 96–109. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_7

    Chapter  Google Scholar 

  9. Dibben, C., Williamson, L., Huang, Z.: Digitising Scotland (2012). http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2

  10. Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)

    Article  Google Scholar 

  11. Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)

    Google Scholar 

  12. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  13. Fisher, J., Wang, Q.: Unsupervised measuring of entity resolution consistency. In: IEEE ICDM DINA Workshop, pp. 218–221 (2015)

    Google Scholar 

  14. Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)

    Article  MathSciNet  Google Scholar 

  15. Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. SIGMOD Rec. 27(2), 237–248 (1998)

    Article  Google Scholar 

  16. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM TOC, Dallas, pp. 604–613 (1998)

    Google Scholar 

  17. Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas, pp. 340–349 (2013)

    Google Scholar 

  18. Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, pp. 525–536 (2010)

    Google Scholar 

  19. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Cybern. Control Theory 10, 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  20. Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)

    Article  Google Scholar 

  21. McCallum, A.: Cora dataset: cora.csv (2017). https://doi.org/10.3886/E4728V1

  22. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)

    Google Scholar 

  23. Monge, A.E., Elkan, C.P.: The field-matching problem: algorithm and applications. In: ACM SIGKDD, Portland, pp. 267–270 (1996)

    Google Scholar 

  24. Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  25. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)

    Google Scholar 

  26. Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 574–585. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_45

    Chapter  Google Scholar 

  27. Reid, A., Garrett, E., Davies, R., Blaikie, A.: Scottish census enumerators’ books: Skye, Kilmarnock, Rothiemay and Torthorwald, 1861–1901. Economic and Social Data Service (2006)

    Google Scholar 

  28. Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History Comput. 14(1–2), 61–86 (2002)

    Article  Google Scholar 

  29. Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20

    Chapter  Google Scholar 

  30. Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 562–573. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_44

    Chapter  Google Scholar 

  31. Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, Boston (2010). https://doi.org/10.1007/0-387-29151-2

    Book  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 “Digitising Scotland” and ES/L007487/1 “Administrative Data Research Centre—Scotland”.

We thank Alice Reid of the University of Cambridge and her colleagues, especially Ros Davies and Eilidh Garrett, for the work undertaken on the Kilmarnock and Isle of Skye databases.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Özgür Akgün .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Akgün, Ö., Dearle, A., Kirby, G., Christen, P. (2018). Using Metric Space Indexing for Complete and Efficient Record Linkage. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93040-4_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93039-8

  • Online ISBN: 978-3-319-93040-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics