Using Metric Space Indexing for Complete and Efficient Record Linkage

Akgün, Özgür; Dearle, Alan; Kirby, Graham; Christen, Peter

doi:10.1007/978-3-319-93040-4_8

Özgür Akgün¹⁹,
Alan Dearle¹⁹,
Graham Kirby¹⁹ &
…
Peter Christen²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3440 Accesses
1 Citations

Abstract

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Experimental data, additional figures and source code can be downloaded from: http://github.com/digitisingscotland/pakdd2018-metric-linkage.
2.
Relatively high Levenshtein edit distances are included since Cora contains a number of low-similarity true matches.
3.
Noting that recent research identifies some problematic aspects with using the F-measure to compare record linkage procedures at different similarity thresholds [14].

References

Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE ICDM, Hong Kong, pp. 87–96 (2006)
Google Scholar
Bo, L., Yujian, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007). https://doi.org/10.1109/TPAMI.2007.1078
Article Google Scholar
Broder, A.: On the resemblance and containment of documents. In: IEEE Compression and Complexity of Sequences, Salerno, Italy, pp. 21–29 (1997)
Google Scholar
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)
Google Scholar
Ciaccia, P., Patella, M., Rabitti, F., Zezula, P.: Indexing metric spaces with M-tree. In: Italian Symposium on Advanced Database Systems 1997, pp. 67–86 (1997)
Google Scholar
Connor, R.: A tale of four metrics. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 210–217. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_16
Chapter Google Scholar
Connor, R., Vadicamo, L., Rabitti, F.: High-dimensional simplexes for supermetric search. In: Beecks, C., Borutta, F., Kröger, P., Seidl, T. (eds.) SISAP 2017. LNCS, pp. 96–109. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68474-1_7
Chapter Google Scholar
Dibben, C., Williamson, L., Huang, Z.: Digitising Scotland (2012). http://gtr.rcuk.ac.uk/projects?ref=ES/K00574X/2
Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)
Article Google Scholar
Draisbach, U., Naumann, F., Szott, S., Wonneberg, O.: Adaptive windows for duplicate detection. In: IEEE ICDE, Washington, DC, pp. 1073–1083 (2012)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article Google Scholar
Fisher, J., Wang, Q.: Unsupervised measuring of entity resolution consistency. In: IEEE ICDM DINA Workshop, pp. 218–221 (2015)
Google Scholar
Hand, D., Christen, P.: A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 28(3), 539–547 (2018)
Article MathSciNet Google Scholar
Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. SIGMOD Rec. 27(2), 237–248 (1998)
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM TOC, Dallas, pp. 604–613 (1998)
Google Scholar
Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE ICDM, Dallas, pp. 340–349 (2013)
Google Scholar
Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, Lausanne, pp. 525–536 (2010)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Cybern. Control Theory 10, 707–710 (1966)
MathSciNet MATH Google Scholar
Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006)
Article Google Scholar
McCallum, A.: Cora dataset: cora.csv (2017). https://doi.org/10.3886/E4728V1
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, Boston (2006)
Google Scholar
Monge, A.E., Elkan, C.P.: The field-matching problem: algorithm and applications. In: ACM SIGKDD, Portland, pp. 267–270 (1996)
Google Scholar
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684–695 (2016)
Google Scholar
Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 574–585. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_45
Chapter Google Scholar
Reid, A., Garrett, E., Davies, R., Blaikie, A.: Scottish census enumerators’ books: Skye, Kilmarnock, Rothiemay and Torthorwald, 1861–1901. Economic and Social Data Service (2006)
Google Scholar
Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers: a ‘sets of related individuals’ approach. History Comput. 14(1–2), 61–86 (2002)
Article Google Scholar
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J. (ed.) PSD 2014. LNCS, vol. 8744, pp. 253–268. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11257-2_20
Chapter Google Scholar
Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 562–573. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_44
Chapter Google Scholar
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, Boston (2010). https://doi.org/10.1007/0-387-29151-2
Book MATH Google Scholar

Download references

Acknowledgements

This work was supported by ESRC grants ES/K00574X/2 “Digitising Scotland” and ES/L007487/1 “Administrative Data Research Centre—Scotland”.

We thank Alice Reid of the University of Cambridge and her colleagues, especially Ros Davies and Eilidh Garrett, for the work undertaken on the Kilmarnock and Isle of Skye databases.

Author information

Authors and Affiliations

School of Computer Science, University of St Andrews, St Andrews, Scotland
Özgür Akgün, Alan Dearle & Graham Kirby
Research School of Computer Science, The Australian National University, Canberra, Australia
Peter Christen

Authors

Özgür Akgün
View author publications
You can also search for this author in PubMed Google Scholar
Alan Dearle
View author publications
You can also search for this author in PubMed Google Scholar
Graham Kirby
View author publications
You can also search for this author in PubMed Google Scholar
Peter Christen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Özgür Akgün .

Editor information

Editors and Affiliations

Deakin University, Geelong, Victoria, Australia
Dinh Phung
National Chiao Tung University, Hsinchu City, Taiwan
Vincent S. Tseng
Monash University, Clayton, Victoria, Australia
Geoffrey I. Webb
Japan Advanced Institute of Science and Technology, Nomi, Ishikawa, Japan
Bao Ho
University of Melbourne, Melbourne, Victoria, Australia
Mohadeseh Ganji
University of Melbourne, Melbourne, Victoria, Australia
Lida Rashidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akgün, Ö., Dearle, A., Kirby, G., Christen, P. (2018). Using Metric Space Indexing for Complete and Efficient Record Linkage. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-93040-4_8
Published: 17 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics