Scaling Record Linkage to Non-uniform Distributed Class Sizes

Rendle, Steffen; Schmidt-Thieme, Lars

doi:10.1007/978-3-540-68125-0_28

Steffen Rendle¹ &
Lars Schmidt-Thieme¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2473 Accesses
4 Citations

Abstract

Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA) (2003)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC (2003)
Google Scholar
Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 257–258. ACM Press, New York (2005)
Chapter Google Scholar
Rendle, S., Schmidt-Thieme, L.: Object identification with constraints. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), pp. 475–480. Edmonton, Alberta (2002)
Google Scholar
Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Google Scholar
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Article Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-1995), San Jose, CA, pp. 127–138 (1995)
Google Scholar
McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, pp. 169–178 (2000)
Google Scholar
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Google Scholar
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: Proceedings of AAAI 2006 (2006)
Google Scholar
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the 2003 ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)
Google Scholar
Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, Springer, Heidelberg (2004)
Google Scholar
Karypis, G., Kumar, V.: Parallel multilevel graph partitioning. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996) (1996)
Google Scholar
Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005) (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Lab, University of Hildesheim, Samelsonplatz 1, D-31141, Hildesheim, Germany
Steffen Rendle & Lars Schmidt-Thieme

Authors

Steffen Rendle
View author publications
You can also search for this author in PubMed Google Scholar
Lars Schmidt-Thieme
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rendle, S., Schmidt-Thieme, L. (2008). Scaling Record Linkage to Non-uniform Distributed Class Sizes. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_28

Download citation

DOI: https://doi.org/10.1007/978-3-540-68125-0_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics