Abstract
Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA) (2003)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC (2003)
Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 257–258. ACM Press, New York (2005)
Rendle, S., Schmidt-Thieme, L.: Object identification with constraints. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), pp. 475–480. Edmonton, Alberta (2002)
Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130, 954–959 (1959)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-1995), San Jose, CA, pp. 127–138 (1995)
McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, pp. 169–178 (2000)
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: Proceedings of AAAI 2006 (2006)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the 2003 ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)
Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, Springer, Heidelberg (2004)
Karypis, G., Kumar, V.: Parallel multilevel graph partitioning. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996) (1996)
Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005) (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rendle, S., Schmidt-Thieme, L. (2008). Scaling Record Linkage to Non-uniform Distributed Class Sizes. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-68125-0_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)