Skip to main content

Scaling Record Linkage to Non-uniform Distributed Class Sizes

  • Conference paper
Book cover Advances in Knowledge Discovery and Data Mining (PAKDD 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5012))

Included in the following conference series:

Abstract

Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  2. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA) (2003)

    Google Scholar 

  3. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC (2003)

    Google Scholar 

  4. Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 257–258. ACM Press, New York (2005)

    Chapter  Google Scholar 

  5. Rendle, S., Schmidt-Thieme, L.: Object identification with constraints. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)

    Google Scholar 

  6. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), pp. 475–480. Edmonton, Alberta (2002)

    Google Scholar 

  7. Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)

    Google Scholar 

  8. Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130, 954–959 (1959)

    Article  Google Scholar 

  9. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-1995), San Jose, CA, pp. 127–138 (1995)

    Google Scholar 

  10. McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, pp. 169–178 (2000)

    Google Scholar 

  11. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), Hong Kong (2006)

    Google Scholar 

  12. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: Proceedings of AAAI 2006 (2006)

    Google Scholar 

  13. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of the 2003 ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)

    Google Scholar 

  14. Christen, P., Churches, T., Hegland, M.: A parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, Springer, Heidelberg (2004)

    Google Scholar 

  15. Karypis, G., Kumar, V.: Parallel multilevel graph partitioning. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996) (1996)

    Google Scholar 

  16. Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for record linkage in comparison shopping. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005) (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Einoshin Suzuki Kai Ming Ting Akihiro Inokuchi

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rendle, S., Schmidt-Thieme, L. (2008). Scaling Record Linkage to Non-uniform Distributed Class Sizes. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68125-0_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68124-3

  • Online ISBN: 978-3-540-68125-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics