Skip to main content

A Bag Reconstruction Method for Multiple Instance Classification and Group Record Linkage

  • Conference paper
Advanced Data Mining and Applications (ADMA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

Abstract

Record linking is the task of detecting records in several databases that refer to the same entity. This task aims at exploring the relationship between entities, which normally lack common identifiers in heterogeneous datasets. When entities contain multiple relational records, linking them across datasets can be more accurate by treating the records as groups, which leads to group linking methods. Even so, individual record links may still be needed for the final group linking step. This problem can be solved by multiple instance learning, in which group links are modelled as bags, and record links are considered as instances. In this paper, we propose a novel method for instance classification and group record linkage via bag reconstruction from instances. The bag reconstruction is based on the modeling of the distribution of negative instances in the training bags via kernel density estimation. We evaluate this approach on both synthetic and real-world data. Our results show that the proposed method can outperform several baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 11–18 (2004)

    Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48 (2003)

    Google Scholar 

  3. Chartrand, G.: Introductory Graph Theory. Dover Publications (1985)

    Google Scholar 

  4. Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 1931–1947 (2006)

    Article  Google Scholar 

  5. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 151–159. ACM (2008)

    Google Scholar 

  6. Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48 (2009)

    Article  Google Scholar 

  7. Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)

    Google Scholar 

  8. Dunn, H.L.: Record linkage. American Journal of Public Health 36(12), 1412–1416 (1946)

    Article  Google Scholar 

  9. Elfeky, M., Verykios, V., Elmagarmid, A.: Tailor: A record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering, pp. 17–28 (2002)

    Google Scholar 

  10. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  11. Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census data using household information. In: Proceedings of the 15th International Workshop on Domain Driven Data Mining, Vancouver, Canada, pp. 413–420 (2011)

    Google Scholar 

  12. Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for historical census household linkage. In: Proceedings of the 19th Ninth Australasian Data Mining Conference, Ballarat, Australia (2011)

    Google Scholar 

  13. Fu, Z., Zhou, J., Christen, P., Boot, M.: Multiple Instance Learning for Group Record Linkage. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 171–182. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5), 958–977 (2011)

    Article  Google Scholar 

  15. Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Napa Valley, California, pp. 1325–1326 (2008)

    Google Scholar 

  16. Herzog, T.N., Scheuren, F., Winkler, W.E.: Data quality and record linkage techniques. Springer ( (2007)

    Google Scholar 

  17. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems 31(2), 716–767 (2006)

    Article  Google Scholar 

  18. Namata, G.M., Kok, S., Getoor, L.: Collective graph identification. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 87–95 (2011)

    Google Scholar 

  19. Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)

    Article  Google Scholar 

  20. On, B.W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceeding of the IEEE International Conference on Data Engineering, Istanbul, Turkey, pp. 496–505 (2007)

    Google Scholar 

  21. Rossi, R.A., KcDowell, L.K., Aha, D.W., Neville, J.: Transforming graph representations for statistical relational learning. Journal of Artificial Intelligence Research (2012)

    Google Scholar 

  22. Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)

    Google Scholar 

  23. Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, US Bureau of the Census (2001)

    Google Scholar 

  24. Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proceedings of the 19th International World Wide Web Conference, pp. 981–990 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fu, Z., Zhou, J., Peng, F., Christen, P. (2012). A Bag Reconstruction Method for Multiple Instance Classification and Group Record Linkage. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35527-1_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35526-4

  • Online ISBN: 978-3-642-35527-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics