Skip to main content

Generic Entity Resolution in Relational Databases

  • Conference paper
Book cover Advances in Databases and Information Systems (ADBIS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5739))

Abstract

Entity Resolution (ER) covers the problem of identifying distinct representations of real-world entities in heterogeneous databases. We consider the generic formulation of ER problems (GER) with exact outcome. In practice, input data usually resides in relational databases and can grow to huge volumes. Yet, typical solutions described in the literature employ standalone memory resident algorithms. In this paper we utilize facilities of standard, unmodified relational database management systems (RDBMS) to enhance the efficiency of GER algorithms. We study and revise the problem formulation, and propose practical and efficient algorithms optimized for RDBMS external memory processing. We outline a real-world scenario and demonstrate the advantage of algorithms by performing experiments on insurance customer data.

This work was supported by grants OTKA NK 72845 and NKFP-07-A2 TEXTREND.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ISO-ANSI SQL-2 Database Language Standard, X3H2-92-154 (1992)

    Google Scholar 

  2. Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS 2007: Proceedings of the 27th International Conference on Distributed Computing Systems, Washington, DC, USA, p. 37. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  3. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  4. Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)

    Google Scholar 

  5. Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 529–534 (2006)

    Google Scholar 

  6. Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 25–33. ACM, New York (2008)

    Google Scholar 

  7. Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)

    Google Scholar 

  8. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 437–448. ACM, New York (2007)

    Chapter  Google Scholar 

  9. Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008)

    Google Scholar 

  10. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 1–16 (2007)

    Google Scholar 

  11. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  12. Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)

    Article  Google Scholar 

  13. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 491–500 (2001)

    Google Scholar 

  14. Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 310–317. ACM, New York (2008)

    Google Scholar 

  15. Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1065–1069 (2005)

    Google Scholar 

  16. Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)

    Google Scholar 

  17. McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence, pp. 1050–1055 (1995)

    Google Scholar 

  18. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)

    Google Scholar 

  19. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278 (2002)

    Google Scholar 

  20. Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 722–730. ACM, New York (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sidló, C.I. (2009). Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03973-7_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03972-0

  • Online ISBN: 978-3-642-03973-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics