Generic Entity Resolution in Relational Databases

Sidló, Csaba István

doi:10.1007/978-3-642-03973-7_6

Csaba István Sidló¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5739))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

503 Accesses
4 Citations

Abstract

Entity Resolution (ER) covers the problem of identifying distinct representations of real-world entities in heterogeneous databases. We consider the generic formulation of ER problems (GER) with exact outcome. In practice, input data usually resides in relational databases and can grow to huge volumes. Yet, typical solutions described in the literature employ standalone memory resident algorithms. In this paper we utilize facilities of standard, unmodified relational database management systems (RDBMS) to enhance the efficiency of GER algorithms. We study and revise the problem formulation, and propose practical and efficient algorithms optimized for RDBMS external memory processing. We outline a real-world scenario and demonstrate the advantage of algorithms by performing experiments on insurance customer data.

This work was supported by grants OTKA NK 72845 and NKFP-07-A2 TEXTREND.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ISO-ANSI SQL-2 Database Language Standard, X3H2-92-154 (1992)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS 2007: Proceedings of the 27th International Conference on Distributed Computing Systems, Washington, DC, USA, p. 37. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)
Google Scholar
Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 529–534 (2006)
Google Scholar
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 25–33. ACM, New York (2008)
Google Scholar
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)
Google Scholar
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 437–448. ACM, New York (2007)
Chapter Google Scholar
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008)
Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 1–16 (2007)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)
Article Google Scholar
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 491–500 (2001)
Google Scholar
Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 310–317. ACM, New York (2008)
Google Scholar
Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1065–1069 (2005)
Google Scholar
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
Google Scholar
McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence, pp. 1050–1055 (1995)
Google Scholar
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278 (2002)
Google Scholar
Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 722–730. ACM, New York (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Data Mining and Web Search Research Group, Informatics Laboratory Computer and Automation Research Institute, Hungarian Academy of Sciences, Kende u, 13-17, 1111, Budapest, Hungary
Csaba István Sidló

Authors

Csaba István Sidló
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Applied Computer Systems, Riga Technical University, Kalku iela 1, LV 1658, Riga, Latvia
Janis Grundspenkis
Institute of Computing Science, University of Technology, Piotrowo 2, 60-965, Pozna´n, Poland
Tadeusz Morzy
European Research Center for Information Systems, University of Münster, Leonardo Campus 3, 48149, Münster, Germany
Gottfried Vossen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sidló, C.I. (2009). Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-03973-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03972-0
Online ISBN: 978-3-642-03973-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics