Abstract
Entity Resolution (ER) covers the problem of identifying distinct representations of real-world entities in heterogeneous databases. We consider the generic formulation of ER problems (GER) with exact outcome. In practice, input data usually resides in relational databases and can grow to huge volumes. Yet, typical solutions described in the literature employ standalone memory resident algorithms. In this paper we utilize facilities of standard, unmodified relational database management systems (RDBMS) to enhance the efficiency of GER algorithms. We study and revise the problem formulation, and propose practical and efficient algorithms optimized for RDBMS external memory processing. We outline a real-world scenario and demonstrate the advantage of algorithms by performing experiments on insurance customer data.
This work was supported by grants OTKA NK 72845 and NKFP-07-A2 TEXTREND.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
ISO-ANSI SQL-2 Database Language Standard, X3H2-92-154 (1992)
Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS 2007: Proceedings of the 27th International Conference on Distributed Computing Systems, Washington, DC, USA, p. 37. IEEE Computer Society, Los Alamitos (2007)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)
Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 529–534 (2006)
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 25–33. ACM, New York (2008)
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 437–448. ACM, New York (2007)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 151–159. ACM, New York (2008)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 1–16 (2007)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 491–500 (2001)
Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 310–317. ACM, New York (2008)
Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proceedings of the 2005 ACM symposium on Applied computing, pp. 1065–1069 (2005)
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence, pp. 1050–1055 (1995)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278 (2002)
Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 722–730. ACM, New York (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sidló, C.I. (2009). Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds) Advances in Databases and Information Systems. ADBIS 2009. Lecture Notes in Computer Science, vol 5739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03973-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-03973-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03972-0
Online ISBN: 978-3-642-03973-7
eBook Packages: Computer ScienceComputer Science (R0)