Flexible and Efficient Distributed Resolution of Large Entities

Molnár, András J.; Benczúr, András A.; Sidló, Csaba István

doi:10.1007/978-3-642-28472-4_14

András J. Molnár¹⁸,
András A. Benczúr¹⁸ &
Csaba István Sidló¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7153))

Included in the following conference series:

International Symposium on Foundations of Information and Knowledge Systems

501 Accesses
1 Citations

Abstract

Entity resolution (ER) is a computationally hard problem of data integration scenarios, where database records have to be grouped according to the real-world entities they belong to. In practice these entities may consist of only a few records from different data sources with typos or historical data. In other cases they may contain significantly more records, especially when we search for entities on a higher level of a concept hierarchy than records.

In this paper we give theoretical foundation of a variety of practically important match functions. We show that under these formulations, ER with large entities can be solved efficiently with algorithms based on MapReduce, a distributed computing paradigm. Our algorithm can efficiently incorporate probabilistic and similarity-based record match, enabling flexible match function definition. We demonstrate the usability of our model and algorithm in a real-world insurance ER scenario, where we identify household groups of client records.

This work was supported by the EU FP7 SEC project SCIIMS (Ref. 218223).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh: A family of algorithms for generic, distributed entity resolution. In: Proc. 27th Int. Conf. on Distributed Computing Systems (2007)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Article Google Scholar
Bhattacharya, I., Getoor, L.: A Latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58 (2006)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1(1), 5 (2007)
Article Google Scholar
Bhattacharya, I., Getoor, L., Licamele, L.: Query-time entity resolution. In: Proc. 12th ACM SIGKDD, pp. 529–534 (2006)
Google Scholar
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 25–33. ACM, New York (2008)
Google Scholar
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proc. 9th ACM SIGKDD, pp. 39–48 (2003)
Google Scholar
Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Efficient Closed Pattern Mining in Strongly Accessible Set Systems (Extended Abstract). In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 382–389. Springer, Heidelberg (2007)
Chapter Google Scholar
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD 2007, pp. 437–448. ACM (2007)
Google Scholar
Chen, S., Borthwick, A., Carvalho, V.R.: The case for cost-sensitive and easy-to-interpret models in industrial record linkage. In: 9th International Workshop on Quality in Databases (2011)
Google Scholar
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD 2008, pp. 151–159. ACM (2008)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. In: IEEE TKDE preprint (2011)
Google Scholar
Christen, P., Churches, T., Hegland, M.: Febrl – A Parallel Open Source Data Linkage System. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004)
Chapter Google Scholar
Christen, P., Gayler, R., Hawking, D.: Similarity-aware indexing for real-time entity resolution. In: CIKM 2009, pp. 1565–1568. ACM (2009)
Google Scholar
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press (2001)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey. IEEE TKDE, 1–16 (2007)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Getoor, L., Diehl, C.: Link mining: a survey. ACM SIGKDD Explorations Newsletter 7(2), 3–12 (2005)
Article Google Scholar
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Google Scholar
Guo, S., Dong, X.L., Srivastava, D., Zajac, R.: Record linkage with uniqueness constraints and erroneous values. Proc. VLDB Endow. 3, 417–428 (2010)
Article Google Scholar
Hall, R., Sutton, C., McCallum, A.: Unsupervised deduplication using cross-field dependencies. In: KDD 2008, pp. 310–317. ACM (2008)
Google Scholar
Han, H., Xu, W., Zha, H., Giles, C.: A hierarchical naive Bayes mixture model for name disambiguation in author citations. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 1065–1069 (2005)
Google Scholar
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: A peta-scale graph mining system implementation and observations. In: ICDM, pp. 229–238. IEEE (2009)
Google Scholar
Kim, H.-S., Lee, D.: Parallel linkage. In: CIKM 2007. ACM (2007)
Google Scholar
Kirsten, T., Kolb, L., Hartung, M., Gross, A., Köpcke, H., Rahm, E.: Data partitioning for parallel entity matching. Computing Research Repository (2010)
Google Scholar
Köpcke, H., Rahm, E.: Training selection for tuning entity matching. In: QDB/MUD, pp. 3–12 (2008)
Google Scholar
Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 197–210 (2010)
Article Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 484–493 (2010)
Article Google Scholar
McCarthy, J., Lehnert, W.: Using decision trees for coreference resolution. In: Proc. 14th Int. Conf. on Artificial Intelligence, pp. 1050–1055 (1995)
Google Scholar
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB Workshop, pp. 25–32 (2006)
Google Scholar
Menestrina, D., Whang, S.E., Garcia-Molina, H.: Evaluating entity resolution results. Proc. VLDB Endow. 3, 208–219 (2010)
Article Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: SIGKDD, pp. 269–278 (2002)
Google Scholar
Sidló, C.I.: Generic Entity Resolution in Relational Databases. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 59–73. Springer, Heidelberg (2009)
Chapter Google Scholar
Sidló, C.I.: Entity resolution with heavy indexing. In: Proc. ADBIS, CEUR Workshop Proceedings (2011)
Google Scholar
Sidló, C.I., Garzó, A., Molnár, A., Benczúr, A.A.: Infrastructures and bounds for distributed entity resolution. In: 9th International Workshop on Quality in Databases (2011)
Google Scholar
Talburt, J.R.: Entity Resolution and Information Quality, 1st edn. Morgan Kaufmann (2010)
Google Scholar
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. Proc. of the VLDB Endow. 1(2), 1253–1264 (2008)
Article Google Scholar
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB Endow. 3, 1326–1337 (2010)
Article Google Scholar
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proc. 35th Int. Conf. on Management of Data, pp. 219–232. ACM (2009)
Google Scholar
White, T.: Hadoop: The Definitive Guide. Yahoo Press (2010)
Google Scholar
Wick, M.L., Rohanimanesh, K., Schultz, K., McCallum, A.: A unified approach for schema matching, coreference and canonicalization. In: KDD 2008, pp. 722–730. ACM (2008)
Google Scholar
Yakout, M., Elmagarmid, A.K., Elmeleegy, H., Ouzzani, M., Qi, A.: Behavior based record linkage. Proc. VLDB Endow. 3, 439–448 (2010)
Article Google Scholar
Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926. ACM (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences, Hungary
András J. Molnár, András A. Benczúr & Csaba István Sidló

Authors

András J. Molnár
View author publications
You can also search for this author in PubMed Google Scholar
András A. Benczúr
View author publications
You can also search for this author in PubMed Google Scholar
Csaba István Sidló
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Oxford, Wolfson Building, Parks Road, OX1 3QD, Oxford, UK
Thomas Lukasiewicz
Alfréd Rényi Institute, Hungarian Academy of Sciences, Reáltanoda utca 13-15, 1053, Budapest, Hungary
Attila Sali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Molnár, A.J., Benczúr, A.A., Sidló, C.I. (2012). Flexible and Efficient Distributed Resolution of Large Entities. In: Lukasiewicz, T., Sali, A. (eds) Foundations of Information and Knowledge Systems. FoIKS 2012. Lecture Notes in Computer Science, vol 7153. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28472-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-28472-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28471-7
Online ISBN: 978-3-642-28472-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics