Abstract
Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution framework where existing ER algorithms developed for a given record type can be plugged in and used in concert with other ER algorithms. Our approach also makes it possible to run ER on subsets of similar records at a time, important when the full data are too large to resolve together. We study the scheduling and coordination of the individual ER algorithms, in order to resolve the full dataset, and show the scalability of our approach. We also introduce a “state-based” training technique where each ER algorithm is trained for the particular execution context (relative to other types of records) where it will be used.
Similar content being viewed by others
Notes
Spock was unable to give us all the data for legal reasons.
References
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools, 2nd edn. Addison Wesley, Boston (2006)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)
Azevedo, A., Santos, M.F.: Kdd, semma and crisp-dm: a parallel overview. In: IADIS European Conference Data Mining, pp. 182–185 (2008)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM (2006)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1) Article No. 5 (2007)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
Brucker, P.: Scheduling Algorithms, 4th edn. Springer, Berlin (2004)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Berlin (2012)
Culotta, A., Mccallum, A.: A conditional model of deduplication for multi-type relational data. Technical report, University of Massachusetts (2005)
Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM, pp. 257–258 (2005)
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1, 117–129 (1976). doi:10.1287/moor.1.2.117
Graham, R.L., Grahamt, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 416–429 (1969)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)
Newcombe, H.B., Kennedy, J.M.: Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM 5(11), 563–566 (1962)
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Parag S., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
Poon, H., Domingos, P.: Joint inference in information extraction. In: AAAI, pp. 913–918 (2007)
Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)
Sadinle, M., Hall, R., Fienberg, S.E.: Approaches to multiple record linkage. In: ISI (2011)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278 (2002)
Singla, P., Domingos, P.: Entity resolution with markov logic. In: ICDM, pp. 572–582 (2006)
Spock. http://spock.com
Tarjan, R.E.: Edge-disjoint spanning trees and depth-first search. Acta Inf. 6, 171–185 (1976)
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. PVLDB 3(1), 1326–1337 (2010)
Whang, S.E., Garcia-Molina, H.: Joint entity resolution. In: ICDE, pp. 294–305 (2012)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)
Winkler, W.: Overview of record linkage and current research directions. Technical report, Statistical Research Division, US Bureau of the Census, Washington, DC (2006)
Acknowledgments
We thank Makoto Tachibana and David Menestrina for their early support on the project.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Whang, S.E., Garcia-Molina, H. Joint entity resolution on multiple datasets. The VLDB Journal 22, 773–795 (2013). https://doi.org/10.1007/s00778-013-0308-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0308-z