Skip to main content
Log in

Joint entity resolution on multiple datasets

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact the resolution of other types of records. In this paper we propose a flexible, modular resolution framework where existing ER algorithms developed for a given record type can be plugged in and used in concert with other ER algorithms. Our approach also makes it possible to run ER on subsets of similar records at a time, important when the full data are too large to resolve together. We study the scheduling and coordination of the individual ER algorithms, in order to resolve the full dataset, and show the scalability of our approach. We also introduce a “state-based” training technique where each ER algorithm is trained for the particular execution context (relative to other types of records) where it will be used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Spock was unable to give us all the data for legal reasons.

References

  1. Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools, 2nd edn. Addison Wesley, Boston (2006)

    Google Scholar 

  2. Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)

  3. Azevedo, A., Santos, M.F.: Kdd, semma and crisp-dm: a parallel overview. In: IADIS European Conference Data Mining, pp. 182–185 (2008)

  4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  5. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM (2006)

  6. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1) Article No. 5 (2007)

  7. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)

  8. Brucker, P.: Scheduling Algorithms, 4th edn. Springer, Berlin (2004)

    Book  MATH  Google Scholar 

  9. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Berlin (2012)

    Book  Google Scholar 

  10. Culotta, A., Mccallum, A.: A conditional model of deduplication for multi-type relational data. Technical report, University of Massachusetts (2005)

  11. Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM, pp. 257–258 (2005)

  12. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)

  13. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  14. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  Google Scholar 

  15. Garey, M.R., Johnson, D.S., Sethi, R.: The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1, 117–129 (1976). doi:10.1287/moor.1.2.117

    Article  MathSciNet  MATH  Google Scholar 

  16. Graham, R.L., Grahamt, R.L.: Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 416–429 (1969)

    Google Scholar 

  17. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138 (1995)

  18. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)

    Google Scholar 

  19. LibSVM. http://www.csie.ntu.edu.tw/cjlin/libsvm/

  20. Newcombe, H.B., Kennedy, J.M.: Record linkage: making maximum use of the discriminating power of identifying information. Commun. ACM 5(11), 563–566 (1962)

    Article  Google Scholar 

  21. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  22. Parag S., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)

  23. Poon, H., Domingos, P.: Joint inference in information extraction. In: AAAI, pp. 913–918 (2007)

  24. Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208–218 (2011)

    Google Scholar 

  25. Sadinle, M., Hall, R., Fienberg, S.E.: Approaches to multiple record linkage. In: ISI (2011)

  26. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278 (2002)

  27. Singla, P., Domingos, P.: Entity resolution with markov logic. In: ICDM, pp. 572–582 (2006)

  28. Spock. http://spock.com

  29. Tarjan, R.E.: Edge-disjoint spanning trees and depth-first search. Acta Inf. 6, 171–185 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  30. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)

    Article  Google Scholar 

  31. Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. PVLDB 3(1), 1326–1337 (2010)

    Google Scholar 

  32. Whang, S.E., Garcia-Molina, H.: Joint entity resolution. In: ICDE, pp. 294–305 (2012)

  33. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219–232 (2009)

  34. Winkler, W.: Overview of record linkage and current research directions. Technical report, Statistical Research Division, US Bureau of the Census, Washington, DC (2006)

Download references

Acknowledgments

We thank Makoto Tachibana and David Menestrina for their early support on the project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Steven Euijong Whang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Whang, S.E., Garcia-Molina, H. Joint entity resolution on multiple datasets. The VLDB Journal 22, 773–795 (2013). https://doi.org/10.1007/s00778-013-0308-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0308-z

Keywords

Navigation