Advertisement

Collective entity resolution in multi-relational familial networks

  • Pigi Kouki
  • Jay Pujara
  • Christopher Marcum
  • Laura Koehly
  • Lise Getoor
Regular Paper

Abstract

Entity resolution in settings with rich relational structure often introduces complex dependencies between co-references. Exploiting these dependencies is challenging—it requires seamlessly combining statistical, relational, and logical dependencies. One task of particular interest is entity resolution in familial networks. In this setting, multiple partial representations of a family tree are provided, from the perspective of different family members, and the challenge is to reconstruct a family tree from these multiple, noisy, partial views. This reconstruction is crucial for applications such as understanding genetic inheritance, tracking disease contagion, and performing census surveys. Here, we design a model that incorporates statistical signals (such as name similarity), relational information (such as sibling overlap), logical constraints (such as transitivity and bijective matching), and predictions from other algorithms (such as logistic regression and support vector machines), in a collective model. We show how to integrate these features using probabilistic soft logic, a scalable probabilistic programming framework. In experiments on real-world data, our model significantly outperforms state-of-the-art classifiers that use relational features but are incapable of collective reasoning.

Keywords

Entity resolution Data integration Familial networks Multi-relational networks Collective classification Family reconstruction Probabilistic soft logic 

Notes

Acknowledgements

We would like to thank Peter Christen and Jon Berry for insightful comments on this paper. This work was partially supported by the National Science Foundation Grants IIS-1218488, CCF-1740850, and IIS-1703331 and by the National Human Genome Research Institute Division of Intramural Research at the National Institutes of Health (ZIA HG2000397 and ZIA HG200395, Koehly PI). We would also like to thank the Sandia LDRD (Laboratory-Directed Research and Development) program for support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Sandia Labs.

References

  1. 1.
    Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: IEEE international conference on data engineering (ICDE)Google Scholar
  2. 2.
    Bach S, Broecheler M, Huang B, Getoor L (2017) Hinge-loss markov random fields and probabilistic soft logic. J Mach Learn Res (JMLR) 18(109):1–67MathSciNetzbMATHGoogle Scholar
  3. 3.
    Bach S, Huang B, London B, Getoor L (2013) Hinge-loss Markov random fields: convex inference for structured prediction. In: Uncertainty in artificial intelligence (UAI)Google Scholar
  4. 4.
    Belin T, Rubin D (1995) A method for calibrating false-match rates in record linkage. J Am Stat Assoc 90(430):694–707CrossRefGoogle Scholar
  5. 5.
    Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1).  https://doi.org/10.1145/1217299.1217304 CrossRefGoogle Scholar
  6. 6.
    Cessie S, Houwelingen J (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201CrossRefGoogle Scholar
  7. 7.
    Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):2:27:1–27:27Google Scholar
  8. 8.
    Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, BerlinCrossRefGoogle Scholar
  9. 9.
    Culotta A, McCallum A (2005) Joint deduplication of multiple record types in relational data. In: ACM international conference on information and knowledge management (CIKM)Google Scholar
  10. 10.
    Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM special interest group on management of data (SIGMOD)Google Scholar
  11. 11.
    Driessens K, Reutemann P, Pfahringer B, Leschi C (2006) Using weighted nearest neighbor to benefit from unlabeled data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)Google Scholar
  12. 12.
    Efremova J, Ranjbar-Sahraei B, Rahmani H, Oliehoek F, Calders T, Tuyls K, Weiss G (2015) Multi-source entity resolution for genealogical data, population reconstructionGoogle Scholar
  13. 13.
    Fellegi P, Sunter B (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210CrossRefGoogle Scholar
  14. 14.
    Frank E, Hall M, Witten I (2016) The WEKA Workbench. In: Gray J (ed) Practical machine learning tools and techniques. Morgan Kaufmann, Burlington (Online appendix for data mining)Google Scholar
  15. 15.
    Goergen A, Ashida S, Skapinsky K, de Heer H, Wilkinson A, Koehly L (2016) Knowledge is power: improving family health history knowledge of diabetes and heart disease among multigenerational mexican origin families. Public Health Genomics 19(2):93–101CrossRefGoogle Scholar
  16. 16.
    Hand D, Christen P (2017) A note on using the f-measure for evaluating record linkage algorithms. Stat Comput 28(3):539–547MathSciNetCrossRefGoogle Scholar
  17. 17.
    Hanneman R, Riddle F (2005) Introduction to social network methods. University of California, RiversideGoogle Scholar
  18. 18.
    Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H (2014) Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol 14:36CrossRefGoogle Scholar
  19. 19.
    Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan UniversityGoogle Scholar
  20. 20.
    Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst (TODS) 31(2):716–767CrossRefGoogle Scholar
  21. 21.
    Kouki P, Marcum C, Koehly L, Getoor L (2016) Entity resolution in familial networks. In: SIGKDD conference on knowledge discovery and data mining (KDD), workshop on mining and learning with graphsGoogle Scholar
  22. 22.
    Kouki P, Pujara J, Marcum C, Koehly L, Getoor L (2017) Collective entity resolution in familial networks. In: IEEE international conference on data mining (ICDM)Google Scholar
  23. 23.
    Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 95(1–2):161–205CrossRefGoogle Scholar
  24. 24.
    Li X, Shen C (2008) Linkage of patient records from disparate sources. Stat Methods Med Res 22(1):31–8MathSciNetCrossRefGoogle Scholar
  25. 25.
    Lin J, Marcum C, Myers M, Koehly L (2017) Put the family back in family health history: a multiple-informant approach. Am J Prev Med 5(52):640–644CrossRefGoogle Scholar
  26. 26.
    Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  27. 27.
    Newcombe H (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press Inc, OxfordGoogle Scholar
  28. 28.
    Nowozin S, Gehler P, Jancsary J, Lampert C (2014) Advanced structured prediction. The MIT Press, CambridgeGoogle Scholar
  29. 29.
    Platanios E, Poon H, Mitchell T, Horvitz E (2017) Estimating accuracy from unlabeled data: a probabilistic logic approach. In: Conference on neural information processing systems (NIPS)Google Scholar
  30. 30.
    Pujara J, Getoor L (2016) Generic statistical relational entity resolution in knowledge graphs. In: International joint conference on artificial intelligence (IJCAI), workshop on statistical relational artificial intelligence (StarAI)Google Scholar
  31. 31.
    Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. In: International conference on very large databases (VLDB)Google Scholar
  32. 32.
    Singla P, Domingos P (2006) Entity resolution with Markov logic. In: IEEE international conference on data mining (ICDM)Google Scholar
  33. 33.
    Suchanek F, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the very large data bases endowment (PVLDB), vol 5(3)CrossRefGoogle Scholar
  34. 34.
    Winkler W (2006) Overview of record linkage and current research directions. Technical report, US Census BureauGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of EngineeringUniversity of California Santa CruzSanta CruzUSA
  2. 2.National Human Genome Research InstituteNational Institutes of HealthBethesdaUSA

Personalised recommendations