Advertisement

Frontiers of Computer Science

, Volume 13, Issue 1, pp 157–169 | Cite as

EnAli: entity alignment across multiple heterogeneous data sources

  • Chao Kong
  • Ming GaoEmail author
  • Chen Xu
  • Yunbin Fu
  • Weining Qian
  • Aoying Zhou
Research Article
  • 14 Downloads

Abstract

Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

Keywords

entity alignment exponential family locality sensitive hashing EM-algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61672234, 61402180 and 61232002). This work was also supported by NSF of Shanghai (14ZR1412600).

Supplementary material

11704_2017_6561_MOESM1_ESM.ppt (286 kb)
Supplementary material, approximately 284 KB.

References

  1. 1.
    Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664Google Scholar
  2. 2.
    Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019CrossRefGoogle Scholar
  3. 3.
    Zafarani R, Liu H. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357Google Scholar
  4. 4.
    Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836CrossRefGoogle Scholar
  5. 5.
    Zhang JW, Yu P S. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131Google Scholar
  6. 6.
    Zhang J W, Yu P S. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759CrossRefGoogle Scholar
  7. 7.
    Gao M, Lim E P, Lo D, Zhu F D, Prasetyo P K, Zhou A Y. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762Google Scholar
  8. 8.
    Kong C, Gao M, Xu C, Qian W N, Zhou A Y. Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146CrossRefGoogle Scholar
  9. 9.
    Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959CrossRefGoogle Scholar
  10. 10.
    Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278Google Scholar
  11. 11.
    Wang Y R, Madnick S E. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55Google Scholar
  12. 12.
    Hernandez M A, Stolfo S J. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138Google Scholar
  13. 13.
    Jin L, Li C, Mehrotra S. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584Google Scholar
  14. 14.
    Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102CrossRefGoogle Scholar
  15. 15.
    Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400Google Scholar
  16. 16.
    Whang S E, Garcia-Molina H. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337CrossRefGoogle Scholar
  17. 17.
    Singla P, Domingos P M. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582Google Scholar
  18. 18.
    Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633CrossRefzbMATHGoogle Scholar
  19. 19.
    Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012CrossRefGoogle Scholar
  20. 20.
    Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16CrossRefGoogle Scholar
  21. 21.
    Winkler W E. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623Google Scholar
  22. 22.
    Wang J N, Li G L, Yu J X, Feng J H. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633CrossRefGoogle Scholar
  23. 23.
    Bilenko M, Mooney R. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48Google Scholar
  24. 24.
    Dong X, Halevy A Y, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96Google Scholar
  25. 25.
    Roos L L, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117Google Scholar
  26. 26.
    Grannis S J, Overhage J M, McDonald C J. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309Google Scholar
  27. 27.
    Rastogi V, Dalvi Ni N, Garofalakis M N. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218CrossRefGoogle Scholar
  28. 28.
    Lee S, Lee J, Hwang S. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356Google Scholar
  29. 29.
    Liu J, Zhang F, Song X Y, Song Y I, Lin C Y, Hon H W. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504Google Scholar
  30. 30.
    Liu S Y, Wang S H, Zhu F D, Zhang J B, Krishnan R. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62Google Scholar
  31. 31.
    Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49CrossRefGoogle Scholar
  32. 32.
    Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210CrossRefzbMATHGoogle Scholar
  33. 33.
    DuVall S L, Kerber R A, Thomas A. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30CrossRefGoogle Scholar
  34. 34.
    Sadinle M, Fienberg S E. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537–1555CrossRefGoogle Scholar
  36. 36.
    Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011Google Scholar
  37. 37.
    Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803Google Scholar
  38. 38.
    Zheng W G, Zou L, Feng Y S, Chen L, Zhao D Y. Efficient simrank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504CrossRefGoogle Scholar
  39. 39.
    Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49CrossRefGoogle Scholar
  40. 40.
    Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022zbMATHGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Chao Kong
    • 1
  • Ming Gao
    • 1
    Email author
  • Chen Xu
    • 2
  • Yunbin Fu
    • 1
  • Weining Qian
    • 1
  • Aoying Zhou
    • 1
  1. 1.School of Data Science and EngineeringEast China Normal UniversityShanghaiChina
  2. 2.Technische Universität BerlinBerlinGermany

Personalised recommendations