Advertisement

ReLiC: entity profiling using random forest and trustworthiness of a source

Abstract

The digital revolution has brought most of the data in the world to the world wide web, but at the same time, the data available on WWW has increased manyfold in the past decade. Social networks, online clubs etc., have come into existence. Expert systems are required to extract information from these venues about a real-world entity like a person, organisation, event, etc. However, this information may change over time, and there is a need to maintain the data. Therefore, it is desirable to have an intelligent model to extract relevant data items from different sources and merge them to build a complete profile of an entity (entity profiling). Further, this model should be able to handle incorrect or obsolete data items. In this paper, we propose a novel method for completing a profile. We have developed a two-phase method. (1) The first phase (resolution phase) links records to the queries. We have studied the performance of various classifiers for this purpose and observed that the use of the random forest is best suited for entity resolution. Also, we proposed and used “trustworthiness of a source” as a feature to the random forest. (2) The second phase selects the appropriate values from records to complete a profile based on our proposed selection criteria. We used the concept of assigning authority to a reliable source in entity profiling, and it is established through our results that the use of an authoritative source has significantly improved the performance of the proposed system. Experimental results show that our proposed system ReLiC outperforms COMET.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Figure 1
Figure 2
Figure 3

Notes

  1. 1.

    https://www.instantcheckmate.com/.

  2. 2.

    GoogleNews-vectors-negative300.bin.gz.

  3. 3.

    http://code.google.com/archive/p/word2vec.

  4. 4.

    Note that this value is computed using Word2Vec.

  5. 5.

    www.yellowpages.com.

  6. 6.

    https://github.com/ShubhamVarma/ReLiC.

References

  1. 1

    Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R and Ives Z 2007 Dbpedia: a nucleus for a web of open data. In: Proceedings of The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007. Berlin–Heidelberg: Springer, pp. 722–735

  2. 2

    Suchanek F M, Kasneci G and Weikum G 2007 Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, ACM, pp. 697–706

  3. 3

    Li F, Lee M L and Hsu W 2014 Entity profiling with varying source reliabilities. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1146–1155

  4. 4

    Li F, Lee M L, Hsu W and Tan W C 2015 Linking temporal records for profiling entities. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, pp. 593–605

  5. 5

    Wagner A, Barbosa J L V and Barbosa D N F 2014 A model for profile management applied to ubiquitous learning environments. Expert Systems with Applications 41(4, part 2): 2023–2034

  6. 6

    Nicoletti M, Schiaffino S and Godoy D 2013 Mining interests for user profiling in electronic conversations. Expert Systems with Applications 40(2): 638–645

  7. 7

    Hawalah A and Fasli M 2014 Utilizing contextual ontological user profiles for personalized recommendations. Expert Systems with Applications 41(10): 4777–4797

  8. 8

    Guo S, Dong X L, Srivastava D and Zajac R 2010 Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment 3(1–2): 417–428

  9. 9

    Calegari S and Pasi G 2013 Personal ontologies: generation of user profiles based on the YAGO ontology. Information Processing & Management 49(3): 640–658

  10. 10

    Garcia M M R, Garcia-Nieto J and Aldana-Montes J F 2016 An ontology-based data integration approach for web analytics in e-commerce. Expert Systems with Applications 63: 20–34

  11. 11

    Amini B, Ibrahim R, Othman M S and Selamat A 2014 Capturing scholar’s knowledge from heterogeneous resources for profiling in recommender systems. Expert Systems with Applications 41(17): 7945–7957

  12. 12

    Singla P and  Domingos P 2006 Entity resolution with Markov logic. In: Proceedings of the Sixth International Conference on Data Mining (ICDM), pp. 572–582

  13. 13

    Christen P 2008 Febrl: a freely available record linkage system with a graphical user interface. In: Proceedings of the Second Australasian Workshop on Health Data and Knowledge Management. Wollongong, NSW, Australia: Australian Computer Society, Inc., vol. 80, pp. 17–25

  14. 14

    Fellegi I P and Sunter A B 1969 A theory for record linkage. Journal of the American Statistical Association 64(328): 1183–1210

  15. 15

    Bilenko M and Mooney R J 2003 Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 39–48

  16. 16

    Malin B and Sweeney L 2005 Enres: a semantic framework for entity resolution modelling. Institute for Software Research International Technical Report, Carnegie Mellon University

  17. 17

    Zhao G, Wu J, Wang D and Li T 2016 Entity disambiguation to wikipedia using collective ranking. Information Processing and Management 52(6): 1247–1257

  18. 18

    Fu Z, Zhou J, Peng F and Christen P 2012 A bag reconstruction method for multiple instance classification and group record linkage. In: Proceedings of the 8th International Conference on Advanced Data Mining and Applications (ADMA), Nanjing, China, December 15–18. Springer LNCS, pp. 247–259

  19. 19

    Hu Y, Wang Q, Vatsalan D and Christen P 2017 Improving temporal record linkage using regression classification. In: Proceedings of 21st Pacific–Asia Conference of Advances on Knowledge Discovery and Data Mining PAKDD, Jeju, South Korea, May 23–26. Springer LNCS, pp. 561–573

  20. 20

    Christen P 2008 Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08. New York, NY, USA: ACM, pp. 151–159

  21. 21

    Christen P 2007 A two-step classification approach to unsupervised record linkage. In: Proceedings of the Sixth Australasian Conference on Data Mining and Analytics, AusDM ’07. Darlinghurst, Australia: Australian Computer Society, Inc., vol. 70, pp. 111–119

  22. 22

    Cheng J, Sugiyama K and Kan M Y 2016 Linking organizational social network profiles. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16. New York, NY, USA: ACM, pp. 901–904

  23. 23

    Derczynski L, Maynard D, Rizzo G, van Erp M, Gorrell G, Troncy R, Petrak J and Bontcheva K 2015 Analysis of named entity recognition and linking for tweets. Information Processing & Management 51(2): 32–49

  24. 24

    Christen P 2012 Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin–Heidelberg: Springer

  25. 25

    Talburt J R 2011 Entity Resolution and Information Quality. Amsterdam: Elsevier Science

  26. 26

    Gruenheid A, Dong X L and Srivastava D 2014 Incremental record linkage. Proceedings of the VLDB Endowment 7(9): 697–708

  27. 27

    Syed H, Talburt J, Liu F, Pullen D and Wu N 2012 Developing and refining matching rules for entity resolution. In: Proceedings of the International Conference on Information and Knowledge Engineering (IKE), pp. 1–6

  28. 28

    Bhattacharya I and Getoor L 2007 Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1(1): 1–36

  29. 29

    Xiao C, Wang W, Lin X and Yu J X 2008 Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, ACM, pp. 131–140

  30. 30

    Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E and Widom J 2009 Swoosh: a generic approach to entity resolution. The VLDB Journal 18(1): 255–276

  31. 31

    Bilgic M, Licamele L,  Getoor L and  Shneiderman B 2006 D-dupe: an interactive tool for entity resolution in social networks. In: Proceedings of the IEEE Symposium on Visual Analytics Science and Technology, pp. 43–50

  32. 32

    Schewe K D and Wang Q 2014 A theoretical framework for knowledge-based entity resolution. Theoretical Computer Science 549: 101–126

  33. 33

    Zhang L, Dong Y and Rettinger A 2015 Towards entity correctness, completeness and emergence for entity recognition. In: Proceedings of the 24th International Conference on World Wide Web, WWW’15 Companion. New York, NY, USA: ACM, pp. 143–144

  34. 34

    Prabhakar Benny S, Vasavi S and Anupriya P 2016 Hadoop framework for entity resolution within high velocity streams. Procedia Computer Science 85: 550–557 Procedia Computer Science (International Conference on Computational Modelling and Security, CMS 2016).

  35. 35

    Dharavath R and Kumar C 2015 Entity resolution based EM for integrating heterogeneous distributed probabilistic data. Journal of Systems and Software 107: 93–109

  36. 36

    Ayat N, Akbarinia R, Afsarmanesh H and Valduriez P 2014 Entity resolution for probabilistic data. Information Sciences 277: 492–511

  37. 37

    Hu W and Jia C 2015 A bootstrapping approach to entity linkage on the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 34: 1–12

  38. 38

    Yerva S R 2013 Entities on the web resolution, matching and profiling. PhD thesis, EPFL, Lausanne

  39. 39

    Wang J, Kraska T, Franklin M J and Feng J 2012 Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5(11): 1483–1494

  40. 40

    Cheng G, Xu D and Qu Y 2015 C3d+p: a summarization method for interactive entity resolution. Web Semantics: Science, Services and Agents on the World Wide Web 35(4 part): 203–213

  41. 41

    Köpcke H, Thor A and Rahm E 2009 Comparative evaluation of entity resolution approaches with fever. Proceedings of the VLDB Endowment 2(2): 1574–1577

  42. 42

    Mikolov T, Sutskever I, Chen K, Corrado G S and Dean J 2013 Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119

  43. 43

    Blanco L, Crescenzi V, Merialdo P and Papotti P 2012 Web data reconciliation: models and experiences. Berlin–Heidelberg: Springer, pp. 1–15

  44. 44

    Bradley A P 1997 The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7): 1145 – 1159

  45. 45

    Matthews B W 1975 Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure 405(2): 442–451

  46. 46

    Hanley J A and McNeil B J 1983 A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148(3): 839–843

  47. 47

    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M and Duchesnay É 2011 Scikit-learn: machine learning in python. Journal of Machine Learning Research 12: 2825–2830

  48. 48

    Levenshtein V I 1966 Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8): 707–710

  49. 49

    Sakai T 2014 Statistical reform in information retrieval? SIGIR Forum 48(1): 3–12

Download references

Author information

Correspondence to C Ravindranath Chowdary.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Varma, S., Sameer, N. & Chowdary, C.R. ReLiC: entity profiling using random forest and trustworthiness of a source. Sādhanā 44, 200 (2019). https://doi.org/10.1007/s12046-019-1178-x

Download citation

Keywords

  • Entity profiling
  • entity resolution
  • record linkage
  • authoritative source