ReLiC: entity profiling using random forest and trustworthiness of a source


The digital revolution has brought most of the data in the world to the world wide web, but at the same time, the data available on WWW has increased manyfold in the past decade. Social networks, online clubs etc., have come into existence. Expert systems are required to extract information from these venues about a real-world entity like a person, organisation, event, etc. However, this information may change over time, and there is a need to maintain the data. Therefore, it is desirable to have an intelligent model to extract relevant data items from different sources and merge them to build a complete profile of an entity (entity profiling). Further, this model should be able to handle incorrect or obsolete data items. In this paper, we propose a novel method for completing a profile. We have developed a two-phase method. (1) The first phase (resolution phase) links records to the queries. We have studied the performance of various classifiers for this purpose and observed that the use of the random forest is best suited for entity resolution. Also, we proposed and used “trustworthiness of a source” as a feature to the random forest. (2) The second phase selects the appropriate values from records to complete a profile based on our proposed selection criteria. We used the concept of assigning authority to a reliable source in entity profiling, and it is established through our results that the use of an authoritative source has significantly improved the performance of the proposed system. Experimental results show that our proposed system ReLiC outperforms COMET.

    Note that this value is computed using Word2Vec.

  Entity profiling
  entity resolution
  record linkage
  authoritative source