Abstract
In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online. Faced with these overwhelming amounts of data, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we propose a unified framework to aid users in quickly discerning salient connections and facts from a set of related documents, and presents the resulting information in a graph-based visualization. Specifically, given a set of relevant documents as input, we firstly extract candidate facts from above sources by exploiting Open Information Extraction (Open IE) approaches. Then, we design a Two-Stage Candidate Triple Filtering (TCTF) approach based on a self-training framework to maintain only coherent facts associated with the specified document topic from the candidates and connect them in the form of an initial graph. We further construct this graph by a heuristic to ensure the final conceptual graph only consist of facts likely to represent meaningful and salient relationships, which users may explore graphically. The experiments on two real-world datasets illustrate that our extraction approach achieves 2.4% higher on the average of F-score over several OpenIE baselines. We also further present an empirical evaluation of the quality of the final generated conceptual graph towards different topics on its coverage rate of topic entities and concepts, confidence score, and the compatibility of involved facts. Experimental results show the effectiveness of our proposed approach.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
During the extraction, we applied a few straightforward transformations to correct two types of common errors such as wrong boundaries and uninformative extraction, which were caused by the syntactic analysis in extraction approaches.
- 5.
- 6.
- 7.
As for a triple extracted by OLLIE, ClausIE, and MinIE, only when its confidence is greater than 0.85 can it be judged that the triple is a correct extraction.
- 8.
The similarity scores between two candidate fact \(f_{i}\), \(f_{j}\) is computed as \(sim(f_{i}, f_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}\), where \(s_{k}\), \(l_{k}\) denote the semantic similarity and literal similarity scores between the facts, respectively. We compute \(s_{k}\) using the Align, Disambiguate and Walk algorithm [2], while \(l_{k}\) are computed using the Jaccard index. \(\gamma =0.8\) denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity.
- 9.
Kappa implementation: https://gist.github.com/ShinNoNoir/9687179.
- 10.
- 11.
- 12.
A triple is annotated as correct if the following conditions are met: (i) it is entailed by its corresponding clause; (ii) it is reasonable or meaningful without any context and (iii) when these three annotators mark it correct simultaneously (The inter-annotator agreement was 82% (\(\kappa \) = 0.60)).
- 13.
An entity or concept is regarded as topic concept when it occurs in the topic words list.
- 14.
For popular OpenIE systems such as OLLIE, ClausIE, and MinIE, we use the confidence value computed by each system itself as the confidence score of each of facts.
- 15.
We mark top-2 performance results in F-score in bold face.
- 16.
The random forest classifier has an average absolute 87% higher on the F-score metric for different topics when the model has converged.
References
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)
Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: A unified approach for measuring semantic similarity. In: ACL (Volume 1: Long Papers), pp. 1341–1351 (2013)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)
Kochtchi, A., Landesberger, T.V., Biemann, C.: Networks of names: visual exploration and semi-automatic tagging of social networks from newspaper articles. In: Computer Graphics Forum, pp. 211–220 (2014)
Spitkovsky, V.I., Chang, A.X.: A cross-lingual dictionary for English Wikipedia concepts. In: LREC, pp. 3168–3175 (2012)
Mausam, M.: Open information extraction systems and downstream applications. In: IJCAI, pp. 4074–4077 (2016)
Schmitz, M., Bart, R., Soderland, S., Etzioni, O.: Open language learning for information extraction. In: EMNLP-CoNLL, pp. 523–534 (2012)
Del Corro, L., Gemulla, R.: Clausie: clause-based open information extraction. In: WWW, pp. 355–366 (2013)
Gashteovski, K., Gemulla, R., Del Corro, L.: Minie: minimizing facts in open information extraction. In: EMNLP, pp. 2630–2640 (2017)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP (2004)
Mann, G.: Multi-document relationship fusion via constraints on probabilistic databases. In: Human Language Technologies 2007: NAACL, pp. 332–339 (2007)
Ji, H., Favre, B., Lin, W.P., Gillick, D., Hakkani-Tur, D., Grishman, R.: Open-domain multi-document summarization via information extraction: challenges and prospects. In: Multi-source, multilingual information extraction and summarization, pp. 177–201 (2013)
Fuchs, C.A., Peres, A.: Quantum-state disturbance versus information. Uncertainty Relat. Quantum Inf. Phys. Rev. A 53(4), 20–38 (1996)
Yu, D., Huang, L., Ji, H.: Open relation extraction and grounding. In: IJCNLP (Volume 1: Long Papers), pp. 854–864 (2017)
Sheng, Y., Xu, Z., Wang, Y., Zhang, X., Jia, J., You, Z., de Melo, G.: Visualizing multi-document semantics via open domain information extraction. In: ECML-PKDD, pp. 695–699 (2018)
Acknowledgments
This paper was partially supported by National Natural Science Foundation of China (Nos.61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No.ZYGX2016Z003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sheng, Y., Xu, Z. (2019). Coherence and Salience-Based Multi-Document Relationship Mining. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-26072-9_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26071-2
Online ISBN: 978-3-030-26072-9
eBook Packages: Computer ScienceComputer Science (R0)