Coherence and Salience-Based Multi-Document Relationship Mining

Sheng, Yongpan; Xu, Zenglin

doi:10.1007/978-3-030-26072-9_30

Coherence and Salience-Based Multi-Document Relationship Mining

Yongpan Sheng¹⁴ &
Zenglin Xu¹⁴

Conference paper
First Online: 18 July 2019

1448 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11641))

Abstract

In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online. Faced with these overwhelming amounts of data, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we propose a unified framework to aid users in quickly discerning salient connections and facts from a set of related documents, and presents the resulting information in a graph-based visualization. Specifically, given a set of relevant documents as input, we firstly extract candidate facts from above sources by exploiting Open Information Extraction (Open IE) approaches. Then, we design a Two-Stage Candidate Triple Filtering (TCTF) approach based on a self-training framework to maintain only coherent facts associated with the specified document topic from the candidates and connect them in the form of an initial graph. We further construct this graph by a heuristic to ensure the final conceptual graph only consist of facts likely to represent meaningful and salient relationships, which users may explore graphically. The experiments on two real-world datasets illustrate that our extraction approach achieves 2.4% higher on the average of F-score over several OpenIE baselines. We also further present an empirical evaluation of the quality of the final generated conceptual graph towards different topics on its coverage rate of topic entities and concepts, confidence score, and the compatibility of involved facts. Experimental results show the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://safetyapp.shinyapps.io/GoWvis/.
2.
http://tagesnetzwerk.de.
3.
http://www.newsleak.io/.
4.
During the extraction, we applied a few straightforward transformations to correct two types of common errors such as wrong boundaries and uninformative extraction, which were caused by the syntactic analysis in extraction approaches.
5.
https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
6.
https://github.com/letiantian/TextRank4ZH/blob/master/README.md.
7.
As for a triple extracted by OLLIE, ClausIE, and MinIE, only when its confidence is greater than 0.85 can it be judged that the triple is a correct extraction.
8.
The similarity scores between two candidate fact \(f_{i}\), \(f_{j}\) is computed as \(sim(f_{i}, f_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}\), where \(s_{k}\), \(l_{k}\) denote the semantic similarity and literal similarity scores between the facts, respectively. We compute \(s_{k}\) using the Align, Disambiguate and Walk algorithm [2], while \(l_{k}\) are computed using the Jaccard index. \(\gamma =0.8\) denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity.
9.
Kappa implementation: https://gist.github.com/ShinNoNoir/9687179.
10.
https://duc.nist.gov/.
11.
http://research.signalmedia.co/newsir16/signal-dataset.html.
12.
A triple is annotated as correct if the following conditions are met: (i) it is entailed by its corresponding clause; (ii) it is reasonable or meaningful without any context and (iii) when these three annotators mark it correct simultaneously (The inter-annotator agreement was 82% (\(\kappa \) = 0.60)).
13.
An entity or concept is regarded as topic concept when it occurs in the topic words list.
14.
For popular OpenIE systems such as OLLIE, ClausIE, and MinIE, we use the confidence value computed by each system itself as the confidence score of each of facts.
15.
We mark top-2 performance results in F-score in bold face.
16.
The random forest classifier has an average absolute 87% higher on the F-score metric for different topics when the model has converged.

References

Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)
Google Scholar
Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: A unified approach for measuring semantic similarity. In: ACL (Volume 1: Long Papers), pp. 1341–1351 (2013)
Google Scholar
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)
Google Scholar
Kochtchi, A., Landesberger, T.V., Biemann, C.: Networks of names: visual exploration and semi-automatic tagging of social networks from newspaper articles. In: Computer Graphics Forum, pp. 211–220 (2014)
Article Google Scholar
Spitkovsky, V.I., Chang, A.X.: A cross-lingual dictionary for English Wikipedia concepts. In: LREC, pp. 3168–3175 (2012)
Google Scholar
Mausam, M.: Open information extraction systems and downstream applications. In: IJCAI, pp. 4074–4077 (2016)
Google Scholar
Schmitz, M., Bart, R., Soderland, S., Etzioni, O.: Open language learning for information extraction. In: EMNLP-CoNLL, pp. 523–534 (2012)
Google Scholar
Del Corro, L., Gemulla, R.: Clausie: clause-based open information extraction. In: WWW, pp. 355–366 (2013)
Google Scholar
Gashteovski, K., Gemulla, R., Del Corro, L.: Minie: minimizing facts in open information extraction. In: EMNLP, pp. 2630–2640 (2017)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP (2004)
Google Scholar
Mann, G.: Multi-document relationship fusion via constraints on probabilistic databases. In: Human Language Technologies 2007: NAACL, pp. 332–339 (2007)
Google Scholar
Ji, H., Favre, B., Lin, W.P., Gillick, D., Hakkani-Tur, D., Grishman, R.: Open-domain multi-document summarization via information extraction: challenges and prospects. In: Multi-source, multilingual information extraction and summarization, pp. 177–201 (2013)
Google Scholar
Fuchs, C.A., Peres, A.: Quantum-state disturbance versus information. Uncertainty Relat. Quantum Inf. Phys. Rev. A 53(4), 20–38 (1996)
Google Scholar
Yu, D., Huang, L., Ji, H.: Open relation extraction and grounding. In: IJCNLP (Volume 1: Long Papers), pp. 854–864 (2017)
Google Scholar
Sheng, Y., Xu, Z., Wang, Y., Zhang, X., Jia, J., You, Z., de Melo, G.: Visualizing multi-document semantics via open domain information extraction. In: ECML-PKDD, pp. 695–699 (2018)
Chapter Google Scholar

Download references

Acknowledgments

This paper was partially supported by National Natural Science Foundation of China (Nos.61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No.ZYGX2016Z003).

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Yongpan Sheng & Zenglin Xu

Authors

Yongpan Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Zenglin Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zenglin Xu .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Jie Shao
Hong Kong Polytechnic University, Hong Kong, China
Man Lung Yiu
The University of Tokyo, Tokyo, Japan
Masashi Toyoda
Zhejiang University, Hangzhou, China
Dongxiang Zhang
National University of Singapore, Singapore, Singapore
Wei Wang
Peking University, Beijing, China
Bin Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheng, Y., Xu, Z. (2019). Coherence and Salience-Based Multi-Document Relationship Mining. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-26072-9_30
Published: 18 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26071-2
Online ISBN: 978-3-030-26072-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics