Skip to main content

Coherence and Salience-Based Multi-Document Relationship Mining

  • Conference paper
  • First Online:
  • 1448 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11641))

Abstract

In today’s interconnected world, there is an endless 24/7 stream of new articles appearing online. Faced with these overwhelming amounts of data, it is often helpful to consider only the key entities and concepts and their relationships. This is challenging, as relevant connections may be spread across a number of disparate articles and sources. In this paper, we propose a unified framework to aid users in quickly discerning salient connections and facts from a set of related documents, and presents the resulting information in a graph-based visualization. Specifically, given a set of relevant documents as input, we firstly extract candidate facts from above sources by exploiting Open Information Extraction (Open IE) approaches. Then, we design a Two-Stage Candidate Triple Filtering (TCTF) approach based on a self-training framework to maintain only coherent facts associated with the specified document topic from the candidates and connect them in the form of an initial graph. We further construct this graph by a heuristic to ensure the final conceptual graph only consist of facts likely to represent meaningful and salient relationships, which users may explore graphically. The experiments on two real-world datasets illustrate that our extraction approach achieves 2.4% higher on the average of F-score over several OpenIE baselines. We also further present an empirical evaluation of the quality of the final generated conceptual graph towards different topics on its coverage rate of topic entities and concepts, confidence score, and the compatibility of involved facts. Experimental results show the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://safetyapp.shinyapps.io/GoWvis/.

  2. 2.

    http://tagesnetzwerk.de.

  3. 3.

    http://www.newsleak.io/.

  4. 4.

    During the extraction, we applied a few straightforward transformations to correct two types of common errors such as wrong boundaries and uninformative extraction, which were caused by the syntactic analysis in extraction approaches.

  5. 5.

    https://en.wikipedia.org/wiki/Tf%E2%80%93idf.

  6. 6.

    https://github.com/letiantian/TextRank4ZH/blob/master/README.md.

  7. 7.

    As for a triple extracted by OLLIE, ClausIE, and MinIE, only when its confidence is greater than 0.85 can it be judged that the triple is a correct extraction.

  8. 8.

    The similarity scores between two candidate fact \(f_{i}\), \(f_{j}\) is computed as \(sim(f_{i}, f_{j}) = \gamma \dot{s}_{k} + (1-\gamma ) \dot{l}_{k}\), where \(s_{k}\), \(l_{k}\) denote the semantic similarity and literal similarity scores between the facts, respectively. We compute \(s_{k}\) using the Align, Disambiguate and Walk algorithm [2], while \(l_{k}\) are computed using the Jaccard index. \(\gamma =0.8\) denotes the relative degree to which the semantic similarity contributes to the overall similarity score, as opposed to the literal similarity.

  9. 9.

    Kappa implementation: https://gist.github.com/ShinNoNoir/9687179.

  10. 10.

    https://duc.nist.gov/.

  11. 11.

    http://research.signalmedia.co/newsir16/signal-dataset.html.

  12. 12.

    A triple is annotated as correct if the following conditions are met: (i) it is entailed by its corresponding clause; (ii) it is reasonable or meaningful without any context and (iii) when these three annotators mark it correct simultaneously (The inter-annotator agreement was 82% (\(\kappa \)  =  0.60)).

  13. 13.

    An entity or concept is regarded as topic concept when it occurs in the topic words list.

  14. 14.

    For popular OpenIE systems such as OLLIE, ClausIE, and MinIE, we use the confidence value computed by each system itself as the confidence score of each of facts.

  15. 15.

    We mark top-2 performance results in F-score in bold face.

  16. 16.

    The random forest classifier has an average absolute 87% higher on the F-score metric for different topics when the model has converged.

References

  1. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)

    Google Scholar 

  2. Pilehvar, M.T., Jurgens, D., Navigli, R.: Align, disambiguate and walk: A unified approach for measuring semantic similarity. In: ACL (Volume 1: Long Papers), pp. 1341–1351 (2013)

    Google Scholar 

  3. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)

    Google Scholar 

  4. Kochtchi, A., Landesberger, T.V., Biemann, C.: Networks of names: visual exploration and semi-automatic tagging of social networks from newspaper articles. In: Computer Graphics Forum, pp. 211–220 (2014)

    Article  Google Scholar 

  5. Spitkovsky, V.I., Chang, A.X.: A cross-lingual dictionary for English Wikipedia concepts. In: LREC, pp. 3168–3175 (2012)

    Google Scholar 

  6. Mausam, M.: Open information extraction systems and downstream applications. In: IJCAI, pp. 4074–4077 (2016)

    Google Scholar 

  7. Schmitz, M., Bart, R., Soderland, S., Etzioni, O.: Open language learning for information extraction. In: EMNLP-CoNLL, pp. 523–534 (2012)

    Google Scholar 

  8. Del Corro, L., Gemulla, R.: Clausie: clause-based open information extraction. In: WWW, pp. 355–366 (2013)

    Google Scholar 

  9. Gashteovski, K., Gemulla, R., Del Corro, L.: Minie: minimizing facts in open information extraction. In: EMNLP, pp. 2630–2640 (2017)

    Google Scholar 

  10. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP (2004)

    Google Scholar 

  11. Mann, G.: Multi-document relationship fusion via constraints on probabilistic databases. In: Human Language Technologies 2007: NAACL, pp. 332–339 (2007)

    Google Scholar 

  12. Ji, H., Favre, B., Lin, W.P., Gillick, D., Hakkani-Tur, D., Grishman, R.: Open-domain multi-document summarization via information extraction: challenges and prospects. In: Multi-source, multilingual information extraction and summarization, pp. 177–201 (2013)

    Google Scholar 

  13. Fuchs, C.A., Peres, A.: Quantum-state disturbance versus information. Uncertainty Relat. Quantum Inf. Phys. Rev. A 53(4), 20–38 (1996)

    Google Scholar 

  14. Yu, D., Huang, L., Ji, H.: Open relation extraction and grounding. In: IJCNLP (Volume 1: Long Papers), pp. 854–864 (2017)

    Google Scholar 

  15. Sheng, Y., Xu, Z., Wang, Y., Zhang, X., Jia, J., You, Z., de Melo, G.: Visualizing multi-document semantics via open domain information extraction. In: ECML-PKDD, pp. 695–699 (2018)

    Chapter  Google Scholar 

Download references

Acknowledgments

This paper was partially supported by National Natural Science Foundation of China (Nos.61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No.ZYGX2016Z003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zenglin Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sheng, Y., Xu, Z. (2019). Coherence and Salience-Based Multi-Document Relationship Mining. In: Shao, J., Yiu, M., Toyoda, M., Zhang, D., Wang, W., Cui, B. (eds) Web and Big Data. APWeb-WAIM 2019. Lecture Notes in Computer Science(), vol 11641. Springer, Cham. https://doi.org/10.1007/978-3-030-26072-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26072-9_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26071-2

  • Online ISBN: 978-3-030-26072-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics