Interactive Topic Graph Extraction and Exploration of Web Content

Neumann, Günter; Schmeier, Sven

doi:10.1007/978-3-642-28569-1_7

Interactive Topic Graph Extraction and Exploration of Web Content

Günter Neumann⁵ &
Sven Schmeier⁵

Chapter
First Online: 01 January 2012

1916 Accesses
2 Citations

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

Abstract

In the following, we present an approach using interactive topic graph extraction for the exploration of Web content. The initial information request, in the form of a query topic description, is issued online by a user to the system. The topic graph is then constructed from N Web snippets that are produced by a standard search engine. We consider the extraction of a topic graph to be a specific empirical collocation extraction task, where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. This topic graph can then be further analyzed by users so that they can request additional background information with the help of interesting nodes and pairs of nodes in the topic graph, e.g., explicit relationships extracted from Wikipedia or those automatically extracted from additional Web content as well as conceptual information of the topic in form of semantically oriented clusters of descriptive phrases. This information is presented to the users, who can investigate the identified information nuggets to refine their information search. An initial user evaluation shows that our approach is especially helpful for finding new interesting information on topics about which the user has only a vague idea or no idea, at all.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Actually, both languages are only supported in the i–GNSSMM mode. In the case of the i–MILREX mode, we currently only support the English Wikipedia.
2.
Consult, for example, the Web page http://nlp.uned.es/weps/ for more information about the problem space.
3.
The screenshots shows relations retrieved from Wikipedia infoboxes only. The component for detecting missing relationships is not yet integrated in the running system.
4.
For the remainder of the paper N = 1000. We are using Bing (http://www.bing.com/) for Web search.
5.
Concerning the English PoS tags, “word/PoS” expressions that match the following regular expression are considered as extended noun tag: “/(N(N∣P))∣/VB(N∣G)∣/IN∣/DT”. The English Verbs are those whose PoS tag start with VB. We are using the tag sets from the Penn treebank (English) and the Negra treebank (German).
6.
Currently, the main purpose of recognizing verb chunks is to improve proper recognition of noun groups. The verb chunks are ignored when building the topic graph.
7.
In fact we used the polynomials of the Taylor series for ln(1 + x). Note also that k is actually restricted by the number of chunks in a snippet.
8.
For “Jim Clark”, e.g., wikipedia’s infoboxes do not provide information for the relations: birthplace, place_of_death, or cause_of_death.
9.
The classification of NP chunks to argument types like times and dates is currently done by using simple regular expressions.

References

Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M.S., Etzioni, O.: Open information extraction from the Web. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, pp. 2670–2676. (2007)
Google Scholar
Baroni, M., Evert, S.: Statistical methods for corpus exploitation. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2008)
Google Scholar
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the Web of Data. Web Semant. 7(3), 154–165 (2009)
Google Scholar
Bunescu, R.C., Mooney, R.J.: Learning to extract relations from the Web using minimal supervision. In: Proceedings of ACL’07, Prague, pp. 576–583. (2007)
Google Scholar
Cui, H., Kan, M.Y., Chua T.S., Xiao, J.: A comparative study on sentence retrieval for definitional question answering. SIGIR Workshop on Information Retrieval for Question Answering (IR4QA), Sheffield (2004)
Google Scholar
Downey, D., Schoenmackers, S., Etzioni, O.: Sparse information extraction: unsupervised language models to the rescue. In: Proceedings of ACL, Prague, pp. 696–703. (2007)
Google Scholar
Eichler, K., Hemsen, H., Löckelt, M., Neumann, G., Reithinger, N.: Interactive dynamic information extraction. In: Proceedings of KI’2008, Kaiserslautern, pp. 54–61. (2008)
Google Scholar
Etzioni, O.: Machine reading of Web text. In: Proceedings of the 4th International Conference on Knowledge Capture, Whistler, pp. 1–4. (2007)
Google Scholar
Figueroa, A., Neumann, G.: Language independent answer prediction from the Web. In: Proceedings of the 5th FinTAL, Turku (2006)
Google Scholar
Figueroa, A., Neumann, G., Atkinson, J.: Searching for definitional answers on the Web using surface patterns. IEEE Comput. 42(4), 68–76 (2009)
Google Scholar
Giesbrecht, E., Evert, S.: Part-of-speech tagging – a solved task? An evaluation of PoS taggers for the Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop, San Sebastian (2009)
Google Scholar
Giménez, J., Màrquez, L.: SVMTool: a general PoS tagger generator based on Support Vector Machines. In: Proceedings of LREC’04, Lisbon (2004)
Google Scholar
Greenwood, M.A., Stevenson, M.: Improving semi-supervised acquisition of relation extraction patterns. In: Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, pp. 12–19. (2006)
Google Scholar
Hildebrandt, W., Katz, B., Lin, J.: Answering definition questions using multiple knowledge sources. In: Proceedings HLT-NAACL, Boston, pp. 49–56. (2004)
Google Scholar
Joho, H., Liu, Y.K., Sanderson, M.: Large scale testing of a descriptive phrase finder. In: Proceedings 1st Human Language Technology Conference, San Diego, pp. 219–221. (2001)
Google Scholar
Landauer, T., McNamara, D., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Lawrence Erlbaum, Mahwah (2007)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Google Scholar
McDonald, R., Kulick, S., Pereira, F., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, University of Michigan, pp. 491–498. (2005)
Google Scholar
Rosenfeld, B., Feldman, R.: URES: an unsupervised Web relation extraction system. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney, pp. 667–674. (2006)
Google Scholar
Shinyama, Y., Sekine, S.: Preemptive information extraction using unrestricted relation discovery. In: Proceedings of the Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, pp. 304–311. (2006)
Google Scholar
Sekine, S.: On-demand information extraction. In: Proceedings of the COLING/ACL, Sydney, pp. 731–738. (2006)
Google Scholar
Sudo, K., Sekine, S., Grishman, R.: An improved extraction pattern representation model for automatic IE pattern acquisition. In: Proceedings of ACL, Sapporo, pp. 224–231. (2003)
Google Scholar
Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the 12th European Conference on Machine Learning. Freiburg, pp. 491–502. (2001)
Google Scholar
Yates, A.: Information extraction from the Web: techniques and applications. Ph.D. Thesis, University of Washington, Computer Science and Engineering (2007)
Google Scholar

Download references

Acknowledgements

The presented work was partially supported by grants from the German Federal Ministry of Economics and Technology (BMWi) to the Theseus project (FKZ: 01MQ07016).

Author information

Authors and Affiliations

German Research Center for Artificial Intelligence GmbH (DFKI), Stuhlsatzenhausweg 3, D-66123, Saarbrücken, Germany
Günter Neumann & Sven Schmeier

Authors

Günter Neumann
View author publications
You can also search for this author in PubMed Google Scholar
Sven Schmeier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Günter Neumann .

Editor information

Editors and Affiliations

Universite Sorbonne Nouvelle, LATTICE-CNRS, Ecole Normale Superieure and, rue d'Ulm 45, Paris, 75005, France
Thierry Poibeau
, Information & Communication Technologies, Universitat Pompeu Fabra, C/ Tanger 122-140, Barcelona, 08018, Spain
Horacio Saggion
Institute for Computer Science, Polish Acadmey of Science, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Jakub Piskorski
Department of Computer Science, University of Helsinki, Gustaf Hällströmin katu 2, Helsinki, 00014, Finland
Roman Yangarber

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Neumann, G., Schmeier, S. (2013). Interactive Topic Graph Extraction and Exploration of Web Content. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-28569-1_7
Published: 12 July 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28568-4
Online ISBN: 978-3-642-28569-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics