Skip to main content

Interactive Topic Graph Extraction and Exploration of Web Content

  • Chapter
  • First Online:

Abstract

In the following, we present an approach using interactive topic graph extraction for the exploration of Web content. The initial information request, in the form of a query topic description, is issued online by a user to the system. The topic graph is then constructed from N Web snippets that are produced by a standard search engine. We consider the extraction of a topic graph to be a specific empirical collocation extraction task, where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. This topic graph can then be further analyzed by users so that they can request additional background information with the help of interesting nodes and pairs of nodes in the topic graph, e.g., explicit relationships extracted from Wikipedia or those automatically extracted from additional Web content as well as conceptual information of the topic in form of semantically oriented clusters of descriptive phrases. This information is presented to the users, who can investigate the identified information nuggets to refine their information search. An initial user evaluation shows that our approach is especially helpful for finding new interesting information on topics about which the user has only a vague idea or no idea, at all.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Actually, both languages are only supported in the i–GNSSMM mode. In the case of the i–MILREX mode, we currently only support the English Wikipedia.

  2. 2.

    Consult, for example, the Web page http://nlp.uned.es/weps/ for more information about the problem space.

  3. 3.

    The screenshots shows relations retrieved from Wikipedia infoboxes only. The component for detecting missing relationships is not yet integrated in the running system.

  4. 4.

    For the remainder of the paper N = 1000. We are using Bing (http://www.bing.com/) for Web search.

  5. 5.

    Concerning the English PoS tags, “word/PoS” expressions that match the following regular expression are considered as extended noun tag: “/(N(N∣P))∣/VB(N∣G)∣/IN∣/DT”. The English Verbs are those whose PoS tag start with VB. We are using the tag sets from the Penn treebank (English) and the Negra treebank (German).

  6. 6.

    Currently, the main purpose of recognizing verb chunks is to improve proper recognition of noun groups. The verb chunks are ignored when building the topic graph.

  7. 7.

    In fact we used the polynomials of the Taylor series for ln(1 + x). Note also that k is actually restricted by the number of chunks in a snippet.

  8. 8.

    For “Jim Clark”, e.g., wikipedia’s infoboxes do not provide information for the relations: birthplace, place_of_death, or cause_of_death.

  9. 9.

    The classification of NP chunks to argument types like times and dates is currently done by using simple regular expressions.

References

  1. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M.S., Etzioni, O.: Open information extraction from the Web. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, pp. 2670–2676. (2007)

    Google Scholar 

  2. Baroni, M., Evert, S.: Statistical methods for corpus exploitation. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook. Mouton de Gruyter, Berlin (2008)

    Google Scholar 

  3. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the Web of Data. Web Semant. 7(3), 154–165 (2009)

    Google Scholar 

  4. Bunescu, R.C., Mooney, R.J.: Learning to extract relations from the Web using minimal supervision. In: Proceedings of ACL’07, Prague, pp. 576–583. (2007)

    Google Scholar 

  5. Cui, H., Kan, M.Y., Chua T.S., Xiao, J.: A comparative study on sentence retrieval for definitional question answering. SIGIR Workshop on Information Retrieval for Question Answering (IR4QA), Sheffield (2004)

    Google Scholar 

  6. Downey, D., Schoenmackers, S., Etzioni, O.: Sparse information extraction: unsupervised language models to the rescue. In: Proceedings of ACL, Prague, pp. 696–703. (2007)

    Google Scholar 

  7. Eichler, K., Hemsen, H., Löckelt, M., Neumann, G., Reithinger, N.: Interactive dynamic information extraction. In: Proceedings of KI’2008, Kaiserslautern, pp. 54–61. (2008)

    Google Scholar 

  8. Etzioni, O.: Machine reading of Web text. In: Proceedings of the 4th International Conference on Knowledge Capture, Whistler, pp. 1–4. (2007)

    Google Scholar 

  9. Figueroa, A., Neumann, G.: Language independent answer prediction from the Web. In: Proceedings of the 5th FinTAL, Turku (2006)

    Google Scholar 

  10. Figueroa, A., Neumann, G., Atkinson, J.: Searching for definitional answers on the Web using surface patterns. IEEE Comput. 42(4), 68–76 (2009)

    Google Scholar 

  11. Giesbrecht, E., Evert, S.: Part-of-speech tagging – a solved task? An evaluation of PoS taggers for the Web as corpus. In: Proceedings of the 5th Web as Corpus Workshop, San Sebastian (2009)

    Google Scholar 

  12. Giménez, J., Màrquez, L.: SVMTool: a general PoS tagger generator based on Support Vector Machines. In: Proceedings of LREC’04, Lisbon (2004)

    Google Scholar 

  13. Greenwood, M.A., Stevenson, M.: Improving semi-supervised acquisition of relation extraction patterns. In: Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, pp. 12–19. (2006)

    Google Scholar 

  14. Hildebrandt, W., Katz, B., Lin, J.: Answering definition questions using multiple knowledge sources. In: Proceedings HLT-NAACL, Boston, pp. 49–56. (2004)

    Google Scholar 

  15. Joho, H., Liu, Y.K., Sanderson, M.: Large scale testing of a descriptive phrase finder. In: Proceedings 1st Human Language Technology Conference, San Diego, pp. 219–221. (2001)

    Google Scholar 

  16. Landauer, T., McNamara, D., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Lawrence Erlbaum, Mahwah (2007)

    Google Scholar 

  17. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Google Scholar 

  18. McDonald, R., Kulick, S., Pereira, F., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, University of Michigan, pp. 491–498. (2005)

    Google Scholar 

  19. Rosenfeld, B., Feldman, R.: URES: an unsupervised Web relation extraction system. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney, pp. 667–674. (2006)

    Google Scholar 

  20. Shinyama, Y., Sekine, S.: Preemptive information extraction using unrestricted relation discovery. In: Proceedings of the Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, pp. 304–311. (2006)

    Google Scholar 

  21. Sekine, S.: On-demand information extraction. In: Proceedings of the COLING/ACL, Sydney, pp. 731–738. (2006)

    Google Scholar 

  22. Sudo, K., Sekine, S., Grishman, R.: An improved extraction pattern representation model for automatic IE pattern acquisition. In: Proceedings of ACL, Sapporo, pp. 224–231. (2003)

    Google Scholar 

  23. Turney, P.D.: Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the 12th European Conference on Machine Learning. Freiburg, pp. 491–502. (2001)

    Google Scholar 

  24. Yates, A.: Information extraction from the Web: techniques and applications. Ph.D. Thesis, University of Washington, Computer Science and Engineering (2007)

    Google Scholar 

Download references

Acknowledgements

The presented work was partially supported by grants from the German Federal Ministry of Economics and Technology (BMWi) to the Theseus project (FKZ: 01MQ07016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Günter Neumann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Neumann, G., Schmeier, S. (2013). Interactive Topic Graph Extraction and Exploration of Web Content. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28569-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28569-1_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28568-4

  • Online ISBN: 978-3-642-28569-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics