Skip to main content

Information Access via Topic Hierarchies and Thematic Annotations from Document Collections

  • Conference paper
  • 637 Accesses

Abstract

With the development and the availability of large textual corpora, there is a need for enriching and organizing these corpora so as to make easier the research and navigation among the documents. The Semantic Web research focuses on augmenting ordinary Web pages with semantics. Indeed, wealth of information exists today in electronic form, they cannot be easily processed by computers due to lack of external semantics. Furthermore, the semantic addition is an help for user to locate, process information and compare documents contents. For now, Semantic Web research has been focused on the standardization, internal structuring of pages, and sharing of ontologies in a variety of domains. Concerning external structuring, hypertext and information retrieval communities propose to indicate relations between documents via hyperlinks or by organizing documents into concepts hierarchies, both being manually developed. We consider here the problem of automatically structuring and organizing corpora in a way that reflects semantic relations between documents. We propose an algorithm for automatically inferring concepts hierarchies from a corpus. We then show how this method may be used to create specialization/generalization links between documents leading to document hierarchies. As a byproduct, documents are annotated with keywords giving the main concepts present in the documents. We also introduce numerical criteria for measuring the relevance of the automatically generated hierarchies and describe some experiments performed on data from the LookSmart and New Scientist web sites.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • J. Allan, 1996. Automatic hypertext link typing. Proceeding of the ACM Hypertext Conference, Washington, DC pp.42–52.

    Google Scholar 

  • C. Cleary, R. Bareiss, 1996. Practical methods for automatically generating typed links. Hypertext’ 96, Washington DC USA

    Google Scholar 

  • D. R. Cutting, D. R. Karger, J. O. Pedersen, J. W. Tukey, 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In ACM SIGIR.

    Google Scholar 

  • T. Draier, P. Gallinari, 2001. Characterizing Sequences of User Actions for Access Logs Analysis, User Modelling, LNAI 2109.

    Google Scholar 

  • G. Källgren, 1988. Automatic Abstracting on Content in text. Nordic Journal of Linguistics. pp. 89–110, vol. 11.

    Article  Google Scholar 

  • B. Katz, J. Lin, 2002. Annotating the Semantic Web Using Natural Language. In Proceedings of the 2nd Workshop on NLP and XML (NLPXML-2002) at COLING 2002.

    Google Scholar 

  • B. Katz, J. Lin, D. Quan, 2002. Natural Language Annotations for the Semantic Web. In Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE2002).

    Google Scholar 

  • K. Krishna, R. Krishnapuram, 2001. A Clustering Algorithm for Asymmetrically Related Data with Applications to Text Mining. Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management. Atlanta, Georgia, USA. Pp.571–573

    Google Scholar 

  • Dawn Lawrie and W. Bruce Croft, 2000. Discovering and Comparing Topic Hierarchies. In proceedings of RIAO 2000.

    Google Scholar 

  • G. Salton, A. Singhal, C. Buckley, M. Mitra, 1996. Automatic Text Decomposition Using Text Segments and Text Themes. Hypertext 1996: 53–65

    Google Scholar 

  • Mark Sanderson, Bruce Croft, 1999. Deriving concept hierarchies from text. In Proceedings ACM SIGIR Conference’ 99, 206–213.

    Google Scholar 

  • Randall Trigg, 1983. A network-based approach to text handling for the online scientific community. University of Maryland, Department of Computer Science, Ph.D dissertation, November 1983.

    Google Scholar 

  • A. Vinokourov, M. Girolami, 2000. A Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents. Proceedings of the 15th International Conference on Pattern Recognition (ICPR’2000), Barcelona, Spain. IEEE computer press, vol.2 pp.182–185.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer

About this paper

Cite this paper

Fotzo, H.N., Gallinari, P. (2006). Information Access via Topic Hierarchies and Thematic Annotations from Document Collections. In: Seruca, I., Cordeiro, J., Hammoudi, S., Filipe, J. (eds) Enterprise Information Systems VI. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3675-2_17

Download citation

  • DOI: https://doi.org/10.1007/1-4020-3675-2_17

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-3674-3

  • Online ISBN: 978-1-4020-3675-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics