The Utrecht Blend: Basic Ingredients for an XML Retrieval System

van Zwol, Roelof; Wiering, Frans; Dignum, Virginia

doi:10.1007/11424550_12

Roelof van Zwol²⁰,
Frans Wiering²⁰ &
Virginia Dignum²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3493))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

428 Accesses
2 Citations

Abstract

Exploiting the structure of a document allows for more powerful information retrieval techniques. In this article a basic approach is discussed for the retrieval of XML document fragments. Based on a vector-space model for text retrieval we aim at investigating various strategies that influence the retrieval performance of an XML-based IR system.

The first extension of the system uses a schema-based approach that assumes that authors tag their text to emphasise on particular pieces of content that are of importance. Based on the schema used by the document collection, the system can easily derive the children of mixed content nodes. Our hypothesis is that those child nodes are more important than other nodes.

The second approach discussed here is based on a horizontal fragmentation of the inverse document frequencies, used by the vector space model. The underlying assumption states that the distribution of terms is related to the semantical structure of the document. However, we observed that the IEEE collection is not a good example of semantic tagging.

The third approach investigates how the performance of the retrieval system can improve for the ’Content Only’ task by using a set of a-priori defined cut-off nodes that define ‘logical’ document fragments that are of interest to a user.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Google Scholar
Fuhr, N., Kazai, N., Lalmas, M.: INEX: Initiative for the evaluation of XML retrieval. In: Proceedings of the ACM SIGIR 2000 Workshop on XML and Information Retrieval (2000)
Google Scholar
Fuhr, N., Malik, S., Lalmas, M.: Overview of the initiative for the evaluation of xml. In: Proceedings of the Second INitiative for the Evaluation of XML Retrieval (INEX) Workshop, December 2003, pp. 1–11 (2003)
Google Scholar
Kazai, G.: Report of the inex 2003 metrics working group. In: Proceedings of the Second INitiative for the Evaluation of XML Retrieval (INEX) Workshop, Dagstuhl, Germany, pp. 184–190 (2003)
Google Scholar
Lalmas, M., Malik, S.: Inex 2004 retrieval task and result submission specification (June 2004), http://inex.is.informatik.uni-duisburg.de:2004/internal/pdf/INEX04_Retrieval_Task.pdf
List, J.A., de Vries, A.P.: CWI at inex 2002. In: Proceedings of the First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX) (2002)
Google Scholar
Malik, S., Lalmas, M. (2004), http://inex.lip6.fr/2004/metrics/official.php
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Trotman, A., O’Keefe, R.A.: The simplest query language that could possibly work. In: Proceedings of the Second Workshop of the INitiative for the Evaluation of XML retrieval, INEX (2004)
Google Scholar
van Zwol, R.: Modelling and searching web-based document collections, Enschede, the Netherlands, April 26. Ctit ph.d. thesis series, Centre for Telematics and Information Technology, CTIT (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Content and Knowledge Engineering, Utrecht University, Utrecht, The Netherlands
Roelof van Zwol, Frans Wiering & Virginia Dignum

Authors

Roelof van Zwol
View author publications
You can also search for this author in PubMed Google Scholar
Frans Wiering
View author publications
You can also search for this author in PubMed Google Scholar
Virginia Dignum
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Duisburg-Essen, Duisburg, Germany
Norbert Fuhr
Queen Mary, University of London, London, UK
Mounia Lalmas
University Duisburg-Essen, Germany
Saadia Malik
Computer and Automation Research Institute, Hungarian Academy of Sciences, Kende u. 13-17, H-1111, Budapest, Hungary
Zoltán Szlávik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

van Zwol, R., Wiering, F., Dignum, V. (2005). The Utrecht Blend: Basic Ingredients for an XML Retrieval System. In: Fuhr, N., Lalmas, M., Malik, S., Szlávik, Z. (eds) Advances in XML Information Retrieval. INEX 2004. Lecture Notes in Computer Science, vol 3493. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424550_12

Download citation

DOI: https://doi.org/10.1007/11424550_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26166-7
Online ISBN: 978-3-540-32053-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics