Abstract
Not only since the advent of XML, many applications call for e.cient structured document retrieval, challenging both Information Retrieval (IR) and database (DB) research. Most approaches combining indexing techniques from both .elds still separate path and content matching, merging the hits in an expensive join. This paper shows that retrieval is signi.cantly accelerated by processing text and structure simultaneously. The Content-Aware DataGuide (CADG) interleaves IR and DB indexing techniques to minimize path matching and suppress joins at query time, also saving needless I/O operations during retrieval. Extensive experiments prove the CADG to outperform the DataGuide [11,14] by a factor 5 to 200 on average. For structurally unselective queries, it is over 400 times faster than the DataGuide. The best results were achieved on large collections of heterogeneously structured textual documents.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Amer-Yahia, S., Case, P.: XQuery and XPath Full-Text Use Cases. W3C Working Draft (2003), See http://www.w3.org/TR/xmlquery-full-text-use-cases
Baeza-Yates, R., Navarro, G.: Integrating Contents and Structure in Text Retrieval. SIGMOD Record 25(1), 67–79 (1996)
Barg, M., Wong, R.K.: A Fast and Versatile Path Index for Querying Semi- Structured Data. In: Proc. 8th Int. Conf. on DBS for Advanced Applications (2003)
Buxton, S., Rys, M.: XQuery and XPath Full-Text Requirements. W3C Working Draft (2003), See http://www.w3.org/TR/xquery-full-text-requirements
Chen, Y., Aberer, K.: Combining Pat-Trees and Signature Files for Query Eval. in Document DBs. In: Proc. 10th Int. Conf. on DB & Expert Systems Applic. (1999)
Cooper, B., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A Fast Index for Semistructured Data. In: Proc. 27th Int. Conf. on Very Large DB (2001)
Cui, H., Wen, J.-R., Chua, T.-S.: Hier. Indexing and Flexible Element Retrieval for Struct. Document. In: Proc. 25th Europ. Conf. on IR Research, pp. 73–87 (2003)
Faloutsos, C.: Signature Files: Design and Performance Comparison of Some Signature Extraction Methods. In: Proc. ACM-SIGIR Int. Conf. on Research and Development in IR, pp. 63–82 (1985)
Frakes, W.B. (ed.): IR. Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)
Fuhr, N., Großjohann, K.: XIRQL: A Query Language for IR in XML Documents. Research and Development in IR, pp. 172–180 (2001)
Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Proc. 23rd Int. Conf. on Very Large DB (1997)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the Integration of Structure Indexes and Inverted Lists. In: Proc. 20th Int. Conf. on Data Engineering (2004) (to appear)
Li, Q., Moon, B.: Indexing and Querying XML Data for Regular Path Expressions. In: Proc. 27th Int. Conf. on Very Large DB, pp. 361–370 (2001)
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., Widom, J.: Lore: A DB Management System for Semistructured Data. SIGMOD Rec. 26(3), 54–66 (1997)
Meuss, H., Schulz, K., Bry, F.: Visual Querying and Explor. of Large Answers in XML DBs with X2. In: Proc. 19th Int. Conf. on DB Engin., pp. 777–779 (2003)
Meuss, H., Strohmaier, C.: Improving Index Structures for Structured Document Retrieval. In: Proc. 21st Ann. Colloquium on IR Research (1999)
Oesterle, J., Maier-Meyer, P.: The GNoP (German Noun Phrase) Treebank. In: Proc. 1st Int. Conf. on Language Resources and Evaluation (1998)
Schlieder, T., Meuss, H.: Querying and Ranking XML Documents. JASIS Spec. Top. XML/IR 53(6), 489–503 (2002)
Shin, D., Jang, H., Jin, H.: BUS: An Effective Indexing and Retrieval Scheme in Structured Documents. In: Proc. 3rd ACM Int. Conf. on Digital Libraries (1998)
Weigel, F.: A Survey of Indexing Techniques for Semistructured Documents. Technical report, Dept. of Computer Science, University of Munich, Germany (2002)
Weigel, F.: Content-Aware DataGuides for Indexing Semi-Structured Data. Master’s thesis, Dept. of Computer Science, University of Munich, Germany (2003)
Wolff, J.E., Flörke, H., Cremers, A.B.: Searching and Browsing Collections of Structural Information. In: Advances in Digital Libraries, pp. 141–150 (2000)
XML Benchmark Project. A benchmark suite for evaluating XML repositories, See http://monetdb.cwi.nl/xml
Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transactions on DB Systems 23(4), 453–490 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weigel, F., Meuss, H., Bry, F., Schulz, K.U. (2004). Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-24752-4_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive