Retrieving Relevant Portions from Structured Digital Documents

  • Sujeet Pradhan
  • Katsumi Tanaka
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3180)


Retrieving relevant portions from structured documents consisting of logical components has been a challenging task in both the database and the information retrieval world, since an answer related to a query may be split across multiple components. In this paper, we propose a query mechanism that applies database style query evaluation in response to IR style keyword-based queries for retrieving relevant answers from a logically structured document. We first define an appropriate semantics of keywords-based queries and then propose an algebra that is capable of computing every relevant portion of a document, which can be considered answer to a set of arbitrary keywords. The ordering and structural relationship among the components are preserved in the answer. We also introduce several practically useful filters that saves users from having to deal with an overwhelming number of answers.


Query Term Query Evaluation Structure Document Keyword Query Query Formulation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Al-Khalifa, S., Yu, C., Jagadish, H.V.: Querying structured text in an XML database. In: SIGMOD 2003, pp. 4–15 (2003)Google Scholar
  2. 2.
    Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)Google Scholar
  3. 3.
    Burkowski, F.J.: Retrieval activities in a database consisting of heterogeneous collections of structured text. In: Proc. of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 112–125. ACM Press, New York (1992)CrossRefGoogle Scholar
  4. 4.
    Clarke, C.L.A., Cormack, G.V., Burkowski, F.J.: An algebra for structured text search and a framework for its implementation. The Computer Journal 38(1), 43–56 (1995)Google Scholar
  5. 5.
    Florescu, D., Kossman, D., Manolescu, I.: Integrating keyword search into XML query processing. In: International World Wide Web Conference, pp. 119–135 (2000)Google Scholar
  6. 6.
    Jaakkola, J., Kilpelaine, P.: Nested text-region algebra. Technical Report C-1999-2, Department of Computer Science, University of Helsinki (January 1999), Available at
  7. 7.
    Li, W.-S., Candan, K.S., Vu, Q., Agrawal, D.: Retrieving and organizing web pages by ‘Information Unit’. In: Tenth International WWW Conference, Hong Kong, China, pp. 230–244 (2001)Google Scholar
  8. 8.
    Navarro, G., Baeza-Yates, R.A.: Proximal nodes: A model to query document databases by content and structure. ACM Transactions on Information Systems 15(4), 400–435 (1997)CrossRefGoogle Scholar
  9. 9.
    Sacks-Davis, R., Arnold-Moore, T., Zobel, J.: Database systems for structured documents. In: International Symposium on Advanced Database Technologies and Their Integration, pp. 272–283 (1994)Google Scholar
  10. 10.
    Salminen, A., Tompa, F.: Pat expressions: an algebra for text search. Acta Linguistica Hungar 41(1-4), 277–306 (1992)Google Scholar
  11. 11.
    Tanaka, K., Tajima, K., Sogo, T., Pradhan, S.: Algebraic retrieval of fragmentarily indexed video. New Generation Computing 18(4), 359–374 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Sujeet Pradhan
    • 1
  • Katsumi Tanaka
    • 2
  1. 1.Kurashiki University of Science and the ArtsKurashikiJapan
  2. 2.Kyoto UniversityKyotoJapan

Personalised recommendations