Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8066))

Abstract

Document retrieval is a special type of pattern matching that is closely related to information retrieval and web searching. In this problem, the data consist of a collection of text documents, and given a query pattern P, we are required to report all the documents (not all the occurrences) in which this pattern occurs. In addition, the notion of relevance is commonly applied to rank all the documents that satisfy the query, and only those documents with the highest relevance are returned. Such a concept of relevance has been central in the effectiveness and usability of present day search engines like Google, Bing, Yahoo, or Ask. When relevance is considered, the query has an additional input parameter k, and the task is to report only the k documents with the highest relevance to P, instead of finding all the documents that contains P. For example, one such relevance function could be the frequency of the query pattern in the document. In the information retrieval literature, this task is best achieved by using inverted indexes. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—we cannot take advantages of the word boundaries and we need a different approach.

This leads to one of the active research topics in string matching and text indexing community in recent years, and various aspects of the problem have been studied, such as space-time tradeoffs, practical solutions, multipattern queries, and I/O-efficiency. In this article, we review some of the initial frameworks for designing such indexes and also summarize the developments in this area.

This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSF Grant CCF–1017623 (R. Shah and J. S. Vitter) and CCF–1218904 (R. Shah).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Afshani, P.: On dominance reporting in 3D. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 41–51. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  2. Afshani, P., Brodal, G.S., Zeh, N.: Ordered and unordered top-k range reporting in large data sets. In: SODA, pp. 390–400 (2011)

    Google Scholar 

  3. Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

    Article  MathSciNet  Google Scholar 

  4. Arge, L., Samoladas, V., Vitter, J.S.: On two-dimensional indexability and optimal range search indexing. In: Proc. 18th Symposium on Principles of Database Systems (PODS), pp. 346–357 (1999)

    Google Scholar 

  5. Belazzougui, D., Navarro, G.: Improved compressed indexes for full-text document retrieval. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 386–397. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Chazelle, B.: Lower bounds for orthogonal range searching: I. the reporting case. J. ACM 37(2), 200–212 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  7. Chien, Y.-F., Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Geometric burrows-wheeler transform: Compressed text indexing via sparse suffixes and range searching. Algorithmica (2013)

    Google Scholar 

  8. Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40-42), 3795–3800 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  9. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)

    Google Scholar 

  10. Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k ranked document search in general text databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part II. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  11. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.: Forbidden patterns. In: Fernández-Baca, D. (ed.) LATIN 2012. LNCS, vol. 7256, pp. 327–337. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  13. Gagie, T., Karhu, K., Navarro, G., Puglisi, S.J., Sirén, J.: Document listing on repetitive collections. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 107–119. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Gagie, T., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426, 25–41 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  16. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  17. Hon, W.-K., Patil, M., Shah, R., Wu, S.-B.: Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms 8(4), 402–417 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  18. Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an optimal space-and-query-time index for top-k document retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  19. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String retrieval for multi-pattern queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  20. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Document listing for queries with excluded pattern. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 185–195. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  21. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed top-k document retrieval. In: DCC (2013)

    Google Scholar 

  22. Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS 2009, pp. 713–722 (2009)

    Google Scholar 

  23. Hon, W.-K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 260–274. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  24. Culpepper, M.P.J.S., Scholer, F.: Efficient in-memory top-k document retrieval. In: SIGIR (2012)

    Google Scholar 

  25. Karpinski, M., Nekrich, Y.: Top-k color queries for document retrieval. In: SODA, pp. 401–411 (2011)

    Google Scholar 

  26. Konow, R., Navarro, G.: Faster Compact Top-k Document Retrieval. In: DCC (2013)

    Google Scholar 

  27. Matias, Y., Muthukrishnan, S.M., Şahinalp, S.C., Ziv, J.: Augmenting suffix trees, with applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)

    Google Scholar 

  28. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)

    Google Scholar 

  29. Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR, abs/1304.6023 (2013)

    Google Scholar 

  30. Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: SODA, pp. 1066–1077 (2012)

    Google Scholar 

  31. Navarro, G., Puglisi, S.J.: Dual-sorted inverted lists. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 309–321. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  32. Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical compressed document retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  33. Navarro, G., Thankachan, S.V.: Faster top-k document retrieval in optimal space (submitted)

    Google Scholar 

  34. Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  35. Nekrich, Y., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Top-k categorical range maxima queries (submitted)

    Google Scholar 

  36. Patil, M., Thankachan, S.V., Shah, R., Hon, W.-K., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: SIGIR, pp. 555–564 (2011)

    Google Scholar 

  37. Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  38. Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: On optimal top-k string retrieval. CoRR, abs/1207.2632 (2012)

    Google Scholar 

  39. Tsur, D.: Top-k document retrieval in optimal space. Inf. Process. Lett. 113(12), 440–443 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  40. Välimäki, N., Mäkinen, V.: Space-efficient algorithms for document retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  41. Vitter, J.S.: Compressed data structures with relevance. In: CIKM, pp. 4–5 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Hon, WK., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S. (2013). Indexes for Document Retrieval with Relevance. In: Brodnik, A., López-Ortiz, A., Raman, V., Viola, A. (eds) Space-Efficient Data Structures, Streams, and Algorithms. Lecture Notes in Computer Science, vol 8066. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40273-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40273-9_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40272-2

  • Online ISBN: 978-3-642-40273-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics