Indexes for Document Retrieval with Relevance

Hon, Wing-Kai; Patil, Manish; Shah, Rahul; Thankachan, Sharma V.; Vitter, Jeffrey Scott

doi:10.1007/978-3-642-40273-9_22

Wing-Kai Hon²⁰,
Manish Patil²¹,
Rahul Shah²¹,
Sharma V. Thankachan²¹ &
…
Jeffrey Scott Vitter²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8066))

1933 Accesses
8 Citations

Abstract

Document retrieval is a special type of pattern matching that is closely related to information retrieval and web searching. In this problem, the data consist of a collection of text documents, and given a query pattern P, we are required to report all the documents (not all the occurrences) in which this pattern occurs. In addition, the notion of relevance is commonly applied to rank all the documents that satisfy the query, and only those documents with the highest relevance are returned. Such a concept of relevance has been central in the effectiveness and usability of present day search engines like Google, Bing, Yahoo, or Ask. When relevance is considered, the query has an additional input parameter k, and the task is to report only the k documents with the highest relevance to P, instead of finding all the documents that contains P. For example, one such relevance function could be the frequency of the query pattern in the document. In the information retrieval literature, this task is best achieved by using inverted indexes. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—we cannot take advantages of the word boundaries and we need a different approach.

This leads to one of the active research topics in string matching and text indexing community in recent years, and various aspects of the problem have been studied, such as space-time tradeoffs, practical solutions, multipattern queries, and I/O-efficiency. In this article, we review some of the initial frameworks for designing such indexes and also summarize the developments in this area.

This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSF Grant CCF–1017623 (R. Shah and J. S. Vitter) and CCF–1218904 (R. Shah).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Afshani, P.: On dominance reporting in 3D. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 41–51. Springer, Heidelberg (2008)
Chapter Google Scholar
Afshani, P., Brodal, G.S., Zeh, N.: Ordered and unordered top-k range reporting in large data sets. In: SODA, pp. 390–400 (2011)
Google Scholar
Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Article MathSciNet Google Scholar
Arge, L., Samoladas, V., Vitter, J.S.: On two-dimensional indexability and optimal range search indexing. In: Proc. 18th Symposium on Principles of Database Systems (PODS), pp. 346–357 (1999)
Google Scholar
Belazzougui, D., Navarro, G.: Improved compressed indexes for full-text document retrieval. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 386–397. Springer, Heidelberg (2011)
Chapter Google Scholar
Chazelle, B.: Lower bounds for orthogonal range searching: I. the reporting case. J. ACM 37(2), 200–212 (1990)
Article MathSciNet MATH Google Scholar
Chien, Y.-F., Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Geometric burrows-wheeler transform: Compressed text indexing via sparse suffixes and range searching. Algorithmica (2013)
Google Scholar
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40-42), 3795–3800 (2010)
Article MathSciNet MATH Google Scholar
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)
Google Scholar
Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A.: Top-k ranked document search in general text databases. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part II. LNCS, vol. 6347, pp. 194–205. Springer, Heidelberg (2010)
Chapter Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.: Forbidden patterns. In: Fernández-Baca, D. (ed.) LATIN 2012. LNCS, vol. 7256, pp. 327–337. Springer, Heidelberg (2012)
Chapter Google Scholar
Gagie, T., Karhu, K., Navarro, G., Puglisi, S.J., Sirén, J.: Document listing on repetitive collections. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 107–119. Springer, Heidelberg (2013)
Chapter Google Scholar
Gagie, T., Navarro, G., Puglisi, S.J.: Colored range queries and document retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 67–81. Springer, Heidelberg (2010)
Chapter Google Scholar
Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426, 25–41 (2012)
Article MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Patil, M., Shah, R., Wu, S.-B.: Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms 8(4), 402–417 (2010)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V.: Towards an optimal space-and-query-time index for top-k document retrieval. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 173–184. Springer, Heidelberg (2012)
Chapter Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String retrieval for multi-pattern queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)
Chapter Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Document listing for queries with excluded pattern. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 185–195. Springer, Heidelberg (2012)
Chapter Google Scholar
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed top-k document retrieval. In: DCC (2013)
Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS 2009, pp. 713–722 (2009)
Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 260–274. Springer, Heidelberg (2010)
Chapter Google Scholar
Culpepper, M.P.J.S., Scholer, F.: Efficient in-memory top-k document retrieval. In: SIGIR (2012)
Google Scholar
Karpinski, M., Nekrich, Y.: Top-k color queries for document retrieval. In: SODA, pp. 401–411 (2011)
Google Scholar
Konow, R., Navarro, G.: Faster Compact Top-k Document Retrieval. In: DCC (2013)
Google Scholar
Matias, Y., Muthukrishnan, S.M., Şahinalp, S.C., Ziv, J.: Augmenting suffix trees, with applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)
Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)
Google Scholar
Navarro, G.: Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR, abs/1304.6023 (2013)
Google Scholar
Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: SODA, pp. 1066–1077 (2012)
Google Scholar
Navarro, G., Puglisi, S.J.: Dual-sorted inverted lists. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 309–321. Springer, Heidelberg (2010)
Chapter Google Scholar
Navarro, G., Puglisi, S.J., Valenzuela, D.: Practical compressed document retrieval. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 193–205. Springer, Heidelberg (2011)
Chapter Google Scholar
Navarro, G., Thankachan, S.V.: Faster top-k document retrieval in optimal space (submitted)
Google Scholar
Navarro, G., Valenzuela, D.: Space-efficient top-k document retrieval. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 307–319. Springer, Heidelberg (2012)
Chapter Google Scholar
Nekrich, Y., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Top-k categorical range maxima queries (submitted)
Google Scholar
Patil, M., Thankachan, S.V., Shah, R., Hon, W.-K., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: SIGIR, pp. 555–564 (2011)
Google Scholar
Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)
Article MathSciNet MATH Google Scholar
Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: On optimal top-k string retrieval. CoRR, abs/1207.2632 (2012)
Google Scholar
Tsur, D.: Top-k document retrieval in optimal space. Inf. Process. Lett. 113(12), 440–443 (2013)
Article MathSciNet MATH Google Scholar
Välimäki, N., Mäkinen, V.: Space-efficient algorithms for document retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Chapter Google Scholar
Vitter, J.S.: Compressed data structures with relevance. In: CIKM, pp. 4–5 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

National Tsing Hua University, Taiwan
Wing-Kai Hon
Louisiana State University, USA
Manish Patil, Rahul Shah & Sharma V. Thankachan
The University of Kansas, USA
Jeffrey Scott Vitter

Authors

Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Manish Patil
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Shah
View author publications
You can also search for this author in PubMed Google Scholar
Sharma V. Thankachan
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Scott Vitter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Andrej Brodnik
Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Alejandro López-Ortiz
The Institute of Mathematical Sciences, Chennai, India
Venkatesh Raman
Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay
Alfredo Viola

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hon, WK., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S. (2013). Indexes for Document Retrieval with Relevance. In: Brodnik, A., López-Ortiz, A., Raman, V., Viola, A. (eds) Space-Efficient Data Structures, Streams, and Algorithms. Lecture Notes in Computer Science, vol 8066. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40273-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-40273-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40272-2
Online ISBN: 978-3-642-40273-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics