Space-Efficient Data Structures for Flexible Text Retrieval Systems

Sadakane, Kunihiko

doi:10.1007/3-540-36136-7_2

Kunihiko Sadakane⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2518))

Included in the following conference series:

International Symposium on Algorithms and Computation

1000 Accesses
3 Citations

Abstract

We propose space-efficient data structures for text retrieval systems that have merits of both theoretical data structures like suffix trees and practical ones like inverted files. Traditional text retrieval systems use the inverted files and support ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents that contain given keywords, which cannot be solved by using only the suffix trees. A drawback of the systems is that the scores can be computed for only predetermined keywords. We extend the data structure so that the scores can be computed for any pattern efficiently while keeping the size of the data structures moderate. The size is comparable with the text size, which is an improvement from existing methods using O(n log n) bit space for a text collection of length n.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in Linear Time? In ACM Symposium on Theory of Computing, pages 427–436, 1995.
Google Scholar
M. Bender and M. Farach-Colton. The LCA Problem Revisited. In Proceedings of LATIN2000, LNCS 1776, pages 88–94, 2000.
Google Scholar
A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578–595, 1987.
Article MathSciNet Google Scholar
P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. In 41st IEEE Symp. on Foundations of Computer Science, pages 390–398, 2000.
Google Scholar
R. Grossi, A. Gupta, and J. S. Vitter. Higher Order Entropy Analysis of Compressed Suffix Arrays. In DIMACS Workshop on Data Compression in Networks and Applications, March 2002.
Google Scholar
R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
Google Scholar
L. Hui. Color Set Size Problem with Applications to String Matching. In Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching (CPM’92), LNCS 644, pages 227–240, 1992.
Google Scholar
J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001.
Article MATH MathSciNet Google Scholar
J. I. Munro, V. Raman, and S. Srinivasa Rao. Space Efficient Suffix Trees. Journal of Algorithms, 39(2):205–222, May 2001.
Google Scholar
S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proc. ACM-SIAM SODA, pages 657–666, 2002.
Google Scholar
R. Raman, V. Raman, and S. Srinivasa Rao. Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets. In Proc. ACM-SIAM SODA, pages 233–242, 2002.
Google Scholar
K. Sadakane. Compressed Text Databases with Efficient Query Algorithms based on the Compressed Suffix Array. In Proceedings of ISAAC’00, number 1969 in LNCS, pages 410–421, 2000.
Google Scholar
K. Sadakane. Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays. In Proc. ACM-SIAM SODA 2002, pages 225–232, 2002.
Google Scholar
G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620, 1975.
Article MATH Google Scholar
P. Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973.
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, Aramaki Aza Aoba09, Aoba-ku, Sendai, 980-8579, Japan
Kunihiko Sadakane

Authors

Kunihiko Sadakane
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Carleton University, 1125 Colonel By Drive, Ottawa, Ontario, Canada, K1S 5B6
Prosenjit Bose & Pat Morin &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadakane, K. (2002). Space-Efficient Data Structures for Flexible Text Retrieval Systems. In: Bose, P., Morin, P. (eds) Algorithms and Computation. ISAAC 2002. Lecture Notes in Computer Science, vol 2518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36136-7_2

Download citation

DOI: https://doi.org/10.1007/3-540-36136-7_2
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00142-3
Online ISBN: 978-3-540-36136-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics