Abstract
We propose space-efficient data structures for text retrieval systems that have merits of both theoretical data structures like suffix trees and practical ones like inverted files. Traditional text retrieval systems use the inverted files and support ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents that contain given keywords, which cannot be solved by using only the suffix trees. A drawback of the systems is that the scores can be computed for only predetermined keywords. We extend the data structure so that the scores can be computed for any pattern efficiently while keeping the size of the data structures moderate. The size is comparable with the text size, which is an improvement from existing methods using O(n log n) bit space for a text collection of length n.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in Linear Time? In ACM Symposium on Theory of Computing, pages 427–436, 1995.
M. Bender and M. Farach-Colton. The LCA Problem Revisited. In Proceedings of LATIN2000, LNCS 1776, pages 88–94, 2000.
A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578–595, 1987.
P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. In 41st IEEE Symp. on Foundations of Computer Science, pages 390–398, 2000.
R. Grossi, A. Gupta, and J. S. Vitter. Higher Order Entropy Analysis of Compressed Suffix Arrays. In DIMACS Workshop on Data Compression in Networks and Applications, March 2002.
R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.
L. Hui. Color Set Size Problem with Applications to String Matching. In Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching (CPM’92), LNCS 644, pages 227–240, 1992.
J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001.
J. I. Munro, V. Raman, and S. Srinivasa Rao. Space Efficient Suffix Trees. Journal of Algorithms, 39(2):205–222, May 2001.
S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proc. ACM-SIAM SODA, pages 657–666, 2002.
R. Raman, V. Raman, and S. Srinivasa Rao. Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets. In Proc. ACM-SIAM SODA, pages 233–242, 2002.
K. Sadakane. Compressed Text Databases with Efficient Query Algorithms based on the Compressed Suffix Array. In Proceedings of ISAAC’00, number 1969 in LNCS, pages 410–421, 2000.
K. Sadakane. Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays. In Proc. ACM-SIAM SODA 2002, pages 225–232, 2002.
G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620, 1975.
P. Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sadakane, K. (2002). Space-Efficient Data Structures for Flexible Text Retrieval Systems. In: Bose, P., Morin, P. (eds) Algorithms and Computation. ISAAC 2002. Lecture Notes in Computer Science, vol 2518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36136-7_2
Download citation
DOI: https://doi.org/10.1007/3-540-36136-7_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00142-3
Online ISBN: 978-3-540-36136-7
eBook Packages: Springer Book Archive