Skip to main content

Space-Efficient Data Structures for Flexible Text Retrieval Systems

  • Conference paper
  • First Online:
Algorithms and Computation (ISAAC 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2518))

Included in the following conference series:

Abstract

We propose space-efficient data structures for text retrieval systems that have merits of both theoretical data structures like suffix trees and practical ones like inverted files. Traditional text retrieval systems use the inverted files and support ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents that contain given keywords, which cannot be solved by using only the suffix trees. A drawback of the systems is that the scores can be computed for only predetermined keywords. We extend the data structure so that the scores can be computed for any pattern efficiently while keeping the size of the data structures moderate. The size is comparable with the text size, which is an improvement from existing methods using O(n log n) bit space for a text collection of length n.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Andersson, T. Hagerup, S. Nilsson, and R. Raman. Sorting in Linear Time? In ACM Symposium on Theory of Computing, pages 427–436, 1995.

    Google Scholar 

  2. M. Bender and M. Farach-Colton. The LCA Problem Revisited. In Proceedings of LATIN2000, LNCS 1776, pages 88–94, 2000.

    Google Scholar 

  3. A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578–595, 1987.

    Article  MathSciNet  Google Scholar 

  4. P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. In 41st IEEE Symp. on Foundations of Computer Science, pages 390–398, 2000.

    Google Scholar 

  5. R. Grossi, A. Gupta, and J. S. Vitter. Higher Order Entropy Analysis of Compressed Suffix Arrays. In DIMACS Workshop on Data Compression in Networks and Applications, March 2002.

    Google Scholar 

  6. R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000.

    Google Scholar 

  7. L. Hui. Color Set Size Problem with Applications to String Matching. In Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching (CPM’92), LNCS 644, pages 227–240, 1992.

    Google Scholar 

  8. J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001.

    Article  MATH  MathSciNet  Google Scholar 

  9. J. I. Munro, V. Raman, and S. Srinivasa Rao. Space Efficient Suffix Trees. Journal of Algorithms, 39(2):205–222, May 2001.

    Google Scholar 

  10. S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proc. ACM-SIAM SODA, pages 657–666, 2002.

    Google Scholar 

  11. R. Raman, V. Raman, and S. Srinivasa Rao. Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets. In Proc. ACM-SIAM SODA, pages 233–242, 2002.

    Google Scholar 

  12. K. Sadakane. Compressed Text Databases with Efficient Query Algorithms based on the Compressed Suffix Array. In Proceedings of ISAAC’00, number 1969 in LNCS, pages 410–421, 2000.

    Google Scholar 

  13. K. Sadakane. Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays. In Proc. ACM-SIAM SODA 2002, pages 225–232, 2002.

    Google Scholar 

  14. G. Salton, A. Wong, and C. S. Yang. A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613–620, 1975.

    Article  MATH  Google Scholar 

  15. P. Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sadakane, K. (2002). Space-Efficient Data Structures for Flexible Text Retrieval Systems. In: Bose, P., Morin, P. (eds) Algorithms and Computation. ISAAC 2002. Lecture Notes in Computer Science, vol 2518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36136-7_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-36136-7_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00142-3

  • Online ISBN: 978-3-540-36136-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics