Abstract
The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [16] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Additionally, we design a new external-memory algorithm that follows the basic philosophy underlying the algorithm in [13] but in a significantly different manner, thus combining its good practical qualities with efficient worstcase performances. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections.
Part of this work was done while the second author had a Post-Doctoral fellowship at the Max- Planck-Institut für Informatik, Saarbrücken, Germany. The work has been supported by EU ESPRIT LTR Project N. 20244 (ALCOM-IT)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
L. Arge, P. Ferragina, R. Grossi and J. S. Vitter. On sorting Strings in External Memory. In ACM Symp. on Theory of Computing, pp. 540–548, 1997.
R. Ahuja, K. Mehlhorn, J. B. Orlin and R. E. Tarjan. Faster Algorithms for the Shortest Path Problem. Journal of the ACM (2), pp. 213–223, 1990.
A. Andersson and S. Nilsson. Efficient implementation of Suffix Trees. Software Practice and Experience, 2(25): 129–141, 1995.
S. Burkhard, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals and M. Vingron. q-gram based database searching using a suffix array (QUASAR). International Conference on Computational Molecular Biology, 1999.
D. R. Clark and J. I. Munro. Efficient Suffix Trees on Secondary Storage. In ACM-SIAM Symp. on Discrete Algorithms, pp.383–391, 1996.
A. Crauser, P. Ferragina and U. Meyer. Practical and Efficient Priority Queues for External Memory. Technical Report MPI, see WEB pages of the authors.
A. Crauser and K. Mehlhorn. LEDA-SM: A Library Prototype for Computing in Secondary Memory. Technical Report MPI, see WEB pages of the authors.
C. Faloutsos. Access Methods for text. ACM Computing Surveys, 17, pp.49–74, March 1985.
M. Farach. Optimal suffix tree construction with large alphabets. In IEEE Foundations of Computer Science, pp. 137–143, 1997.
M. Farach, P. Ferragina and S. Muthukrishnan. Overcoming the Memory Bottleneck in Suffix Tree Construction. In IEEE Foundations of Computer Science, 1998.
C. L. Feng. PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval. ACM SIGIR, pp. 50–58, 1997.
P. Ferragina and R. Grossi. A Fully-Dynamic Data Structure for External Substring Search. In ACM Symp. Theory of Computing, pp. 693–702, 1995. Also Journal of the ACM (to appear).
G. H. Gonnet, R. A. Baeza-Yates and T. Snider. Newindices for text:PAT trees and PAT arrays. In Information Retrieval-Data Structures and Algorithms,W. B. Frakes and R. BaezaYates Editors, pp. 66–82, Prentice-Hall, 1992.
D. E. Knuth. The Art of Computer Programming: Sorting and Searching. Vol. 3, Addison-Wesley Publishing Co. 1973.
S. Kurtz. Reducing the Space Requirement of SuffixTrees. Technical Report 98-03, University of Bielefeld, 1998.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22, 5,pp. 935–948, 1993.
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM 23, 2,pp. 262–272, 1976.
G. Navarro, J. P. Kitajima, B. A. Ribeiro-Neto and N. Ziviani. Distributed Generation of Suffix Arrays. In Combinatorial Pattern Matching Conference, pp. 103–115, 1997.
S. Näher and K. Mehlhorn. LEDA:A Platform for Combinatorial and Geometric Computing. Communications of the ACM (38), 1995.
C. Ruemmler and J. Wilkes. An introduction to disk drive modeling. IEEE Computer, 27(3):17–29, 1994.
E. A. Shriver and J. S. Vitter. Algorithms for parallel memory I: two-level memories. Algorithmica, 12(2-3), pp. 110–147, 1994.
D. E. Vengroff and J. S. Vitter. I/O-efficient scientific computing using TPIE. In IEEE Symposium on Parallel and Distributed Computing, 1995.
J. Vitter. External memory algorithms. Invited Tutorial in 17th Ann. ACMSymp. on Principles of Database Systems (PODS’ 98), 1998. Also Invited Paper in European Symposium on Algorithms (ESA’ 98), 1998.
J. Zobel, A. Moffat and K. Ramamohanarao. Guidelines for presentation and comparison of indexing techniques. SIGMAD Record 25, 3:10–15, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Crauser, A., Ferragina, P. (1999). On Constructing Suffix Arrays in External Memory. In: Nešetřil, J. (eds) Algorithms - ESA’ 99. ESA 1999. Lecture Notes in Computer Science, vol 1643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48481-7_20
Download citation
DOI: https://doi.org/10.1007/3-540-48481-7_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66251-8
Online ISBN: 978-3-540-48481-3
eBook Packages: Springer Book Archive