Constructing Compressed Suffix Arrays with Large Alphabets

Hon, Wing-Kai; Lam, Tak-Wah; Sadakane, Kunihiko; Sung, Wing-Kin

doi:10.1007/978-3-540-24587-2_26

Wing-Kai Hon⁷,
Tak-Wah Lam⁷,
Kunihiko Sadakane⁸ &
…
Wing-Kin Sung⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2906))

Included in the following conference series:

International Symposium on Algorithms and Computation

1281 Accesses
9 Citations

Abstract

Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Σ, this algorithm requires O(|Σ|nlogn) time and (2 H ₀ + 1+ε)n bits of working space, where H ₀ is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size | Σ| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters.

The main contribution of this paper is a new algorithm which can construct CSA in O(nlogn) time using (H ₀ + 2+ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H ₀ > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.

This work was supported in part by the Hong Kong RGC Grant HKU-7024/01E; by the Grant-in-Aid of the Ministry of Education, Science, Sports and Culture of Japan; and by the NUS Academic Research Grant R-252-000-119-112.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burrows, M., Wheeler, D.J.: A Block-sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, California (1994)
Google Scholar
Clark, D.R., Munro, J.I.: Efficient Suffix Trees on Secondary Storage. In: Proc. ACM-SIAM SODA, pp. 383–391 (1996)
Google Scholar
Farach, M.: Optimal Suffix Tree Construction with Large Alphabets. In: Proc. IEEE FOCS, pp. 137–143 (1997)
Google Scholar
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proc. ACM-SIAM SODA, pp. 269–278 (2001)
Google Scholar
Ferragine, P., Manzini, G.: Opportunistic Data Structures with Applications. In: Proc. IEEE FOCS, pp. 390–398 (2000)
Google Scholar
Graff, D., Chen, K.: Chinese Gigaword (2003), http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T09
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In: Proc. ACM STOC, pp. 397–406 (2000)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. Manuscript (2001)
Google Scholar
Grossman, D.A., Frieder, O.: Information Retrieval: Algorithms and Heuristics. Kluwer Academic Publishers, Boston (1998)
MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a Time-and-Sapce Barrier in Constructing Full-Text Indices. In: Proc. IEEE FOCS (2003) (to appear)
Google Scholar
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. In: Proc. VLDB, pp. 410–421 (2000)
Google Scholar
Ko, P., Aluru, S.: Space Efficient Linear Time Construction of Suffix Arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)
Chapter Google Scholar
Kurtz, S.: Reducing the Space Requirement of Suffix Trees. Software Practice and Experiences 29, 1149–1171 (1999)
Article Google Scholar
Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays. In: Ibarra, O.H., Zhang, L. (eds.) COCOON 2002. LNCS, vol. 2387, pp. 401–410. Springer, Heidelberg (2002)
Chapter Google Scholar
Larsson, J., Sadakane, K.: Faster Suffix Sorting. Technical Report Technical Report LU-CS-TR:99-214, LUNDFD6/(NFCS-3140)/1-43/(1999), Lund University (1999)
Google Scholar
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Manzini, G.: An Analysis of the Burrows-Wheeler Transform. Journal of the ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Ong, T.H., Chen, H.: Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management. In: Proceedings of Asian Digital Library Conference (1999)
Google Scholar
Rice, R.F.: Some practical universal noiseless coding techniques. Technical Report JPL-79-22, Jet Propulsion Laboratory, Pasadena, California (1979)
Google Scholar
Sadakane, K.: New Text Indexing Functionalities of the Compressed Suffix Arrays. Journal of Algorithms (in press)
Google Scholar
Seward, J.: The bzip2 and libbzip2 official home page (1996), http://sources.redhat.com/bzip2/
Shimozono, S., Arimura, H., Arikawa, S.: Efficient Discovery of Optimal Word Association Patterns in Large Text Databases. New Generation Computing 18, 49–60 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Informations Systems, The University of Hong Kong, Hong Kong
Wing-Kai Hon & Tak-Wah Lam
Department of Computer Science and Communication Engineering, Kyushu University, Japan
Kunihiko Sadakane
School of Computing, National University of Singapore, Singapore
Wing-Kin Sung

Authors

Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Tak-Wah Lam
View author publications
You can also search for this author in PubMed Google Scholar
Kunihiko Sadakane
View author publications
You can also search for this author in PubMed Google Scholar
Wing-Kin Sung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Science and Technology, Kwansei Gakuin University, Sanda, Japan
Toshihide Ibaraki
Department of Architecture and Architectural Engineering, Kyoto University, 615-8540, Nishikyo-ku, Kyoto, Japan
Naoki Katoh
Department of Computer Science and Communication Engineering, Kyushu University, 812-8581, Fukuoka, Japan
Hirotaka Ono

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hon, WK., Lam, TW., Sadakane, K., Sung, WK. (2003). Constructing Compressed Suffix Arrays with Large Alphabets . In: Ibaraki, T., Katoh, N., Ono, H. (eds) Algorithms and Computation. ISAAC 2003. Lecture Notes in Computer Science, vol 2906. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24587-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-540-24587-2_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20695-8
Online ISBN: 978-3-540-24587-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics