Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes

Ohlebusch, Enno; Gog, Simon; Kügel, Adrian

doi:10.1007/978-3-642-16321-0_36

Enno Ohlebusch¹⁸,
Simon Gog¹⁸ &
Adrian Kügel¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6393))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1282 Accesses
26 Citations

Abstract

Exact string matching is a problem that computer programmers face on a regular basis, and full-text indexes like the suffix tree or the suffix array provide fast string search over large texts. In the last decade, research on compressed indexes has flourished because the main problem in large-scale applications is the space consumption of the index. Nowadays, the most successful compressed indexes are able to obtain almost optimal space and search time simultaneously. It is known that a myriad of sequence analysis and comparison problems can be solved efficiently with established data structures like the suffix tree or the suffix array, but algorithms on compressed indexes that solve these problem are still lacking at present. Here, we show that matching statistics and maximal exact matches between two strings S ¹ and S ² can be computed efficiently by matching S ² backwards against a compressed index of S ¹.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Weiner, P.: Linear pattern matching algorithms. Proc. 14th IEEE Annual Symposium on Switching and Automata Theory. 1–11 (1973)
Google Scholar
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words, pp. 85–96. Springer, Heidelberg (1985)
Chapter Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Manber, U., Myers, E.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. IEEE Symposium on Foundations of Computer Science, pp. 390–398 (2000)
Google Scholar
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Research Report 124, Digital Systems Research Center (1994)
Google Scholar
Chang, W., Lawler, E.: Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 327–344 (1994)
Article MathSciNet MATH Google Scholar
Teo, C., Vishwanathan, S.: Fast and space efficient string kernels using suffix arrays. In: Proc. 23rd Conference on Machine Learning, pp. 929–936. ACM Press, New York (2003)
Google Scholar
Rahmann, S.: Fast and sensitive probe selection for DNA chips using jumps in matching statistics. In: Proc. 2nd IEEE Computer Society Bioinformatics Conference, pp. 57–64 (2003)
Google Scholar
Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome Biology 5, R12 (2004)
Google Scholar
Abouelhoda, M., Kurtz, S., Ohlebusch, E.: CoCoNUT: An efficient system for the comparison and analysis of genomes. BMC Bioinformatics 9, 476 (2008)
Article Google Scholar
Puglisi, S., Smyth, W., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 1–31 (2007)
Article Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850 (2003)
Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), Article 2 (2007)
Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41, 589–607 (2007)
Article MathSciNet MATH Google Scholar
Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Article MathSciNet MATH Google Scholar
Ohlebusch, E., Gog, S.: A compressed enhanced suffix array supporting fast string matching. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 51–62. Springer, Heidelberg (2009)
Google Scholar
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theoretical Computer Science 410(51), 5354–5364 (2009)
Article MathSciNet MATH Google Scholar
Russo, L., Navarro, G., Oliveira, A.: Parallel and distributed compressed indexes. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 348–360. Springer, Heidelberg (2010)
Google Scholar
Khan, Z., Bloom, J., Kruglyak, L., Singh, M.: A practical algorithm for finding maximal exact matches in large sequence data sets using sparse suffix arrays. Bioinformatics 25, 1609–1616 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Theoretical Computer Science, University of Ulm, D-89069, Germany
Enno Ohlebusch, Simon Gog & Adrian Kügel

Authors

Enno Ohlebusch
View author publications
You can also search for this author in PubMed Google Scholar
Simon Gog
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Kügel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Physics and Mathematics, Edificio "B", Universidad Michoacana, Ciudad Universitaria, 5800, Morelia, Mich., Mexico
Edgar Chavez
Dept. of Computer Science and Enginerring, University of California, 92521, Riverside, CA, USA
Stefano Lonardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ohlebusch, E., Gog, S., Kügel, A. (2010). Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-16321-0_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics