Longest Common Prefix with Mismatches

Manzini, Giovanni

doi:10.1007/978-3-319-23826-5_29

Giovanni Manzini¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1161 Accesses
10 Citations

Abstract

The Longest Common Prefix (LCP) array is a data structure commonly used in combination with the Suffix Array. However, in some settings we are interested in the LCP values per se since they provide useful information on the repetitiveness of the underlying sequence.

Since sequences can contain alterations, which can be either malicious (plagiarism attempts) or pseudo-random (as in sequencing experiments), when using LCP values to measure repetitiveness it makes sense to allow for a small number of errors. In this paper we formalize this notion by considering the longest common prefix in the presence of mismatches. In particular, we propose an algorithm that computes, for each text suffix, the length of its longest prefix that occurs elsewhere in the text with at most one mismatch. For a sequence of length n our algorithm uses \(\Theta (n\log n)\) bits and runs in \(\mathcal {O}(n \text{L}_{ave}\log n/\log \log n)\) time where \(\text{L}_{ave}\) is the average LCP of the input sequence. Although \(\text{L}_{ave}\) is \(\Theta (n)\) in the worst case, recent analyses of real world data show that it usually grows logarithmically with the input size. We then describe and analyse a second algorithm that uses a greedy strategy to reduce the amount of computation and that can be turned into an even faster algorithm if allow an additive one-sided error.

Finally, we consider the related problem of computing the 1-mappability of a sequence. In this problem we are asked to compute, for each length-m substring of the input sequence, the number of other substrings which are at Hamming distance one. For this problem we propose an algorithm that takes \(\mathcal {O}(m n \log n/\log \log n)\) time using \(\Theta (n \log n)\) bits of space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amir, A., Keselman, D., Landau, G., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)
Article MathSciNet MATH Google Scholar
Barbay, J., Claude, F., Navarro, G.: Compact binary relation representations with rich functionality. Information and Computation 232, 19–37 (2013)
Article MathSciNet MATH Google Scholar
Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms 18, 22–31 (2013)
Article MathSciNet MATH Google Scholar
Bille, P.: Gørtz, I.L.: Substring range reporting. Algorithmica 69, 384–396 (2014)
Article MathSciNet Google Scholar
Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In: Proc. 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), pp. 88–102. SIAM (2013)
Google Scholar
Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)
Chapter Google Scholar
Crochemore, M., Langiu, A., Rahman, M.S.: Indexing a sequence for mapping reads with a single mismatch. Phil. Trans. R. Soc. A 372 (2014)
Google Scholar
Derrien, T., Estell, J., Marco-Sola, S., Knowles, D.G., Raineri, E., Guig, R., Ribeca, P.: Fast computation and applications of genome mappability. PLoS One 7 (2012)
Google Scholar
Fischer, J.: Inducing the LCP-Array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)
Chapter Google Scholar
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 465–492 (2011)
Article MathSciNet MATH Google Scholar
Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted Ancestors in Suffix Trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014)
Google Scholar
Gog, S., Ohlebusch, E.: Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics 18 (2013)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55, 60–70 (2009)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Chapter Google Scholar
Kärkkäinen, J., Kempa, D.: LCP Array Construction in External Memory. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 412–423. Springer, Heidelberg (2014)
Google Scholar
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proc. 26th Int. Conference on Research and Development in Information Retrieval (SIGIR), pp. 104–110. ACM (2003)
Google Scholar
Lam, T.W., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.M.: High throughput short read alignment via bi-directional BWT. In: Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2009), pp. 31–36 (2009)
Google Scholar
Léonard, M., Mouchard, L., Salson, M.: On the number of elements to reorder when updating a suffix array. J. Discrete Algorithms 11, 87–99 (2012)
Article MathSciNet MATH Google Scholar
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 935–948 (1993)
Article MathSciNet MATH Google Scholar
Marco-Sola, S., Sammeth, M., Guigó, R., Ribeca, P.: The GEM mapper: Fast, accurate and versatile alignment by filtration. Nature Methods 9, 1185–1188 (2012)
Article Google Scholar
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proc. 13th Symposium on Discrete Algorithms (SODA 2002), pp. 225–232. ACM/SIAM (2002)
Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)
Article MathSciNet MATH Google Scholar
Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012)
Article MathSciNet MATH Google Scholar
Sirén, J.: Sampled Longest Common Prefix Array. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 227–237. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Institute, University of Eastern Piedmont, Vercelli, Italy
Giovanni Manzini

Authors

Giovanni Manzini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Manzini .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manzini, G. (2015). Longest Common Prefix with Mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_29
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics