Abstract
The Longest Common Prefix (LCP) array is a data structure commonly used in combination with the Suffix Array. However, in some settings we are interested in the LCP values per se since they provide useful information on the repetitiveness of the underlying sequence.
Since sequences can contain alterations, which can be either malicious (plagiarism attempts) or pseudo-random (as in sequencing experiments), when using LCP values to measure repetitiveness it makes sense to allow for a small number of errors. In this paper we formalize this notion by considering the longest common prefix in the presence of mismatches. In particular, we propose an algorithm that computes, for each text suffix, the length of its longest prefix that occurs elsewhere in the text with at most one mismatch. For a sequence of length n our algorithm uses \(\Theta (n\log n)\) bits and runs in \(\mathcal {O}(n \text{L}_{ave}\log n/\log \log n)\) time where \(\text{L}_{ave}\) is the average LCP of the input sequence. Although \(\text{L}_{ave}\) is \(\Theta (n)\) in the worst case, recent analyses of real world data show that it usually grows logarithmically with the input size. We then describe and analyse a second algorithm that uses a greedy strategy to reduce the amount of computation and that can be turned into an even faster algorithm if allow an additive one-sided error.
Finally, we consider the related problem of computing the 1-mappability of a sequence. In this problem we are asked to compute, for each length-m substring of the input sequence, the number of other substrings which are at Hamming distance one. For this problem we propose an algorithm that takes \(\mathcal {O}(m n \log n/\log \log n)\) time using \(\Theta (n \log n)\) bits of space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amir, A., Keselman, D., Landau, G., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)
Barbay, J., Claude, F., Navarro, G.: Compact binary relation representations with rich functionality. Information and Computation 232, 19–37 (2013)
Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms 18, 22–31 (2013)
Bille, P.: Gørtz, I.L.: Substring range reporting. Algorithmica 69, 384–396 (2014)
Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In: Proc. 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), pp. 88–102. SIAM (2013)
Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)
Crochemore, M., Langiu, A., Rahman, M.S.: Indexing a sequence for mapping reads with a single mismatch. Phil. Trans. R. Soc. A 372 (2014)
Derrien, T., Estell, J., Marco-Sola, S., Knowles, D.G., Raineri, E., Guig, R., Ribeca, P.: Fast computation and applications of genome mappability. PLoS One 7 (2012)
Fischer, J.: Inducing the LCP-Array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 465–492 (2011)
Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted Ancestors in Suffix Trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014)
Gog, S., Ohlebusch, E.: Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics 18 (2013)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55, 60–70 (2009)
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Kärkkäinen, J., Kempa, D.: LCP Array Construction in External Memory. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 412–423. Springer, Heidelberg (2014)
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proc. 26th Int. Conference on Research and Development in Information Retrieval (SIGIR), pp. 104–110. ACM (2003)
Lam, T.W., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.M.: High throughput short read alignment via bi-directional BWT. In: Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2009), pp. 31–36 (2009)
Léonard, M., Mouchard, L., Salson, M.: On the number of elements to reorder when updating a suffix array. J. Discrete Algorithms 11, 87–99 (2012)
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 935–948 (1993)
Marco-Sola, S., Sammeth, M., Guigó, R., Ribeca, P.: The GEM mapper: Fast, accurate and versatile alignment by filtration. Nature Methods 9, 1185–1188 (2012)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proc. 13th Symposium on Discrete Algorithms (SODA 2002), pp. 225–232. ACM/SIAM (2002)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)
Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012)
Sirén, J.: Sampled Longest Common Prefix Array. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 227–237. Springer, Heidelberg (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Manzini, G. (2015). Longest Common Prefix with Mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-23826-5_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)