Skip to main content

Longest Common Prefix with Mismatches

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

  • International Symposium on String Processing and Information Retrieval

Abstract

The Longest Common Prefix (LCP) array is a data structure commonly used in combination with the Suffix Array. However, in some settings we are interested in the LCP values per se since they provide useful information on the repetitiveness of the underlying sequence.

Since sequences can contain alterations, which can be either malicious (plagiarism attempts) or pseudo-random (as in sequencing experiments), when using LCP values to measure repetitiveness it makes sense to allow for a small number of errors. In this paper we formalize this notion by considering the longest common prefix in the presence of mismatches. In particular, we propose an algorithm that computes, for each text suffix, the length of its longest prefix that occurs elsewhere in the text with at most one mismatch. For a sequence of length n our algorithm uses \(\Theta (n\log n)\) bits and runs in \(\mathcal {O}(n \text{L}_{ave}\log n/\log \log n)\) time where \(\text{L}_{ave}\) is the average LCP of the input sequence. Although \(\text{L}_{ave}\) is \(\Theta (n)\) in the worst case, recent analyses of real world data show that it usually grows logarithmically with the input size. We then describe and analyse a second algorithm that uses a greedy strategy to reduce the amount of computation and that can be turned into an even faster algorithm if allow an additive one-sided error.

Finally, we consider the related problem of computing the 1-mappability of a sequence. In this problem we are asked to compute, for each length-m substring of the input sequence, the number of other substrings which are at Hamming distance one. For this problem we propose an algorithm that takes \(\mathcal {O}(m n \log n/\log \log n)\) time using \(\Theta (n \log n)\) bits of space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amir, A., Keselman, D., Landau, G., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  2. Barbay, J., Claude, F., Navarro, G.: Compact binary relation representations with rich functionality. Information and Computation 232, 19–37 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  3. Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms 18, 22–31 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bille, P.: Gørtz, I.L.: Substring range reporting. Algorithmica 69, 384–396 (2014)

    Article  MathSciNet  Google Scholar 

  5. Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In: Proc. 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), pp. 88–102. SIAM (2013)

    Google Scholar 

  6. Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Crochemore, M., Langiu, A., Rahman, M.S.: Indexing a sequence for mapping reads with a single mismatch. Phil. Trans. R. Soc. A 372 (2014)

    Google Scholar 

  8. Derrien, T., Estell, J., Marco-Sola, S., Knowles, D.G., Raineri, E., Guig, R., Ribeca, P.: Fast computation and applications of genome mappability. PLoS One 7 (2012)

    Google Scholar 

  9. Fischer, J.: Inducing the LCP-Array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 465–492 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted Ancestors in Suffix Trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014)

    Google Scholar 

  12. Gog, S., Ohlebusch, E.: Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics 18 (2013)

    Google Scholar 

  13. Gusfield, D.: Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  14. Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55, 60–70 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  16. Kärkkäinen, J., Kempa, D.: LCP Array Construction in External Memory. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 412–423. Springer, Heidelberg (2014)

    Google Scholar 

  17. Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proc. 26th Int. Conference on Research and Development in Information Retrieval (SIGIR), pp. 104–110. ACM (2003)

    Google Scholar 

  18. Lam, T.W., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.M.: High throughput short read alignment via bi-directional BWT. In: Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2009), pp. 31–36 (2009)

    Google Scholar 

  19. Léonard, M., Mouchard, L., Salson, M.: On the number of elements to reorder when updating a suffix array. J. Discrete Algorithms 11, 87–99 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  20. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22, 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  21. Marco-Sola, S., Sammeth, M., Guigó, R., Ribeca, P.: The GEM mapper: Fast, accurate and versatile alignment by filtration. Nature Methods 9, 1185–1188 (2012)

    Article  Google Scholar 

  22. Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proc. 13th Symposium on Discrete Algorithms (SODA 2002), pp. 225–232. ACM/SIAM (2002)

    Google Scholar 

  23. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  24. Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  25. Sirén, J.: Sampled Longest Common Prefix Array. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 227–237. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanni Manzini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Manzini, G. (2015). Longest Common Prefix with Mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23826-5_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23825-8

  • Online ISBN: 978-3-319-23826-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics