pp 1–20 | Cite as

Longest Common Substring with Approximately k Mismatches

  • Tomasz Kociumaka
  • Jakub RadoszewskiEmail author
  • Tatiana Starikovskaya


In the longest common substring problem, we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one character. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature. In this paper, we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately k mismatches and use locality-sensitive hashing to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratic-time 2-approximation algorithm for the longest common substring with k mismatches problem and show conditional hardness of improving its approximation ratio.


Randomised algorithms String similarity measures Longest common substring Sketching Locality-sensitive hashing Binary jumbled indexing 



Jakub Radoszewski was supported by the “Algorithms for text processing with errors and uncertainties” project carried out within the HOMING programme of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund. Funding was provided by Foundation for Polish Science (Grant No. Homing/2016-2/16).


  1. 1.
    Abboud, A., Williams, R.R., Yu, H.: More applications of the polynomial method to algorithm design. In: Indyk P. (ed.) 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pp. 218–230. SIAM (2015).
  2. 2.
    Agrawal, M., Kayal, N., Saxena, N.: PRIMES is in P. Ann. Math. 160(2), 781–793 (2004). MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990). CrossRefGoogle Scholar
  4. 4.
    Andoni, A., Indyk, P.: Efficient algorithms for substring near neighbor problem. In: 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pp. 1203–1212. SIAM (2006).
  5. 5.
    Babenko, M.A., Starikovskaya, T.: Computing longest common substrings via suffix arrays. In: Hirsch, E.A., Razborov, A.A., Semenov, A.L., Slissenko, A. (eds.) Computer Science Symposium in Russia, CSR 2008, LNCS, vol. 5010, pp. 64–75. Springer (2008).
  6. 6.
    Babenko, M.A., Starikovskaya, T.: Computing the longest common substring with one mismatch. Probl. Inf. Transm. 47(1), 28–33 (2011). MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Bille, P., Gørtz, I.L., Kristensen, J.: Longest common extensions via fingerprinting. In: Dediu, A., Martín-Vide, C. (eds.) Language and Automata Theory and Applications, LATA 2012, LNCS, vol. 7183, pp. 119–130. Springer (2012).
  8. 8.
    Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. J. Discrete Algorithms 25, 42–50 (2014). MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Chan, T.M., Lewenstein, M.: Clustered integer 3SUM via additive combinatorics. In: Servedio, R.A., Rubinfeld, R. (eds.) 47th Annual ACM Symposium on Theory of Computing, STOC 2015, pp. 31–40. ACM (2015).
  10. 10.
    Charalampopoulos, P., Crochemore, M., Iliopoulos, C.S., Kociumaka, T., Pissis, S.P., Radoszewski, J., Rytter, W., Waleń, T.: Linear-time algorithm for long LCF with \(k\) mismatches. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Combinatorial Pattern Matching, CPM 2018, LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2018).
  11. 11.
    Cygan, M., Fomin, F.V., Kowalik, Ł., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., Saurabh, S.: Parameterized Algorithms. Springer (2015).
  12. 12.
    Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011). MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Fischer, M.J., Paterson, M.S.: String matching and other products. In: Karp, R.M. (ed.) Complexity of Computation, SIAM-AMS Proceedings, vol. 7, pp. 113–125. AMS, Providence, RI (1974)Google Scholar
  14. 14.
    Flouri, T., Giaquinta, E., Kobert, K., Ukkonen, E.: Longest common substrings with \(k\) mismatches. Inf. Process. Lett. 115(6–8), 643–647 (2015). MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Galil, Z., Giancarlo, R.: Parallel string matching with \(k\) mismatches. Theor. Comput. Sci. 51, 341–348 (1987). MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Grabowski, S.: A note on the longest common substring with \(k\)-mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015). MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). CrossRefzbMATHGoogle Scholar
  18. 18.
    Har-Peled, S., Indyk, P., Motwani, R.: Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput. 8(1), 321–350 (2012). MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984). MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963). MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A.,  Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching, CPM 1992, LNCS, vol. 644, pp. 230–243. Springer (1992).
  22. 22.
    Ilie, L., Navarro, G., Tinta, L.: The longest common extension problem revisited and applications to approximate string searching. J. Discrete Algorithms 8(4), 418–428 (2010). MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Impagliazzo, R., Paturi, R.: On the complexity of \(k\)-SAT. J. Comput. Syst. Sci. 62(2), 367–375 (2001). MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001). MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Kociumaka, T., Starikovskaya, T., Vildhøj, H.W.: Sublinear space algorithms for the longest common substring problem. In: Schulz, A.S., Wagner, D. (eds.) Algorithms, ESA 2014, LNCS, vol. 8737, pp. 605–617. Springer (2014).
  27. 27.
    Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30(2), 457–474 (2000). MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Landau, G.M., Vishkin, U.: Efficient string matching with \(k\) mismatches. Theor. Comput. Sci. 43, 239–249 (1986). MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Leimeister, C., Morgenstern, B.: kmacs: the \(k\)-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014). CrossRefGoogle Scholar
  30. 30.
    Porat, B., Porat, E.: Exact and approximate pattern matching in the streaming model. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, pp. 315–323. IEEE Computer Society (2009).
  31. 31.
    Starikovskaya, T.: Longest common substring with approximately \(k\) mismatches. In: Grossi, R., Lewenstein, M. (eds.) Combinatorial Pattern Matching, CPM 2016, LIPIcs, vol. 54, pp. 21:1–21:11. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2016).
  32. 32.
    Starikovskaya, T., Vildhøj, H.W.: Time-space trade-offs for the longest common substring problem. In: Fischer, J., Sanders, P., (eds.) Combinatorial Pattern Matching, CPM 2013, LNCS, vol. 7922, pp. 223–234. Springer (2013).
  33. 33.
    Tao, T., Croot III, E., Helfgott, H.: Deterministic methods to find primes. Math. Comput. 81(278), 1233–1246 (2012). MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) Research in Computational Molecular Biology, RECOMB 2018, LNCS, vol. 10812, pp. 211–224. Springer (2018).
  35. 35.
    Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016). MathSciNetCrossRefGoogle Scholar
  36. 36.
    Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pp. 1–11. IEEE Computer Society, Washington, DC, USA (1973).
  37. 37.
    Williams, R.: A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci. 348(2–3), 357–365 (2005). MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019
corrected publication 2019

Authors and Affiliations

  1. 1.Institute of InformaticsUniversity of WarsawWarsawPoland
  2. 2.DIENS, École Normale SupérieurePSL UniversityParisFrance

Personalised recommendations