Skip to main content

Algorithms for Indexing Highly Similar DNA Sequences

  • Chapter
  • First Online:

Abstract

The availability of numerical data grows from one day to the other in an extraordinary way. This is the case for DNA sequences produced by new technologies of high-throughput Next Generation Sequencing (NGS). Hence, it is possible to sequence several genomes of organisms and a project (http://www.1000genomes.org) now provide about 2500 individual human genomes (sequences of more than three billion characters (A, C, G, T).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.1000genomes.org.

  2. 2.

    In this exposition all logarithms are in base 2 unless stated otherwise.

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  2. Alatabbi, A., Barton, C., Iliopoulos, C.S., Mouchard, L.: Querying highly similar structured sequences via binary encoding and word level operations. In: Iliadis, L.S., Maglogiannis, I., Papadopoulos, H., Karatzas, K., Sioutas, S. (eds.) Proceedings of the International Workshop on Artificial Intelligence Applications and Innovations, AIAI 2012, Part II. IFIP Advances in Information and Communication Technology, vol. 382, pp. 584–592. Springer, Cham (2012)

    Google Scholar 

  3. Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advance Science Institute Series, vol. 12, pp. 85–96. Springer, Berlin (1985)

    Chapter  Google Scholar 

  4. Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, CPM 2006, Barcelona. Lecture Notes in Computer Science, vol. 4009, pp. 318–329. Springer, Berlin (2006)

    Google Scholar 

  5. Bell, T., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Upper Saddle River (1990)

    Google Scholar 

  6. Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.-T., Seiferas, J.: The smallest automation recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)

    Article  MATH  Google Scholar 

  7. Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  8. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC Research (1994)

    Google Scholar 

  9. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  10. Crochemore, M., Lecroq, T.: Trie. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 3179–3182. Springer, Heidelberg (2009)

    Google Scholar 

  11. Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Oxford (1994)

    MATH  Google Scholar 

  12. Crochemore, M., Vérin, R.: On compact directed acyclic word graphs. In: Mycielski, J., Rozenberg, G., Salomaa, A. (eds.) Structures in Logic and Computer Science. A Selection of Essays in Honor of Andrzej Ehrenfeucht. Lecture Notes in Computer Science, vol. 1261, pp. 192–211. Springer, Berlin (1997)

    Google Scholar 

  13. Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) Proceedings of the Joint International Conference on Frontiers in Algorithmics and Algorithmic Aspects in Information and Management, FAW-AAIM 2012, Beijing. Lecture Notes in Computer Science, vol. 7285, pp. 291–302. Springer, Berlin (2012)

    Google Scholar 

  14. Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of the 38th Annual Symposium on Foundations of Computer Science, FOCS 1997, Miami Beach, FL, pp. 137–143 (1997)

    Google Scholar 

  15. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  16. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, Redondo Beach, CA, pp. 390–398 (2000)

    Google Scholar 

  17. Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2001, Washington, DC, pp. 269–278. Society for Industrial and Applied Mathematics, Philadelphia (2001)

    Google Scholar 

  18. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  19. Ferragina, P., Manzini, G., Veli, M., Navarro, G.: An alphabet-friendly fm-index. In: Apostolico, A., Melucci, M. (eds.) Proceedings of the 11th International Conference on String Processing and Information Retrieval, SPIRE 2004, Padova. Lecture Notes in Computer Science, vol. 3246, pp. 150–160. Springer, Berlin (2004)

    Google Scholar 

  20. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), 20 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  21. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Yao, F.F., Luks, E.M. (eds.) Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, STOC 2000, Portland, OR, pp. 397–406 (2000)

    Google Scholar 

  22. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  23. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2003, Baltimore, MD, pp. 841–850 (2003)

    Google Scholar 

  24. Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, LA, pp. 636–645. Society for Industrial and Applied Mathematics, Philadelphia (2004)

    Google Scholar 

  25. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  26. Holub, J., Crochemore, M.: On the implementation of compact DAWG’s. In: Champarnaud, J.-M., Maurel, D. (eds.) Proceedings of the 7th International Conference on Implementation and Application of Automata, CIAA 2002, Revised Papers, Tours. Lecture Notes in Computer Science, vol. 2608, pp. 289–294. Springer, Berlin (2003)

    Google Scholar 

  27. Huang, S., Lam, T.W., Sung, W.-K., Tam, S.-L., Yiu, S.-M.: Indexing similar DNA sequences. In: Chen, B. (ed.) Proceedings of the 6th International Conference on Algorithmic Aspects in Information and Management, AAIM 2010, Weihai. Lecture Notes in Computer Science, vol. 6124, pp. 180–190. Springer, Berlin (2010)

    Google Scholar 

  28. Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. In: Lewenstein, M., Valiente, G. (eds.) Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, CPM 2006, Barcelona. Lecture Notes in Computer Science, vol. 4009, pp. 169–180. Springer, Berlin (2006)

    Google Scholar 

  29. Itoh, H., Tanaka, H.: An efficient method for in memory construction of suffix arrays. In: Proceedings of String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupware, pp. 81–88 (1999)

    Google Scholar 

  30. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) Proceedings of the 30th International Colloquium on Automata, Languages and Programming, ICALP 2003, Eindhoven. Lecture Notes in Computer Science, vol. 2719, pp. 943–955. Springer, Berlin (2003)

    Google Scholar 

  31. Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP). Citeseer (1996)

    Google Scholar 

  32. Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, Morelia, Michocán. Lecture Notes in Computer Science, vol. 2676, pp. 186–199. Springer, Berlin (2003)

    Google Scholar 

  33. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, Morelia, Michocán. Lecture Notes in Computer Science, vol. 2676, pp. 200–210. Springer, Berlin (2003)

    Google Scholar 

  34. Kurtz, S.: Reducing the space requirement of suffix trees. Softw.-Pract. Exper. 29(13), 1149–1171 (1999)

    Article  Google Scholar 

  35. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chávez, E., Lonardi, S. (eds.) Proceedings of the 17th International Symposium on String Processing and Information Retrieval, SPIRE 2010, Los Cabos. Lecture Notes in Computer Science, vol. 6393, pp. 201–206. Springer, Berlin (2010)

    Google Scholar 

  36. Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theor. Comput. Sci. 387(3), 258–272 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  37. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  38. Mäkinen, V.: Compact suffix array-a space-efficient full-text index. Fundam. Inform. 56(1–2), 191–210 (2003)

    MathSciNet  MATH  Google Scholar 

  39. Mäkinen, V., Navarro, G.: Compressed compact suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S., Dogrusöz, U. (eds.) Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM 2004, Istanbul. Lecture Notes in Computer Science, vol. 3109, pp. 420–433. Springer, Berlin (2004)

    Google Scholar 

  40. Mäkinen, V., Navarro, G.: New search algorithms and time/space tradeoffs for succinct suffix arrays. Technical Report C-2004-20, University of Helsinki (2004)

    Google Scholar 

  41. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic J. Comput. 12(1), 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  42. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  43. Maniscalco, M.A., Puglisi, S.J.: Faster lightweight suffix array construction. In: Proceedings of the 17th Australasian Workshop on Combinatorial Algorithms, Ayers Rock, Uluru, pp. 16–29 (2006)

    Google Scholar 

  44. Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  45. McCreight, E.D.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  46. Morrison, D.: Patricia-practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)

    Article  Google Scholar 

  47. Na, J.C., Park, H., Crochemore, M., Holub, J., Iliopoulos, C.S., Mouchard, L., Park, K.: Suffix tree of alignment: an efficient index for similar data. In: Lecroq, T., Mouchard, L. (eds.) Proceedings of the 24th International Workshop on Combinatorial Algorithms, IWOCA 2013, Rouen. Lecture Notes in Computer Science, vol. 8288. Springer, Berlin (2013)

    Google Scholar 

  48. Na, J.C., Park, H., Lee, S., Hong, M., Lecroq, T., Mouchard, L., Park, K.: Suffix array of alignment: a practical index for similar data. In: Oren Kurland, M.L., Porat, E. (eds.) Proceedings of the 20th International Symposium on String Processing and Information Retrieval, SPIRE 2013, Jerusalem. Lecture Notes in Computer Science, vol. 8214, pp. 243–254. Springer, Berlin (2013)

    Google Scholar 

  49. Navarro, G.: Indexing text using the Ziv-Lempel trie. In: Laender, A.H.F., Oliveira, A.L. (eds.) Proceedings of the 9th International Symposium on String Processing and Information Retrieval, SPIRE 2002, Lisbon. Lecture Notes in Computer Science, vol. 2476, pp. 325–336. Springer, Berlin (2002)

    Google Scholar 

  50. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)

    Article  MATH  Google Scholar 

  51. Nekrich, Y.: Orthogonal range searching in linear and almost-linear space. In: Dehne, F.K.H.A., Sack, J.-R., Zeh, N. (eds.) Proceedings of the 10th International Workshop on Algorithms and Data Structures, WADS 2007, Halifax. Lecture Notes in Computer Science, vol. 4619, pp. 15–26. Springer, Berlin (2007)

    Google Scholar 

  52. Procházka, P., Holub, J.: Compressing similar biological sequences using FM-index. In: Bilgin, A., Marcellin, M.W., Serra-Sagristà, J., Storer, J.A. (eds.) Data Compression Conference, DCC 2014, Snowbird, UT, 26–28 March 2014, pp. 312–321. IEEE, New York (2014)

    Google Scholar 

  53. Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 4 (2007)

    Article  Google Scholar 

  54. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  55. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  56. Schürmann, K.-B., Stoye, J.: An incomplex algorithm for fast suffix array construction. In: Demetrescu, C., Sedgewick, R., Tamassia, R. (eds.) Proceedings of the 7th Workshop on Algorithm Engineering and Experiments and the Second Workshop on Analytic Algorithmics and Combinatorics, ALENEX/ANALCO 2005, Vancouver, BC, pp. 77–85. SIAM, Philadelphia (2005)

    Google Scholar 

  57. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) Proceedings of the 15th International Symposium on String Processing and Information Retrieval, SPIRE 2008, Melbourne. Lecture Notes in Computer Science, vol. 5280, pp. 164–175. Springer, Berlin (2008)

    Google Scholar 

  58. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  59. Ukkonen, E., Wood, D.: Approximate string matching with suffix automata. Algorithmica 10(5), 353–364 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  60. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, SWAT (FOCS), Iowa City, IA, vol. 1873, pp. 1–11. IEEE Computer Society, Washington (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nadia Ben Nsira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Nsira, N.B., Lecroq, T., Elloumi, M. (2017). Algorithms for Indexing Highly Similar DNA Sequences. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics