Skip to main content

Full-Text Indexes for High-Throughput Sequencing

  • Chapter
  • First Online:
Algorithms for Next-Generation Sequencing Data
  • 1806 Accesses

Abstract

Recent advances in High-Throughput Sequencing demand for novel algorithms working on efficient data structures specifically designed for the analysis of large volumes of sequence data. This chapter describes such data structures, called full-text indexes, to represent all substrings (or substrings up to a certain length) contained in a given text (or text collection).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media, Berlin (2008)

    Book  Google Scholar 

  3. Arlazarov, V., Dinic, E., Kronrod, M., Faradzev, I.: On economical construction of the transitive closure of a directed graph. Dokl. Akad. Nauk 11, 194 (1970)

    Google Scholar 

  4. Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, pp. 326–337. Springer, Berlin (2012)

    Google Scholar 

  5. Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the bwt of string collections. Theor. Comput. Sci. 483, 134–148 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  6. Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fund. Inform. 56(1,2), 51–70 (2003)

    Google Scholar 

  7. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram based database searching using suffix arrays. In: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pp. 77–83 (1999)

    Google Scholar 

  8. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC Research Report (1994)

    Google Scholar 

  9. Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de Bruijn graphs. In: Combinatorial Pattern Matching, pp. 89–99. Springer, Berlin (2014)

    Google Scholar 

  10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT, Cambridge, MA (2001)

    MATH  Google Scholar 

  11. Crochemore, M., Grossi, R., Kärkkäinen, J., Landau, G.M.: A constant-space comparison-based algorithm for computing the Burrows–Wheeler transform. In: Combinatorial Pattern Matching, pp. 74–82. Springer, Berlin (2013)

    Google Scholar 

  12. Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinf. 9, 11 (2008)

    Article  Google Scholar 

  13. Emde, A.K., Grunert, M., Weese, D., Reinert, K., Sperling, S.R.: MicroRazerS: rapid alignment of small RNA reads. Bioinformatics 26(1), 123–124 (2010)

    Article  Google Scholar 

  14. Emde, A.K., Schulz, M.H., Weese, D., Sun, R., Vingron, M., Kalscheuer, V.M., Haas, S.A., Reinert, K.: Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using splazers. Bioinformatics 28(5), 619–627 (2012)

    Article  Google Scholar 

  15. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  16. Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)

    Article  MATH  Google Scholar 

  17. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  19. Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string matching. J. Complexity 4(1), 33–72 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  20. Giegerich, R., Kurtz, S.: A comparison of imperative and purely functional suffix tree constructions. Sci. Comput. Program. 25, 187–218 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  21. Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Softw. Pract. Exp. 33(11), 1035–1049 (2003)

    Article  Google Scholar 

  22. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Experimental Algorithms, pp. 326–337. Springer, Berlin (2014)

    Google Scholar 

  23. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  24. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pp. 841–850. Society for Industrial and Applied Mathematics, Philadelphia, PA (2003)

    Google Scholar 

  25. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  26. Hamming, R.W.: Error detecting and error correcting codes. Syst. Tech. J. 29, 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  27. Hon, W.K., Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  28. Intel: Intel®; 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation, Santa Clara, CA (2011)

    Google Scholar 

  29. Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, 1989, pp. 549–554. IEEE, New York (1989)

    Google Scholar 

  30. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, CPM ’01, pp. 181–192. Springer, Berlin (2001)

    Google Scholar 

  31. Kehr, B., Weese, D., Reinert, K.: Stellar: fast and exact local alignments. BMC Bioinf. 12(Suppl. 9), S15 (2011)

    Article  Google Scholar 

  32. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  33. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  34. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009)

    Article  Google Scholar 

  35. Louza, F.A., Telles, G.P., Ciferri, C.D.D.A.: External memory generalized suffix and LCP arrays construction. In: Combinatorial Pattern Matching, pp. 201–210. Springer, Berlin (2013)

    Google Scholar 

  36. Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. In: SODA ’90, pp. 319–327. SIAM, Philadelphia (1990)

    Google Scholar 

  37. Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  38. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows–Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  39. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  40. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  41. Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)

    Article  Google Scholar 

  42. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2:1–2:61 (2007)

    Google Scholar 

  43. Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch, Bremen (2013)

    MATH  Google Scholar 

  44. Puglisi, S., Smyth, W., Turpin, A.: A taxonomy of suffix array construction algorithms. In: Holub, J. (ed.) Proceedings of Prague Stringology Conference ’05, Prague, pp. 1–30 (2005)

    Google Scholar 

  45. Rausch, T., Emde, A.K., Weese, D., Döring, A., Notredame, C., Reinert, K.: Segment-based multiple sequence alignment. Bioinformatics 24(16), i187–192 (2008)

    Article  Google Scholar 

  46. Rausch, T., Koren, S., Denisov, G., Weese, D., Emde, A.K., Döring, A., Reinert, K.: A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics 25(9), 1118–1124 (2009)

    Article  Google Scholar 

  47. Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Crandall, K., Lagergren, J. (eds.) Algorithms in Bioinformatics. Lecture Notes in Computer Science, vol. 5251, pp. 306–317. Springer, Berlin (2008)

    Chapter  Google Scholar 

  48. Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356–i363 (2014)

    Article  Google Scholar 

  49. Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Jaffar, J., Yap, R. (eds.) Concurrency and Parallelism, Programming, Networking, and Security. Lecture Notes in Computer Science, vol. 1179, pp. 11–22. Springer, Berlin (1996). DOI 10.1007/BFb0027775. http://dx.doi.org/10.1007/BFb0027775

    Chapter  Google Scholar 

  50. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)

    Article  Google Scholar 

  51. Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)

    Google Scholar 

  52. Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013)

    Article  Google Scholar 

  53. Siragusa, E., Weese, D., Reinert, K.: Scalable string similarity search/join with approximate seeds and multiple backtracking. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 370–374. ACM, New York (2013)

    Google Scholar 

  54. Ukkonen, E.: Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching, pp. 228–242. Springer, Berlin (1993)

    Google Scholar 

  55. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  56. Weese, D.: Indices and applications in high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2013)

    Google Scholar 

  57. Weese, D., Schulz, M.H.: Efficient string mining under constraints via the deferred frequency index. In: Proceedings of the 8th Industrial Conference on Data Mining (ICDM ’08). LNAI, vol. 5077, pp. 374–388. Springer, Berlin (2008)

    Google Scholar 

  58. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Symposium on Switching and Automata Theory, SWAT ’73, pp. 1–11. IEEE Computer Society, Washington (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Weese .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Weese, D., Siragusa, E. (2017). Full-Text Indexes for High-Throughput Sequencing. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59826-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59824-6

  • Online ISBN: 978-3-319-59826-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics