Full-Text Indexes for High-Throughput Sequencing

Weese, David; Siragusa, Enrico

doi:10.1007/978-3-319-59826-0_2

David Weese² &
Enrico Siragusa²

1806 Accesses

Abstract

Recent advances in High-Throughput Sequencing demand for novel algorithms working on efficient data structures specifically designed for the analysis of large volumes of sequence data. This chapter describes such data structures, called full-text indexes, to represent all substrings (or substrings up to a certain length) contained in a given text (or text collection).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)
Article MathSciNet MATH Google Scholar
Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media, Berlin (2008)
Book Google Scholar
Arlazarov, V., Dinic, E., Kronrod, M., Faradzev, I.: On economical construction of the transitive closure of a directed graph. Dokl. Akad. Nauk 11, 194 (1970)
Google Scholar
Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, pp. 326–337. Springer, Berlin (2012)
Google Scholar
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the bwt of string collections. Theor. Comput. Sci. 483, 134–148 (2013)
Article MathSciNet MATH Google Scholar
Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fund. Inform. 56(1,2), 51–70 (2003)
Google Scholar
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram based database searching using suffix arrays. In: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pp. 77–83 (1999)
Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC Research Report (1994)
Google Scholar
Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de Bruijn graphs. In: Combinatorial Pattern Matching, pp. 89–99. Springer, Berlin (2014)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT, Cambridge, MA (2001)
MATH Google Scholar
Crochemore, M., Grossi, R., Kärkkäinen, J., Landau, G.M.: A constant-space comparison-based algorithm for computing the Burrows–Wheeler transform. In: Combinatorial Pattern Matching, pp. 74–82. Springer, Berlin (2013)
Google Scholar
Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinf. 9, 11 (2008)
Article Google Scholar
Emde, A.K., Grunert, M., Weese, D., Reinert, K., Sperling, S.R.: MicroRazerS: rapid alignment of small RNA reads. Bioinformatics 26(1), 123–124 (2010)
Article Google Scholar
Emde, A.K., Schulz, M.H., Weese, D., Sun, R., Vingron, M., Kalscheuer, V.M., Haas, S.A., Reinert, K.: Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using splazers. Bioinformatics 28(5), 619–627 (2012)
Article Google Scholar
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
Article MathSciNet MATH Google Scholar
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)
Article MATH Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)
Article MathSciNet MATH Google Scholar
Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string matching. J. Complexity 4(1), 33–72 (1988)
Article MathSciNet MATH Google Scholar
Giegerich, R., Kurtz, S.: A comparison of imperative and purely functional suffix tree constructions. Sci. Comput. Program. 25, 187–218 (1995)
Article MathSciNet MATH Google Scholar
Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Softw. Pract. Exp. 33(11), 1035–1049 (2003)
Article Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Experimental Algorithms, pp. 326–337. Springer, Berlin (2014)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pp. 841–850. Society for Industrial and Applied Mathematics, Philadelphia, PA (2003)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Hamming, R.W.: Error detecting and error correcting codes. Syst. Tech. J. 29, 147–160 (1950)
Article MathSciNet Google Scholar
Hon, W.K., Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)
Article MathSciNet MATH Google Scholar
Intel: Intel^®; 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation, Santa Clara, CA (2011)
Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, 1989, pp. 549–554. IEEE, New York (1989)
Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, CPM ’01, pp. 181–192. Springer, Berlin (2001)
Google Scholar
Kehr, B., Weese, D., Reinert, K.: Stellar: fast and exact local alignments. BMC Bioinf. 12(Suppl. 9), S15 (2011)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Article Google Scholar
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009)
Article Google Scholar
Louza, F.A., Telles, G.P., Ciferri, C.D.D.A.: External memory generalized suffix and LCP arrays construction. In: Combinatorial Pattern Matching, pp. 201–210. Springer, Berlin (2013)
Google Scholar
Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. In: SODA ’90, pp. 319–327. SIAM, Philadelphia (1990)
Google Scholar
Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows–Wheeler transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Article MathSciNet MATH Google Scholar
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet MATH Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Article MathSciNet MATH Google Scholar
Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2:1–2:61 (2007)
Google Scholar
Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch, Bremen (2013)
MATH Google Scholar
Puglisi, S., Smyth, W., Turpin, A.: A taxonomy of suffix array construction algorithms. In: Holub, J. (ed.) Proceedings of Prague Stringology Conference ’05, Prague, pp. 1–30 (2005)
Google Scholar
Rausch, T., Emde, A.K., Weese, D., Döring, A., Notredame, C., Reinert, K.: Segment-based multiple sequence alignment. Bioinformatics 24(16), i187–192 (2008)
Article Google Scholar
Rausch, T., Koren, S., Denisov, G., Weese, D., Emde, A.K., Döring, A., Reinert, K.: A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics 25(9), 1118–1124 (2009)
Article Google Scholar
Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Crandall, K., Lagergren, J. (eds.) Algorithms in Bioinformatics. Lecture Notes in Computer Science, vol. 5251, pp. 306–317. Springer, Berlin (2008)
Chapter Google Scholar
Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356–i363 (2014)
Article Google Scholar
Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Jaffar, J., Yap, R. (eds.) Concurrency and Parallelism, Programming, Networking, and Security. Lecture Notes in Computer Science, vol. 1179, pp. 11–22. Springer, Berlin (1996). DOI 10.1007/BFb0027775. http://dx.doi.org/10.1007/BFb0027775
Chapter Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)
Article Google Scholar
Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)
Google Scholar
Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013)
Article Google Scholar
Siragusa, E., Weese, D., Reinert, K.: Scalable string similarity search/join with approximate seeds and multiple backtracking. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 370–374. ACM, New York (2013)
Google Scholar
Ukkonen, E.: Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching, pp. 228–242. Springer, Berlin (1993)
Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Weese, D.: Indices and applications in high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2013)
Google Scholar
Weese, D., Schulz, M.H.: Efficient string mining under constraints via the deferred frequency index. In: Proceedings of the 8th Industrial Conference on Data Mining (ICDM ’08). LNAI, vol. 5077, pp. 374–388. Springer, Berlin (2008)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Symposium on Switching and Automata Theory, SWAT ’73, pp. 1–11. IEEE Computer Society, Washington (1973)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
David Weese & Enrico Siragusa

Authors

David Weese
View author publications
You can also search for this author in PubMed Google Scholar
Enrico Siragusa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Weese .

Editor information

Editors and Affiliations

LaTICE, Tunis, Tunisia
Mourad Elloumi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Weese, D., Siragusa, E. (2017). Full-Text Indexes for High-Throughput Sequencing. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-59826-0_2
Published: 19 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics