Compressed Indexes for Repetitive Textual Datasets

Gagie, Travis; Navarro, Gonzalo

doi:10.1007/978-3-319-63962-8_53-1

Travis Gagie³ &
Gonzalo Navarro⁴

478 Accesses

Definition

Given a text or collection of texts containing long repeated substrings, we are asked to build an index that takes space bounded in terms of the size of a well-compressed encoding of the text or texts and that, given an arbitrary pattern, can quickly report the occurrences of that pattern in the dataset.

Overview

Humanity now stores as much data in a year as we did in our whole history until the turn of the millennium. Most applications that use this data need to query it efficiently, and one of the most important kinds of queries is pattern matching in texts. It is usually impractical to scan massive textual datasets every time we want to count or find the occurrences of a pattern, so we must index them. Until fairly recently, indexing a text often took much more memory than simply storing the dataset. In 2000, however, Ferragina and Manzini (2000, 2005) showed how to simultaneously compress and index a text, with the index itself supporting access to the text and thus...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Abeliuk A, Cánovas R, Navarro G (2013) Practical compressed suffix trees. Algorithms 6(2):319–351
Google Scholar
Belazzougui D, Cunial F (2017) Representing the suffix tree with the CDAWG. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 7:1–7:13
Google Scholar
Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2015) Composite repetition-aware data structures. In: Proceedings of the 26th symposium on combinatorial pattern matching (CPM), pp 26–39
Google Scholar
Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2017) Flexible indexing of repetitive collections. In: Proceedings of the 13th conference on computability in Europe (CiE), pp 162–174
Google Scholar
Bille P, Ettienne MB, Gørtz IL, Vildhøj HW (2017) Time-space trade-offs for Lempel-Ziv compressed indexing. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 16:1–16:17
Google Scholar
Blumer A, Blumer J, Haussler D, McConnell RM, Ehrenfeucht A (1987) Complete inverted files for efficient text retrieval and analysis. J ACM 34(3):578–595
Google Scholar
Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of the 12th workshop on algorithms in bioinformatics (WABI), pp 225–235
Google Scholar
Claude F, Navarro G (2011) Self-indexed grammar-based compression. Fundamenta Informaticae 111(3):313–337
Google Scholar
Claude F, Navarro G (2012) Improved grammar-based compressed indexes. In: Proceedings of the 19th symposium on string processing and information retrieval (SPIRE), pp 180–192
Google Scholar
Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23
Google Scholar
Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PLoS One 9(10):e109384
Google Scholar
Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G (2015) Improved genome inference in the MHC using a population reference graph. Nat Genet 47(6):682–688
Google Scholar
Do HH, Jansson J, Sadakane K, Sung W (2014) Fast relative Lempel-Ziv self-index for similar sequences. Theor Comput Sci 532:14–30
Google Scholar
Eggertsson HP et al (2017) Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49(11):1654–1660
Google Scholar
Ferrada H, Gagie T, Hirvola T, Puglisi SJ (2014) Hybrid indexes for repetitive datasets. Phil Trans R Soc A 372(2016):20130137
Google Scholar
Ferrada H, Kempa D, Puglisi SJ (2018) Hybrid indexing revisited. In: Proceedings of the 20th workshop on algorithm engineering and experiments (ALENEX), pp 1–8
Google Scholar
Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS), pp 390–398
Google Scholar
Ferragina P, Manzini G (2005) Indexing compressed text. J ACM 52(4):552–581
Google Scholar
Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4:1–4:33
Google Scholar
Gagie T, Puglisi SJ (2015) Searching and indexing genomic databases via kernelization. Front Bioeng Biotechnol 3:12
Google Scholar
Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2012) A faster grammar-based self-index. In: Proceedings of the 6th conference on language and automata theory and applications (LATA), pp 240–251
Google Scholar
Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Proceedings of the 11th Latin American symposium on theoretical informatincs (LATIN), pp 731–742
Google Scholar
Gagie T, Manzini G, Sirén J (2017a) Wheeler graphs: a framework for BWT-based data structures. Theor Comput Sci 698:67–78
Google Scholar
Gagie T, Navarro G, Prezza N (2017b) Optimal-time text indexing in BWT-runs bounded space. Technical report 1705.10382, arXiv.org
Google Scholar
Gagie T, Navarro G, Prezza N (2018) Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29th symposium on discrete algorithms (SODA), pp 1459–1477
Google Scholar
Grossi R, Vitter JS (2000) Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the 32nd symposium on theory of computing (STOC), pp 397–406
Google Scholar
Grossi R, Vitter JS (2005) Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J Comput 35(2):378–407
Google Scholar
Kärkkäinen J, Ukkonen E (1996) Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American workshop on string processing (WSP), pp 141–155
Google Scholar
Kempa D, Prezza N (2017) At the roots of dictionary compression: string attractors. In: Proceedings of the 50th symposium on theory of computing (STOC), 2018. CoRR abs/1710.10964
Google Scholar
Kreft S, Navarro G (2013) On compressing and indexing repetitive sequences. Theor Comput Sci 483:115–133
Google Scholar
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25
Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760
Google Scholar
Maciuca S, del Ojo Elias C, McVean G, Iqbal Z (2016) A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th workshop on algorithms in bioinformatics (WABI), pp 222–233
Google Scholar
Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308
Google Scholar
Mäkinen V, Belazzougui D, Cunial F, Tomescu AI (2015) Genome-scale algorithm design: biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, Cambridge
Google Scholar
Maruyama S, Nakahara M, Kishiue N, Sakamoto H (2013) ESP-index: a compressed index based on edit-sensitive parsing. J Discrete Algorithms 18:100–112
Google Scholar
Na JC, Park H, Crochemore M, Holub J, Iliopoulos CS, Mouchard L, Park K (2013a) Suffix tree of alignment: an efficient index for similar data. In: Proceedings of the 24th international workshop on combinatorial algorithms (IWOCA), pp 337–348
Google Scholar
Na JC, Park H, Lee S, Hong M, Lecroq T, Mouchard L, Park K (2013b) Suffix array of alignment: a practical index for similar data. In: Proceedings of the 20th symposium on string processing and information retrieval (SPIRE), pp 243–254
Google Scholar
Na JC, Kim H, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2016) FM-index of alignment: a compressed index for similar strings. Theor Comput Sci 638:159–170
Google Scholar
Na JC, Kim H, Min S, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2018) FM-index of alignment with gaps. Theor Comput Sci. https://doi.org/10.1016/j.tcs.2017.02.020
Navarro G (2017) A self-index on block trees. In: Proceedings of the 17th symposium on string processing and information retrieval (SPIRE), pp 278–289
Google Scholar
Navarro G, Ordóñez A (2016) Faster compressed suffix trees for repetitive text collections. J Exp Algorithmics 21(1):article 1.8
Google Scholar
Navarro G, Raffinot M (2002) Flexible pattern matching in strings – practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK
Book MATH Google Scholar
Nishimoto T, Tomohiro I, Inenaga S, Bannai H, Takeda M (2016) Dynamic index and LZ factorization in compressed space. In: Proceedings of the prague stringology conference (PSC), pp 158–170
Google Scholar
Novak AM, Garrison E, Paten B (2017a) A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol 12(1):18:1–18:12
Google Scholar
Novak AM et al (2017b) Genome graphs. Technical report 101378, bioRxiv
Google Scholar
Ohlebusch E (2013) Bioinformatics algorithms: sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Bremen, Germany
MATH Google Scholar
Paten B, Novak AM, Eizenga JM, Garrison E (2017) Genome graphs and the evolution of genome inference. Genome Res 27(5):665–676
Article Google Scholar
Procházka P, Holub J (2014) Compressing similar biological sequences using FM-index. In: Proceedings of the data compression conference (DCC), pp 312–321
Google Scholar
Rahn R, Weese D, Reinert K (2014) Journaled string tree – a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505
Article Google Scholar
Sadakane K (2000) Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Proceedings of the 11th international symposium on algorithms and computations (ISAAC), pp 410–421
Google Scholar
Sadakane K (2003) New text indexing functionalities of the compressed suffix arrays. J Algorithms 48(2):294–313
Article MathSciNet MATH Google Scholar
Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98
Article Google Scholar
Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX), pp 13–27
Google Scholar
Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th workshop on algorithms in bioinformatics (WABI), pp 270–281
Google Scholar
Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinform 11(2): 375–388
Article Google Scholar
Takabatake Y, Tabei Y, Sakamoto H (2014) Improved ESP-index: a practical self-index for highly repetitive texts. In: Proceedings of the 13th symposium on experimental algorithms (SEA), pp 338–350
Google Scholar
Takabatake Y, Nakashima K, Kuboyama T, Tabei Y, Sakamoto H (2016) siEDM: an efficient string index and search algorithm for edit distance with moves. Algorithms 9(2):26
Google Scholar
Takagi T, Goto K, Fujishige Y, Inenaga S, Arimura H (2017) Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: Proceedings of the 24th symposium on string processing and information retrieval (SPIRE), pp 304–316
Google Scholar
Valenzuela D (2016) CHICO: a compressed hybrid index for repetitive collections. In: Proceedings of the 15th symposium on experimental algorithms (SEA), pp 326–338
Google Scholar
Valenzuela D, Mäkinen V (2017) CHIC: a short read aligner for pan-genomic references. Technical report 178129, bioRxiv.org
Google Scholar
Wandelt S, Leser U (2015) MRCSI: compressing and searching string collections with multiple references. Proc VLDB Endowment 8(5):461–472
Article Google Scholar
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545
Article Google Scholar
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

EIT, Diego Portales University, Santiago, Chile
Travis Gagie
Department of Computer Science, University of Chile, Santiago, Chile
Gonzalo Navarro

Authors

Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Travis Gagie or Gonzalo Navarro .

Editor information

Editors and Affiliations

School of Comp. Sci. and Engineering, University of New South Wales School of Comp. Sci. and Engineering, Eveleigh, New South Wales, Australia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Department of Computer Science, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Gagie, T., Navarro, G. (2018). Compressed Indexes for Repetitive Textual Datasets. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_53-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_53-1
Published: 10 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

Latest
Compressed Indexes for Repetitive Textual Datasets

Published:

10 February 2018

DOI: https://doi.org/10.1007/978-3-319-63962-8_53-1
Original
Compressed Indexes for Repetitive Textual Datasets

Published:

24 February 2012

DOI: https://doi.org/10.1007/978-3-319-63962-8_53-2