Skip to main content

Compressed Indexes for Repetitive Textual Datasets

  • Living reference work entry
  • Latest version View entry history
  • First Online:
Encyclopedia of Big Data Technologies
  • 478 Accesses

Definition

Given a text or collection of texts containing long repeated substrings, we are asked to build an index that takes space bounded in terms of the size of a well-compressed encoding of the text or texts and that, given an arbitrary pattern, can quickly report the occurrences of that pattern in the dataset.

Overview

Humanity now stores as much data in a year as we did in our whole history until the turn of the millennium. Most applications that use this data need to query it efficiently, and one of the most important kinds of queries is pattern matching in texts. It is usually impractical to scan massive textual datasets every time we want to count or find the occurrences of a pattern, so we must index them. Until fairly recently, indexing a text often took much more memory than simply storing the dataset. In 2000, however, Ferragina and Manzini (2000, 2005) showed how to simultaneously compress and index a text, with the index itself supporting access to the text and thus...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Abeliuk A, Cánovas R, Navarro G (2013) Practical compressed suffix trees. Algorithms 6(2):319–351

    Google Scholar 

  • Belazzougui D, Cunial F (2017) Representing the suffix tree with the CDAWG. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 7:1–7:13

    Google Scholar 

  • Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2015) Composite repetition-aware data structures. In: Proceedings of the 26th symposium on combinatorial pattern matching (CPM), pp 26–39

    Google Scholar 

  • Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2017) Flexible indexing of repetitive collections. In: Proceedings of the 13th conference on computability in Europe (CiE), pp 162–174

    Google Scholar 

  • Bille P, Ettienne MB, Gørtz IL, Vildhøj HW (2017) Time-space trade-offs for Lempel-Ziv compressed indexing. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 16:1–16:17

    Google Scholar 

  • Blumer A, Blumer J, Haussler D, McConnell RM, Ehrenfeucht A (1987) Complete inverted files for efficient text retrieval and analysis. J ACM 34(3):578–595

    Google Scholar 

  • Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of the 12th workshop on algorithms in bioinformatics (WABI), pp 225–235

    Google Scholar 

  • Claude F, Navarro G (2011) Self-indexed grammar-based compression. Fundamenta Informaticae 111(3):313–337

    Google Scholar 

  • Claude F, Navarro G (2012) Improved grammar-based compressed indexes. In: Proceedings of the 19th symposium on string processing and information retrieval (SPIRE), pp 180–192

    Google Scholar 

  • Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23

    Google Scholar 

  • Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PLoS One 9(10):e109384

    Google Scholar 

  • Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G (2015) Improved genome inference in the MHC using a population reference graph. Nat Genet 47(6):682–688

    Google Scholar 

  • Do HH, Jansson J, Sadakane K, Sung W (2014) Fast relative Lempel-Ziv self-index for similar sequences. Theor Comput Sci 532:14–30

    Google Scholar 

  • Eggertsson HP et al (2017) Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49(11):1654–1660

    Google Scholar 

  • Ferrada H, Gagie T, Hirvola T, Puglisi SJ (2014) Hybrid indexes for repetitive datasets. Phil Trans R Soc A 372(2016):20130137

    Google Scholar 

  • Ferrada H, Kempa D, Puglisi SJ (2018) Hybrid indexing revisited. In: Proceedings of the 20th workshop on algorithm engineering and experiments (ALENEX), pp 1–8

    Google Scholar 

  • Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS), pp 390–398

    Google Scholar 

  • Ferragina P, Manzini G (2005) Indexing compressed text. J ACM 52(4):552–581

    Google Scholar 

  • Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4:1–4:33

    Google Scholar 

  • Gagie T, Puglisi SJ (2015) Searching and indexing genomic databases via kernelization. Front Bioeng Biotechnol 3:12

    Google Scholar 

  • Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2012) A faster grammar-based self-index. In: Proceedings of the 6th conference on language and automata theory and applications (LATA), pp 240–251

    Google Scholar 

  • Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Proceedings of the 11th Latin American symposium on theoretical informatincs (LATIN), pp 731–742

    Google Scholar 

  • Gagie T, Manzini G, Sirén J (2017a) Wheeler graphs: a framework for BWT-based data structures. Theor Comput Sci 698:67–78

    Google Scholar 

  • Gagie T, Navarro G, Prezza N (2017b) Optimal-time text indexing in BWT-runs bounded space. Technical report 1705.10382, arXiv.org

    Google Scholar 

  • Gagie T, Navarro G, Prezza N (2018) Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29th symposium on discrete algorithms (SODA), pp 1459–1477

    Google Scholar 

  • Grossi R, Vitter JS (2000) Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the 32nd symposium on theory of computing (STOC), pp 397–406

    Google Scholar 

  • Grossi R, Vitter JS (2005) Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J Comput 35(2):378–407

    Google Scholar 

  • Kärkkäinen J, Ukkonen E (1996) Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American workshop on string processing (WSP), pp 141–155

    Google Scholar 

  • Kempa D, Prezza N (2017) At the roots of dictionary compression: string attractors. In: Proceedings of the 50th symposium on theory of computing (STOC), 2018. CoRR abs/1710.10964

    Google Scholar 

  • Kreft S, Navarro G (2013) On compressing and indexing repetitive sequences. Theor Comput Sci 483:115–133

    Google Scholar 

  • Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25

    Google Scholar 

  • Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760

    Google Scholar 

  • Maciuca S, del Ojo Elias C, McVean G, Iqbal Z (2016) A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th workshop on algorithms in bioinformatics (WABI), pp 222–233

    Google Scholar 

  • Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308

    Google Scholar 

  • Mäkinen V, Belazzougui D, Cunial F, Tomescu AI (2015) Genome-scale algorithm design: biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, Cambridge

    Google Scholar 

  • Maruyama S, Nakahara M, Kishiue N, Sakamoto H (2013) ESP-index: a compressed index based on edit-sensitive parsing. J Discrete Algorithms 18:100–112

    Google Scholar 

  • Na JC, Park H, Crochemore M, Holub J, Iliopoulos CS, Mouchard L, Park K (2013a) Suffix tree of alignment: an efficient index for similar data. In: Proceedings of the 24th international workshop on combinatorial algorithms (IWOCA), pp 337–348

    Google Scholar 

  • Na JC, Park H, Lee S, Hong M, Lecroq T, Mouchard L, Park K (2013b) Suffix array of alignment: a practical index for similar data. In: Proceedings of the 20th symposium on string processing and information retrieval (SPIRE), pp 243–254

    Google Scholar 

  • Na JC, Kim H, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2016) FM-index of alignment: a compressed index for similar strings. Theor Comput Sci 638:159–170

    Google Scholar 

  • Na JC, Kim H, Min S, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2018) FM-index of alignment with gaps. Theor Comput Sci. https://doi.org/10.1016/j.tcs.2017.02.020

  • Navarro G (2017) A self-index on block trees. In: Proceedings of the 17th symposium on string processing and information retrieval (SPIRE), pp 278–289

    Google Scholar 

  • Navarro G, Ordóñez A (2016) Faster compressed suffix trees for repetitive text collections. J Exp Algorithmics 21(1):article 1.8

    Google Scholar 

  • Navarro G, Raffinot M (2002) Flexible pattern matching in strings – practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK

    Book  MATH  Google Scholar 

  • Nishimoto T, Tomohiro I, Inenaga S, Bannai H, Takeda M (2016) Dynamic index and LZ factorization in compressed space. In: Proceedings of the prague stringology conference (PSC), pp 158–170

    Google Scholar 

  • Novak AM, Garrison E, Paten B (2017a) A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol 12(1):18:1–18:12

    Google Scholar 

  • Novak AM et al (2017b) Genome graphs. Technical report 101378, bioRxiv

    Google Scholar 

  • Ohlebusch E (2013) Bioinformatics algorithms: sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Bremen, Germany

    MATH  Google Scholar 

  • Paten B, Novak AM, Eizenga JM, Garrison E (2017) Genome graphs and the evolution of genome inference. Genome Res 27(5):665–676

    Article  Google Scholar 

  • Procházka P, Holub J (2014) Compressing similar biological sequences using FM-index. In: Proceedings of the data compression conference (DCC), pp 312–321

    Google Scholar 

  • Rahn R, Weese D, Reinert K (2014) Journaled string tree – a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505

    Article  Google Scholar 

  • Sadakane K (2000) Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Proceedings of the 11th international symposium on algorithms and computations (ISAAC), pp 410–421

    Google Scholar 

  • Sadakane K (2003) New text indexing functionalities of the compressed suffix arrays. J Algorithms 48(2):294–313

    Article  MathSciNet  MATH  Google Scholar 

  • Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98

    Article  Google Scholar 

  • Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX), pp 13–27

    Google Scholar 

  • Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th workshop on algorithms in bioinformatics (WABI), pp 270–281

    Google Scholar 

  • Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinform 11(2): 375–388

    Article  Google Scholar 

  • Takabatake Y, Tabei Y, Sakamoto H (2014) Improved ESP-index: a practical self-index for highly repetitive texts. In: Proceedings of the 13th symposium on experimental algorithms (SEA), pp 338–350

    Google Scholar 

  • Takabatake Y, Nakashima K, Kuboyama T, Tabei Y, Sakamoto H (2016) siEDM: an efficient string index and search algorithm for edit distance with moves. Algorithms 9(2):26

    Google Scholar 

  • Takagi T, Goto K, Fujishige Y, Inenaga S, Arimura H (2017) Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: Proceedings of the 24th symposium on string processing and information retrieval (SPIRE), pp 304–316

    Google Scholar 

  • Valenzuela D (2016) CHICO: a compressed hybrid index for repetitive collections. In: Proceedings of the 15th symposium on experimental algorithms (SEA), pp 326–338

    Google Scholar 

  • Valenzuela D, Mäkinen V (2017) CHIC: a short read aligner for pan-genomic references. Technical report 178129, bioRxiv.org

    Google Scholar 

  • Wandelt S, Leser U (2015) MRCSI: compressing and searching string collections with multiple references. Proc VLDB Endowment 8(5):461–472

    Article  Google Scholar 

  • Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545

    Article  Google Scholar 

  • Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Travis Gagie or Gonzalo Navarro .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Gagie, T., Navarro, G. (2018). Compressed Indexes for Repetitive Textual Datasets. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_53-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_53-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

  1. Latest

    Compressed Indexes for Repetitive Textual Datasets
    Published:
    10 February 2018

    DOI: https://doi.org/10.1007/978-3-319-63962-8_53-1

  2. Original

    Compressed Indexes for Repetitive Textual Datasets
    Published:
    24 February 2012

    DOI: https://doi.org/10.1007/978-3-319-63962-8_53-2