Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Compressed Indexes for Repetitive Textual Datasets

  • Travis GagieEmail author
  • Gonzalo NavarroEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_53

Definitions

Given a text or collection of texts containing long repeated substrings, we are asked to build an index that takes space bounded in terms of the size of a well-compressed encoding of the text or texts and that, given an arbitrary pattern, can quickly report the occurrences of that pattern in the dataset.

Overview

Humanity now stores as much data in a year as we did in our whole history until the turn of the millennium. Most applications that use this data need to query it efficiently, and one of the most important kinds of queries is pattern matching in texts. It is usually impractical to scan massive textual datasets every time we want to count or find the occurrences of a pattern, so we must index them. Until fairly recently, indexing a text often took much more memory than simply storing the dataset. In 2000, however, Ferragina and Manzini (2000, 2005) showed how to simultaneously compress and index a text, with the index itself supporting access to the text and thus...

This is a preview of subscription content, log in to check access.

References

  1. Abeliuk A, Cánovas R, Navarro G (2013) Practical compressed suffix trees. Algorithms 6(2):319–351MathSciNetCrossRefGoogle Scholar
  2. Belazzougui D, Cunial F (2017) Representing the suffix tree with the CDAWG. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 7:1–7:13Google Scholar
  3. Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2015) Composite repetition-aware data structures. In: Proceedings of the 26th symposium on combinatorial pattern matching (CPM), pp 26–39CrossRefGoogle Scholar
  4. Belazzougui D, Cunial F, Gagie T, Prezza N, Raffinot M (2017) Flexible indexing of repetitive collections. In: Proceedings of the 13th conference on computability in Europe (CiE), pp 162–174CrossRefGoogle Scholar
  5. Bille P, Ettienne MB, Gørtz IL, Vildhøj HW (2017) Time-space trade-offs for Lempel-Ziv compressed indexing. In: Proceedings of the 28th symposium on combinatorial pattern matching (CPM), pp 16:1–16:17Google Scholar
  6. Blumer A, Blumer J, Haussler D, McConnell RM, Ehrenfeucht A (1987) Complete inverted files for efficient text retrieval and analysis. J ACM 34(3): 578–595MathSciNetzbMATHCrossRefGoogle Scholar
  7. Bowe A, Onodera T, Sadakane K, Shibuya T (2012) Succinct de Bruijn graphs. In: Proceedings of the 12th workshop on algorithms in bioinformatics (WABI), pp 225–235Google Scholar
  8. Claude F, Navarro G (2011) Self-indexed grammar-based compression. Fundamenta Informaticae 111(3): 313–337MathSciNetzbMATHGoogle Scholar
  9. Claude F, Navarro G (2012) Improved grammar-based compressed indexes. In: Proceedings of the 19th symposium on string processing and information retrieval (SPIRE), pp 180–192CrossRefGoogle Scholar
  10. Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23CrossRefGoogle Scholar
  11. Danek A, Deorowicz S, Grabowski S (2014) Indexes of large genome collections on a PC. PLoS One 9(10):e109384CrossRefGoogle Scholar
  12. Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G (2015) Improved genome inference in the MHC using a population reference graph. Nat Genet 47(6): 682–688CrossRefGoogle Scholar
  13. Do HH, Jansson J, Sadakane K, Sung W (2014) Fast relative Lempel-Ziv self-index for similar sequences. Theor Comput Sci 532:14–30MathSciNetzbMATHCrossRefGoogle Scholar
  14. Eggertsson HP et al (2017) Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet 49(11):1654–1660CrossRefGoogle Scholar
  15. Ferrada H, Gagie T, Hirvola T, Puglisi SJ (2014) Hybrid indexes for repetitive datasets. Phil Trans R Soc A 372(2016):20130137zbMATHCrossRefGoogle Scholar
  16. Ferrada H, Kempa D, Puglisi SJ (2018) Hybrid indexing revisited. In: Proceedings of the 20th workshop on algorithm engineering and experiments (ALENEX), pp 1–8Google Scholar
  17. Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proceedings of the 41st symposium on foundations of computer science (FOCS), pp 390–398Google Scholar
  18. Ferragina P, Manzini G (2005) Indexing compressed text. J ACM 52(4):552–581MathSciNetzbMATHCrossRefGoogle Scholar
  19. Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4:1–4:33MathSciNetzbMATHCrossRefGoogle Scholar
  20. Gagie T, Puglisi SJ (2015) Searching and indexing genomic databases via kernelization. Front Bioeng Biotechnol 3:12CrossRefGoogle Scholar
  21. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2012) A faster grammar-based self-index. In: Proceedings of the 6th conference on language and automata theory and applications (LATA), pp 240–251CrossRefGoogle Scholar
  22. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ (2014) LZ77-based self-indexing with faster pattern matching. In: Proceedings of the 11th Latin American symposium on theoretical informatincs (LATIN), pp 731–742CrossRefGoogle Scholar
  23. Gagie T, Manzini G, Sirén J (2017a) Wheeler graphs: a framework for BWT-based data structures. Theor Comput Sci 698:67–78MathSciNetzbMATHCrossRefGoogle Scholar
  24. Gagie T, Navarro G, Prezza N (2017b) Optimal-time text indexing in BWT-runs bounded space. Technical report 1705.10382, arXiv.orgGoogle Scholar
  25. Gagie T, Navarro G, Prezza N (2018) Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the 29th symposium on discrete algorithms (SODA), pp 1459–1477CrossRefGoogle Scholar
  26. Grossi R, Vitter JS (2000) Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the 32nd symposium on theory of computing (STOC), pp 397–406Google Scholar
  27. Grossi R, Vitter JS (2005) Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J Comput 35(2): 378–407MathSciNetzbMATHCrossRefGoogle Scholar
  28. Kärkkäinen J, Ukkonen E (1996) Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American workshop on string processing (WSP), pp 141–155Google Scholar
  29. Kempa D, Prezza N (2017) At the roots of dictionary compression: string attractors. In: Proceedings of the 50th symposium on theory of computing (STOC), 2018. CoRR abs/1710.10964Google Scholar
  30. Kreft S, Navarro G (2013) On compressing and indexing repetitive sequences. Theor Comput Sci 483: 115–133MathSciNetzbMATHCrossRefGoogle Scholar
  31. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25CrossRefGoogle Scholar
  32. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760CrossRefGoogle Scholar
  33. Maciuca S, del Ojo Elias C, McVean G, Iqbal Z (2016) A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th workshop on algorithms in bioinformatics (WABI), pp 222–233Google Scholar
  34. Mäkinen V, Navarro G, Sirén J, Välimäki N (2010) Storage and retrieval of highly repetitive sequence collections. J Comput Biol 17(3):281–308MathSciNetCrossRefGoogle Scholar
  35. Mäkinen V, Belazzougui D, Cunial F, Tomescu AI (2015) Genome-scale algorithm design: biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  36. Maruyama S, Nakahara M, Kishiue N, Sakamoto H (2013) ESP-index: a compressed index based on edit-sensitive parsing. J Discrete Algorithms 18:100–112MathSciNetzbMATHCrossRefGoogle Scholar
  37. Na JC, Park H, Crochemore M, Holub J, Iliopoulos CS, Mouchard L, Park K (2013a) Suffix tree of alignment: an efficient index for similar data. In: Proceedings of the 24th international workshop on combinatorial algorithms (IWOCA), pp 337–348Google Scholar
  38. Na JC, Park H, Lee S, Hong M, Lecroq T, Mouchard L, Park K (2013b) Suffix array of alignment: a practical index for similar data. In: Proceedings of the 20th symposium on string processing and information retrieval (SPIRE), pp 243–254CrossRefGoogle Scholar
  39. Na JC, Kim H, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2016) FM-index of alignment: a compressed index for similar strings. Theor Comput Sci 638:159–170MathSciNetzbMATHCrossRefGoogle Scholar
  40. Na JC, Kim H, Min S, Park H, Lecroq T, Léonard M, Mouchard L, Park K (2018) FM-index of alignment with gaps. Theor Comput Sci. https://doi.org/10.1016/j.tcs.2017.02.020MathSciNetzbMATHCrossRefGoogle Scholar
  41. Navarro G (2017) A self-index on block trees. In: Proceedings of the 17th symposium on string processing and information retrieval (SPIRE), pp 278–289CrossRefGoogle Scholar
  42. Navarro G, Ordóñez A (2016) Faster compressed suffix trees for repetitive text collections. J Exp Algorithmics 21(1):article 1.8Google Scholar
  43. Navarro G, Raffinot M (2002) Flexible pattern matching in strings – practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UKzbMATHCrossRefGoogle Scholar
  44. Nishimoto T, Tomohiro I, Inenaga S, Bannai H, Takeda M (2016) Dynamic index and LZ factorization in compressed space. In: Proceedings of the prague stringology conference (PSC), pp 158–170Google Scholar
  45. Novak AM, Garrison E, Paten B (2017a) A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms Mol Biol 12(1):18:1–18:12Google Scholar
  46. Novak AM et al (2017b) Genome graphs. Technical report 101378, bioRxivGoogle Scholar
  47. Ohlebusch E (2013) Bioinformatics algorithms: sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Bremen, GermanyzbMATHGoogle Scholar
  48. Paten B, Novak AM, Eizenga JM, Garrison E (2017) Genome graphs and the evolution of genome inference. Genome Res 27(5):665–676CrossRefGoogle Scholar
  49. Procházka P, Holub J (2014) Compressing similar biological sequences using FM-index. In: Proceedings of the data compression conference (DCC), pp 312–321Google Scholar
  50. Rahn R, Weese D, Reinert K (2014) Journaled string tree – a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24):3499–3505CrossRefGoogle Scholar
  51. Sadakane K (2000) Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Proceedings of the 11th international symposium on algorithms and computations (ISAAC), pp 410–421zbMATHCrossRefGoogle Scholar
  52. Sadakane K (2003) New text indexing functionalities of the compressed suffix arrays. J Algorithms 48(2):294–313MathSciNetzbMATHCrossRefGoogle Scholar
  53. Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10(9):R98CrossRefGoogle Scholar
  54. Sirén J (2017) Indexing variation graphs. In: Proceedings of the 19th workshop on algorithm engineering and experiments (ALENEX), pp 13–27Google Scholar
  55. Sirén J, Välimäki N, Mäkinen V (2011) Indexing finite language representation of population genotypes. In: Proceedings of the 11th workshop on algorithms in bioinformatics (WABI), pp 270–281Google Scholar
  56. Sirén J, Välimäki N, Mäkinen V (2014) Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinform 11(2): 375–388CrossRefGoogle Scholar
  57. Takabatake Y, Tabei Y, Sakamoto H (2014) Improved ESP-index: a practical self-index for highly repetitive texts. In: Proceedings of the 13th symposium on experimental algorithms (SEA), pp 338–350Google Scholar
  58. Takabatake Y, Nakashima K, Kuboyama T, Tabei Y, Sakamoto H (2016) siEDM: an efficient string index and search algorithm for edit distance with moves. Algorithms 9(2):26MathSciNetCrossRefGoogle Scholar
  59. Takagi T, Goto K, Fujishige Y, Inenaga S, Arimura H (2017) Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: Proceedings of the 24th symposium on string processing and information retrieval (SPIRE), pp 304–316CrossRefGoogle Scholar
  60. Valenzuela D (2016) CHICO: a compressed hybrid index for repetitive collections. In: Proceedings of the 15th symposium on experimental algorithms (SEA), pp 326–338Google Scholar
  61. Valenzuela D, Mäkinen V (2017) CHIC: a short read aligner for pan-genomic references. Technical report 178129, bioRxiv.orgGoogle Scholar
  62. Wandelt S, Leser U (2015) MRCSI: compressing and searching string collections with multiple references. Proc VLDB Endowment 8(5):461–472CrossRefGoogle Scholar
  63. Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endowment 6(13):1534–1545CrossRefGoogle Scholar
  64. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.EITDiego Portales UniversitySantiagoChile
  2. 2.Department of Computer ScienceUniversity of ChileSantiagoChile