Skip to main content
Log in

Fixed Block Compression Boosting in FM-Indexes: Theory and Practice

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The FM index (Ferragina and Manzini in J ACM 52(4):552–581, 2005) is a widely-used compressed data structure that stores a string T in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good “off-the-shelf” choice for many applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/simongog/sdsl-lite.

References

  1. Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M., Puglisi, S.J.: Bidirectional variable-order de Bruijn graphs. In: 12th Latin American Theoretical Informatics Symposium (LATIN 2016), Ensenada, 11–15 Apr 2016. LNCS, vol. 9644, pp. 164–178. Springer (2016)

  2. Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: 2015 Data Compression Conference (DCC 2015), Snowbird, 7–9 Apr 2015, pp. 383–392. IEEE (2015)

  3. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: 12th Workshop on Algorithms in Bioinformatics (WABI 2012), Ljubljana, 10–12 Sept 2012. LNCS, vol. 7534, pp. 225–235. Springer (2012)

  4. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, California (1994)

  5. Claude, F., Navarro, G., Pereira, A.O.: The wavelet matrix: an efficient wavelet tree for large alphabets. Inf. Syst. 47, 15–32 (2015)

    Article  Google Scholar 

  6. Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52(4), 688–713 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 1.12–1.31 (2009)

    MathSciNet  MATH  Google Scholar 

  8. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: 11th International Symposium on String Processing and Information Retrieval (SPIRE 2004), Padova, 5–8 Oct 2004. LNCS, vol. 3246, pp. 150–160. Springer (2004)

  10. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007)

  11. Ferragina, P., Nitto, I., Venturini, R.: On optimally partitioning a text to improve its compression. Algorithmica 61(1), 51–74 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6(11 Suppl), S6–S12 (2009)

    Article  Google Scholar 

  13. Gog, S., Kärkkäinen, J., Kempa, D., Petri, M., Puglisi, S.J.: Faster, minuter. In: 2016 Data Compression Conference (DCC 2016), Snowbird, 30 Mar–1 Apr, 2016, pp. 53–62. IEEE (2016)

  14. Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Pract. Exp. 44(11), 1287–1314 (2014)

    Article  Google Scholar 

  15. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Annual Symposium on Discrete Algorithms (SODA 2003), Baltimore, 12–14 Jan 2003, pp. 841–850. SIAM (2003)

  16. Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: 15th Annual Symposium on Discrete Algorithms (SODA 2004), New Orleans, 11–14 Jan 2004, pp. 636–645. SIAM (2004)

  17. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26–28 Mar 2014, pp. 302–311. IEEE (2014)

  18. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel–Ziv parsing in external memory. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26–28 Mar 2014, pp. 153–162. IEEE (2014)

  19. Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: 18th International Symposium on String Processing and Information Retrieval (SPIRE 2011), Pisa, 17–21 Oct 2011. LNCS, vol. 7024, pp. 174–184. Springer (2011)

  20. Kempa, D., Puglisi, S.J.: Lempel-Ziv factorization: simple, fast, practical. In: 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), New Orleans, 7 Jan 2013, pp. 103–112. SIAM (2013)

  21. Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967 (2009)

    Article  Google Scholar 

  22. Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge (2015)

    Book  Google Scholar 

  23. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  24. Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: 14th International Symposium on String Processing and Information Retrieval (SPIRE 2007), Santiago, 29–31 Oct 2007. LNCS, vol. 4726, pp. 229–241. Springer (2007)

  25. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  26. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  27. Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: 11th International Symposium on Experimental Algorithms (SEA 2012), Bordeaux, 7–9 June 2012. LNCS, vol. 7276, pp. 295–306. Springer (2012)

  28. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: 9th Meeting on Algorithm Engineering and Experiments (ALENEX 2007), New Orleans, 6 Jan 2007, pp. 60–70. SIAM (2007)

  29. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  30. Shareghi, E., Petri, M., Haffari, G., Cohn, T.: Compact, efficient and unlimited capacity: language modeling with compressed suffix trees. In: 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, 17–21 Sept 2015, pp. 2409–2418. The Association for Computational Linguistics (2015)

Download references

Acknowledgements

Preliminary versions of this article appeared as J. Kärkkäinen and S. J. Puglisi, “Fixed block compression boosting in FM-indexes”, Proc. SPIRE 2011 [19], and S. Gog, J. Kärkkäinen, D. Kempa, M. Petri, and S. J. Puglisi, “Faster, minuter”, Proc. DCC 2016 [13].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Kempa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supported by the Academy of Finland via Grant 294143.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gog, S., Kärkkäinen, J., Kempa, D. et al. Fixed Block Compression Boosting in FM-Indexes: Theory and Practice. Algorithmica 81, 1370–1391 (2019). https://doi.org/10.1007/s00453-018-0475-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-018-0475-9

Keywords

Navigation