FM-index for Dummies

Grabowski, Szymon; Raniszewski, Marcin; Deorowicz, Sebastian

doi:10.1007/978-3-319-58274-0_16

Szymon Grabowski¹⁵,
Marcin Raniszewski¹⁵ &
Sebastian Deorowicz¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

1606 Accesses
2 Citations

Abstract

Full-text search refers to techniques for searching a document, or a document collection, in a full-text database. To speed up such searches, the given text should be indexed. The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2–3 times faster than the fastest known ones, for the price of using typically 1.5–5 times more space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12
Chapter Google Scholar
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23 (2014). Article 23
Article MathSciNet MATH Google Scholar
Chacón, A., Moure, J.C., Espinosa, A., Hernández, P.: \(n\)-step FM-index for faster pattern matching. Proc. Comput. Sci. 18, 70–79 (2013)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)
Article Google Scholar
Fariña, A., Navarro, G., Paramá, J.: Boosting text compression with word-based statistical encoding. Comput. J. 55(1), 111–131 (2012)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS, pp. 390–398. IEEE (2000)
Google Scholar
Gog, S., Kärkkäinen, J., Kempa, D., Petri, M., Puglisi, S.J.: Faster, minuter. In: Proceedings of DCC, pp. 53–62. IEEE (2016)
Google Scholar
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw.: Pract. Exp. 44(11), 1287–1314 (2014)
Google Scholar
Grabowski, S.: Making dense codes even denser. AGH Automatyka 12(3), 769–779 (2008)
Google Scholar
Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array. In: Proceedings of PSC, pp. 179–191 (2014)
Google Scholar
Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 287–298. Springer, Cham (2015). doi:10.1007/978-3-319-23826-5_28
Chapter Google Scholar
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware FM-index. In: Proceedings of ALENEX, pp. 10–23. SIAM (2015)
Google Scholar
Jacobson, G.: Succinct static data structures. Ph.D. thesis, Carnegie Mellon University (1989)
Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: Proceedings of DCC, pp. 302–311. IEEE (2014)
Google Scholar
Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 174–184. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24583-1_18
Chapter Google Scholar
Külekci, M.O., Vitter, J.S., Xu, B.: Fast pattern-matching via \(k\)-bit filtering based text decomposition. Comput. J. 55(1), 62–68 (2010)
Article MATH Google Scholar
Lam, T.W., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.M.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36. IEEE (2009)
Google Scholar
Mäkinen, V., Navarro, G.: New search algorithms and time/space tradeoffs for succinct suffix arrays. Technical report C-2004-20, University of Helsinki, Finland (2004)
Google Scholar
Moffat, A., Gog, S.: String search experimentation using massive data. Philos. Trans. Roy. Soc. Lond. A: Math. Phys. Eng. Sci. 372(2016), 20130135 (2014)
Article MathSciNet MATH Google Scholar
Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceeding of SODA (2017, to appear)
Google Scholar
Navarro, G.: Wavelet trees for all. J. Discret. Algorithms 25, 2–20 (2014)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Article MATH Google Scholar
Orlandi, A., Venturini, R.: Space-efficient substring occurrence estimation. Algorithmica 74(1), 65–90 (2016)
Article MathSciNet MATH Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Article Google Scholar
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68552-4_12
Chapter Google Scholar
Vyverman, M., De Baets, B., Fack, V., Dawyndt, P.: Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Res. 40(15), 6993–7015 (2012)
Article Google Scholar

Download references

Acknowledgments

We thank Simon Gog for providing the FM-FB-V5 and FM-hybrid-FB_8 sources and helping us in running sdsl-lite, and Shaun D. Jackman for a remark concerning the ABySS de novo genome assembler.

The work was supported by the Polish National Science Centre upon decision DEC-2013/09/B/ST6/03117.

Author information

Authors and Affiliations

Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90–924, Łódź, Poland
Szymon Grabowski & Marcin Raniszewski
Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
Sebastian Deorowicz

Authors

Szymon Grabowski
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Raniszewski
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Deorowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastian Deorowicz .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grabowski, S., Raniszewski, M., Deorowicz, S. (2017). FM-index for Dummies. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-58274-0_16
Published: 27 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics