EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

Pockrandt, Christopher; Ehrhardt, Marcel; Reinert, Knut

doi:10.1007/978-3-319-56970-3_12

Christopher Pockrandt^14,15,
Marcel Ehrhardt¹⁴ &
Knut Reinert¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1934 Accesses
8 Citations

Abstract

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If $\sigma $ is the size of the alphabet then the method of Lam et al. can conduct one step in time $\mathcal {O}(\sigma )$ while needing space $\mathcal {O}(\sigma \cdot n)$ using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to $\mathcal {O}(\log \sigma )$ while using $\mathcal {O}(\log \sigma \cdot n)$ bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.

In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in $\mathcal {O}(1)$ time per step while using $\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)$ bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).

We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between $\approx 2.2-4.2$ times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12
Chapter Google Scholar
Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11, 31 (2015)
Article MathSciNet Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)
Google Scholar
Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinform. 9, 11 (2008). https://doi.org/10.1186/1471-2105-9-11
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science (2000). https://doi.org/10.1109/SFCS.2000.892127
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms (TALG) 3, 20 (2007)
Article MathSciNet MATH Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)
Google Scholar
Hauswedell, H., Singer, J., Reinert, K.: Lambda: the local aligner for massive biological data. Bioinformatics (Oxford, England) 30, i349–i355 (2014). https://doi.org/10.1093/bioinformatics/btu439
Article Google Scholar
Jacobson, G.J.: Succinct static data structures (1988)
Google Scholar
Lam, T., Li, R., Tam, A., Wong, S., Wu, E.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36 (2009). https://doi.org/10.1109/BIBM.2009.42
Lam, T., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)
Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324
Article Google Scholar
Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinform. 12, 214 (2011). https://doi.org/10.1186/1471-2105-12-214
Article Google Scholar
Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: International Symposium on Experimental Algorithms (2012). https://doi.org/10.1007/978-3-642-30850-5_26
Santiago, M., Sammeth, M., Guigo, R., Ribeca, P.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221
Article Google Scholar
Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012). https://doi.org/10.1016/j.ic.2011.03.007
Article MathSciNet MATH Google Scholar
Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)
Google Scholar
Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41, e78–e78 (2013). https://doi.org/10.1093/nar/gkt005
Article Google Scholar
Ye, Y., Choi, J.-H., Tang, H.: Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12, 1 (2011)
Article Google Scholar

Download references

Acknowledgments

We would like to acknowledge Enrico Siragusa for his previous implementations of the FM index in SeqAn. The first author acknowledges the support of the International Max-Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC). We also thank Veli Mäkinen and Simon Gog for very helpful remarks on a previous version of this manuscript during the Dagstuhl seminar 16351 “Next Generation Sequencing - Algorithms, and Software For Biomedical Applications”.

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, Freie Universität Berlin, Berlin, Germany
Christopher Pockrandt, Marcel Ehrhardt & Knut Reinert
International Max Planck Research School for Computational Biology and Scientific Computation, Berlin, Germany
Christopher Pockrandt

Authors

Christopher Pockrandt
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Ehrhardt
View author publications
You can also search for this author in PubMed Google Scholar
Knut Reinert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Pockrandt .

Editor information

Editors and Affiliations

Indiana University Bloomington, Bloomington, Indiana, USA
S. Cenk Sahinalp

Appendix

In the appendix we give for the reader not familiar with FM and 2FM indices a short introduction.

1.1 Introduction to the FM and 2FM Index

Given a text T of length n over an ordered, finite alphabet $\varSigma = \{c_1, \dots , c_{\sigma }\}$ with $\forall \, 1 \le i< \sigma : c_{i} <_{lex} c_{i+1}$, let T[i] denote the character at position i in T, $\cdot $ the concatenation operator and T[1..i] the prefix of T up to the character at position i. $T^{rev}$ represents the reversed text. We assume that T ends with a sentinel character $\$\notin \varSigma $ that does not occur in any other position in T and is lexicographically smaller than any character in $\varSigma $. The FM index needs the Burrows-Wheeler transform (BWT) of T. The BWT is the concatenation of characters in the last column of all lexicographically sorted cyclic permutations of the string T (see Fig. 3 for an example). We will refer to the BWT as L.

In contrast to suffix trees or suffix arrays, where a prefix P of a pattern is extended by characters to the right (referred to as forward search $P \rightarrow Pc$ for $c \in \varSigma $), the FM index can only be searched using backward search, i.e., extending a suffix $P'$ by characters to the left, $P' \rightarrow cP'$. Performing a single character backward search of c in the FM index will require two pieces of information. First, C(c), the number of characters in L that are lexicographically smaller than c, second, Occ(c, i), the number of c’s in L[1..i]. Given a range [a, b] for P; i.e., the range in the sorted list of cyclic permutations that starts with P, we can compute the range $[a^\prime , b^\prime ]$ for cP as follows: $[a^\prime , b^\prime ]$ = $[C(c) + Occ(c, a - 1) + 1, C(c) + Occ(c, b)]$. We will refer to the BWT together with tables C and Occ as FM index $\mathcal {I}$ (see Fig. 3 for an example of one search step).

The 2FM index maintains two FM indices $\mathcal {I}$ and $\mathcal {I}^{rev}$, one for the original text T and one for the reversed text $T^{rev}$. Searching a pattern left to right on the original text (i.e., conducting a forward search) corresponds to a backward search in $\mathcal {I}^{rev}$; searching a pattern right to left in the original text corresponds to a backward search in $\mathcal {I}$. The difficulty is to keep both indices synchronized whenever a search step is performed. W.l.o.g. we assume that we want to extend the pattern to the right, i.e., perform a forward search $P \rightarrow Pc_j$ for some character $c_j$. First, the backward search $P^{rev} \rightarrow c_jP^{rev}$ is carried out on $\mathcal {I}^{rev}$ and its new range $[a^\prime _{rev}, b^\prime _{rev}] = [C(c_j) + Occ(c_j, a_{rev} - 1) + 1, C(c_j) + Occ(c_j, b_{rev})]$ is computed. The new range in $\mathcal {I}$ can be calculated using the interval [a, b] for P in $\mathcal {I}$ and the range size of the reversed texts index $[a^\prime , b^\prime ] = [a+smaller,a+smaller+ b^\prime _{rev} - a^\prime _{rev}]$. To compute smaller, Lam et al. [11] propose to perform $\mathcal {O}(\sigma )$ backward searches $P^{rev} \rightarrow c_iP^{rev}$ for all $1 \le i < j$ and sum up the range sizes, i.e., $smaller = \sum _{1 \le i< j} Occ(c_i, b_{rev}) - \sum _{1 \le i < j} Occ(c_i, a_{rev} - 1)$ leading to a total running time of $\mathcal {O}(\sigma )$ (See Fig. 4 for an illustration).

The implementation of the occurrence table Occ is usually not done by storing explicitly the values of the entire table. Instead of storing the entire $Occ : \varSigma \times \{1,\dots ,n\} \rightarrow \{1,\dots ,n\}$ one uses the more space-efficient constant time rank dictionary: for every $c \in \varSigma $ a bit vector $B_c[1..n]$ is constructed such that $B_c[i] = 1$ if and only if $L[i] = c$. Thus the occurrence value equals the number of 1’s in $B_c[1..i]$, i.e., $Occ(c, i) = rank(B_c, i)$. Jacobson [10] showed that rank queries can be answered in constant time using only $o(n)$ additional space per bit vector. Since then many other constant time rank query data structures have been proposed. For an overview we refer the reader to [17] containing a comparison of various implementations. Since we will make also use of this technique, we explain the most simple idea, namely the one for 2-level rank dictionaries in the following paragraph.

1.2 Constant Time Rank Queries

In order to store partial prefix sums, the technique uses two levels of lookup table, called blocks and superblocks. Given a bit vector B of length n we divide it into blocks of length $\ell $ and superblocks of length $\ell ^2$ where

$$\begin{aligned} \ell =\left\lfloor \frac{\log n}{2}\right\rfloor . \end{aligned}$$

For both, blocks and superblocks we allocate arrays M and $M'$ of sizes $\left\lfloor \frac{n}{\ell }\right\rfloor $ and $\left\lfloor \frac{n}{\ell ^2}\right\rfloor $ respectively (see Fig. 5 for an illustration).

For the m-th superblock we store the number of 1’s from the beginning of B to the end of the superblock in $M'[m]=rank(B, m \cdot \ell ^2)$. As there are $\left\lfloor \frac{n}{\ell ^2}\right\rfloor $ superblocks, $M'$ can be stored in $\mathcal {O}\left( \frac{n}{\ell ^2} \cdot \log n\right) =\mathcal {O}\left( \frac{n}{\log n}\right) =o(n)$ bits. For the m-th block we store the number of 1’s from the beginning of the overlapping superblock to the end of the block in $M[m]=rank\big (B[1+k\ell ..n],(m-k) \cdot \ell \big )$, where $k=\left\lfloor \frac{m-1}{\ell }\right\rfloor \ell $ is the total number of blocks in all superblocks left of the current superblock. M has $\left\lfloor \frac{n}{\ell }\right\rfloor $ entries and can be stored in $\mathcal {O}\left( \frac{n}{\ell } \cdot \log \ell ^2\right) =\mathcal {O}\left( \frac{ n\cdot \log \log n}{\log n}\right) =o(n)$ bits.

Given a rank query rank(B, i), one can now add the corresponding superblock and block values. But we still have to account for the 1’s in the block covering position i (in case i is not at the end of a block). Let P be a precomputed lookup table such that for each possible bit vector V of length $\ell $ and $i\in [1..\ell ]$ holds $P[V][i]=rank(V,i)$. V has $2^\ell \cdot \ell $ entries of values at most $\ell $ and thus can be stored in

$$\begin{aligned} \mathcal {O}\left( 2^\ell \cdot \ell \cdot \log \ell \right) =\mathcal {O}\left( 2^{\frac{\log n}{2}} \cdot \log n \cdot \log \log n \right) = \mathcal {O}\left( \sqrt{n} \cdot \log n \cdot \log \log n \right) =o(n) \end{aligned}$$

bits. We now decompose a rank query into 3 subqueries using the precomputed tables. For a position i we determine the index $p=\left\lfloor \frac{i-1}{\ell }\right\rfloor $ of next block left of i and the index $q=\left\lfloor \frac{p-1}{\ell }\right\rfloor $ of the next superblock left of block p. Then it holds:

$$rank(B,i)=M'[q]+M[p]+P\big [B[1+p\ell ..(p+1)\ell ]\big ]\big [i-p\ell \big ].$$

Since the text T of length n has to be addressed, we assume that a register has at least size $\left\lceil \log n\right\rceil $. Thus $B[1+p\ell ..(p+1)\ell ]$ fits into a single register and can be determined in $\mathcal {O}(1)$ time. Therefore a rank query can be answered in $\mathcal {O}(1)$ time. In practice the precomputed lookup table P is replaced by a popcount operation on the CPU register and we have only two lookup operations.

One can now replace the occurrence table by this 2-level dictionary, i.e., by creating a bit vector for every $c \in \varSigma $ and setting it to 1 if the character occurs in the BWT L. This results in $\mathcal {O}(\sigma \cdot n)+o(\sigma \cdot n)$ bits space requirement.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pockrandt, C., Ehrhardt, M., Reinert, K. (2017). EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-56970-3_12
Published: 12 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Introduction to the FM and 2FM Index

1.2 Constant Time Rank Queries

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation