Abstract
The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the alphabet then the method of Lam et al. can conduct one step in time \(\mathcal {O}(\sigma )\) while needing space \(\mathcal {O}(\sigma \cdot n)\) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to \(\mathcal {O}(\log \sigma )\) while using \(\mathcal {O}(\log \sigma \cdot n)\) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.
In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in \(\mathcal {O}(1)\) time per step while using \(\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)\) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).
We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between \(\approx 2.2-4.2\) times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12
Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11, 31 (2015)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)
Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinform. 9, 11 (2008). https://doi.org/10.1186/1471-2105-9-11
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science (2000). https://doi.org/10.1109/SFCS.2000.892127
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms (TALG) 3, 20 (2007)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)
Hauswedell, H., Singer, J., Reinert, K.: Lambda: the local aligner for massive biological data. Bioinformatics (Oxford, England) 30, i349–i355 (2014). https://doi.org/10.1093/bioinformatics/btu439
Jacobson, G.J.: Succinct static data structures (1988)
Lam, T., Li, R., Tam, A., Wong, S., Wu, E.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36 (2009). https://doi.org/10.1109/BIBM.2009.42
Lam, T., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324
Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinform. 12, 214 (2011). https://doi.org/10.1186/1471-2105-12-214
Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: International Symposium on Experimental Algorithms (2012). https://doi.org/10.1007/978-3-642-30850-5_26
Santiago, M., Sammeth, M., Guigo, R., Ribeca, P.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221
Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012). https://doi.org/10.1016/j.ic.2011.03.007
Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)
Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41, e78–e78 (2013). https://doi.org/10.1093/nar/gkt005
Ye, Y., Choi, J.-H., Tang, H.: Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12, 1 (2011)
Acknowledgments
We would like to acknowledge Enrico Siragusa for his previous implementations of the FM index in SeqAn. The first author acknowledges the support of the International Max-Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC). We also thank Veli Mäkinen and Simon Gog for very helpful remarks on a previous version of this manuscript during the Dagstuhl seminar 16351 “Next Generation Sequencing - Algorithms, and Software For Biomedical Applications”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
In the appendix we give for the reader not familiar with FM and 2FM indices a short introduction.
1.1 Introduction to the FM and 2FM Index
Given a text T of length n over an ordered, finite alphabet \(\varSigma = \{c_1, \dots , c_{\sigma }\}\) with \(\forall \, 1 \le i< \sigma : c_{i} <_{lex} c_{i+1}\), let T[i] denote the character at position i in T, \(\cdot \) the concatenation operator and T[1..i] the prefix of T up to the character at position i. \(T^{rev}\) represents the reversed text. We assume that T ends with a sentinel character \(\$\notin \varSigma \) that does not occur in any other position in T and is lexicographically smaller than any character in \(\varSigma \). The FM index needs the Burrows-Wheeler transform (BWT) of T. The BWT is the concatenation of characters in the last column of all lexicographically sorted cyclic permutations of the string T (see Fig. 3 for an example). We will refer to the BWT as L.
In contrast to suffix trees or suffix arrays, where a prefix P of a pattern is extended by characters to the right (referred to as forward search \(P \rightarrow Pc\) for \(c \in \varSigma \)), the FM index can only be searched using backward search, i.e., extending a suffix \(P'\) by characters to the left, \(P' \rightarrow cP'\). Performing a single character backward search of c in the FM index will require two pieces of information. First, C(c), the number of characters in L that are lexicographically smaller than c, second, Occ(c, i), the number of c’s in L[1..i]. Given a range [a, b] for P; i.e., the range in the sorted list of cyclic permutations that starts with P, we can compute the range \([a^\prime , b^\prime ]\) for cP as follows: \([a^\prime , b^\prime ]\) = \([C(c) + Occ(c, a - 1) + 1, C(c) + Occ(c, b)]\). We will refer to the BWT together with tables C and Occ as FM index \(\mathcal {I}\) (see Fig. 3 for an example of one search step).
The 2FM index maintains two FM indices \(\mathcal {I}\) and \(\mathcal {I}^{rev}\), one for the original text T and one for the reversed text \(T^{rev}\). Searching a pattern left to right on the original text (i.e., conducting a forward search) corresponds to a backward search in \(\mathcal {I}^{rev}\); searching a pattern right to left in the original text corresponds to a backward search in \(\mathcal {I}\). The difficulty is to keep both indices synchronized whenever a search step is performed. W.l.o.g. we assume that we want to extend the pattern to the right, i.e., perform a forward search \(P \rightarrow Pc_j\) for some character \(c_j\). First, the backward search \(P^{rev} \rightarrow c_jP^{rev}\) is carried out on \(\mathcal {I}^{rev}\) and its new range \([a^\prime _{rev}, b^\prime _{rev}] = [C(c_j) + Occ(c_j, a_{rev} - 1) + 1, C(c_j) + Occ(c_j, b_{rev})]\) is computed. The new range in \(\mathcal {I}\) can be calculated using the interval [a, b] for P in \(\mathcal {I}\) and the range size of the reversed texts index \([a^\prime , b^\prime ] = [a+smaller,a+smaller+ b^\prime _{rev} - a^\prime _{rev}]\). To compute smaller, Lam et al. [11] propose to perform \(\mathcal {O}(\sigma )\) backward searches \(P^{rev} \rightarrow c_iP^{rev}\) for all \(1 \le i < j\) and sum up the range sizes, i.e., \(smaller = \sum _{1 \le i< j} Occ(c_i, b_{rev}) - \sum _{1 \le i < j} Occ(c_i, a_{rev} - 1)\) leading to a total running time of \(\mathcal {O}(\sigma )\) (See Fig. 4 for an illustration).
The implementation of the occurrence table Occ is usually not done by storing explicitly the values of the entire table. Instead of storing the entire \(Occ : \varSigma \times \{1,\dots ,n\} \rightarrow \{1,\dots ,n\}\) one uses the more space-efficient constant time rank dictionary: for every \(c \in \varSigma \) a bit vector \(B_c[1..n]\) is constructed such that \(B_c[i] = 1\) if and only if \(L[i] = c\). Thus the occurrence value equals the number of 1’s in \(B_c[1..i]\), i.e., \(Occ(c, i) = rank(B_c, i)\). Jacobson [10] showed that rank queries can be answered in constant time using only \(o(n)\) additional space per bit vector. Since then many other constant time rank query data structures have been proposed. For an overview we refer the reader to [17] containing a comparison of various implementations. Since we will make also use of this technique, we explain the most simple idea, namely the one for 2-level rank dictionaries in the following paragraph.
1.2 Constant Time Rank Queries
In order to store partial prefix sums, the technique uses two levels of lookup table, called blocks and superblocks. Given a bit vector B of length n we divide it into blocks of length \(\ell \) and superblocks of length \(\ell ^2\) where
For both, blocks and superblocks we allocate arrays M and \(M'\) of sizes \(\left\lfloor \frac{n}{\ell }\right\rfloor \) and \(\left\lfloor \frac{n}{\ell ^2}\right\rfloor \) respectively (see Fig. 5 for an illustration).
For the m-th superblock we store the number of 1’s from the beginning of B to the end of the superblock in \(M'[m]=rank(B, m \cdot \ell ^2)\). As there are \(\left\lfloor \frac{n}{\ell ^2}\right\rfloor \) superblocks, \(M'\) can be stored in \(\mathcal {O}\left( \frac{n}{\ell ^2} \cdot \log n\right) =\mathcal {O}\left( \frac{n}{\log n}\right) =o(n)\) bits. For the m-th block we store the number of 1’s from the beginning of the overlapping superblock to the end of the block in \(M[m]=rank\big (B[1+k\ell ..n],(m-k) \cdot \ell \big )\), where \(k=\left\lfloor \frac{m-1}{\ell }\right\rfloor \ell \) is the total number of blocks in all superblocks left of the current superblock. M has \(\left\lfloor \frac{n}{\ell }\right\rfloor \) entries and can be stored in \(\mathcal {O}\left( \frac{n}{\ell } \cdot \log \ell ^2\right) =\mathcal {O}\left( \frac{ n\cdot \log \log n}{\log n}\right) =o(n)\) bits.
Given a rank query rank(B, i), one can now add the corresponding superblock and block values. But we still have to account for the 1’s in the block covering position i (in case i is not at the end of a block). Let P be a precomputed lookup table such that for each possible bit vector V of length \(\ell \) and \(i\in [1..\ell ]\) holds \(P[V][i]=rank(V,i)\). V has \(2^\ell \cdot \ell \) entries of values at most \(\ell \) and thus can be stored in
bits. We now decompose a rank query into 3 subqueries using the precomputed tables. For a position i we determine the index \(p=\left\lfloor \frac{i-1}{\ell }\right\rfloor \) of next block left of i and the index \(q=\left\lfloor \frac{p-1}{\ell }\right\rfloor \) of the next superblock left of block p. Then it holds:
Since the text T of length n has to be addressed, we assume that a register has at least size \(\left\lceil \log n\right\rceil \). Thus \(B[1+p\ell ..(p+1)\ell ]\) fits into a single register and can be determined in \(\mathcal {O}(1)\) time. Therefore a rank query can be answered in \(\mathcal {O}(1)\) time. In practice the precomputed lookup table P is replaced by a popcount operation on the CPU register and we have only two lookup operations.
One can now replace the occurrence table by this 2-level dictionary, i.e., by creating a bit vector for every \(c \in \varSigma \) and setting it to 1 if the character occurs in the BWT L. This results in \(\mathcal {O}(\sigma \cdot n)+o(\sigma \cdot n)\) bits space requirement.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pockrandt, C., Ehrhardt, M., Reinert, K. (2017). EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-56970-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56969-7
Online ISBN: 978-3-319-56970-3
eBook Packages: Computer ScienceComputer Science (R0)