Skip to main content

EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2017)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10229))

Abstract

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the alphabet then the method of Lam et al. can conduct one step in time \(\mathcal {O}(\sigma )\) while needing space \(\mathcal {O}(\sigma \cdot n)\) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to \(\mathcal {O}(\log \sigma )\) while using \(\mathcal {O}(\log \sigma \cdot n)\) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.

In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in \(\mathcal {O}(1)\) time per step while using \(\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)\) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).

We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between \(\approx 2.2-4.2\) times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12

    Chapter  Google Scholar 

  2. Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11, 31 (2015)

    Article  MathSciNet  Google Scholar 

  3. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report (1994)

    Google Scholar 

  4. Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinform. 9, 11 (2008). https://doi.org/10.1186/1471-2105-9-11

    Article  Google Scholar 

  5. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Annual Symposium on Foundations of Computer Science (2000). https://doi.org/10.1109/SFCS.2000.892127

  6. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms (TALG) 3, 20 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  7. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_28

    Google Scholar 

  8. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003)

    Google Scholar 

  9. Hauswedell, H., Singer, J., Reinert, K.: Lambda: the local aligner for massive biological data. Bioinformatics (Oxford, England) 30, i349–i355 (2014). https://doi.org/10.1093/bioinformatics/btu439

    Article  Google Scholar 

  10. Jacobson, G.J.: Succinct static data structures (1988)

    Google Scholar 

  11. Lam, T., Li, R., Tam, A., Wong, S., Wu, E.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36 (2009). https://doi.org/10.1109/BIBM.2009.42

  12. Lam, T., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics 24, 791–797 (2008). https://doi.org/10.1093/bioinformatics/btn032

    Article  Google Scholar 

  13. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012)

    Article  Google Scholar 

  14. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013)

    Google Scholar 

  15. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324

    Article  Google Scholar 

  16. Meyer, F., Kurtz, S., Backofen, R., Will, S., Beckstette, M.: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinform. 12, 214 (2011). https://doi.org/10.1186/1471-2105-12-214

    Article  Google Scholar 

  17. Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: International Symposium on Experimental Algorithms (2012). https://doi.org/10.1007/978-3-642-30850-5_26

  18. Santiago, M., Sammeth, M., Guigo, R., Ribeca, P.: The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods 9, 1185–1188 (2012). https://doi.org/10.1038/nmeth.2221

    Article  Google Scholar 

  19. Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012). https://doi.org/10.1016/j.ic.2011.03.007

    Article  MathSciNet  MATH  Google Scholar 

  20. Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015)

    Google Scholar 

  21. Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41, e78–e78 (2013). https://doi.org/10.1093/nar/gkt005

    Article  Google Scholar 

  22. Ye, Y., Choi, J.-H., Tang, H.: Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinform. 12, 1 (2011)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to acknowledge Enrico Siragusa for his previous implementations of the FM index in SeqAn. The first author acknowledges the support of the International Max-Planck Research School for Computational Biology and Scientific Computing (IMPRS-CBSC). We also thank Veli Mäkinen and Simon Gog for very helpful remarks on a previous version of this manuscript during the Dagstuhl seminar 16351 “Next Generation Sequencing - Algorithms, and Software For Biomedical Applications”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Pockrandt .

Editor information

Editors and Affiliations

Appendix

Appendix

In the appendix we give for the reader not familiar with FM and 2FM indices a short introduction.

1.1 Introduction to the FM and 2FM Index

Given a text T of length n over an ordered, finite alphabet \(\varSigma = \{c_1, \dots , c_{\sigma }\}\) with \(\forall \, 1 \le i< \sigma : c_{i} <_{lex} c_{i+1}\), let T[i] denote the character at position i in T, \(\cdot \) the concatenation operator and T[1..i] the prefix of T up to the character at position i. \(T^{rev}\) represents the reversed text. We assume that T ends with a sentinel character \(\$\notin \varSigma \) that does not occur in any other position in T and is lexicographically smaller than any character in \(\varSigma \). The FM index needs the Burrows-Wheeler transform (BWT) of T. The BWT is the concatenation of characters in the last column of all lexicographically sorted cyclic permutations of the string T (see Fig. 3 for an example). We will refer to the BWT as L.

Fig. 3.
figure 3

First step of the backwards search for \(P=\texttt {ssi}\) in the FM index for the text \(T=\texttt {mississippi\$}\). The first interval [ab] is the whole range [1, 12]. From all matrix rows we search those beginning with the last pattern character \(P[3]=i\). From \(Occ(i,1)=0\) and \(Occ(i,12)=4\) follows \(a'=C(i)+0+1=2\) and \(b'=C(i)+4=5\).

In contrast to suffix trees or suffix arrays, where a prefix P of a pattern is extended by characters to the right (referred to as forward search \(P \rightarrow Pc\) for \(c \in \varSigma \)), the FM index can only be searched using backward search, i.e., extending a suffix \(P'\) by characters to the left, \(P' \rightarrow cP'\). Performing a single character backward search of c in the FM index will require two pieces of information. First, C(c), the number of characters in L that are lexicographically smaller than c, second, Occ(ci), the number of c’s in L[1..i]. Given a range [ab] for P; i.e., the range in the sorted list of cyclic permutations that starts with P, we can compute the range \([a^\prime , b^\prime ]\) for cP as follows: \([a^\prime , b^\prime ]\) = \([C(c) + Occ(c, a - 1) + 1, C(c) + Occ(c, b)]\). We will refer to the BWT together with tables C and Occ as FM index \(\mathcal {I}\) (see Fig. 3 for an example of one search step).

The 2FM index maintains two FM indices \(\mathcal {I}\) and \(\mathcal {I}^{rev}\), one for the original text T and one for the reversed text \(T^{rev}\). Searching a pattern left to right on the original text (i.e., conducting a forward search) corresponds to a backward search in \(\mathcal {I}^{rev}\); searching a pattern right to left in the original text corresponds to a backward search in \(\mathcal {I}\). The difficulty is to keep both indices synchronized whenever a search step is performed. W.l.o.g. we assume that we want to extend the pattern to the right, i.e., perform a forward search \(P \rightarrow Pc_j\) for some character \(c_j\). First, the backward search \(P^{rev} \rightarrow c_jP^{rev}\) is carried out on \(\mathcal {I}^{rev}\) and its new range \([a^\prime _{rev}, b^\prime _{rev}] = [C(c_j) + Occ(c_j, a_{rev} - 1) + 1, C(c_j) + Occ(c_j, b_{rev})]\) is computed. The new range in \(\mathcal {I}\) can be calculated using the interval [ab] for P in \(\mathcal {I}\) and the range size of the reversed texts index \([a^\prime , b^\prime ] = [a+smaller,a+smaller+ b^\prime _{rev} - a^\prime _{rev}]\). To compute smaller, Lam et al. [11] propose to perform \(\mathcal {O}(\sigma )\) backward searches \(P^{rev} \rightarrow c_iP^{rev}\) for all \(1 \le i < j\) and sum up the range sizes, i.e., \(smaller = \sum _{1 \le i< j} Occ(c_i, b_{rev}) - \sum _{1 \le i < j} Occ(c_i, a_{rev} - 1)\) leading to a total running time of \(\mathcal {O}(\sigma )\) (See Fig. 4 for an illustration).

Fig. 4.
figure 4

When conducting a forward search \(P \Rightarrow Pc_j\) we need to determine the subinterval of the suffix array interval for P which is depicted on the left. In order to determine the start, we can compute in \(\mathcal {I}^{rev}\) the size of the intervals for all characters smaller then \(c_j\), depicted in dark gray on the right. The sum of all those sizes is exactly the needed offset from the beginning of the interval for P in \(\mathcal {I}\).

The implementation of the occurrence table Occ is usually not done by storing explicitly the values of the entire table. Instead of storing the entire \(Occ : \varSigma \times \{1,\dots ,n\} \rightarrow \{1,\dots ,n\}\) one uses the more space-efficient constant time rank dictionary: for every \(c \in \varSigma \) a bit vector \(B_c[1..n]\) is constructed such that \(B_c[i] = 1\) if and only if \(L[i] = c\). Thus the occurrence value equals the number of 1’s in \(B_c[1..i]\), i.e., \(Occ(c, i) = rank(B_c, i)\). Jacobson [10] showed that rank queries can be answered in constant time using only \(o(n)\) additional space per bit vector. Since then many other constant time rank query data structures have been proposed. For an overview we refer the reader to [17] containing a comparison of various implementations. Since we will make also use of this technique, we explain the most simple idea, namely the one for 2-level rank dictionaries in the following paragraph.

1.2 Constant Time Rank Queries

In order to store partial prefix sums, the technique uses two levels of lookup table, called blocks and superblocks. Given a bit vector B of length n we divide it into blocks of length \(\ell \) and superblocks of length \(\ell ^2\) where

$$\begin{aligned} \ell =\left\lfloor \frac{\log n}{2}\right\rfloor . \end{aligned}$$

For both, blocks and superblocks we allocate arrays M and \(M'\) of sizes \(\left\lfloor \frac{n}{\ell }\right\rfloor \) and \(\left\lfloor \frac{n}{\ell ^2}\right\rfloor \) respectively (see Fig. 5 for an illustration).

For the m-th superblock we store the number of 1’s from the beginning of B to the end of the superblock in \(M'[m]=rank(B, m \cdot \ell ^2)\). As there are \(\left\lfloor \frac{n}{\ell ^2}\right\rfloor \) superblocks, \(M'\) can be stored in \(\mathcal {O}\left( \frac{n}{\ell ^2} \cdot \log n\right) =\mathcal {O}\left( \frac{n}{\log n}\right) =o(n)\) bits. For the m-th block we store the number of 1’s from the beginning of the overlapping superblock to the end of the block in \(M[m]=rank\big (B[1+k\ell ..n],(m-k) \cdot \ell \big )\), where \(k=\left\lfloor \frac{m-1}{\ell }\right\rfloor \ell \) is the total number of blocks in all superblocks left of the current superblock. M has \(\left\lfloor \frac{n}{\ell }\right\rfloor \) entries and can be stored in \(\mathcal {O}\left( \frac{n}{\ell } \cdot \log \ell ^2\right) =\mathcal {O}\left( \frac{ n\cdot \log \log n}{\log n}\right) =o(n)\) bits.

Fig. 5.
figure 5

2-level dictionary. Blocks and superblocks are allocated for each character (only one shown).

Given a rank query rank(Bi), one can now add the corresponding superblock and block values. But we still have to account for the 1’s in the block covering position i (in case i is not at the end of a block). Let P be a precomputed lookup table such that for each possible bit vector V of length \(\ell \) and \(i\in [1..\ell ]\) holds \(P[V][i]=rank(V,i)\). V has \(2^\ell \cdot \ell \) entries of values at most \(\ell \) and thus can be stored in

$$\begin{aligned} \mathcal {O}\left( 2^\ell \cdot \ell \cdot \log \ell \right) =\mathcal {O}\left( 2^{\frac{\log n}{2}} \cdot \log n \cdot \log \log n \right) = \mathcal {O}\left( \sqrt{n} \cdot \log n \cdot \log \log n \right) =o(n) \end{aligned}$$

bits. We now decompose a rank query into 3 subqueries using the precomputed tables. For a position i we determine the index \(p=\left\lfloor \frac{i-1}{\ell }\right\rfloor \) of next block left of i and the index \(q=\left\lfloor \frac{p-1}{\ell }\right\rfloor \) of the next superblock left of block p. Then it holds:

$$rank(B,i)=M'[q]+M[p]+P\big [B[1+p\ell ..(p+1)\ell ]\big ]\big [i-p\ell \big ].$$

Since the text T of length n has to be addressed, we assume that a register has at least size \(\left\lceil \log n\right\rceil \). Thus \(B[1+p\ell ..(p+1)\ell ]\) fits into a single register and can be determined in \(\mathcal {O}(1)\) time. Therefore a rank query can be answered in \(\mathcal {O}(1)\) time. In practice the precomputed lookup table P is replaced by a popcount operation on the CPU register and we have only two lookup operations.

One can now replace the occurrence table by this 2-level dictionary, i.e., by creating a bit vector for every \(c \in \varSigma \) and setting it to 1 if the character occurs in the BWT L. This results in \(\mathcal {O}(\sigma \cdot n)+o(\sigma \cdot n)\) bits space requirement.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Pockrandt, C., Ehrhardt, M., Reinert, K. (2017). EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices. In: Sahinalp, S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science(), vol 10229. Springer, Cham. https://doi.org/10.1007/978-3-319-56970-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56970-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56969-7

  • Online ISBN: 978-3-319-56970-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics