A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Ekim, Barış; Berger, Bonnie; Orenstein, Yaron

doi:10.1007/978-3-030-45257-5_3

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12074))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1929 Accesses
8 Citations

Abstract

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. $k > 13$). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating near-optimal UHSs, which newly handles $k > 13$. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)
Article Google Scholar
Berger, B., Rompel, J., Shor, P.W.: Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci. 49(3), 454–477 (1994)
Article MathSciNet Google Scholar
DeBlasio, D., Gbosibo, F., Kingsford, C., Marçais, G.: Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 167–176. ACM (2019)
Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Article Google Scholar
Johnson, D.S.: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9(3), 256–278 (1974)
Article MathSciNet Google Scholar
Kawulok, J., Deorowicz, S.: CoMeta: classification of metagenomes using k-mers. PLoS ONE 10(4), e0121453 (2015)
Article Google Scholar
Kucherov, G.: Evolution of biosequence search algorithms: a brief survey. Bioinformatics 35(19), 3547–3552 (2019)
Article Google Scholar
Leinonen, R., Sugawara, H., Shumway, M., Collaboration, I.N.S.D.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)
Article Google Scholar
Lovász, L.: On the ratio of optimal integral and fractional covers. Discret. Math. 13(4), 383–390 (1975)
Article MathSciNet Google Scholar
Marçais, G., DeBlasio, D., Kingsford, C.: Asymptotically optimal minimizers schemes. Bioinformatics 34(13), i13–i22 (2018)
Article Google Scholar
Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., Kingsford, C.: Improving the performance of minimizers and winnowing schemes. Bioinformatics 33(14), i110–i117 (2017)
Article Google Scholar
Marçais, G., Solomon, B., Patro, R., Kingsford, C.: Sketching and sublinear data structures in genomics. Ann. Rev. Biomed. Data Sci. 2, 93–118 (2019)
Google Scholar
Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory 13(1), 40–45 (1972)
Article MathSciNet Google Scholar
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Compact universal k-mer hitting sets. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 257–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43681-4_21
Chapter Google Scholar
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol. 13(10), e1005777 (2017)
Article Google Scholar
Paindavoine, M., Vialla, B.: Minimizing the number of bootstrappings in fully homomorphic encryption. In: Dunkelman, O., Keliher, L. (eds.) SAC 2015. LNCS, vol. 9566, pp. 25–43. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31301-6_2
Chapter MATH Google Scholar
Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59 (2010)
Article Google Scholar
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Article Google Scholar
Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804 (2007)
Article Google Scholar
Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(6), S1 (2012)
Article Google Scholar

Download references

Acknowledgments

This work was supported by NIH grant R01GM081871 to B.B. B.E. was supported by the MISTI MIT-Israel program at MIT and Ben-Gurion University of the Negev. We gratefully acknowledge the support of Intel Corporation for giving access to the Intel®AI DevCloud platform used for part of this work.

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Barış Ekim & Bonnie Berger
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Barış Ekim & Bonnie Berger
School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, 8410501, Beer-Sheva, Israel
Yaron Orenstein

Authors

Barış Ekim
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar
Yaron Orenstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bonnie Berger or Yaron Orenstein .

Editor information

Editors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Russell Schwartz

Appendices

A Emulating the Greedy Algorithm

The greedy Set Cover algorithm was developed independently by Johnson and Lovász for unweighted vertices [5, 9]. Lovász [9] proved:

Theorem 1

The greedy algorithm for Set Cover outputs cover R with $|R| \le (1 + \log T_{max})|OPT|$, where $T_{max}$ is the maximum cardinality of a set.

We adapt a definition for an algorithm emulating the greedy algorithm for the Set Cover problem to the second phase of DOCKS [2]. We say that an algorithm for the second phase of DOCKS $\alpha $-emulates the greedy algorithm if it outputs a set of vertices serially, during which it selects a vertex set A such that

$$\frac{|A|}{|P_A|} \le \frac{\alpha }{T_{max}},$$

where $P_A$ is the set of $\ell $-long paths covered by A. Using this definition, we come up with a near-optimal approximation by the following theorem:

Theorem 2

An algorithm for the second phase of DOCKS that $\alpha $-emulates the greedy algorithm produces cover $R \subseteq V$ with $|R| \le \alpha (1 + \log T_{max})|OPT|$, where OPT is the optimal cover.

Proof

We define the cost of covering path p as $\mathcal {C}(p) = \frac{|S|}{|P_S|}$, where S is the set of vertices selected in the selection step in which p was covered, and $P_S$ the set of $\ell $-long paths covered by S. Then, $\sum _{p \in P_S} \mathcal {C}(p) = |S|$.

Let $P_\ell $ be the set of all $\ell $-long paths in G. A fractional cover of graph $G = (V, E)$ is function $\mathcal {F}: V \rightarrow \{0, 1\}$ s.t. for all $p \in P_\ell $, $\sum _{v \in p} \mathcal {F}(v) \ge 1$. The optimal cover $\mathcal {F}_{OPT}$ has minimum $\sum _{v \in V} \mathcal {F}_{OPT}(v)$.

Let $\mathcal {F}$ be such an optimal fractional cover. The size of the cover produced is

$$|R| = \sum _{p \in P_{\ell }}\mathcal {C}(p) \le \sum _{v \in V} \Big ( \mathcal {F}(v) \sum _{p \in P_v}\mathcal {C}(p) \Big )$$

where $P_v$ is the set of all $\ell $-long paths through vertex v.

Lemma 1

There are at most $\frac{\alpha }{k}$ paths $p \in P_v$ such that $\mathcal {C}(p) \ge k$ for any v, k.

Proof

Assume the contrary: Before such a path p is covered, $T(v, \ell ) > \frac{\alpha }{k}$. Thus,

$$\begin{aligned} \frac{|S|}{|P_S|}&\ge k > \alpha /{T(v, \ell )} \ge {\alpha }/{T_{max},} \end{aligned}$$

contradicting the definition.

Suppose we rank the $T(v, \ell )$ paths $p \in P_v$ by decreasing order of $\mathcal {C}(p)$. From the above remark, if the ith path has cost k, then $i \le \alpha /k$. Then, we can write

$$\begin{aligned} \sum _{p \in P_v} \mathcal {C}(p)&\le \sum _{i=1}^{T(v, \ell )} \alpha /i \le \alpha \sum _{i=1}^{T(v, \ell )} 1/i \le \alpha (1 + \log T(v, \ell )) \le \alpha (1 + \log T_{max}) \end{aligned}$$

Then,

$$\sum _{p \in P_{\ell }}\mathcal {C}(p) \le \sum _{v \in V}\mathcal {F}(v)\alpha (1+\log T_{max})$$

and finally

$$|R| \le \alpha (1+\log T_{max})|OPT|.$$

In PASHA, we ensure that in step t, the sum of vertex hitting numbers of selected vertex set $V_t$ is at least $|V_t|(1+\varepsilon )^t(1 - 4\delta -2\varepsilon )$. We now show that this is satisfied with high probability in each step.

Theorem 3

With probability at least 1/2, the sum of vertex hitting numbers of selected vertex set $V_t$ at step t is at least $|V_t|(1+\varepsilon )^t (1 - 4\delta - 2\varepsilon )$.

Proof

For any vertex v in selected vertex set $V_t$ at step t, let $X_v$ be an indicator variable for the random event that vertex v is picked, and $f(X) = \sum _{v \in V_t} X_v$.

Note that $\mathrm {Var}{[f(X)]} \le |V_t| \cdot \delta /\ell $, and $|V_t| \ge \ell /\delta ^3$, since we are given that no vertex covers a $\delta ^3$ fraction of the $\ell $-long paths covered by the vertices in $V_t$. By Chebyshev’s inequality, for any $k \ge 0$,

$$\mathrm {Pr}[|f(X) - \mathrm {E}[f(X)]| \ge k(|V_t|\cdot \delta / \ell )] \le \frac{1}{k^2}$$

and with probability 3/4,

$$(f(X) - \mathrm {E}[f(X)])^2 \le 4|V_t|^2\cdot \frac{\delta ^4}{\ell ^2}$$

and

$$|f(X) - \mathrm {E}[f(X)]| \le 2|V_t|\cdot \frac{\delta ^2}{\ell }.$$

Let $P_{V_t}$ denote the set of $\ell $-long paths covered by vertex set $V_t$. Then,

$$|P_{V_t}| \ge \sum _{u \in V_t} T(u, \ell ) X_u - \sum _{p \in P_{V_t}} \sum _{u, v \in p} X_uX_v$$

We know that $\sum _{u \in V_t} T(u, \ell ) X_u \ge |V_t|(1+\varepsilon )^{t-1}$, which is bounded below by $((\delta - 2\delta ^2) \cdot |V_t|(1+\varepsilon )^{t-1})/\ell $. Let $g(X) = \sum _{p \in P_{V_t}} \sum _{u, v \in p} X_uX_v$. Then,

910

$$\begin{aligned} \mathrm {E}[g(X)]&= \sum _{p \in P_{V_t}} \mathrm {E}[\sum _{u, v \in p} X_uX_v] = \sum _{p \in P_{V_t}} \left( {\begin{array}{c}l\\ 2\end{array}}\right) (\delta /\ell )^2 =\sum _{p \in P_{V_t}} \frac{(\ell -1)\cdot \delta ^2}{2\ell } \le \sum _{p \in P_{V_t}}\frac{\delta ^2}{2}. \end{aligned}$$

Hence, with probability at least 3/4,

$$g(X) \le 4\mathrm {E}[g(X)] \le 2\delta ^2\cdot |V_t|(1+\varepsilon )^t$$

Both events hold with probability at least 1/2, and the sum of vertex hitting numbers is at least 7.59

$$\begin{aligned} ((\delta - 2\delta ^2) \cdot |V_t|(1+\varepsilon )^{t-1}) \cdot \ell - 2\delta ^2\cdot |V_t|(1+\varepsilon )^t&\ge |V_t|(1+\varepsilon )^{t-1}(\delta \ell -2\delta ^2\ell -2\delta ^2-2\delta ^2\varepsilon )\\&\ge |V_t|(1+\varepsilon )^{t}(\delta \ell -2\delta ^2\ell -2\delta ^2-2\delta ^2\varepsilon )/(1+\varepsilon )\\&\ge |V_t|(1+\varepsilon )^{t}(1-4\delta -2\varepsilon ). \end{aligned}$$

B Runtime Analysis

Here, we show the number of the selection steps and the average-time asymptotic complexity of PASHA.

Lemma 2

The number of selection steps is $O(\log |V| \log |P_\ell |/(\varepsilon \delta ^3m))$.

Proof

The number of steps is $O(\log |V|/\varepsilon )$, and within each step, there are $O(\log |P_S|/(\delta ^3m))$ selection steps (where $P_S$ is the sum of vertex hitting numbers of the vertex set S for that step and m the number of threads used), since we are guaranteed to remove at least $\delta ^3$ fraction of the paths during that step. Overall, there are $O(\log |V| \log |P_{\ell }|/(\varepsilon \delta ^3m))$ selection steps.

Theorem 4

For $\varphi < 1$, there is an approximation algorithm for the second phase of DOCKS that runs in $O((L^2 \cdot |\varSigma |^{k+1}\cdot \log ^2 (|\varSigma |^{k}))/(\varepsilon \delta ^3m))$ average time, where m is the number of threads used, and produces a cover of size at most $(1+\varphi )(1 + \log T_{max})$ times the optimal size, where $1+ \varphi = 1/(1-4\delta -2\epsilon )$.

Proof

Follows immediately from Theorem 2 and Lemma 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ekim, B., Berger, B., Orenstein, Y. (2020). A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-45257-5_3
Published: 21 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45256-8
Online ISBN: 978-3-030-45257-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendices

A Emulating the Greedy Algorithm

Theorem 1

Theorem 2

Proof

Lemma 1

Proof

Theorem 3

Proof

B Runtime Analysis

Lemma 2

Proof

Theorem 4

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation