Skip to main content

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2020)

Abstract

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. \(k > 13\)). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating near-optimal UHSs, which newly handles \(k > 13\). We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)

    Article  Google Scholar 

  2. Berger, B., Rompel, J., Shor, P.W.: Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci. 49(3), 454–477 (1994)

    Article  MathSciNet  Google Scholar 

  3. DeBlasio, D., Gbosibo, F., Kingsford, C., Marçais, G.: Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 167–176. ACM (2019)

    Google Scholar 

  4. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)

    Article  Google Scholar 

  5. Johnson, D.S.: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9(3), 256–278 (1974)

    Article  MathSciNet  Google Scholar 

  6. Kawulok, J., Deorowicz, S.: CoMeta: classification of metagenomes using k-mers. PLoS ONE 10(4), e0121453 (2015)

    Article  Google Scholar 

  7. Kucherov, G.: Evolution of biosequence search algorithms: a brief survey. Bioinformatics 35(19), 3547–3552 (2019)

    Article  Google Scholar 

  8. Leinonen, R., Sugawara, H., Shumway, M., Collaboration, I.N.S.D.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)

    Article  Google Scholar 

  9. Lovász, L.: On the ratio of optimal integral and fractional covers. Discret. Math. 13(4), 383–390 (1975)

    Article  MathSciNet  Google Scholar 

  10. Marçais, G., DeBlasio, D., Kingsford, C.: Asymptotically optimal minimizers schemes. Bioinformatics 34(13), i13–i22 (2018)

    Article  Google Scholar 

  11. Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., Kingsford, C.: Improving the performance of minimizers and winnowing schemes. Bioinformatics 33(14), i110–i117 (2017)

    Article  Google Scholar 

  12. Marçais, G., Solomon, B., Patro, R., Kingsford, C.: Sketching and sublinear data structures in genomics. Ann. Rev. Biomed. Data Sci. 2, 93–118 (2019)

    Google Scholar 

  13. Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory 13(1), 40–45 (1972)

    Article  MathSciNet  Google Scholar 

  14. Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Compact universal k-mer hitting sets. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 257–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43681-4_21

    Chapter  Google Scholar 

  15. Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol. 13(10), e1005777 (2017)

    Article  Google Scholar 

  16. Paindavoine, M., Vialla, B.: Minimizing the number of bootstrappings in fully homomorphic encryption. In: Dunkelman, O., Keliher, L. (eds.) SAC 2015. LNCS, vol. 9566, pp. 25–43. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31301-6_2

    Chapter  MATH  Google Scholar 

  17. Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59 (2010)

    Article  Google Scholar 

  18. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)

    Article  Google Scholar 

  19. Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804 (2007)

    Article  Google Scholar 

  20. Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(6), S1 (2012)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by NIH grant R01GM081871 to B.B. B.E. was supported by the MISTI MIT-Israel program at MIT and Ben-Gurion University of the Negev. We gratefully acknowledge the support of Intel Corporation for giving access to the Intel®AI DevCloud platform used for part of this work.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bonnie Berger or Yaron Orenstein .

Editor information

Editors and Affiliations

Appendices

A Emulating the Greedy Algorithm

The greedy Set Cover algorithm was developed independently by Johnson and Lovász for unweighted vertices [5, 9]. Lovász [9] proved:

Theorem 1

The greedy algorithm for Set Cover outputs cover R with \(|R| \le (1 + \log T_{max})|OPT|\), where \(T_{max}\) is the maximum cardinality of a set.

We adapt a definition for an algorithm emulating the greedy algorithm for the Set Cover problem to the second phase of DOCKS [2]. We say that an algorithm for the second phase of DOCKS \(\alpha \)-emulates the greedy algorithm if it outputs a set of vertices serially, during which it selects a vertex set A such that

$$\frac{|A|}{|P_A|} \le \frac{\alpha }{T_{max}},$$

where \(P_A\) is the set of \(\ell \)-long paths covered by A. Using this definition, we come up with a near-optimal approximation by the following theorem:

Theorem 2

An algorithm for the second phase of DOCKS that \(\alpha \)-emulates the greedy algorithm produces cover \(R \subseteq V\) with \(|R| \le \alpha (1 + \log T_{max})|OPT|\), where OPT is the optimal cover.

Proof

We define the cost of covering path p as \(\mathcal {C}(p) = \frac{|S|}{|P_S|}\), where S is the set of vertices selected in the selection step in which p was covered, and \(P_S\) the set of \(\ell \)-long paths covered by S. Then, \(\sum _{p \in P_S} \mathcal {C}(p) = |S|\).

Let \(P_\ell \) be the set of all \(\ell \)-long paths in G. A fractional cover of graph \(G = (V, E)\) is function \(\mathcal {F}: V \rightarrow \{0, 1\}\) s.t. for all \(p \in P_\ell \), \(\sum _{v \in p} \mathcal {F}(v) \ge 1\). The optimal cover \(\mathcal {F}_{OPT}\) has minimum \(\sum _{v \in V} \mathcal {F}_{OPT}(v)\).

Let \(\mathcal {F}\) be such an optimal fractional cover. The size of the cover produced is

$$|R| = \sum _{p \in P_{\ell }}\mathcal {C}(p) \le \sum _{v \in V} \Big ( \mathcal {F}(v) \sum _{p \in P_v}\mathcal {C}(p) \Big )$$

where \(P_v\) is the set of all \(\ell \)-long paths through vertex v.

Lemma 1

There are at most \(\frac{\alpha }{k}\) paths \(p \in P_v\) such that \(\mathcal {C}(p) \ge k\) for any vk.

Proof

Assume the contrary: Before such a path p is covered, \(T(v, \ell ) > \frac{\alpha }{k}\). Thus,

$$\begin{aligned} \frac{|S|}{|P_S|}&\ge k > \alpha /{T(v, \ell )} \ge {\alpha }/{T_{max},} \end{aligned}$$

contradicting the definition.

Suppose we rank the \(T(v, \ell )\) paths \(p \in P_v\) by decreasing order of \(\mathcal {C}(p)\). From the above remark, if the ith path has cost k, then \(i \le \alpha /k\). Then, we can write

$$\begin{aligned} \sum _{p \in P_v} \mathcal {C}(p)&\le \sum _{i=1}^{T(v, \ell )} \alpha /i \le \alpha \sum _{i=1}^{T(v, \ell )} 1/i \le \alpha (1 + \log T(v, \ell )) \le \alpha (1 + \log T_{max}) \end{aligned}$$

Then,

$$\sum _{p \in P_{\ell }}\mathcal {C}(p) \le \sum _{v \in V}\mathcal {F}(v)\alpha (1+\log T_{max})$$

and finally

$$|R| \le \alpha (1+\log T_{max})|OPT|.$$

In PASHA, we ensure that in step t, the sum of vertex hitting numbers of selected vertex set \(V_t\) is at least \(|V_t|(1+\varepsilon )^t(1 - 4\delta -2\varepsilon )\). We now show that this is satisfied with high probability in each step.

Theorem 3

With probability at least 1/2, the sum of vertex hitting numbers of selected vertex set \(V_t\) at step t is at least \(|V_t|(1+\varepsilon )^t (1 - 4\delta - 2\varepsilon )\).

Proof

For any vertex v in selected vertex set \(V_t\) at step t, let \(X_v\) be an indicator variable for the random event that vertex v is picked, and \(f(X) = \sum _{v \in V_t} X_v\).

Note that \(\mathrm {Var}{[f(X)]} \le |V_t| \cdot \delta /\ell \), and \(|V_t| \ge \ell /\delta ^3\), since we are given that no vertex covers a \(\delta ^3\) fraction of the \(\ell \)-long paths covered by the vertices in \(V_t\). By Chebyshev’s inequality, for any \(k \ge 0\),

$$\mathrm {Pr}[|f(X) - \mathrm {E}[f(X)]| \ge k(|V_t|\cdot \delta / \ell )] \le \frac{1}{k^2}$$

and with probability 3/4,

$$(f(X) - \mathrm {E}[f(X)])^2 \le 4|V_t|^2\cdot \frac{\delta ^4}{\ell ^2}$$

and

$$|f(X) - \mathrm {E}[f(X)]| \le 2|V_t|\cdot \frac{\delta ^2}{\ell }.$$

Let \(P_{V_t}\) denote the set of \(\ell \)-long paths covered by vertex set \(V_t\). Then,

$$|P_{V_t}| \ge \sum _{u \in V_t} T(u, \ell ) X_u - \sum _{p \in P_{V_t}} \sum _{u, v \in p} X_uX_v$$

We know that \(\sum _{u \in V_t} T(u, \ell ) X_u \ge |V_t|(1+\varepsilon )^{t-1}\), which is bounded below by \(((\delta - 2\delta ^2) \cdot |V_t|(1+\varepsilon )^{t-1})/\ell \). Let \(g(X) = \sum _{p \in P_{V_t}} \sum _{u, v \in p} X_uX_v\). Then,

910

$$\begin{aligned} \mathrm {E}[g(X)]&= \sum _{p \in P_{V_t}} \mathrm {E}[\sum _{u, v \in p} X_uX_v] = \sum _{p \in P_{V_t}} \left( {\begin{array}{c}l\\ 2\end{array}}\right) (\delta /\ell )^2 =\sum _{p \in P_{V_t}} \frac{(\ell -1)\cdot \delta ^2}{2\ell } \le \sum _{p \in P_{V_t}}\frac{\delta ^2}{2}. \end{aligned}$$

Hence, with probability at least 3/4,

$$g(X) \le 4\mathrm {E}[g(X)] \le 2\delta ^2\cdot |V_t|(1+\varepsilon )^t$$

Both events hold with probability at least 1/2, and the sum of vertex hitting numbers is at least 7.59

$$\begin{aligned} ((\delta - 2\delta ^2) \cdot |V_t|(1+\varepsilon )^{t-1}) \cdot \ell - 2\delta ^2\cdot |V_t|(1+\varepsilon )^t&\ge |V_t|(1+\varepsilon )^{t-1}(\delta \ell -2\delta ^2\ell -2\delta ^2-2\delta ^2\varepsilon )\\&\ge |V_t|(1+\varepsilon )^{t}(\delta \ell -2\delta ^2\ell -2\delta ^2-2\delta ^2\varepsilon )/(1+\varepsilon )\\&\ge |V_t|(1+\varepsilon )^{t}(1-4\delta -2\varepsilon ). \end{aligned}$$

B Runtime Analysis

Here, we show the number of the selection steps and the average-time asymptotic complexity of PASHA.

Lemma 2

The number of selection steps is \(O(\log |V| \log |P_\ell |/(\varepsilon \delta ^3m))\).

Proof

The number of steps is \(O(\log |V|/\varepsilon )\), and within each step, there are \(O(\log |P_S|/(\delta ^3m))\) selection steps (where \(P_S\) is the sum of vertex hitting numbers of the vertex set S for that step and m the number of threads used), since we are guaranteed to remove at least \(\delta ^3\) fraction of the paths during that step. Overall, there are \(O(\log |V| \log |P_{\ell }|/(\varepsilon \delta ^3m))\) selection steps.

Theorem 4

For \(\varphi < 1\), there is an approximation algorithm for the second phase of DOCKS that runs in \(O((L^2 \cdot |\varSigma |^{k+1}\cdot \log ^2 (|\varSigma |^{k}))/(\varepsilon \delta ^3m))\) average time, where m is the number of threads used, and produces a cover of size at most \((1+\varphi )(1 + \log T_{max})\) times the optimal size, where \(1+ \varphi = 1/(1-4\delta -2\epsilon )\).

Proof

Follows immediately from Theorem 2 and Lemma 2.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ekim, B., Berger, B., Orenstein, Y. (2020). A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45257-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45256-8

  • Online ISBN: 978-3-030-45257-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics