Abstract
As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. \(k > 13\)). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating near-optimal UHSs, which newly handles \(k > 13\). We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nat. Rev. Genet. 14(5), 333 (2013)
Berger, B., Rompel, J., Shor, P.W.: Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci. 49(3), 454–477 (1994)
DeBlasio, D., Gbosibo, F., Kingsford, C., Marçais, G.: Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 167–176. ACM (2019)
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Johnson, D.S.: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9(3), 256–278 (1974)
Kawulok, J., Deorowicz, S.: CoMeta: classification of metagenomes using k-mers. PLoS ONE 10(4), e0121453 (2015)
Kucherov, G.: Evolution of biosequence search algorithms: a brief survey. Bioinformatics 35(19), 3547–3552 (2019)
Leinonen, R., Sugawara, H., Shumway, M., Collaboration, I.N.S.D.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010)
Lovász, L.: On the ratio of optimal integral and fractional covers. Discret. Math. 13(4), 383–390 (1975)
Marçais, G., DeBlasio, D., Kingsford, C.: Asymptotically optimal minimizers schemes. Bioinformatics 34(13), i13–i22 (2018)
Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., Kingsford, C.: Improving the performance of minimizers and winnowing schemes. Bioinformatics 33(14), i110–i117 (2017)
Marçais, G., Solomon, B., Patro, R., Kingsford, C.: Sketching and sublinear data structures in genomics. Ann. Rev. Biomed. Data Sci. 2, 93–118 (2019)
Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory 13(1), 40–45 (1972)
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Compact universal k-mer hitting sets. In: Frith, M., Storm Pedersen, C.N. (eds.) WABI 2016. LNCS, vol. 9838, pp. 257–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43681-4_21
Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C.: Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol. 13(10), e1005777 (2017)
Paindavoine, M., Vialla, B.: Minimizing the number of bootstrappings in fully homomorphic encryption. In: Dunkelman, O., Keliher, L. (eds.) SAC 2015. LNCS, vol. 9566, pp. 25–43. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31301-6_2
Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59 (2010)
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004)
Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804 (2007)
Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(6), S1 (2012)
Acknowledgments
This work was supported by NIH grant R01GM081871 to B.B. B.E. was supported by the MISTI MIT-Israel program at MIT and Ben-Gurion University of the Negev. We gratefully acknowledge the support of Intel Corporation for giving access to the Intel®AI DevCloud platform used for part of this work.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A Emulating the Greedy Algorithm
The greedy Set Cover algorithm was developed independently by Johnson and Lovász for unweighted vertices [5, 9]. Lovász [9] proved:
Theorem 1
The greedy algorithm for Set Cover outputs cover R with \(|R| \le (1 + \log T_{max})|OPT|\), where \(T_{max}\) is the maximum cardinality of a set.
We adapt a definition for an algorithm emulating the greedy algorithm for the Set Cover problem to the second phase of DOCKS [2]. We say that an algorithm for the second phase of DOCKS \(\alpha \)-emulates the greedy algorithm if it outputs a set of vertices serially, during which it selects a vertex set A such that
where \(P_A\) is the set of \(\ell \)-long paths covered by A. Using this definition, we come up with a near-optimal approximation by the following theorem:
Theorem 2
An algorithm for the second phase of DOCKS that \(\alpha \)-emulates the greedy algorithm produces cover \(R \subseteq V\) with \(|R| \le \alpha (1 + \log T_{max})|OPT|\), where OPT is the optimal cover.
Proof
We define the cost of covering path p as \(\mathcal {C}(p) = \frac{|S|}{|P_S|}\), where S is the set of vertices selected in the selection step in which p was covered, and \(P_S\) the set of \(\ell \)-long paths covered by S. Then, \(\sum _{p \in P_S} \mathcal {C}(p) = |S|\).
Let \(P_\ell \) be the set of all \(\ell \)-long paths in G. A fractional cover of graph \(G = (V, E)\) is function \(\mathcal {F}: V \rightarrow \{0, 1\}\) s.t. for all \(p \in P_\ell \), \(\sum _{v \in p} \mathcal {F}(v) \ge 1\). The optimal cover \(\mathcal {F}_{OPT}\) has minimum \(\sum _{v \in V} \mathcal {F}_{OPT}(v)\).
Let \(\mathcal {F}\) be such an optimal fractional cover. The size of the cover produced is
where \(P_v\) is the set of all \(\ell \)-long paths through vertex v.
Lemma 1
There are at most \(\frac{\alpha }{k}\) paths \(p \in P_v\) such that \(\mathcal {C}(p) \ge k\) for any v, k.
Proof
Assume the contrary: Before such a path p is covered, \(T(v, \ell ) > \frac{\alpha }{k}\). Thus,
contradicting the definition.
Suppose we rank the \(T(v, \ell )\) paths \(p \in P_v\) by decreasing order of \(\mathcal {C}(p)\). From the above remark, if the ith path has cost k, then \(i \le \alpha /k\). Then, we can write
Then,
and finally
In PASHA, we ensure that in step t, the sum of vertex hitting numbers of selected vertex set \(V_t\) is at least \(|V_t|(1+\varepsilon )^t(1 - 4\delta -2\varepsilon )\). We now show that this is satisfied with high probability in each step.
Theorem 3
With probability at least 1/2, the sum of vertex hitting numbers of selected vertex set \(V_t\) at step t is at least \(|V_t|(1+\varepsilon )^t (1 - 4\delta - 2\varepsilon )\).
Proof
For any vertex v in selected vertex set \(V_t\) at step t, let \(X_v\) be an indicator variable for the random event that vertex v is picked, and \(f(X) = \sum _{v \in V_t} X_v\).
Note that \(\mathrm {Var}{[f(X)]} \le |V_t| \cdot \delta /\ell \), and \(|V_t| \ge \ell /\delta ^3\), since we are given that no vertex covers a \(\delta ^3\) fraction of the \(\ell \)-long paths covered by the vertices in \(V_t\). By Chebyshev’s inequality, for any \(k \ge 0\),
and with probability 3/4,
and
Let \(P_{V_t}\) denote the set of \(\ell \)-long paths covered by vertex set \(V_t\). Then,
We know that \(\sum _{u \in V_t} T(u, \ell ) X_u \ge |V_t|(1+\varepsilon )^{t-1}\), which is bounded below by \(((\delta - 2\delta ^2) \cdot |V_t|(1+\varepsilon )^{t-1})/\ell \). Let \(g(X) = \sum _{p \in P_{V_t}} \sum _{u, v \in p} X_uX_v\). Then,
910
Hence, with probability at least 3/4,
Both events hold with probability at least 1/2, and the sum of vertex hitting numbers is at least 7.59
B Runtime Analysis
Here, we show the number of the selection steps and the average-time asymptotic complexity of PASHA.
Lemma 2
The number of selection steps is \(O(\log |V| \log |P_\ell |/(\varepsilon \delta ^3m))\).
Proof
The number of steps is \(O(\log |V|/\varepsilon )\), and within each step, there are \(O(\log |P_S|/(\delta ^3m))\) selection steps (where \(P_S\) is the sum of vertex hitting numbers of the vertex set S for that step and m the number of threads used), since we are guaranteed to remove at least \(\delta ^3\) fraction of the paths during that step. Overall, there are \(O(\log |V| \log |P_{\ell }|/(\varepsilon \delta ^3m))\) selection steps.
Theorem 4
For \(\varphi < 1\), there is an approximation algorithm for the second phase of DOCKS that runs in \(O((L^2 \cdot |\varSigma |^{k+1}\cdot \log ^2 (|\varSigma |^{k}))/(\varepsilon \delta ^3m))\) average time, where m is the number of threads used, and produces a cover of size at most \((1+\varphi )(1 + \log T_{max})\) times the optimal size, where \(1+ \varphi = 1/(1-4\delta -2\epsilon )\).
Proof
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ekim, B., Berger, B., Orenstein, Y. (2020). A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-45257-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45256-8
Online ISBN: 978-3-030-45257-5
eBook Packages: Computer ScienceComputer Science (R0)