Advertisement

DDP-B: A Distributed Dynamic Parallel Framework for Meta-genomics Binary Similarity

  • Mengxian ChiEmail author
  • Xu JinEmail author
  • Feng Li
  • Hong AnEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11783)

Abstract

Great efforts have been made on meta-genomics in the field of new species exploration in the past decades. With the development of next-generation sequencing technology, meta-genomics datasets have been produced as large as dozens of hundreds of gigabytes or even several terabytes, which brings a severe challenge to data analysis. Besides, conventional meta-genomics comparing algorithms may not take full advantage of powerful computing capacity from parallel computing techniques due to lack of parallelism. In this paper, we propose DDP-B, a distributed dynamic parallel framework for meta-genomics binary similarity analysis, to overcome these limitations. In this framework, we introduce a binary distance algorithm for meta-genomics similarity measurement and develop different levels of parallel granularity of the algorithm utilizing MPI, OpenMP, and SIMD techniques. Moreover, we establish a dynamic scheduling method to deliver asynchronous parallel computing tasks and design a distributed cluster to deploy the dynamic parallel system, which completes 2.97K pairs of meta-genomics vectors comparison per second and achieves an 134.79x speedup versus the baseline in the optimal condition. Our framework shows stable scalability when assigned larger workloads.

Keywords

Meta-genomics Big data Parallel computing Binary distance Dynamic scheduling Distributed scalability 

References

  1. 1.
    Bernard, G., Greenfield, P., Ragan, M.A., Chan, C.X.: k-mer similarity, networks of microbial genomes, and taxonomic rank. mSystems 3(6), e00257–18 (2018)Google Scholar
  2. 2.
    Buyya, R., et al.: High Performance Cluster Computing: Architectures and Systems (Volume 1), vol. 1, p. 999. Prentice Hall, Upper Saddle River (1999)Google Scholar
  3. 3.
    Chapman, B., Jost, G., Van Der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming, vol. 10. MIT Press, Cambridge (2008)Google Scholar
  4. 4.
    Charras, C., Lecroq, T.: Handbook of Exact String Matching Algorithms. Citeseer (2004)Google Scholar
  5. 5.
    Choi, S.S., Cha, S.H., Tappert, C.C.: A survey of binary similarity and distance measures. J. Syst. Cybern. Inform. 8(1), 43–48 (2010)Google Scholar
  6. 6.
    Driver, H.E., Kroeber, A.L.: Quantitative Expression of Cultural Relationships, vol. 31. University of California Press, Berkeley (1932)Google Scholar
  7. 7.
    Fleischmann, R.D., et al.: Whole-genome random sequencing and assembly of haemophilus influenzae RD. Science 269(5223), 496–512 (1995)CrossRefGoogle Scholar
  8. 8.
    Forbes, S.A.: On the local distribution of certain Illinois fishes: an essay in statistical ecology, vol. 7. Illinois State Laboratory of Natural History (1907)Google Scholar
  9. 9.
    Grigoriev, I.V., et al.: The genome portal of the department of energy joint genome institute. Nucleic Acids Res. 40(D1), D26–D32 (2011)CrossRefGoogle Scholar
  10. 10.
    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)CrossRefGoogle Scholar
  11. 11.
    Hubalek, Z.: Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biol. Rev. 57(4), 669–689 (1982)CrossRefGoogle Scholar
  12. 12.
    Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579 (1901)Google Scholar
  13. 13.
    Jeong, H., Kim, S., Lee, W., Myung, S.H.: Performance of SSE and AVX instruction sets. arXiv preprint arXiv:1211.0820 (2012)
  14. 14.
    Li, D., Liu, C.M., Luo, R., Sadakane, K., Lam, T.W.: Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31(10), 1674–1676 (2015)CrossRefGoogle Scholar
  15. 15.
    Lomont, C.: Introduction to Intel advanced vector extensions. Intel White Paper, pp. 1–21 (2011)Google Scholar
  16. 16.
    Metzker, M.L.: Sequencing technologies-the next generation. Nat. Rev. Genet. 11(1), 31 (2010)CrossRefGoogle Scholar
  17. 17.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)CrossRefGoogle Scholar
  18. 18.
    Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 132 (2016)CrossRefGoogle Scholar
  19. 19.
    Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F.: Vsearch: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016)CrossRefGoogle Scholar
  20. 20.
    Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2009)CrossRefGoogle Scholar
  21. 21.
    Sneath, P.H.A.: The principles and practice of numerical classification. Numer. Taxon. 573, 263–268 (1973)Google Scholar
  22. 22.
    Wilming, L.G., Gilbert, J.G., Howe, K., Trevanion, S., Hubbard, T., Harrow, J.L.: The vertebrate genome annotation (vega) database. Nucleic Acids Res. 36(suppl\(\_\)1), D753–D760 (2007)CrossRefGoogle Scholar
  23. 23.
    Woese, C.R., Fox, G.E.: Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. 74(11), 5088–5090 (1977)CrossRefGoogle Scholar
  24. 24.
    Woyke, T., Rubin, E.M.: Searching for new branches on the tree of life. Science 346(6210), 698–699 (2014)CrossRefGoogle Scholar
  25. 25.
    Wrighton, K.C., et al.: Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337(6102), 1661–1665 (2012)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2019

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina

Personalised recommendations