CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA

  • Zejun Zheng
  • Thuy-Diem Nguyen
  • Bertil Schmidt
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7036)

Abstract

Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clustering program to process the distance matrix. On a single-GPU, CRiSPy achieves speedups of around two orders of magnitude compared to the sequential ESPRIT program for both the time-consuming pairwise genetic distance module and the whole processing pipeline, thus making CRiSPy particularly suitable for high-throughput microbial studies.

Keywords

Metagenomics Pyrosequencing Alignment CUDA MPI 

References

  1. 1.
    Sogin, M.L., Morrison, H.G., Huber, J.A., et al.: Microbial diversity in the deep sea and the underexplored rare biosphere. PNAS 103(32), 12115–12120 (2006)CrossRefGoogle Scholar
  2. 2.
    Turnbaugh, P., Hamady, M., Yatsunenko, T., et al.: A core gut microbiome in obese and lean twins. Nature 457(7228), 480–484 (2009)CrossRefGoogle Scholar
  3. 3.
    Fabrice, A., Didier, R.: Exploring microbial diversity using 16S rRNA high-throughput methods. Applied and Environmental Microbiology 2, 074–092 (2009)Google Scholar
  4. 4.
    Hamady, M., Knight, R.: Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Research 19(7), 1141–1152 (2009)CrossRefGoogle Scholar
  5. 5.
    Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5), 1792–1797 (2004)CrossRefGoogle Scholar
  6. 6.
    Nawrocki, E.P., Kolbe, D.L., Eddy, S.R.: Infernal 1.0: inference of RNA alignments. Bioinformatics 25(10), 1335–1337 (2009)CrossRefGoogle Scholar
  7. 7.
    Sun, Y., Cai, Y., Liu, L., et al.: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research 37(10), e76 (2009)CrossRefGoogle Scholar
  8. 8.
    Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)CrossRefGoogle Scholar
  9. 9.
    Huse, S.M., Welch, D.M., Morrison, H.G., et al.: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology 12(7), 1889–1998 (2010)CrossRefGoogle Scholar
  10. 10.
    Sun, Y., Cai, Y., Huse, S., et al.: A Large-scale Benchmark Study of Existing Algorithms for Taxonomy-Independent Microbial Community Analysis. Briefings in Bioinformatics (2011)Google Scholar
  11. 11.
    Liu, Y., Schmidt, B., Maskell, D.L.: CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Research Notes 3, 93 (2010)CrossRefGoogle Scholar
  12. 12.
    Shi, H., Schmidt, B., Liu, W., et al.: A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. Journal of Computational Biology 17(4), 603–615 (2010)CrossRefGoogle Scholar
  13. 13.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)CrossRefGoogle Scholar
  14. 14.
    Schloss, P.D., Handelsman, J.: Introducing DOTUR a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology 71(3), 1501–1506 (2005)CrossRefGoogle Scholar
  15. 15.
    Huse, S.M., Huber, J.A., Morrison, H.G., et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8(7), R143 (2007)CrossRefGoogle Scholar
  16. 16.
    Schloss, P.D., Westcott, S.L., Ryabin, T., et al.: Introducing MOTHUR Open-Source Platform-Independent Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75(23), 7537–7541 (2009), doi:10.1128/AEM.01541-09CrossRefGoogle Scholar
  17. 17.
    Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32(1), 380–385 (2004)CrossRefGoogle Scholar
  18. 18.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Zejun Zheng
    • 1
  • Thuy-Diem Nguyen
    • 1
  • Bertil Schmidt
    • 2
  1. 1.School of Computer EngineeringNanyang Technological UniversitySingapore
  2. 2.Institut für InformatikJohannes Gutenberg UniversityMainzGermany

Personalised recommendations