Skip to main content

Long Reads Enable Accurate Estimates of Complexity of Metagenomes

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2018)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10812))

Abstract

Although reduced microbiome diversity has been linked to various diseases, estimating the diversity of bacterial communities (the number and the total length of distinct genomes within a metagenome) remains an open problem in microbial ecology. We describe the first analysis of microbial diversity using long reads without any assumption on the frequencies of genomes within a metagenome (parametric methods) and without requiring a large database that covers the total diversity (non-parametric methods). The long read technologies provide new insights into the diversity of metagenomes by interrogating rare species that remained below the radar of previous approaches based on short reads. We present a novel approach for estimating the diversity of metagenomes based on joint analysis of short and long reads and benchmark it on various datasets. We estimate that genomes comprising a human gut metagenome have total length varying from 1.3 to 3.5 billion nucleotides, with genomes responsible for \(50\%\) of total abundance having total length varying from only 40 to 60 million nucleotides. In contrast, genomes comprising an aquifer sediment metagenome have more than two-orders of magnitude larger total length (\({\approx }840\) billion nucleotides).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amann, R., Rosselló-Móra, R.: After all, only millions? mBio 7(4), e00,99916 (2016)

    Article  Google Scholar 

  2. Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  MathSciNet  Google Scholar 

  3. Bankevich, A., Pevzner, P.A.: TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat. Methods 13, 248–250 (2016). https://doi.org/10.1038/nmeth.3737

    Article  Google Scholar 

  4. Capo, E., Debroas, D., Arnaud, F., Domaizon, I.: Is planktonic diversity well recorded in sedimentary DNA? Toward the reconstruction of past protistan diversity. Microb. Ecol. 70(4), 865–875 (2015)

    Article  Google Scholar 

  5. Chao, A., Bunge, J.: Estimating the number of species in a stochastic abundance model. Biometrics 58(3), 531–539 (2002). https://doi.org/10.1111/j.0006-341X.2002.00531.x

    Article  MathSciNet  MATH  Google Scholar 

  6. Chen, Y., Kuang, J., Jia, P., Cadotte, M.W., Huang, L., Li, J., Liao, B., Wang, P., Shu, W.: Effect of environmental variation on estimating the bacterial species richness. Front. Microbiol. 8, 690 (2017)

    Google Scholar 

  7. Compeau, P.E.C., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). https://doi.org/10.1038/nbt.2023

    Article  Google Scholar 

  8. Curtis, T.P., Sloan, W.T., Scannell, J.W.: Estimating prokaryotic diversity and its limits. Proc. Natl. Acad. Sci. U.S.A. 99(16), 10494–10499 (2002). https://doi.org/10.1073/pnas.142680199

    Article  Google Scholar 

  9. Driscoll, C.B., Otten, T.G., Brown, N.M., Dreher, T.W.: Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture. Stand. Genomic Sci. 12(1), 9 (2017)

    Article  Google Scholar 

  10. Dykhuizen, D.E.: Santa Rosalia revisited: why are there so many species of bacteria? Antonie Van Leeuwenhoek 73(1), 25–33 (1998)

    Article  Google Scholar 

  11. Ellegaard, K.M., Engel, P.: Beyond 16S rRNA community profiling: intra-species diversity in the gut microbiota. Front. Microbiol. 7, 1475 (2016)

    Article  Google Scholar 

  12. Frisli, T., Haverkamp, T.H.A., Jakobsen, K.S., Stenseth, N.C., Rudi, K.: Estimation of metagenome size and structure in an experimental soil microbiota from low coverage next-generation sequence data. J. Appl. Microbiol. 114(1), 141–151 (2013). https://doi.org/10.1111/jam.12035

    Article  Google Scholar 

  13. Gao, W., Weng, J., Gao, Y., Chen, X.: Comparison of the vaginal microbiota diversity of women with and without human papillomavirus infection: a cross-sectional study. BMC Infect. Dis. 13(1), 271 (2013)

    Article  Google Scholar 

  14. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013). https://doi.org/10.1093/bioinformatics/btt086

    Article  Google Scholar 

  15. Haegeman, B., Hamelin, J., Moriarty, J., Neal, P., Dushoff, J., Weitz, J.S.: Robust estimation of microbial diversity in theory and in practice. ISME J. 7(6), 1092–1101 (2013). https://doi.org/10.1038/ismej.2013.10

    Article  Google Scholar 

  16. Haider, B., Ahn, T.H., Bushnell, B., et al.: Omega: an overlap-graph de novo assembler for metagenomics. Bioinformatics 30(19), 2717–2722 (2014). https://doi.org/10.1093/bioinformatics/btu395

    Article  Google Scholar 

  17. Hong, S.H., Bunge, J., Jeon, S.O., Epstein, S.S.: Predicting microbial species richness. Proc. Natl. Acad. Sci. U.S.A. 103(1), 117–122 (2006). https://doi.org/10.1073/pnas.0507245102

    Article  Google Scholar 

  18. Hooper, S.D., Dalevi, D., Pati, A., Mavromatis, K., Ivanova, N.N., Kyrpides, N.C.: Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics 26(3), 295–301 (2010). https://doi.org/10.1093/bioinformatics/btp687

    Article  Google Scholar 

  19. Hughes, J.B., Hellmann, J.J., Ricketts, T.H., Bohannan, B.J.: Counting the uncountable: statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol. 67(10), 4399–4406 (2001)

    Article  Google Scholar 

  20. Jousset, A., Bienhold, C., Chatzinotas, A., et al.: Where less may be more: how the rare biosphere pulls ecosystems strings. ISME J. 33(4), 853–862 (2017)

    Article  Google Scholar 

  21. Kashtan, N., Roggensack, S.E., Rodrigue, S., et al.: Single-cell genomics reveals hundreds of coexisting subpopulations in wild prochlorococcus. Science 344(6182), 416–420 (2014)

    Article  Google Scholar 

  22. Kemp, P.F., Aller, J.Y.: Bacterial diversity in aquatic and other environments: what 16S rDNA libraries can tell us. FEMS Microbiol. Ecol. 47(2), 161–177 (2004). https://doi.org/10.1016/S0168-6496(03)00257-5

    Article  Google Scholar 

  23. Kuleshov, V., Jiang, C., Zhou, W., Jahanbani, F., Batzoglou, S., Snyder, M.: Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat. Biotechnol. 34(1), 64–69 (2015). https://doi.org/10.1038/nbt.3416

    Article  Google Scholar 

  24. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  25. Lennon, J.T., Locey, K.J.: The underestimation of global microbial diversity. mBio 7(5), e01,298-16 (2016). https://doi.org/10.1128/mBio.01298-16

    Article  Google Scholar 

  26. Lennon, J.T., Placella, S.A., Muscarella, M.E.: Relic DNA contributes minimally to estimates of microbial diversity. bioRxiv, p. 131284 (2017)

    Google Scholar 

  27. Li, R., Hsieh, C.L., Young, A., et al.: Illumina synthetic long read sequencing allows recovery of missing sequences even in the “finished” C. elegans genome. Sci. Rep. 5, 10,814 (2015). https://doi.org/10.1038/srep10814

    Article  Google Scholar 

  28. Lladser, M.E., Gouet, R., Reeder, J.: Extrapolation of urn models via poissonization: accurate measurements of the microbial unknown. PLoS ONE 6(6), e21,105 (2011). https://doi.org/10.1371/journal.pone.0021105

    Article  Google Scholar 

  29. Locey, K.J., Lennon, J.T.: Scaling laws predict global microbial diversity. Natl. Acad. Sci. U.S.A. 113(21), 5970–5975 (2016)

    Article  Google Scholar 

  30. Loose, M., Malla, S., Stout, M.: Real-time selective sequencing using nanopore technology. Nat. Methods 13(9), 751–754 (2016)

    Article  Google Scholar 

  31. Lynch, M.D.J., Neufeld, J.D.: Ecology and exploration of the rare biosphere. Nat. Rev. Microbiol. 13(4), 217–229 (2015). https://doi.org/10.1038/nrmicro3400

    Article  Google Scholar 

  32. McCoy, R.C., Taylor, R.W., Blauwkamp, T.A., et al.: Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS ONE 9(9), e106,689 (2014). https://doi.org/10.1371/journal.pone.0106689

    Article  Google Scholar 

  33. McDonald, D., et al.: American gut: an open platform for citizen-science microbiome research (2018, submitted)

    Google Scholar 

  34. Miller, C.S., Baker, B.J., Thomas, B.C., Singer, S.W., Banfield, J.F.: Emirge: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biol. 12(5), R44 (2011). https://doi.org/10.1186/gb-2011-12-5-r44

    Article  Google Scholar 

  35. Pedrós-Alió, C., Manrubia, S.: The vast unknown microbial biosphere. Proc. Natl. Acad. Sci. U.S A. 113(24), 6585–6587 (2016). https://doi.org/10.1073/pnas.1606105113

    Article  Google Scholar 

  36. Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)

    Article  Google Scholar 

  37. Rodriguez-R, L.M., Konstantinidis, K.T.: Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30(5), 629–635 (2014). https://doi.org/10.1093/bioinformatics/btt584

    Article  Google Scholar 

  38. Roesch, L.F.W., Fulthorpe, R.R., Riva, A., Casella, G., Hadwin, A.K.M., Kent, A.D., Daroub, S.H., Camargo, F.A.O., Farmerie, W.G., Triplett, E.W.: Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J. 1(4), 283–290 (2007). https://doi.org/10.1038/ismej.2007.53

    Article  Google Scholar 

  39. Rosselli, R., Romoli, O., Vitulo, N., et al.: Direct 16S rRNA-SEQ from bacterial communities: a PCR-independent approach to simultaneously assess microbial diversity and functional activity potential of each taxon. Sci. Rep. 6, 32,165 (2016)

    Article  Google Scholar 

  40. Rozov, R., Brown Kav, A., Bogumil, D., Shterzer, N., Halperin, E., Mizrahi, I., Shamir, R.: Recycler: an algorithm for detecting plasmids from de novo assembly graphs. Bioinformatics 33(4), 475–482 (2017)

    Google Scholar 

  41. Scher, J.U., Ubeda, C., Artacho, A., et al.: Decreased bacterial diversity characterizes the altered gut microbiota in patients with psoriatic arthritis, resembling dysbiosis in inflammatory bowel disease. Arthritis Rheumatol. 67(1), 128–139 (2015). https://doi.org/10.1002/art.38892

    Article  Google Scholar 

  42. Schloss, P.D., Girard, R.A., Martin, T., Edwards, J., Thrash, J.C.: Status of the archaeal and bacterial census: an update. mBio 7(3), e00,201-16 (2016). https://doi.org/10.1128/mBio.00201-16

    Article  Google Scholar 

  43. Schloss, P.D., Handelsman, J.: Status of the microbial census. Microbiol. Mol. Biol. Rev. 68(4), 686–691 (2004). https://doi.org/10.1128/MMBR.68.4.686-691.2004

    Article  Google Scholar 

  44. Shade, A.: Diversity is the question, not the answer. ISME J. 11(1), 1–6 (2016). https://doi.org/10.1038/ismej.2016.118

    Article  Google Scholar 

  45. Shakya, M., Quince, C., Campbell, J.H., Yang, Z.K., Schadt, C.W., Podar, M.: Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. Environ. Microbiol. 15(6), 1882–1899 (2013). https://doi.org/10.1111/1462-2920.12086

    Article  Google Scholar 

  46. Sharon, I., Kertesz, M., Hug, L.A., et al.: Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 25(4), 534–543 (2015). https://doi.org/10.1101/gr.183012.114

    Article  Google Scholar 

  47. Sharpton, T.J., Riesenfeld, S.J., Kembel, S.W., et al.: PhyLOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLoS Comput. Biol. 7(1), e1001,061 (2011)

    Article  Google Scholar 

  48. Sogin, M.L., Morrison, H.G., Huber, J.A., et al.: Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl. Acad. Sci. U.S.A. 103(32), 12115–12120 (2006). https://doi.org/10.1073/pnas.0605127103

    Article  Google Scholar 

  49. Sunagawa, S., DeSantis, T.Z., Piceno, Y.M., et al.: Bacterial diversity and White Plague Disease-associated community changes in the Caribbean coral Montastraea faveolata. ISME J. 3(5), 512–521 (2009). https://doi.org/10.1038/ismej.2008.131

    Article  Google Scholar 

  50. Taur, Y., Jenq, R.R., Perales, M.A., et al.: The effects of intestinal tract bacterial diversity on mortality following allogeneic hematopoietic stem cell transplantation. Blood 124, 1174–1182 (2014). https://doi.org/10.1182/blood-2014-02-554725

    Article  Google Scholar 

  51. Tiedje, J.: Microbial diversity: of value to whom? ASM News 60, 524–525 (1994)

    Google Scholar 

  52. Voskoboynik, A., Neff, N.F., Sahoo, D., et al.: The genome sequence of the colonial chordate, Botryllus schlosseri. eLife 2, 69 (2013). https://doi.org/10.7554/eLife.00569

    Article  Google Scholar 

  53. White, R.A., Bottos, E.M., Roy Chowdhury, T., et al.: Moleculo long-read sequencing facilitates assembly and genomic binning from complex soil metagenomes. mSystems 1(3) (2016). https://doi.org/10.1128/mSystems.00045-16

    Article  Google Scholar 

  54. Williamson, M., Gaston, K.J.: The lognormal distribution is not an appropriate null hypothesis for the species-abundance distribution. J. Anim. Ecol. 74(3), 409–422 (2005). https://doi.org/10.1111/j.1365-2656.2005.00936.x

    Article  Google Scholar 

  55. Willis, A.: Extrapolating abundance curves has no predictive power for estimating microbial biodiversity. Proc. Natl. Acad. Sci. U.S.A. 113(35), E5096 (2016). https://doi.org/10.1073/pnas.1608281113

    Article  Google Scholar 

Download references

Acknowledgements

We are indebted to Chris Dupont, Rob Knight, and Glenn Tesler for providing numerous comments. Glenn Tesler also suggested using exponential integrals for analyzing the bias of our estimator. We are grateful to Yana Safonova, Andrey Bzikadse, Sergey Bankevich, Sergey Nurk, Alon Orlitsky, Ivan Tolstoganov, and Aleksandr Shlemov for many helpful discussions and help with preparation of this paper. This study was funded by the Russian Science Foundation (award 14-50-00069) and by the National Science Foundation (MCB-BSF award 1715911).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Bankevich .

Editor information

Editors and Affiliations

Appendix

Appendix

TruSPAdes Assemblies of MOCK, GUT, and SEDI Datasets. The TruSeq SLR technology generates accurate and long virtual reads derived from pools of short reads [27, 32, 52]. It is based on fragmenting genomic DNA into large segments (\({\approx }10\) kb long) and forming random pools of the resulting segments (each pool contains \({\approx }300\) segments). Next, these fragments are amplified, sheared, and marked with a barcode that is unique to the pool. Afterwards, they are sequenced using the standard Illumina short reads technology. All short reads originating from the same barcode are assembled together resulting in a set of long contigs (this step is called the SLR barcode assembly). Ideally, the result of such sequencing effort for a single barcode is the collection of approximately 300 fragments (each fragment is \({\approx }10\) kb long) from a genome forming 300 long virtual reads. SLRs have low mismatch rate (about \(0.1\%\)), extremely low indel rate, and few misassemblies [3].

Table 2 presents results of barcode assembly of MOCK, GUT and SEDI datasets with truSPAdes.

Table 2. Results of truSPAdes assemblies of MOCK, GUT and SEDI datasets. Long SLRs are defined as SLRs longer than 6 kb.

Analyzing the CAMI and CROHN Datasets. In addition to datasets described in the main text, we also analyzed a larger synthetic dataset and four human microbiome datasets from a patient suffering from the Crohn’s disease.

The CAMI dataset is a simulated dataset generated by the “Critical Assessment of Metagenome Interpretation” (CAMI) initiative aimed at evaluating various approaches to analyzing metagenomes (http://www.cami-challenge.org/). We used a CAMI dataset simulated from 225 genomes and containing 150 million 100bp paired-end reads with mean insert size of 180bp (the errors in simulated reads are modelled after Illumina HiSeq). We simulated 50 thousand SLRs in the same way as for the SYNTH dataset. The total length of the reference genomes for this dataset is \({\approx }820\) Mb and its de Bruijn complexity is \({\approx }770\) Mb. Figure 4 shows that our estimator works well for the CAMI dataset.

The CROHN datasets are four human gut microbiome datasets from a patient with Crohn’s disease. These datasets (CROHN1, CROHN2, CROHN3, CROHN4) represent a metagenomics time series collected at 12-28-2011, 04-29-2013, 11-16-2014 and 06-29-2015, respectively. Each of these datasets includes one Illumina paired-end library and one SLR library. Number of short reads in these datasets ranges from 150 to 230 millions with mean insert size \({\approx }400\) bp for all datasets. The number of SLRs ranges from 17 to 50 thousand. Assembly efforts for these datasets resulted in contigs of length 242, 172, 225 and 275 Mb for CROHN1, CROHN2, CROHN3, and CROHN4 datasets respectively.

We estimated metagenome capacity for CROHN1, CROHN2, CROHN3, and CROHN4 datasets as 3.5, 2.0, 2.4, and 3.2 Gb, respectively. Values of M50 were estimated as 41, 61, 25, and 45 Mb, respectively, while values of M90 were estimated as 230, 490, 240, 250 Mb respectively. These estimates reveal large variations in metagenome capacity during the course of disease that go well beyond what can be estimated using short read assemblies. Correlation between metagenome capacity and antibiotics treatments for this metagenomics time series will be discussed elsewhere.

Fig. 4.
figure 4

Estimated frequency histograms and abundance plots for CAMI (left) and , , , datasets (right). The distribution of heights (frequencies) of individual genomes within a metagenome was obtained based on alignments of short reads to SLRs. For the CAMI dataset, we compared the constructed plots with the blue plot representing the reference genomes with known abundancies.

Estimating Metagenome Capacity Using Long Error Prone SMS Reads. Although SMS reads (e.g., reads generated using Pacific Biosciences and Oxford Nanopores technologies) are still rarely used for analyzing metagenomes [9], they have a potential to be widely used in future metagenomics projects when their cost reduces and when the read until technology [30] developed by Oxford Nanopores becomes widely available. Below we show how to extend our approach for estimating the metagenome complexity using SMS reads.

SMS reads present an attractive alternative to TSLRs since their average length is higher and since they feature a uniform coverage depth that is not affected by the GC content. However, alignment of short Illumina reads against error-prone SMS reads is a more challenging task than their alignment against accurate TSLRs. We addressed this complication using the bowtie2 alignment tool [24] with specially selected parameters aimed at alignment of short Illumina reads against error-prone SMS reads (-D 40 -R 3 -N 0 -L 17 -i S,1,0.50 –rdg 1,3 –rfg 1,3 –score-min L,-0.6,-1 -a). However, even using these custom parameters, bowtie2 fails to detect alignments of \({\approx }20\%\) of Illumina reads, resulting in an underestimation of the heights of long reads. To compensate for this effect, we applied an adjustment factor \(\frac{100}{100-20}=1.25\) to artificially inflate the heights in our formula for estimating the metagenome capacity.

Currently, there is a shortage of publicly available hybrid metagenomics datasets (containing both Illumina and SMS reads). Ideally, Illumina and SMS reads for such datasets should be generated at the same time so that the abundances of individual genomes within a metagenome are the same for Illumina and SMS reads, implying that the depth of coverage by Illumina reads is proportional to the depth of coverage by SMS reads. In practice, since the SMS reads for these datasets were often generated as an afterthought, Illumina and SMS reads for the publicly available hybrid metagenomics datasets are generated at different time points and prepared for sequencing using different sample preparation protocols. Thus, since metagenome composition is changing and is subject to blooms [33], the existing hybrid datasets do not necessarily feature the proportional depths of coverage by Illumina and SMS reads. Our analysis revealed that the fractions of Illumina and SMS reads aligned to each of the reference genomes for publicly available hybrid synthetic metagenomic dataset may differ by two orders of magnitude. This difference in the genome coverages by short and long reads in the publicly available hybrid metagenomics datasets makes our approach inapplicable to the currently available hybrid metagenomics datasets.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bankevich, A., Pevzner, P. (2018). Long Reads Enable Accurate Estimates of Complexity of Metagenomes. In: Raphael, B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science(), vol 10812. Springer, Cham. https://doi.org/10.1007/978-3-319-89929-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-89929-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-89928-2

  • Online ISBN: 978-3-319-89929-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics