Towards precision quantification of contamination in metagenomic sequencing experiments
Metagenomic next-generation sequencing (mNGS) experiments involving small amounts of nucleic acid input are highly susceptible to erroneous conclusions resulting from unintentional sequencing of occult contaminants, especially those derived from molecular biology reagents. Recent work suggests that, for any given microbe detected by mNGS, an inverse linear relationship between microbial sequencing reads and sample mass implicates that microbe as a contaminant. By associating sequencing read output with the mass of a spike-in control, we demonstrate that contaminant nucleic acid can be quantified in order to identify the mass contributions of each constituent. In an experiment using a high-resolution (n = 96) dilution series of HeLa RNA spanning 3-logs of RNA mass input, we identified a complex set of contaminants totaling 9.1 ± 2.0 attograms. Given the competition between contamination and the true microbiome in ultra-low biomass samples such as respiratory fluid, quantification of the contamination within a given batch of biological samples can be used to determine a minimum mass input below which sequencing results may be distorted. Rather than completely censoring contaminant taxa from downstream analyses, we propose here a statistical approach that allows separation of the true microbial components from the actual contribution due to contamination. We demonstrate this approach using a batch of n = 97 human serum samples and note that despite E. coli contamination throughout the dataset, we are able to identify a patient sample with significantly more E. coli than expected from contamination alone. Importantly, our method assumes no prior understanding of possible contaminants, does not rely on any prior collection of environmental or reagent-only sequencing samples, and does not censor potentially clinically relevant taxa, thus making it a generalized approach to any kind of metagenomic sequencing, for any purpose, clinical or otherwise.
KeywordsMetagenomics Sequence analysis DNA DNA contamination Regression analysis Microbiota
External RNA Controls Consortium
Metagenomic next-generation sequencing
Sequence Read Archive
Metagenomic next-generation sequencing (mNGS) is a highly sensitive tool capable of detecting even single fragments of nucleic acid. While this sensitivity allows for detection of rare organisms within a much larger host background, sensitivity is a double-edged sword, as reagent and environmental contamination, ubiquitous in sequencing experiments, will also be detected and potentially misinterpreted. Contamination can be introduced by the environment, reagents, handlers, or machines at any point during the collection of the sample, the extraction of nucleic acid, or the preparation of libraries [1, 2, 3, 4, 5, 6]. This can lead to results that vary widely between laboratories, reagent kits, or extraction batches [6, 7, 8], can result in false-negative or false-positive assessments [9, 10, 11, 12], and can provide misleading information about microbiological niches [13, 14, 15, 16]. While steps can be taken to minimize contamination, existing best practices are unable to completely prevent it or control for it; therefore, it is critical that contamination is addressed during sequencing analyses in order to prevent misleading results, particularly from low biomass samples [4, 8, 10, 12, 16, 17, 18, 19, 20, 21].
In the December 2018 issue of Microbiome, Davis et al. present an elegant approach to the identification of contamination in metagenomic sequencing results . Their approach relies on two core principles: first, that contaminant sequences are inversely correlated with total sequencing reads (the frequency-based approach), and second, that contaminant sequences are present in more controls than samples (the prevalence-based approach). Their work employs several statistical methods that culminate in a classification threshold ranging from 0 to 1. Once the threshold is set (the authors recommend 0.1 to start), a list of contaminant DNA can be compiled. Analyzing sequences according to these principles eliminates the need to assign an arbitrary threshold for removing sequences and reduces reliance on an a priori set list of known contaminants. Davis et al. then provide a user-friendly R package entitled decontam and validate their approach on multiple datasets to demonstrate robust detection of contaminating sequences in both shotgun and 16S sequencing results [22, 23].
The approach employed by Davis et al. is particularly useful in identifying contamination in low biomass samples, and the authors rightly point out that the assumptions of their approach break down when the contaminant mass (C) approaches the total input sample mass (S). For any given sample in any given mNGS experiment, the exact limit at which input sample mass becomes so small that contamination dominates the results remains unknown.
In our own work in the area of clinical mNGS, this issue has been a cause of constant concern [16, 24]. Paralleling the work of Davis et al., we sought better methods to characterize the lower limit of sample input in order to automatically both identify and quantify the contribution of each contaminating component. Here, we suggest an amendment to the method of Davis et al. This improvement relies on determining an association between sequencing read output and input mass, made possible through the incorporation of a series of precise spike-in controls. In doing so, a straightforward statistical method allows the identification and separation of contaminating components from those inherent to the sample itself, without censoring.
After microbial taxa are binned as either contaminant or true constituent of the microbiome, Davis et al. propose that contaminant taxa are censored from the dataset and nicely demonstrate a reduction in batch effect and other experimental improvements. However, as described by the authors, one significant limitation of the approach is that “decontam assumes that contaminants and true community members are distinct from one another.” In our view, such binary assignments are not realistic for a number of important microbes in numerous experimental situations. Consider the example of a human patient harboring an Escherichia coli bloodstream infection. As E. coli appears to be a ubiquitous laboratory contaminant, attempts to sequence the metagenome from a blood sample would produce a final E. coli sequencing count with contributions from both reagent contamination and the true microbiome. Disregarding E. coli as a component of the microbiome based on its identification as a contaminant would result in a false negative report, which could be disastrous in the field of clinical metagenomics for infectious disease diagnostic purposes. Similar vignettes can be described for numerous microbes that are both pathogenic and common laboratory contaminants, including Staphylococcus aureus and Pseudomonas aeruginosa.
In summary, Davis et al. present an intuitive and straightforward approach to identifying contamination in metagenomic sequencing experiments. When microbe sequencing quantity is inversely proportional to total sample input mass, it is suspicious for contamination; we thus suggest that assessing the studentized residual for each sample can provide a probabilistic assessment of the degree to which a contaminant might also be present in the true sample metagenome. The inclusion of ERCC controls provides the additional benefit of allowing sample input mass to be calculated even for picogram-level samples. In short, this statistical approach allows an investigator to separate the estimated contribution from contamination from the true sample-derived component without censoring the organism from all further analyses. Importantly, our method assumes no prior understanding of possible contaminants and does not rely on any prior collection of environmental or reagent-only sequencing samples, thus making it a generalized approach to any kind of metagenomic sequencing, for any purpose, clinical or otherwise.
The presence of E. coli and S. maltophilia RNA in the original sera was confirmed using custom PCR primers for the following inserts with 100% BLAST homology for their respective species. E. coli insert: TCAGCACGATTTCAGTCTGAGTCGGACATTCAGCAGTGATACCCGCAGGCAGCTGATGGTCAACAGGATGAGAGAAACCCAGAGACAGGTTAATCACATTGCCTTTAACCGCTGCACGGTAACCTACACCAACCAGCTGCAGCTTCTTAGTGAAGCCTTCGGTAACACCGATAACCATTGAGTTCAGCAGGGCACGCGCGGTACCAGCCTGTGCCCAACCGTCTGCGTAACCATCACGCGGACCGAAGGTCAGGGTATTATCTGCATGTTTAACTTCAACAGCATCGTT. S. maltophilia insert: ATAGCCCTGTATCTGAAAGGGCCATTTCAGTGAAGACGAGTAGGGCGGGGCACGTGAAACCCTGTCTGAACATGGGGGGACCATCCTCCAAGGCTAAATACTACTGACCGACCGATAGTGAACCAGTACCGTGAGGGAAAGGCGAAAAGAACCCCGGAGAGGGGAGTGAAATAGAACCTGAAACCGTGTGCGTACAAGCAGTAGGAGCTCCGCAAGGAGTGACTGCGTACCTTTTGTATAATGGGTCAGCGACTTACTG
Zinter MS received funding from the Eunice Kennedy Shriver National Institute of Child Health & Development K12HD000850. Ryckman KK and Jelliffe-Pawlowski LL received funding from the University of California, San Francisco California Preterm Birth Initiative (PTBi-CA). DeRisi JL received funding from the Chan Zuckerberg Biohub.
Availability of data and materials
Data files are available in the Sequence Read Archive as BioProjects PRJNA516238 and PRJNA516235. The IDseq bioinformatics pipeline for microbial taxa detection within metagenomic samples is freely available at: https://github.com/chanzuckerberg/idseq-web
All authors participated in the conception, planning, and writing of this manuscript. All authors approved of the final version.
Ethics approval and consent to participate
Methods and protocols for the study were approved by the Committee for the Protection of Human Subjects within the Health and Human Services Agency of the State of California (#12-090702).
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 9.Lee D, Das Gupta J, Gaughan C, Steffen I, Tang N, Luk KC, Qiu X, Urisman A, Fischer N, Molinaro R, Broz M, Schochetman G, Klein EA, Ganem D, Derisi JL, Simmons G, Hackett J Jr, Silverman RH, Chiu CY. In-depth investigation of archival and prospectively collected samples reveals no evidence for XMRV infection in prostate cancer. PloS One. 2012;7(9):e44954.CrossRefGoogle Scholar
- 10.Bittinger K, Charlson ES, Loy E, Shirley DJ, Haas AR, Laughlin A, Yi Y, Wu GD, Lewis JD, Frank I, Cantu E, Diamond JM, Christie JD, Collman RG, Bushman FD. Improved characterization of medically relevant fungi in the human respiratory tract using next-generation sequencing. Genome Biol. 2014;15(10):487.CrossRefGoogle Scholar
- 16.Wilson MR, O'Donovan BD, Gelfand JM, Sample HA, Chow FC, Betjemann JP, Shah MP, Richie MB, Gorman MP, Hajj-Ali RA, Calabrese LH, Zorn KC, Chow ED, Greenlee JE, Blum JH, Green G, Khan LM, Banerji D, Langelier C, Bryson-Cahn C, Harrington W, Lingappa JR, Shanbhag NM, Green AJ, Brew BJ, Soldatos A, Strnad L, Doernberg SB, Jay CA, Douglas V, Josephson SA, DeRisi JL. Chronic Meningitis Investigated via Metagenomic Next-Generation Sequencing. JAMA Neurol. 2018.Google Scholar
- 20.Minich JJ, Zhu Q, Janssen S, Hendrickson R, Amir A, Vetter R, Hyde J, Doty MM, Stillwell K, Benardini J, Kim JH, Allen EE, Venkateswaran K, Knight R. KatharoSeq enables high-throughput microbiome analysis from low-biomass samples. mSystems. 2018;3(3). https://doi.org/10.1128/mSystems.00218-17. eCollection 2018 May-Jun.
- 23.Karstens L, Asquith M, Davin S, Fair D, Gregory WT, Wolfe AJ, Braun J, McWeeney S. Controlling for contaminants in low biomass 16S rRNA gene sequencing experiments. bioRxiv.Google Scholar
- 24.Zinter MS, Dvorak CC, Mayday MY, Iwanaga K, Ly NP, McGarry ME, Church GD, Faricy LE, Rowan CM, Hume JR, Steiner ME, Crawford ED, Langelier C, Kalantar K, Chow ED, Miller S, Shimano K, Melton A, Yanik GA, Sapru A, DeRisi JL. Pulmonary Metagenomic Sequencing Suggests Missed Infections in Immunocompromised Children. Clin Infect Dis. 2018.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.