RiboTagger: fast and unbiased 16S/18S profiling using whole community shotgun metagenomic or metatranscriptome surveys
- 1.7k Downloads
Taxonomic profiling of microbial communities is often performed using small subunit ribosomal RNA (SSU) amplicon sequencing (16S or 18S), while environmental shotgun sequencing is often focused on functional analysis. Large shotgun datasets contain a significant number of SSU sequences and these can be exploited to perform an unbiased SSU--based taxonomic analysis.
Here we present a new program called RiboTagger that identifies and extracts taxonomically informative ribotags located in a specified variable region of the SSU gene in a high-throughput fashion.
RiboTagger permits fast recovery of SSU-RNA sequences from shotgun nucleic acid surveys of complex microbial communities. The program targets all three domains of life, exhibits high sensitivity and specificity and is substantially faster than comparable programs.
KeywordsMicrobial ecology Short subunit ribosomal RNA Microbial community profiling Sequence analysis
Operational taxonomic unit
Position specific scoring matrix
Studying the composition and dynamics of microbial communities is a key problem in microbiome research and microbial ecology . Traditionally, these studies have been based on isolating and sequencing short subunits of the 16S and 18S genes, present in bacteria or archea, and eukaryotes, respectively. Typically most studies now make use of amplicon sequencing to obtain such data from complex microbial communities [2, 3]. The PCR amplicon technique was very useful when sequencing power was limited, however, with the increasing power and complexity of the new generation of sequencing technologies, the broad advantages of amplicon sequencing are starting to be balanced by major limitations, which include PCR primer selection and amplification bias [4, 5]. In particular, no PCR primers are able to amplify all known bacterial taxonomic groups efficiently and uniformly , which leads to biased rRNA profiling analysis, and the use of short read technologies, notably Illumina, results in a complex, interdependent chain of technical decisions, that can heavily influence the subsequent community profiling results .
Within microbiome research, there is increasing use of whole community gDNA surveys (i.e. shotgun metagenomics), which offer, at theoretically, a less biased view of community composition than using from amplicon based methods, by eliminating dependency on 16S primers . In practice however, the interpretation of shotgun metagenome data is heavily dependent on having access to reference genomes of community members, without which substantial limitations of interpretation may arise . While the intended use of these shotgun data is typically to capture functional capacity of a community , or to permit member genome recovery , it has been recognized that whole commnuity shotgun surveys will of course contain a substantial number of reads derived from SSU-rRNA genes, and these can be exploited for the purposes of community profiling [10, 11, 12, 13, 14, 15, 16, 17]. Similarly, when using total RNA metatranscriptome sequencing, rRNA often account for 95% of reads sequenced, and thus provides coverage of SSU diversity to great depth. Within this general area, a number of examples of this approach have been undertaken and several software implementations of this approach are now available [13, 16, 17], mostly based on the use of Hidden Markov Models to capture reads of SSU-origin from the total read population [10, 11, 12, 13, 14, 16, 17].
Here we present a new open source software package, RiboTagger, (https://github.com/xiechaos/ribotagger) to analyze rRNA data from shotgun sequencing reads. The software takes raw metagenome or metatranscriptome sequencing reads in FASTQ or FASTA files as input, and is able to process billions of Illumina HiSeq reads under an hour. RiboTagger produces a BIOM formatted files for downstream analysis in standard packages like QIIME  or MEGAN . It is equally sensitive over all known bacterial and archaeal phyla and classes, and highly specific in not classifying non-rRNA sequences as rRNA.
Results and discussion
Application of our universal profiles to the Greengenes database gave rise to a set of 90,061 ribotags. Each ribotag was assigned a taxon based on the set of SSU sequences that contain it, using the majority taxon in the case of discordance. The sensitivity of each of the profiles is over 95% percent (Fig. 2), as established by applying the profiles to all SSU sequences in the SILVA database . To test the false positive rate, we ran the profiles on 5.6 billion faux reads, obtained by sampling all 80 nt non-overlapping fragments in the RefSeq database (release 66), calculating the false positive rate to be at most 2 × 10−6 in all cases (Fig. 2). To address the problem of the extent to which sequencing-error can generate false ribotags, RiboTagger estimates the expected frequency of each ribotag due to sequencing error, based on the observed frequencies of ribotags that differ by one letter and a simple probabilistic model.
Due to the short length of the 16S tags, we need to consider the possible influence of sequencing errors explicitly. PCR artifacts during library generation would lead to abnormally high numbers of reads with identical starting position and sequence content. Therefore, simply counting sequencing reads with different starting position can detect possible PCR artifacts. Using stringent quality value filtering for each nucleotide that covers 16S tags can also the impact of limit potential sequencing errors, which results extremely low error probability for each sequencing read considered. For example, with quality value threshold 30, the probability of sequencing error is only 0.1%. In addition, we also calculate the number of expected number of occurrence of any 16S tag, assuming the tag is not present at all. Assuming the tag of interest is not present, given the observed average sequencing error probability at each nucleotide position along the tag and the abundance of all other tag sequences with one nucleotide difference in the data, we can calculate the expected abundance of the tag of interest by sequencing error. Firstly, for the tag under study, all tag sequences with one nucleotide difference from the tag of interest are collected. Each of the 1 nt neighbors will contribute E = n × e/3 false positive counts to the tag of interest, where n is the number of reads covering the 1 nt neighbour and e is the observed error probability for the difference position on the tag of interest. The sum of E for all 1 nt neighbours gives the expected false positive tag of interest due to one sequencing error. Multiple sequencing errors can be also considered in a similar fashion but due to the quality-value filtering step, we can consider the probability of observing multiple sequencing errors in ribotag sequences as negligible.
To illustrate the results obtained from RiboTagger, we ran RiboTagger against a coupled DNA-Seq and RNA-Seq dataset generated from an activated sludge community of an operational wastewater treatment (Ulu Pandan Water Reclamation Plant, Northworks, Tank 3C), operated by the Public Utilities Board, Republic of Singapore; samples obtained between 10/08/2012 and 17/08/2012). Genomic DNA and total RNA sequencing were each performed on a HiSeq2500 Rapid run using 250 bp paired end sequencing for DNA samples, and 150 bp paired end read sequencing for RNA samples. We observed a total of 4686 V4-region ribotags in the entire dataset. From the DNA dataset, we can estimate whether the number of observed 16S tags is as expected using the following rough approximation: if we assume there are up to 5000 genes in a typical free-living bacteria , and if we neglect the likely variation in 16S copy number across the member species in the community, and assume that one of those is a 16S gene with roughly 9 equally sized V-regions and 10 equally sized conserved regions, then we would expect approximately 308,505,950/(19 × 5000) = 3247 reads to originate from a single V-region in our analysis, which is a conservative underestimate to our observed number. For the RNA data, we report only results for one sample to for purposes of illustration, specifically with a total of 41,523,808 RNA reads available after QC filtering., we obtained a total of 4867 tags, of which the top 11 accounted for 20%, 80% and 95% of community membership (as described by total number of reads). Collectively there are 327 unique genera detected, 203 families, 128 orders and 51 phyla, The majority of annotated tags at kingdom level (1450) were attributable to bacteria, as expected in this community, with 34 and 40 tags being assigned to archaea and eukaryota, respectively. Approximately 3343 tags could not be assigned any annotation using SILVA v119 , including 1 and 41 in the top 20% and 50% of the community, respectively. These latter results highlight the substantial numbers of unknown taxa residing in complex microbial communities.
Comparison to related programs
Using the RNA dataset described above, we compared the computational time of RiboTagger against two other existing search tools, RiboFrame  and SSUSearch , which are designed for retrieval of 16S sequences from whole metagenomic sequencing datasets. The comparison was carried out on a standalone server with the following specifications—2 Intel Xeon X7542 (18 M Cache, 2.66 GHz) CPUs, 128 GB memory, internal 146 GB HDD and external 8 TB RAID HDD. RiboTagger took approximately 90 min to complete, and we observed RiboTagger to be approximately 6 times faster than RiboFrame (~9 h) and 6.6 times faster than SSUSearch (~10 h). As RiboTagger and SSUsearch both used SILVA annotations, we compared the identity and relative abundance of detected phlya between both programs. In total 112 phyla were detected (51 with Ribotagger and 86 with SSUsearch) of which 25 were common between the two methods. The number of unclassifiable sequences differed, with 21.6% being called in SSUsearch and 47.5% in RiboTagger, however, of the 25 common phyla, 7 accounted for 95% of community composition in each set of results. The overall correlation between relative abundance was 0.99 (with 25 common phyla) and 0.88 (using all 112 phyla with non-detects set as zero). We note that while some of these differences may be accounted for by different database versions (SILVA v115 and SILVA v119 for SSUearch and Ribotagger, respectively) and/or handling of unclassifible sequences, these results suggest that further investigation of the differences between detection methodologies are warranted.
RiboTagger is a implemented as a platform independent Perl program. It can be executed in a single command in several modes with are briefly described here, along with their respective outputs. Full details can be found on the RiboTagger project page on Github (see Availability and Requirements)
A. Single input file mode
tag: the tag sequence for the variable region
n: the number of reads that contains this tag
npos: the number of different locations of the tag on their source reads (a large value of n and a small value of npos indicates the presence of duplicated reads or would be observed if applied to amplicon sequencing reads).
fp: the number of reads you would expect to see this tag due to sequencing errors alone
long.total.count: the number of reads containing a longer sequence of this tag (see the-long option)
long1.count, long2.count, long3.count: number of reads containing the most abundant variants of this tag’s long sequences (low long1.count/long.total.count ratio indicates that this tag is very likely representing a mixture of “species”)
long1, long2: the most abundant long representive sequences of this tag
B. Multiple input file mode
tag: the ribotag sequence
use: “tag” or “long”, whether the annotation was based on the short tag or long representative sequence
taxon_level: taxa rank of this annotation of this tag
taxon_data: taxa rank of the most specific annotation appeared in the database (SILVA or Greengenes) for this tag
long: the long representative sequence of this tag
long_total: the number of samples having any long representative sequence
long_this: the number of samples having this long sequence as its major representative of this tag
support: the number of database sequences having this tag or long sequence
confidence: the proportion of the database sequences agreed on this annotation
k, p, c, o, f, g, s: annotation for each of the taxa ranks, namely kingdom/domain, phylum, class, order, family, genus, and species
The *.xls file is an Excel file combining data from the *.tab and *.anno files. Using biom.pl, a *.biom file can be subsequently used by QIIME . We have implemented options for generating annotations to either SILVA  or Greengenes . All 4 files types can automatically be generated if RiboTagger is run in batch mode.
Here, we have developed software for the fast recovery of SSU-RNA sequences from shotgun nucleic acid surveys of complex microbial communities. Our code is fast, completing an analysis of about 40 M reads within 1.5 h, and will output an annotated matrix of read counts that can be used for downstream community profiling analysis with minimal further processing. Our implementation executes in a single line, avoiding the complications and the lack of robustness inherent in combination-type pipelines and is at least 6 times faster than SSUsearch  and RiboFrame . Additional, we also note that our approach avoids the use of OTU generation, which recent analyses suggest may carry significant advantages in resolving intra-community dynamics for some classes of experimental design, such as time series experiments .
Availability and requirements
Project name: RiboTagger
Project home page: https://github.com/xiechaos/ribotagger
Operating system (s): Platform independent
Programming language: Perl
Other requirements: None
Any restrictions to use by non-academics: none.
We thank our colleagues Larry Liew, Stefan Wuertz, Yehuda Cohen, Uma Shankari d/o Chanda Segaran, Stephan C. Schuster, Daniela Moses for providing access to data, and Siao Yun Chang, Choon Leng Lau and their colleagues from the Public Utilities Board (Republic of Singapore) for assistance with access to facilities and sampling operations.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 19, 2016. 15th International Conference On Bioinformatics (INCOB 2016): bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-19
Publication charges for this article were supported in full by the Singapore National Research Foundation and the Ministry of Education under the Research Centre of Excellence Programme.
Availability of data and material
The sequencing datasets analyzed during the current study are not yet publicly available due to their forming part of a large set of data being released as part of a primary manuscript. They are available on request to R.B.H.W.
C.X. conceived and designed the approach and implemented methods. C.X., C. L. W.G and R.B.H.W analyzed example data, and all authors interpreted the results. The paper was written by C.X, D.H.H and R.B.H.W with specific contributions from C.L.W.G and P.F.R.L. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Consent for publication
- 12.Bengtsson J, et al. Metaxa: a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets. Antonie Van Leeuwenhoek. 2011;100:471–5.CrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.