Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads
KeywordsMetagenomic Sequencing Human Gastrointestinal Tract Metagenomic Sequencing Data Bacteroides Genus Local Similarity Analysis
Accurate estimation of microbial community composition based on metagenomic sequencing data is fundamental for subsequent metagenomic analysis. However, it is also a challenging computational problem because of the mixed nature of metagenomes and the fact that only a small fraction of them get sequenced.
With the advents of next-generation sequencing (NGS) technologies, there has been significant increase in sequencing capacity yet reduction in single read length. This paradigm shift in sequencing technologies has impacted downstream analyses. Specifically, the identification of the origin of a read becomes more difficult for several reasons. First, a large number of short reads cannot be uniquely mapped to a specific location of one genome. Instead, they map to multiple locations of one or multiple genomes. These ambiguities are directly associated with the read length reduction in NGS technologies. Second, communities usually consist of many microbes with similar genomes, different only in some parts, making it indeed impossible to determine the origin of a particular short read based solely on its sequence.
Despite these difficulties, NGS read sets have brought in richer abundance information of microbial communities than traditional datasets because of the significant increase in the number of reads. Along with the increase of read set size, efforts to assemble more reference genomes are ongoing. In addition, new experimental techniques, such as single-cell sequencing approaches, are being developed to sequence reference genomes directly from environmental samples. In face of the challenges from short reads and the opportunities from fast-expanding reference genome databases, GRAMMy is a statistical framework developed to accurately and efficiently estimate the relative abundance of microbial organisms within the community (Xia et al. 2011).
The GRAMMy Framework
In the typical GRAMMy workflow, which is shown in Fig. 2, the end user starts with the metagenomic read set and reference genome set and then chooses between mapping-based (“map”) and k-mer composition-based (“k-mer”) assignment options (He and Xia 2007). In either option, after the assignment procedure, an intermediate matrix describing the probability that each read is assigned to one of the reference genomes is produced. This matrix, along with the read set and reference genome set, is fed forward to the EM algorithm module for estimation of the GRA levels. After the calculation, GRAMMy outputs the GRA estimates as a numerical vector, as well as the log-likelihood and standard errors for the estimates. If the taxonomy information for the input reference genomes is available, strain (genome) level GRA estimates can be combined to calculate high taxonomic level abundance, such as species- and genus-level estimates.
Accurate GRAMMy Estimates with EM Algorithm
GRAMMy Estimates for Human Gut Metagenomes
GRAMMy is a rigorous probabilistic framework for accurately and efficiently estimating genome relative abundance (GRA) based on shotgun metagenomic reads. Users have a wide choice of mapping and alignment tools to assign reads to references. The method is particularly suitable for NGS short read datasets due to its better handling of read assignment ambiguities. GRAMMy tools are packaged as a C++ extension to Python, which can be downloaded freely from GRAMMy’s homepage: http://meta.usc.edu/softs/grammy.