A De Novo Metagenomic Assembly Program for Shotgun DNA Reads
Contig: a set of overlapping DNA segments that together represent a consensus region of DNA. Assembly (also genome assembly): the process of taking a large number of short DNA sequencing reads and putting them back together to create contigs from which the DNA originated.
MAP (metagenomic assembly program) is a de novo assembler designed to be applicable to shotgun DNA reads (recommended as >200 bp) for metagenome sequencing project (Lai et al. 2012). The program focuses on the metagenomic assembly problem of longer reads produced by, for example, Sanger (typically 700–1,000 bp) and 454 sequencing (typically 200–500 bp). Meanwhile, mate-pair information from both ends of a DNA fragment for a given size (e.g., an insert in a vector plasmid in Sanger sequencing or a mate-pair template in 454 sequencing) in sequencing is introduced, which is commonly available in Sanger sequencing and most of the new sequencing technologies including 454 sequencing.
Although processing of shotgun metagenomic sequence data usually does not have a fixed end point to recover one or more complete genomes as for isolated microbial genomes, the assembly tools, which aim to combine sequence reads into contigs, are still expected to play an important role in sequence processing, due to more valuable genomic content they can provide (Tyson et al. 2004; Venter et al. 2004). In the past decade, a good many assembly algorithms have been proposed to deal with the sequence assembly problem, among of which are the early algorithms targeted to the Sanger sequencing technology, such as Phrap (http://www.phrap.org), Celera (Myers et al. 2000; Miller et al. 2008), and PCAP (Huang et al. 2003), and the up-to-date algorithms targeted to the next-generation technology, such as Velvet (Zerbinor and Birney 2008) and SOAPdenovo (Li et al. 2010). However, these methods are not targeting the metagenome sequencing in spite of the situation that they are still usually employed to undertake assembling of the metagenomic sequencing reads.
Compared to isolated genome assembly problem, the metagenomic assembly problem is more complicated due to two challenges (Kunin et al. 2008): (1) the genomic repeats may originate from either the same genome or the different genomes; therefore, large numbers of mixed short DNA reads belong to many different species (we even know little about the population structure for some environmental samples); and (2) the inhomogeneous coverage distribution and the low abundance of organisms provide limited information to handle repeats. Due to the specific challenges of the metagenomic assembly problem, traditional assembly methods developed for single genome assembly problem usually generate poor quality draft assembly on metagenomic data (Mavromatis et al. 2007). Thus, it is in need to develop highly efficient assembly method specifically for metagenomic data.
Moreover, compared with Sanger and 454 sequencing, the current limitation of shorter reads (<200 bp, typically 25–100 bp) and higher errors by the new sequencing platforms does not allow a significant utility for metagenomic analyses for the difficulty in phylogenetic study or gene function inference. In fact, shorter reads technologies have not been widely used in metagenome sequencing, and meanwhile the sequencing technologies producing longer reads, such as Sanger (usually 700–1,000 bp) and 454 sequencing (usually 200–500 bp), are still the overwhelming recommendation and thus remain the major source of metagenomic sequence data. Therefore, it is never trivial to continue to emphasize the importance of longer reads to metagenomic analyses, clearly including the reads assembly tool designed specifically.
Algorithm of MAP
In the OLC approach of MAP, the overlap graph is used to facilitate the assembly process. Conceptually, reads and overlaps are represented in the graph G by nodes and bidirected edges, respectively. The arrows of both ends of the edge are determined by the way how two reads overlap. Herein, a dovetail path is defined as an acyclic path with each node has only one arrow outward it and one arrow inward it. Thus, a dovetail path can determine a certain contig by means of threading the reads corresponding to the nodes in this path. Thus, the goal of the layout stage is to separate the graph into disconnected dovetail paths. However, since there may be quite many misleading edges in the graph that represent the false overlaps mainly originated from two repetitive DNA regions or similar fragments of different genomes, this goal seems to be a formidable task. To this end, MAP is designed to determine the optimal dovetail paths with the aids of the clues given by mate pairs (Lai et al. 2012).
Compared with other assemblers, several distinct features of MAP algorithm should be pointed out. First, MAP does not refer to any other information such as genome length or sequencing coverage that is often used in the assemblers targeting the isolated genomes, because such information is clearly not applicable to the situation of metagenomic assembly. What is more important is that MAP employs mate-paired information different from other assemblers do. For example, the Celera Assembler (Myers et al. 2000) used mate-paired information in the scaffold construction. The Celera Assembler later developed a new pipeline CABOG, which finds the best overlap graph in the unitigger module (Miller et al. 2008). In this algorithm, mate pairs are used to correct the misassemblies by breaking the unitigs which are found violated with the mate-pair constrains. PCAP (Huang et al. 2003) used mate-paired information to correct contigs and to link contigs into scaffolds. Different from these assemblers, MAP uses mate pairs as a core measure to construct contigs when repeats hamper the assembly. Based on mate-paired information, MAP designs a series of procedures to implement the layout stage.
Performance of MAP
MAP is designed for metagenomic assembly on long reads data with mate pairs, such as Sanger reads (700–1,000 bp) and 454 reads (200–500 bp). MAP method was assessed on simulated data compared with widely used assemblers on long reads data. Specifically, the assessment test results on simulated dataset with 800 bp reads demonstrate that the total assembly performance of MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, and the results on simulated dataset with 200 bp reads show that MAP has evident advantage over Celera, Newbler (Margulies et al. 2005), and Genovo (Laserson et al. 2011), for typical shorter reads by 454 sequencing (Lai et al. 2012).
MAP is written in C++ and the source code is freely available under GNU GPL license. The MAP is freely available at http://bioinfo.ctb.pku.edu.cn/MAP/.
- Myers EW, Sutton GG, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2896–204.Google Scholar