Encyclopedia of Metagenomics

Living Edition
| Editors: Karen E. Nelson

A De Novo Metagenomic Assembly Program for Shotgun DNA Reads

  • Huaiqiu ZhuEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-6418-1_726-2

Abstract

MAP: metagenomic assembly program

Synonyms

Definition

Contig: a set of overlapping DNA segments that together represent a consensus region of DNA. Assembly (also genome assembly): the process of taking a large number of short DNA sequencing reads and putting them back together to create contigs from which the DNA originated.

Introduction

MAP (metagenomic assembly program) is a de novo assembler designed to be applicable to shotgun DNA reads (recommended as >200 bp) for metagenome sequencing project (Lai et al. 2012). The program focuses on the metagenomic assembly problem of longer reads produced by, for example, Sanger (typically 700–1,000 bp) and 454 sequencing (typically 200–500 bp). Meanwhile, mate-pair information from both ends of a DNA fragment for a given size (e.g., an insert in a vector plasmid in Sanger sequencing or a mate-pair template in 454 sequencing) in sequencing is introduced, which is commonly available in Sanger sequencing and most of the new sequencing technologies including 454 sequencing.

Although processing of shotgun metagenomic sequence data usually does not have a fixed end point to recover one or more complete genomes as for isolated microbial genomes, the assembly tools, which aim to combine sequence reads into contigs, are still expected to play an important role in sequence processing, due to more valuable genomic content they can provide (Tyson et al. 2004; Venter et al. 2004). In the past decade, a good many assembly algorithms have been proposed to deal with the sequence assembly problem, among of which are the early algorithms targeted to the Sanger sequencing technology, such as Phrap (http://www.phrap.org), Celera (Myers et al. 2000; Miller et al. 2008), and PCAP (Huang et al. 2003), and the up-to-date algorithms targeted to the next-generation technology, such as Velvet (Zerbinor and Birney 2008) and SOAPdenovo (Li et al. 2010). However, these methods are not targeting the metagenome sequencing in spite of the situation that they are still usually employed to undertake assembling of the metagenomic sequencing reads.

Compared to isolated genome assembly problem, the metagenomic assembly problem is more complicated due to two challenges (Kunin et al. 2008): (1) the genomic repeats may originate from either the same genome or the different genomes; therefore, large numbers of mixed short DNA reads belong to many different species (we even know little about the population structure for some environmental samples); and (2) the inhomogeneous coverage distribution and the low abundance of organisms provide limited information to handle repeats. Due to the specific challenges of the metagenomic assembly problem, traditional assembly methods developed for single genome assembly problem usually generate poor quality draft assembly on metagenomic data (Mavromatis et al. 2007). Thus, it is in need to develop highly efficient assembly method specifically for metagenomic data.

Moreover, compared with Sanger and 454 sequencing, the current limitation of shorter reads (<200 bp, typically 25–100 bp) and higher errors by the new sequencing platforms does not allow a significant utility for metagenomic analyses for the difficulty in phylogenetic study or gene function inference. In fact, shorter reads technologies have not been widely used in metagenome sequencing, and meanwhile the sequencing technologies producing longer reads, such as Sanger (usually 700–1,000 bp) and 454 sequencing (usually 200–500 bp), are still the overwhelming recommendation and thus remain the major source of metagenomic sequence data. Therefore, it is never trivial to continue to emphasize the importance of longer reads to metagenomic analyses, clearly including the reads assembly tool designed specifically.

Algorithm of MAP

MAP designs an improved approach of the classical overlap/layout/consensus (OLC) strategy, in which several special algorithms are incorporated into its stages, to calculate correct contigs by connecting the fragments linked by mate pairs to prevent the false merge of unrelated reads. For the improved OLC strategy, MAP deploys a series of algorithms in three stages as shown in Fig. 1. In the overlap stage, the filter algorithm based on q-gram (Mullikin et al. 2003) is used to obtain the read pairs that are supposed to have the overlaps, and the seed and extend alignment approach, similar to that used by BLAST (Altschul et al. 1990), is employed in the pairwise alignment calculation. While in the consensus stage, a consistency-based consensus algorithm is used (Rausch et al. 2009), which is based on a multi-read alignment algorithm aligning the reads with a consistency-enhanced alignment graph of shared sequence segments identified in advance. The most important innovation of MAP is the layout stage which applies mate-paired information to deal with repeat problem, which is described below.
Fig. 1

The flowchart of MAP algorithm

In the OLC approach of MAP, the overlap graph is used to facilitate the assembly process. Conceptually, reads and overlaps are represented in the graph G by nodes and bidirected edges, respectively. The arrows of both ends of the edge are determined by the way how two reads overlap. Herein, a dovetail path is defined as an acyclic path with each node has only one arrow outward it and one arrow inward it. Thus, a dovetail path can determine a certain contig by means of threading the reads corresponding to the nodes in this path. Thus, the goal of the layout stage is to separate the graph into disconnected dovetail paths. However, since there may be quite many misleading edges in the graph that represent the false overlaps mainly originated from two repetitive DNA regions or similar fragments of different genomes, this goal seems to be a formidable task. To this end, MAP is designed to determine the optimal dovetail paths with the aids of the clues given by mate pairs (Lai et al. 2012).

Compared with other assemblers, several distinct features of MAP algorithm should be pointed out. First, MAP does not refer to any other information such as genome length or sequencing coverage that is often used in the assemblers targeting the isolated genomes, because such information is clearly not applicable to the situation of metagenomic assembly. What is more important is that MAP employs mate-paired information different from other assemblers do. For example, the Celera Assembler (Myers et al. 2000) used mate-paired information in the scaffold construction. The Celera Assembler later developed a new pipeline CABOG, which finds the best overlap graph in the unitigger module (Miller et al. 2008). In this algorithm, mate pairs are used to correct the misassemblies by breaking the unitigs which are found violated with the mate-pair constrains. PCAP (Huang et al. 2003) used mate-paired information to correct contigs and to link contigs into scaffolds. Different from these assemblers, MAP uses mate pairs as a core measure to construct contigs when repeats hamper the assembly. Based on mate-paired information, MAP designs a series of procedures to implement the layout stage.

Performance of MAP

MAP is designed for metagenomic assembly on long reads data with mate pairs, such as Sanger reads (700–1,000 bp) and 454 reads (200–500 bp). MAP method was assessed on simulated data compared with widely used assemblers on long reads data. Specifically, the assessment test results on simulated dataset with 800 bp reads demonstrate that the total assembly performance of MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, and the results on simulated dataset with 200 bp reads show that MAP has evident advantage over Celera, Newbler (Margulies et al. 2005), and Genovo (Laserson et al. 2011), for typical shorter reads by 454 sequencing (Lai et al. 2012).

Availability

MAP is written in C++ and the source code is freely available under GNU GPL license. The MAP is freely available at http://bioinfo.ctb.pku.edu.cn/MAP/.

References

  1. Altschul SF, Gish W, et al. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.PubMedCrossRefGoogle Scholar
  2. Huang X, Wang J, et al. PCAP: a whole-genome assembly program. Genome Res. 2003;13:2164–70.PubMedCentralPubMedCrossRefGoogle Scholar
  3. Kunin V, Copeland A, et al. A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev. 2008;72:557–178.PubMedCentralPubMedCrossRefGoogle Scholar
  4. Lai B, Ding R, et al. A de novo metagenomic assembly program for shotgun DNA reads. Bioinformatics. 2012;28(11):1455–62.PubMedCrossRefGoogle Scholar
  5. Laserson J, Jojic V, et al. Genovo: de novo assembly for metagenomes. J Comput Biol. 2011;18:429–43.PubMedCrossRefGoogle Scholar
  6. Li R, Zhu H, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20:265–72.PubMedCentralPubMedCrossRefGoogle Scholar
  7. Margulies M, Egholm M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80.PubMedCentralPubMedGoogle Scholar
  8. Mavromatis K, Ivanova N, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4:495–500.PubMedCrossRefGoogle Scholar
  9. Miller JR, Delcher AL, et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformaticts. 2008;24:2818–24.CrossRefGoogle Scholar
  10. Mullikin JC, Ning Z, et al. The phusion assembler. Genome Res. 2003;13:81–90.PubMedCentralPubMedCrossRefGoogle Scholar
  11. Myers EW, Sutton GG, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2896–204.Google Scholar
  12. Rausch T, Koren S, et al. A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics. 2009;25:1118–24.PubMedCentralPubMedCrossRefGoogle Scholar
  13. Tyson GW, Chapman J, et al. Genomic structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43.PubMedCrossRefGoogle Scholar
  14. Venter JC, Remington K, et al. Environmental genome shotgun sequencing of Sargasso sea. Science. 2004;304:66–74.PubMedCrossRefGoogle Scholar
  15. Zerbinor DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Biomedical Engineering, and Center for Theoretical BiologyPeking UniversityBeijingChina