A Sequel to Sanger: amplicon sequencing that scales
- 3.1k Downloads
Although high-throughput sequencers (HTS) have largely displaced their Sanger counterparts, the short read lengths and high error rates of most platforms constrain their utility for amplicon sequencing. The present study tests the capacity of single molecule, real-time (SMRT) sequencing implemented on the SEQUEL platform to overcome these limitations, employing 658 bp amplicons of the mitochondrial cytochrome c oxidase I gene as a model system.
By examining templates from more than 5000 species and 20,000 specimens, the performance of SMRT sequencing was tested with amplicons showing wide variation in GC composition and varied sequence attributes. SMRT and Sanger sequences were very similar, but SMRT sequencing provided more complete coverage, especially for amplicons with homopolymer tracts. Because it can characterize amplicon pools from 10,000 DNA extracts in a single run, the SEQUEL can reduce greatly reduce sequencing costs in comparison to first (Sanger) and second generation platforms (Illumina, Ion).
SMRT analysis generates high-fidelity sequences from amplicons with varying GC content and is resilient to homopolymer tracts. Analytical costs are low, substantially less than those for first or second generation sequencers. When implemented on the SEQUEL platform, SMRT analysis enables massive amplicon characterization because each instrument can recover sequences from more than 5 million DNA extracts a year.
KeywordsSMRT sequencing Mitochondrial DNA Nuclear DNA Phylogenetics DNA barcoding PCR Nucleotide composition Homopolymer
Barcode of Life Datasystems
Carbamoylphosphate synthetase/aspartate transcarbamylase/dihydroorotase
Circular consensus sequence
Cytochrome c oxidase I
Ultra-pure and sterile water
Deoxyribose nucleoside triphosphate
Multiplex Barcoding Research And Visualization Environment
Molecular operational taxonomic unit
Nuclear mitochondrial DNA segment
Polymerase chain reaction
Personal Genome Machine
Single molecule, real-time
Solid Phase Reversible Immobilization
Unique molecular identifier
Unique molecular identifier forward
Unique molecular identifier reverse
High-throughput sequencers are doubling their analytical capacity every 9 months [1, 2], but their reads are generally short (< 400 bp) and error rates reach 0.8%–1.7% . These limitations are an important constraint in three contexts: de novo genome assemblies are difficult , complex regions of well-known genomes can be intractable , and sequencing long amplicons is inefficient. Because of the latter constraint, Sanger sequencing is still widely used for amplicon characterization [6, 7, 8, 9] despite its relatively high cost . While recent studies have established that Illumina [11, 12] and Ion Torrent  platforms can analyze 1 kb amplicons with good accuracy, their need to concatenate short reads creates risks to data quality linked to the recovery of chimeras and pseudogenes. As well, because of their relatively complex workflows, costs are only three to four times less than those for Sanger analysis.
In contrast to the short reads delivered by other HTS platforms, the SEQUEL from Pacific Biosciences generates up to 60 kb reads [13, 14]. Despite its high error (13%) in single base calls , its long reads permit the generation of circular consensus sequences (CCSs) . For example, presuming a 12 kb read, each nucleotide position in a 1 kb amplicon is reanalyzed 10 times, allowing its accurate characterization. Since each run generates about 200,000 CCSs, the SEQUEL has the potential to analyze a diverse pool of amplicons. However, because each CCS reflects the characterization of a single molecule, SMRT analysis can recover heterogeneous sequences from a DNA extract, reflecting both variation in the target gene and diversity introduced by polymerase error. This contrasts with Sanger sequencing where the base call at each position integrates the signal from many amplicons so variants that comprise less than 10% of the amplicon pool have no impact on the base call at a particular position. Given this difference, empirical studies are needed to understand the complexities that arise when SMRT sequencing is employed to characterize amplicons.
A rigorous performance test demands the examination of amplicons with varied GC content and with substantial genetic divergence to reveal error biases dependent on sequence context . For example, long homopolymer runs are a challenge for Sanger sequencing [17, 18], while Illumina platforms are subject to GC bias . Specimens employed to create DNA barcode reference libraries provide an ideal test system as Sanger reference sequences are available for a 648 bp region of the mitochondrial cytochrome c oxidase I (COI) gene from more than 500,000 species . Because of its considerable variation in GC composition (15–45%) and substantial sequence divergence, COI is sometimes challenging for Sanger analysis, usually as a result of homopolymer tracts. Consequently, this gene region provides a stringent test for the capacity of a sequencing platform to support amplicon analysis.
The present study tests the performance of SMRT sequencing in the characterization of COI amplicons from 20,000 DNA extracts, each from a different specimen. It compares the performance of SMRT and Sanger analysis in three key metrics: sequence length, sequence quality, and recovery success. It also ascertains the diversity of COI amplicons that can be characterized by a SMRT cell, the analytical chip for SEQUEL. Although this limit will depend on the number of reads and on the capacity to standardize amplicon abundances, the present study provides a first sense of the upper bound.
The geographic origin, number of specimens analyzed, and number of species represented in the four libraries analyzed by Sanger and SMRT sequencing
Insecta + Arachnida
Insecta + Arachnida
Insecta + Arachnida + Collembola
Para, Brazil & Selangor, Malaysia
Guanacaste, Costa Rica
The DNA extracts used to construct libraries #1/2 were selected from a study that examined arthropod biodiversity from the rare Charitable Research Reserve, near Cambridge, Ontario, Canada (Lat/Lon: 43.37, − 80.364). It recovered COI sequences from 44,000 arthropods representing 5600 species . From this array, 948 specimens were selected for SMRT analysis. Library #1 included 95 taxa – a single specimen from 82 insect species (Coleoptera–18, Diptera–12, Hemiptera–13, Hymenoptera–24, Lepidoptera–15) and 13 arachnid species (Mesostigmata–9, Sarcoptiformes–1, Trombidiformes–3). Library #2 included 948 specimens, all belonging to a different species with two exceptions: one dipteran was represented by three and one hymenopteran by two specimens to test the impact of replication on the number of reads for a species. The 945 species in library #2 included 837 from the five major insect orders (Coleoptera–159, Diptera–245, Hemiptera–94, Hymenoptera–244, Lepidoptera–95) and 108 from six orders of arachnids (Araneae–23, Mesostigmata–38, Opiliones–3, Pseudoscorpiones–1, Sarcoptiformes–3, Trombidiformes–40). Aside from ensuring taxonomic diversity, the primary criterion for the selection of a species was its possession of > 6% sequence divergence at COI from any species already included in the library, barring four species pairs (2 dipterans, 1 hemipteran, 1 mite) with low divergence (1.6–3.2%) that were included to verify that SMRT sequencing could discriminate them. The resultant array of species showed substantial variation (20.4–44.1%) in the GC content of their COI amplicons.
Libraries #3/4 were used to test the number of different amplicons that could be analyzed with a SMRT cell. Library #3 included amplicons from 9120 arthropods from Brazil and Malaysia, while library #4 included amplicons from 9830 arthropods from Costa Rica. All three of these specimen collections were obtained through the deployment of a single Malaise trap for a year-long interval. Specimens collected by the trap were preserved in 95% ethanol at their time of capture and the catch from each trap was harvested weekly. It was then stored at − 20C until DNA was extracted from each specimen. Half (4560) of the specimens in library #3 derived from a Malaise trap deployed in Caxiuana National Forest, Para (Lat/Lon: − 1.73717, − 51.4546) while the other half came from a trap placed in the Botanic Garden at the University of Malaya in Kuala Lumpur (Lat/Lon: 3.1295, 101.6560). Library #4 derived from a Malaise trap deployed in the Sector Pailas Dos, Area de Conservacion Guanacaste, Costa Rica (Lat/Lon: 10.7642,− 85.3350). Because some species in both libraries were represented by two or more specimens, the species count was lower than the sample size (4764 species versus 18,950 specimens). The results from this analysis also permitted a comparison of the relative success of Sanger and SMRT sequencing in recovering COI from a diverse array of taxa.
Each DNA extract was used as a template for COI amplification without normalization of its concentration. Two PCR protocols, both targeting the same 658 bp segment of COI, were employed to generate amplicons for Sanger and SMRT sequencing. PCR1 was used when amplicon tagging was not required to link sequence records to their source, either because the amplicons were analyzed individually (i.e. Sanger sequencing) or because deep COI divergences among samples permitted post hoc taxonomic assignments (i.e. SMRT sequencing of libraries #1/2). PCR2 was used for SMRT sequencing of libraries #3/4 to enable the association of each CCS to its source well/specimen. Each PCR reaction was composed of 5% trehalose (Fluka Analytical), 1× Platinum Taq reaction buffer (Invitrogen), 2.5 mM MgCl2 (Invitrogen), 0.1 μM of each primer (Integrated DNA Technologies), 50 μM of each dNTP (KAPA Biosystems), 0.15 units of Platinum Taq (Invitrogen), 1 μl of DNA extract, and Hyclone ultra-pure water (Thermo Scientific) for a final volume of 6 μl.
PCR1 employed a single primer cocktail, C_LepFolF, C_LepFolR , and the following thermocycling protocol (initial denaturation for 2 min at 94 °C, then 5 cycles of denaturation for 40 s at 94 °C followed by annealing for 40 s at 45 °C and extension for 1 min at 72 °C, then 35 cycles of denaturation for 40 s at 94 °C followed by annealing for 40 s at 51 °C and extension for 1 min at 72 °C, followed by final extension for 5 min at 72 °C). Most reactions generated a 709 bp product (658 bp of COI plus 51 bp of forward and reverse primers), but some were slightly shorter because certain species had a 3–15 bp deletion in COI. To minimize the number of analytical steps, PCR amplicons were not purified and their concentrations were not normalized prior to sequence characterization, but the success of PCR was evaluated by scoring E-gels (Thermo Fisher) to confirm the presence of an amplification product. Products for Sanger sequencing were diluted 1:4 with ddH2O before 2 μl was used as template for a cycle sequencing reaction. As well, 2 μl from each of the 95/948 amplicons in libraries #1/2 were pooled to create two amplicon mixtures that were submitted for SMRT sequencing.
PCR2 involved two rounds of amplification (PCR2a, PCR2b). The first round (PCR2a) used a single primer cocktail (C_LepFolF, C_LepFolR) tailed with 30 bp adapter sequences (AF = gcagtcgaacatgtagctgactcaggtcac; AR = tggatcacttgtgcaagcatcacatcgta). These tails provided binding sites for primers tagged with unique molecular identifiers (UMIs) that were introduced in PCR2b [24, 25]. PCR2a employed the following thermocycling regime (initial denaturation for 2 min at 94 °C, then 20 cycles of denaturation for 40 s at 94 °C followed by annealing for 1 min at 51 °C and extension for 1 min at 72 °C, followed by final extension for 5 min at 72 °C). After a 1:1 dilution with ddH2O, the products were used as template for PCR2b which employed PCR primers consisting of a terminal 5 bp pad sequence (GGTAG), a 16 bp UMI, and a 30 bp AF or AR adapter to match the primer tails from PCR2a. Because the SEQUEL platform is well suited for asymmetric UMI tagging, 100 forward and 100 reverse primers, each with a different UMI, permitted 10,000 pairwise combinations, making it possible to attribute every sequence to its source by deploying a unique primer combination in each well. For example, discrimination of the 96 negative controls and 9120 samples in library #3 required 96 UMI-F and 96 UMI-R primers to create 9216 primer combinations. A Biomek FX liquid handler with a 384-channel head was employed to avoid errors in dispensing the designated primer cocktail into each well. PCR2b employed a thermocycling regime identical to PCR2a except the annealing temperature was raised to 64 °C. Although the amplicons generated by PCR2b were ordinarily 811 bp long (10 bp pad, 32 bp UMIs, 60 bp AF/AR adaptors, 51 bp COI primers, 658 bp COI), some were 3–15 bp shorter because of deletions in the COI gene itself. A 2 μl aliquot from each of the 9216 PCRs for library #3 (9120 samples, 96 negative controls) and from each of the 9932 PCRs for library #4 (9830 samples, 102 negative controls) were pooled to create two libraries for SMRT analysis.
Each product from PCR1 was Sanger sequenced using BigDye v3.1 from Life Technologies (Thermo Fisher). Sequencing reactions were performed by adding 0.5 μl of each diluted PCR product (95 for #1, 948 for #2, 9216 for #3, 9932 for #4) into 384-well plates prefilled with 5 μl of sequencing reaction mix following the manufacturer’s protocol. All cycle sequencing products were purified using an automated SPRI method  and sequenced on an ABI 3730XL. Libraries #1/2 were sequenced in both directions while libraries #3/4 were sequenced in one direction using the C_LepFolR primer. Trace files were submitted to the Barcode of Life Datasystems (BOLD; www.boldsystems.org) where they underwent quality trimming and filtering to produce barcode sequences that subsequently gained a Barcode Index Number (=species) assignment .
DNA quantity was evaluated for each amplicon pool using a Bioanalyzer and Nanodrop system before a 1 μg aliquot from each pool was used to prepare a SMRTbell library . Prior to ligation of the hairpin adapters that bind the sequencing primer and DNA polymerase, amplicons underwent damage- and end-repair to create double-stranded amplicon fragments with blunt ends. The resulting SMRTbell libraries were purified with AMPure® PB magnetic beads and combined with a sequencing primer and polymerase before each was loaded into a single SMRT cell to quantify amplicon diversity. Each of the resultant CCSs represented the analysis of a concatenated set of sub-reads, each corresponding to a single pass through a particular SMRTbell. Although the reads generated by each SMRT cell were output as a single fastq file, they varied in quality, reflecting, in part, variation in read length. Following convention, the number of CCS reads was determined for three data partitions (99%, 99.9%, 99.99%) where the percentage value indicates the proportion of bases in each CCS that is predicted to match its template based on Pacific Bioscience’s model of the circular sequencing process.
Analysis of SMRT data
Slightly different analytical paths were required to analyze the results from libraries #1/2 versus #3/4 because SMRT sequences from the latter libraries included UMI tags. However, once each UMI-tagged CCS was assigned to its source well (details below), the four datasets were analyzed on mBRAVE (Multiplex Barcoding Research and Visualization Environment; www.mbrave.net) using a standard pipeline which involved sequence trimming, quality filtering, de-replication, identification, and OTU generation. Because of its integration with BOLD , mBRAVE has direct access to the reference libraries needed for data interpretation.
A key step in data analysis involved the categorization of each CCS as ‘target’ or ‘non-target’. This assignment required pairwise comparisons against the reference Sanger sequences for each library (available as DS-PBBC1, DS-PBBC2, DS-PBBC3, DS-PBBC4 on BOLD). Prior to assignment, each CCS was trimmed by excising 30 bp from its 5′ and 3′ termini to ensure removal of the 25/26 bp primers and it was then further truncated to 648 bp. After trimming, the quality of each CCS was assessed; those with a mean QV < 40, length < 500 bp or > 1% of their bases with a QV < 20 were excluded. All remaining sequences were de-replicated based on perfect string identity and each distinct CCS was examined for similarity to the Sanger reference sequences in its library. In the case of libraries #1/2, each CCS with > 98% similarity to any one of the 95/948 Sanger references was assigned to the ‘target’ category while those with lower similarity were ‘non-target’. The same sequence similarity value was employed for libraries #3/4, but it was implemented on a well-by-well basis. For each library, the pairwise distance value used to categorize each CCS employed k-mer searches followed by verification through global pairwise alignment using the Needleman-Wunsch algorithm  to the appropriate reference sequence array. Sequences were only assessed for a match when there was > 80% overlap between the CCS and the Sanger reference, but less than 0.01% of all sequences were excluded based on this filter.
The fastq files for libraries #3/4 required an initial step to assign each CCS to its source well, work that was completed with a pipeline constructed from open source tools and python scripts. Pad sequences were trimmed from both ends of each CCS using cutadapt . To de-multiplex the fastq files, each CCS was first split at the UMI-F using a barcode splitter from the fastx toolkit (2017; http://hannonlab.cshl.edu/fastx_toolkit/index.html), and then split at the UMI-R. Because sequencing of a SMRTbell template can start from either forward or reverse strands of the amplicon, about 50% of the reads had to be reverse complemented prior to scoring the UMIs. Only CCSs with a perfect match to a UMI were retained. This stringent criterion meant that many CCSs (38.4% for #3, 40.7% for #4) were not assigned to a well. Most of these cases reflected erosion of the pad and the adjacent UMI tag that likely occurred during the damage- and end-repair stages of SMRTbell ligation. All sequences with a perfect UMI match were trimmed with a hard cut of 46 bp at both ends using the fastx trimmer to remove the UMI (16 bp) and the AF/AR adaptor (30 bp). This process produced sequence records for each COI amplicon and the primers employed in its PCR, each assigned to its source well by examining its UMI-F/UMI-R. In the usual situation where multiple sequence records were recovered from a well, their congruence was determined. Whenever the CCSs from a well possessed > 2% divergence from any sequence in the reference library, they were assigned to a new molecular operational taxonomic unit (MOTU). Each MOTU was compared with all sequences on BOLD to ascertain if it derived from a bacterial endosymbiont such as Wolbachia or from another arthropod, reflecting contamination.
Comparison of sequence recovery by Sanger and SEQUEL
All specimens in libraries #3/4 were analyzed on both platforms, permitting comparison of the success in sequence recovery via Sanger and SMRT sequencing. Success in sequence recovery required meeting four criteria: 1) mean QV > 35; 2) < 1% of bases with QV < 10; 3) < 5% of bases with QV < 20; and 4) minimum length = > 493 bp (75% of the barcode region).
Sanger sequence metrics
SMRT sequence metrics
Mean read length and number of circular consensus sequences (CCSs) for COI from the four libraries in three data partitions. The percentage values (99%, 99.9%, 99.99%) indicate the proportion of bases in each CCS that is predicted to match its template based on Pacific Bioscience’s model of the circular sequencing process
Number of Reads
Comparison of Sanger and SMRT sequences
A sliding window analysis of the SMRT sequences indicated that indels were generally infrequent, averaging 0.1% per base pair (Fig. 4c), but their incidence rose 30-fold in the COI segment with the long homopolymer tract. The number of indels was not linked to the overall GC content of a COI sequence (Additional file 1: Figure S2), but there was evidence for consistent differences among taxa as evidenced by the strong correlation in indel counts for the 95 taxa that were shared by libraries #1/2 (Additional file 1: Figure S3).
Sequence recovery via Sanger and SMRT analysis
Screening amplicon libraries with SMRT sequencing
Incidence of non-target sequences with SMRT analysis
Few sequences derived from the negative control wells in libraries #3/4. In fact, just two sequences were recovered from the 96 controls in #3, and five from the 102 controls in #4, each involving a single CCS per well. More non-target sequences were recovered from wells with a DNA extract. They comprised 2.2% of the sequences from libraries #1/2, although their source could not be identified because the samples were not UMI-tagged. Non-targets were slightly more abundant for libraries #3/4, reaching 3.1 and 4.0% respectively with most deriving from arthropods (84.4% in #3, 91.9% in #4) or endosymbiotic bacteria (14.8% in #3, 7.7% in #4). Non-target sequences were only present in higher abundance than the target sequence in 1.6% of the wells in library #3 (149/9210) and in 1.5% of those in #4 (193/9830). The mean CCS count for wells where non-targets predominated was significantly lower than those where the target was dominant (3.5 versus 8.9 reads for #3, 2.9 versus 7.8 reads for #4), and these wells often possessed low quality Sanger reads (Additional file 1: Figure S7).
The present study has shown that the SEQUEL is a highly effective platform for amplicon sequencing. Although the number of CCSs generated by each SMRT cell was modest (167,000 in the 99.9% partition), these sequences showed close congruence to their Sanger counterparts. Earlier studies have reported similar results, but they focused on 16S rRNA and examined far fewer templates [32, 33]. The present analysis sequenced amplicons from more than 5000 species, taxa whose COI amplicons showed wide variation in GC content (15–45%) and diverse compositional attributes. Across this range of templates, SMRT sequencing showed no points of failure. SMRT sequences also had a major advantage over their Sanger counterparts as they regularly provided complete coverage for the target amplicon. By comparison, the generation of complete coverage through Sanger sequencing demanded bidirectional analysis to escape interpretational complexities introduced by ‘dye blobs’ and uncertainty in base calling at the initiation of each sequence . As a consequence of these factors, unidirectional Sanger reads were invariably truncated and even bidirectional reads showed more length variation than SMRT sequences, often reflecting barriers created by homopolymers.
Although single CCS records from a particular DNA extract corresponded closely with their Sanger counterpart, they were often not identical to it, averaging 0.6% divergence reflecting the presence of about three substitutions and 0.75 indels. Nucleotide substitutions occurred at similar frequency at each position in the COI amplicon, but indels showed site-specific variation, increasing 30-fold in regions with homopolymer tracts. The SMRT sequences recovered from such taxa typically showed several length variants, likely reflecting polymerase slippage in the homopolymer region during PCR . Because of its protein-coding function, SMRT sequences for COI could be readily aligned, allowing the excision of indels created through polymerase slippage. By comparison, the Sanger traces from taxa with long homopolymer runs were often uninterpretable. Viewed from this perspective, homopolymers disrupt Sanger analysis [18, 35], but only lead to readily resolved length variation in SMRT sequences.
If the sequence variation noted among the CCS reads from individual DNA extracts derived from heteroplasmy in its source organism, one would expect a small number of variants . Instead, there was usually a single dominant sequence and many low-frequency variants with an indel and/or a substitution at varied positions, a pattern consistent with PCR errors . Presuming COI homoplasy in the source specimen, the frequency of errors in the final amplicon pool can be predicted from polymerase fidelity, sequence length, and the number of PCR cycles . As all reactions employed Platinum Taq (error rate = 2.28 × 10− 5) and 40 cycles of amplification, 60% of the final amplicons should possess a PCR error (658 bp × 40 cycles × 2.28 × 10− 5 = 60.01%), while 40% should match the source. This theoretical expectation coincided with the observed results; the dominant sequence represented approximately 40% of the sequences from each DNA extract, and perfectly matched its Sanger counterpart. The adoption of high fidelity DNA polymerases [39, 40] would allow 99% of amplicons to match their source molecule, but their high cost and requirement for additional cleanup steps mean their use will only be justified in limited contexts.
This study has established that simple PCR protocols allow sufficient standardization of amplicon concentrations to permit a single SMRT cell to recover sequences from nearly 10,000 DNA extracts. Although the mean number of reads per taxon declined by more than three orders of magnitude (1515 to 8 as library complexity increased from 95 to 9830 templates), success in sequence recovery remained high (100% for 95 and 945, 90% for 9120, 86% for 9830). Because SMRT sequencing supports asymmetric UMI tagging, high levels of multiplexing could be implemented cost-effectively since just 200 primers were required to discriminate 10,000 samples (versus 50 times that many with symmetric tagging). Despite the modest number of reads generated by each run, SMRT sequencing supports high levels of multiplexing because the fidelity of each sequence is high. Although some samples failed to gain a sequence, the cost ($0.20) per sample was least at the highest level of multiplexing employed in this study and it could be further reduced. For example, the analysis of a library with amplicons from 40,000 DNA extracts on a single SMRT cell should generate an average of two reads and recover sequences from about 25,000 of them. Extracts without a sequence in the first run could then be pooled for secondary analysis, a strategy that would reduce sequencing costs to $0.05 per specimen. If deployed in all combinations, the 384 UMI tags currently available (https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Barcoding) can discriminate 73,536 amplicons, a capacity that will be useful once SMRT cells generate more CCSs. The high fidelity of SMRT sequencing has the additional advantage of minimizing the risk that UMIs will be misread, an error that leads sequences to be assigned to the incorrect source, a frequent problem with other HTS platforms [41, 42].
Prior studies have shown that other HTS platforms can sequence circa 1 kb amplicons, but their workflows are more complex and more expensive. For example, Shokralla et al. [11, 43] demonstrated that the Illumina MiSeq could recover 648 bp COI amplicons from hundreds of taxa at a time. However, because it only generates 300 bp sequences (or 500 bp with its paired end approach), workflows required the acquisition of multiple reads of several amplicons to obtain a high-fidelity 648 bp sequence. As a consequence, the cost ($1.50) per extract is substantially higher than with SMRT analysis. Diekstra et al.  employed the Ion Torrent PGM platform to characterize up to 900 bp templates, but the amplicons had to be sheared to < 300 bp before sequencing and reassembly. The resulting sequences corresponded closely with their Sanger counterparts, but analytical costs ($1.96) were only three-fold less than those for Sanger analysis ($6.00). Illumina and Ion platforms also run the risk of recovering chimeras when short reads are assembled to characterize longer amplicons. This is a particular risk for studies on mitochondrial genes as NUMTs are prevalent  and are readily recovered by PCR . Because most NUMTs are < 300 bp , any strategy which relies on the amplification and assembly of short amplicons creates the risk of generating chimeras which combine a segment of the authentic gene with a NUMT. Perhaps reflecting this fact, Craud et al.  found that 5% of the COI sequences generated by MiSeq analysis diverged from their Sanger reference.
While Sanger analysis is effective for sequencing amplicons up to 1 kb [14, 47], the characterization of longer templates requires primer walking or shotgun sequencing, protocols that introduce complexities and raise costs [48, 49]. The capacity of SMRT sequencing to analyze long amplicons is a key advantage in such situations . For example, the 5 kb CAD gene is valuable for phylogenetic studies, but its recovery via Sanger sequencing requires the analysis of six overlapping amplicons , and difficulties are often encountered in recovering the full set. Similarly, Zhang et al.  identified 1083 genes with high potential for phylogenetic studies on plants, but many exceeded 1500 bp in length. Because the average gene length across animals, fungi, plants, and protists ranges from 1200 to 1500 bp and from 1700 to 9500 bp when introns are included [53, 54], the capacity of SMRT sequencing to analyze long amplicons is a general asset.
In summary, this study indicates that SMRT analysis is a powerful approach for amplicon sequencing. It can characterize templates with large divergence in GC content and long homopolymer tracts. SMRT sequences are congruent with those obtained through Sanger analysis, but analytical costs are greatly reduced when many samples are multiplexed. Because SMRT analysis supports massive multiplexing, a single SEQUEL platform can characterize millions of DNA extracts annually. Exploitation of this capacity is aided by the fact that data processing is simple. While Sanger sequencing requires the visual inspection of trace files to optimize data quality, SMRT sequences can be processed with an automated pipeline that is much simpler than those required for similar analyses on short-read HTS platforms.
We thank staff at Pacific Biosciences, particularly Jonas Korlach and Cheryl Heiner for aiding data acquisition. We are also very grateful to John Wilson for providing specimens from Malaysia and to Marlucia Martins from providing those from Brazil. We thank Suzanne Bateson for her aid with graphics. We are extremely grateful to Ann McCain Evans and Chris Evans for their ongoing support.
This work was supported by grants from the Canada First Research Excellence Fund Ontario and from the Ontario Ministry of Research, Innovation and Science. It represents a contribution to the ‘Food From Thought’ research program.
Availability of data and materials
Details on the specimens in each library, coupled with Sanger sequences and trace files are available in four datasets on BOLD, the Barcode of Life Datasystems (www.boldsystems.org) accessible through the following DOIs: Library #1 – https://doi.org/10.5883/DS-PBBC1; library #2 – https://doi.org/10.5883/DS-PBBC2, library #3 – https://doi.org/10.5883/DS-PBBC3, library #4 – https://doi.org/10.5883/DS-PBBC4. The 12 datasets holding the CCS records for each partition (99%, 99.9%, 99.99%) from libraries #1 to #4 are available in DRYAD under doi: https://doi.org/10.5061/dryad.4368h. The same doi includes the 12 output files from mBRAVE (www.mBRAVE.net) resulting from the analyses of the 12 CCS files.
PDNH and EVZ conceived and designed the study. TWAB, SWJP, SR, JRD, NVI, SN, JES and EVZ performed the experiments. TWAB, SWJP, SR, JRD, NVI and EVZ analyzed the data. TWAB, SWJP, SR, DHJ and WH contributed reagents/materials/analysis tools. PDNH, TWAB, SWJP and EVZ wrote the paper. All authors read, edited and approved the final manuscript.
Ethics approval and consent to participate
As this study only examined arthropods, no ethics approval was required.
Consent for publication
The authors declare they have no competing interests.
- 7.Beck TF, Mullikin JC, Biesecker LG. Systematic evaluation of Sanger validation of next-generation sequencing variants. Clin Chem. 2016;62:7647–54.Google Scholar
- 34.Kieleczawa J. DNA sequencing: optimizing the process and the analysis. Sudbury: Jones and Bartlett Publishers; 2005.Google Scholar
- 35.Yang A. Solutions for sequencing difficult regions. In: Kieleczawa J, editor. DNA sequencing III: dealing with difficult templates. Sudbury: Jones and Bartlett Publishers; 2008. p. 65–90.Google Scholar
- 43.Shokralla S, Gibson JF, Nikbakht H, Janzen DH, Hallwachs W, Hajibabaei M. Next-generation DNA barcoding: using next-generation sequencing to enhance and accelerate DNA barcode capture from single specimens. Mol Ecol Res. 2014;14:892–901.Google Scholar
- 45.Song H, Buhay JE, Whiting MF, Crandall KA. Many species in one: DNA barcoding overestimates the number of species when nuclear mitochondrial genes are coamplified. Proc Nat Acad Sci USA. 2008;105:13846–91.Google Scholar
- 53.Kaplunovsky A, Ivaschenko A, Bolshoy A. Statistical analysis of exon lengths in various eukaryotes. Open Access Bioinformatics. 2011;3:1–15.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.