Findings

The development of next-generation sequencing (NGS) technologies has dramatically increased the size and scale of sequencing experiments. It is now possible to produce several gigabases of DNA sequence in a short period of time [1]. To date, the cost of whole human genome sequencing remains prohibitive. Focusing NGS experiments to specific genomic regions is an alternative, cost-effective approach to whole genome sequencing but it requires the enrichment of the targeted regions before library construction. Several hybridization-based methods - each with its own strengths and weaknesses - have been developed [2]. While these DNA enrichment methods continue to be developed and improved, a simple and inexpensive alternative is PCR. PCR is a robust, well-understood, very accessible and flexible strategy for DNA enrichment. It also allows for the very specific amplification of targeted regions without the high background found in hybridization-based methods. Hundreds or even thousands of PCR amplicons that span selected genomic regions of interest can be pooled together and used as input material for the sequencing reaction. PCR has been traditionally used for classic Sanger chemistry-based sequencing of few genes. However, the throughput of next-generation DNA re-sequencing is such that new tools need to be developed to facilitate the implementation of PCR as enrichment strategy for these new sequencing methods.

Enrichment of exonic regions is of particular interest as the functionality of variations within these regions can be more easily inferred than variations in non-coding DNA. Surveying genetic variation by NGS for all exons in candidate genes, such as those identified in genome wide association studies (GWAS), may contribute to the identification of the causal genes and variants, and therefore the underlying biology of the disease.

In order to utilize PCR in the context of NGS, where hundreds of genes are resequencd, a major challenge facing researchers is how to generate a very large number of functional PCR primers that will successfully generate useable amplicons. For instance, in order to target the re-sequencing of 100 genes, each with 10 exons, this requires 1,000 pairs of PCR primers. In fact, the reality is often more complex as each gene might have several isoforms and the enrichment strategy must accommodate large exons that require overlapping PCR products to maintain the desired amplicon size. Thus, to address this significant bioinformatic need in setting up robust NGS enrichment pipelines, we have developed the Optimus Primer (OP) website http://op.pgx.ca where users can import a gene list with no need to provide genome coordinates to generate a complete list of PCR primer pairs (see figure 1 and table 1 for example). Other groups have developed pipelines to facilitate PCR primer design, [35]. These programs often cannot accommodate primer design from the gene names alone and require significant background research to obtain sequence information. Furthermore, most of these programs are only capable of processing a limited list of genes at a time and lastly, all of these programs do not possess the ability to handle multiple design criteria to design multiple iterations of the same amplicon in a single run, thus maximizing the likelihood that a successful primer pair will be designed. Therefore these algorithms are not suitable for the large number of PCR primers required for NGS enrichment. In contrast, OP is able to handle many genes simultaneously and requires only the gene name as input. The user of the program has the ability to set the design criteria for up to four iterations of primer designing parameters that allows the user to progressively modify the design iterations to make them less stringent in order to design primers for all submitted genes or regions. This makes this program especially useful for the rapid design of the large number of primers needed in targeted resequencing experiments.

Table 1 Example PCR Primer Pairs Designed After Submitting TCF7L2
Figure 1
figure 1

Example of Exon Coverage of Optimus Primer Designed Amplicons. Image from the UCSC genome browser showing Optimus Primer design results for the gene TCF7L2 [10]. The black boxes (highlighted by the red arrow) indicate the positions of all isPCR validated amplicons designed by OP. Note that only the gene name, TCF7L2, was submitted into the pipeline yet amplicons were designed spanning the exons (indicated by blue boxed) for all 6 known isoforms of the gene.

Implementation

Optimus Primer (OP) is a web-based automated pipeline that requires the user to submit only a gene list, or list of regions of interest, and primer design parameters. The pipeline consists of four steps. First, all exons for all known isoforms for each gene submitted are identified using the RefSeq database [6]. This step is skipped if regions of interest are submitted. A list of all unique exons for each gene is then generated with exons/regions that are in close proximity to each other (< 25 base pairs for example) merged into a single element in the list. Second, the pipeline extracts the desired genomic sequences from the current build of the human genome (currently hg18/NCBI36), plus additional flanking sequence at a length defined by the user to facilitate the design of the PCR primers. OP will prioritize the design of the primers to these flanking regions to ensure complete coverage of the specific exonic regions. The user has the option of including or excluding sequence that has been masked with RepeatMasker [7]. Additionally, polymorphisms from the current build of dbSNP (currently build 130) can be masked to ensure that primers are not designed to locations with underlying SNPs [8]. Primer3 has been integrated into the pipeline to design PCR primers using user defined parameters [9]. Exons/regions that are larger than the specified amplicon size will be automatically split into smaller amplicons, with a minimum 25 bp overlap to ensure that every base can be amplified and sequenced. Exons for which no PCR primer design is possible using the initial parameters are passed on to a second iteration of Primer3 with modified design criteria defined by the user.

Currently, the pipeline allows the user to define up to four iterations of Primer3 design criteria in a single pass to attempt to design PCR primers for all amplicons with up to 5 primer pairs for each amplicon. The final step of the pipeline is to run all designed PCR primers through the UCSC Genome Browser in-silico PCR (isPCR) utility as a validation step for the primer pairs selected [10]. The isPCR utility allows the user to check the human genome for the presence of unique primer pairs, ensures that they are designed correctly on opposite strands, that they are the correct distance apart and generates a report of the theoretical amplicons produced by the primer pair. OP then uses this data to generate a report for all primer designs as well as the percent coverage for each exon/region for each gene for all isPCR validated primer pairs. Primers designed with OP can then be used to amplify genes of interest as the enrichment step prior to library construction for NGS experiments. In particular, because PCR is flexible and easily implementable, OP will be ideal to target for NGS genes that are difficult to enrich using solid- or liquid-based capture reagents and for genes that are very polymorphic. Additionally, for genes whose annotation is dynamic from one build of the human genome to the other, PCR can be easily adapted whereas probes-based capture reagents will need to be re-synthesized.

Results

As part of a high-throughput DNA re-sequencing project to identify genetic risk factors for ulcerative colitis, we targeted 77 genes for NGS. PCR amplification was selected as enrichment strategy and we opted for pooled sequencing. This corresponds to 867 unique exons and a total of 237 kb of sequence. OP divided the exons into 993 amplicons (average size of 586 bp). After a single pass of OP, primer pairs were designed for 861 (87%) of amplicons using our stringent design parameters (see table 2). PCR primers could not be designed for all desired amplicons for reasons such as low complexity DNA and underlying variation. However with additional effort, it is possible to modify (loosen) Primer3 parameters to allow for the design of PCR primers in these regions. 714 of the OP designed primer designs (861) passed our isPCR evaluation and validation process (representing 72.5% of desired exons). In the lab, we tested 508 of the isPCR validated primer pairs using 7 different genomic DNA samples. We demonstrated successful PCR products for 472 amplicons (92.9% success rate) using agarose gel electrophoresis and/or PicoGreen quantification. Of the 508 primer pairs tested, 338, 18, 80 and 72 were designed in the first, second, third and fourth iterations of Primer3 respectively. By including three additional design iterations with different design parameters with different stringencies, it resulted in an additional 170 successful primer pairs or an increase of 34% more designs in a single pass of OP than by conventional methods. The success rates were 93%, 84%, 99% and 97% for the four iterations. Therefore, we feel that our approach of using multiple design iterations with progressively looser design parameters is a valid approach to successful primer design. In our specific experiment, amplicons were pooled and libraries were constructed according to Illumina's recommendations. We re-sequenced the targeted exons using a single-end 36-base pair protocol on our Genome Analyzer II. We generated 4.5 gigabases of DNA sequences, 89% of which was on target. This allowed us to generate 40× coverage and identify over 700 coding DNA sequence variants. Specific details from genetic risk factors for UC will be presented elsewhere.

Table 2 Primer Design Parameters Used in the Four Passes of Primer3.

Conclusion

PCR is currently the cheapest, simplest, most flexible approach for sample enrichment prior to NGS experiments. It also has some distinctive advantages over the less specific enrichment methodologies currently used for targeted next generation resequencing. In order to capitalize on PCR-based methodologies, hundreds if not thousands of PCR primers need to be designed. To address this gap in bioinformatic tools, we have developed Optimus Primer (OP), a web-based automated PCR design pipeline that facilitates the simultaneous design of PCR primers for the enrichment of exonic regions in multiple genes. This tool can be useful not only for the enrichment of exonic regions for NGS experiments, but it also has much more general applicability to other experiments that require the rapid design of PCR primers for multiple regions of interest such as genotyping, Sanger sequencing and real time PCR. With only a gene list and PCR parameters, a user can design hundreds of PCR primers in a very short timeframe.