Full Sequencing of Viral Genomes: Practical Strategies Used for the Amplification and Characterization of Foot-and-Mouth Disease Virus
Nucleic acid sequencing is now commonplace in most research and diagnostic virology laboratories. The data generated can be used to compare novel strains with other viruses and allow the genetic basis of important phenotypic characteristics, such as antigenic determinants, to be elucidated. Furthermore, virus sequence data can also be used to address more fundamental questions relating to the evolution of viruses. Recent advances in laboratory methodologies allow rapid sequencing of virus genomes. For the first time, this opens up the potential for using genome sequencing to reconstruct virus transmission trees with extremely high resolution and to quickly reveal and identify the origin of unresolved transmission events within discrete infection clusters. Using foot-and-mouth disease virus as an example, this chapter describes strategies that can be successfully used to amplify and sequence the full genomes of RNA viruses. Practical considerations for protocol design and optimization are discussed, with particular emphasis on the software programs used to assemble large contigs and analyze the sequence data for high-resolution epidemiology.
Key wordsComplete genome foot-and-mouth disease virus nucleotide sequence virus
During the past 15 years, a number of incremental improvements have been made to methods used to generate nucleotide sequence data. The principle underpinning the mostly widely used sequencing approaches is based on the dideoxynucleotide chain-termination method initially devised by Fred Sanger in the 1970s (1). The throughput and robustness of these methods have been improved by the use of fluorescent dyes and capillary separation technologies, such that the routine assembly of large fragments of genomic DNA (>10 kb) is now achievable by many modestly equipped laboratories. For the large part, protocols developed to sequence large fragments of nucleic acid can also be adapted to characterize the genomes of RNA viruses, which typically are 15 kb or less. Full-genome sequences of viruses can be used to address fundamental questions relating to evolution, identification of critical antigenic determinants, and viral molecular epidemiology. Although sequencing small numbers of some viral genomes can be straightforward, specific protocols and work flows are required to effectively manage projects that aim to characterize the molecular epidemiology of viral transmission.
Using foot-and-mouth disease virus (FMDV) as an example, this chapter describes strategies that can be successfully used to amplify and sequence the complete genomes of RNA viruses. Foot-and-mouth disease (FMD) is a highly contagious disease affecting cloven-hoofed livestock (cattle, sheep, pigs, goats, and water buffalo). The causative agent is a virus belonging to the genus Aphthovirus (family: Picornaviridae) that exists as seven antigenically distinct serotypes, each comprising numerous and constantly evolving variants (2). The genome of FMDV is approximately 8,300 nucleotides in length. It comprises a polyadenylated positive-sense RNA that encodes a single polyprotein, which is posttranslationally cleaved into constituent capsid proteins and nonstructural proteins involved in viral replication.
In common with most other RNA viruses, the enzyme (RNA-dependent RNA polymerase) responsible for replication of the FMDV genome has poor fidelity, such that changes to the nucleotide sequence frequently occur and are inherited to progeny viruses. This rapid evolution rate of FMDV allows virus transmission trees to be reconstructed with extremely high resolution, opening up the possibility of using these data to retrospectively reveal and identify the origin of unresolved transmission events (3,4). In addition to forensic molecular epidemiology, full-genome sequence data have also recently contributed to our understanding of a number of aspects of FMDV evolution, including (i) evolutionary rates (5); (ii) sites and importance of recombination (6,7); (iii) identification of ordered RNA structures (8); and (iv) contribution and significance of the quasi-species phenomenon to evolution (9). Sequence data from a wide variety of FMDV isolates also play an important role in the reiterative design of oligonucleotide primers used for molecular assays for routine diagnostic use in reference laboratories (for pan-reactive and serotype-specific detection and strain characterization).
1.1 Amplification Strategies: Design and Targeting of Polymerase Chain Reaction Primers
The extent of the run length obtained by capillary sequencers places a limit on the maximum distance between oligonucleotide primers (either in the polymerase chain reaction [PCR] amplification or cycle sequencing setup stages). In contrast to DNA targets, which are relatively stable, researchers who study RNA viruses, such as picornaviruses, are familiar with the plasticity of viral genomes. This high variability poses particular challenges for the design of pan-reactive oligonucleotide primers to reliably amplify complete viral complementary DNAs (cDNAs). For viruses such as FMDV, the existence of multiple serotypes (whose nucleotide sequences may vary by as much as 50% in some genome regions) can further complicate the identification of suitable target sequences.
Alternative approaches, such as shotgun cloning (for example, Fig. 1c) are also being considered for full-genome sequencing. Initially, these use long-range PCR to amplify large fragments of the virus genome (possibly even encompassing entire genomic sequences). These PCR products are subsequently fragmented and cloned into plasmid vectors prior to sequencing and reconstruction of the viral sequence. Since this approach uses only two viral-specific primers (which can be targeted to highly conserved regions) and is not reliant on internal virus-specific primers, this method may provide a more suitable approach that has a broader sensitivity to different viral variants. However, these methods need to balance the advantages in diagnostic sensitivity that are gained from using a smaller number of primers with the drawback of lower analytical sensitivity that may arise from amplifying large PCR products (in comparison to shorter fragments).
1.2 Overview of Approaches Used for FMDV Sequencing
In this chapter, a guide protocol that has been successfully used to sequence FMDV is presented. Although some of the finer details are specific to FMDV, the general approaches described are broadly applicable to other RNA viruses. Indeed, similar methods have been described recently to characterize the genomes of other viruses that infect humans, livestock, and plants (16, 17, 18, 19, 20, 21, 22, 23).
2.1 RNA Extraction
0.04M phosphate-buffered saline: 35 mM Na2HPO4, 5.7 mM KH2PO4, pH 7.6. Store at room temperature.
Sterile sand (Fine Sifted, BDH). Small aliquots (∼3 g) are prepared and autoclaved prior to use. Store at room temperature.
Sterile pestle and mortar (Fisher); autoclave prior to use.
TRIzol Reagent (Invitrogen). Store at +2 to 8°C. This solution contains phenol and guanidine isothiocyanate; care should be taken to minimize skin contact and inhalation.
Chloroform (AnalaR Grade, BDH) (toxic and probable carcinogen; care should be taken to minimize inhalation and ingestion).
0.2M glycogen (Roche).
Isopropanol (propan-2-ol) (AnalaR Grade, BDH).
Ethanol (AnalaR Grade, BDH). Store at +2 to 8°C.
Nuclease-free water (deoxyribonuclease [DNase] and ribonuclease [RNase] free) (Invitrogen). Store at room temperature.
2.2 Reverse Transcription and PCR Amplification
Random hexamers (Promega). Store at −20°C.
Deoxynucleotide 5′-triphosphate (dNTP) mixture (Promega). The dNTP mix is a premixed solution containing sodium salts of dATP (deoxyadenosine 5′-triphosphate), dCTP (deoxycytosine 5′-triphosphate), dGTP (deoxyguanosine 5′-triphosphate), and dTTP (deoxythymidine 5′-triphosphate), each at 10 mM in water. Store at −20°C.
Reverse transcription kit: SuperScript™ III RT (Invitrogen). Enzyme is supplied with a vial of 5X first-strand buffer (250 mM Tris-HCl, pH 8.3, 375 mM KCl, 15 mM MgCl2) and a vial of 100 mM dithiothreitol (DTT). Store at −20°C.
RNaseOUT (Invitrogen); store at −20°C.
GFX™ PCR DNA and Gel Band Purification Kit (GE Healthcare).
2.3 PCR Cleanup Prior to Setting Up Sequencing Reactions
Agarose (UltraPure™, Invitrogen). Store at room temperature.
Tris-borate-ethylenediaminetetraacetic acid (EDTA) buffer (National Diagnostics). 10X solution: When diluted, the 1X solution contains 89 mM Tris-base, 89 mM boric acid (pH 8.3), and 2 mM Na2EDTA. Store at room temperature.
Ethidium bromide (UltraPure). Of a 10 mg/mL stock solution, added 2 μL to 100 mL gels to visualize PCR bands. Ethidium bromide is a potent mutagen. Therefore, care should be taken to minimize exposure and to ensure correct disposal of material (solutions and gels) containing ethidium bromide. Store at room temperature.
6X loading buffer for samples to be tested by agarose gel electrophoresis (Invitrogen).
DNA standards, if required (Invitrogen).
Using sterile sand and a pestle and mortar, prepare a 10% (w/v) suspension of the tissue sample in phosphate-buffered saline. Liquid samples (such as serum) can be processed straight to step 3. Depending on application and nature of the sample to be tested (see Note 3), alternative RNA extraction protocols can also be used (such as commercially available silica-based spin columns).
Centrifuge at 300g for 10 min.
Add 200 μL of the sample supernatant to 1 mL of TRIzol reagent in a microfuge tube (see Note 4 ).
Add 240 μL of chloroform directly to the tube.
Mix the tube by inversion and centrifuge for 15 min at 10,000g at 2–8°C.
Transfer the top phase to a fresh microfuge tube and add 1μL of 0.2M glycogen.
Add an equivalent volume of isopropanol.
Mix the tube by inversion and centrifuge for 15 min at 10,000g at 2–8 °C.
Carefully wash the pellet (containing the RNA) with ice-cold 70% ethanol and recentrifuge for 15 min (10,000g at 2–8 °C).
Air-dry the pellet and resuspend RNA in nuclease-free water.
3.2 Reverse Transcription and PCR Amplification
Prepare primer mix (9 µL) containing 30 pmol reverse primer [5′-GGC GGC CGC TTT TTT TTT TTT TTT-3′], 50 ng random hexamers, and 30 nmol of each dNTP (3 µL of a 10 mM solution).
Add to 12 µL prepared RNA.
Denature the RNA by incubating the RNA/primer mix at 70°C for 3 min and place on ice for 3 min.
Add 17 µL of reverse-transcription (RT) mix containing 8 µL first-strand buffer, 2 µL 0.1 mM DTT, 2 µL RNaseOUT, 5 µL nuclease-free water.
Add 2 µL Superscript III Reverse Transcriptase.
Incubate at 42°C for 1–4 h followed by 85°C for 5 min. A specific PCR amplifying the 5′ end of the genome can be used to test that complete first-strand cDNAs have been generated.
Cleanup cDNA using GFX PCR DNA and Gel Band Purification kit according to manufacturers’ instructions and elute in 50 µL. This step removes unincorporated primers and dNTPs from the RT reaction.
Set up a PCR master mix in a clean room using the primer sets required for amplification of the genomic fragments.
Add 2.5 μL cDNA to each reaction in a separate area away from the PCR clean room (see Note 5).
Run thermocycling program (as described in refs. 2, 3, 10; see Note 6 ).
3.3 PCR Cleanup Prior to Setting Up Sequencing Reactions
Run 2 μL of PCR product on 1.2% (w/v) agarose gel at 105 V for 30 min to check reaction has worked.
Clean up cDNA using GFX PCR DNA and Gel Band Purification kit according to manufacturers’ instructions.
Quantify DNA concentration in purified PCR product. This can be done using a spectrophotometer (e.g., Nanodrop, Thermo Fisher Scientific) or by agarose gel electrophoresis using DNA standards (see Subheading 2.3 .).
Dilute products to give appropriate concentrations for sequencing.
Prepare sequencing reaction using diluted PCR product.
3.4 Analysis of Sequence Data
Sequencing viral genomes can quickly accumulate a large amount of data (see Note 7). Software programs (such as Lasergene, http://www.dnastar.com/) can be used to simplify the alignment of individual sequences and to rapidly assemble large contigs. The minimum criterion for acceptance of a final sequence is that each nucleotide position should be determined by sequencing reactions in either direction (forward and reverse).
Currently, the genetic evolution and relationships of viruses are studied by analyzing their genetic sequence data by phylogenetic methods. Phylogenetic trees are constructed and used to deduce the genetic relatedness of the viruses. There are different methods for constructing phylogenetic trees; the first approach developed was the maximum parsimony methodology, but more recently maximum likelihood (24) and Bayesian methods (25) are the preferred techniques for tree construction. Other methods based on distance matrixes, such as neighbor-joining (26) or unweighted pair-group method with arithmetic mean (UPGMA) (27), which calculate genetic distance from multiple sequence alignments, are simpler to implement but do not invoke an evolutionary model.
Maximum parsimony determines the most parsimonious tree requiring the least evolutionary steps. This method is simple and as such makes very few assumptions about the evolutionary process. However, certain features of genetic evolution of organisms present problems when using this method of tree construction. First, inaccuracies can occur as a result of the existence of homoplasy. Homoplasy describes processes, such as convergent evolution, by which a single mutation can occur twice on independent branches of a tree. Hence, it implies that two sequences sharing a mutation were not necessarily derived from a common ancestor that also contained this mutation. Another hurdle to overcome is back-mutation, by which a mutation reverts to its original genotype. This can cause the specific sequence to appear more ancestral than is necessarily the case. A further drawback to the method of maximum parsimony is that it takes no account of the rate at which mutations arise and the varying probabilities of different mutations occurring (i.e., transversions vs. transitions).
For these reasons, the parametric method of maximum likelihood is usually preferred as it provides the most probable tree that suits a specific determined evolutionary model. Providing that the model employed is a reasonable approximation of the evolutionary processes that gave rise to the observed genetic data, this analysis is potentially more powerful than other methods. The evolutionary model may include a large number of parameters accounting for differences in the probabilities of various character states, differences in the occurrence of particular substitutions/mutations, and differences in the probabilities of change among characters. With the sophisticated models such as the Hasegawa-Kishino-Yano (HKY) model (28) and the general time reversible (GTR) model (29), an improved idea of phylogeny is achieved, although fitting an incorrect model can give incorrect results. The suitability of models can be tested using a program such as model test (30).
Maximum likelihood estimation of tree phylogeny is generally preferable to maximum parsimony because it is statistically consistent with a better statistical foundation, and it allows complex modeling of evolutionary processes. However, the maximum likelihood method has a computing limitation for large numbers of sequences. To infer statistical confidence in either maximum parsimony or maximum likelihood, constructed phylogenies bootstrap analyses (31) are performed. A further method to infer phylogenies is that of Bayesian inference, which generates a posterior distribution for a parameter based on the prior for that parameter and the likelihood of the data (represented by the sequence alignment). In other words, whereas maximum likelihood analysis investigates the probability of the observed data given a specific evolutionary model, Bayesian inference looks at the probability that a model is correct given the observed data set. With the availability of Markov chain Monte Carlo methods (32), Bayesian inference can be a preferred choice for tree estimation because it can be faster than maximum likelihood, and no bootstrapping is required as the posterior probabilities determine the statistical confidence in the tree.
Although in the majority of incidences maximum likelihood or Bayesian inference is preferable for tree construction, in certain situations maximum parsimony can be a viable alternative. When studying closely related sequences over a short time period the likelihood of back-mutation is relatively low, and hence maximum parsimony tree construction is likely to give an accurate estimation of tree phylogeny. Phylogenetic analysis of virus sequences is often performed with the aim of tracing specific virus history, and in these cases the method of statistical parsimony can be used. The distances depicted by parsimony trees represent the actual number of differences between sequences, whereas for a maximum likelihood tree the probability of change is shown (Fig. 3 ).
3.5 Future Technologies
Newer technologies are currently being developed that offer the potential to eliminate the use of capillary electrophoresis and even greater throughput. Resequencing microarrays have been developed and used to determine the sequence of the severe acute respiratory syndrome (SARS) coronavirus (35,36). However, development of specific arrays is heavily resource dependent and currently likely to be deployed only in niche markets. Of the newer technologies, sequential ligation systems (SOLiD), solid-phase primer amplification (Solexa), and bead-and-well-based pyrosequencing methods (such as the 454 platform) have the capacity to generate reads of 4–20 Mb in a single run. Although this might be considered excessive for characterization of individual viral genomes, these approaches may allow infrequent mutations within a viral population to be detected. Thus, these methods may be ideal for dissecting the genetic variability within viral populations.
In addition to ensuring that all solutions used for RNA extraction are RNase free, pipets and work surfaces should be cleaned using 10% bleach followed by DNAzap (Ambion) prior to and between each sample processed.
- 2.A logical work flow for processing the samples for sequencing projects is highly recommended (see Fig. 4 for an example). This is particularly important for high-resolution molecular epidemiological studies since the discrimination of samples may be dependent on the accurate determination of only a few nucleotide differences in the complete genome length (3). Therefore, it is important that care is taken to minimize cross-contamination between samples (particularly post-PCR products). If possible, samples should be processed independently (including suitable negative control material), and the study should be organized to attempt to maximize the differences between successive samples tested.
A variety of sample types (including blood, tissues, esophageal-pharyngeal fluid, and cell culture supernatant) can be tested; however, it is usually preferable to test primary material (such as clinical samples) since it is possible that cell culture passage or molecular cloning of viruses can introduce nucleotide changes that can influence the interpretation of results.
Once placed in TRIzol reagent, samples can be stored for extended periods (at a wide range of temperatures, −70 to +4°C).
The requirement to perform a high number of downstream sequencing reactions may necessitate that a relatively large volume of PCR product is generated requiring pooling of RNA, cDNA, or post-PCR products. An additional practical consideration is the fidelity of the DNA polymerase used for the PCR amplification step; if possible proofreading enzymes that are widely available should be used.
In common with other long PCR methods, the parameters of the protocol used for amplification of viral genomes should be optimized prior to routine use. Steps to be considered include the components of the RT or PCR mixes and the cycling times used for amplification. In initial experiments, a PCR targeting a fragment of the 5′ end of the genome can be used to confirm that full-length cDNA has been produced in the RT reaction.
In general, these methods provide an accurate estimation of the viral consensus sequence. However, it is important to recognize that this sequence will be a composite of the component variability that, to a greater or lesser extent, may be present. In spite of concerns that it is theoretically possible that the sequence generated will not represent an actual virus species present in the sample, studies with FMDV indicated that the majority of molecular clones have identical sequences to the consensus ( 37 ). Testing of duplicate samples can generate identical results ( 4), demonstrating that these methods are accurate, and as long as the viral concentrations are relatively high, consensus sequences obtained will mask any individual proofreading errors that might arise due to low fidelity of reverse transcriptase and polymerase enzymes. These aspects relating to accurate determination of the sequences of specific viral genomes (rather than consensus sequences) will be of particular concern in studies that aim to characterize the genetic population structure within samples (i.e., the quasi-species nature of a virus). New technologies and approaches (see Subheading 3.5.) may be utilized to address these important questions that underpin our understanding of viral evolution.
This work was funded by Defra research project SE2936. We acknowledge the assistance of colleagues Guido König, Sasmita Upadhyaya, Nigel Ferris, and Geoff Hutchings and Michael Quail from the Wellcome Trust Sanger Institute, Cambridge, for collaboration with the shotgun sequencing approach.
- 12.Carrillo, C., Tulman, E. R., Delhon, G., Lu, Z., Carreno, A., Vagnozzi, A., et al (2006). High throughput sequencing and comparative genomics of foot-and-mouth disease virus. Dev. Biol. (Basel) 126, 23–30.Google Scholar
- 18.Marston, D. A., McElhinney, L. M., Johnson, N., Müller, T., Conzelmann, K. K., Tordo, N., et al (2007). Comparative analysis of the full genome sequence of European bat lyssavirus type 1 and type 2 with other lyssaviruses and evidence for a conserved transcription termination and polyadenylation motif in the G-L 3′ non-translated region. J. Gen. Virol. 88, 1302–1314.PubMedCrossRefGoogle Scholar
- 33.Galtier, N., Gouy, M., and Gautier, C. (1996). SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput. Applic. Biosci. 12, 543–548.Google Scholar
- 35.Sulaiman, I. M., Liu, X., Frace, M., Sulaiman, N., Olsen-Rasmussen, M., Neuhaus, E., et al (2006). Evaluation of affymetrix severe acute respiratory syndrome resequencing GeneChips in characterization of the genomes of two strains of coronavirus infecting humans. Appl. Environ. Microbiol. 72, 207–211.PubMedCrossRefGoogle Scholar
- 37.Cottam, E. M., King, D. P., Wilson, A., Paton, D. J., and Haydon, D. T. (2009) Analysis of Foot-and-mouth disease virus nucleotide sequence variation within naturally infected epithelium. Virus Res. In pressGoogle Scholar