Introduction

In recent decades, many infectious diseases have significantly increased in incidence and/or geographic range, in some cases impacting heavily on human, animal or plant populations. Some of these ‘emerging infectious diseases’ are associated with pathogens that have appeared in populations for the first time as a result of cross-species transmission (e.g. human immunodeficiency virus—acquired immunodeficiency syndrome (HIV-AIDS), severe acute respiratory syndrome (SARS)), while others were previously known but are rapidly increasing in incidence or geographic range as a result of underlying epidemiological changes (e.g. multi-drug resistant Staphylococcus aureus (MRSA) infection, dengue, West Nile encephalitis, foot and mouth disease, cassava mosaic disease). The latter include prominent diseases such as tuberculosis, malaria and yellow fever that were once on the decline but are now re-emerging.

Factors underlying emergence may be broadly grouped into (1) ‘ecological’ changes (such as environmental, agricultural, socio-economic, demographic and behavioural changes) that increase the probability of exposure of susceptible individuals/populations to infected reservoir hosts or vectors, (2) evolutionary changes that lead to increased pathogen virulence, drug resistance, host range or transmissibility and (3) changes in host population susceptibility (e.g. due to malnutrition and HIV-associated immunodeficiency in human populations). In human populations, the majority of disease emergence is driven by ecological factors (Jones et al. 2008; Morens et al. 2004; Taylor et al. 2001; Weiss and McMichael 2004; Woolhouse and Gaunt 2007; Woolhouse 2002; Woolhouse et al. 2005). In particular, anthropogenic factors such as deforestation, habitat fragmentation, urbanisation and modern agricultural practices provide increased opportunities for human interaction with infected reservoirs and vectors, and the existence of rapid global transport networks, and high-density human and animal populations facilitate the spread of pathogens at an unprecedented rate, often over very large distances (Jones et al. 2008; Morens et al. 2004; Taylor et al. 2001; Weiss and McMichael 2004; Woolhouse and Gaunt 2007; Woolhouse 2002; Woolhouse et al. 2005).

Although emerging and re-emerging diseases are associated with all types of microbes, viruses (and in particular RNA viruses) predominate (Taylor et al. 2001). This is considered a consequence of their large population sizes and capacity for rapid evolutionary change (Woolhouse 2002) which together can produce large pools of phenotypic variants including viruses with altered virulence, transmissibility or host range that have increased epidemic potential in their original hosts or are able to jump species boundaries and establish themselves in new hosts. Emergence that leads to successful host switching may be classified into three stages: (1) initial single infection of a new host with no onward transmission (i.e. spillovers into ‘dead-end’ hosts), (2) spillovers that go on to cause local chains of transmission in the new host population before epidemic fade-out (i.e. outbreaks) and (3) epidemic or sustained endemic host-to-host disease transmission in the new host population (Parrish et al. 2008).

In human populations, the majority of emerging diseases are caused by viruses that originate in wildlife populations and spill over into humans either directly or via domestic animals (Taylor et al. 2001). Wolfe et al. (2007) defined five stages through which animal only (i.e. stage 1) viruses progress to become human only (stage 5) viruses such as measles, smallpox and mumps (see Table 7.1). For the majority of emerging viruses, humans apparently represent dead-end hosts and only a few proceed to stage 3 and beyond to achieve human—human transmission and cause epidemics or sustained endemic transmission (Parrish et al. 2008; Wolfe et al. 2007). Nonetheless when this does occur, the impact in terms of morbidity, mortality and economic costs can be immense, as has been well demonstrated by the emergence of HIV, SARS coronavirus and H1N1 influenza.

Table 7.1 Stages of viral emergence into human populations (Wolfe et al. 2007)

In terms of understanding and eventually controlling viral disease emergence, the challenge lies in identifying and quantifying the factors that determine which viruses may make the species jump and whether a new disease will progress to epidemic stage or not. At the other end of the spectrum, there is the ever-present challenge of developing effective therapies and vaccines against rapidly evolving viral pathogens. In this regard, emerging viruses, the nature and extent of their diversity, their evolutionary processes and disease mechanisms need to be fully characterised and understood.

Viruses were the first organisms to have their genomes completely sequenced (Fiers et al. 1976), and because of their small size, this could be done relatively quickly and cheaply even prior to the advent of ‘next-generation’ sequencing technologies. There is no doubt, however, that the latter has opened the floodgates since viral genomes can now be generated at lower cost and much more rapidly than was possible using conventional sequencing approaches. The number of viral genomes available in public databases continues to increase exponentially. This wealth of data has led to significant progress in terms of rapid identification and characterisation of emerging viruses, as well as knowledge about their biodiversity and evolution. In terms of evolutionary biology, the beauty of working in viral genomics lies in the ability to study evolutionary changes on the same time scales as the events that shape them. For several viruses, historical samples are available for retrospective study, and their analysis has contributed to our understanding of viral evolutionary and epidemiological factors/events accompanying their emergence, maintenance and spatial diffusion (reviewed in (Pybus and Rambaut 2009)).

In addition to enabling an exploitation of existing virus collections, the new sequencing technologies and accompanying bioinformatic tools provide the potential for comprehensive tracking of viral evolution and population dynamics in real time. Unfortunately, much less progress has been made in areas that impact directly on virus control and treatment (Holmes 2009). This is largely a consequence of the lack of appropriate clinical and epidemiological data to accompany the wealth of sequences (Holmes 2009). The other challenge that cannot be ignored is the ability of current computational approaches to deal with the huge volume of sequence data being generated.

In this chapter, I discuss how viral genomics has contributed to our understanding of each of the stages of viral emergence and how it might contribute to disease prevention and control in the future. Although disease emergence in other species can be of equal importance and ultimately impacts on human development, for the purpose of brevity, I concentrate primarily on diseases that have emerged in human populations and draw examples from those that have most deeply affected the developing world. While there is no apparent relationship between the tendency for new human pathogens to be reported and a country’s geographic location or level of development (Woolhouse and Gaunt 2007), inadequate public health surveillance and response systems in developing countries coupled with the existence of underlying disease conditions have meant that disease burden is usually greater in developing than developed countries (as well illustrated by the recent H1N1 pandemic, (Archer et al. 2009)). Additionally, prevention and control strategies that are effective in more developed countries often fall short in resource-limited settings, which can then act as pockets of refuge where pathogens persist, and may serve as future source populations for outbreaks in other regions.

Investigating the Cross-Species Transmission Interface

Identification and Characterisation of Potentially Emergent Viruses in Animal Populations

It has been suggested that preventing viral disease emergence in human populations begins with a systematic survey of viral diversity in animal populations (Wolfe et al. 2007). Such knowledge would enable identification of animal populations harbouring viruses that have previously infected humans or that are likely to do so by virtue of their relatedness to known human pathogens, or perhaps their ability to infect human cell lines (Holmes and Rambaut 2004). Zoonotic viruses generally cause little or no apparent disease in their original hosts; thus animal reservoirs are often not obvious. Since it is clearly impossible to survey all animal species, the focus of animal surveillance should be on species that are more likely to harbour potentially emergent viruses, for example species with large and/or dense populations, and in particular those that live in close proximity to and are more closely related to humans and their domestic mammals, such as rodents, bats and birds (Holmes and Rambaut 2004). Non-human primate populations (regardless of their size or population density) are also worth surveying because of their close evolutionary relationship with humans and the fact that a number of important human pathogens have emerged from them (e.g. dengue virus (DENV), chikungunya virus (CHIKV), yellow fever virus (YFV), human T-cell leukaemia virus (HTLV) and HIV). Finally, any other species having direct (e.g. bushmeat, livestock) or indirect contact (e.g. vector-mediated contact) with humans that could have led to human infections in the past should also be included (Wolfe et al. 2007).

Traditional approaches to virus discovery such as electron microscopy, cell culture, animal inoculation studies and serology (Storch 2007) have a number of limitations, the most important being that not all viruses can be cultured in the laboratory (Amann et al. 1995). There are now a range of sensitive molecular approaches to virus discovery that circumvent this problem by relying on detection and characterisation of viral genomes rather than targeting viral particles, antigens or their cytopathic effects (reviewed in Bexfield 2011). These include hybridisation-, PCR- and sequence-based approaches that have varying levels of reliance on sequence information from known pathogens and thus differ in terms of the range of pathogens they would be expected to detect. For example, hybridisation-based techniques (such as microarray (Wang et al. 2002) and subtractive hybridisation (Lisitsyn et al. 1993)) require sequence information from known pathogens to detect related pathogens and are unable to detect completely novel virus families. Likewise, PCR-based approaches using degenerate primers are limited to amplification and detection of related viruses. However, there are also sequence independent PCR approaches that facilitate detection of completely novel pathogens. These include sequence-independent single primer amplification (SISPA), degenerate oligonucleotide primed PCR, random PCR and rolling circle amplification (reviewed in Bexfield 2011). When these approaches are coupled with ‘next-generation’ sequencing technology (Margulies et al. 2005) such as 454 pyrosequencing (Roche), Illumina (Solexa) and SoLiD™ (Applied Biosystems) for definitive identification of amplified fragments, they very efficiently generate large amounts of sequence data that can then be analysed using bioinformatic tools.

Next-generation sequencing also obviates the need for amplification prior to sequencing and has opened the field of metagenomics, i.e. the culture-independent study of microbial, communities in environmental or biological samples by analysing the sample’s nucleotide content. First applied to environmental samples such as sea water (Angly et al. 2006; Breitbart et al. 2002; Williamson et al. 2008), fresh water (Breitbart et al. 2009; Djikeng et al. 2009), soil (Fierer et al. 2007) and marine sediments (Breitbart et al. 2004), this approach has now been used to define the ‘microbiomes’ of a range of biological samples including human nasopharyngeal swabs (Bogaert et al. 2011), termite gut (Hongoh 2011) and cow rumen (Hess et al. 2011). It has also been adapted to specifically target viral metagenomes or ‘viromes’, by enriching samples for intact virions and then treating with nucleases to remove non-virion particle protected (naked) DNA and RNA (Djikeng et al. 2008). In terms of targeting potential reservoirs or vectors for emerging diseases, studies have been performed on faecal, oral, urine and tissue samples from bats (Donaldson et al. 2010; Li et al. 2010a), insect pools (Victoria et al. 2008), chimpanzee and farm animals (Li et al. 2010b).

The metagenomic approach has also been used for the identification and characterization of 2009 pandemic H1N1 influenza A virus from nasopharyngeal swabs (Greninger et al. 2010), to study previously ‘uncharacterisable’ viruses that have been isolated through culture (Victoria et al. 2008), to explore within host diversity of HIV and SIV (Bimber et al. 2010) and in comparative studies to identify viruses found in diseased versus healthy tissues from a variety of species (Blomström et al. 2010; Ng et al. 2009a; Ng et al. 2009b; Willner et al. 2009). However, one important limitation of this approach to detecting novel viruses is that the protocol currently used to enrich samples for viruses prior to sequencing includes a filtration step designed to exclude cells, cell debris and bacteria, which may also exclude very large viruses (‘giant viruses’) such as mimiviruses. Also, nuclease treatment eliminates the genomes of any viruses whose integrity has been disrupted by the enrichment process, and depending on the titre of remaining intact virions, these may not be efficiently sequenced (Djikeng et al. 2008).

One intriguing new approach to virus discovery that is worth noting in terms of its ability to characterise viral diversity in insects (which can be important viral vectors) is ‘virus discovery in invertebrates by deep sequencing and assembly of total small RNAs’ or vdSAR (Kreuze et al. 2009; Wu et al. 2010). This approach involves deep sequencing of viral small interfering RNAs (vsiRNA) produced by host immune machinery in response to infection. vsiRNAs are produced by cutting up viral genomes, so piecing their sequences together recovers the virus sequence. In addition to being a sequence independent approach, the process is expected to be more efficient since only a small proportion of host small RNAs need to be sequenced and data-mined (Wu et al. 2010). Additionally, since vdSAR assembles viral genomes from the products of an active host immune response to infection, only replicating and infectious viruses that induce the immune response are identified by this approach (Wu et al. 2010).

Identification and Characterisation of Newly Emerged Viruses

In addition to facilitating surveys of animal reservoirs and vectors, all of the techniques described above (with the exception of vdSAR) may be used to rapidly detect and characterise newly emerged viruses in human populations. This is usually the primary research focus when an apparently new infectious disease first appears, as it facilitates the development of screening tests for early detection and epidemiological investigations aimed at identifying risk groups, reservoirs and possible transmission routes. Such information can then be used to inform control and prevention strategies, including the development of vaccines and antiviral therapies.

The role that viral genomics can play in this regard was well demonstrated during the emergence of SARS, the first cases of which appeared in November 2002 in southern China. In March 2003, traditional cell culture resulted in the isolation of a novel virus from patient specimens (Drosten et al. 2003; Ksiazek et al. 2003; Peiris et al. 2003). Within days of this, the virus was identified as a coronavirus through the use of a pan viral microarray and confirmed by sequencing using two parallel approaches. The first involved designing primers based on known coronaviruses and amplifying regions of the novel virus, and in the second, viral sequences were directly recovered from the surface of the microarray to which they were hybridised, cloned and sequenced without the need to design specific primers (Wang et al. 2003). Comparison with previously characterised coronavirus strains demonstrated that the virus identified was distinct from all known human pathogens (Wang et al. 2003). Thus within 24 h, an unknown virus was identified as a coronavirus and within days partial genome sequences had been generated. Comparative genomics and evolutionary analyses also played the major role in pinpointing bats as the source of the precursor to the SARS virus and the primary reservoirs for SARS-like coronaviruses (Dominguez et al. 2007; Gloza-Rausch et al. 2008; Lau et al. 2005; Poon et al. 2005; Tang et al. 2006; Tong 2009; Woo et al. 2006; Carrington et al. 2008).

As sequencing costs continue to fall and computing capacity improves, metagenomic approaches to virus detection and characterisation will no doubt become more and more routine aspects of public health activities. Researchers have demonstrated the potential utility of high-throughput pyrosequencing for the detection of viruses in human clinical specimens such as stool (Nakamura et al. 2009), nasopharyngeal swabs (Bogaert et al. 2011; Nakamura et al. 2009), autopsy-derived liver and kidney tissues (Palacios et al. 2008) and serum (Briese et al. 2009). This includes identification of novel viruses associated with high mortality outbreaks of unknown aetiology (Briese et al. 2009) and in tissues from individuals who died following organ transplantation from the same donor (Palacios et al. 2008). Others have demonstrated the potential usefulness of metagenomic sequencing in field surveillance for arboviruses by applying the technique to mosquitoes experimentally infected with dengue virus (Bishop-Lilly et al. 2010). It has even been suggested that metagenomic sequencing may be used for continual surveillance of large human populations for known and unknown viral pathogens (Anderson et al. 2003). The suggestion is that large pooled samples of human serum and plasma (possibly discarded specimens from diagnostic laboratories) could be enriched for viral particles and then subjected to metagenomic sequencing on a routine basis. Such large-scale continual surveillance could allow identification of viruses that have entered the human population even before the usual detection thresholds (which would normally depend on several people being infected) have been reached. According to the authors, this approach could be used to ‘monitor the levels of known viruses, rapidly detect outbreaks and systematically discover novel or variant human viruses’ (Anderson et al. 2003).

Understanding Factors Involved in Cross-Species Transmission and Adaptation to New Hosts

Evidence suggests that transmission of viruses from animal reservoirs to humans is not uncommon (Hahn et al. 2000; Wolfe et al. 2005; Wolfe et al. 2004). However, in the majority of cases, humans are dead-end hosts or even when they are not, the zoonotic virus cannot be sustained in prolonged transmission chains such that outbreaks are small and die out quickly. The barriers to onward transmission are primarily biological (Woolhouse and Gaunt 2007). For example, tissue tropism or viral titres achieved might not allow for efficient human-to-human transmission, or transmission might be restricted by reliance on a vector that does not commonly interact with humans or in which the virus does not achieve high enough titres to efficiently infect humans. In an apparent minority of cases, viruses surmount these barriers and can be maintained in the human population and may even lose their ability to replicate in the animal species they originated from.

The evolutionary events that enable cross-species transmission and subsequent adaptation to the new host are poorly understood. However, they are more likely to be the result of viral rather than human evolutionary changes since the time scale of human evolution is so much longer than the time frame implied by the frequency with which these events occur (Holmes and Rambaut 2004; Schliekelman et al. 2001). Studying viral evolution and comparative genomics applied to viruses before and after a transition, or to phylogenetically related human–animal pathogen pairs, can help us to understand the changes involved in adaptation to humans and other aspects of successful emergence.

This type of approach, coupled with In vitro and in vivo studies, was used to identify a single amino acid change in the envelope glycoprotein that is responsible for enzootic strains of Venezuelan encephalitis virus (VEEV) gaining the ability to cause epidemics of neurological and potentially fatal disease in horses, with humans as spill-over hosts (Anishchenko et al. 2006). VEEV, an arbovirus belonging to the genus Alphavirus, is usually maintained in an enzootic rodent-mosquito-rodent cycle. An amino acid change (Thr → Arg) at position 213 in the E2 glycoprotein confers the ability to cause high titre viraemia in horses, whereas the wild type is either unable to replicate in horses or does so at very low titres (Anishchenko et al. 2006). Likewise, the dramatic emergence of the CHIKV (another mosquito-borne alphavirus) in Asia has been linked to a single amino acid change in the envelope 1 glycoprotein (E1-A226V) of the Indian Ocean lineage responsible (de Lamballerie et al. 2008; Hapuarachchi et al. 2010; Kumar et al. 2008; Ng et al. 2009c; Sam et al. 2009; Schuffenecker et al. 2006). This change results in increased infectivity and transmissibility by Aedes albopictus (Tsetsarkin et al. 2007; Vazeille et al. 2007), previously considered as only a secondary vector in human-mosquito-human cycles (urban epidemic cycles), which typically involve Ae. aegypti.

While these findings in VEEV and CHIKV provide proof of concept, they are both unusual in that only one amino acid change resulted in adaptation to a new host/vector. This may be because in both cases, the viruses already had the ability to infect the ‘new’ host, albeit inefficiently. In the case of viruses entering a new species for the first time, the scenario is expected to be much more complicated. This may be why mutations associated with emergence remain unknown for other zoonoses including intensely studied viruses like HIV. Also, more recent work on CHIKV has shown that the effect of the E1-A226V mutation is lineage specific, working only in the IOL genomic background, with endemic Asian CHIKV strains requiring a second mutation (E1-98T) to become Ae. albopictus adapted (Tsetsarkin et al. 2011).

Next-generation sequencing technology allows for rapid and comprehensive surveys of the extent and nature of viral diversity within and amongst animal reservoir hosts, vectors and human populations. This would provide a basis for investigating the fitness distribution and relevance of mutations produced. The latter, coupled with good ecological, epidemiological, immunological and experimental data from In vitro and in vivo systems, is crucial if we are to understand the mechanisms involved in adaptation.

Understanding the Spatiotemporal Dynamics of Emerging Viruses

Phylogenetic inference may be used to reconstruct the demographic history of a population from molecular sequences sampled from the population (Drummond et al. 2005). The approach is based on a population genetic model known as the coalescent which describes the relationship between the shape of the genealogical tree of sampled sequences and the demographic history of the population from which they were sampled (i.e. rates of population growth and decline, extent of population subdivision and patterns of migration) (Kingman 1982; Griffiths and Tavare 1994). In the case of RNA viruses, exploiting this link between population dynamics and molecular evolution, i.e. exploring their ‘phylodynamics’, (Holmes 2009; Grenfell et al. 2004) is particularly attractive since their high mutation rates, short generation times and large populations sizes can result in significant genetic differences between sequences sampled within years, months or even days of each other. Additionally, the relatively short time frames involved mean that evolutionary and demographic events may be temporally aligned with the immunological, transmission and ecological events that shaped them. Given date and location stamped sequences, and depending on the nature and spatiotemporal resolution of the sampling, it is then possible to estimate when and where a given epidemic began or particular lineages arose, the order and timing of transmission events, the timing of changes in population growth rates, and the pattern and rate of virus movement between geographic regions, epidemiological risk groups, individuals and even tissues within an individual (reviewed in (Pybus and Rambaut 2009)). All very pertinent given that ecological and immunological rather than genetic factors are thought to be the main determinants of viral emergence (Holmes 2006).

One of the potential pitfalls of this approach is that inferences are based on estimated genealogies that have been derived with a level of uncertainty as the reconstructed genealogy is in fact only one of many that can be derived from the data. While it may be the best estimate, the true genealogy is rarely, if ever, known with absolute certainty. One solution is to account for this uncertainty by using probabilistic models to estimate parameters over many, many plausible genealogies, thereby providing a more rigorous statistical framework. The most commonly used model is the Bayesian skyline plot (Drummond et al. 2005) incorporated into the BEAST software package (Drummond and Rambaut 2007). This approach uses a Markov chain Monte Carlo (MCMC) sampling procedure to derive a distribution of trees from which a distribution of population size estimates is determined at intervals going back to the most recent common ancestor of the gene sequences (Drummond and Rambaut 2007; Drummond et al. 2002). The result is a plot of the estimated effective population size over time with credibility intervals that represent both phylogenetic and coalescent uncertainty (see Boxes 7.1 and 7.2). BEAST also jointly estimates substitution rates and divergence times (i.e. times to the most recent common ancestors of individual lineages and the genealogy as a whole) with credibility intervals and provides the option of using relaxed molecular clock models that allow for substitution rate variation across lineages in a tree (i.e. it does not assume a molecular clock) (Drummond et al. 2006). Several models that assume a particular pattern of population growth (e.g. exponential growth, constant population size) are also available for comparison (Drummond and Rambaut 2007). Boxes 7.1 and 7.2 describe results from two studies in which the demographic histories of dengue viruses were reconstructed from molecular sequences using the skyline plot in BEAST (Bennett et al. 2010; Carrington et al. 2005).

The BEAST programme was also recently extended to allow for inference, visualisation and hypothesis testing of phylogeographic history (Lemey et al. 2009). In the first model implemented, the geographic locations from which sequences were derived are considered as discrete states (Lemey et al. 2009). The spatial diffusion of the virus is then reconstructed using the coalescent approach to infer when and where direct ancestors of the sampled sequences existed. Different scenarios and models of spatial diffusion can be investigated and compared by specifying different prior distributions for the diffusion rates amongst the sampling locations (Lemey et al. 2009; Auguste et al. 2010; Talbi et al. 2010; Allicock et al. 2012). Phylogeographic inferences may be summarised using virtual globe software (Google Earth) such that spread over time may be visualised as an interactive animation. Examples of virtual globe projections demonstrating the diffusion dynamics through time are available online at http://www.phylogeography.org.

The above-mentioned discrete model, however, requires the assumption that at any point along the phylogeny, the samples existed in one of the sampled locations. To address this limitation, a more realistic ‘continuous trait’ model that allows for diffusion over a continuous landscape was recently implemented (Lemey et al. 2010). Box 7.3 illustrates the spatial spread of rabies virus amongst racoons in North America reconstructed using this model (Lemey et al. 2010).

In addition to those shown in Boxes 7.17.3, there are numerous other examples where a ‘phylodynamic’ approach to viral evolutionary analysis has been successfully applied. They include, for example, the reconstruction of the origin and global dissemination of HIV-1 (Gilbert et al. 2007; Korber et al. 2000; Vidal et al. 2000; Zhu et al. 1998), reconstruction of the spread of rabies virus in North Africa with an investigation of factors underlying the patterns observed (Talbi et al. 2010), inference of YFV and DENV spatial diffusion in the Americas (Auguste et al. 2010; Allicock et al. 2012) and investigation of the mechanism by which the YFV is maintained between epidemics (Auguste et al. 2010) and elucidation of the role of natural selection and global migration in influenza A epidemic patterns (Nelson et al. 2007; Rambaut et al. 2008; Russell et al. 2008). Although this approach cannot replace good epidemiological data, it complements traditional epidemiological approaches and provides insights into the evolutionary dynamics underlying epidemic behaviour. The ability of the approach to recover information not available in census data (e.g. in the analysis described in Box 7.3, the geographic area where the raccoon rabies virus is estimated to have spread by 1973 includes the location where the first raccoon rabies case was reported in 1977 even though the data did not include a sequence for this case (Lemey et al. 2010)) may be particularly useful in regions that have flawed monitoring and surveillance systems, such as in the developing world.

Antiviral Therapies, Prognostic Markers and Vaccines

The rapid viral evolution that facilitates species jumps and emergence also underlies viruses’ ability to escape our immune systems and often presents a challenge in terms of developing effective vaccines and antiviral therapies. As described above, analysis of genomic data can provide valuable insights into virus evolution and epidemiology. The plethora of genomic sequence data being generated and the availability of rapid high-throughput sequencing technology therefore represent valuable resources that have already impacted on the way vaccine and therapeutic development is approached. In particular, they present the opportunity to understand the scope and distribution of the genomic diversity that must be tackled for a given virus and facilitate monitoring of the spatiotemporal dynamics of this biodiversity, thereby underpinning reverse vaccinology, pan-genomic and comparative genomic approaches to identifying vaccine and/or drug targets (Seib et al. 2009) (see Box 7.4). Genomic approaches are also expected to accelerate the identification of genetic and other molecular markers of prognostic and therapeutic relevance, such as markers of disease severity and drug resistance. Access to genomic data also enables researchers to go beyond genomics to transcriptomics, proteomics and other ‘omics’ approaches to studying emerging viruses.

However, despite the immense potential, with a few exceptions such as the use of pyrosequencing to screen for mutations associated with antiviral resistance in influenza (Deyde et al. 2009; Deyde et al. 2010; Bright et al. 2005; Deng et al. 2011; Dharan et al. 2009; Duwe and Schweiger 2008; Hurt et al. 2009; Lackenby et al. 2008) and resources such as the Stanford HIV drug resistance database (http://hivdb.stanford.edu/index.html), genomic developments of direct relevance to clinical care have been slow in coming (Holmes 2009). This is likely to be a consequence of the fact that genomic data are not often associated with data on clinical manifestations and host immunological responses that would enable them to be fully exploited (Holmes 2009). Notable exceptions are the aforementioned Stanford HIV drug resistance database and the Los Alamos HIV databases (http://www.hiv.lanl.gov/content/index), and more recently, large-scale whole genome sequencing projects such as the Broad Institute’s Genome Resources in Dengue Consortium (GRID) project (http://www.broadinstitute.org/annotation/viral/Dengue/) and the influenza genome sequencing projects (IGSP) by The Institute for Genomic Research (TIGR) (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) have sought to incorporate these and other metadata.

The BROAD dengue sequencing initiative, for example, aims to sequence over 3,500 dengue genomes tagged with information on geographic origin and disease severity (i.e. whether the disease outcome is dengue fever (DF) or the more severe, life-threatening dengue haemorrhagic fever (DHF) and dengue shock syndrome (DSS)) in an attempt to determine the impact of introduced strains versus indigenous evolution on disease outcomes, understand genomic correlates of disease severity and provide a map of genomic distributions with reference to DF, DHF and DSS (http://www.broadinstitute.org/annotation/viral/Dengue/projects.html). Dengue sequence diversity within individual patients with well-characterised disease ­outcomes, and for whom time courses for viraemia and status as primary or secondary infections are available, will also be investigated, in order to determine how intra-host diversity drives viraemia and disease and how it correlates with disease severity and primary versus secondary infection.

At the time of writing, the GRID project, which was initiated in 2005, had sequenced 2,372 dengue genomes, IGSP (also launched in the same year) had generated over 3,400 of approximately 7,400 planned genome sequences and there were tens of thousands of HIV sequences available in the Los Alamos database of which 2,788 were HIV1 complete genomes. The current and potential impact of these and other dengue, HIV and influenza sequencing initiatives is well reviewed in Holmes 2009. For dengue, in addition to the previously detailed insights into evolution and epidemiology, analyses suggest that some genotypes differ in virulence and/or fitness (Armstrong and Rico-Hesse 2001; Bennett et al. 2003; Cologna et al. 2005; Cologna and Rico-Hesse 2003; Klungthong et al. 2004; Leitmeyer et al. 1999; Rico-Hesse et al. 1997; Sittisombut et al. 1997; Thu et al. 2004; Wittke et al. 2002; Zhang et al. 2005) and that immune-mediated natural selection may determine which genotypes survive (Adams et al. 2006). Thus the fitness of a given genotype may vary with the changing immunological landscape, which has major implications for vaccine development since tetravalent vaccines designed to induce immunity to all four DENV serotypes are unlikely to provide complete cross-protection (Whitehead et al. 2007). For influenza virus, analysis of IGSP data has already altered basic concepts of influenza virus evolution and shed light on the evolution of drug resistance, identified important source and sink populations and provided data on genomic diversity that will improve and accelerate the process of choosing which strains to incorporate into annual vaccines (reviewed in (Holmes 2009)). HIV is perhaps the greatest disappointment in terms of our inability to arrive at a vaccine despite a wealth of genomic data on the virus. In this regard, the major lesson learned from viral genomics is that HIV is immensely diverse both within and between individual hosts (Rambaut et al. 2004) and vaccines are likely to have to be location/population specific and require regular updating (Holmes 2009).

Conclusions

The ability to generate viral genomes increasingly, rapidly and cheaply and the ­development of bioinformatic tools for analysing these data have transformed the study of emerging viruses. Metagenomic sequencing and evolutionary analyses will soon become routine diagnostic and surveillance tools, allowing us to detect and visualise viral emergence and spatiotemporal dynamics in real time. In addition to enabling rapid responses in terms of development of pathogen-specific screening tests, identification of source populations and disease tracking, this will facilitate generation of hypotheses about evolutionary mechanisms and ecological factors underlying the patterns observed. However, despite the immense potential, addressing prevention and control issues of more direct clinical relevance such as the development of vaccines and therapeutics will only be possible if genomic data are accompanied by relevant clinical, immunological, phenotypic, host genomic and epidemiological data, with biological measures from In vitro and in vivo experimental studies incorporated as they arise. The development and maintenance of widely accessible and flexible genomic databases is therefore key in this regard. Furthermore, if we are to avoid the limitations of past efforts, it is essential that data from across the clinical spectrum be included so that the all too common bias towards symptomatic and/or severe cases is avoided. An ideal database would also include viral genomic and corresponding metadata from animal reservoir and/or vector populations, particularly if our goal is to predict future viral emergence. In addition to traditional sources, these data might be derived from programmes and early warning systems such as the Global Viral Forecasting Initiative (http://www.gvfi.org/), USAID PREDICT (http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm) and the WHO/FAO/OIE Global Early Warning and Response System (GLEWS; http://www.glews.net/), which focus on identification and control of potentially emergent pathogens through surveillance at the animal–human interface.

This is a tall order in terms of the level of coordination and collaboration required to bring all of these data together—public health practitioners, field epidemiologists, clinicians, veterinarians and researchers would all have to work together. More important, however, is the computational challenge. There is no shortage of good ideas, but many of the analyses involved are very computationally intensive, and this is already a limiting factor. Bioinformatic and computational tools will therefore have to further evolve to handle the amounts of genomic and other metadata generated.

Given our level of globalisation and population mobility (which is only going to increase), it is also essential that all affected geographic regions and populations be represented in these efforts. In addition to providing a complete picture of viral biodiversity against the full span of existing host genomic backgrounds, this will ensure that needs are addressed where the burden of disease is often greatest. It will also reduce the number of surveillance and control ‘blind spots’ where viruses might take refuge and eventually re-emerge. It is therefore essential that developing countries be fully integrated into the genomic age, through collaboration, technology transfer and in-country capacity building. The availability of open source databases, computational tools and scientific literature also goes a long way in this regard.