Keywords

FormalPara Preamble

Viruses are notorious to infect all forms of life ranging from bacteria to chordates. In humans, viruses are known to cause infectious diseases such as influenza, hepatitis, AIDS, diarrhoea, encephalitis, dengue fever and, more recently, severe acute respiratory syndrome (SARS), Ebola (Singh et al. 2017a), Zika (Singh et al. 2017b), etc. Despite the vaccines and treatments for such diseases, morbidity and mortality both occur as a result of the viral infections. Viral disease of animals not only affects the production but also is a threat to humans (Saminathan et al. 2016). A rapid growth in the availability of sequencing methods and a vast amount of viral sequence data have been generated during recent times. Thus, it is imperative to decipher this data using more advanced tools such as bioinformatics resources. A large number of bioinformatics tools that can aid in the analysis of viral genomes and develop preventive and therapeutic strategies have been developed for human as well as animal viruses. This chapter will introduce virologists to some of the common as well virus-specific bioinformatics tools that the researches can use to analyse viral sequence data to elucidate the viral dynamics, evolution and preventive therapeutics.

1 Applications of Bioinformatics in Virology

Analysis of viral sequence involves use of certain tools that are employable on any novel sequence, for example, gene identification, ORF identification, functional annotation and phylogeny. However, due to small genome size, viruses have complex methods to maximize the coding potential of genomes and evolution. Many viruses utilize overlapping reading frames or translational frameshifts to code for multiple proteins from limited genome sequences. Also, higher rates of mutations and recombination between related viruses pose a challenge in accurate phylogenetic and evolutionary analysis of viruses using general-purpose softwares. Lately, enormous growth in the volume and diversity of viral sequences in the databases has been seen. Now, it has become imperative to organize data of these viral sequences in virus family-specific resources tailored for accurate analysis of a specific virus.

1.1 Phylogeny and Molecular Epidemiology

One of the most common applications of bioinformatics in virology was to use phylogenetic analysis of the viral isolates to aid in the epidemiological analysis of viral outbreaks. General-purpose phylogeny programs such as PHYLIP (Felsenstein 1989) have been used extensively for the phylogeny and molecular epidemiology of viruses. A comprehensive list of these packages and web servers is maintained by Joe Felenstein at http://evolution.genetics.washington.edu/phylip/software.html.

1.2 ORF/Gene Discovery

An open reading frame (ORF) is the part of genome that translates into a protein. Finding ORF is one of the key steps in viral genome analysis. It forms the basis for further analysis such as homologous search, predicting proteins, functional analysis and viral vaccine and antiviral target discovery. If an ORF translates a surface protein that is unique to that virus, it may elicit immune responses and could potentially be a vaccine candidate. ORF Finder by NCBI is a ORF prediction program (Rombel et al. 2002). The program outputs a range of each ORFs along with its protein translation in six possible reading frames from the input DNA sequence. It can be used to search newly sequenced DNA for potential protein encoding sequences and to verify predicted proteins using SMART BLAST or BLASTP (Altschul et al. 1990). However, the web version of the program is limited to a query sequence length of 50 kb only. A standalone system has no limitation on length but is available only for the Linux 64 operating system. NEG8, a 167-codon novel ORF in segment 8 of influenza virus, was visualized using ORF Finder (Clifford et al. 2009). Using the ORF Finder in association with the basic local alignment search tool BLAST, 154 ORFs were found in the Hz-1 virus genome (Cheng et al. 2002). Due to small genome size, viruses employ multiple strategies to maximize the coding potential including frameshifts and alternative codon usage. Thus, virus-specific programs have been developed to overcome these challenges. GeneMark (http://opal.biology.gatech.edu/GeneMark/genemarks.cgi) provides gene prediction tools for viruses (Besemer and Borodovsky 2005). Viral genome organizer (VGO) – a Java-based web tool – offers identification of gene and ORF identification in viral sequences (Upton et al. 2000).

1.3 Epitope Recognition

Identification of immune epitopes is important in designing new vaccine candidates and in diagnostics. An epitope is the part of an antigen that is recognized by the receptors of immune system components such as antibodies, B cells or T cells. Epitopes have been generally classified as either linear or conformational epitopes. T cells recognize linear epitopes, short continuous strings of amino acids derived from protein antigen, presented with MHC class I molecules. B cells and antibodies, on the other hand, recognize conformational epitopes which are formed by interactions of amino acids with multiple discontinuous segments forming a three-dimensional antigen (Barlow et al. 1986). Owing to the simple linear structure of T cell epitopes, their interaction with receptors can be modelled with high accuracy (DeLisi and Berzofsky 1985). A large number of prediction databases and servers thus are available for linear epitope prediction. MHCPEP (Brusic et al. 1998), SYFPEITHI (Rammensee et al. 1999), FIMM (Schonbach et al. 2005), MHCBN (Bhasin et al. 2003) and EPIMHC (Reche et al. 2005) are some of the commonly used T cell epitope prediction programs. Immune epitope database and analysis resource (https://www.iedb.org) (Vita et al. 2015) offers the most comprehensive set of tools for epitope analysis for epitope prediction covering HLA-A and HLA-B for humans as well as chimpanzee, macaque, gorilla, cow, pig and mouse and is one of the few databases that cover such a variety of organisms. Since 2011, IEDB uses NetMHCpan as prediction method. NetMHC server uses the artificial neural network method to predict binding of peptides to different alleles from human as well as 41 animals including cattle and pig (38 from core). The database also contains curated data for many viruses including influenza and herpesviruses. B cell receptors and epitope interactions are more complex in nature than the linear epitopes for T cells; thus, accuracy of B cell epitopes is relatively low. Furthermore, most of the current databases are centred on linear rather than conformational epitopes. Bcipep is a tool developed for predicting the linear epitope of B cells (Saha et al. 2005). Epitome is a database of structure-inferred antigenic residues in proteins (Schlessinger et al. 2006). Epitome is especially useful in the prediction of antibody-antigen complex interaction. The database is available at http://www.rostlab.org/services/epitome/. AntiJen is an intricate database with entries on both T cell and B cell epitopes. It emphasizes on integration of kinetic, thermodynamic, functional and cellular data within the context of immunology and vaccinology (Toseland et al. 2005) (Fig. 23.1a).

Fig. 23.1a
figure 1

The online tool PredictProtein predicts various secondary structures in a given viral protein. The amino acid sequences of viral protein are required to be fed in Fasta format

1.4 Structural Modelling

Three-dimensional prediction of viral proteins can be used to predict the correlation between actual protein structure and antigenic sites, folding surfaces and functional motifs. Such structural modelling tools may be implicated to identify and design novel candidates for antiviral inhibitors and vaccine targets. Secondary structures may be predicted using the tool PredictProtein (http://www.predictprotein.org/) (Rost et al. 2004). Using this online tool, along with secondary structures, solvent accessibility and possible transmembrane helices can be predicted. Further, it also provides expected accuracy of prediction methods. SWISS-MODEL (http://swissmodel.expasy.org/) is a popular tool for the prediction of a 3-D structure of a protein. 3-D structure prediction programs usually employ homology searching using similar and known protein structures as templates. One of the most commonly used database for such templates is Protein Data Bank (PDB) (Reddy et al. 2001). Output from the SWISS-MODEL program includes the template selected, alignment between the query sequence and the template, and the predicted 3-D model. Results of SWISS-MODEL are, however, only sent by email (Figs. 23.1b, 23.1c, 23.1d and 23.1e).

Fig. 23.1b
figure 2

Prediction of various secondary structures (Helical) in a given viral protein using the online tool Predictprotein

Fig. 23.1c
figure 3

Prediction of various secondary structures (Strand) in a given viral protein using the online tool Predictprotein

Fig. 23.1d
figure 4

Prediction of various secondary structures (Helical transmembrane region) in a given viral protein using the online tool Predictprotein

Fig. 23.1e
figure 5

Prediction of various secondary structures (buried sequence motifs) in a given viral protein using the online tool Predictprotein

2 Virus-Centred Bioinformatics Tools

For long, bioinformatic analysis of viruses utilized common bioinformatics tools developed for other organisms. However, analysing viral genomes using general bioinformatics tools could compromise the accuracy and sensitivity of analysis. Virus genomes are too small (e.g. < 10 kb) to compute statistics with their codon usage. To maximize the coding potential, viruses work with unusual codon usage patterns comprising of overlapping coding and non-coding functional elements. Additionally, viruses also rely on other translational mechanisms such as stop codon read-through, frameshifting, leaky scanning and internal ribosome entry sites. Comparative genomic analysis of viruses is complicated by the fact that highly conservative sequences may not be coding for anything. Presence of overlapping pairs may be indicated by conservation for the sequences where there is overlapping of CDSs and/or non-coding functional elements. Novel virus types comprise of new CDSs that are different than previously known CDSs. There are multiple databases and tools available for analysis of human viruses; however, there are still only a limited number of resources designed specifically for veterinary viruses. In this section, some of the databases and resources useful for the analysis of veterinary viruses are discussed (Table 23.1).

Table 23.1 Virus-specific bioinformatics tools

2.1 Comparative and Diversity Analysis of Viral Sequences

Viruses are one of the most diversified and dynamic microorganisms. With increasing viral genome sequencing, there was a need to develop bioinformatics tools to compare and analyse the voluminous data. To meet this requirement, one such downloadable software package is Base-By-Base, which aids in analysis of whole viral genome alignments at single nucleotide level (Brodie et al. 2004). Moreover, with the online resource Genome Information Broker for Viruses (GIB-V), comparative studies can be made using the generic tools such as ClustalW, BLAST and Keyword Search algorithms (Hirahata et al. 2007). Another downloadable web server tool, ViroBLAST, is an exclusive BLAST tool that can be used for queries against multiple databases (Deng et al. 2007). Sequences from a variety of viral strains can be analysed simultaneously using the Alvira software, which is a multiple sequence alignment tool that provides graphical representation as well (Enault et al. 2007). Furthermore, comparative analysis of genes and genomes of coronavirus can be carried out by using the CoVDB (coronavirus database) (Huang et al. 2008).

The digital resource ViralZone is designed specifically to comprehend viral diversity and acquire information on viral molecular biology, hosts, taxonomy, epidemiology and structures (Hulo et al. 2011). The Simmonics program was upgraded to the simple sequence editor (SSE) software package, wherein the user-given sequences can be aligned and annotated and further can be analysed for diversity and phylogeny (Simmonds 2012). Evolutionary changes in viral genome lead to polymorphisms in their proteins, which in turn result into changes in viral phenotype such as viral virulence, viral-host interactions, etc. The digital database, ViralORFeome, not only stores all variants and mutants of viral ORFS, but also provides tools to design ORF-specific cloning primers (Pellet et al. 2010). Further, degenerate primer pairs can be selected and matched to amplify user-defined viral genomes using the online tool PriSM (Yu et al. 2011). The recent advances in next-generation sequencing and technologies have facilitated to study viral population at an advanced level. The viral population biodiversity and dynamics can be studied using the first such tool developed, PHACCS (Phage Communities from Contig Spectrum), that can analyse the shotgun sequence data to estimate the structure and diversity of phages (Angly et al. 2005). Later on, more tools/resources were developed to analyse viral metagenomics sequences, such as Viral Informatics Resource for Metagenomic Exploration (VIROME), Viral MetaGenome Annotation Pipeline (VMGAP) and Metavir (Lorenzi et al. 2011, Roux et al. 2011, Wommack et al. 2012). Novel viruses can be identified from a pool of specimen types using a specific computational pipeline, VirusHunter (Zhao et al. 2013).

2.2 Viral Recombination and Integration-Specific Resources

The phenomenon of genetic recombination in viruses is responsible for the emergence of new viruses, increased virulence and host range, immune evasion and development of antiviral resistance. This distinct process of viral recombination can be detected by two bioinformatics tools, viz. jpHMM (Jumping Profile Hidden Markov Model) and ViReMa (Virus Recombination Mapper) genomes (Schultz et al. 2009; Routh and Johnson 2014). The jpHMM, a web server, can be used for predicting recombination in HIV-1 and HBV, whereas ViReMa, a downloadable software, can be used to analyse next-generation sequencing data. Additionally, another software called VIPR HMM (Viral Identification with a PRobabilistic algorithm incorporating hidden Markov model) can detect recombinant and non-recombinant viruses using microbial detection microarrays (Allred et al. 2012). Further, viral genome sequences can be searched for degenerate locus of recombination (lox)-like sites by a web server called SeLOX (Surendranath et al. 2010). A downloadable software, VIRAPOPS, is a forward simulator that allows simulation of RNA virus population (Petitjean and Vanet 2014). With this software, the drastic changes in rapidly evolving RNA viruses such as mutability, recombination, variation, covariation, etc. can be simulated to predict their effects on viral populations. SeqMap is a tool capable of identifying viral integration sites (VIS) from ligation-mediated PCR (LM-PCR), linear amplification-mediated PCR (LAM-PCR) and nonrestrictive LAM-PCR (nrLAM-PCR) reactions and mapping short sequences to the genome (Hawkins et al. 2011). Further, VIS can also be detected by three more distinct tools, VirusSeq, ViralFusionSeq, and VirusFinder (Chen et al. 2013, Li et al. 2013, Wang et al. 2013). For more precise VIS prediction, all four tools can be employed by virologists.

2.3 Small-RNA Analysis Tools

miRNAs: A microRNA (miRNA) is a small, regulatory, non-coding RNA molecule that regulates the translation or stability of viral and host target mRNAs, thereby affecting viral pathogenesis. This host-viral regulatory relationship can be investigated by a database called ViTa, capable of curating known viral miRNA genes and known/putative target sites of host miRNA (Hsu et al. 2007). ViTa exploits miRanda and TargetScan to scan viral genomes and determine miRNA targets. ViTa is also capable of annotating the viruses, virus-infected tissues and tissue specificity of host miRNAs. Subtypes of viruses, for example, influenza viruses, and the conserved regions in various viruses can also be compared using the ViTa database. Viral miRNA candidate hairpins can be predicted using the database Vir-Mir. It serves as a platform to query the predicted viral miRNA hairpins (based on taxonomic classification) and host target genes (based on the use of the RNAhybrid program) in human, mouse, rat, zebrafish, rice and Arabidopsis (Li et al. 2008).

siRNA: A siRNA is similar to miRNA that operates within the RNA interference (RNAi) pathway. It interferes in expression of specific genes and, therefore, is used in post-transcriptional gene silencing. VIRsiRNAdb is an online curated repository that stores experimentally validated research data of siRNA and short hairpin RNA (shRNA) targeting diverse genes of 42 important human viruses, including influenza virus (Tyagi et al. 2011, Thakur et al. 2012). The current database includes experimental information on siRNA sequence, virus subtype, target gene, GenBank accession, design algorithm, cell type, test object, method, efficacy, etc. A web-based software, siVirus, is an antiviral sRNA design software that allows analysis of influenza virus, HIV-1, HCV and SARS coronavirus (Naito et al. 2006). Further, viral siRNA sequence data sets can be analysed using the softwares Visitor and VIROME (Antoniewski 2011; Watson et al. 2013). A Perl script, called Paparazzi, enables reconstitution of viral genome using a viral siRNA in a given sample (Vodovar et al. 2011).

2.4 Virus-Host Interaction and Miscellaneous Softwares

Host-pathogenic interactions play an important role in determining the pathogenicity of a pathogen or immune evasion mechanism of a host. To comprehend such interactions between viral and host cellular proteins, various databases and softwares are available. One such database is PhEVER that enables to explore virus-virus and virus-host lateral gene transfers by providing evolutionary and phylogenetic information (Palmeira et al. 2011). This distinct database catalogues homologous families between different viral sequences and between viral and host sequences. It compiles the extensive data from completely sequenced genomes (2426 non-redundant viral genomes, 1007 non-redundant prokaryotic genomes, 43 eukaryotic genomes ranging from plants to vertebrates). Thus, it enables compiling of various proteins into homologous families by selecting at least one viral sequence, related alignments and phylogenies for each of these families.

With increasing availability of viral genome sequences, data mining, curation and genome annotation have become essential components to better comprehend the structure and function of genome components. This information can further be exploited to develop diagnostics, vaccines and therapeutics.

There are a number of tools available capable of annotation and classification of viral sequences, such as NCBI genotyping tool (Rozanov et al. 2004), VIGOR (Viral Genome ORF Reader) (Wang et al. 2010), Viral Genome Organizer (VGO) (Upton et al. 2000), Genome Annotation Transfer Utility (GATU) (Tcherepanov et al. 2006), Virus Genotyping Tools (Alcantara et al. 2009), ZCURVE_V (Guo and Zhang 2006) and STAR (Subtype Analyser) (Myers et al. 2005).

VGO is a web-based genome browser that allows viewing and predicting genes and ORFs in one or more viral genomes. It also allows performing searches within viral genomes and acquiring information about a genome such as locating genes, ORFs, start/stop codons, etc. Within genome, the sequences can be searched for regular expression, fuzzy motif pattern, genes with highest AT composition, etc. Using VGO, comparative analyses can be made between different viral genomes. VGO uses the graphical user interface (GUI) for constructing alignments and display orthologues in a set of genomes. It also allows searching the translated genome for matches to mass spec peptides.

VIGOR is a gene prediction online tool that was developed by J. Craig Venter Institute in 2010. It started with gene prediction in small viral genomes such as coronavirus, influenza, rhinovirus and rotavirus. With the updated version in 2012 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3394299/), VIGOR is now capable of gene prediction in 12 more viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and Venezuelan equine encephalitis virus, norovirus, metapneumovirus, yellow fever virus, Japanese encephalitis virus, parainfluenza virus and Sendai virus. With VIGOR, based on sequence similarity searches, users are able to predict protein coding regions, start and stop codons and other complex gene features such as RNA editing, stop codon leakage and ribosomal shunting. Further, various features such as frameshifts, overlapping genes, embedded genes, etc. can be predicted in the virus genome. Additionally, a mature peptide can be predicted in a given polypeptide open reading frame. VIGOR is also capable of genotyping influenza virus and rotavirus. Four output files – a gene prediction file, a complementary DNA file, an alignment file, and a gene feature table file – are produced by VIGOR. GenBank submission can be directly done using the gene feature table.

Genome Annotation Transfer Utility (GATU) facilitates quick and efficient annotation of similar target genome using the reference genomes that have already been annotated. Later, the users can manually curate the annotated genome. The newly annotated genomes can be saved as GenBank, EMBL or XML file format. Although it doesn’t provide a complete annotation system, GATU serves as a very useful tool for the preliminary work in genome annotation. GATU utilizes tBLASTn and BLASTn algorithms to map genes onto the new target genome by using an annotated reference genome. As a result, majority of the new genome’s genes are annotated in a single step. With GATU, users can also identify open reading frames present in the target genome and absent from the reference genome. These ORFs can further be scrutinized by using other bioinformatics tools such as BLAST and VGO, which can determine if the ORFs should be included in the annotation. Multiple-exon genes and mature peptides can also be analysed using GATU.

A primer design tool, PrimerHunter, allows to design highly sensitive and specific primers for virus subtyping by PCR (Duitama et al. 2009). PrimerHunter allows predicting specific forward and reverse primers with respect to a given set of DNA sequences. PhyloType is a web-based as well as downloadable software that uses parsimony to reconstruct ancestral traits and to select phylotypes (Chevenet et al. 2013). RotaC is an automated genotyping tool for group A rotaviruses (Maes et al. 2009). It works by comparing a complete ORF of interest to other complete ORFs of cognate genes available in the GenBank database by performing BLAST searches.

VirOligo is a database of virus-specific oligonucleotides. The VirOligo database acts as a repository for virus-specific oligonucleotides for virus detection (Onodera and Melcher 2002). The database comprises of Oligo data and Common data tables. The Oligo data table enlists PCR primers and hybridization probes that are used for viral nucleic acid detection, while Common data table contains PCR and hybridization experimental conditions used in their detection. Each Oligo data entry provides information on the name of the oligonucleotide, oligonucleotide sequence, target region, type of usage (PCR primer, PCR probe, hybridization or other), note and direction of the PCR oligonucleotide (forward or reverse). Each oligonucleotide entry also contains direct links to PubMed, GenBank, NCBI Taxonomy databases and BLAST. On the updated version of VirOligo as of September 2015, the database contains complete listing of oligonucleotides specific to various animal viruses. The viruses are vaccinia virus; canine parvovirus; porcine parvovirus; rodent parvovirus; tobamovirus; potyvirus; borna virus; bovine herpesvirus types 1, 3, 4 and 5; bovine viral diarrhoea virus; bovine parainfluenza 3 virus; bovine respiratory syncytial virus; bovine adenovirus; bovine rhinovirus; bovine coronavirus; bovine reovirus; bovine enterovirus; foot-and-mouth disease (FMD) virus; and alcelaphine herpesvirus.

Virus-PLoc is a web server for prediction of subcellular localization of viral proteins within host and virus-infected cells (Shen and Chou 2007). Another web server developed a little later, iLoc-Virus, is a multi-label learning classifier that predicts the subcellular locations of viral proteins with single and multiple sites (Xiao et al. 2011). Similarly, a most recent web server, pLoC-mVirus (Cheng et al. 2017), is a new predictor that identifies subcellular localization of viral proteins with both single and multiple location sites. It works by extracting information from the Gene Ontology (GO) database and is claimed to be more successful than the state-of-the-art method, iLoc-Virus, in predicting subcellular localization of viral proteins. AVPpred is an antiviral peptide prediction algorithm that contains the peptides with experimentally proven antiviral activity (Thakur et al. 2012). The prediction is based on peptide sequence features, peptide motifs, sequence alignment, amino acid composition and physicochemical properties. VIPS is a viral internal ribosomal entry site (IRES) prediction system that can predict IRES secondary structures (Hong et al. 2013). VIPS uses the RNA fold program that predicts local RNA secondary structures, RNA align program that compares predicted structures and pknotsRG program (Reeder et al. 2007) that calculates the pseudoknot structures. VaZyMolO, a database that deals with viral sequences at protein level, defines and classifies viral protein modularity (Ferron et al. 2005). It extracts information of complete genome sequences of various viruses from GenBank and RefSeq and organizes the acquired information about modularity on viral ORFs (Fig. 23.1f).

Fig. 23.1f
figure 6

Representation of amino acid composition in a given viral protein using the online tool Predictprotein

There are web-based tools available to predict and analyse structural aspects of viruses. The LearnCoil-VMF is a computational tool that allows to predict coiled-coil-like regions in viral membrane fusion proteins (Singh et al. 1999). The membrane fusion proteins are known to be diverse and share no sequence similarity between most pairs of viruses in the same or different families. The LearnCoil-VMF is also capable of characterizing the core structure of these membrane fusion proteins.

VIPERdb (Virus Particle Explorer database) is a web-based database that enables manual curation of icosahedral virus capsid structures (Carrillo-Tripp et al. 2009). This database serves as a comprehensive resource for specific needs of structural virology and comparatives of data derived from structural and computational analyses of capsids. With the updated version, VIPERdb (2), capsid protein residues in the icosahedral asymmetric unit (IAU) can be deduced using Phi-Psi (Phi-Psi) diagrams (azimuthal polar orthographic projections) (Ref: https://www.ncbi.nlm.nih.gov/pubmed/18981051). These diagrams can be depicted as dynamic interface and surface residues and interface and core residues and can be mapped to the database using a new application programming interface (API). This aids in identifying family-wide conserved residues at the interfaces. Additionally, Jmol and STRAP are built in the system to visualize an interactive model of viral molecular structures.

VIDA is a database that organizes animal virus genome open reading frames from partial and complete genomic sequences (Alba et al. 2001). Presently, VIDA includes a complete collection of homologous protein families from GenBank for Herpesviridae, Papillomaviridae, Poxviridae, Coronaviridae and Arteriviridae. The homologous proteins in VIDA include both orthologous and paralogous sequences. VIDA retrieves virus sequences from GenBank and the files are parsed into subfields. The parsed fields contain all the information such as GenBank accession number, GenBank identifier (GI numbers), protein sequence source, sequence length, gene name and gene product. In order to eliminate 100% redundancy, the virus protein sequences thus retrieved are filtered and a list of synonymous GIs is created for reference. The ORFs from complete and partial virus genomes are further organized into homologous protein families, on the basis of sequence similarity. Furthermore, the structure of known viral proteins or homologous to viral proteins is also mapped onto homologous protein families. VIDA also provides functional classification of virus proteins into broad functional classes based on typical virus processes such as DNA and RNA replication, virus structural proteins, nucleotide and nucleic acid metabolism, transcription, glycoproteins and others. This database also provides alignment of the conserved regions based on potential functional importance. Apart from functional classification, VIDA also provides a taxonomical classification of the proteins and protein families. The protein families serve as a tool for functional and evolutionary studies, whereas alignments of conserved sequences provide crucial information on conserved amino acids or construction of sequence profiles.

3 Virus Bioinformatics Databases

3.1 Viral Bioinformatics Resource Center (VBRC)

The Viral Bioinformatics Resource Center (VBRC) is one of eight NIH-sponsored Bioinformatics Resource Centers (http://www.oxfordjournals.org/nar/database/summary/798). It is an online platform that provides informational and analytical tools and resources to scientific community. The VBRC is oriented to conduct basic and applied research to better comprehend the viruses included on the NIH/NIAID list of priority pathogens. These viruses are selected based on their possibility of bioterrorism threats or as emerging or re-emerging infectious diseases. The VBRC focuses specifically on large DNA viruses. It includes the viruses that belong to the Arenaviridae, Bunyaviridae, Filoviridae, Flaviviridae, Paramyxoviridae, Poxviridae and Togaviridae families. It serves as a relational database and web application tool that allows data storage, annotation, analysis and information exchange of the data. The current version (V 4.2) consists of 369 complete genomic sequences.

Using the VBRC, each of the viral gene and genome can be curated. As a result, a comprehensive and searchable summary is acquired that details about the genotype and phenotype of the genes. The role of the genes in host-pathogen relationships is also being emphasized in these curations. Additionally, the VBRC also houses multiple analytical tools such as tools for genome annotation, comparative analysis, whole genome alignments and phylogenetic analysis. Further, this database also looks forward to include high-throughput data derived from other studies such as microarray gene expression data, proteomic analyses and population genetics data.

3.2 Poxvirus Bioinformatics Resource Center (PBRC)

The Poxvirus Bioinformatics Resource Center (PBRC, now merged into VBRC) is an online platform that serves as an informational and analytical resource to better comprehend the Poxviridae family of viruses. It allows data storage, annotation, analysis and information exchange of the data.

3.3 Influenza Virus Database (IVDB)

Influenza virus is one the major global concern. It gained attention after the emergence of pandemic influenza A virus (H1N1, swine flu) in 2009. There are a total of 11 web portals and tools that focus only on influenza virus. This includes the Influenza Virus Database (IVDB), Influenza Research Database (IRD) and NCBI Influenza Virus Resource (NCBI-IVR) (Chang et al. 2007; Bao et al. 2008; Squires et al. 2008). Researchers can exploit all the three websites mentioned for sequence databases as well as various basic tools such as BLAST, multiple-sequence alignment, phylogenetic tree construction, etc.

IVDB provides access to additional tools such as (i) the Sequence Distribution Tool, which provides global geographical distribution of a given viral genotype as well as correlates its genomic data with epidemiological data, and (ii) the Quality Filter System, which according to their sequence content (coding sequence [CDS], 5’untranslated region [5’UTR], and 3’UTR) and integrity (complete [C] or partial [P]) categorizes a given viral nucleotide sequence into either of the seven categories of C1 to C4 and P1 to P3, respectively. NCBI-IVR is the most widely used and cited online resource. With NCBI-IVR, the given viral genomic sequences can be annotated using a genome annotation tool and Flu ANnotation (FLAN) tool. Additionally, large phylogenetic trees may be constructed and can be visualized in aggregated form with sub-scale details (Bao et al. 2007; Bao et al. 2008; Zaslavsky et al. 2008). IRD provides tools for genomic and proteomic intervention, immune epitope prediction and surveillance data for viral nucleotide sequences (Squires et al. 2012). Furthermore, this resource is also equipped with tools that provide insight into host-pathogen interactions, type of virulence, host range and a correlation of sequence variation and these processes. There are other repositories available: Global Initiative on Sharing Avian Influenza Data (GISAID) consortium that mediated the EpiFlu database and FluGenome database that exclusively provides genotyping of influenza A virus and aids in detecting reassortments taking place in divergent lines (Lu et al. 2007). Furthermore, reassortment events in influenza viruses exclusively can be identified by a program GiRaF (Graph-incompatibility-based Reassortment Finder) that can be downloaded (Nagarajan and Kingsford 2011). Another distinct repository, Influenza Sequence and Epitope Database (ISED), provides viral sequences and epitopes from Asian countries; the information could be exploited to understand and study evolutionary divergence and migration of strains (Yang et al. 2009). The web server ATIVS (Analytical Tool for Influenza Virus Surveillance) provides an antigenic map for conducting surveillance and selection of vaccine strains by scrutinizing the serological data of haemagglutinin sequence data of influenza A/H3N2 viruses and influenza subtypes (Liao et al. 2009). There is another online repository OpenFluDB (an isolate-centred inventory), where information of an isolate such as virus type, host, date of isolation, geographical distribution, predicted antiviral resistance, enhanced pathogenicity or human adaptation propensity may be obtained (Liechti et al. 2010). For influenza viruses, primers and probes can be designed using the Influenza Primer Design Resource (IPDR) (Bose et al. 2008). Further, prospective influenza seasonal epidemics or pandemics can be predicted using a stochastic model, FluTE (Chao et al. 2010) (Table 23.2).

Table 23.2 Virus-specific online databases/repositories

3.4 Virus Variation Resource (NCBI-VVR)

The NCBI Virus Variation Resource (NCBI-VVR) is a web-based database of a set of viruses, viz. influenza virus, dengue virus, rotavirus, West Nile virus, Ebola virus, Zika virus and MERS coronavirus (Resch et al. 2009). It enables the user to submit their viral sequences along with relevant metadata such as sample collection time, isolation source, geographic location, host, disease severity, etc. It further allows integrating and analysing the viral sequences using the generic tools such as multiple sequence alignment and phylogenetic tree construction.

3.5 Web-Based Genotyping Tools

Rotavirus A (RVA) is the most frequent cause of severe diarrhoea in human and animal infants worldwide and remains as a major global threat for childhood morbidity and mortality (Minakshi et al. 2005; Basera et al. 2010). In recent years, extensive research efforts have been done for the development of live, orally administered vaccines. In India, an orally administered vaccine ROTAVAC was also introduced after successful clinical trials in 2014 which became available to clinicians in 2016, although these vaccines will have to be scrutinized and have to be updated regularly to accommodate the emerging rotavirus genotype variations, following which molecular and genetic characterization of new circulating and emerging genotypes of rotavirus strains in humans and animals becomes necessary. Recently, a classification system for RVAs has been described by the Rotavirus Classification Working Group (RCWG) in which all the 11 genomic RNA segments are assigned a particular alphabet followed by the particular genotype number. The classification system will be helpful in explaining the importance of genetic reassortments among RVAs, host range, transfer of gene segments among two different genotypes and adaptation to different hosts. To differentiate between different gene segments of RVAs, an online web-based tool RotaC was developed by the leading researchers from Rega Institute, KU Leuven, Belgium, in 2009 (Table 23.3). It’s an easy-to-use and reliable classification tool for RVAs and works on the agreement with RCWG. It’s a platform-independent tool which works on any web browser by simply going to its URL (http://rotac.regatools.be/) and has been released without any restriction of use by academicians or anyone else. As claimed, the RotaC web-based tool will be updated regularly to reflect the established as well as newly emerging genotypes announced by the RCWG from time to time.

Table 23.3 A list of virus database and sequence-based typing tools that are used globally

4 Conclusions and Future Prospects

Various researches in animal viral diseases are being conducted at the genomic level. Often, handling an enormous data obtained from sequencing is daunting to researchers. The chapter categorically provides a list of bioinformatics approaches that are useful in data mining. There are tables that list all such bioinformatics programs as per the applications. The tables also list databases that organize information on human and animal viruses such as genomic data, ORFs, oligonucleotides, etc. An illustration has also been provided in the chapter showing the application of the tool PredictProtein, which is used for prediction of three-dimensional structures of viral proteins. The major goal of the chapter has been to provide a roadmap to bioinformatics approaches in the field of animal viral diseases.

Although the chapter elaborates on viruses-specific bioinformatics programs, most of these programs are designed for human viruses. Nevertheless, there are bioinformatics tools that are animal-virus specific, but these are limited in number. Henceforth, in many cases, researchers have to switch to either human virus-specific tools or other generic tools. Application of such tools for studying animal viruses or animal diseases, in many situations, may not be as accurate as with specialized tools. The users should take precautions while using the settings of such tools. Furthermore, the results, thus obtained, also need to be scrutinized. Therefore, development of new bioinformatics programs/tools that are specifically designed for animal viruses/diseases should be taken up robustly. Specialized tools will provide much accurate results and predictions, thereby accelerating the bioinformatics researches in the field of animal viral diseases.