Minimum InDel pattern analysis of the Zika virus
The Zika virus (ZIKV) can cause microcephaly and congenital abnormalities in the foetus. Recent studies have provided insights into the evolution of ZIKV from the current and previous outbreaks, but the types have not been determined.
We analysed the insertions and deletions (InDels) in 212 ZIKV polyproteins and 5 Dengue virus (DENV) reference sequences. Spearman correlation tests for the minimum InDel (minInDel) patterns were used to assess the type of polyprotein. Using the minInDel frequencies calculated from polyproteins with 11 elements, likelihood estimation was conducted to correct the evolutionary distance. The minInDel-corrected tree topology clearly distinguished between the ZIKV types (I and II) with a unique minInDel character in the E protein. From the 10-year average genetic distance, the African and Asian lineages of ZIKV-II were estimated to have occurred ~ 270 years ago, which is unlikely for ZIKV-I.
The minInDel pattern analysis showed that the minInDel in the E protein is targetable for the rapid detection and determination of the virus types.
KeywordsZika virus Minimum InDel Polyprotein Envelope protein Virus typing
Minimum insertion and deletion
Multiple sequence alignment
Zika virus (ZIKV) is a single-stranded positive RNA virus species that belongs to the genus Flavivirus. It was first isolated from a sentinel monkey in the Zika forest of Uganda in 1947, and the investigations revealed a mosquito-borne arbovirus that had been transmitted to animals and humans by infected Aedes mosquitos . Most cases of human infections by ZIKV are asymptomatic (~ 80%), but some patients have mild fever, rash, joint pain, red eyes, and headache . Over the past decade, Asian ZIKV outbreaks in the Pacific islands have been reported to be associated with Guillain-Barré syndrome . The spread of ZIKV in Brazil and several countries of South and Central America and the Caribbean is of great concern because it can cause microcephaly and congenital abnormalities in babies whose mothers were infected during pregnancy [4, 5, 6]. Recent genomic studies have provided insights into the evolutionary patterns of ZIKV strains from the current and previous outbreaks [4, 7, 8, 9, 10]; however, the virus types were not determined.
ZIKV genomes differentiated into African and Asian lineages, as determined by phylogenetic studies [7, 8, 11]. The African strains include relatively old strains and recombinant strains derived from co-circulating viruses in different lineages of East and West Africa areas [8, 10, 11, 12], whereas the Asian strains similar to prototype Malaysia/1966 strain P6–740 (GenBank accession no. KX377336) are the most prevalent virus in Asian Pacific American cases, with morbidity to Guillain-Barré syndrome and brain disorder . Previous phylogenetic studies showed that the Asian lineage has been gradually evolving and spreading through the Asian Pacific American outbreaks, after the divergence from an African lineage had occurred ~ 50–180 years ago [10, 12, 14]. Recent phylogenetic trees of ZIKV showed a strong evolutionary bias towards a greater number of Asian strains than African strains. The molecular patterns show that recombination is likely to occur in the envelope E protein, and the nonstructural proteins NS2B and NS5 are from different viruses that co-circulate in Africa [8, 12, 15], but finding insights is still a struggle because recombination in ZIKV is a rare event in the entire genome sequence . Although there is a significant increase in the number of ZIKV genomes, we still lack an understanding of the evolutionary relationships between the types, which were produced by insertion and deletion (InDel) in the viral polyprotein sequences. InDels occur to a lesser degree than substitutions in viruses as well as in cells [17, 18]. Their occurrence represents a loss of function, independent of the evolutionary time for all of the sequences and is likely to increase an active subspace (length), where their orthologues are separated prior to the high-frequency substitutions under selective constraints. The associated analysis is one of the most difficult problems because of computing the length of a sequence or retrieving its elements in a phylogenetic analysis. Therefore, we must reduce the distribution of the composition bias between the true InDels and the alignment of finite sequences.
This study aims at analysing the minimum InDel (minInDel) patterns of closely related ZIKV and Dengue virus (DENV) on the basis of a consensus polyprotein sequence to reduce the burdens of ambiguous InDels and unreliable substitutions in aligned sequences for a divergent population. To compare the minInDel patterns, independent of the parametric information on the lengths of the sequences, we performed the Spearman correlation test, after normalising the length of each protein component with the maximum length of the consensus. We introduced the minInDel frequency to adjust a timer for inferring the evolutionary distance between the sequences by the maximum likelihood method, and we showed that the virus types differentiate from each other with a unique minInDel pattern in the envelope E protein.
We collected 212 ZIKV genomes that coded for the full polyproteins from the National Center for Biotechnology Information up to May 9th, 2018 (Additional file 1). Five DENV genomes were included as reference sequences that are closely related to ZIKV. Annotations of protein families were generated using the Virus Pathogen Resource (ViPR) at www.viprbrc.org. The maximum length of the viral polyprotein was determined by the sum of the maximum lengths of the individual protein components in the datasets (Additional file 2). Using the unweighted consensus, the minInDel positions were searched by Clustal , and the minInDel maps were produced with Geneious version 10.2.3 . The alignment is evaluated by the BLOSUM 62 algorithm to mitigate any query with low similarity (< 62% conserved amino acid residues) in a protein homology group across species. Evolutionary analyses were conducted in MEGA7 .
Spearman correlation between normalised protein lengths
The minInDel maps of the collected ZIKV and DENV strains were based on a consensus polyprotein sequence of 3425 amino acids. A full polyprotein is composed of 11 concatenated protein components: anchored capsid protein ancC, membrane glycoprotein prM, envelope protein E, nonstructural protein NS1, nonstructural protein NS2A, nonstructural protein NS2B, nonstructural protein NS3, nonstructural protein NS4A, protein 2 K, nonstructural protein NS4B, and RNA polymerase NS5. The ancC and prM proteins are precursors of capsid C, protein pr, and membrane glycoprotein M. Because some amino acid residues are eliminated by the maturation processes of capsid C in both ZIKV and DENV and membrane glycoprotein M in DENV, the precursor sequences of ancC and prM were used in the minInDel pattern analysis. Because InDel produces a gapped sequence, it can violate the assumptions of the molecular clock. To resolve the phylogeny, we analysed the minInDel pattern on the basis of the unweighted protein consensus sequence, which is defined as the average by a non-parametric method using the Spearman correlation . To compare the rankings presented by the minInDel pattern, the protein length was normalised in a range between 0 and 1 by dividing it by the consensus which corresponds to the maximum length. Spearman’s correlation coefficient (ρ) for the average ranks in a set of normalised length data was calculated with the P-value to assess the statistical evidence for the monotonic relationship between the consensus rank and the ranks produced by the minInDel.
Estimation of the minInDel frequency and evolutionary distance for the polyprotein
This value is characteristic of discrete minInDel variables in gapped sequences. The valid value ranges between zero and one, where a value of zero indicates no minInDel amino acid in the aligned sequences, and increasing values of the minInDel frequency imply that the lengths are unequal. However, identical sequences, which result in both IA and SA values of zero, give no information on the evolution as well as the minInDel frequency. Although the minInDel value is not obtained from a conventional computer program for the construction of a phylogenetic tree based on a multiple sequence alignment (MSA) of all existing sequences in the given dataset, it is simple to calculate the number of pure InDel amino acids by the differences in the lengths of the sequences. The measurement of realistic gaps in a pairwise alignment (only where the minInDel length was non-zero) could minimise the frequency of independent random InDel variables on the evolutionary path through the accumulation of mutations as well as the burden of computation.
Evaluation of Findel values for inferring the evolutionary distance between genes
The consensus coding sequence was generated from Geneious 10.2.3 (Biomatters Ltd., Auckland, New Zealand) using a BLOSUM62 matrix, in which the third nucleotide positions of the stop codons (UAA, UAG, and UGA) were replaced with the ambiguous “n” character. The pairwise alignment was performed using Clustal W in the TranslatorX server . After removing InDels (gaps) and ambiguous codons, including the “n” character, the number of non-synonymous substitutions (SN) was determined by one-half of the sum of the binary numbers (1, non-synonymous substitution; and 0, synonymous substitution) in the aligned codon positions. When these values were substituted into Eq. (1), appropriately modified Eqs. (4, 5) were used to evaluate the genetic properties of the mutations, including minInDel, nonsynonymous substitution rate, and potential biases in inferring the evolutionary distance computed across the coding region.
Minimum InDel patterns of the Zika virus and dengue virus
Spearman correlation coefficients of minInDel patterns
Spearman correlation coefficients for the minInDel patterns of DENV and ZIKV types
Simpson’s index of Diversity
In contrast, the ZIKV strains isolated from the Zika forest of Uganda in 1947 and from China in 2016 had four distinct minInDel patterns in the E proteins that were homologous to that of DENV. The effect of the minInDel length scale on characterisation of the ZIKV types was evaluated by varying the observed E protein length from 504 to 497 amino acid residues. The Spearman correlation results of each of the full polyproteins compared to the consensus sequence of DENV and ZIKV proved that a single or multiple amino acid loss from the 504-amino acid E protein of ZIKV-II (ρ = 0.8750 and P = 0.0004) displays a strong relationship with the generation of various subtypes of ZIKV-I (ρ = 0.7727 and P = 0.0053 for an E protein less than 503 a.a.), including the ZIKV subtypes, Ia (500 a.a.), Ib (498 a.a.), and Ic (497 a.a.). The subtypes Ia and Ib co-circulated in the Zika forest of Uganda in 1947, and the new subtype Ic was isolated from China in 2016. The probability was calculated to have a strong monotonic pattern only with the consensus (504 a.a.), but increased with a decrease in the E protein length. This finding implicated that the various subtypes of ZIKV could derive from insertion and deletion of single or multiple amino acids in the E protein. However, the measured Simpson’s index of diversity was low (0.099) for the two types of ZIKV-I and -II. The reason is that the reported ZIKV cases have reflected more attention paid to Asian/American strains than African strains.
It was previously suggested that the ancestral ZIKV strain that originated from Africa has spread to Asia, the Pacific Islands and now to the Americas and beyond; however, this history has not been reflected in studying the patterns of mutation accumulation in the viral gene [23, 24]. The investigators did not encompass the entire range of each ZIKV type at spatial and temporal scales because of the very low capacities for the detection of emergent conditions in most of the African continent [23, 31, 32, 33]. Before the twenty-first century, ZIKV infections have been sporadic because false-negative results might happen due to the clinical similarity and cross-reactive response of ZIKV with DENV . The prediction of diversity would be further complicated by viral population changes due to epidemic Asian/American strains and the risk to human health, in which repeated bottleneck effects, genetic drift, and sampling errors can be evoked in their interactions with different host environments. Otherwise, sequencing errors could confound the phylogenetic analyses among the large collection of ZIKV . To reduce any bias that might arise from simple random mutations and sequencing errors, the minInDel pattern analysis on the consensus sequence was useful in finding a unique minInDel pattern in a divergent population by the robustness to non-additive noise at the expense of losing the parametric length. The resulting two main types of ZIKV-I and -II now encompass all of the different ZIKV strains.
Estimating the minInDel frequency and evolutionary distance corrects the phylogenetic tree of ZIKV
A minInDel-corrected tree topology showed that distinct ZIKV (sub)types of African strains have evolved steadily over time before the year 2000. Except for the minInDel value, amino acid substitution rates of ZIKV-II strains significantly increased after its spread to Asian Pacific American countries. This incident was first detected from the Malaysian strains in 1966, but it was difficult to track the original source and global expansion of the Asian strains because of the lack of genome information and knowledge on the distribution and diversity of the various ZIKV types which greatly vary among the lineages by the host ranges and geographical area. The substitution rate of the Asian ZIKV-II strains highly increased from the Malaysia strains in 1966 and gradually decreased during epidemics in the past 10 years. Based on a molecular clock, a linear evolution model assumes that mutations accumulate one after the other, but it is not valid for genetic drift, which causes shifts in alleles and phenotypes. The biased substitution rate of ZIKV-II in the Asian lineage could be due to the spread of a new strain with a high proportion of slightly deleterious mutations, which contribute to adaptive molecular evolution in the vector host environments of the virus, as a general mechanism of evolution [23, 34, 36, 37, 38]. The Asian lineage of ZIKV-II has been therefore divided into two sublineages from different hosts, i.e., mosquitos and humans, as mentioned previously . The observed evolutionary distance between the two Asian sublineages of ZIKV-II is large, but it is not as high as between the Asian and African lineages. Except for the new African II lineage strain (KF383118) which has undergone recombination with distantly related strains, the evolutionary rates of the Asian and African ZIKV-II lineages were independently estimated from the 10-year average genetic distance plots of 200 strains, as shown in Fig. 2d. When the calculated genetic distance data were extrapolated, the time of their origin was estimated at approximately 270 years ago based on the full polyprotein sequence, in which mutation accumulation in the E protein was relatively slow in both lineages. This inference is approximately 100 ~ 200 years older than those estimated based on a rate of ~ 10− 3 substitutions per site per year in the full genome sequences of the Asian strains [5, 23, 24], and 0.212 substitutions per site per year in the E gene sequences of the two African ZIKV-II strains, which were generated by recombination . These results indicate the genetic variation in ZIKV-II among the different lineages in the assessed regions.
The minInDel frequency per mutation between ZIKV-I and -II (findel = 0.0191) was invariable, which leads to a monotonic pattern. This value was approximately 3 times higher than the average minInDel frequency (0.0059) among the four DENV types, within a range of 0.0019 to 0.0101. The minInDel frequencies between the ZIKV and DENV types were also high, which caused the maximum likelihood estimation of the distance between the different virus species to vary largely. Most likely, the minInDel frequency of the viral polyprotein continuously varies within a species, but not between different species. Although the minInDel patterns of the full polyproteins are highly diverse between ZIKV and DENV, the minInDel pattern of the E protein is relatively conserved through evolution.
Effects of minInDels on gene sequences
Excluding the minInDel positions, nonsynonymous and synonymous substitutions have similarly occurred in most coding regions of the two ZIKV types (Fig. 4c). The findings are consistent with the results of the previous studies, which suggest that most gene elements in the ZIKV genome are likely under stabilising selection [15, 23]. This circumstance means that most of the accumulated mutations in the ZIKV strains give rise to conservative substitutions and are evolutionarily neutral from the Uganda prototypes in 1947. This tendency can be expressed by changes in the generation time or mutation rate rather than by changes in the effective population size or patterns of natural selection that will mainly alter the nonsynonymous substitution rates . The gapless nucleotide alignments of ZIKV and the related virus genomes provided no genetic markers with regard to the evolution of the ZIKV types, although the different lineages arose from changes in the substitution rates, recombination sites, and secondary RNA structure of the 3′-untranslated regions [15, 24, 35]. Among the lineages, the extent of difference in the nucleotide substitution rates can be accounted for by temporal changes in nucleotide composition or differences between gene and genome phylogenies that operate on a short time scale. In contrast, a unique minInDel pattern, which is validated by Spearman correlation test and likelihood estimation of evolutionary distance between different types, is considered to be a long-term tendency that is relatively independent of the substitution rate change.
We demonstrated that the minInDel pattern in the E protein of ZIKV is a targetable sequence for the rapid detection and typing of closely related viruses. The Spearman correlation test and likelihood estimation between the minInDel patterns were evaluated not only for strain typing of divergent genotypes but also for evolutionary distance correction. This nonparametric approach is similar to multi-locus sequence typing , but the minInDel analysis determines the minInDel frequency expressed as an absolute InDel number per mutation that resulted in an amino acid change in a protein or polyprotein.
When a unique minInDel without regard to its length is fixed in an essential protein that is involved in pathogenesis, it is likely to produce different phenotypes under different environmental conditions. In this respect, the unique minInDel found in the E protein, including the surface protein N-glycosylation site (Asn154) of ZIKV, provides new insights into the origins and evolution of different types, thus possibly affecting the host infection and spreading of the virion . Without consideration of the minInDel, the amino acid changes are more conservative in the E protein than elsewhere in the polyprotein sequence of ZIKV-II . The origin time of all ZIKV lineages varied from ~ 50 to 180 years ago based on substitution models, such as maximum likelihood and Bayesian inference, without considering the viral protein divergence with the distribution of minInDel [10, 12, 14, 39]. In our dataset for virus types, ZIKV strains share a high sequence identity, with an average of 98.2% between the African ZIKV-I strains, 98.9% between the African ZIKV-II strains, 99.7% between the Asian ZIKV-I strains, and 99.7% between the Asian ZIKV-II strains. The Asian and African lineages of ZIKV-II have 97.1% identity, with a range between 94.6 and 99.9% identities obtained from aligned sequences. In contrast, the total number of pairwise sequence alignments of ZIKV-II strains with ZIKV-I strains exhibit an average identity of 96.9% (range 95.1~ 100%), which is similar between the Asian and African lineages of ZIKV-II. We thought that divergence of the virus lineages would be influenced by the distribution of different virus types, with different mutation rates depending on molecular diversity, geographic distribution and host range of mosquito-specific viruses. Among 212 ZIKV strains which were studied, 11 ZIKV-I strains and a recombinant ZIKV-II strain from Senegal in 2001 were excluded because their numbers were insufficient to estimate genetic distance by the consensus. Based on the consensus polyprotein sequence of ZIKV-II, the divergence time of the African and Asian lineages was estimated at 270 years ago, earlier than previously thought.
Unlikely a phylogenetic tree based on the substitution rate, the minInDel-corrected evolutionary tree topology clearly showed that the branching off of ZIKV-II towards the Asian and African lineages was seemingly independent of the African lineage of ZIKV-I. However, it is still difficult to assess the origin and evolution of the other types and lineages with the unique minInDel pattern in the E protein because there are a limited number of reported ZIKV genomes. There is still much to be learned about the evolution of different type viruses with the unique minInDel character in the E protein, in particular about its dynamics across hosts, species, and different lineages of the virus types, and how selection acts on phenotypic traits created by mutations.
In this study, we included 212 polyprotein sequences of ZIKV and 5 reference sequences of DENV in order to determine the virus types and correct evolutionary distances between the viruses by the minInDel pattern analysis. Using the consensus sequence of each protein component, the minInDel map was constructed based on MSA of polyprotein sequences and used to evaluate virus types by Spearman correlation tests after normalization of the protein length. We used the minInDel frequency to calculate likelihood estimates of evolutionary distance between the virus types and to correct the maximum likelihood tree. Through this work we presented a new method of the minInDel pattern analysis for determination and validation of the virus types with a unique minInDel character in the E protein. This method can be useful in developing a rapid detection method to improve the global maps in suitable environments for infection and transmission.
This study was supported by National Research Foundation of Korea (NRF-2016R1A2B2014493).
Availability of data and materials
The consensus sequence datasets, minInDel frequencies and likelihood estimates analyzed in the paper are available at https://github.com/yonghakkim/minInDel and on request from author YHK.
HL: data collection and statistical analysis plan. MPN: data analysis and result interpretation. YC: data analysis and result interpretation. YHK: experimental design, supervision of result interpretation, and manuscript drafting. All authors read and approved the final version of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 13.Oehler E, Watrin L, Larre P, Leparc-Goffart I, Lastere S, Valour F, et al. Zika virus infection complicated by Guillain-Barre syndrome–case report, French Polynesia, December 2013. Euro Surveill. 2014;19:20720. https://doi.org/10.2807/1560-7917.ES2014.19.9.20720.CrossRefPubMedGoogle Scholar
- 22.McDonald JH. Handbook of biological statistic. 3rd ed. Baltimore: Sparky House Publishing; 2015.Google Scholar
- 25.Grubaugh ND, Weger-Lucarelli J, Murrieta RA, Fauver JR, Garcia-Luna SM, Prasad AN, et al. Genetic drift during systematic arbovirus infection of mosquito vectors leads to decreased relative fitness during host switching. Cell Host Microbe 2016;19:481–492.Google Scholar
- 28.Nei M, Kumar S. Molecular Evolution and Phylogenetics. New York: Oxford University Press; 2000.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.