Background

Hepatitis B virus (HBV) infection is a major public health problem worldwide. At least two billion people (one-third of the global population) have been infected at some point during their lives, and more than 240 million suffer from chronic HBV infection, putting them at an increased risk for liver cirrhosis and hepatocellular carcinoma [1].

The HBV genome is a partially double-stranded DNA molecule of approximately 3.2 kilobases in length. It has a highly compact coding structure consisting of four overlapping reading frames, which are designated P (polymerase), S (surface), C (core) and X (HBx protein). HBV is a DNA virus that employs the error-prone polymerase reverse transcriptase as part of its replication process [2]. For this reason, most estimates of the nucleotide substitution rate for HBV range between 10-4 to 10−6 substitutions per site per year (s/s/y) [38], which is far higher than those observed in other dsDNA viruses.

Eight HBV genotypes (A-H) have been recognized based on a sequence divergence of more than 8% throughout the genome [911]. More recently, two additional genotypes (I and J) were tentatively proposed [1214]. There is a great deal of diversity within genotypes, leading to the division of some genotypes into subgenotypes [15, 16]. The evolution of HBV is strikingly highlighted by the geographical distribution of the genotypes: Genotypes A (HBV/A) and HBV/D have worldwide distributions; HBV/B and HBV/C are found essentially in Asia; HBV/E is confined to West and Central Africa; HBV/G has been found in Europe, the USA and Japan (and may be ubiquitous); and HBV/H is found in Central America and the southern part of the USA [10, 17].

The evolutionary history of HBV/F, which is one of the most divergent HBV genotypes [18], is not well understood due to the relative lack of appropriate studies. It is closely related to HBV/H [9], and probably originated in Amerindian populations, as it has been found in the native populations of Alaska, Central America and South America [9, 1925]. HBV/F has been divided into four genetically distinct subgenotypes (HBV/F1-F4), two of which have been further subdivided. HBV/F1 is found in Central America (HBV/F1a) and Alaska and South America (HBV/F1b). HBV/F3 is found in Central America and in northern South America, whereas HBV/F2 (HBV/F2a and HBV/F2b) [20] and HBV/F4 are found in South America [26].

Brazil is the largest country in the Southern Hemisphere, corresponding to almost half of the area of South America, and is divided broadly into five geographic regions: north, northeast, central west, southeast and south. Brazil has a highly miscegenated population, with HBV/A, HBV/D and HBV/F circulating among Brazilian HBV carriers [23, 2729]. A large-scale study on the geographic distribution of HBV genotypes in Brazil previously showed a low countrywide prevalence of HBV/F (13%) [23]; this differs from other Latin American countries, where HBV/F prevails [20, 3032]. However, a few studies on isolated indigenous communities of Brazil have shown a predominance of HBV/F [23, 33, 34]. Additionally, the Brazilian Amazon region is a highly endemic area for HBV and hepatitis delta virus (HDV); in this region, HBV/F has a higher frequency among HBV-HDV co-infected patients than in the general HBV-monoinfected population [35, 36]. Previously, only two Brazilian HBV/F complete genome sequences, GenBank accession numbers X69798 [37] and HE981181 [38] were available, limiting the contribution of Brazilian isolates to the phylogenetic studies of HBV/F.

Here, we examined the proportion and geographic distributions of HBV/F subgenotypes in Brazil, determined the full-length genomic sequences of 12 HBV/F isolates from different Brazilian geographic regions, and used the Bayesian coalescent and phylogeographic framework to investigate the origin and spread of HBV/F subgenotypes in Brazil.

Results

Phylogenetic analysis of HBV/F S-gene sequences from Brazil

To identify the HBV/F subgenotypes circulating in Brazil and examine their proportion and geographic distributions, we initially performed a phylogenetic analysis using all Brazilian HBV/F S-gene sequences available in GenBank (n=53) (Figure 1). We focused on S-gene sequences because only two full-length Brazilian HBV/F genomes were available prior to the present study. Of these 53 Brazilian HBV/F S-gene sequences, 29 (54.7%) are from the north, 16 (30.2%) from the southeast, 4 (7.5%) from the central west, 3 (5.7%) from the northeast, and 1 (1.9%) from the south of Brazil.

Figure 1
figure 1

Phylogenetic tree of partial S-gene (403 bp) sequences inferred by using the neighbor-joining method. Values at internal nodes indicate percentage of 1000 bootstrap replicates that support the group. The horizontal bar indicates the number of nucleotide substitutions per site. Brazilian sequences available in GenBank are designated by their accession numbers, followed by the geographic region of origin. The sequences generated in this study are denoted BR, followed by the sample number (1 to 12) and the geographic region of origin, and are identified with ♦. HBV/F reference sequences are indicated by their accession numbers, followed by the respective genotype/subgenotype and the abbreviation of the origin country (ARG: Argentina; BOL: Bolivia; COL: Colombia; ESa: El Salvador; FRA: France; JAP: Japan; NIC: Nicaragua; PAN: Panama; PER: Peru; RIC: Costa Rica; VEN: Venezuela), and are identified with *. Reference sequences for the other genotypes are indicated by their accession numbers, followed by the HBV genotype.

The results of our phylogenetic analysis revealed that HBV/F2a is the most prevalent HBV/F subgenotype in Brazil, with 92.5% of the sequences clustering together with HBV/F2a reference sequences. HBV/F4 (5.7%) and HBV/F1b (1.9%) were also found among the Brazilian sequences (Figure 1).

Concerning the geographic distributions of these subgenotypes, HBV/F2a was found in all five Brazilian geographic regions and corresponded to 100% (29/29) of the sequences from the north, 100% (1/1) from the south, 87.5% (14/16) from the southeast, 75% (3/4) from the central west, and 66.7% (2/3) from the northeast. HBV/F1b was found in 33.3% (1/3) of the sequences from the northeast, whereas HBV/F4 was found in 25% (1/4) of the sequences from the central west and 12.5% (2/16) of the sequences from the southeast (Figure 2).

Figure 2
figure 2

Distribution of HBV/F subgenotypes in the Brazilian geographic regions (N - north; NE - northeast; CW - central west; SE - southeast; S - south), based on 53 HBV/F S-gene sequences from Brazil available in GenBank.

Thus, we obtained 12 full-length HBV sequences from different Brazilian geographic regions representing the three HBV/F subgenotypes circulating in Brazil: HBV/F2a (north, n=1; northeast, n=3; central west, n=1; southeast, n=4), HBV/F1b (northeast, n=1; southeast, n=1) and HBV/F4 (southeast, n=1) (Additional file 1: Table S1). Interestingly, one HBV/F1b complete genome sequence was found in the southeast, where this subgenotype had not been previously detected. These sequences were used in our Bayesian coalescent analysis.

Bayesian analyses of HBV/F complete genome sequences

To examine the time and starting point of diversification of HBV/F subgenotypes in Brazil, we conducted Bayesian Markov chain Monte Carlo (MCMC) analyses on 117 complete genome sequences (12 from our study and 105 from GenBank). We estimated the time of the most recent common ancestor (tMRCA) of all internal nodes of the maximum clade credibility (MCC) tree under a relaxed molecular clock model. Since there is not a consensus on the HBV substitution rate, we used three previously published substitution rates: (i) 7.72 × 10-4 s/s/y obtained from Bayesian approaches using complete genome sequences [6], (ii) an “intermediate” rate in the range of previous estimates of 1.0 × 10-5 s/s/y that was chosen by Torres and co-authors to test one of the possible scenarios for the evolutionary history of HBV/F [26] and (iii) a slower nucleotide substitution rate of 2.2 × 10-6 s/s/y recently estimated using deep calibration ages [8]. Because substitution rates ranged among 10-4 to 10-6, a great variation of tMRCAs was found for HBV/F and its subgenotypes (Table 1). However, an older median ancestral age for the Brazilian HBV/F2a (HBV/F2a-BR) subgenotype compared to HBV/F1b-BR and HBV/F4-BR could be estimated (Table 1). Figure 3 shows the time-scaled Bayesian MCC tree, in which the sequences clustered into highly supported monophyletic clades corresponding to the four HBV/F subgenotypes (HBV/F1-F4). Our analyses indicated that most of the posterior root state probability (PRSP) mass of HBV/F1b-BR and HBV/F4-BR was placed in Argentina (PRSP = 0.91 and PRSP = 0.99, respectively), while HBV/F2a-BR root location was most likely Venezuela (PRSP = 0.75) (Figure 3). Therefore, these data suggest that Argentina and Venezuela were the most probable locations from where HBV/F1b and HBV/F4, and HBV/F2a, respectively, were introduced into Brazil.

Table 1 Estimated tMRCAs for HBV/F complete genome sequences (GenBank accession numbers available in Supplementary file 2
Figure 3
figure 3

Time-scaled Bayesian Maximum Clade Credibility tree of HBV/F full length genome sequences calibrated with the substitution rate of 1.0 × 10-5s/s/y. The branches are colored according to the most probable location of their parental node (see the color legend in the figure). The numbers on the internal nodes represent posterior probabilities (pp), and the branch lengths of the tree correspond to length of time (see the time scale bar at the bottom of the tree). The symbol “▲” indicates the nodes corresponding to the posterior root state probability (PRSP) values reported for HBV/F1b-BR, HBV/F2a-BR and HBV/F4-BR (0.91, 0.75, and 0.99, respectively). The two Brazilian HBV/F complete genome sequences previously available in GenBank are identified with *.

Discussion

HBV/F is thought to be the original genotype of the aboriginal populations of the Americas, since it has been found in high frequencies in several countries of South and Central America [19, 24, 3032, 39], as well as in native Alaskan populations [21, 22]. HBV/H is also considered an Amerindian genotype, as it is found in Central America, primarily in Mexico and Nicaragua [9, 40]. Since HBV/F and HBV/H are closely related, it has been suggested that these genotypes likely split off from each other within the New World, via division of an ancestral HBV strain carried by the first settlers that entered the continent across the Bering Strait around 15,000 years ago [9, 41]. It has been estimated that the origin of HBV/F and its subgenotypes precedes the European discovery of the Americas by around 500 years ago [8, 26]. Remarkably, over the first century and a half after the conquest of the Americas, the native population plummeted by an estimated 80% (from around 50 million in 1492 to 8 million in 1650) [42]. This considerable reduction in the population size may have led to the disappearance of some HBV/F strains that were previously endemic in the continent. However, the increase of the Latin American population from the early 1800’s may have favoured transmission of HBV/F in a highly endemic manner, fixing the HBV/F1-F4 subgenotypes in the continent [26].

Unlike the other Latin American countries, where HBV/F prevails [20, 3032], Brazil has a low countrywide prevalence of HBV/F [23]. Even in the northern region of Brazil (which includes the Amazon basin, where the native Amerindian population predominates more than in other regions of the country), low frequencies of HBV/F have been detected [23, 4346]. Instead, a predominance of HBV/A, mainly HBV/A1, which was introduced to the Americas probably during the slave trade [28], has been observed. This may indicate that there has been a change in the native genotypic profile of the region, possibly due to the migratory influx that has occurred in western Amazonia since the late 19th and early 20th centuries, and the gradual reduction of indigenous communities over time [47]. Although the low prevalence of HBV/F countrywide, our survey of HBV/F S-gene sequences from Brazil available at GenBank showed that HBV/F is found in all Brazilian geographic regions, with most of the sequences (54.7%) isolated in the north region, and there is circulation of three subgenotypes (HBV/F2a, HBV/F4 and HBV/F1b).

Because there is no consensus about HBV evolutionary rate, with most estimates ranging between 10-4 to 10−6 s/s/y, we decided to estimate the tMRCAs of the HBV/F subgenotypes circulating in Brazil using three previously published substitution rates, 7.72 × 10-4[6], 1.0 × 10-5[26] and 2.2 × 10-6 s/s/y [8]. Although the substitution rate of 2.2 × 10-6 s/s/y was not obtained for the complete genome, most tMRCA estimations from our study and Paraskevis et al. [8] displayed an overlap of 95% HPD intervals. Among the HBV/F subgenotypes circulating in Brazil, our analysis suggested an older ancestral age for HBV/F2a-BR compared to HBV/F1b-BR and HBV/F4-BR. The median tMRCAs of HBV/F1b-BR and HBV/F4-BR were very similar, which may suggest that they appeared almost simultaneously. Inversely, HBV/F2a has a lower tMRCA than HBV/F1b and HBV/F4 when all HBV/F sequences were compared (Table 1). This fact indicates that HBV/F2a probably emerged in the American continent at a later time than the other two clades. Despite this, HBV/F2a probably arrived and started to spread in Brazil at an earlier time than HBV/F1b and HBV/F4. Since Brazil was not pointed as the epicenter of dissemination of any HBV/F clade, the time at which each clade started to spread in South America is different from the time at which each HBV/F clade started to spread in Brazil.

The tMRCA estimations using the substitution rates of 1.0 × 10-5 and 2.2 × 10-6 s/s/y (but not 7.72 × 10-4 s/s/y), provided epidemiologically realistic scenarios for the evolutionary history of HBV/F and its subgenotypes. Both evolutionary rates supported the hypothesis that the starting point of diversification of the strains ancestral to HBV/F2a-BR in Brazil would have occurred in the pre-Columbian Americas. However, some cautions are necessary when interpreting these results, since the number of HBV/F2a-BR sequences is greater than HBV/F1b-BR and HBV/F4-BR sequences and this may introduce a bias in tMRCA estimates. Noteworthy, HBV/F2a has been found in all geographic regions of Brazil, and is much more prevalent (92.5%) than HBV/F4 (5.7%) and HBV/F1b (1.9%). In addition, all HBV/F sequences from the Brazilian Amazon region correspond to HBV/F2a, including five sequences (EF690487, EF690488, EF690489, EF690490, and EF690491) from the native Amerindian Apurinã Tribe, thus corroborating the idea that HBV/F2a is the oldest HBV/F subgenotype in Brazil and may represent the original native HBV of the Brazilian population.

HBV/F subgenotypes have a distinct geographical distribution in the American continent. HBV/F1 has been further subdivided into two sub-clusters, HBV/F1a and HBV/F1b [39, 48]. HBV/F1a remained restricted to Central America, whereas HBV/F1b had a more complex distribution, being found mainly in South America (Argentina, Brazil, Chile, Peru, and Venezuela). Curiously, HBV/F1b is also found in the native populations of Alaska. In our analysis, the Alaskan sequences grouped far from the HBV/F1 root, suggesting a relatively recent migration. This notion is supported by a paper that also suggested a relatively recent introduction of this genotype in Alaska [21]. Likewise, HBV/F2 separated into two sub-clusters, HBV/F2a and HBV/F2b [20]. HBV/F2b remained restricted to Venezuela, whereas HBV/F2a is found almost exclusively in Brazil and Venezuela, with the exception of one HBV/F2a sequence (GenBank accession number AY090455) isolated in Nicaragua. HBV/F3 is found in Central America, Colombia and Venezuela, while HBV/F4 in Argentina, Bolivia and Brazil. Our data suggest a plausible route of introduction of HBV/F subgenotypes in Brazil. The phylogeographic patterns indicated that Venezuela was the most probable origin for HBV/F2a that has subsequently spread countrywide, indicating a north-to-south net viral flow in Brazil. On the other hand, although the limited number of Brazilian HBV/F1b and HBV/F4 sequences, these strains appeared to have spread from the south to the north, being Argentina the most probable epicenter of introduction of these subgenotypes in Brazil.

Conclusions

We herein employed the Bayesian coalescent and phylogeographic framework to investigate the evolutionary history of HBV/F and the complex factors underlying its origin in Brazil. Our study suggests a plausible route of introduction of HBV/F subgenotypes in Brazil and provides full-length genomic sequences of HBV/F isolates from different Brazilian geographic regions, which will increase the contribution of Brazilian isolates to further phylogenetic studies of HBV/F.

Methods

DNA extraction and molecular assays

HBV-DNA were extracted from a total of 29 HBeAg positive serum samples taken from different geographic regions of Brazil, previously characterized as having HBV/F strains. However, the amplification and sequencing of the complete genome were successfully performed in 12 samples: southeast (n=6), northeast (n=4), central west (n=1), and north (n=1) (Additional file 1: Table S1). The Ethics Committee of Oswaldo Cruz Institute reviewed and approved the research protocol. All participating individuals gave written informed consent.

HBV DNA was extracted from 0.2 mL of serum using a High Pure Viral Nucleic Acid kit (Roche Diagnostics, Germany) according to the manufacturer’s instructions. The full-length HBV genome was amplified as previously described [49] and purified using the Wizard SV Gel and PCR Clean-Up System (Promega, WI, USA). HBV nucleotide sequences were determined using a BigDye Terminator kit (Applied Biosystems, CA, USA) and the previously described primers [50]. Sequencing reactions were analyzed on an ABI3730 automated sequencer (Applied Biosystems). The nucleotide sequences obtained during this study were deposited in the GenBank database under accession numbers KC494394-KC494405.

HBV/F S-gene sequence dataset from Brazil

All HBV/F S-gene sequences from Brazil available in GenBank by October 2012 (n=53) were selected for phylogenetic analysis. Reference sequences for HBV genotypes A-J were also included in the analysis. The phylogenetic relationships among the S-gene sequences were evaluated with the neighbor-joining method (1000 bootstrap replicates) using MEGA version 5.1 [51].

Bayesian phylogenetic and phylogeographic analyses

The 12 complete HBV/F genome sequences generated in this study were combined with 105 full-length HBV/F sequences from North, Central and South America and Japan (GenBank accession numbers available in Additional file 2: Table S2). All selected sequences had country data available and were classified as non-recombinant. The sampling locations included Argentina (n=24), Bolivia (n=7), Brazil (n=2), Chile (n=21), Colombia (n=2), Japan (n=2), Peru (n=2), USA (n=11) and Venezuela (n=26). The sequences from Costa Rica (n=2), El Salvador (n=2), Nicaragua (n=2), and Panama (n=2) were grouped as representing Central America. A reference sequence from HBV/H was also included in the analysis.

The tMRCA and the spatial diffusion for the full-length sequences of HBV/F and its subgenotypes were jointly estimated using the Bayesian MCMC statistical framework implemented in the BEAST v1.7.4 package [52, 53]. A matrix of geographic locations was constructed based on the sampling location for each sequence. A full model was used in which all possible reversible exchange rates between locations were equally likely (flat prior) [54]. Analyses were carried out with a Bayesian Skyline coalescent tree [55] under the model of nucleotide substitution TVM + Γ + I, which was selected as the best-fit model by the jModeltest program [56]. We used a relaxed (uncorrelated Lognormal) molecular clock model [57] that was chosen over a strict molecular clock by calculating the Bayes Factor from the posterior output of each model using TRACER v1.4 [58]. The time-scale of the Bayesian tree was calibrated using three previously published substitution rates, 7.72 × 10-4[6], 1.0 × 10-5[26] and 2.2 × 10-6 s/s/y [8]. MCMC analysis was run for 5 × 107 generations to achieve the convergence of parameters, which was assessed after a 10% burn-in and calculation of the Effective Sample Size (ESS) using TRACER v1.4. All parameter estimates showed ESS values >200, and their uncertainty was reflected in the 95% Highest Posterior Density (HPD) intervals. The MCC tree was visualized using the FigTree v.1.3.1 program after the posterior tree distribution was summarized using the TreeAnnotator v.1.7.4 program.