Background

The LPS of Helicobacter pylori is a primary host-interacting structure and several lines of evidence indicate that it is involved in host adaptation and virulence. Strains lacking the O-antigen have a reduced capacity to colonize the murine stomach [1], and various alternative extensions form structural motifs which are identical to human blood group antigens Lewis X, Y, and b [26], which may be involved in immune evasion and the ability to establish long term colonization [7, 8]. LPS structures are also the context in which surface proteins are presented and different LPS phenotypes are capable of affecting the presentation and function of these proteins, for example influencing flagellum-mediated motility and urease activity [9].

Almost half of the LPS biosynthetic genes of H. pylori are phase variable. Phase variability is a consistent marker for genes involved in niche adaptation and immune evasion [10, 11]. The phase variable repertoire of this species has been recently re-evaluated in a population-based survey [12], and the variability of these genes has been studied in a model of prolonged gastric colonization [13]. These studies have shown that the LPS biosynthetic genes are amongst the most abundant phase variable genes, are amongst those that display the greatest inter-strain differences in their repeats, and are the most dynamically variable during colonization. The sequencing used to assess the phase variable status of the genes in these previous studies was focussed upon the regions containing the repeats. This revealed that there were additional polymorphisms within the sequences flanking the repeats between strains, suggesting frequent recombination as a significant additional source of previously uncharacterized inter-strain variability within these genes, which was the stimulus for the study describe here.

The selection of sequences used for bacterial population studies and typing is essentially pragmatic. The selected target must show sufficient diversity to provide discrimination between unrelated strains. It must also be sufficiently stable to indicate when two strains are related over a biologically relevant period of evolutionary time. The dominant methodology used for studies of bacterial population genetics is currently MLST, and this has been recently applied successfully to H. pylori [14, 15] revealing clear correlations between the population structure of the bacteria and the human migratory populations with which they are associated. H. pylori is one of the most recombinogenic species to have been studied to date [1618] in which relatively small sections of DNA are exchanged. The MLST studies focus upon genes associated with core metabolic processes which are believed to be under minimal selective pressure for change.

What is currently undefined is the degree to which these sequences, which have been selected as a basis for strain typing because of their relative stability, reflect the patterns of exchange and recombination of more diverse genes that are potentially subject to functional selection for change. It may be that the population divisions do not reflect functional barriers to the free exchange of other genes in the global population, and that sequences directly affecting strain behaviour move more freely through the population. Equally, it may be the case that these populations that have been relatively geographically isolated are broadly similar in other ways and that this would be reflected in their allelic-variant gene complements as a whole. While the latter possibility is generally assumed to be more likely, it is important to specifically address this issue, because when strains are selected for functional and mutagenesis-based experiments it is necessary to know to what extent the findings are likely to be applicable to the wider bacterial population, especially in a highly panmictic species such as H. pylori. In this study, we have focussed upon eight LPS biosynthetic genes in which we have previously seen polymorphisms, seven of which are phase variable, to determine their diversity and the extent to which they are mobile within and between the major population subdivisions, using strains representative of the previously defined population structure.

Results and Discussion

Variations in gene complement and gene location

Our previous study of H. pylori phase variable genes used primer pairs designed to amplify repeats mediating phase variation that were located within and immediately adjacent to the coding sequences. This revealed some differences in gene complement between the strains and the presence of polymorphisms indicative of recombination [12]. The current study sought to investigate the diversity within the whole of the coding regions of the selected genes, and primers were designed to locations in the flanking regions of these genes – normally within the coding sequences of the adjacent genes. This led to the unanticipated finding that some of these genes, whilst previously shown to be present, are located in different, yet to be determined, chromosomal locations. This is not a consequence of failed amplification due to polymorphisms in primer target sites, or local reorganizations, because in some instances the sites in which the genes are located in the sequenced strains can be amplified, sequenced, and shown to be 'empty'. The data on gene complements and locations is summarized in Table 1.

Table 1 Presence and absence of the 8 LPS genes studied among 23 H. pylori strains

Gene jhp820, an rfaJ homologue, is not present in strain 26695, nor in B225, JP9, and 162, which is consistent with our previous findings. In the current study the gene was not amplified using the flanking primers from a further 8 strains, although we know from our previous study that these strains possess this gene [12]. In these 8 cases short PCR products were obtained from this chromosomal locus corresponding to an empty site, as was obtained from strain 26695. Sequences of two representative products show the expected flanking regions and the absence of jhp820 in this location. This indicates that although this gene is present in these strains, it is located at a different chromosomal position.

In two strains (L133 and C164) sequencing of the complete jhp820-equivalent genes from this genomic location, using primers located within the flanking genes, revealed intact homopolymeric tracts sufficiently long to mediate phase variation, when in our previous study [12] these were found to be stabilized with an internal thymidine. This indicates that there are actually at least two copies of this gene in these strains, and that those from which the internal region was previously sequenced is located in a different genomic location. In strain J99 this gene is located between one of the vacA genes (jhp0819) and fecE (jhp0821, an iron transport associated gene). The primer pair used previously extended into the vacA gene because the homopolymeric tract is located close to the 5' end of the coding sequence. The vacA gene is present in multiple copies, with one gene and four and three paralogues annotated in strains 26695 and J99, respectively. One possible scenario is that vacA is variably associated with different rfaJ homologues in two or more of its different locations. Interestingly, the absence of gene jhp0820 in this location, or its replacement with another allele, is a feature shared by the strains of the hpLadakh and hpEastAsia populations. Because the PCR and sequencing data for this gene were obtained from only 11 of the 23 strains selected for this study, jhp0820 was not included in the population structure analyses that follow.

HP0217, encoding for a transferase involved in LPS biosynthesis, was not amplified from the predicted location in two strains: NQ367 (hpEurope) and CC42C (hpAfrica1), although we know this gene to be present on the basis of amplification and sequencing of its repeat-containing region in our previous study [12]. Primers located in the flanking genes, which encode hypothetical proteins, generated short products which did not contain this gene, indicating that this gene is not located in this genomic location in these two strains, and is also located in a different, currently unknown, site.

HP0619 is a homologue of lex2B, which is phase variable. Adjacent to this gene in strain J99, but not in the published sequence of strain 26695, is a related gene jhp0562 which is not phase variable. The population structure revealed by the gene jhp0562 is broadly similar to the one described on the basis of the MLST analysis. At least 16 strains possess both genes, and strain 26695 is unique within the studied collection in having only HP0619. Three strains: strains L67 (hpLadakh), VZ21 (hpEurope), and GU5 (hpEastAsia), only contained an orthologue of jhp0562. However, the situation in some strains is more complex. In strain JP9 (hpEastAsia), two PCR products were obtained. Both products were sequenced, which revealed the presence of two lex2B homologues in one locus (HP0619 and jhp0562 homologues), reflecting the more common reported state, and an additional different one (a jhp0562 homologue) from the other product. The sequences of all three genes are clearly distinct on the basis of polymorphisms. Taken together this indicates that there are at least two loci containing these homologues, suggesting a potential for substantial diversity in the synthesis of different LPS structures.

Comparison of the population structures revealed by MLST and phase variable LPS biosynthetic gene sequences

The LPS genes are much more diverse than the housekeeping genes (Table 2). Higher genetic distances and lower nucleotide identities were found for the LPS genes than for the housekeeping genes. Moreover, the higher rate of non-synonymous substitution (Ka) found in LPS genes show that the purifying selection exerted on housekeeping genes is much higher than on the LPS genes. The main difference observed between the MLST tree, built with the 7 housekeeping genes sequences (total of 3.4 kb of sequence), and the LPS tree, built on the 7 LPS genes (the 8 LPS genes excluding jhp0820 representing a total of 7.3 kb of sequence), is the depth of the branches. This higher genetic diversity in the LPS biosynthesis genes is consistent with these genes being under different selective pressures to the housekeeping genes used for MLST.

Table 2 Diversity within H. pylori LPS genes and housekeeping genes under study

The general pattern for all of the genes studied is that there are strong associations between the described MLST-based population subdivisions (Figure 1) and the differences in the seven phase variation-associated LPS biosynthetic genes (Figure 2). This pattern is clearest once the lengths of the phase variation-mediating repeats have been standardized and major indels removed, but is consistent for complete, repeat and indel standardized, and amino acid based analyses. It should be noted that the MLST tree presented here is different from the tree published previously [12] because the vacA sequence has been excluded from the current analysis. This is in part because we wished to compare the population structure based upon the core housekeeping genes under minimal selection, with that based upon the phase varied LPS biosynthetic genes. It is also because the vacA gene fragment used for MLST is strongly biased by the so-called s-region encoding the gene's leader peptide, and there is a strong correlation between the s-region (1 or 2) and the cagPAI status (positive or negative respectively) leading to a false phylogeny of the resulting supergene (Bodo Linz – personal communication).

Figure 1
figure 1

Neighbour-joining tree based on MLST data. All distances are based on Kimura two-parameter estimates and are to scale (scale bar, lower left). Population are coded such as: ▲ hpEastAsia; ▼ hpLadakh; ● hpEurope; ■ hpAfrica1 and ◆ hpAfrica2.

Figure 2
figure 2

Neighbour-joining tree based on LPS biosyntheticgenes sequencing data. All distances are based on Kimura two-parameter estimates and are to scale (scale bar, lower left). Population are coded as in Figure 1.

There are clear areas of sequence divergence indicative of recombination, and the clustering shown within the trees of this study indicate that the predominant movement of sequences is within, rather than between, the population subdivisions. This indicates that the population structures, as defined using MLST, are likely to be broadly associated with relatively conserved similarities and differences between the strains with regard to their general characteristics, and that these subdivisions do, indeed, form a sound basis for the investigation of biological differences between different H. pylori strains.

The housekeeping gene-based MLST tree distinguishes 5 populations: hpEastAsia (boostrap value = 100), hpLadakh (bootstrap value = 90), hpAfrica1 (bootstrap value = 93), hpAfrica2 (bootstrap value = 100) which is rooted in hpEurope (Figure 1), which is consistent with the previous findings using these sequences. The tree based upon the combined LPS biosynthesis gene sequences group the strains from the previously defined sub-populations similarly (Figure 2), although the branch orders differ, and generally associates strains in the previously assigned sub-populations most closely. However, there are some differences between the two trees. Specifically, in the LPS biosynthesis gene-based tree, hpLadakh appears as a subpopulation of hpEastAsia as illustrated by the relative position of strain JP9, and hpAfrica2 is rooted between hpAfrica1 and hpEurope, rather than from the midst of the hpEurope population. From both trees, based upon the Maynard-Smith et al. model [19], H. pylori has a population consistent with a clonal epidemic structure, with clonal groupings (hpEastAsia, hpLadakh, hpAfrica1 and hpAfrica2) and a recombinant structure in hpEurope. When the sequence data of the 7 LPS genes and the 7 housekeeping genes were combined the resulting Neighbour-Joining tree had the same overall shape as the tree obtained with the MLST data combining the sequence of the 7 housekeeping genes with the vacA gene fragment.

With the exception of HP0379, all of the single gene-derived trees indicate that the hpAfrica2 population is the most divergent from the others, and that the hpEurope and hpAfrica1, and the hpEastAsia and hpLadakh populations are the most closely associated pairs of population subdivisions. This contrasts with the previously described attribution of the hpLadakh strains as part of the hpEurope population [14].

The behaviour of individual genes in the population

The orthologues of HP0651, an alpha-1,3-fucosyltransferase, shows inter-strain differences in the presence of a long homopolymeric cytidine repeat, or a shorter hexamer which is not predicted to mediate phase variation. The presence of the longer and shorter repeats shows a clear division (bootstrap value = 100) within the population. This indicates that the emergence of longer repeats mediating phase variation is probably a relatively uncommon evolutionary event. It may also reflect the gene complements of a specific group of strains, in which phase variation of a particular gene may be less adaptively advantageous. The separation of the different alleles remains after the variable length repeats have been adjusted to a uniform length. The population grouping roughly follows the population grouping obtained with the MLST study, with hpAfrica1 strains constituting a cluster (bootstrap value = 80) and hpAfrica2 strains constituting a different cluster rooted in the hpEurope population.

The other alpha-1,3-fucosyltransferase sequenced, the orthologues of HP0379, shows the only example in this study of integration of alleles from hpAfrica2 with another part of the population. In this case the hpAfrica1 and hpAfrica2 alleles cluster together (bootstrap value = 100) indicating an exchange of this allele between these two populations, and that opportunities for exchange, either directly or indirectly, with this most-divergent of H. pylori populations do occur. HpLadakh strains constitute a cluster (bootstrap value = 100) distinct from the hpEastAsia population and the hpEurope population. This illustrates that even though the previous MLST-based population study robustly demonstrated that this population had a prolonged genetic isolation from the rest of the H. pylori population [14], these strains still have a degree of genetic connectedness with the wider H. pylori gene pool.

When data for the genes HP0651 and HP0379 are combined (Figure 3), the comparatively recent transfer of HP0379 within the African populations is readily apparent. While in the wider non-African populations both genes have been present for sufficient time for their distribution to show a common strain-associated pattern of divergence, the strain associations break down in the African populations, where these genes are clustered primarily on the basis of the similarities between the genes rather than on their divergence since acquisition by individual strains. HpAfrica1 and hpAfrica2 strains are part of a single cluster (bootstrap value = 100), hpEastAsia strains constitute a cluster (bootstrap value = 97) distinct from hpLadakh (bootstrap value = 94), whereas hpEurope strains are placed all around the tree, without forming a true group. It is usually, but not always, the case that one of the two alpha-1,3-fucosyltransferase genes is phase variable, while the other is not. Among the 8 strains of the hpAfrica1 and hpAfrica2 populations, HP0379 is not phase variable and HP0651 is phase variable. In two strains of hpLadakh population (L7 and L67) neither of these genes is phase variable, and in four strains of hpEurope population (26695, B225, H1413, 111UK) both are phase variable. The functional consequences of which, or both, of these genes is phase varied has yet to be determined.

Figure 3
figure 3

Neighbour-joining tree based on combined sequencing data of the two alpha-1,3-fucosyltransferase genes (HP0651 and HP0379). All distances are based on Kimura two-parameter estimates and are to scale (scale bar, lower left). Population are coded as in Figure 1; Solid shapes indicate phase variable alleles and empty shapes indicate non-phase variable alleles.

The neighbour-joining trees for HP0651 and HP0379 show different patterns of evolution in the hpAfrica1 and hpAfrica2 populations compared to the patterns observed in the other populations. In order to better understand the differences, these populations were studied separately. Considering the relatively small number of strains used in this study from each population, hpEastAsia and hpLadakh were grouped together (referred to as "hpEastAsia-Ladakh"), which was suggested to be reasonable by trees built using the other LPS genes sequences. On the same basis, hpAfrica1 and hpAfrica2 populations were grouped together (referred to as "hpAfrica1-2"). The genetic distances in the "hpAfrica1-2" and "hpEastAsia-Ladakh" groups are shorter than those in the hpEurope population, suggesting that the strains in the so-called hpEurope population are more divergent than the strains from hpEastAsia and hpLadakh, or than the strains from hpAfrica1 and hpAfrica2. The genetic distances in hpEurope (dHP0619-HP0379 = 0.11 ± 0.03) are as high as the genetic distances in all of the populations considered together (dHP0619-HP0379 = 0.12 ± 0.03). This is consistent with a recombinant population structure for hpEurope, whereas the shorter genetic distances in both hpAfrica1-2 (dHP0619-HP0379 = 0.09 ± 0.04) and hpEastAsia-Ladakh (dHP0619-HP0379 = 0.09 ± 0.03) groupings are consistent with clonal populations. Nearly identical synonymous substitutions (Ks) frequencies were found for HP0651 and HP379 in hpEurope (KsHP0619 = 0.46 ± 0.07; KsHP0379 = 0.42 ± 0.16) and hpEastAsia-Ladakh (KsHP0619 = 0.34 ± 0.07; KsHP0379 = 0.35 ± 0.09), suggesting that the interval since their divergence from an ancestor is similar. In contrast, the different Ks frequencies found in hpAfrica1-2 (KsHP0619 = 0.37 ± 0.20; KsHP0379 = 0.29 ± 0.17) suggest that HP0651 and HP0379 have a different pattern of evolution. Non synonymous/synonymous substitution rate analysis of HP0651 and HP0379 suggests that hpAfrica1-2 has undergone a higher purifying selection than hpEurope and hpEastAsia-Ladakh (Figure 4). Ka/Ks analysis also shows that, in the hpAfrica1-2 population, the selective pressure is greater on HP0379, the phase variable alpha-1,3-fucosyltransferase gene, than on HP0651, the non-phase variable allele, whereas in hpEurope and hpEastAsia-Ladakh the selective pressure exerted on HP0651 and HP0379 is similar.

Figure 4
figure 4

Non-synonymous/synonymous substitution rate (Ka/Ks) analysis of the two alpha-1,3-fucosyltransferase genes (HP0651, HP0379). Considering the small number of strains in each population, hpAfrica1 and hpAfrica2 populations were grouped together in a "hpAfrica1-2" cluster, and hpEastAsia and hpLadakh populations were together in a "hpEastAsia-Ladakh" cluster.

HP0208 shows a different pattern of subpopulation exchange, in that it has a pattern which indicates recombination of this gene between the hpAfrica1 and hpEurope lineages. This gene is absent from the hpLadakh strains, but otherwise has a pattern reflecting the general population subdivisions, with hpEastAsia (bootstrap value = 98) and hpAfrica2 constituting two clusters whereas hpAfrica1 and hpEurope strains constitute a background population. Similarly, HP0217 and HP0093 show a generally typical pattern, but with some evidence of exchange from the hpEastAsia and hpLadakh populations to some hpEurope strains. HP0619 and jhp0562 when considered independently show a generally typical pattern as well, but when these two genes are considered together, they show evidence of genetic exchange between hpLadakh and hpEastAsia.

Considering the overall patterns of all of the genes (Figure 2), and the different patterns illustrated by HP0379, HP0208, HP0217, and HP0093, there is a general picture in which these populations, hpEurope, hpEastAsia, hpLadadh, hpAfrica1 and hpAfrica2, are genuinely ecologically separate in many regards and as reflected by their general allelic compositions, but there are occasional bridging points which have allowed for the exchange of genes between these subpopulations. The nature of these exchanges, and that they differ on a gene-by-gene basis is of relevance to those designing and interpreting functional studies, and highlights the need to specifically address these issues of relatedness at both a population and gene-specific level when selecting representative strains for such projects.

Conclusion

The MLST-based population subdivisions appear to broadly represent the allelic gene complements of the population in other genes outside of those associated with core 'housekeeping' metabolic functions. Therefore the described population subdivisions are a good basis for studying the biology of H. pylori, although the relatedness of the population subdivisions may be slightly different from those currently reported. The depth of neighbour joining tree branches are longer for the LPS biosynthesis genes than for the housekeeping genes reflecting their faster evolution. The stable association of long and short repeats within the population suggests that emergence of phase variable repeats is probably a relatively uncommon evolutionary event. Exchange of genes between hpAfrica2 and the more closely related populations can occur, or these populations can acquire genes from or through a common source. However, there are indications that there is considerable additional complexity in the repertoire and number of the LPS biosynthetic genes, and their genetic locations, which can vary significantly between individual strains. Notably, this type of diversity is not reflected in MLST or comparative genome hybridization studies that have been reported to-date. This diversity is poorly reflected by the limited number of currently available genome sequences, and it would be wise to check the presence, sequence, and distribution of any gene of interest when working in non-sequenced strains.

Methods

Bacterial strains and growth conditions

Twenty-three H. pylori strains were selected representing diverse ethnic groups and countries of origin [14], including the sequenced strains 26695 [20] and J99 [21], and the mouse-adapted strain SS1 [22] (Table 1). Culture conditions were as described previously [12].

Amplification and sequencing of the phase variable LPS genes

DNA was prepared from plate cultures using the AquaPure Genomic DNA Isolation Kit (BioRad) according to the manufacturer's instructions. Seven phase variable LPS biosynthetic genes, encoding for two alpha-1,3-fucosyltransferases (HP0651 and HP0379), an alpha-1,2-fucosyltransferase (HP0093), a glycosyltransferase homologue of Lex2B (HP0619), two alpha-1,2-glycosyltransferase homologues of RfaJ (HP0208 and jhp0820), and a beta-1,4-N-acetylgalactoamyl transferase (HP0217), and a non-phase variable LPS biosynthetic gene (jhp0652, related to HP0619) were amplified and sequenced. Primers were designed using the published sequences of H. pylori strains 26695 [20] and J99 [21] [See Additional file 1]. PCRs were carried out using Taq DNA polymerase (Invitrogen) according to the manufacturer's instructions. PCR products were cleaned up and sequenced directly on both strands using the primers used for PCR. Automated sequencing used ABI Prism BigDye™ Terminator cycle sequencing, version 3.0 (Applied Biosystems) and was resolved on an ABI Prism 3100 DNA sequencer (Applied Biosystems).

Phylogenetic and sequence analysis

Sequences were assembled using the program Gap4 from the Staden package [23]. Multiple alignments were performed using ClustalW [24], and then, manually edited using Seqlab from the Wisconsin Package, version 10.2 (Genetics Computer Group, Madison, Wisconsin) through the Oxford University Bioinformatics Centre. Analyses of nucleotide identity, genetic distances, and synonymous/nonsynonymous substitutions [25] were performed on multiple alignments using Swaap 1.0.0 (Pride, D.T. (2001). Swaap 1.0.0: a tool for analysing substitutions and similarity in multiple alignments. Distributed by the author, available at http://www.bacteriamuseum.org/SWAAP/SwaapPage.htm) [26]. Neighbour-joining trees [27] based on distances with Kimura two-parameter estimates [28] were constructed, and their robustness was assessed using a bootstrapping procedure (500 repetitions) [29] using MEGA version 2.1 [30].