Genetic diversity of Escherichia coli in gut microbiota of patients with Crohn’s disease discovered using metagenomic and genomic analyses
Crohn’s disease is associated with gut dysbiosis. Independent studies have shown an increase in the abundance of certain bacterial species, particularly Escherichia coli with the adherent-invasive pathotype, in the gut. The role of these species in this disease needs to be elucidated.
We performed a metagenomic study investigating the gut microbiota of patients with Crohn’s disease. A metagenomic reconstruction of the consensus genome content of the species was used to assess the genetic variability.
The abnormal shifts in the microbial community structures in Crohn’s disease were heterogeneous among the patients. The metagenomic data suggested the existence of multiple E. coli strains within individual patients. We discovered that the genetic diversity of the species was high and that only a few samples manifested similarity to the adherent-invasive varieties. The other species demonstrated genetic diversity comparable to that observed in the healthy subjects. Our results were supported by a comparison of the sequenced genomes of isolates from the same microbiota samples and a meta-analysis of published gut metagenomes.
The genomic diversity of Crohn’s disease-associated E. coli within and among the patients paves the way towards an understanding of the microbial mechanisms underlying the onset and progression of the Crohn’s disease and the development of new strategies for the prevention and treatment of this disease.
KeywordsGut microbiota Crohn’s disease Escherichia coli Gene content Inflammatory bowel diseases Metagenomics Pangenome
Adherent-invasive Escherichia coli
Inflammatory bowel diseases
Metagenomic single-nucleotide polymorphism
Pediatric Crohn’s Disease Activity Index
Gut microbiota disequilibrium is among the accepted hallmarks of Crohn’s disease (CD). Although the aetiology of this disease is still not completely clear, in addition to genetics and lifestyle, the microbial community structure is among the factors contributing to its pathology . The importance of understanding the role of microbiota in CD and the promising opportunity of the therapeutic modulation of its composition have been recently demonstrated by successful CD treatment via faecal transplantation in pilot studies (for example, in adult  and paediatric  patients; a recent meta-analysis includes 11 studies ).
Using metagenomic high-throughput sequencing, the changes in the structure and function of gut microbiota in CD can be assessed. Surveys of 16S rRNA and “shotgun” formats have revealed a decrease in the total species diversity and abundance of butyrate-producing species accompanied by the growth of Proteobacteria and other opportunist species [5, 6, 7, 8]. An analysis of “shotgun” sequencing data has provided evidence of shifts in the functional composition, including the increased presence of genes linked to inflammation, oxidative stress (including glutathione transport) and amino acid degradation . Changes in microbial metabolism in the ileum of CD patients have also been reported based on proteomic studies . Correlation analyses of microbial gene abundance levels have shown that the decrease in community richness in CD is likely due to the extinction of understudied species without representative reference genomes . In addition to the changes in the microbial community, the gut virome composition has been found to be abnormal in patients with CD [11, 12].
The metagenomic profiling of microbiota in patients with CD is complicated by certain methodological challenges. Antimicrobial, anti-inflammatory and other types of treatments represent independent factors that strongly influence the community composition [7, 13]. To distinguish the effect of medical prescriptions from the effect caused by the disease, the metagenome has been examined in treatment-naive patients, e.g., children . However, in the above-mentioned study, the ileal microbiota was profiled using 16S rRNA sequencing of biopsy samples, which limited the assessment of the gene content (biopsy samples are not directly applicable due to the absolute dominance of human DNA over the microbial fraction ). Furthermore, species with highly similar 16S rRNA gene sequences might carry substantially different sets of accessory genes, including groups of genes with special clinical significance, such as antibiotic resistance determinants and virulence factors .
To date, no single microbial species has been identified as an unambiguously strong factor contributing to the onset of Crohn’s disease. However, Escherichia coli, particularly the adherent-invasive pathotype (AIEC), is among the species often detected in increased abundance in CD patients [17, 18]. This species is a beneficiary of inflammation that is able to persist in the gut mucosa in inflammatory tissue. Specific varieties isolated from CD patients with the AIEC pathotype are able to invade epithelial cells, survive in macrophages [19, 20] and form biofilms . The increase in Proteobacteria in the ileum has been shown to be correlated with the Pediatric Crohn’s Disease Activity Index (PCDAI) . Physiologically, the survival of E. coli is linked to an increase in the reactive oxygen species levels and a decline in the populations of commensal bacteria which normally provide a protective effect in the host’s intestine. E. coli is not associated with the underlying pathology of CD but rather a marker of the inflammatory process . Although E. coli is among the most examined model organisms, the enormous genomic variability of this species poses many questions regarding its functioning in vivo. Its identity and role in CD are widely discussed. In CD, are the E. coli present clonal or are there distinct genotypes of the organism present? What are the specific features distinguishing E. coli in CD patients from E. coli in healthy subjects? Finally, is the species a primary cause of Crohn’s disease contributing to the onset of the disease via interplay with the other risk factors or a secondary “first available” species emerging in an environment of general gut imbalance that exacerbates disease progression?
Here, we attempted to decipher the microbial causal factors of CD at multiple levels, i.e., from the gut community level to the single species level (E. coli), by integrating cultivation-independent and -dependent methods. We conducted a “shotgun” metagenomic analysis of gut microbiota in CD using stool and ileal content samples collected from patients with CD at two clinical centres.
Results and discussion
Microbiota in Crohn’s disease patients exhibits a range of abnormal community structures
Metagenomic analysis of E. coli genome suggests the existence of multiple strains within the same patients
One of the most obvious differences between the CD patients and the healthy populations from Russia and other countries was the order of magnitude increase in the Escherichia coli relative abundance in the stool samples (2.4% ± 16.5% vs. 0.2% ± 7.7%, one-tailed Mann-Whitney test p = 0.00167). The ileal E. coli levels were also high (2.8% ± 4.1%) but were not correlated with the respective stool levels in the same patients (n = 5, Pearson correlation r = − 0.22). Previously, we performed a cultivation-dependent study of this species associated with CD via the isolation and genome sequencing of a subset of the same microbiota samples . To deepen our understanding of the specifics of CD-associated E. coli, here, we compared the gene content of this bacterial species reconstructed from “shotgun” metagenomic data with genomes isolated from gut microbiota samples from the same patients (Additional file 1: Table S2).
Overall, the analysis of the metagenomes confirmed the presence of the species with genomes highly similar to that of the respective isolates. The mapping of the metagenomic reads to the corresponding E. coli isolate genomes and analysis of the read coverage using the maximum expected coverage (MEC) algorithm (see Materials and methods) showed that for each of the seven metagenomes analyzed, the corresponding E. coli isolate genome was the most similar and that the similarity was significantly higher than the similarity between random “genome-metagenome” pairs (MEC = 0.99 ± 0.01 vs. 0.81 ± 0.08, one-tailed Mann-Whitney test, p ≤ 10− 7) (Additional file 1: Figure S1).
The metagenomic data also supported certain important extra-chromosomal specifics of E. coli identified via the genomic analysis of the isolates. The mapping of the metagenomic reads to the sequence of the plasmid pLF82 (present in the CD-associated E. coli strain LF82 ) yielded 4 samples with a high coverage (17–73% of the plasmid sequence was covered), and all four of the corresponding E. coli isolate genomes were shown to have sequences with a high level of identity to pLF82  (Additional file 1: Table S3A). Moreover, some subjects from the healthy Russian population also had metagenomic sequences with identity to this plasmid (Additional file 1: Table S3B); interestingly, these samples had increased fractions of E. coli and other opportunistic bacteria (Streptococcus, Enterococcus and Klebsiella).
For a more precise comparison of the genomes and metagenomic profiles of E. coli from the same samples, we performed an analysis of metagenomic single-nucleotide polymorphisms (mSNPs) (see Materials and methods). In total, there were 5 “genome-metagenome” pairs in which the metagenome produced sufficient coverage of the E. coli genome to perform this analysis. In 3 of these pairs, the distance was orders of magnitude lower than the distance within the mixed “genome-metagenome” pairs (distance d = 0.00026–0.00520 vs. 0.018 ± 0.011, p = 0.002, one-tailed Mann-Whitney test; see Additional file 1: Table S4). However, in the other 2 pairs, the distance was significantly higher (d = 0.016 and 0.010). To examine the possibility of subspecies-level diversity in the discrepant pairs, we estimated the mSNP allele frequency. Interestingly, in each of these two pairs, the fraction of mSNPs with a second major allele identical to the letter present in the respective genome was higher than that of any of the mixed “genome-metagenome” pairs involving the same genome (95% vs. 84% ± 4 and 75% vs. 62% ± 5%, respectively). Our results suggest the existence of more than one E. coli strain in the microbiota samples and that the sequenced genome corresponds to one of the dominant strains. The analysis of the subspecies-level diversity of the metagenomes using an alternative approach, i.e., the ConStrains tool , also revealed the presence of two or more strains of E. coli in 7 of the 28 analysed metagenomes (see Additional file 2: Table S5).
CD-associated E. coli is genetically diverse
To assess the gene composition landscape of E. coli as reflected by the metagenomic data, we performed pre-mapping to a global gut microbial gene catalogue  and selected the genes belonging to E. coli from the catalogue. Initially, the E. coli pangenome was identified using a representative set of various E. coli genomes, including commensal, pathogenic and other strains; then, based on the sequence similarity, the genes corresponding to the E. coli pangenome were extracted from the gene catalogue (see Materials and methods). The analysis of the presence/absence of genes was performed on the level of orthology groups (OGs). We separately analysed the part of the pangenome corresponding to an accessory genome (AG) containing 2993 of the 5598 OGs.
Our previous genomic analysis of strains isolated from gut microbiota of CD patients showed that the functions that mostly differentiate CD-associated E. coli from commensal strains include the utilization of propanediol (and other sugar alcohols) and iron uptake . We decided to explore how this understanding conforms to the results obtained from the “shotgun” metagenomic profiling of the same and an additional set of samples from CD patients. To analyse the metagenomic profile of the E. coli gene content in the context of the available genomic data, we converted a representative set of E. coli genomes to an accessory genome presence/absence profile (see Materials and methods) and compared this profile to metagenomic E. coli images of CD patients from Russia, the USA  and Denmark  and healthy populations worldwide. Using the AG profiles, the metagenomic E. coli images appeared to be more similar to the genomes of the strains isolated from the same samples (0.25 ± 0.09, n = 9 comparisons, binary metric) than all unrelated AG profiles of the worldwide populations were to each other (0.37 ± 0.09, one-tailed Welch t-test, p = 0.001); the same result was observed when the comparison was performed using only the E. coli virulence genes (17 genes selected as described in the Methods; 0.20 ± 0.28 vs. 0.5 ± 0.19, respectively; p = 0.04082). The paired AG profiles of the stool and ileal samples from the same patients were highly similar (d = 0.06 and 0.17, n = 2 pairs).
To identify the genotype of the most recent common ancestor of the pathogen-enriched Clade 1, we identified the genes prevalent in this clade (n = 1128; OGs detected in > 80% of the metagenomes) and at the same time rare in Clade 2 (n = 798; OGs detected in < 20% of the metagenomes). Forty-seven OGs were obtained. Furthermore, we refined the list by excluding the OGs with a similar OG detected in most Clade 2 members (see Materials and methods), ultimately yielding 34 genes.
The list representing the signature of Clade 1 (Additional file 1: Table S6) contains several genes that are remarkable from the pathobiont arsenal perspective. The proteins coded by some of these genes are known to play important roles in biofilm formation as follows: holin-like protein CidA facilitates the release of DNA that fortifies the biofilm , and toxin-antitoxin biofilm protein TabA influences biofilm dispersal . Putative neuraminidase (sialidase) is another gene involved in mucus degradation that potentially contributes to the mucus adherence of E. coli as follows: sialidases are known factors involved in the pathogenesis of microbial infections by helping bacteria survive in the outer mucosa layer by cleaving mucin glycoproteins [46, 47]. The levels of its particular type, intramolecular trans-sialidase, have been shown to increase in microbiota of IBD patients . Several genes are associated with cell adhesion and fimbriae facilitating successful colonization of the gut, including long polar fimbrial protein LpfD and putative fimbrial-like protein YcbV precursor. The presence of the metallo-beta-lactamase superfamily proteins suggests that resistance to clinically important antibiotics is prevalent among the members of Clade 1. Finally, the presence of microcin-E7 immunity protein (conferring immunity to toxin colicin E7 produced by bacteria) suggests that such a toxin that is normally carried on a plasmid is also prevalent in CD-associated E. coli belonging to this clade.
For further validation, a cluster analysis of the reconstructed E. coli gene content was additionally performed using the gene-family profiles obtained using an alternative approach, i.e., PanPhlAn . Once the samples from the Russian CD patients were clustered, a tree visually similar to that produced using our approach was obtained (Additional file 1: Figure S3A); this similarity was confirmed by comparing the distance matrices yielded using our method and PanPhlAn (Mantel test R = 0.928, p = 0.001). Once this set was extended (as an example) by 6 metagenomes from USA treatment-naive CD patients and 7 metagenomes from the Chinese control group, the resulting tree also included 2 main branches that were generally similar to those in the tree based on the AG profiles (Additional file 1: Figure S3B; Mantel test R = 0.8824, p = 0.001).
Comparison of accessory genomes to matched controls shows the lack of genetic commonalities in CD-associated E. coli
Remarkably, most Russian CD metagenomes were located in Clade 2 (36 genomes and 68 metagenomes; see Fig. 4). Clade 2 contains few pathogenic (or disease-associated) strains. Therefore, we analysed the contents of Clade 2 in another way. E. coli in each particular case of CD can possess a specific set of genes that allow it to thrive in the gut environment and produce inflammation. To examine this question in greater detail, we compared the AG profile of each CD metagenome in Clade 2 (11 Russian and 8 non-Russian patients) with the closest match to the metagenome of a healthy subject. While most subjects were not examined endoscopically, at the time of sampling, the subjects did not manifest signs of intestinal discomfort.
Nineteen matched control comparisons were performed. The matched controls closest to CD in terms of the AG profile were the healthy Russian (n = 10 comparisons) and Chinese (n = 9) populations; the samples within the compared pairs were 1.4 times closer than random pairs (0.23 ± 0.03 vs. 0.33 ± 0.05, p = 2.268E-14). In each comparison, 227 ± 76 genes per comparison were identified to be CD-specific (present in CD but lacking in the non-CD control; 1704 unique genes in all comparisons in total). This distribution was found to be generally comparable with that obtained in the inverse comparison (261 ± 165 genes per comparison present in the E. coli profile of a control but not in that of a matching CD patient; paired Mann-Whitney test p = 0.52; n = 1763 unique genes in total).
We analysed the genes that are commonly present in E. coli in CD patients but lacking from one of the matched controls (identified in ≥5 of the 19 comparisons). In total, 175 and 205 genes were identified in the CD and control metagenomes, respectively (80 and 137 after discarding the genes with unknown functions, see Additional file 1: Table S7). The lists of CD- and control-specific genes contained a similar fraction of genes related to bacteriophages (15% vs. 19%) and horizontal gene transfer (2.5% vs. 3.6%). No genes related to biofilm formation were included in the lists. However, the CD-specific list was enriched in mercury metabolism genes (5% vs 0%). The E. coli accessory genome diversity results showed that most CD metagenomes harbouring E. coli strains are genomically diverse and distant from the AIEC pathotype typically described as CD-associated (contained in Clade 1).
Overall, the increasing incidence of inflammatory bowel diseases, including Crohn’s disease, in developed countries is a frightening prospect in terms of morbidity and a challenge for the modern global healthcare system. A meta-analysis showed that the annual incidence is as high as 12.7 and 20.2 per 100,000 person-years in Europe and North America, respectively . The Western lifestyle includes not only the acquisition of material wealth and social security but also the following undesirable additions: underexposure to bacteria during birth and at an early age, the consumption of industrial foods rich in preservatives and artificial additives, a sedentary lifestyle, and decreased intestinal motility. A plethora of these previously underestimated factors contributes to the imbalance of the “host-microbiota” system and is associated with an increased risk of IBD. Given the rapid globalization, Crohn’s disease is a potential threat for all individuals.
To date, studies attempting to identify the ultimate causative factors of this disease have not been successful. In particular, although microbiota may be involved in the pathogenesis , its participation is not completely elucidated. The available results suggest that no single reference healthy gut microbiota exists . The species-level composition is known to vary widely but still provides functional homeostasis . We suggest that following external perturbations, the emerging pathological process might easily resolve in some subjects, but in other subjects this process may transform into chronic inflammation and may be due to an imbalance in the microbiota . An important question is where to search for markers of this imbalance, i.e., at the global community, subcommunities or individual opportunist pathogen levels. Several studies have shown associations between particular gut bacterial species and Crohn’s disease. Some of these species have been hypothesized to be causative agents of the disease [17, 55]. We approached this problem from different scales of the gut system, i.e., at the single species level, particularly E. coli, which is among the most referenced species in this context [56, 57, 58], and the whole gut community level.
Although E. coli is the first and most well-studied human gut species, its levels are normally relatively low compared with species that are members of the dominant phyla (Bacteroidetes, Firmicutes and Actinobacteria). The low levels are likely linked to the fact that E. coli has a specific ecological niche as follows: this species does not metabolize polysaccharides , which is the main source of energy in the colon. While Proteobacteria may normally prevail in the neonate gut to prepare the niche for the further colonization of commensals , in adults, the increased abundance of E. coli is associated with various diseases, including colon cancer , liver diseases , cystic fibrosis , and alcoholic dependence syndrome  probably due to the inflammation that supports the growth of E. coli via the selective suppression of commensal taxa ; this increase can also occur following treatment with proton pump inhibitors .
In our analysis of the gut communities in patients with Crohn’s disease, we revealed significant dysbiosis in stool and ileal metagenomes. One of the most abundant genera was Bacteroides. The increase of Bacteroides has been associated with the risk for the development of intestinal inflammation . In general, among all commensals, Bacteroides fragilis had the highest abundance. Enterotoxigenic B. fragilis which possess the bft gene has been hypothesized to be associated with IBD ; however, the two bft gene sequences from the gene catalogue were not detected in any of the CD metagenomes (data not shown). The observed increase in opportunistic pathogens is consistent with previously published results [8, 14]. Interestingly, the disruption of the community structure appears to be due to an increase in different opportunist species in each patient and, thus, lacks a single direction. However, the significantly increased fraction of E. coli in comparison with that in the healthy populations from Russia and other countries was a universal feature. The identification of the specific gene content of CD-associated E. coli could enhance our understanding of the gene determinants of the active role of the species in inflammation and, based on variability, differentiate possible scenarios related to pathogenesis. We previously demonstrated the broad variability of CD-associated E. coli using genomic sequencing of isolates of gut microbiota from patients . Here, we extended the scope of the analysis by approximating the consensus composition of the accessory genome of the species using metagenomic data. Our approach allowed us to avoid cultivation-associated biases and include the extensive volume of published human metagenomic datasets in a comparative analysis.
CD-associated E. coli has been previously suggested to be predominantly represented by AIEC strains, including LF82 and O83:H1 . However, our results show that a large number of E. coli genotypes were detected in CD patient metagenomes. Genotypes similar to those of the AIEC strains were only detected in a small fraction of CD patients from the Russian and other populations worldwide. The respective Clade 1 E. coli strains from CD patient metagenomes possessed genes associated with biofilm formation, antibiotic resistance, mucus and cell adhesion. The presence of genes associated with virulence does not unambiguously imply the pathogenicity of a strain [68, 69]. Thus, a pathogenicity score based on the number of such genes in a genome is unreasonable. However, the genes identified can be considered genetic determinants conferring a high level of fitness to CD-associated strains in the human gut (based on their high abundance). This hypothesis is supported by the fact that in addition to AIEC and several other pathogenic strains, the probiotic strain Nissle 1917 is included in Clade 1. Most CD-associated E. coli were found to be located in the Clade 2. An alternative approach for identifying genetic commonalities among E. coli in this clade, i.e., comparing the AG profiles with those of matched healthy subjects, only yielded an enrichment in mercury metabolism genes. As heavy metal resistance genes are often carried on plasmids along with antibiotic resistance and virulence factors [70, 71], we speculate that their prevalence is linked to the higher prevalence of mobile elements (particularly plasmids) in CD-associated E. coli than in E. coli inhabiting the gut of healthy subjects. Interestingly, the richness of the gene repertoire in CD-associated E. coli was comparable to those that are in non-CD subjects, although it is generally accepted that pathogenic strains of E. coli tend to possess larger genomes than commensal strains . The lack of specific virulence-associated genes commonly distinguishing CD-associated E. coli in Clade 2 from the matched control E. coli suggests that there are no universal genetic determinants conferring Clade 2 E. coli virulence in CD. Generally, our observations suggest that CD-associated E. coli do not possess a specifically defined gene composition but is represented by a broad range of biotypes varying in gene content; the species is likely involved in disease progression rather than onset. These results are consistent with a recent cultivation-based study investigating AIEC  that failed to identify a molecular property exclusive to this phenotype. Our conclusions support the concept that Crohn’s disease is a syndrome, i.e., a disease in which similar manifestations in multiple cases are caused by different factors in each case. It is possible that some genes detected in our prediction of the E. coli accessory gene content from the metagenomic data in fact originated from other species (i.e., had been obtained from Escherichia/Shigella spp. via horizontal transfer). However, the general validity of our method of assessing the E. coli accessory genome content from the metagenome is supported by the high similarity with the core genomic data of the isolates from the same microbiota samples.
Furthermore, our results extended the previous findings as follows: at a subtler level, the analysis of the E. coli subspecies-level diversity within a microbiota of CD patients performed using our metagenomic SNP profiling algorithm and the ConStrains algorithm revealed that some CD patients host multiple strains of E. coli in their gut microbiota. Although the data was derived from only a few samples, the considerable size of the effect is supported by the fact that in each of the samples, we identified thousands of sufficiently covered mixed-type mSNPs supporting the heterogeneity. The effect should be further examined via the extensive isolation of multiple strains using a variety of growth media and genome sequencing. Such diversity might reflect microevolutionary snapshots of E. coli or its co-existence at different loci in the gut. Previous genomic analyses of isolates demonstrated the possibility that multiple strains of the same species co-exist within the microbiota of the same subjects, such as, Helicobacter pylori in the stomach [74, 75] and Propionibacterium acnes on the skin . The metagenomics data also provided evidence for such a phenomenon in the microbiota of healthy subjects . Here, for the first time, we demonstrated the genomic heterogeneity of E. coli within the same CD patients. This understanding is eye-opening for clinicians because it implies the need to carefully revise the treatment for Crohn’s disease as follows: an antimicrobial therapy scheme targeted against one biotype of E. coli might be ineffective against the other biotypes associated with the condition and may cause collateral damage to both the commensal microbiota and organism of a patient. In particular, there is a need to use a diverse set of organisms as part of probiotic treatment scheme as the experimental evidence shows that multiple E. coli pathotypes occupy significantly different ecological niches .
Our results expand the current understanding of how microbial components contribute to Crohn’s disease onset and progression and pave the way for the development of better strategies for personalized diagnostics and treatment of inflammatory bowel diseases.
The samples were collected from a cohort of patients with CD (n = 19) enrolled at two clinical centres (Moscow Clinical Scientific Center and State Scientific Center of Coloproctology, Moscow, Russian Federation). The inclusion criteria were as follows: patients aged 18 years or older who were endoscopically and radiologically diagnosed with and histologically confirmed to have Crohn’s disease. The exclusion criteria included signs of indeterminate colitis, infectious diseases, anamnesis of total colectomy, the presence of stoma, and antibiotic treatment. The faecal samples were collected prior to the preparation for endoscopy. The bowel preparation was performed with a polyethylene glycol solution. The patients underwent ileocolonoscopy at the clinical centres. During this procedure, the ileum liquid content was aspirated. In total, 9 stool and 15 ileal samples were obtained (Additional file 1: Table S8).
Sample preparation protocol
The stool samples were prepared according to the following protocol. A portion of 0.3–0.5 g of faeces was added to 10 ml of PBS, vigorously vortexed for 2–5 min and repeatedly cooled on ice. Human cells and debris were pelleted by centrifugation for 10 min at 0.1 g. The supernatants were collected and placed on ice. The resuspension and centrifugation of the pellet were repeated 3–5 times. The combined supernatant was centrifuged for 20 min at 13.5 g to collect the microbial fraction. The pellet was resuspended, washed twice with PBS and stored at − 20 °C.
The ileal samples were collected as ileal lavage fluid collected during endoscopy by washing the ileal wall with saline. In total, 1 ml of fluid was used for the DNA extraction from such samples. Unlike the stool sample preparation, a lower dilution (3–5 times) and a lower number of repeats of resuspension and centrifugation (no more than twice) were performed.
Preparation of metagenomic libraries and “shotgun” sequencing
The “shotgun” libraries were prepared for the sequencing platforms SOLiD 4, SOLiD 5500 and Ion Torrent (Life Technologies, USA). The sequencing was performed according to the manufacturer’s instructions. For SOLiD 4, the SOLiD Fragment Library Construction Kit, SOLiD Fragment Library Barcoding Module 1–16, SOLiD EZ Bead TM E80 System Consumables, SOLiD ToP Sequencing Kit and MM50/5 (Life Technologies, USA) were used. For SOLiD 5500, the 5500 SOLiD Fragment Library Core Kit, SOLiD Fragment Library Barcoding Kit, SOLiD FlowChip Kit, SOLiD FWD SR S50 Kit and SOLiD Run Cycle Buffer Kit (Life Technologies, USA) were used. For Ion Torrent PGM, the Ion Xpress Plus Fragment Library Kit, Ion Sequencing Kit, Ion PGM Template OT2 200 Kit, Ion PGM Sequencing 200 Kit and Ion 318 Chip Kit (Life Technologies, USA) were used.
Three of the samples were sequenced using more than one platform; in total, 28 metagenomic read sets were obtained (Additional file 1: Table S8).
Taxonomic and functional profiling of metagenomes
All analytical steps of the study are summarized in the Additional file 1: Figure S4. The taxonomic composition analysis was performed by mapping the metagenomic reads against a representative non-redundant reference catalogue of 353 gut microbial genomes  and estimating the relative abundance levels based on the obtained coverage profiles as previously described . The functional composition was assessed in a similar way as previously described by using a reference gene catalogue containing 3.3 mln gut microbial genes . For the comparative analysis, stool metagenomic datasets from 3 studies involving the following healthy populations worldwide were used: China (n = 68) , Denmark (n = 85) , and USA (n = 138) ; in addition, Crohn’s disease patients from the USA (treatment-naive paediatric patients; n = 17)  and Spain (clinical remission, n = 13) [10, 40] were included.
Analysis of maximum-expected coverage
Analysis of metagenomic SNPs
To assess the subspecies level genomic variability of E. coli using “shotgun” metagenomes, an algorithm for metagenomic SNP (mSNP) calling was developed. This approach is similar to SNP calling for a single sequenced genome and uses heuristically selected parameters. Essentially, the mapping of a single metagenomic readset to a reference genome catalogue yields an alignment file (in BAM format) containing the coverage profile of each of the genomes. In sufficiently covered genomes (with MEC ≥ 0.5 and mean coverage ≥5×), only the nucleotide positions with coverage ≥4× are considered. For each such position, a consensus letter is determined using ad hoc scripts as a letter supported by the highest number of reads (which should not be less than 4); if two major alleles are supported by an identical number of reads, one allele is randomly selected. A metagenomic SNP is defined as a position in a consensus sequence with an alternate letter compared to the reference genome. As a genomic dissimilarity measure of E. coli species between two metagenomes, we measured the Hamming distance between the two respective consensus sequences adjusted based on the coverage information; only the genome fragments belonging to core genes were used. Specifically during this analysis focused on metagenomic SNPs, each metagenome was mapped to a reference genome of the E. coli strain K-12 alone. To compare the E. coli genomes with the metagenomic profiles of E. coli, each isolated genome was randomly fragmented into reads (mean targeted coverage 100×) before being subjected to identical consensus sequence identification (mapping to a reference genome catalogue, etc.).
Selection of E. coli pangenome and accessory genes in the global gene catalogue
A representative set of 81 diverse E. coli genomes was formed (among them, 28 genomes belonged to commensal strains and 43 genomes belonged to pathogenic strains, including 4 known CD-associated strains [23, 42, 43, 82] and 10 strains previously isolated from Russian CD patients from the examined cohort ) (Additional file 1: Table S9). Using this set, gene orthology groups (OGs) were constructed using OrthoMCL  by 50% identity of amino acid sequences. As some of the produced OGs were divided by the clustering algorithm into multiple groups due to the incompleteness of the gene sequences, to compensate for this effect, the potentially divided groups were merged using the following heuristic: for each two OGs A and B, pairwise BLASTp alignments of a gene from A to a gene from B were considered for such possible pairs; the score S = (% identity) × (% match) was computed for each pair; if the average S across all gene pairs exceeded 6400 (80 × 80), the ОGs were merged into a single OG.
The nucleotide sequences of all genes of the obtained OGs were aligned against the 3.3 mln gut microbial reference gene catalogue to yield highly similar matches (similarity criterion: > 80% identity and > 80% of both subject and query sequence lengths). The obtained 9125 matching genes in the catalogue belonged to 5598 OGs; these genes were further used as a template to assess the presence of the E. coli pangenome in the gut metagenomes (the detailed statistics used for the pangenome construction are described in Additional file 1: Table S10). Core genes (genes occurring in all 81 genomes) were subtracted from the pangenome to yield a set of E. coli accessory genes (AG) (2993 OGs).
Virulence-associated genes were identified among the accessory genes of E. coli by aligning the sequences of the latter to a published list of 76 E. coli virulence-associated genes  (BLASTn search was performed with the following thresholds: e-value <1E-5, sequence coverage > 80%, identity > 80%). The search yielded 17 genes listed in the Additional file 1: Table S11.
Profiling of E. coli gene content in metagenomes and genomes
The gene content of E. coli in the gut metagenomes was estimated in the form of a binary vector of the presence/absence of each gene included in the pangenome of the species. A gene was considered present if at least 1 read was mapped to the gene during the mapping of all reads to the reference gene catalogue. Here, to adjust for variation in the sequencing coverage and E. coli relative abundance across the metagenomes, for each metagenome, a random subsampling was simulated such that the total number of reads mapped to the pangenome was 80,000. Metagenomes with a lower number of pangenomic reads or less than 50% coverage of the pangenome were not considered in the analysis of the pangenome and accessory genome. The accessory gene (AG) profile was obtained from the pangenome presence profile by filtering genes corresponding to the core genome. A pangenome and accessory profiles were also produced for genomes in the same format based on the alignment of the genomes against the reference gene catalogue  using BLASTn (similarity criterion: > 80% identity for > 80% of the gene length). The pairwise dissimilarity between the AG profiles was calculated using a binary metric (using the function dist in R package stats). Hierarchical clustering was performed using the average method.
During the stage of refining the orthology groups specific to Clade 1, the candidate signature OGs of Clade 1 with an OG with an identical function description detected in ≥20% of the AG profiles in Clade 2 were excluded. Here, the OG similarity score was defined as a product of the percent sequence identity and percent query matching length averaged over all possible pairs of genes between the two groups.
Alternative profiling of E. coli gene content
To validate the results of the clustering based on the AG profiles, we used PanPhlAn tool , which maps the metagenomic reads against the reference genomes of the target species and then, based on the gene coverage levels, reconstructs the unique gene set of a strain present in the sample. The resulting gene content of E. coli in the metagenome is also in the form of a binary vector of the presence/absence of each gene family. The pre-processed database of the E. coli pangenome contained 34,881 gene families. The threshold for a gene family to be considered present in the final presence/absence gene profile was set to a coverage depth > 0.5× times the mean coverage.
This study was supported by the Russian Science Foundation (project 16-15-00258).
Availability of data and materials
The metagenomic reads generated during the study were linked under BioProject PRJNA296946 and deposited in the Sequence Read Archive (SRA) under project ID SRP064372.
VMG conceived the study. VMG, DGA, PLS, ILK, IVM, ESK and TVG supervised the work. MVS, AYK, DNA, EAB and YMB collected the samples. DVR, JPB, VGL, IYK, TAS, AKL, MNS and SYM performed the sample preparations and sequencing. AVP performed the study coordination. AVT, AIM, AVK, DSI, ASP, AVE and BAK analysed the data. AVT, VMG, DGA, AIM, AVE and BAK prepared the manuscript. VMG, AIM and DGA revised the manuscript. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
This study was approved by the ethics committees of the Moscow Clinical Scientific Center and State Scientific Center of Coloproctology, Moscow, Russian Federation and carried out following the rules of the Declaration of Helsinki of 1975. All patients provided written informed consent for the sample collection and personal data processing.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Paramsothy S, Paramsothy R, Trainee G, Rubin DT, Kamm MA, Kaakoush NO, et al. Faecal Microbiota Transplantation for Inflammatory Bowel Disease: A Systematic Review and Meta-analysis. J Crohns Colitis. 11(10):1180-1199.Google Scholar
- 8.Pascal V, Pozuelo M, Borruel N, Casellas F, Campos D, Santiago A, et al. A microbial signature for Crohn ’ s disease; 2017. p. 813–22.Google Scholar
- 10.Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32:822–8 https://doi.org/10.1038/nbt.2939%5Cnhttp://www.cbs.dtu.dk/databases/CAG/.CrossRefGoogle Scholar
- 12.Pérez-Brocal V, García-López R, Nos P, Beltrán B, Moret I, Moya A. Metagenomic analysis of Crohn’s disease patients identifies changes in the Virome and microbiome related to disease status and therapy, and detects potential interactions and biomarkers. Inflamm Bowel Dis. 2015;21:2515–32. https://doi.org/10.1097/MIB.0000000000000549.CrossRefPubMedGoogle Scholar
- 14.Gevers D, Kugathasan S, Denson L A, Vázquez-Baeza Y, Van Treuren W, Ren B, et al. the treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 2014;15:382–392. doi: https://doi.org/10.1016/j.chom.2014.02.005.
- 18.Lapaquette P, Glasser A-L, Huett A, Xavier RJ, Darfeuille-Michaud A. Crohn’s disease-associated adherent-invasive E. coli are selectively favoured by impaired autophagy to replicate intracellularly. Cell Microbiol. 2010;12:99–113. https://doi.org/10.1111/j.1462-5822.2009.01381.x.CrossRefPubMedGoogle Scholar
- 21.Martinez-Medina M, Aldeguer X, Lopez-Siles M, González-Huix F, López-Oliu C, Dahbi G, et al. Molecular diversity of Escherichia coli in the human gut: new ecological evidence supporting the role of adherent-invasive E. coli (AIEC) in Crohn’s disease. Inflamm Bowel Dis. 2009;15:872–82.CrossRefGoogle Scholar
- 23.Miquel S, Peyretaillade E, Claret L, de Vallée A, Dossat C, Vacherie B, et al. Complete genome sequence of Crohn’s disease-associated adherent-invasive E. coli strain LF82. PLoS One. 2010;5. https://doi.org/10.1371/journal.pone.0012714.
- 25.La Carbona S, Sauvageot N, Giard J-C, Benachour A, Posteraro B, Auffray Y, et al. Comparative study of the physiological roles of three peroxidases (NADH peroxidase, alkyl hydroperoxide reductase and thiol peroxidase) in oxidative stress response, survival inside macrophages and virulence of enterococcus faecalis. Mol Microbiol. 2007;66:1148–63. https://doi.org/10.1111/j.1365-2958.2007.05987.x.CrossRefPubMedGoogle Scholar
- 26.Palmer KL, Godfrey P, Griggs A, Kos VN, Zucker J, Desjardins C, et al. Comparative Genomics of Enterococci: Variation in Enterococcus faecalis, Clade Structure in E. faecium, and Defining Characteristics of E. gallinarum and E. casseliflavus. MBio. 2012;3:e00318–1. https://doi.org/10.1128/mBio.00318-11.CrossRefPubMedPubMedCentralGoogle Scholar
- 27.Lobatón T, Hoffman I, Vermeire S, Ferrante M, Verhaegen J, Van Assche G. Aeromonas species: an opportunistic enteropathogen in patients with inflammatory bowel diseases? A single center cohort study. Inflamm Bowel Dis. 2015;21:71–8. https://doi.org/10.1097/MIB.0000000000000247.CrossRefPubMedGoogle Scholar
- 33.Wang W, Chen L, Zhou R, Wang X, Song L, Huang S, et al. Increased proportions of Bifidobacterium and the lactobacillus group and loss of butyrate-producing Bacteria in inflammatory bowel disease. J Clin Microbiol. 2014;52:398–406. https://doi.org/10.1128/JCM.01500-13.CrossRefPubMedPubMedCentralGoogle Scholar
- 37.Machiels K, Joossens M, Sabino J, De Preter V, Arijs I, Eeckhaut V, et al. A decrease of the butyrate-producing species Roseburia hominis and Faecalibacterium prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut. 2014;63:1275–83. https://doi.org/10.1136/gutjnl-2013-304833.CrossRefPubMedGoogle Scholar
- 39.Luo C, Knight R, Siljander H, Knip M, Xavier RJ, Gevers D. ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol. 2015. https://doi.org/10.1038/nbt.3319.
- 51.Sommer F, Rühlemann MC, Bang C, Höppner M, Rehman A, Kaleta C, et al. Microbiomarkers in inflammatory bowel diseases : caveats come with caviar. Gut. 2017;:1–6.Google Scholar
- 52.Tyakht AV, Alexeev DG, Popenko AS, Govorun VM. Rural and urban microbiota: To be or not to be. Gut Microbes. 2014;5 https://doi.org/10.4161/gmic.28685.
- 57.Viladomiu M, Kivolowitz C, Abdulhamid A, Dogan B, Victorio D, Castellanos JG, et al. IgA-coated E. coli enriched in Crohn’s disease spondyloarthritis promote T H 17-dependent inflammation. Sci Transl Med. 2017;9:eaaf9655. https://doi.org/10.1126/scitranslmed.aaf9655.CrossRefPubMedPubMedCentralGoogle Scholar
- 70.Osborn AM, Bruce KD, Strike P, Ritchie DA, Gadd GM, Hughes MN, et al. Distribution, diversity and evolution of the bacterial mercury resistance (mer) operon. FEMS Microbiol Rev. 1997;19:239–62. https://doi.org/10.1111/j.1574-6976.1997.tb00300.x.CrossRefPubMedGoogle Scholar
- 76.Tomida S, Nguyen L, Chiu B-H, Liu J, Sodergren E, Weinstock GM, et al. Pan-genome and comparative genome analyses of propionibacterium acnes reveal its genomic diversity in the healthy and diseased human skin microbiome. MBio. 2013;4:e00003–13. https://doi.org/10.1128/mBio.00003-13.CrossRefPubMedPubMedCentralGoogle Scholar
- 79.Yarygin K, Tyakht A, Larin A, Kostryukova E, Kolchenko S, Bitner V, et al. Abundance profiling of specific gene groups using precomputed gut metagenomes yields novel biological hypotheses. PLoS One. 2017;12:e0176154. https://doi.org/10.1371/journal.pone.0176154.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.