Dynamics of microbial populations mediating biogeochemical cycling in a freshwater lake
Microbial processes are intricately linked to the depletion of oxygen in in-land and coastal water bodies, with devastating economic and ecological consequences. Microorganisms deplete oxygen during biomass decomposition, degrading the habitat of many economically important aquatic animals. Microbes then turn to alternative electron acceptors, which alter nutrient cycling and generate potent greenhouse gases. As oxygen depletion is expected to worsen with altered land use and climate change, understanding how chemical and microbial dynamics impact dead zones will aid modeling efforts to guide remediation strategies. More work is needed to understand the complex interplay between microbial genes, populations, and biogeochemistry during oxygen depletion.
Here, we used 16S rRNA gene surveys, shotgun metagenomic sequencing, and a previously developed biogeochemical model to identify genes and microbial populations implicated in major biogeochemical transformations in a model lake ecosystem. Shotgun metagenomic sequencing was done for one time point in Aug., 2013, and 16S rRNA gene sequencing was done for a 5-month time series (Mar.–Aug., 2013) to capture the spatiotemporal dynamics of genes and microorganisms mediating the modeled processes. Metagenomic binning analysis resulted in many metagenome-assembled genomes (MAGs) that are implicated in the modeled processes through gene content similarity to cultured organism and the presence of key genes involved in these pathways. The MAGs suggested some populations are capable of methane and sulfide oxidation coupled to nitrate reduction. Using the model, we observe that modulating these processes has a substantial impact on overall lake biogeochemistry. Additionally, 16S rRNA gene sequences from the metagenomic and amplicon libraries were linked to processes through the MAGs. We compared the dynamics of microbial populations in the water column to the model predictions. Many microbial populations involved in primary carbon oxidation had dynamics similar to the model, while those associated with secondary oxidation processes deviated substantially.
This work demonstrates that the unique capabilities of resident microbial populations will substantially impact the concentration and speciation of chemicals in the water column, unless other microbial processes adjust to compensate for these differences. It further highlights the importance of the biological aspects of biogeochemical processes, such as fluctuations in microbial population dynamics. Integrating gene and population dynamics into biogeochemical models has the potential to improve predictions of the community response under altered scenarios to guide remediation efforts.
KeywordsBiogeochemical model 16S rRNA gene sequencing Metagenome-assembled genome
Hidden Markov Model
Kyoto Encyclopedia of Genes and Genomes
Mean absolute error
Operational taxonomic unit
Biogeochemical processes in hypoxic and anoxic environments can feedback on changes to local and global ecosystems, so understanding how they are regulated will serve to aid remediation efforts, model climate change, and protect public health. Oxygen depletion is a major impairment to water bodies in the USA  and is likely to become more severe in a warming climate . Oxygen depletion deprives economically important aquatic organisms of vital habitat. Often remediation efforts designed to improve the ecosystem are not sufficient to restore esthetics, biodiversity, and functionality of impacted water bodies . Oxidation-reduction processes initiated upon oxygen depletion can influence the mobility of toxic substances within the water body, such as uranium  and arsenic . Migration of these substances into surface or groundwater can have adverse effects on human health. Low-oxygen concentrations can also alter the availability of key nutrients, nitrogen and phosphorous [6, 7, 8, 9], that can impact the productivity of the ecosystem. Finally, microbial processes such as methanogenesis, methane oxidation, and denitrification mediate the production and consumption of methane and nitrous oxide that could impact atmospheric levels . Improving predictive models of microbial processes in low-oxygen environments will provide a way to test the community response to altered environmental conditions.
Population dynamics are not typically used to improve predictive models of microbial processes because of the difficulty of determining in situ relationships between populations and processes. Microbial processes, especially in anoxic environments, are traditionally thought of as a “black-box,” where the identity of the microorganisms involved is ignored [e.g., ]. The high degree of functional redundancy was thought to make identity irrelevant . A number of studies of host-associated microbiomes (algae , sponges , and human ) note large changes in taxonomic composition associated with relative invariance in functional gene content. This suggests that changes in community membership do not yield meaningful differences in community function. Functional biomarker genes have been used to drive biogeochemical models of an oxygen minimum zone  and a eutrophic estuary . An underlying assumption in such efforts is that the rest of the genomic repertoire of mostly uncultivated populations is irrelevant for understanding their effect on biogeochemical process.
However, there is consensus in macroecology that species’ functional characteristics significantly alter ecosystem energy and material flow via positive and negative interactions with other species . Examples of such events include the introduction of the actinorhizal nitrogen fixer Myrica faya in Hawaii  and the reestablishment of wolf populations in Yellowstone . In these examples, the spatiotemporal dynamics of invasive species or keystone predators drive change to a greater extent than abiotic factors.
In microbial ecology of low-oxygen environments, less is known about how the unique characteristics of microbial populations influence energy and material flow. Yet, examples suggest microbial population functional properties can be important. Phytoplankton functional types (e.g., diatoms, cyanobacteria, chlorophytes, and coccolithophores) impact carbon cycling differently enough to warrant individual parameters in many biogeochemical models . Competition for electron donors between Rhodoferax and Geobacter may influence uranium bioremediation because Geobacter can reduce uranium, while Rhodoferax cannot . Since syntrophic and competitive interactions are common between microbial populations in anoxic environments [23, 24], both interaction and successional dynamics could have a strong impact on overall biogeochemistry. More work is needed to identify these unique population characteristics and understand how they affect predictive models.
The first hurdle in understanding the impact of population-specific traits and interactions on ecosystems is identifying them in natural environments. This often requires inference rather than direction observation. Directly observing the functional capabilities and interactions of microbial populations in the environment provides the most accurate way to probe interactions , especially when organisms are physically associated  and metabolites can be traced with the use of isotopic signatures (e.g., [26, 27]). However, direct evidence requires substantial prior knowledge, can be limited to certain types of interactions and often requires experimental manipulation. Thus, functional capabilities and ecological interactions are often inferred through indirect evidence. Partial genome reconstruction facilitated by metagenomic binning has vastly improved our understanding of the genetic potential of uncultured microorganisms (e.g., [28, 29]). Competition and syntrophic interactions can also be inferred through co-occurrence patterns  and time-series analysis . Bioinformatic inference allows for the investigation of a less well-characterized functions and interactions in ecosystems.
The Upper Mystic Lake’s unique chemistry, simple hydrology, and previously characterized microbial community make it amenable to investigations of microbial gene and population dynamics. The chemical environment of the Upper Mystic Lake is impacted by pollution from the Industri-Plex and Wells G & H Superfund sites (Woburn, MA). Contaminants from the Superfund site, such as nitrogen, iron, and arsenic, have complex interconnections in the lake. Specifically, nitrate controls arsenic  and phosphate  mobilization through microbially mediated iron oxidation with nitrate. Methane concentrations are also elevated in the lake, resulting in methane release to the atmosphere through bubbling that is correlated with changes in hydrostatic pressure . Microorganisms mediating oxidation-reduction reactions of nitrogen, iron, and methane were previously characterized using a 16S rRNA gene survey and a biogeochemical model . The microorganisms mediating biogeochemical processes in the lake co-occur with other microbes not directly mediating the modeled processes, suggesting syntrophic interactions play an important part in these processes as well.
Here, we compare gene and population spatiotemporal dynamics to a biogeochemical model in Upper Mystic Lake. We use 16S rRNA gene sequencing and metagenomic binning to infer capabilities. Population dynamics and implications of potential interactions are compared with a computational model capturing the ecosystem-level biogeochemistry of the lake. This framework allows us to identify capabilities and interactions that could substantially alter the predicted chemistry of the lake and observe the dynamics of dominant populations related to predicted biogeochemical cycling. Elucidating the gene organization within and between populations could improve our ecosystem-level understanding of biogeochemical processes and the factors that control them.
Biogeochemical process predictions from a previous model largely explain gene distribution
To better understand the relationship between biogeochemical processes and the distribution of genes within the lake, we compared the distributions of key genes to the rates of the corresponding processes predicted by a biogeochemical model. The biogeochemical model was developed from a reactive transport model coupling major chemical cycles  and adapted for this lake ecosystem . We assume that dispersion below the thermocline distribute cells evenly throughout water column. We assume that the presence of diagnostic genes mediating biogeochemical processes in the model allow cells to actively reproduce in proportion to energy availability and all other physical or ecological processes (e.g., predation, mortality) impact microorganisms evenly throughout the water column. This would result in a relative gene distribution proportional to the relative rates of the biogeochemical processes they mediate. While other recent biogeochemical models use gene abundances to drive biogeochemical rates [16, 17], our model was calibrated independent from observed gene abundances. Previously, the model predictions explained the distribution of populations inferred to have specific capabilities through phylogenetic similarity to cultured populations or through a novel single-cell method to fuse functional and phylogenetically informative genes, Emulsion, Paired Isolation, and Concatenation PCR (epicPCR; ). In this analysis, we compare model predictions to distribution of genes in shotgun metagenomic data involved in methane, sulfur, nitrogen, and iron cycling in the lake.
Metagenome-assembled genomes provide insight into population capabilities
The coupling of genes within genomes of dominant microorganisms in the lake can provide insight into the factors that influence gene distribution. It can provide an estimate of the metabolic capabilities of abundant microorganisms and identify metabolic versatility and redox processes that were not previously included in the model. We used metagenomic binning to partially reconstruct genomes of lake microorganisms from Aug. 2013. Metagenomic assembly and binning resulted in 87 metagenomic bins, or metagenome-assembled genomes (MAGs) with completeness greater than 70% (median 84.4%) and less than 10% contamination (Additional file 1). The metabolic capabilities associated with MAGs were further characterized by clustering them to each other and to genomes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Additional files 2 and 3) according to shared gene content. Finally, the presence of key genes and pathways in energy metabolism was used to confirm the metabolic capabilities associated with each MAG (Additional file 4).
Insight into the spatiotemporal dynamics of populations [i.e., operational taxonomic units (OTUs) from distribution-based clustering] over the course of the entire season (March–Aug., 2013) was assessed by amplicon libraries of the 16S rRNA gene. To gain insight into how metabolic capabilities shape population dynamics in relation to the model predictions, we matched OTUs to MAGs to provide confirmation about the role of specific OTUs in the processes described in the model. During the normal course of de-novo assembly and the binning, highly conserved genes with uniquely divergent tetranucleotide frequencies, such as 16S rRNA genes , are prone to being misassembled or misplaced during binning. Thus, an additional algorithm was needed to integrate the functional repertoire of MAGs to the extended spatiotemporal dynamics of OTUs. We used Euclidean distances between normalized abundance vectors and the shared fraction of the taxonomic classification hierarchy to connect OTUs to MAGs. The unification of data types facilitated comparisons between the changing abundances of populations in time and space and related modeled processes over the same scope.
Populations (OTUs) and associated genomes (MAGs) implicated in mediating processes in the biogeochemical model
Metabolic versatility (genes)
Sulfuricella denitrificans/Sideroxydans lithotrophicus ES-1
Sulfur cycling (dsrAB); Denitrification (nosZ)
Denitrification (nxrAB, nirK, norBC)
Rhodoferax ferrireducens T118
Denitrification (nxrAB, nirS)
Desulfatirhabdium butyrativorans strain HB1
Population capabilities inferred from partial genome reconstructions with metagenomic binning
Other OTU sequences matching MAGs with genes and genomic characteristics supportive of their role in the modeled processes were also found in the lake. Four MAGs (bin.46, bin.48, bin.81, bin.12/bin.60) with matching Bacteroidetes OTUs (seq280, seq433, seq0, seq214, resp.) had at least one gene involved in denitrification (i.e., nosZ) and clustered near other heterotrophic KEGG genomes (Additional file 2). None of these or any other assembled MAGs had a complete denitrification pathway. The two other MAGs associated with denitrification coupled to sulfur oxidation also match two OTUs classified as Betaproteobacteria (bin.15->seq335; bin.73->seq1088). A MAG implicated in methanotrophy (bin.78) aligns with OTUs similar to a methanotroph Methylobacter tundripaludum (seq172). Another MAG (bin.71), clustering near methylotrophs genomes, contained genes involved in methanol metabolism and carbon fixation and matched an OTU (seq3) similar to Methyloterna versitalis.
All bins passing our quality control metrics were classified as Bacteria. The dominance of bacterial bins in the final dataset agrees with overall classification from the shotgun metagenomic data. Nearly 63.8% of assembled contigs could not be matched to known sequences or were matched to uninformative sequences. The rest were classified as Bacteria (96.8%), Eukaryota (2.8%), Archaea (0.1%), or viral (0.3%). Some preliminary bins, which failed our quality control measures, contained non-bacterial sequences. A single preliminary bin, annotated only at the level of kingdom as Archaea, fell just short of the completeness criterion. Upon further inspection of this bin, none of the functional genes used in this study were detected in the bin. Five preliminary bins were annotated as primarily eukaryotic, but the phyla assigned to individual contigs within those bins varied widely. Another small preliminary bin contained three 40–50 kbp viral contigs, two of which were classified as Podoviridae. All three peaked in abundance between 9 m and 13 m and were twice as abundant as the most abundant bacterial MAGs.
Identification of potentially important pathways missing from the model
From the metagenomic bins, we were able to identify two metabolic processes that were not previously included in the model. A number of MAGs (bin.59, bin.15, bin.73) clustered to the KEGG genomes of freshwater sulfur-oxidizing autotrophs capable of denitrification, Sulfuritalea hydrogenivorans , and Sulfuricella denitrificans . These MAGs contained the diagnostic genes for carbon fixation (rbcLS), sulfur cycling (dsrAB), and denitrification (nosZ). One MAG (bin.59) also clustered with iron oxidizing autotroph Sideroxydans lithotrophicus ES-1. Bin.59 is the most relatively abundant bin from 17 to 21 m depth. Thus, if this MAG is associated with iron oxidation, it also contains sulfur-cycling genes that add to metabolic flexibility, which was previously observed . The model did not include sulfide oxidation with nitrate, so it is unclear from the current model predictions where this process is expected to occur within the water column to compare to the MAG distributions.
Another metagenomic bin provides evidence for coupling methane oxidation to nitrate and/or nitrite reduction, which was not included in the model. A number of MAGs cluster near KEGG genome Methylomicrobium alcaliphilum [T01649; ]. One of these MAGs (bin.78) contains methane-oxidizing genes (pmoA) along with genes involved in nitrate (nxr), nitrite (nir), and nitric oxide (nor) reduction. Coupling methane oxidation with nitrate reduction under hypoxic conditions has been demonstrated in other Methylomicrobium populations . The model does not include this process, so we cannot compare where this process would be expected, given the chemistry.
Although this analysis does not confirm that sulfur or methane oxidation is coupled to denitrification within this system, the model can be used to determine whether these processes are favorable given the chemical environment, and if so, what the impact would be to overall biogeochemistry. Sulfur and methane oxidation with nitrate were the two processes not included in the original reactive transport model . However, both processes have since been shown to be widespread and important [45, 46, 47].
Competing denitrification processes could substantially alter lake biogeochemistry
To determine the potential impact of sulfur and methane oxidation coupled to denitrification in the lake, the model was updated and calibrated to match the chemical and gene observations. We calibrated model parameters to match both chemistry and key genes involved in each processes, an approach used previously . Using previously published values, the model fit with a mean absolute error (MAE) of 0.205 to the concentrations of sulfate, dissolved oxygen, nitrate, ferrous iron, and gene abundance across the 3-month time series. This decreased slightly to average MAE of 0.176 using the gene distributions as a calibration objective during optimization (Additional file 6).
While the fit to the observed chemistry alone worsened slightly as a result of the calibration (ΔMAEaverage = + 6.9e−3), the fit to observed gene profiles improved substantially (ΔMAEaverage = − 5.4e−2). The majority of this improvement stemmed from a much better accordance between the calibrated model’s predictions of sulfur oxidation and sulfate reduction and the observed abundances of the dsrAB gene (ΔMAE = − 0.31).
Spatial and temporal dynamics of OTUs compared to model predictions
Methanotroph (Fig. 4l) and methylotrophs (Fig. 4p) displayed bloom-like dynamics distinct from the constant rate of methane oxidation predicted by the model (Fig. 4k and n). When these organisms did bloom, they were found at a depth that largely corresponded to the model prediction for methane metabolism. Shifts in the ratio of relative abundance of these C1-oxidizing populations from the amplicon data correspond to deviations between the model and observations for nitrate and iron (Additional file 8). Although these MAGs differ in the presence of methane monooxygenase (pmoABC) and mxa (methanol dehydrogenase) genes, they are both involved in metabolizing C1 compounds generated from methane. Interestingly, both nitrate and iron deviate from model predictions most dramatically during shifts in the ratio of these populations in the water column. Thus, their dynamics in the lake could represent shifts in the efficiency or use of nitrate as an electron acceptor in C1 compound oxidation. More work is needed to determine the cause of population shifts and their relationship to chemical deviations from the model.
This work provides insight into the relationship between microbial genes, populations, and a biogeochemical model of the processes they mediate to elucidate factors driving population dynamics and advance the use of population dynamics and MAG data in the development of predictive models. A biogeochemical model was used as our hypothesis of the factors influencing the dynamics of microbial populations mediating processes in the lake. It was also used to understand how adding or removing processes from the community would alter lake chemistry. We assumed that the presence of genes mediating the modeled processes would influence population dynamics and distribution; thus, the overall distribution of these genes would correspond to the modeled predictions. However, we also checked whether genes associated with different modeled processes co-occur within the MAG as a result of metabolic versatility, which would also influence gene distribution. Finally, we compared biogeochemical model predictions to the dynamics of key populations. We found that the dynamics of some microorganisms, such as those mediating denitrification, iron reduction, and sulfate reduction seem to be qualitatively captured by a dynamic model of energy availability. However, the model poorly captures the dynamics of key populations mediating methane oxidation, ammonia oxidation, and sulfur/iron oxidation.
There are multiple possible reasons for the lack of agreement between the diagnostic gene distributions and the associated processes predicted by the model. The model could be missing important processes, as suggested by the presence of sulfur-oxidizing denitrifiers. The original model only included sulfur oxidation with oxygen, shifting this process higher up in the water column than was observed for the genes. Pairing sulfur oxidation to denitrification would allow sulfur oxidizers to inhabit lower depths than pairing with oxygen since nitrate is found below where oxygen is depleted. Additionally, although the genes were present, they may not have been active. Genes could also exhibit substrate promiscuity, with genes assigned to one process mediating other processes. The ability of the well-known ferredoxin-nitrite reductase nirA to perform sulfite reduction is one such example . An alternative explanation, especially applicable to bidirectional enzymes such as oxioreductases, is that the reverse reaction is occurring , as a result of the intracellular redox state or subtle sequence mutations affecting the reaction center. Our attempt to align genes to processes in the model relies on our ability to identify the genes involved. Thus, for environments such as this, where 63% of the genes are unknown or unclassified, many genes important to these processes could be unidentified.
The co-occurrence of autotrophic and denitrification genes in relatively abundant populations highlights the expanding view of nitrogen cycling. The original model coupled denitrification with primary carbon oxidation reactions , reflecting the idea that denitrification is largely associated with heterotrophic processes. However, this view has expanded as many researchers have identified the importance of denitrification in oxidation of methane , iron , and sulfur . The combination of genes within autotrophic MAGs also supports this expanded role of denitrification within this system. Our original implementation of the biogeochemical model included iron oxidation coupled to denitrification based on previous observations from the lake [32, 50]. The similarity of a MAG to Sideroxydans lithotrophicus ES-1 supports this process. But the clustering of MAGs with other autotrophic denitrifiers and presence of nosZ genes suggests additional autotrophic denitrifying pathways should be considered.
Given the evidence for autotrophic processes competing for nitrate, these processes could directly impact greenhouse gases, nutrient cycling, and contaminant mobilization. Using the biogeochemical model, we demonstrate that competition for nitrate among autotrophic denitrifying populations could dramatically change iron speciation within the lake. Because of the importance of nitrate and iron in determining arsenic  and phosphate  speciation and mobility, these chemical species will likely be impacted as well. Additionally, nitrifiers have the denitrification gene nirK, providing it the potential for nitrifier denitrification, which can be a significant source of the potent greenhouse gas nitrous oxide in some ecosystems . More work is needed to determine the rules governing these interactions and how the outcome of these competitive interactions between denitrifying populations could impact the overall chemistry of the lake.
While we could successfully identify MAGs linking functional capabilities to 16S rRNA gene sequences in our amplicon dataset, the development of more targeted methods to consistently associate function to 16S rRNA gene sequences would extend this analysis to more of the community. 16S rRNA gene sequencing can more efficiently provide population dynamics because many more samples can be sequenced with sufficient coverage as compared to shotgun metagenomics. Yet, limited functional information is available from 16S rRNA gene sequencing. Some currently available techniques, such as stable isotope analysis , or epicPCR , can help identify the 16S rRNA gene sequence associated with organisms containing specific capabilities or functional genes, but this targeted approach has limited use in creating a comprehensive understanding of community function and interactions. Although 16S rRNA gene sequences cannot always discriminate between closely related but functionally distinct organisms , the importance of these functional differences will depend on the process of interest. More phylogenetically constrained forms of energy conservation, like methane oxidation and sulfate reduction, will be less sensitive to this impact than processes such as denitrification . More effective methods to tie 16S rRNA genes to associated function and MAGs would make the most efficient use of both types of sequencing and better inform dynamics and function.
The biogeochemical model serves multiple purposes in this analysis. First, it can be used to generate hypotheses about where processes are expected to occur given the current understanding of processes and the observed chemistry. It can then be used to compare to observations of genes and populations mediating the modeled processes. However, observations can help generate new ideas about the flow of energy and matter through the ecosystem, requiring updates to the model in an iterative approach. By comparing the previous model predictions with gene observations, we were able to identify major differences, which were supported by the MAG gene data. When we updated the model to reflect these observations, it allowed us to gain insight about potentially competing processes, which could substantially change iron chemistry in the lake. Finally, the model can be used to test unobserved conditions, such as the removal of methane and sulfide oxidation coupled to nitrate reduction from the ecosystem and observe the potential consequences. Not only can computational models facilitate the analysis of microbial communities, but it can also result in a usable product that may predict the response of the community to future environmental change scenarios.
While this analysis provides insight into potential capabilities and interactions between microorganisms, more work is needed to confirm activity and interactions. In general, the metabolic model and associated MAGs support the potential for these processes to occur given the chemical environment. Although we assume reproduction would concentrate cells at the various locations in the lake, the presence of microorganisms does not necessarily translate into metabolic activity. Microorganisms or their genes may not be active, even though there is an overall positive relationship in the water column between the presence (from DNA) and activity (from RNA) [17, 55]. Meta-transcriptomics could provide more insight into when and where organisms express the genes they carry. However, transcription of these key genes may only provide a snapshot of metabolic activities if they vary substantially between sampling periods . Population abundances and their associated functional capabilities may integrate energy availability better than transcription over longer time-scales, especially if activity is intermittent within the specified time. On the other hand, population abundances are also determined by a complicated set of factors that might not be related to energy availability. Factors regulating population abundances may be influenced by other processes, such as immigration  and complex hydrodynamics, such as seiching . However, within this lake, we have no evidence to suggest that hydrodynamic processes are the main drivers of the position of microbial populations.
This analysis uses a combination of 16S rRNA gene surveys, metagenomic analysis, and biogeochemical modeling to gain insight into how energy availability shapes the distribution and dynamics of genes and microbial populations in the lake. Partial genome reconstruction through metagenomic binning provided insight into the capabilities of microbial populations mediating major biogeochemical cycles in the lake, including populations with genes supporting sulfur or methane oxidation coupled to denitrification. Modulating these processes in a biogeochemical model of the lake substantially changed predicted iron chemistry, which is recognized to impact both arsenic and phosphate mobilization. Secondary redox processes are not well captured by the model; thus, more work is needed to better understand the factors governing these processes. Overall, the relationship of the biogeochemical model predictions to the dynamics of populations mediating the primary redox processes identifies populations with dynamics largely controlled by energy availability as opposed to other factors. Additional observations of this ecosystem could shed light on how population dynamics are shaped by factors like phage predation, especially within methane oxidizers with potential for denitrification, which could impact ecosystem energy and material flow.
Sampling site and sampling protocol
Samples were collected from the middle of Upper Mystic Lake located in Medford, MA (42° 26.155′ N, 71° 08.961′ W), which has an approximate depth of 23 m. Samples were taken on March 26, May 10, June 17, July 17, and August 15, 2013, from the surface to 1 m off the bottom at 1 to 2 m intervals. For each sample, a peristaltic pump was used to pump 50 mL of water through a sterile 0.22-μm filter (Millipore), which was immediately placed on dry ice and stored at − 80 °C until DNA extraction. The filtrate was collected in a sterile 50-ml conical tube, placed immediate on dry ice, and stored at − 80 °C for anion analysis with ion chromatography. One milliliter of filtered and unfiltered water was added to 63 μl of concentrated hydrochloric acid for iron analysis with ferrozine . An in situ probe was used to measure oxygen, temperature, and depth (Hydrolab MiniSonde). Water chemistry for the entire period and 16S rRNA gene libraries from DNA from Aug. 15, 2013, were used in a previous analysis .
16S rRNA gene library preparation and operational taxonomic unit (OTU) generation
DNA was extracted as previously described  using the PowerWater DNA extraction kit (MoBio Laboratories) following the manufacturer’s protocol with an alternative lysis step with the addition of 20 μl of proteinase K and incubation at 65 °C for 10 min before bead beating. 16S rRNA gene libraries of the V4 region of the 16S rRNA gene were generated using a two-step amplification method as described previously . Universal primers targeting the region between the U515 and E786 positions were used (PE16S-V4-U515-F, 5′-ACACG ACGCT CTTCC GATCT YRYRG TGCCA GCMGC CGCGG TAA-3′ and PE16S-V4-E786-R, 5′-CGGCA TTCCT GCTGA ACCGC TCTTC CGATC TGGAC TACHV GGGTW TCTAA T-3′). The second set of primers contained Illumina adapter sequences and a 9-bp barcode for demultiplexing indicated by Ns (PE-III-PCR-F, 5′-AATGA TACGG CGACC ACCGA GATCT ACACT CTTTC CCTAC ACGAC GCTCT TCCGA TCT-3′ and PEIII-PCR-##-R, 5′-CAAGC AGAAG ACGGC ATACG AGATN NNNNN NNNCG GTCTC GGCAT TCCTG CTGAA CCGCT CTTCC GATCT-3′). Negative controls were also included to determine the contamination from the tubing and reagents. Positive control samples, including a mock community and replicate samples, were also included. Libraries were sequenced at Massachusetts Institute of Technology BioMicro Center (BMC) on an Illumina MiSeq with 250 bases for each of the forward and reverse reads and 8-base indexing read. One hundred two base pairs of the forward read were quality filtered, trimmed, and clustered into OTUs as previously described  using USEARCH , mothur  distribution-based clustering , and UCHIME  for quality filtering, sequence alignment, OTU calling, and chimera checking, respectively. Replicate samples were combined by their mean. Although the forward and reverse reads overlap in the amplicon library, 16S rRNA reads do not commonly overlap in the shotgun metagenomic library, described below. Since a direct comparison between 16S rRNA gene sequences generated in both datasets was used for quality control, the amplicon library sequence length was limited to 102 bps, a length used previously .
Analysis of amplicon library control samples
To determine the influence of reagent contamination and methodological error on 16S rRNA gene sequence data, OTU tables and representative sequences were imported into QIIME2  for diversity and principle coordinate analysis plots. The average read count for all negative samples was 1515 as compared to an average read count of 21,111 for positive samples and 62,838 for environmental samples. Negative samples were distinct from environmental samples (Additional file 9a), and negative samples typically contained a large proportion of OTUs classified as Halomonas while most of the other samples did not. Samples that did were discarded. Of the OTUs mentioned in the text, six were observed in the negative, but at much lower abundance (17 reads total) than in the environmental samples (1,194,496 reads total), suggesting contamination of negatives by samples, making it unlikely that these OTUs arose from contamination from reagents or other sources. Four mock community samples matched the known mock community sequences as expected, with little contamination. The sequenced reads matched the mock community with an average correlation coefficient of 0.77 and 0.89 with and without sequences with mismatches in the primer-binding site, respectively (Additional file 9b). Technical and biological replicates had an average Bray-Curtis distance of 0.19 (standard deviation +/− 0.09) compared to the average pairwise distance across the entire dataset of 0.69 (standard deviation +/− 0.2), suggesting that methodological variability was low compared to natural variability (Additional file 9d). Samples were processed and sequenced in random groups and did not cluster according to process group (Additional file 9c).
Shotgun metagenomic library preparation and sequencing
Metagenomic libraries were made from samples taken at every other depth from 1 to 21 m, plus 20 and 22 m from the Aug. 15, 2013. Positive control libraries were made from E. coli (B strain) and Microcystis aeruginosa stock cultures (Carolina Biological Supply, Burlington, NC). Libraries were prepared using the Illumina Nextera XT kit (Illumina, San Diego, CA) according to the manufacturer’s protocol, except libraries were pooled manually. Libraries were sequenced at the Johns Hopkins University Genetic Resources Core Facility on an Illumina HiSeq 2500 (Illumina, San Diego, CA), generating 150 base pair paired-end sequences.
The requisite metadata information recommended by the Genomic Standards Consortium for metagenome-assembled genome  is shown in Additional file 1. The following approaches apply to all bins derived. Taxonomic classification was performed using megaBLAST  and the taxator-tk algorithm . Reassembly of bins and initial co-assembly were performed using SPAdes 3.11 . The initial assembly was completed in “metagenomic” mode, and reassembly was done for each bin in “careful” mode using the first pass contigs as “untrusted contigs.”
Contigs were binned according to their coverage and tetramer frequencies. A set of consensus bins were derived from the bins produced by the maxbin2 , metabat2 , and concoct  algorithms. Completeness and contamination assessment were performed using the lineage workflow in CheckM . The preliminary bins mentioned above were those produced by maxbin2. Bins with < 70% completion and > 10% contamination were discarded. The bioinformatics pipeline from QC, to assembly, to binning, to refinement, to reassembly, and taxonomic classification was done within the metaWRAP software . Prokka  was used to facilitate gene calling and preliminary annotations. Prodigal , barrnap, SSUsearch , and Aragorn v1.2  were used to call open reading frames, ribosomal RNA, and transfer RNA. All protein sequences were generated using translation Table 11 within Prodigal.
To investigate the consistency of the binning pipeline output as compared to contemporary alternatives , the published raw reads collected from a study in the Gulf of Mexico dead zone were processed in parallel with our samples. The bins produced by our pipeline were compared to metagenomic-assembled genomes uploaded and published to the IMG database. Using mash  to compare bins, the bins produced by maxbin2 most closely matched those created in a previous analysis . It is notable that even those bins that were manually refined into multiple smaller bins and presented as distinct organisms (43–1/2 and 45–1/2) were matched to unique bins produced by maxbin2.
Mash was used to determine the closest matches between bin sets . Shared hashes are the units of homology produced by mash. Despite having greater sensitivity to mismatches and less accuracy than alignment-based comparisons, it was suitable for rapid similarity assessment. The degree of “concentration” or “specificity” of a match was derived from these data and was calculated as the ratio of shared hashes between the query genome and the single best reference genome, divided by the sum of hashes shared with all references. This value was calculated for members of a match to determine the degree of dispersion of a given MAG across the entire other bin set (Additional file 10). After establishing 1 to 1 pairing between MAG sets, Mummer was used to perform genome-to-genome alignments . The “--mum” tag was used to ensure only unique alignments on both reference and query were produced. Alignments were filtered to only include regions with longer than 1 kbp of homology. The genome size and the fraction of the genome aligned for both members of a match are shown in the file mentioned above.
Identification of genes involved in modeled metabolic processes
Protein annotations were conducted using KEGG BlastKOALA and GhostKOALA annotation web service . Any classifications with a GHOSTX score below 100 were removed, as previously recommended . C-type-cytochromes, involved in extracellular electron transfer to iron in Geobacter  and Rhodoferax , were identified by BLASTP and TBLASTX , respectively, with e value cutoff of 1E−190.
Functional gene distribution and corresponding modeled processes
The abundance of select genes was annotated with biogeochemical process-related KEGG Ids. Salmon was used to determine the relative abundances of these sequences in each library . The nucleotide sequence of each gene was bookended by 50–100 bp on each end to ensure that near identical gene sequences on distinct contigs were not conflated. Uniqueness was confirmed during indexing, as Salmon detects and removes duplicates prior to mapping.
Gene abundances values produced by Salmon are presented in sample-normalized units (i.e., copies per million). They were then L1 normalized by gene. These steps were taken as an alternative to normalizing according to average single-copy gene abundances . The rationale for this approach stems from the fact that uncultivable organisms are not ensured to contain single copies of all single-copy genes. Furthermore, the ratio of binned to unbinned nucleotides is approximately 3:4 in our analysis. The majority of these unbinned contigs were contained in the 182 bins that were discarded based on the absence of > 30% of their single-copy genes. Taken together, these imply that the average single-copy gene abundance will likely underestimate the contribution of functional genes from unbinned fragments.
Protein annotations do not always correspond to a unique modeled process; thus, multiple modeled processes were compared to the gene distribution during model comparison and calibration. For these processes, the sum of all possible modeled reactions mediated by the gene was normalized and compared to the gene distribution. Ammonia oxidation, methane oxidation with oxygen, and methane oxidation with sulfate were used in the comparison with pmoABC genes, as these genes can mediate either methane or ammonia oxidation . Sulfur oxidation and sulfate reduction were used in the comparison with dsrAB genes, as these could be involved in either sulfate reduction or sulfur oxidation . Heterotrophic and autotrophic denitrification was used in the comparison with nosDZ. Methane oxidation with oxygen and sulfate were used in the comparison with mxaCGKL. Nitrification was represented with hoa and iron reduction with Rhodoferax and Geobacter iron reduction genes, matched by BLAST .
Characterizing metabolic capabilities of metagenome-assembled genomes
Metabolic capabilities of MAGs were determined from gene content similarity to KEGG genomes with known functions and the presence of key genes and pathways within the MAG. KEGG ids extracted from the MGENE set associated with each KEGG genome (5647 total comparisons) were added to a matrix of all MAG KEGG ids individually. Normalization and complete linkage hierarchical clustering using Spearman Rank correlation placed the KEGG genome within the MAG tree using Cluster 3.0 . The functional capabilities of up to four of the closest KEGG genomes were used to assess MAG functional capabilities. Additionally, the presence of specific genes within the MAG was used as validation for specific functional capabilities as described above for methane and ammonia oxidation (pmoABC; K10944, K10945, and K10946), nitrification (hao; K10535), methane oxidation (mxaCGKL; K16255, K16257, K16258, K16259), sulfur cycling (dsrAB; K11180 and K11181), denitrification (nosDZ; K07218, K00376), iron reduction (Geobacter and Rhodoferax genes), or carbon fixation (rbcSL; K01601, K01602).
OTUs matching metagenomic bins
Both OTUs derived from shotgun libraries, as well as those observed in amplicon libraries were used assigned 16S rRNA gene sequences to bins based on similarities in abundance and taxonomic classification. The top 500 amplicon OTUs by the sum of their abundances in comparable samples were used in the analysis. The 11 overlapping samples in both the metagenomic shotgun dataset and the amplicon library dataset were used in the comparison. Abundances were normalized using the L1 norm along both axes, within samples first and then within each amplicon/bin.
The product of the length of contig (li) and its coverage (cij) in sample j was summed across all N contigs assigned to the bin and then divided by total length of the bin genome. These raw values were divided by the number of reads in each library before following the same normalization steps applied to OTU abundances.
OTU taxonomy was assigned using the RDP classifier and the 16S rRNA gene Training Set 16 . Confidence scores below 50% and Incertae Sedis classifications were removed from the “fixed rank” output. Each level in the hierarchy from kingdom to genus was treated as an independent feature to match the format of the taxonomic assignments produced for each bin. Pairwise Euclidean distances between abundance vectors for bins, and OTUs were calculated and seen to range from 0 to 1 on L1 normalized vectors. The fraction of matching taxonomic hierarchical levels (using the bin taxonomy as the denominator) was then subtracted from the distance to produce a combined distance metric ranging from 1 to −1, where the latter represents a perfect match. Eight bins were used as a positive control for the matching procedure, as five bins contained assembled 16S rRNA genes and three bins contained known positive control organisms. The tolerance was trained on these bins such that the minimum distance needed to capture all the correct assignments was accepted. This tolerance value was observed to be different for the full time series of amplicon libraries (0.0251), as compared to shotgun library data (0.176).
Incorporating additional processes into the biogeochemical model
Gene- and concentration-driven model calibration
Model parameters were calibrated to fit the gene and chemical observations. The calibrated parameters include rate constants, precipitation constants, and scalars applied to initial concentration profiles of unmeasured chemicals. Calibrated parameters were constrained using existing literature evidence when available.
Model calibration was achieved using a stochastic, sequential, and bounded search algorithm (https://github.com/spacocha/MWMW/blob/master/Scripts/calibration_script.py). The algorithm sequentially optimizes and fixes the value of each parameter. For each parameter, a fixed number (100–1000) of simulations is performed. Parameter values are randomly sampled from Gaussian distributions whose boundaries are the limits drawn from the literature. At the end of an iteration, an F test is performed with the sampled parameter values as features and the observed fit metric as a response. The parameter with the lowest p value (i.e., the most sensitive) is fixed at its best-fitting value. Next, sampling distributions for the values of the unfixed parameters in the next set of simulations are constructed. Linear regression models are fit individually using the performance metric to reset the centers of the Gaussian distributions for unfixed parameters. The distance between the current and previous centers for these unfixed parameters is used to define the shape of the Gaussian. A burn-in iteration in which no parameter is fixed is used to calculate the initial centers for subsequent iterations steps. Once all parameters are fixed, additional polishing is achieved using the L-BFGS-B algorithm  initiated at the point of convergence.
The performance metric utilized was the average mean absolute error between observed and modeled concentrations and process rates. Observed concentrations of some oxidized species were used directly for this purpose, and the average normalized abundances of specific gene sets were used as a proxy for microbial process. A list of parameters, calibration search ranges, and final value selections can be found in the script repository (https://github.com/spacocha/MWMW/tree/master/Data/calibration_data).
We thank the Winchester Boat club for the use of their facilities and boat launch and Eric Sakowski for thoughtful comments on this manuscript. Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC).
This work was supported by Johns Hopkins University, Whiting School of Engineering.
Availability of data and materials
The datasets generated and analyzed during the current study are available in NCBI’s SRA. The raw data from 16S rRNA gene sequencing and shotgun metagenomic library sequencing is available from NCBI’s SRA under accession numbers SRR3358790-SRR3359255 and SRR6237649-SRR6237662, respectively. Scripts and code used to process OTUs and MAGs are available from SmileTrain (https://github.com/almlab/SmileTrain) and MWMW (https://github.com/spacocha/MWMW), respectively. FASTA files containing the MAG sequences produced by metaWRAP are also available (https://github.com/spacocha/MWMW/tree/master/Data/Bin_Sequence_Data/bin_fastas).
KAW designed the analysis and controls; generated, analyzed, and interpreted the data; and helped write the manuscript. SWO designed the analysis, helped generate data, and helped write the manuscript. BPS, KD, SJS, EMM, SA, and AS helped generate data and helped write the manuscript. SPP designed the analysis, generated, analyzed and interpreted the data, and wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.U.S. Environmental Protection Agency, O.o.W. National water quality inventory: report to congress, 2004 reporting cycle. Washington, D.C: E.P. Agency, Editor; 2004.Google Scholar
- 21.Rousseaux CS, Gregg WW. Interannual variation in phytoplankton primary production at a global scale. Remote Sens-Basel. 2014;6:1–19.Google Scholar
- 23.Canfield DE, Kristensen E, Thamdrup B. Heterotrophic carbon metabolism. Advances in Marine Biology. 2005;48:129–66.Google Scholar
- 40.Emerson D, et al. Comparative genomics of freshwater Fe-oxidizing bacteria: implications for physiology, ecology, and systematics. Front Microbiol; 2013. https://doi.org/10.3389/fmicb.2013.00254.
- 44.Kits KD, et al. Diverse electron sources support denitrification under hypoxia in the obligate methanotroph Methylomicrobium album strain BG8. Front Microbiol; 2015. https://doi.org/10.3389/fmicb.2015.01072.
- 49.Thorup C, et al. Disguised as a sulfate reducer: growth of the deltaproteobacterium Desulfurivibrio alkaliphilus by sulfide oxidation with nitrate. MBio; 2017. https://doi.org/10.1128/mBio.00671-17.
- 50.Senn DB. Coupled arsenic, iron, and nitrogen cycling in arsenic-contaminated Upper Mystic Lake; 2001. http://hdl.handle.net/1721.1/8750. Accessed 12 Sep 2018.
- 72.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. 2018. https://doi.org/10.1101/277442.
- 77.Ondov BD, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol; 2016. https://doi.org/10.1186/s13059-016-0997-x.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.