Comparative genomics of Bacteria commonly identified in the built environment
The microbial community of the built environment (BE) can impact the lives of people and has been studied for a variety of indoor, outdoor, underground, and extreme locations. Thus far, these microorganisms have mainly been investigated by culture-based methods or amplicon sequencing. However, both methods have limitations, complicating multi-study comparisons and limiting the knowledge gained regarding in-situ microbial lifestyles. A greater understanding of BE microorganisms can be achieved through basic information derived from the complete genome. Here, we investigate the level of diversity and genomic features (genome size, GC content, replication strand skew, and codon usage bias) from complete genomes of bacteria commonly identified in the BE, providing a first step towards understanding these bacterial lifestyles.
Here, we selected bacterial genera commonly identified in the BE (or “Common BE genomes”) and compared them against other prokaryotic genera (“Other genomes”). The “Common BE genomes” were identified in various climates and in indoor, outdoor, underground, or extreme built environments. The diversity level of the 16S rRNA varied greatly between genera. The genome size, GC content and GC skew strength of the “Common BE genomes” were statistically larger than those of the “Other genomes” but were not practically significant. In contrast, the strength of selected codon usage bias (S value) was statistically higher with a large effect size in the “Common BE genomes” compared to the “Other genomes.”
Of the four genomic features tested, the S value could play a more important role in understanding the lifestyles of bacteria living in the BE. This parameter could be indicative of bacterial growth rates, gene expression, and other factors, potentially affected by BE growth conditions (e.g., temperature, humidity, and nutrients). However, further experimental evidence, species-level BE studies, and classification by BE location is needed to define the relationship between genomic features and the lifestyles of BE bacteria more robustly.
KeywordsBuilt environment Bacteria Diversity Genomic features Genome size GC content Replication strand skew Codon usage bias
Guanine and cytosine
GC skew index
- S value
Strength of selected codon usage bias
Mean distance between all pairs of bacteria as a diversity index
The microbial community of the built environment (BE) is an important player in human-microbe interactions. As such, in order to build urban environments that benefit human well-being, it is necessary to study the relationship between the BE and microbial communities. As of 2016, about 54% of the world’s population is living in urban areas , and by 2050, this number is expected to increase to 66% . Moreover, people spend about 87% of their time indoors and about 6% in cars , suggesting that the indoor microbial community can play an important role in the lives of individuals. In fact, the indoor microbial community has already been shown to affect occupant health (e.g., respiratory health  and asthma ), including adverse effects on mental health , and can be influenced by building design (e.g., ventilation), occupants, and usage [7, 8, 9]. In turn, individuals can easily influence the surrounding microbial community with their own personal microbiome, especially through physical contact [10, 11, 12] and movement , leaving a microbial fingerprint in the built environment [9, 14, 15]. The microbial community of the BE also extends to the outdoor (e.g., green roofs  and parks ), underground (e.g., transit systems [18, 19, 20]), and extreme environments (e.g., cleanrooms  and space [21, 22]).
The BE microbiome is slightly influenced by environmental conditions, mainly temperature, humidity, and lighting [23, 24, 25, 26, 27, 28]. Several other building parameters have been tested previously (e.g., room pressure, CO2 concentration, surface material) but were not found to play a significant role in the microbial community composition [29, 30]. Moisture levels are widely known to affect microbial abundances and activity, especially when water damage occurs (e.g., flooded homes had higher abundances of Penicillium ). However, many indoor built environments are largely devoid of water and nutrients, and it is likely that geographical location, on the scale of cities or even at larger scales , plays a more important role in the microbiome composition .
The relationship between humans and microorganisms in the BE has moved from investigations limited to culture-based methods to approaches involving next-generation sequencing. One of the first publications on an indoor microbial community occurred in 1887 , which expounded a positive correlation between the presence of indoor microorganisms and death rate. Since the advent of high-throughput sequencing, several studies have used amplicon sequencing to gain more information about the microbial community of the BE, including the ribosomal RNA region (e.g., 16S rRNA) for Bacteria and Archaea and the internal transcribed spacer (ITS) region for Fungi . The microbial communities of a variety of locations have been analyzed, such as clean rooms , operating rooms , plumbing systems , universities , and transit systems [18, 19, 20]. While these studies have enhanced our understanding of the relationship between humans, microorganisms, and the built environment [25, 29, 37], there are limitations to amplicon sequencing, including bias with sequencing primers, targeted amplicon region, DNA extraction protocols, and sequencing platforms , which make multi-study comparisons difficult.
Improving our understanding of microbial communities in the BE can be achieved by analyzing draft or complete genomes derived from genomic and metagenomic studies . There have been several published genomes of bacteria collected from the BE, such as Dermacoccus nishinomiyaensis , Arthrobacter sp. , and Gordonia sp. , among others [43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]. These data provide detailed information on individual bacterial genomes and can be indicative of a bacteria’s lifestyle or ecological niches [54, 55]. For example, comparative genomics of Lactobacillus species, a common microorganism in the human vagina which is mostly absent from other habitats, revealed that the genomes of the vaginal species were smaller with lower GC (guanine and cytosine) content compared to the non-vaginal species . The observed genome size reduction suggests that the vaginal Lactobacillus species has “some degree of adaptation to a host-dependent lifestyle” and is commonly observed in symbiotic microorganisms . However, the individual organismal genome information (e.g., genome size and nucleotide composition) has not been investigated in depth for microorganisms in the BE.
In the present study, we performed genome sequence analyses for bacteria that have been commonly identified in BEs, and focused on genomic features, including genome size, GC content, replication strand skew, and codon usage bias. This information could be useful for the characterization of the microbial members present in BEs, and in the future, these basic features might be useful to help predict the microorganisms likely to adapt to BE conditions.
Bacteria commonly identified in the built environment
Built environments (BEs) are occupied by various microorganisms and are also important transitions that link the natural world, humans, and the urban environment. The indoor microbiome has already been shown to influence human health [4, 5, 6], and a building’s design and operation can play a major role in the spread of microorganisms, including pathogens . For example, air and water via ventilation and plumbing systems, respectively, are major routes for microbial dispersal throughout a BE . Since BEs are designed to improve the lives of the individuals cohabiting them, it is important to understand the relationship between the BEs and the microorganisms therein.
Locations in the BE where “Common BE genera” were identified. The locations where “Common BE genera” were identified are listed for the 28 genera. This list is based on the 54 publications used for this study (see Additional file 1: Table S2)
Environment Type in BE
Clinical (e.g., hospitals), Residential (e.g., bathroom), Extreme (e.g., spacecraft, cleanroom. ISS), Subway (e.g., underground touchscreens). Public recreation (e.g., gym), Hotel bathroom, Office workspace, University (e.g., classroom)
Extreme (e.g, cleanroom, ISS), Residential dust, Subway air
Clinical (e.g., hospitals), Residential (e.g., bathroom), Extreme (e.g., spacecraft, cleanroom, ISS), Subway, Public recreation (e.g., gym), Hotel bathroom, Office workspace
Extreme (e.g., spacecraft, cleanroom, ISS), Residential (e.g., wall surfaces), Clinical (e.g., hospital bathroom), Office workspace, Hotel bathroom
Clinical (e.g., hospital), Extreme (e.g., spacecraft, cleanroom, ISS), Subway, University classroom
Extreme (e.g., spacecraft, cleanroom, ISS), Residential (e.g., bathroom), Clinical (e.g., hospital), Hotel, bathroom Public recreation (e.g., park, gym)
Residential (e.g., kitchen), Extreme (e.g., cleanroom, ISS), Subway
Clinical (e.g., hospitals), Residential (e.g., dust), Extreme (e.g., spacecraft, cleanroom, ISS), Subway (e.g., ticketing machines, underground touchscreens), Office, workspace University (e.g., classroom, dormitory)
Extreme (e.g., spacecraft, cleanroom, ISS), Clinical (e.g., hospital)
Extreme (e.g., spacecraft, ISS), Subway (e.g., outdoor and underground surfaces), University (e.g., classroom,)
Extreme (e.g., cleanroom, ISS), Clinical (e.g., hospital), Subway (e.g., outdoor and underground surfaces), Public recreation (e.g., park)
Clinical (e.g., hospitals), Residential (e.g., kitchen, bathroom), Extreme (e.g., ISS), Subway (e.g., passenger area), Public recreation (e.g., gym), Hotel bathroom
Residential (e.g., indoor surface), Extreme (e.g., cleanroom, ISS), Subway (e.g., underground air), Clinical (e.g., hospitals)
Clinical (e.g., nursing home), Residential (e.g., indoor air, surface dust), Extreme (e.g., cleanroom, ISS), Subway (e.g., touchscreens), Office workspace) University (e.g., classroom, dormitory, bathroom)
Clinical (e.g., hospitals), Residential (e.g., bathroom), Extreme (e.g., spacecraft, cleanroom, ISS), Subway (e.g., touchscreens), Office (e.g., dust), University (e.g. door handle), Hotel bathroom
Extreme (e.g., spacecraft, cleanroom, ISS), Subway (e.g., underground air)
Clinical (e.g., hospitals), Residential (e.g., indoor air, surface), Extreme (e.g., spacecraft, cleanroom, ISS), Subway (e.g., underground air)
Clinical (e.g., hospitals), Residential (e.g., indoor air, surface), Extreme (e.g., cleanroom), Subway (e.g., outdoor air), Hotel (e.g., showerhead), Public recreation (e.g., gym)
Clinical (e.g., hospitals), Residential (e.g., dust), Extreme (e.g., ISS), Hotel (e.g., showerhead), Public recreation (e.g., gym), Office workspace
Extreme (e.g., space station, ISS), Subway (e.g., underground air)
Residential (e.g., wall surface, dust), Extreme (e.g., ISS), Office workspace, University (e.g., dormitory)
Clinical (e.g., nursing home), Residential (e.g., kitchen, bathroom), Extreme (e.g., cleanroom, space station), Subway (e.g., indoor air), University (e.g., classroom, door handle)
Clinical (e.g., hospitals), Residential (e.g., kitchen, bathroom), Extreme (e.g., cleanroom, space station, ISS), Subway (e.g., underground air), University (e.g., door handle), Hotel (e.g., showerhead), Public recreation (e.g., gym), Office (workspace)
Clinical (e.g., hospitals), Residential (e.g., indoor air), Extreme (e.g., cleanroom, space station, ISS)
Clinical (e.g., hospitals), Residential (e.g., bathroom), Extreme (e.g., cleanroom, space station, ISS), Subway (e.g., ticketing machines, underground touchscreens), University (e.g., classroom), Hotel (e.g., showerhead), Public recreation (e.g., gym, park, parking lot), Office (e.g., dust)
Clinical (e.g., hospitals), Residential (e.g., bathroom), Extreme (e.g., cleanroom, space station, ISS), Subway (e.g., air), University (e.g. classroom), Hotel (e.g., showerhead), Public recreation (e.g., gym), Office workspace
Clinical (e.g., hospitals), Extreme (e.g. cleanroom, space station, ISS), Subway (e.g. ticketing machines, underground touchscreens)
Clinical (e.g., hospitals), Residential (e.g., bathroom, wall surface), Extreme (e.g., cleanroom, ISS), Subway (e.g., indoor air, touchscreens), University (e.g., classroom, door handle), Hotel (e.g., showerhead), Public recreation (e.g., gym), Office (e.g., dust, workspace)
From the 54 publications used in this study, many of the “Common BE genera” (Table 1) were identified around the world (Additional file 2: Figure S1). For example, Acinetobacter was found in five countries, spanning eight different climates, and in the ISS. Unsurprisingly, all 28 genera had some association with humans, as analyzed by MetaMetaDB (Additional file 1: Table S6) , further demonstrating the influence that humans have on the BE microbiome [29, 37]. Due to the limitations of this study, the prevalence of these “Common BE genera” cannot yet be associated with BE selection pressures. For example, while there are several other human-associated genera (e.g., Haemophilus, Veillonella, Alistipes, Rothia), the microbial community abundances could be affected by different abundance levels and shedding rates across the human body. Other limitations are listed in the section “Robustness and limitations.”
Diversity among common BE genera
To assess the diversity of the “Common BE genera,” we calculated the mean distance (Dmean) between all pairs of taxa within each genus based on 16S rRNA gene sequences available in the LTP datasets of the SILVA v128 release . The SILVA database was selected over other 16S rRNA databases (e.g. Greengenes [59, 60] and RDP ) due to greater alignment quality  and because it is continuously updated . The Dmean was also selected over the phylogenetic diversity index (PD) [64, 65] because it is less affected by the number of taxa (N) available in the LTP database, as demonstrated by a smaller Pearson correlation coefficient (r = 0.0017) between N and Dmean compared to N and PD (r = 0.7248) (Additional file 2: Figure S2).
Genome size, GC content, and GC skew
We compared the genomic features (genome size, GC content, GC skew, and codon usage bias) of 2580 complete prokaryotic genomes from the NCBI RefSeq database, in which 717 genomes are from bacteria commonly identified in the BE (“Common BE genera”) and 1863 other genomes (“Other genera”) (Additional file 1: Table S8-S9). The “Other genomes” have not been identified in at least six publications (equivalent to 10% of the publications used for this study).
Genomic features, including genome size, GC content, and GC skew, can provide information about the bacterial lifestyle as well as phylogeny . For example, genome size can reflect genome streamlining, symbiosis, or genome expansion [71, 72]. GC content has been shown to relate to both the phylogeny and ecological adaptations of a microbial species, as demonstrated by Reichenberger and co-workers . GC content can range from 15 to 75% and can be influenced by environmental factors such as temperature , oxygen levels , and nucleotide availability . Furthermore, GC skew, as quantified by the GC skew index (GCSI), measures the strength of replication strand skew  and could indicate variation in mutational and selective pressures between leading and lagging strands of DNA replication . Indeed, the leading strand tends to be biased with G and T while the lagging strand is rich in A and C . Strand composition bias has been shown to especially occur in obligate intracellular microorganisms that permanently live within a host, resulting in the loss of some DNA repair genes and the accumulation of mutations . Replication, repair, and transcription enzymes are thought to influence strand composition, where different genes are involved in transcribing the leading and lagging strand . Each enzyme will have different mutational and selective pressures, and thus, GCSI informs DNA repair capabilities and provides insight into the metabolism and lifestyle of bacteria .
Codon usage bias
The genetic code of each “Common BE genus” can also provide information about codon usage bias, which has further implications on evolutionary processes, such as selection, mutation , and even horizontal gene transfer [83, 84, 85]. Many amino acids can be encoded by more than one codon, also known as synonymous codons, due to the redundancy of the genetic code, and there is generally a preference for one synonymous codon over another . The pattern of synonymous codon usage can vary between organisms (e.g., some organisms use a set of synonymous codons more frequently) and across genes within a genome [82, 87]. It is hypothesized that codons are selected based on their impact on translation, influencing bacterial growth [88, 89], and that codon usage bias can be derived from highly expressed genes [90, 91]. Several studies have demonstrated that codon usage bias correlates with bacterial growth rates, likely suggesting a selection towards efficient translation machinery [87, 89, 92, 93]. Codons may also be selected to optimize protein production speed . For example, the codon usage bias of Salmonella enterica serovar Typhimurium, a fast-growing bacterium, correlates well with gene expression levels . Thus, it is imperative to determine the codon usage bias in order to further surmise the lifestyles of bacteria that have been commonly identified in the BE.
Here, we determined the strength of selected codon usage bias (S value) (Fig. 2d), as discussed by Sharp and co-workers . The S value is based on a comparison of codon usage between constitutively highly expressed genes and the entire genome (see Methods for details) . The median S value of the “Common BE genomes” (1.32) was higher than that of the “Other genomes” (0.50), with a large effect size (Cliff’s delta of 0.574). Moreover, the Wilcoxon rank sum test provided a significant result with a q-value of 1.22e-111, suggesting that the S value could be more indicative of the type of bacteria commonly observed in the BE compared to other genomic features described previously (genome size, GC content, and GC skew).
Further categorization of the environments (MetaMetaDB) indicates that the S value is stronger for the “Common BE genomes” observed with the human microbiome, as compared to the other “Common BE genomes” (Additional file 1: Table S10 and Additional file 2: Figure S7). Among the 517 “Common BE genomes” for which species were categorized according to environments in MetaMetaDB, the S value tended to be lower in compost-associated “Common BE genomes” than in the other “Common BE genomes” (Cliff’s delta = − 0.647; q-value = 1.01e-21). In contrast, the median S value for the “Common BE genomes” also associated with the category “human” by MetaMetaDB (n = 454; median S value = 1.45) was higher than that for the other “Common BE genomes” (n = 63; median S value = 0.71). The difference was large based on the effect size (Cliff’s delta = 0.516) and was statistically significant based on the Wilcoxon rank sum test (q-value = 2.53e-10). This trend is also true when examining only the top bacterial genera found in the human microbiome (list taken from Lloyd-Price J, Mahurkar A, et al. ). The top human microbiome genera that are also commonly found in the BE (n = 301 genomes; median S value = 1.50) had significantly higher S values compared to those not commonly found in the BE (n = 28 genomes; median S value = 1.08) with a medium effect size (Cliff’s delta of 0.451) and a q-value of 0.0009. This suggests that the human and BE microbiome are interconnected, with bacterial genera trending towards larger S values. However, the limitations of this study (see section “Robustness and limitations”) cannot associate the “Common BE genera” with BE selection pressures.
When examining each “Common BE genus,” the S value was found to cover a wide range (e.g., Enterococcus, Mycobacterium, and Bacillus) (Fig. 3d). Future reports of BE microbial communities could help to resolve the importance of the S value by accurately identifying taxa to the species level and by unifying metadata collection and method protocols. Indeed, the S value has been shown to vary across species, especially for those that are not closely related ; e.g., Clostridium has the largest S value range (Fig. 3d) and also has the largest Dmean (0.038) (Fig. 1).
Case study: Mycobacterium
As a case study for one of the “Common BE genera”, we further discuss Mycobacterium and describe how the four genomic features can be used to surmise the potential lifestyle of bacteria. Mycobacterium, a genus with well-known pathogenic species (e.g., Mycobacterium tuberculosis and Mycobacterium bovis), has one of the largest genome size ranges from 3.3 Mb [Mycobacterium leprae Br4923 (NC_011896)] to 7.0 Mb [Mycobacterium smegmatis strain MC2 155 (NC_008596)] with a median of 4.5 Mb (Fig. 3a). Mycobacterium has been found in several locations, including hospitals, therapy pools, showerheads, water-damaged homes, and cleanrooms (Table 1). One of the major factors determining the presence of Mycobacterium in water-damaged homes may be due to transmission from human and pet occupants . The GC content in Mycobacterium was relatively high (57.8–69.3%) compared to other “Common BE genera” (27.4–73.0%) (Fig. 3b), where the outlier group (57.8%) was the species M. leprae (Additional file 1: Table S8). The smaller genome size and lower GC content of M. leprae, an obligate pathogen, are a result of genome reduction which has been well documented . The GCSI ranged from 0.025 [M. avium subsp. paratuberculosis K-10 (NC_002944); Additional file 2: Figure S8A] to 0.167 [M. leprae Br4923 (NC_011896); Additional file 2: Figure S8B]. The S value for Mycobacterium ranged from 0.36–1.30, suggesting that either the growth rate of different Mycobacterium species present in the BE varies drastically or that some Mycobacterium species have more “volatile” codons, as discussed below. For example, M. tuberculosis and M. leprae have S values in the lower range (0.36–0.45) and also have slow generation times of ~ 1 and 14 d, respectively [87, 98, 99]. In comparison, one of the highest S values (1.3) corresponded to M. abscessus, which has a generation time of 4–5 h .
Genomic features relation to the potential lifestyle of bacteria commonly identified in the built environment
To further understand the 28 “Common BE genera,” we analyzed four genomic features: genome size, GC content, GC skew, and codon bias. While our study based itself on the results of previous studies to retrieve the “Common BE genera,” we aimed to demonstrate the potential of using genomic features to provide insight into microbial lifestyles and to describe the trends found in the “Common BE genera” . The “Common BE genomes” tended to have larger genome sizes, higher GC contents, higher GCSI, and larger S values compared to the “Other genomes.” While the differences for all the genomic features were statistically significant based on the Wilcoxon rank sum test, further analysis by the Cliff’s delta effect size demonstrated that the S value is likely a more important genomic feature for bacteria commonly identified in the BE compared to the “Others” analyzed in this study.
This initial analysis could help begin to surmise certain lifestyles of the bacteria commonly found in the BE. For example, the S value has implications on the growth rates of bacteria  found in the BE, which may be higher than those found in other environments, and could also be related to higher levels of gene expression [90, 91]. A stronger preference for codon usage bias in the “Common BE genera” may have resulted from a of long-term relationship with humans (e.g., genome reduction in bacteria was associated with the “Neolithic revolution”  and “Common BE genera” were found on nineteenth century documents [102, 103]) but further analysis is needed.
Moreover, the preference for certain codons may be related to either directional mutation or specific selection . In the case of directional mutation, it is hypothesized that some codons are more prone to mutation, resulting in lower S values . For example, Mycobacterium tuberculosis, one of the “Common BE genera” and pathogen with S values (0.41–0.45) below the “Common BE” and “Other” genome medians (Fig. 3d), has more “volatile” codons relating to antigens, surface proteins, or antibodies which are likely to mutate more than other codons . These help M. tuberculosis prevent host-immune system interactions . As for specific selection, it is thought to lead to efficient translation processes and accurate protein synthesis due to the use of more frequent codons by highly expressed genes . This can be a reflection of an organism’s adaptation to an environment, and it is likely that the “Common BE genomes” share “synchronized regulation mechanisms of translational optimization” . Indeed, this has been shown for 11 distinct metagenomes from various environments , where, for example, microorganisms living with an abundant food source (whale fall carcass) have translationally optimized genes for energy production and conversion.
The trend towards larger S values in the “Common BE genera” also suggests that these genera can inhabit a wide range of environments . The “Common BE genera” must also contend with chemicals derived from the daily use of personal care and household products (e.g., avobenzone from sunscreen, laureth sulfate from shampoo, and amlodipine from medication used to treat high blood pressure), in addition to human-derived chemicals (e.g., acyl glycerols, which make up the membrane of human cells) [108, 109, 110]. For example, Propionibacterium has been shown to metabolize triglyceride triolein, a human acylated glycerol, and was found to be co-localized with acylated glycerols on the human body . Since these chemicals can be found in the BE and may be associated with an occupant’s chemical signature , future studies are needed to determine how these chemicals may affect the BE microbial community composition (e.g., rural vs. urban environments, change in a product’s formula, etc.).
While not as important as the S value in this study, larger genome sizes could be attributed to the incorporation of regulatory and secondary metabolic genes , which may be important for survival in the BE (e.g., aromatics degradation and regulation to environmental stresses). Indeed, the top three major functional pathways annotated for the microbial community found in ambulances were 1) biosynthesis of cofactors, prosthetic groups, and electron carriers, 2) secondary metabolites biosynthesis, and 3) aromatics compound degradation .
Robustness and limitations
This study demonstrates the potential of using the four genomic features (genome size, GC content, GCSI, and S value) to surmise the lifestyle of bacteria. The “Common BE genera” selected in this study have only been commonly identified by culture-based and amplicon-based sequencing studies, which have limitations as described in the Introduction. Although the “Common BE genera” have been detected in multiple BE studies (≥ 6), these bacteria may not be active in the BE. Moreover, although this study is based on completed genomes from the NCBI RefSeq database, the genomes could have been derived from environments not related to the BE. Thus, the conclusions derived from this study serve as a hypothesis for the potential lifestyles of commonly identified BE bacterial genera. Further studies are needed to accurately determine the typical BE genera and the association of BE genera with BE selection pressures.
It is important to note that the results remained similar when different data sets were compared (Additional file 1: Table S9). We tested the robustness to the composition of the genome data set by testing different subsets of bacteria (e.g., phyla of Proteobacteria, Firmicutes, and Actinobacteria), and also by randomly selecting one representative for species that have multiple strains sequenced. Of the four genomic features (genome size, GC content, GCSI, and S value), only the S value showed consistent results and tended to be higher in the “Common BE genera” compared to the “Others.” This indicates that the selected codon usage bias tends to be stronger in the “Common BE genera” than in the “Other genera,” regardless of the datasets used, and that our results were less affected by biases in the available sequenced genomes. We also tested different numbers of publications (n = 1, 2, 3, 4, 5, and 6) to select for BE genera. The corresponding numbers of the selected “Common BE genomes” were 1208, 1029, 922, 825, 739, and 717. Even when genera observed in at least 1 out of 54 publications were defined as the “Common BE genera,” the median S value for the “Common BE genomes” (1.14) was higher than that for the “Other genomes” (0.35) with a large effect size (Cliff’s delta of 0.548), and the Wilcoxon rank sum test returning significant result with q-value of 2.59e-126. This is consistent with the results obtained by larger numbers of publications (n > 1) to define the “Common BE genera.” Thus, selected codon usage bias tends to be larger in the “Common BE genomes” than in the “Other genomes,” regardless of the genome data set used and criteria to define BE genera.
Our selection of the 28 common bacterial genera is likely biased towards the genera found in certain locations (e.g. fewer publications sampling outdoors and subways compared to indoors and extreme; more publications sampling locations with mild temperate climates) (Additional file 1: Table S3–S5) and sampling type (e.g., fewer publications conducted microbial community analysis of water samples compared to surface and air samples) (Additional file 1: Table S3). In addition, 16S rRNA amplicon sequencing was the dominant method used to determine the microbial community amongst the 54 publications used in this study. Some publications also conducted culture-based studies (e.g. study on airborne bacteria in Tokyo ). This introduces bias from the range of protocols used across publications, including sample collection methods (e.g. swab, wipe, air, and storage method), DNA extraction methods, primers used, 16S rRNA target region (e.g. V3–V4, V4, V6–V8), and sequencing methods [113, 114, 115]. With advances in sequencing for 16S rRNA (e.g., full-length ), genomes, and metagenomes (e.g., longer contigs, accurate base calling) and increased global research collaboration (e.g., MetaSUB ), more specific classification of BE microorganisms can be obtained at the species level, allowing for more accurate descriptions in future studies.
After obtaining the 28 “Common BE genera,” we then used the NCBI RefSeq database to obtain completed genomes. Another level of bias arises from using sequenced genomes from the public database (e.g., towards medically and industrially important microorganisms), although there are ongoing “efforts to expand the bacterial and archaeal reference genomes…to maximize sequence coverage of phylogenetic space” . However, this study aimed to demonstrate the capability of using genomic features to characterize the “Common BE genera,” providing a first step towards understanding the potential lifestyles of these bacteria. As more genomes from the BE microbial community are sequenced (e.g., efforts by the MetaSUB International Consortium ), much more accurate analyses can be carried out to appropriately examine the microbial lifestyles based on genomic features and functional annotation.
Twenty-eight bacterial genera were selected to represent the bacteria commonly identified in the BE. Although geographical location, temperature, and humidity are important factors in shaping the BE microbial composition, many of the “Common BE genera” were identified around the world. All the genera have also been observed in the human microbiome. Here, we used genomic features to demonstrate the potential of understanding the lifestyle of bacteria from the genome. Together, the genome size, GC content, and GC skew for the “Common BE genomes” showed trends similar to (were not strongly deviated from) those for the entire data set of completed prokaryotic genomes analyzed obtained from the NCBI database. On the other hand, the strength of selected codon usage bias (S value) for the “Common BE genomes” tended to be significantly higher than that of the “Other genomes.” As such, the S value could be indicative of bacterial growth rates, gene expression, and other evolutionary processes that may play a role in the bacteria present in the BE. Further insights could be gained through more BE studies analyzing locations with fewer publications (e.g., rural, tropical climates, and outdoor), identifying microbial communities at the species-level, and by minimizing cross-study biases.
Selection of common BE bacterial genera, metadata, and genome sequence data
Bacteria commonly identified in the BE are listed in Additional file 1: Table S1 and Table 1. Since most currently available BE studies conducted 16S rRNA amplicon sequencing, the identification was largely limited to the genus level. In this study, 54 total publications (published between 2003 and 2017) were compiled with metadata, including the bacterial genera, BE location identified, sample type, temperature (°C), humidity (%), and approximate climate (Additional file 1: Table S2). These publications either conducted 16S rRNA amplicon sequencing or isolated bacteria from the BE. If the temperature or humidity was not described by the publication, the average over a certain period of time (either the timeframe stated in the publication or the publication year) was obtained from online sources (see Additional file 1: Table S2 for references and timeframe). In order to obtain climate level assignment, the Köppen climate classification scheme was implemented (1981–2010) by determining the closest latitude and longitude to a publication’s described study location  (Additional file 1: Table S4). In order to identify the “Common BE genera,” we selected for bacterial genera which were identified in more than about 10% of the publications (n ≥ 6 publications) and had at least one genome sequenced in the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov) RefSeq database [120, 121] (Additional file 1: Table S8) (n = 28 genera). These were denoted as “Common BE genomes” or “Common BE genera” while the bacterial genera not selected were denoted as “Other genomes” or “Other genera.” Based on this criterion, 28 genera were retained (Additional file 1: Table S1).
To further understand the potential associated environments of each BE genus, we used MetaMetaDB (data by November 6, 2014 at http://mmdb.aori.u-tokyo.ac.jp) (Additional file 1: Table S6) . MetaMetaDB is a database to search for the possible habitats a microorganism could live in and was made by collecting 16S rRNA sequences. Hits for environmental categories for each common BE genus was based on an identity threshold of 97%, corresponding to the species taxonomic level. Environmental categories on MetaMetaDB are based on the classification used by the NCBI taxonomy, which include categories such as aquatic, soil, human, compost, and more. While these categories are not well-defined and controlled (e.g., there are several categories for human, including human, human gut, human oral, human skin, and others), we used MetaMetaDB to gain insight into the associated environments of each BE genus.
RefSeq chromosome sequence accessions with the NC_ prefix were obtained from the NCBI prokaryotic genome list (ftp://ftp.ncbi.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt), and complete sequences of prokaryotic chromosomes (GenBank format ) were downloaded with the RefSeq accessions using E-utilities on 2018-01-27. In cases where the organism has multiple replicons (chromosomes and plasmids), only the largest chromosome was used for the analysis as a representative replicon of the organism. The final data set included 2580 prokaryotic genomes (142 Archaea and 2438 Bacteria), including 717 genomes of bacteria belonging to the 28 genera commonly found in the BE (“Common BE genomes”) and 1863 other prokaryotic genomes (“Other genomes”). The 717 “Common BE genomes” belonged to 4 phyla: Firmicutes (370), Proteobacteria (222), Actinobacteria (123), and Bacteroidetes (2). The 1863 “Other genomes” belonged to 644 genera from 36 phyla, including Proteobacteria (875), Firmicutes (192), Actinobacteria (115), and Chlamydiae (110). The “Common BE genomes” and “Other genomes” were linked to the 18 environmental categories in MetaMetaDB: Aquatic, Biofilm, Compost, Food, Freshwater, Hot_springs, Human, Human_gut, Human_lung, Human_nasal_pharyngeal, Human_oral, Human_skin, Marine, Rhizosphere, Rock, Root, Sediment, and Soil. Complete listings of the genomes used in this study, along with the genomic features, are shown in Additional file 1: Table S8.
To measure the genetic diversity among taxa within a genus, the mean distance (Dmean) between all pairs of bacteria was calculated . The genetic distance between a pair of bacteria was calculated with the K80 model using the ‘dist.dna’ function of the ‘ape’ package of R (https://cran.r-project.org/web/packages/ape) . We used a nucleotide sequence alignment of the 16S rRNA genes in ‘The All-Species Living Tree’ Project (https://www.arb-silva.de/projects/living-tree/) . LTP datasets based on SILVA release 128 were downloaded from the Download page .
The total number of nucleotides (A + T + G + C) was calculated from the whole nucleotide sequence of each chromosome.
GC content (%)
The relative frequency (percentage) of guanine and cytosine (G + C)/(A + T + G + C) was calculated from the whole nucleotide sequence of each chromosome.
GC skew index (GCSI)
The asymmetry in nucleotide composition between leading and lagging strands of DNA replication is represented by GC skew (C-G)/(C + G). The strength of GC skew was measured by the GC skew index or GCSI  with a window number of 4096. This fixed window number was used to prevent any effects from biased nucleotide composition in coding regions and is based on an average gene length of 1 kb and a genome size of 2–4 Mb . The GCSI values can range from 0 (no GC skew) to approximately 1 (strong GC skew).
Strength of selected codon usage bias (S value)
As a measure of translationally selected codon usage bias, the S value was calculated for each chromosome, as described in Sharp and co-workers  and Vieira-Silva and Rocha , using the codon usage for four amino acids, Phe (TTC and TTT), Tyr (TAC and TAT), Ile (ATC and ATT), and Asn (AAC and AAT). The two codons are recognized by the same tRNA species, and the C-ending codon is recognized more efficiently than T-ending codon. The S value is based on a comparison of codon usage within these synonymous groups between constitutively highly expressed genes (those encoding ribosomal proteins and translation elongation factors) and the entire genome [87, 89].
We performed several statistical analyses to compare the values of the genomic features (genome size, GC content, GCSI, and S value) between two groups of genomes: e.g., “Common BE genomes” versus “Other genomes”; and MetaMetaDB environment-associated “Common BE genomes” (e.g., “Human”) versus other “Common BE genomes” (e.g., not associated with “Human”).
Wilcoxon rank sum test
We performed the Wilcoxon rank sum test (also called Mann-Whitney U test) as a non-parametric statistical hypothesis test to compare the values between two groups . The p-value obtained by the statistical test was adjusted for multiple comparisons by controlling for the false discovery rate (FDR) . An FDR adjusted p-value (q-value) of 0.05 was used as a threshold for statistical significance.
Cliff’s delta effect size
We calculated Cliff’s delta statistic as a non-parametric effect size to estimate the degree of overlap between two distributions . A Cliff’s delta of 0.0 indicates the group distributions overlap completely, whereas a 1.0 or − 1.0 indicates the absence of overlap between the two groups. A positive Cliff’s delta close to 1.0 indicates that the genomic feature values tended to be higher in the “Common BE genomes” than in the “Other genomes.” A negative Cliff’s delta close to − 1.0 indicates that the genomic feature values tend to be lower in the “Common BE genomes” than in the “Other genomes.” Three thresholds were used to determine the magnitude: |d| < 0.147 “negligible,” |d| < 0.33 “small,” and |d| < 0.474 “medium” or “large” . These thresholds are used for two normal distributions , equivalent to the original thresholds used by Cliff (1993)  to scale the effect size indices to observable phenomena.
Genome sequence analyses (e.g., calculating genome size, GC content, GCSI, and S value) were performed using the G-language Genome Analysis Environment version 1.9.1 (http://www.g-language.org) . Statistical computing and graph drawing were conducted with R version 3.3.3 (https://www.R-project.org/) .
We gratefully thank Professor Christopher E. Mason from Weill Cornell Medicine and the MetaSUB International Consortium for their support of our microbial community of the BE research projects. Computational resources were provided by the Data Integration and Analysis Facility, National Institute for Basic Biology. We would like to thank Editage (www.editage.jp) for English language editing.
NM received the Earth-Life Science Institute Origin of Life (EON) Postdoctoral Fellowship, which is supported by a grant from the John Templeton Foundation. The opinions expressed in this publication are those of the author(s) and do not necessarily reflect the views of the John Templeton Foundation. Research funds were received from Keio University, the Yamagata prefectural government and the City of Tsuruoka.
Availability of data and materials
The genomes used in this study were obtained from the NCBI RefSeq database. All data analyzed during this study are included in this published article (see also Supplementary Tables and Figures).
NM, SZ, and HS contributed to analyzing the data and writing the manuscript. HS conducted bioinformatics analysis on all the genomes. MT managed bioinformatics environments and helped write the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.The World bank. Urban population (% of total). 2018. https://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS. Accessed 30 Nov 2018.
- 2.United Nations. World urbanization prospects: The 2014 revision, highlights. department of economic and social affairs. Population Division, United Nations. 2014. https://esa.un.org/unpd/wup/publications/files/wup2014-highlights.pdf. Accessed 30 Nov 2018.
- 24.Stephens B. What Have We Learned about the Microbiomes of Indoor Environments? mSystems. 2016;1.Google Scholar
- 27.Tang JW. The effect of environmental parameters on the survival of airborne infectious agents. J Royal Soc Interface. 2009;6:S737–46.Google Scholar
- 33.Thos C, Haldane JS, Anderson AM. The carbonic acid, organic matter, and micro-organisms in air, more especially of dwellings and schools. Philos Trans R Soc Lond Ser B Biol Sci. 1887;178:61–111.Google Scholar
- 40.Klein BA, Lemon KP, Gajare P, Jospin G, Eisen JA, Coil DA. Draft genome sequences of Dermacoccus nishinomiyaensis strains UCD-KPL2534 and UCD-KPL2528 isolated from an indoor track facility. Genome Announc. 2017;5.Google Scholar
- 41.Kincheloe GN, Eisen JA, Coil DA. Draft Genome Sequence of Arthrobacter sp. Strain UCD-GKA (Phylum Actinobacteria). Genome Announc. 2017;5.Google Scholar
- 42.Koenigsaecker TM, Eisen JA, Coil DA. Draft Genome Sequence of Gordonia sp. Strain UCD-TK1 (Phylum Actinobacteria). Genome Announc. 2016;4.Google Scholar
- 43.Klein BA, Lemon KP, Faller LL, Jospin G, Eisen JA, Coil DA. Draft Genome Sequence of Curtobacterium sp. Strain UCD-KPL2560 (Phylum Actinobacteria). Genome Announc. 2016;4.Google Scholar
- 44.Coil DA, Benardini JN, Eisen JA. Draft genome sequence of Bacillus safensis JPL-MERTA-8-2, isolated from a Mars-bound spacecraft. Genome Announc. 2015;3.Google Scholar
- 45.Coil DA, Eisen JA. Draft Genome Sequence of Porphyrobacter mercurialis (sp. nov.) Strain Coronado. Genome Announc. 2015;3.Google Scholar
- 46.Betts MN, Jospin G, Eisen JA, Coil DA. Draft genome sequence of Planomicrobium glaciei UCD-HAM (phylum Firmicutes). Genome Announc. 2015;3.Google Scholar
- 47.Lymperopoulou DS, Coil DA, Schichnes D, Lindow SE, Jospin G, Eisen JA, et al. Draft genome sequences of eight bacteria isolated from the indoor environment: Staphylococcus capitis strain H36, S. capitis strain H65, S. cohnii strain H62, S. hominis strain H69, Microbacterium sp. strain H83, Mycobacterium iranicum strain H39, Plantibacter sp. strain H53, and Pseudomonas oryzihabitans strain H72. Stand Genomic Sci. 2017;12:17.PubMedPubMedCentralCrossRefGoogle Scholar
- 48.Lo JR, Lang JM, Darling AE, Eisen JA, Coil DA. Draft genome sequence of an Actinobacterium, Brachybacterium muris strain UCD-AY4. Genome Announc. 2013;1.Google Scholar
- 49.Bendiks ZA, Lang JM, Darling AE, Eisen JA, Coil DA. Draft Genome Sequence of Microbacterium sp. Strain UCD-TDU (Phylum Actinobacteria). Genome Announc. 2013;1.Google Scholar
- 50.Coil DA, Doctor JI, Lang JM, Darling AE, Eisen JA. Draft Genome Sequence of Kocuria sp. Strain UCD-OTCP (Phylum Actinobacteria). Genome Announc. 2013;1.Google Scholar
- 51.Holland-Moritz HE, Bevans DR, Lang JM, Darling AE, Eisen JA, Coil DA. Draft Genome Sequence of Leucobacter sp. Strain UCD-THU (Phylum Actinobacteria). Genome Announc. 2013;1.Google Scholar
- 52.Flanagan JC, Lang JM, Darling AE, Eisen JA, Coil DA. Draft genome sequence of Curtobacterium flaccumfaciens strain UCD-AKU (phylum Actinobacteria). Genome Announc. 2013;1.Google Scholar
- 53.Diep AL, Lang JM, Darling AE, Eisen JA, Coil DA. Draft Genome Sequence of Dietzia sp. Strain UCD-THP (Phylum Actinobacteria). Genome Announc. 2013;1.Google Scholar
- 62.Pollock J, Glendinning L, Wisedchanwet T, Watson M. The madness of microbiome: attempting to find consensus “best practice” for 16S microbiome studies. Appl Environ Microbiol. 2018. https://doi.org/10.1128/aem.02627-17.
- 77.Arakawa K, Tomita M. The GC skew index: a measure of genomic compositional asymmetry and the degree of replicational selection. Evol Bioinformatics Online. 2007;3:159–68.Google Scholar
- 89.Vieira-Silva S, Rocha EP. The systemic imprint of growth and its uses in ecological (meta)genomics. PLoS Genet. 2010;6.Google Scholar
- 100.Cortes MAM, Nessar R, Singh AK. Laboratory maintenance of Mycobacterium abscessus. In: Curr Protoc Microbiol: Wiley; 2005. https://doi.org/10.1002/9780471729259.mc10d01s18.
- 103.Karakasidou K, Nikolouli K, Amoutzias GD, Pournou A, Manassis C, Tsiamis G, et al. Microbial diversity in biodeteriorated Greek historical documents dating back to the 19th and 20th century: a case study. Microbiology Open. https://doi.org/10.1002/mbo3.596:e00596-n/a.
- 104.Błażej P, Mackiewicz D, Wnętrzak M, Mackiewicz P. The impact of selection at the amino acid level on the usage of synonymous codons. G3 Genes Genom Genet. 2017;7:967–81.Google Scholar
- 125.SILVA. High quality ribosomal RNA databases. 2017. https://www.arb-silva.de/no_cache/download/archive/living_tree/LTP_release_128/. Nov. 30, 2018.
- 127.Weaver KF, Morales VC, Dunn SL, Godde K, Weaver PF. An introduction to statistical analysis in research: with applications in the biological and life sciences: Wiley; 2017.Google Scholar
- 128.Sundarraman S. Recent advances in biostatistics: false discovery rates, survival analysis, and related topics: World Scientific; 2011.Google Scholar
- 130.Romano J, Kromrey JD, Coraggio J, Skowronek J. Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In: Annual Meeting of the Florida Association of Institutional Research; 2006. p. 1–33.Google Scholar
- 132.R Project. R: A language and environment for statistical computing. . 2010. ISBN 3–900051–07–0, URL: https://wwwr-projectorg Nov. 30, 2018.
- 145.Barberán A, Dunn RR, Reich BJ, Pacifici K, Laber EB, Menninger HL, et al. The ecology of microscopic life in household dust. Proc R Soc B. 2015;282. https://doi.org/10.1098/rspb.2015.1139.
- 146.Bruce RJ, Ott CM, Skuratov VM, Pierson DL. Microbial surveillance of potable water sources of the International Space Station: SAE Technical Paper; 2005. http://papers.sae.org/2005-01-2886/. Accessed 16 Oct 2016
- 155.Pierson DL. Microbial contamination of spacecraft. Gravitational and Space Research. 2001;14 http://www.gravitationalandspacebiology.org/index.php/journal/article/view/261. Accessed 16 Oct 2016.
- 157.La Duc MT, Sumner R, Pierson D, Venkateswaran K. Characterization and Monitoring of Microbes in the International Space Station Drinking Water. Vancouver, British Columbia, Canada: International Conference for Environmental Systems; 2003.Google Scholar
- 159.Soto-Giron MJ, Rodriguez-R LM, Luo C, Elk M, Ryu H, Hoelle J, et al. Characterization of biofilms developing on hospital shower hoses and implications for nosocomial infections. Appl Environ Microbiol. 2016;AEM:03529–15.Google Scholar
- 163.Farias PG, Gama F, Reis D, Alarico S, Empadinhas N, Martins JC, et al. Hospital microbial surface colonization revealed during monitoring of Klebsiella spp., Pseudomonas aeruginosa, and non-tuberculous mycobacteria. Antonie van Leeuwenhoek. 2017:1–14.Google Scholar
- 175.Oubre CM, Birmele MN, Castro VA, Venkateswaran KJ, Vaishampayan PA, Jones KU, et al. Microbial Monitoring of Common Opportunistic Pathogens by Comparing Multiple Real-Time PCR Platforms for Potential Space Applications. Am Instit Aeronaut Astronaut. 2013. https://doi.org/10.2514/6.2013-3314.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.