Keywords

1 Introduction

The Natronobacterium gregoryi (N. gregoryi) SP2 (NCIMB2189/ATCC43098), a member of the Natronobacterium genus [1], was isolated from Lake Magadi, Kenya. N. gregoryi SP2 is an obligate halophile and alkaliphile with the cell size of 0.5 – 0.7 × 10 – 15 μm, thriving at an optimum saline concentration of 20%, temperature of 37 ℃ and pH of 9.5 [2]. The strain SP2 is Gram-stain-negative, catalase-positive, oxidase-positive, red pigmented and aerobic archaeon, which has a sequence similarity value of 97.3% with the strain B23 of Natronobacterium texcoconense sp. nov. based on phylogenetic analysis of 16S rRNA gene sequences [2]. In this study, N. gregoryi SP2 grew in 20% NaCl for almost 5 days with the provided culture medium for genome sequencing.

The strain has been in the spotlight since a report on Natronobacterium gregoryi Argonaute (NgAgo)-mediated genome editing published by Chunyu Han group in 2016 [3]. The NgAgo/gDNA system may offer an attractive alternative for genome manipulation. Soon, some research groups questioned the genome editing ability of NgAgo because they found that it did not perform successfully in eukaryotic cells [4,5,6,7]. The controversy over the genome editing functionality of NgAgo continued more than a year [8,9,10,11] and the paper of Han group was finally retracted. However, a very recent study revealed that NgAgo enhances homologous sequence-guided gene editing in bacteria [12]. Another recent report shows that NgAgo is a novel DNA endonuclease defined by a characteristic repA domain. NgAgo cleaves DNA through both a conserved catalytic tetrad in PIWI and a novel repA domain. The result provides insight into poorly characterized NgAgo for development of subsequent gene-editing tool [13].

Here, we submitted the whole sequences of 149 contigs to GenBank and the raw data to Sequence Read Archive (SRA) after whole genome sequencing. Three hypothetical proteins from the ‘.gbff’ file provided by NCBI FTP Site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/855/455/GCA_002855455.1_ASM285545v1) were annotated. COG, GO and KEGG pathway classification analysis were employed in N. gregoryi SP2. What’s more, we compared the genome statistics of three submissions from different research groups, and draw the phylogenetic tree based on 16S rRNA sequences and the genome circular maps of the three genome submissions.

2 Whole-Genome Shotgun Sequence

Genome Statistics

Here, we present the whole-genome sequence of N. gregoryi SP2. Archaeal colonies were identified by amplification of 16S rRNA of the colonies and alignment with 16S rRNA of reference genome. The 16S rRNA gene sequences of N. gregoryi SP2 were amplified using primers archaeal 344F (5ʹ-ACGGGGYGCAGCAGGCGCGA-3ʹ) and archaeal 915R (5ʹ-GTGCTCCCCCGCCAATTCCT-3ʹ) and sequenced from the 344F end (Shanghai Majorbio Bio-pharm Technology Co., Ltd). DNA for sequencing was extracted from pure colonies with the Wizard® Genomic DNA Purification Kit (Promega, Madison, WI) following the manufacturer’s recommended protocol. A paired-end library with an average insert size 440 bp of genome fragments (Covaris M220) was prepared by TruSeq™ DNA Sample Prep Kit (Illumina, Inc.). PCR was performed with Hiseq PE Cluster Kit v4 cBot (Illumina, Inc.). Genome of the strain SP2 was sequenced by the Illumina HiSeq 4000 platform using 150 × 2 paired-end reads with HiSeq 3000/4000 SBS Kits according to the protocol provided by the producer. Using readfq.v5, the raw reads were trimmed to obtain high-quality sequences. The trimming included removing reads with adapter contamination, reads with a certain proportion of low-quality bases (default setting) and reads with a certain proportion of Ns (default setting) [14]. Quality control was then performed by using FastQC v0.11.7 [15]. Finally a total of 8,252,309 paired-end reads were produced. Assembly was performed with SOAPdenovo v.2.04 [16] with a genome coverage value of 640.0x. The draft genome sequence of N. gregoryi SP2 consists of 3,695,310 bp with a G + C content of 62.4%. The assembly had 112 scaffolds, N50 scaffold size of 77,395 bp and the maximum scaffold size of 309,005 bp. The number of contigs is 149, the size of contig N50 is 68,510 bp and the largest contig has 172,207 bp. These contigs were then annotated using the GeneMarkS + v4.4 suite [17] implemented in the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) [18] with default settings. We identified 3,568 coding genes, 51 RNAs, 3 rRNA operons, 46 tRNAs, 2 ncRNAs and 200 pseudogenes. In addition, 21 simple sequence repeats (SSRs), 21 interspersed repeats and 118 tandem repeats were also predicted.

At the time of writing, there are three whole-genome sequences of strain SP2 in NCBI (Table 1). The sequencing platforms of genomes from JGI, UC and UESTC are 454/Illumina, Illumina GAIIX/HiSeq and Illumina HiSeq 4000 respectively. The features of three genome statistics are summarized in Table 1. The submission of GenBank accession number CP003377.1 is the only complete genome and it was updated from GenBank accession number of FORO00000000 which assembly level was contig. The genome assembly level of the other two submissions is contig. The submission from UESTC has the highest genome coverage of 640.0x described as Table 1. UC’s and UESTC’s genomes have 128 and 149 contigs respectively and their contig N50 values are 48,343 and 68,510 respectively. Although the three genomes were assembled in different level, their genome features are basically the same except for some minimal differences. Comparative genome analysis shows that the gene organizations of the three genomes for N. gregoryi SP2 are similar.

Table 1. Genome statistics of three submissions for N. gregoryi SP2.

COG, GO, KEGG Classification Analysis

Based on a BLAST search (E-value ≤ 10−5) against the string database (v9.05), there were 4020 ORFs hits were searched against the Clusters of Orthologous Groups of proteins (COG) database to predict and classify their possible functions based on the conserved domain alignment. In total, 1,332 genes were successfully annotated and grouped into 21 COG functional categories, including “Amino acid transport and metabolism”, “Translation, ribosomal structure and biogenesis”, “Energy production and conversion”, and “Inorganic ion transport and metabolism”, in which the cluster of ‘Amino acid transport and metabolism’ occupied the largest number (164; 12.3%), followed by ‘Translation, ribosomal structure and biogenesis’ (134; 10.1%) (Fig. 1). The columns represent the number of genes in each subcategory.

Fig. 1.
figure 1

Histogram presentation of COG classification of all unigenes. A total of 1332 genes were successfully annotated and grouped into 21 COG functional categories.

All genes were also subject to Gene ontology (GO) classification analysis to predict their potential biological functions. There are 1294 genes successfully annotated. These genes belong to three categories: cellular component, molecular function and biological process (Fig. 2). High percentages of genes related to ‘single-organism process’, ‘cellular process’, ‘metabolic process’, ‘binding’ and ‘catalytic activity’ were observed to be represented in the biological process and molecular function category. The cellular component category consists of 473 genes annotated with Gene Ontology by Blast2GO against the NCBI-nr protein database with an E-value threshold of 1E-5. The molecular function category is comprised of 1150 genes and the biological process category contains 1195 genes. Cell or cell part, binding, catalytic activity, cellular process, single-organism process and metabolic process are the majority of the categories from each GO cluster. These genes related to cellular structure, molecular interaction and metabolism were mostly involved in the life cycle of this strain. In contrast, among the subcategories with the fewest members were ‘immune system process’, ‘reproduction’, ‘multi-organism process’ of the biological processes ontology and ‘virion part‘, ‘virion’ of the cellular components.

Fig. 2.
figure 2

GO classification of all unigenes. A total of 1294 genes were annotated and grouped into three GO categories: biological process, cellular component and molecular function. The x-axis represents the Gene ontology. The left y-axis indicates the percentages of a specific category of genes in that category. The right y-axis indicates the number of genes in a category.

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was performed for classification. In the KEGG classification, genes were annotated and grouped into 170 KEGG Pathways. Among the 170 KEGG pathways, 29 pathways in which the genes are more than 20 were shown in Fig. 3. The metabolic pathway (ko01100) consisted of a large number of 476 genes which was the largest group. Besides, KEGG pathways involved in the “Biosynthesis of secondary metabolites” (ko01110), “Microbial metabolism in diverse environments” (ko01120), “Biosynthesis of amino acids” (ko01230) and “Carbon metabolism” (ko01200) were also considerably enriched. Pathway-based analysis helps to further understand the biological pathways of all genes.

Fig. 3.
figure 3

Histogram of KEGG classification. The genes were annotated and grouped into 170 KEGG pathways. The bar chart shows 29 KEGG pathways which include more than 20 genes.

Annotation of the Hypothetical Proteins

The genome of N. gregoryi SP2 submitted by us was annotated with Prokaryotic Genome Annotation Pipeline 4.4 and 3,568 proteins were found. Among them, 1091 proteins were annotated as hypothetical proteins. As the structural and functional domains of proteins are conserved in evolution, we can compare the hypothetical proteins with known protein sequences or conserved domains and infer their families and potential functions. We blasted the 1091 proteins against the non-redundant protein sequences of Halobacteria and got 8 hits with E-value < 1e-20 and identity > 90%. There are no specific annotations for these 8 hypothetical proteins in reference genome annotation file of N. gregoryi SP2.

CD-Search [19] was further used to search conserved domains or functional units within the 8 protein sequences against the CDD v3.17 database. As shown in Table 2 and Fig. 4, we identified 3 conserved domains, i.e. VirB11, TusA, and UPF0126 in 3 hypothetical proteins within the e-value threshold of 0.01. The 3 hypothetical proteins could be annotated as type II secretion system protein, SirA family protein, and uncharacterized membrane protein YeiH of N. gregoryi SP2 respectively as they have relevant conserved domain VirB11, which is similar to that of Halobiforma lacisalsi; TusA, which is similar to that of Halobiforma nitratireducens; and UPF0126, which is similar to Natronobacterium texcoconense.

Table 2. The annotations of the 3 hypothetical proteins
Fig. 4.
figure 4

Conserved domains found in hypothetical proteins. All search results type is “specific hits” in the concise display. The track of “Query seq” represents the length and sequence of corresponding query protein. The small triangles under sequence represent conserved motifs, such as catalytic and binding sites. The specific hits of CD-Search represent a high confidence level for the inferred function of the hypothetical proteins indicated by its E-values.

3 Three Submissions of N. gregoryi SP2 at NCBI

As described previously, there are now three whole genome sequencing results of N. gregoryi SP2 at NCBI, which are submitted by JGI, UC, and UESTC respectively. As shown in Fig. 5, the corresponding 3 genome circular maps were drawn using DNAPlotter (http://www.sanger.ac.uk/science/tools/dnaplotter). The genome size of the JGI submission is 3,788,356 bp (Fig. 5A); the UC submission has 128 contigs with 3,694,030 bp in total (Fig. 5B); and the UESTC submission has 149 contigs with 3,695,310 bp in total (Fig. 5C). The complete genome of JGI submission contains 3,770 genes in total, including 3,588 proteins, 122 pseudo genes, 2 ncRNAs, 49 tRNAs, and 9 rRNAs. The G + C content of JGI submission is 62.2% and its coding regions cover 82.4% of the complete genome. The genome submitted by UC has a G + C content of 62.3%, and 3,766 genes is annotated, including 3,524 proteins, 188 pseudo genes, 2 ncRNAs, 47 tRNAs, and 5 rRNAs. Our submission shows a G + C content of 62.4%, and 3,819 genes is annotated, including 3,568 proteins, 200 pseudo genes, 2 ncRNAs, 46 tRNAs, and 3 rRNAs. Although three submissions are a little bit different from gene numbers, they share a distinct similarity of G + C content.

Fig. 5.
figure 5

Circular genome maps of three submission of N. gregoryi SP2 at NCBI. (A) Complete genome submitted by JGI. The outer scale indicates the coordinates in base pairs. The open reading frames (ORF) are shown on the first two rings; first ring (red) is forward ORF and second ring (bright blue) is reverse ORF. The third and fourth circle shows the tRNA (blue) and rRNA genes (black). The next circle shows the GC content values. Purple and green colors indicate negative and positive sign, respectively. The inner-most circle shows GC skew, gray indicating negative values and yellow for positive values. (B) The UC submission. The first ring is composed of 128 contigs. The next circle displays the GC percentage plot (deep yellow above average, purple below average). The inner circle displays the GC skew. (C) The UESTC submission. The legend of this map is the same to Fig. 5B, except that 149 contigs makes the first ring. (Color figure online)

A phylogenetic tree based on 16S rRNA sequences of Halobacteria was constructed by MEGA X [20] (Fig. 6). The evolutionary history was inferred using the Neighbor-Joining method. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test are shown next to the branches. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Maximum Composite Likelihood method [21] and are in the units of the number of base substitutions per site. This analysis involved 27 nucleotide sequences of 16S rRNA. The tree showed a branch of N. gregoryi SP2 with other strains investigating phylogenetic incongruence using tree–tree distances. The optimal tree with the sum of branch length = 0.41783639 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) are shown next to the branches. The tree revealed that the closest relative of N. gregoryi SP2 was the Natronobacterium texcoconense strain B23 with the similarity of 97.6%.

Fig. 6.
figure 6

Phylogenetic tree of N. gregoryi SP2. The tree was constructed by MEGA X using neighbor-joining as statistical method based on 16S rRNA gene sequences. The parameters were set: ‘Maximum Composite Likelihood method’ in substitution model and ‘1000 bootstrap replications’ in phylogeny test. The scale length is 0.010.

4 Conclusion

In this study, we sequenced the whole genome of N. gregoryi SP2. We also compared the other two submissions of N. gregoryi SP2 genome from two different groups with our genome data and found that they have similar genome statistical features. The COG, GO, KEGG pathway classifications were performed based on the whole genome sequence of N. gregoryi SP2. We also identified and annotated three proteins in the genome as type II secretion system protein E, SirA family protein and uncharacterized membrane protein YeiH using BLASTP program and CD Search. The 3 genes were only annotated as hypothetical proteins automatically through PGAP workflow in all the submissions. In addition, our evolution analysis indicated that Natronobacterium texcoconense strain B23 was the closest relative of N. gregoryi SP2. In a word, the work provided a systematic characterization and comparison of N. gregoryi SP2 genomes for the first time, laying the foundation for further research of N. gregoryi SP2. The limitation of this study is that our genome data is at contig level. The comparison analysis between the provided data and the existing ones for the N. gregoryi SP2 could be more precise after the update version of the genome.

Data availability. This whole-genome shotgun project has been deposited at GenBank under the accession no. PKKI00000000 and BioProject accession no. PRJNA423232. The reads have been submitted to SRA under the accession no. SRP127538.