Comparative genomics of Steinernema reveals deeply conserved gene regulatory networks
- 5.2k Downloads
Parasitism is a major ecological niche for a variety of nematodes. Multiple nematode lineages have specialized as pathogens, including deadly parasites of insects that are used in biological control. We have sequenced and analyzed the draft genomes and transcriptomes of the entomopathogenic nematode Steinernema carpocapsae and four congeners (S. scapterisci, S. monticolum, S. feltiae, and S. glaseri).
We used these genomes to establish phylogenetic relationships, explore gene conservation across species, and identify genes uniquely expanded in insect parasites. Protein domain analysis in Steinernema revealed a striking expansion of numerous putative parasitism genes, including certain protease and protease inhibitor families, as well as fatty acid- and retinol-binding proteins. Stage-specific gene expression of some of these expanded families further supports the notion that they are involved in insect parasitism by Steinernema. We show that sets of novel conserved non-coding regulatory motifs are associated with orthologous genes in Steinernema and Caenorhabditis.
We have identified a set of expanded gene families that are likely to be involved in parasitism. We have also identified a set of non-coding motifs associated with groups of orthologous genes in Steinernema and Caenorhabditis involved in neurogenesis and embryonic development that are likely part of conserved protein–DNA relationships shared between these two genera.
KeywordsGene Ontology Differentially Express Differentially Express Gene Entomopathogenic Nematode Infective Juvenile
Features of the Steinernema draft genomes
Estimated genome size (Mb)
Number of scaffolds
GC content (%)
N content (Mb)
N content (%)
Maximum scaffold size (bp)
Number of Augustus-predicted genes
Number of Augustus-predicted transcripts
Average gene length (bp)
Average intron length (bp)
Average exon length (bp)
Average intergenic distance (bp)
Average number of exons per gene
Average number of introns per gene
GC content in coding regions (%)
Number of genes with no introns
Repeat content (%)
Results and discussion
We sequenced and assembled the genomes of five Steinernema species (S. carpocapsae, S. feltiae, S. glaseri, S. monticolum, S. scapterisci) for comparative analysis (Fig. 1, Table 1, Additional file 1: Figure S1). We focused on sequencing S. carpocapsae in greater depth than the others to use it as a representative for comparative analyses with other nematode genera. The other species were chosen based on their commercial availability, their evolutionary relationships, and their varied host specificities and foraging strategies. In addition, we sequenced and assembled de novo the mRNA of the IJ-stage of each species to aid in genome annotation. Additional RNA was collected for S. carpocapsae, S. feltiae, and C. elegans at the embryonic, first larval (L1), and young adult stages for a comparative analysis of gene expression, which is discussed below. The final genome assembly sizes ranged from 80 to 90 Mb and 28,000 to 36,000 genes (Fig. 1, Table 1, Additional file 1: Figure S2, Additional file 2: Table S1) and S. carpocapsae was the best assembled genome (scaffold N50 = 299 kb) with an estimated genome completeness of 98 % (Fig. 1, Additional file 2: Table S2). Detailed methods for assembly and annotation can be found in the “Methods” section.
Although taxon selection clearly influences phylogenomic accuracy , sequencing the genomes of multiple species in the same genus should increase confidence in our ability to recover their evolutionary history [15, 16, 17, 18]. Current best estimates place Steinernema in Holterman clade 10 and thus closely related to the sequenced nematode Bursaphelenchus xylophilus, a plant-parasite, and the free-living Panagrellus redivivus (Fig. 1a) [1, 2, 19]. Previous attempts to recover relationships among different species of Steinernema resulted in several poorly resolved/supported nodes, likely due to the limited number of molecular markers used and their homoplastic and/or plesiomorphic nature . We evaluated the relationships among the five Steinernema using a supermatrix of 3,885 strictly orthologous genes (1:1:1:1:1:1), with P. redivivus as our out-group taxon (Fig. 1b). The relationships we recovered are strongly supported but differ from previous hypotheses in that S. monticolum, which was chosen for sequencing based on its hypothesized close relationship to S. carpocapsae and S. scapterisci , was more closely related to S. feltiae than any of the other nematodes in our analysis.
Protein family domain abundance
We then analyzed the predicted protein domains to understand which gene families have undergone expansions in Steinernema that may have contributed to adaptation to a parasitic lifestyle. The S. carpocapsae proteome was predicted to have a total of 17,518 Pfam domains from 3,256 different Pfam accession categories. The relative Pfam domain abundances of the S. carpocapsae genome were compared to those in the parasitic Bursaphelenchus xylophilus, Brugia malayi, and Ascaris suum, as well as the free-living C. elegans (Fig. 2b, Additional file 1: Figures S3–S6). Overall, most Pfam domains were detected at similar levels in both genomes, with some notable exceptions. For example, while most transcription factor domains showed similar prevalence in both genomes, we found the expected enrichment of C4 zinc fingers in C. elegans that are associated with nuclear hormone receptors, as well as a novel three-fold enrichment of the alcohol dehydrogenase transcription factor Myb/SANT-like domain in S. carpocapsae (Fig. 2b). The S. carpocapsae genome appears to be enriched in proteases, protease inhibitors, certain families of G protein-coupled receptors (GPCRs) , and fatty acid- and retinol-binding (FAR) proteins, among others (Fig. 2b, Additional file 1: Figures S3–S6). The abundance of predicted Pfam domains from S. carpocapsae was compared with the other four Steinernema species we sequenced (Fig. 2, Additional file 1: Figures S7–S10). The domain richness of certain types of proteases, protease inhibitors, and certain families of GPCRs varied widely between the different species of Steinernema, though some enrichments were shared, such as the greater abundance of certain protease and protease inhibitor families, and FAR proteins, which appeared in all Steinernema species and are discussed separately below (Fig. 2c).
Putative parasitism genes: proteases and protease inhibitors
Proteases (peptidases) are involved in a wide variety of biological functions including development, digestion, signal transduction, and immune responses . Of particular relevance in these genomic analyses is the role proteases play in parasitism, such as tissue penetration and immune suppression or evasion [32, 33, 34, 35]. A total of 654 peptidases were identified in the S. carpocapsae genome (Fig. 2, Additional file 2: Table S4). These can be broken down into five key classes based on the chemical groups that function in catalysis: aspartic (6.3 %), cysteine (19 %), metallo- (32.7 %), serine (37.6 %), and threonine (4.1 %). Because steinernematids can be lethal even without their pathogenic symbionts [36, 37, 38, 39] and proteolytic activity is higher in the excreted-secreted products in more virulent strains [39, 40], proteases were among the first products examined in relation to the lethality of Steinernema nematodes and have been suggested to be actively pumped into host tissues by parasitic nematodes [33, 35, 41, 42, 43]. Steinernematids have more predicted proteases (Fig. 2, Additional file 2: Table S4) than any other nematode sequenced to date. This correlates with the remarkably broad host ranges of many Steinernema species, which can infect multiple species across many insect orders in some cases, whereas other parasitic nematodes have more restricted or specialized host ranges. Breaking the proteases into subclasses highlights species-specific expansions of serine and metalloproteases among Steinernema species. However, the abundance of aspartic, cysteine, and threonine proteases is relatively similar across nematodes (Fig. 2c, Additional file 2: Tables S4–S7). The serine and metalloproteases are the most highly represented families in nematode excreted-secreted products, suggesting that they play a role in parasitism . We found Steinernema-specific expansions of chymotrypsin-like (S01A), Lon-A-like (S16), and signal peptidase I-like (S26A) serine proteases and expansions of the astacin (M12A), carboxypeptidase A1-like (M14A), and the pitrilysin (M16A) metalloproteases. Whereas chymotrypsin-like and carboxypeptidase A1-like proteases were expanded in all five of the Steinernema spp. when compared to other nematodes, other proteases such as the Lon-A-like, signal peptidase I-like, astacin, and pitrilysin proteases were only expanded in certain species (Fig. 2c, Additional file 2: Tables S8–S11). These expansions represent putative parasitism genes and may affect the host-range and specificity of these species, influencing their ability to infect and avoid the immune response of certain potential host species. Some proteases in these expanded families have characterized roles in parasitism in Steinernema. For example, an S01A chymotrypsin-like protease from S. carpocapsae has increased expression in IJs exposed to waxworm hemolymph and suppresses waxworm prophenoloxidase activity and immune encapsulation in vitro . Additional biochemical and molecular studies are needed to understand immune suppression and evasion by steinernematids and the role proteases play in these processes.
Previous work has shown a functional role for several proteases in the parasitism of insects by entomopathogenic nematodes. For example, Toubarro et al. identified an S. carpocapsae serine protease that hydrolyzes extracellular matrix proteins and induces apoptosis of insect cells . Two other S. carpocapsae serine proteases are involved in immune subversion by inhibiting insect prophenoloxidase activity and disrupting cellular encapsulation by the insect immune response [33, 44]. Also, an S. carpocapsae astacin is upregulated in IJs upon infection of an insect host, suggesting a role in the infection process . Our findings further support the notion that certain families of proteases play a role in parasitism and may have shaped niche partitioning among the many species of insect parasites.
The virulence of parasitic nematodes is heavily influenced by not only proteases but also protease inhibitors [43, 46]. In addition to the expansion of proteases, the Steinernema genomes show large expansions of several specific protease inhibitor families, such as the I4 serine protease inhibitor (serpin) family, the I8 chymotrypsin/elastase inhibitor family, and the I63 pappalysin-1 inhibitor family (Fig. 2c, Additional file 2: Table S12) . This genus-specific expansion in Steinernema species and the known role of many protease inhibitors in parasitism, particularly serpins (reviewed by Molehin et al. ), suggests that these protease inhibitors are putative parasitism genes likely used by steinernematids to successfully infect hosts. We examine stage-specific gene expression of some of these putative parasitism genes below (S26 proteases and I63 protease inhibitors). Future investigations of the expression context and biochemical function of the expanded proteases and protease inhibitors identified here in these parasitic nematodes might reveal that they facilitate the parasitism of insects and that the various expansions and retractions of these families among steinernematids influence host range and specificity.
Putative parasitism genes: fatty acid- and retinol-binding proteins
The fatty acid- and retinol-binding protein (FAR) gene family represents another dramatic case of genus-wide expansion in Steinernema (Fig. 2d, Additional file 1: Figure S11, Additional file 2: Table S13). Steinernema species have between 38 and 54 FAR proteins compared to 19 in P. pacificus and fewer in the other nematodes we examined (Additional file 1: Figure S11, Additional file 2: Table S13). FAR proteins are a family of lipid-binding proteins that have high binding affinities for fatty acids, retinol, and retinoic acid and are unique to nematodes . They are important in the growth, development, and reproduction of C. elegans, which, like most if not all nematodes, is auxotrophic for sterols. However, FAR proteins were originally discovered in vertebrate-parasitic nematodes, where, in addition to their role in growth and development, they are thought to play a key role in parasitism by functioning in the sequestration of host retinoids as well as by contributing to immune evasion or suppression, though their exact role is not understood [49, 50]. Although parasitism arose independently multiple times among nematodes, FAR proteins have been implicated in the parasitism of plants, invertebrates, and vertebrates across all of the parasitic lineages, suggesting that this protein family is particularly important to parasitism (Fig. 1b) [49, 51, 52]. We examine the stage-specific expression of FARs and explore their genome architecture below. While the function of these proteins in parasitism remains to be shown, one possibility is that they interact with eicosanoids—fatty acids involved in immunological signaling in plants, mammals, and insects [53, 54, 55]. Inhibiting eicosanoid biosynthesis has been shown to reduce the melanotic encapsulation response of insects, which is thought to be insects’ primary defense against nematode parasites [10, 56]. For example, Xenorhabdus nematophila, the insect-pathogenic symbiont of S. carpocapsae, has been shown to dampen the host insect immune response by inhibiting eicosanoid synthesis in infected insects, increasing the likelihood of a successful infection by S. carpocapsae [57, 58]. Thus inhibiting eicosanoid biosynthesis in hosts is one way that parasitic nematodes may suppress host immunity.
Differential gene expression analysis
Gene expression conservation across species
The expression of orthologous genes during development is known to diverge [61, 62]. In order to identify genes with conserved patterns of stage-specific expression across closely and more distantly related species, mRNA from the corresponding embryonic, L1, IJ, and adult stages of S. feltiae and C. elegans was collected and sequenced for comparison to S. carpocapsae (Fig. 3, Additional file 2: Table S14). We limited our analysis to 5,569 1:1:1 orthologs present in all three species to avoid the complications of divergent expression due to gene duplications. We used two methods for determining conserved stage-specific ortholog expression. The first method binarized stage gene expression values using a flexible threshold to sort genes into stage-specific sets. We used this method to quantify the number of orthologs that are “on” and “off” in the same developmental stages between species. We found that 79.3 % (4,416/5,569) of these orthologs had conserved expression between S. carpocapsae and S. feltiae, whereas pairwise comparisons of the expression of each Steinernema species to C. elegans showed lower overall conservation of stage-specific expression of 61–63 % (3,504/5,569 and 3,432/5,569) (Additional file 1: Figure S14). Nevertheless, given that the steinernematids are phylogenetically distant from C. elegans yet share expression of more than two-thirds of their 1:1:1 orthologous genes, these results suggest that gene expression of this core set of unduplicated genes is highly conserved among nematodes. In a separate analysis, we treated the expression of each ortholog during development in each species as a vector and calculated their cosine similarities to address whether the ortholog expression profiles parallel each other during development. We found that 1,441 out of 5,569 orthologs (25.8 %) had a conserved pattern of stage-specific expression (ortholog expression similarity > 0.95) between S. carpocapsae and S. feltiae (Fig. 3b, Additional file 6), whereas there was more divergence with C. elegans. Only 541 (9.7 %) orthologs were conserved in stage-specific expression between C. elegans and S. carpocapsae and 490 (8.7 %) between C. elegans and S. feltiae when all developmental stages were considered.
Using the stage-specific gene expression data, we determined the gene expression levels of 41 FARs, as well as the expression levels of a family of serine proteases and protease inhibitors (S26 and I63) that may play a role in the parasitic lifestyle of S. carpocapsae. We found that sets of the I63 protease inhibitors were expressed at particular post-embryonic developmental stages, with the highest expression levels occurring in the IJ (839.8 FPKM) and adult stage (239.4 FPKM) (Fig. 4a). These are the stages most important in the successful infection of an insect host and these expression data support the notion that I63 protease inhibitors are important for S. carpocapsae parasitism. However, most of the S26 proteases (14/17 proteases) were expressed primarily in the embryonic stage, suggesting that they are involved in development rather than the parasitism of insects by Steinernema (Additional file 1: Figure S13). Additionally, we found that 39 of 41 FAR genes were primarily expressed during the post-embryonic stages (Fig. 4b), and that about half of these genes appeared in clusters in the genome sequence. Of the eight C. elegans FAR genes, only far-1 was conserved in the steinernematids. This gene is reported as having highest expression in L3 C. elegans worms . We confirmed this, seeing high expression in the dauer and L1 stages (Additional file 1: Figure S15). Among the stages we tested, Steinernema far-1 orthologs had highest expression in L1 (Additional file 1: Figure S16), suggesting that they function in development and not parasitism, but this remains to be tested.
Genome conservation and synteny analysis
The second set of genes we investigated was the family of FAR genes in Steinernema. We found a total of 22 out of 41 FAR genes in synteny across three distinct syntenic clusters in S. carpocapsae. By examining the location of the 1:1 orthologs of these genes in the other Steinernema species, we found that the majority of these orthologs are also syntenic in S. scapterisci (13/17 1:1 orthologs) and S. feltiae (10/12 1:1 orthologs) and that their expression during development is also conserved across the Steinernema species (Additional file 1: Figure S18). Interestingly, we also saw that the most highly expressed FAR in S. carpocapsae has five paralogs in S. feltiae, with one dramatically changing its expression pattern from adult to embryonic stage, which suggests that this family is undergoing further rapid functional evolution within Steinernema (Fig. 4c).
Conserved non-coding networks
Multiple motifs from the same networks clustered together in or near some of the orthologous target genes of S. carpocapsae and C. elegans. Some of these motif clusters showed conserved order and position, whereas others showed variation in order only, position only, or variation in both between species (Fig. 6d, Additional file 1: Figure S21). Comparative analysis of the Steinernema congeners led to the identification of these conserved motifs. We found them conserved in C. elegans and enriched near genes influencing similar biological processes in a distantly related genus. This finding suggests that they are under evolutionary selection, although their functionality remains to be tested.
The sequencing of multiple species of Steinernema enabled us to identify gene family expansions that are consistent with and likely important to the particular biology of these species as parasites; to generate new hypotheses about genes likely to be important in parasitism; to explore their genealogy more deeply than ever before and refine our understanding of their relationships to each other as well as define other phylogenetic markers that could be used in subsequent analyses; to identify stage-specific enrichment of functional gene classes; to demonstrate that the differential expression of stage-specific genes is influenced by phylogeny; to explore the evolution of the developmental control genes in the Hox gene cluster and diagnose expansion and rapid evolution of this cluster; and to identify previously unknown conserved non-coding regulatory motifs that regulate similar biological processes in distantly related organisms . Our results point to a core set of conserved motifs, functioning in both C. elegans and S. carpocapsae, that regulate similar biological processes key to proper nematode development across vast phylogenetic distance. These motifs are not detectable from direct sequence alignment between Caenorhabditis and Steinernema but can be found when analyzing genus-level conservation and using conserved gene expression and gene-motif association between orthologs. Further analysis will be required to assess whether these motifs form a phylum-wide core kernel of regulatory relationships or are restricted to the last common nematode ancestor of these two genera.
Materials and methods
Strain culturing and maintenance
S. carpocapsae (strain All), S. scapterisci (strain FL), S. feltiae (strain SN), S. glaseri (strain NC), and S. monticolum (Mount Jiri strain) were reared and maintained using standard methods  (Fig. 1, Additional file 1: Figure S1). Briefly, five last-instar Galleria mellonella waxmoth larvae or a single adult cricket for S. scapterisci (American Cricket Ranch, Lakeside, CA, USA) were placed in a 5 cm Petri dish with a 55 mm Whatman 1 filter paper acting as a pseudo-soil substrate in the bottom of the dish. Up to 250 μl containing 500–1,000 IJs suspended in water was evenly distributed on the filter paper. After 7–10 days the insect cadavers were placed on White traps . Waxmoth cadavers infected with S. glaseri were placed in a Petri dish partially filled with plaster of Paris and harvested from this, because S. glaseri emerge as pre-IJs that will not properly develop if they emerge directly into water . Emerging IJs from all species were harvested, washed for 10 minutes in 0.4 % Hyamine 1622 solution (Fluka, Switzerland), and rinsed three times with water.
Isolation of DNA and RNA
To harvest bulk genomic DNA and IJ RNA, IJs from each species were washed in 0.4 % Hyamine, rinsed three times, and acclimated in Ringer’s solution for 15–30 minutes prior to nucleic acid collection. For DNA extraction, a Wizard Genomic DNA Purification Kit (Promega, Madison, WI) was used following the manufacturer’s protocol. The genomic DNA was then treated with RNase A to remove any RNAs present in the sample. For RNA extractions, the nematodes were snap-frozen in liquid nitrogen in ~100 μL aliquots and stored at −80 °C. Worms were then freeze-thawed three or four times to break the cuticle before extracting bulk RNA. Bulk RNA was then extracted using a phenol-chloroform extraction using Trizol (Invitrogen, Carlsbad, CA). This sample was treated with DNase to remove lingering DNA and then poly-A selected to isolate eukaryotic messenger RNA, reducing if not removing bacterial contamination. To isolate embryonic, L1, and adult stage-specific RNA from S. carpocapsae, 1,000–2,000 IJs were placed onto 10 cm lipid agar plates seeded with overnight cultures of Xenorhabdus nematophila (strain ATCC 19061). These cultures were allowed to grow for ~42 hours to collect young adults, or ~68 hours to collect embryos from mature gravid females. Gravid females were collected from the plates by adding enough distilled water to the cover the surface of the plates, swirling the plates by hand to lift the nematodes into suspension, and pouring them into conical tubes. These were then pelleted by gentle centrifugation and rinsed several times with distilled water until the supernatant was clear. The nematodes were placed in separate 15 mL conical tubes in 7 mL aliquots, and topped off to 15 mL with bleach solution (16.6 mL of 12 % bleach, 5 mL of 1 M KOH, and 80 mL of distilled water). Eggs were harvested by bleaching the nematodes until all nematode tissue was dissolved, leaving only the eggs. These embryos were then either harvested for total RNA as described above, or they were allowed to hatch to L1s in Ringer’s solution over a period of ~30 hours before harvesting the total RNA. To isolate embryonic, L1, and adult stage-specific RNA from S. feltiae, 1,000–2,000 IJs were placed onto 10 cm lipid agar plates seeded with overnight cultures of Xenorhabdus bovienii (Akhurst and Boemare ATCC 35271). These cultures were allowed to grow for ~36 hours to collect adults or ~55 hours to collect embryos from gravid females. To collect L1s, we waited until all embryos hatched, which was ~24 hours. The same bleaching procedure was followed to harvest embryos and L1s as for S. carpocapsae. To isolate embryonic, L1, and adult stage-specific RNA from C. elegans (N2 strain) worms were placed onto 10 cm nematode growth media (NGM) plates seeded with overnight cultures of Escherichia coli (OP50 strain). To these, 200 uL aliquots of OP50 were added every day. Plates with lots of gravid adults were bleached according to the guide for maintenance of C. elegans in Wormbook . The embryos were either collected to harvest embryos, placed in Ringer’s solution for ~20 hours to harvest L1s, or plated on fresh NGM plates seeded with E. coli OP50 and collected ~47 hours later to harvest young adults.
Genomic and RNA-seq library construction
The genomic library was constructed using an Illumina Paired-End DNA Sample Preparation Kit according to the manufacturer’s instructions. Briefly, 3 μg of genomic DNA were fragmented using nebulization. The fragments were end-repaired, 3′-adenylated, and ligated to Illumina’s paired-end adaptors. The ligation products were size-selected on an agarose gel to yield fragments of approximate length of 350 bp. These fragments were then PCR-amplified to produce the finished library. The mate-pair or “jumping” library was prepared using an Illumina Mate Pair Library Preparation Kit v2. Briefly, 7.5 μg of genomic DNA was fragmented using a HydroShear device (Genomic Instrumentation Services, Foster City, CA) to generate fragments of ~2.2 kb. Following end repair and biotinylation, the 2.2 kb fragment was gel-purified and circularized. Circular DNA was fragmented using a Bioruptor NGS (Diagenode, Denville, NJ) and biotin-containing fragments were isolated using Dynabeads (Invitrogen). The fragments were end-repaired, 3′-adenylated, and ligated to NEBNext Multiplex Adaptors (NEB, Ipswich, MA). The ligation products were PCR-amplified and size-selected using AMPure XP beads (Beckman Coulter, Brea, CA) to generate the finished library of approximately 450 bp in length. Genomic libraries were sequenced on an Illumina Genome Analyzer IIx sequencer in paired-end mode with the read length of 76 bp. The jumping library was sequenced on an Illumina HiSeq2000 in paired-end mode with the read length of 100 bp (Additional file 2: Table S20).
The first set of RNA samples, which was used for genome annotation of all genomes, was prepared from 10 μg of total RNA from IJs, poly(A)-selected, and libraries constructed using a standard unstranded protocol [68, 73]. Libraries were quantified using a Qubit fluorometer (Invitrogen) and size distributions were verified using an Agilent Bioanalyzer and the High Sensitivity DNA Kit. These RNA-seq libraries were sequenced on the Illumina Genome Analyzer IIx sequencer in paired-end mode to a read length of 76 bp (Additional file 2: Table S21). The second set of RNA-seq samples, which was used for stage-specific gene expression analyses in S. carpocapsae and S. feltiae, was prepared from 5–30 μg of total RNA, poly(A)-selected using a Dynabeads mRNA DIRECT Kit (Invitrogen), and fragmented with a hydrolysis buffer containing magnesium ions . Double-stranded cDNA was prepared from the mRNA fragments using Invitrogen’s SuperScript Double-Stranded cDNA Synthesis Kit. During the second strand of reverse transcription, dUNTP (Applied Biosystems, Foster City, CA) was added to label the second strand (stranded protocol), and the libraries were prepared following the Myer’s Lab ChIP-seq protocol version 2011 with Illumina sequencing adapters. The libraries were sequenced on either the Illumina HiSeq 2000 or the NextSeq 500 sequencer in single-end mode to a read length of 50 bp or 75 bp, respectively (Additional file 2: Table S14). Reads for RNA-seq samples used for the gene expression analysis and gene expression tables were submitted to Gene Expression Omnibus (GEO) under the accession number [GSE68588].
The genomic libraries were built, sequenced, assembled, filtered, and repeat-masked as previously described  using Velvet 1.2.07 and RepeatModeler 1.0.5, RepeatMasker 3.0.3, recon 1.70, and RepeatScout 1.0.5 (Table 1). The genomes and gene annotations are available at .
The Whole Genome Shotgun project for S. carpocapsae has been deposited at DDBJ/EMBL/GenBank under the accession [AZBU00000000]. The version described in this paper is version AZBU01000000.
The Whole Genome Shotgun project for S. feltiae has been deposited at DDBJ/EMBL/GenBank under the accession [AZBV00000000]. The version described in this paper is version AZBV01000000.
The Whole Genome Shotgun project for S. glaseri has been deposited at DDBJ/EMBL/GenBank under the accession [AZBX00000000]. The version described in this paper is version AZBX01000000.
The Whole Genome Shotgun project for S. monticolum has been deposited at DDBJ/EMBL/GenBank under the accession [AZHV00000000]. The version described in this paper is version AZHV01000000.
The Whole Genome Shotgun project for S. scapterisci has been deposited at DDBJ/EMBL/GenBank under the accession [AZBW00000000]. The version described in this paper is version AZBW01000000.
Transcriptome assembly and genome annotation
IJ-stage, paired-end 75 bp, unstranded RNA-seq data sequenced to an average depth of 76 million reads for S. feltiae, S. glaseri, S. monticolum, and S. scapterisci, and embryo, L1, IJ, and adult stage data for S. carpocapsae were de novo assembled into expressed sequence tags (ESTs) with Oases 0.2.6 as previously described , with the following options: -m 23, -M 59, -s 4, -ins_length. To annotate each genome, ESTs were mapped onto the genome with BLAT 3.4 and these used as hints for gene finding using Augustus 2.6 with C. elegans settings (options: --species = caenorhabditis, --gff3 = on, --alternatives-from-evidence = true, --uniqueGeneId = false, --protein = on, --codingseq = on, --noInFrameStop = true, --UTR = on, --hintsfile) . Separately, RNA-seq reads were mapped onto the genome using TopHat 1.4  to find novel transcripts using Cufflinks 2.0.2  (Table 1, Additional file 2: Table S22, Additional file 7), which is described in more detail in a later section below.
Filtering bacterial symbiont DNA and other bacterial DNA contaminants from genomes
Protein sequences coded by intronless Augustus-predicted genes (putative bacterial contamination) were compared to a database using blastp in Blast2GO  to determine the identities of the bacterial contaminants present in the respective nematode genomes (Additional file 1: Figure S2). Assembled bacterial genomes matching the species blast results were obtained from GenBank, and their sequences were compared to the respective nematode genomes with BLAT 3.4, and removed from the nematode assemblies when the sequence match was >94 % identical . After filtering out bacterial DNA contamination, the genome annotations were repeated for each assembly using Augustus.
To study the evolution of gene families across nematodes, we used the available predicted protein datasets from WormBase release WS225 — Brugia malayi, Caenorhabditis elegans, Meloidogyne hapla, Pristionchus pacificus, and Trichinella spiralis [21, 22, 23, 26, 27]. We also included the Ascaris suum and Bursaphelenchus xylophilus predicted proteome datasets from WormBase release WS229 [24, 25]. We also used the Panagrellus redivivus genome assembly prior to its WormBase release . For out-group and comparative analysis we used the predicted protein dataset of the Nasonia vitripennis (v1.2) genome project, obtained from the NCBI/NIH repository  (Fig. 2a–c, Additional file 1: Figures S3–S10). Version 1.4 of the OrthoMCL pipeline was used to cluster proteins into families of orthologous genes, with default settings and the BLAST parameters recommended in the OrthoMCL documentation  (Fig. 2, Additional file 2: Table S1).
Protein domain analyses
To evaluate the prevalence of protein domains in the proteome of Steinernema carpocapsae and other species, we used the hmmscan program from the latest version of HMMER (3.0) software package, which implements probabilistic profile hidden Markov models . We set our threshold E-value criterion at 10−6, so that no known false-positive matches would be detected in assigning Pfam domain identities. We ran this analysis on the proteomes mentioned above and filtered out splice isoforms from the C. elegans proteome.
Gene tree analyses
Some protein families were further explored by evaluating gene trees either with whole protein sequences or by protein domain sequences. To do these analyses we aligned protein sequences using MUSCLE . Aligned protein sequences were then evaluated by distance analysis using the JTT matrix and a subsequent Neighbor-joining tree was created using the PHYLIP software package version 3.68, using the protdist and neighbor programs, and seqboot where bootstrap values were reported  (Fig. 2d, Additional file 1: Figure S11).
Supermatrix construction and whole genome phylogenetic analysis
The orthology analysis above resulted in 3,885 strictly orthologous genes (1:1 conservation across all steinernematid species and the out-group, P. redivivus). These strict orthologs were then compiled and used for the supermatrix construction and subsequent phylogenetic analysis. Because alignment accuracy greatly influences phylogenetic analyses and an earlier study on Steinernema phylogeny shows that there can be greater topological variation due to different alignment construction parameters than owing to the methods used to generate the phylogenies [85, 86], we took a very conservative approach to generating the amino acid sequence alignments. Accordingly, each gene was first aligned separately in MAFFT v6.821b . The L-INS-i algorithm was chosen because it is the most accurate setting in MAFFT for datasets containing fewer than 200 species . Because this analysis incorporated more genes (3,885 per species) than can reasonably be checked by eye, we used GBlocks v0.91  to objectively eliminate highly divergent and ambiguously aligned regions of the transformation series [89, 90, 91]. Using the batch feature of GBlocks we applied strict settings: four of the six species’ amino acids were required to make a conserved position for a column, five of the six species’ amino acids were required to create a flank position, ten conserved amino acids were required to make a block, eight consecutive non-conserved amino acids was the maximum allowed, and all gaps were removed.
GBlocks identified sequence divergence and alignment ambiguity problems that led us to remove 14 genes from the analysis. Prior to the GBlocks analysis a supermatrix of all of the genes contained 2,924,577 amino acids; the optimized alignment was reduced to 1,320,306 amino acids, a 45 % reduction. GBlocks output was used to concatenate the individual gene files into a supermatrix.
We constructed phylogenetic trees in PAUP* v4.0b10  under the parsimony optimality criterion. The tree search parameters for the supermatrix were an exhaustive parsimony search enforcing a monophyletic root. The result was a separate tree file for each gene and another for the supermatrix. We inferred nodal support by bootstrap analysis  of the supermatrix in PAUP* with 500 repetitions using a heuristic search with randomized additions. The parsimony analysis of the supermatrix resulted in only one best tree (Fig. 1b). The bootstrap values were all 100 on each node, suggesting that the data provide strong support for the solution. The tree that was supported by the largest number of genes was the same tree that was the most parsimonious solution for the supermatrix (data not shown).
Analysis of genome completeness
Genome completeness was determined by clustering S. carpocapsae, S. feltiae, S. glaseri, S. monticolum, and S. scapterisci protein sets with a core set of highly conserved eukaryotic proteins (Core Eukaryotic Gene Mapping Approach, CEGMA) using OrthoMCL 1.4 as previously described [28, 73, 81, 92]. The percentages of genome completeness for each species was found by dividing the number of proteins that were orthologous to CEGMA proteins by the total number of CEGMA proteins (Additional file 2: Table S2).
Gene expression analyses
Stranded, single-ended 50 bp RNA-seq reads from the embryonic, L1, IJ, and adult stages of S. feltiae, S. carpocapsae, and C. elegans sequenced to an average depth of 22, 30, and 33 million reads respectively were trimmed to 35 bp to remove low quality bp (Additional file 2: Table S14). Prior to read mapping, transcriptome indexes were prepared for S. carpocapsae, S. feltiae, and C. elegans (WS220) using the RSEM command (version 1.2.12) rsem-prepare-reference . Reads were mapped to each respective species’ annotations using bowtie 0.12.8 with the following options: -S, --offrate 1, -v 1, -k 10, --best, --strata, -m 10 . Gene expression was quantified using the RSEM command, rsem-calculate-expression, with the following options: --bam, --fragment-length-mean . We used EdgeR to analyze genes that were DE during the developmental time course of each species, and we considered a gene to be DE if it had an FDR < 1 × 10−5 and a fold change > 4× . DE genes were K-means clustered into eight clusters (Fig. 3a, Additional file 4) using Cluster 3.0 , and visualized with JavaTree View . The optimal K for clustering was found using the Akaike information criterion. DE gene clusters were functionally annotated using Blast2GO’s Fisher’s exact test .
Finding novel genes and isoforms using Cufflinks and Cuffmerge
Unstranded paired-end RNA-seq data collected from four S. carpocapsae developmental stages (embryo, L1, IJ, adult) were aligned to the S. carpocapsae genome using TopHat 1.4.0 and Bowtie 0.12.8 with the following options: -r 50, –G < annotations > . Gene expression for the aligned reads was quantified with Cufflinks 2.0.2 using the following options: -u, -g < annotations>. Transcript annotations from each developmental stage were merged together with Cuffmerge (options: -g < annotations>, -s < genome>) (Additional file 7). The Cuffmerge output showed genes and transcripts that were discovered by Cufflinks but missed by Augustus. The Cufflinks annotations were used in combination with the Augustus annotations to delineate coding versus non-coding sequences in downstream analyses.
Unstranded, paired-end RNA-seq data for the IJ stage in the other species were aligned to their respective genomes using TopHat 1.4.0 and Bowtie 0.12.8 with the following options: -r 50, –G < annotations>. Cuffmerge was not used. Gene expression was quantified with Cufflinks 2.0.2 using the following options: -u, -g < annotations > .
Multiple genome alignment
Five whole repeat-masked Steinernema genomes were aligned using MULTIZ/TBA (multiz-tba.012109). Contigs from the best-assembled genome, S. carpocapsae, were concatenated together with 100 bp “N” spacers and used as a reference sequence for the alignment process. The aligned sequences were analyzed with Phast 1.2.1 (phyloFit options: --tree, phastCons options: --target-coverage 0.4, --expected-length 10, --estimate-trees, --nopostprob) to determine regions of sequence conservation across the genomes using setting for C. elegans as previously described [68, 98, 99, 100]. PhastCons parameters were also varied around those used for C. elegans , but the C. elegans parameters provided a good balance between small and large blocks of conservation. Conserved sequences that matched Augustus and Cufflinks coding sequences or 5′ or 3′ untranslated regions were considered conserved coding sequences, whereas sequences that mapped anywhere else were considered conserved non-coding sequences. DNA from the anterior portion of the Hox cluster (ceh-13 and lin-39) in S. carpocapsae, S. scapterisci, and S. feltiae were also aligned using MUSSA  to find conserved regions of their DNA. MUSSA was run with a conservation window size of 30 nucleotides and a nucleotide conservation threshold of 23 nucleotides.
Gene expression conservation
To determine the degree of gene expression conservation during development between nematode species, we compared gene expression data for four developmental stages in S. carpocapsae, S. feltiae, and C. elegans. Two methods were used for determining conserved gene expression. The first method binarized the expression data using a flexible threshold to sort the genes into stage-specific sets (Additional file 1: Figure S14). We examined the gene expression levels of the 1:1:1 orthologs at four developmental stages and asked if an ortholog that was expressed above an averagely expressed gene (10 FPKM) in a particular set of developmental stages in a nematode species was expressed at least above 5 FPKM in the other nematode species in the same set of developmental stages. If the ortholog was expressed in the same set of developmental stages, it was considered conserved in stage-specific gene expression. If not, stage-specific gene expression was considered to have changed. We used this method to determine the fraction of orthologs that are “on” and “off” in the same developmental stages between species. However, to address whether their expression profiles parallel each other during development, we treated the ortholog expression calculated in transcripts per million (TPM, which is interconvertible with FPKM) during development as vectors, and calculated the cosine similarity (Fig. 3b). The cosine similarity provides a measure of similarity between a pair of vectors: a similarity measure of 1 means that the two vectors are perfectly correlated, whereas a similarity measure of 0 means the vectors are orthogonal (i.e., uncorrelated). We calculated the cosine similarities for each ortholog used in the binary method with developmental stage replicates for each species. We found that orthologs with cosine similarities > 0.95 had extremely similar expression profiles during development, so we set this to be our conservation threshold. This gave us a total of 1,441 orthologs with conserved expression profiles between S. carpocapsae and S. feltiae (Additional file 6). We sorted these orthologs into stage-specific gene sets by requiring developmental stages to contribute to at least 10 % of the total gene expression during the time course to be considered “on.” Stage-specific gene sets containing more than 30 genes were used for motif finding (e, f, ef, fi, fa, efi, efa, fia, efia).
Nine sets of Steinernema stage-specific orthologs were chosen for motif mining (Fig. 3c). Conserved non-coding regions ±3 kb or within introns of the genes were obtained by intersecting bed coordinates for the regions upstream of these genes with the genome-wide set of conserved non-coding regions using bedtools/2.15.0 bedintersect . The conserved non-coding bed regions were converted to fasta sequence using bedtools getfasta, filtered for sequences >8 bp, and run through MEME 4.8.1 (settings: -minw 6 -maxw 12 -dna -nmotifs 20-50 -mod zoops -revcomp) to find recurring regulatory motif sequences . We discovered 440 motifs in total across the nine gene sets and searched for them across both the S. carpocapsae and C. elegans conserved non-coding regions using FIMO with the following settings: --thresh 0.3 --qv-thresh --max-stored-scores 20000000 –bgfile --parse-genomic-coord . We used the WS220 gene annotations and the corresponding conserved regions for C. elegans from the UCSC Genome Browser (ce10/WS220: phastConsElements7way.txt) for these analyses. The conserved non-coding regions were produced for C. elegans by retaining conserved regions that did not intersect annotated coding regions (bedtools bedintersect, settings = -wa).
Filtering out redundant and insignificant motifs
Motifs that could not map to any conserved non-coding regions within the q-value threshold (q-value < 0.3) in either species were removed from the analysis. The remaining motif set was compared to itself with TOMTOM to identify redundant motifs, using the following settings: -min-overlap 5 -dist pearson -thresh 0.05 . The redundant motif with the highest MEME e-value of the pair of matching motifs was removed from the analysis. In the end, we were left with 30 non-redundant motifs (Additional file 2: Table S17, Additional file 8).
The final set of non-redundant motifs was associated with the nearest gene models for each species, forming motif-associated gene sets using bedtools closest with the following setting: -d .
Novel motif comparison to WormBase motif database
The final set of 30 motif position weight matrices was compared to 5,512 motifs from WormBase [106, 107] with TOMTOM using the following settings: -min-overlap 5 -dist pearson -evalue -thresh 1.0. Out of 30 motifs, 24 matched WormBase motifs with a p-value < 1e−4 and an e-value < 0.5 (Additional file 2: Table S18).
GO term enrichments were determined for each S. carpocapsae and C. elegans motif-associated gene set using the Fisher’s exact test in Blast2GO . Motif-associated GO terms with FDRs < 0.05 and that were shared between S. carpocapsae and C. elegans were considered for the analysis (Fig. 6b, c; Additional files 9 and 10).
Conserved GO term network generation
Enriched motif-associated GOs (MAGs) shared between S. carpocapsae and C. elegans were analyzed for the number and percentage of motif-associated 1:1 orthologs shared between them. MAGs that shared 30 % 1:1 orthologs were involved in biological processes under or related to the parent terms such as neurogenesis, embryogenesis, and muscle development. Thus, we focused on GO terms related to these particular processes and generated networks by placing shared 1:1 ortholog targets from related GO terms and the putative conserved motifs that regulate them into three networks: a neurogenesis-related network, an embryonic-related network, and a muscle-related network. The supplemental figures show all the conserved motif–gene associations regardless of motif and gene node degree, while the main figures show all nodes that had degrees greater than 5 (Fig. 6b, c, Additional file 1: Figure S20, Additional files 11, 12, and 13). The motifs and ortholog associations within these networks are conserved between S. carpocapsae and C. elegans. Motif locations around the gene models were investigated around interesting orthologs, such as egl-44 and zag-1, to see if the motif sites are conserved in their location or have changed over time (Fig. 6d, Additional file 1: Figure S21).
Randomized GO term control network
To verify that the motif-associated GO term enrichments we obtained were not due to chance, we created 100 randomized GO term sets by shuffling all of the annotated S. carpocapsae gene GO terms that were derived from Blast2GO. We reassigned the GO term sets to new genes that were previously annotated. Unannotated genes were not assigned a randomized GO term set. We applied Fisher’s exact test to all 30 MAG sets using these randomized GO sets (30 motifs × 100 randomizations = 3,000 Fisher’s exact tests in total), and the GO enrichment results for the neuronal, embryo, and muscle GO terms were analyzed. We did not recover enrichments for any GO terms associated with these terms with FDRs < 0.05.
This research was supported by grants from the US National Institutes of Health (NIH) and the Howard Hughes Medical Institute, for which PWS is an investigator. ARD was supported by NIH training grants (5T32GM007616 and 5T32HG000044). AM and MM were supported by an NIH New Innovator Award to AM (DP2 GM111100). XL was supported by the UW-Madison Graduate School research funds.
- 7.Gaugler R, Kaya HK. Entomopathogenic nematodes in biological control. Boca Raton: CRC Press; 1990.Google Scholar
- 9.Stock SP, Goodrich-Blair HG. Entomopathogenic nematodes and their bacterial symbionts: The inside out of a mutualistic association. Symbiosis. 2008;46:65–75.Google Scholar
- 20.Adams BJ, Peat SM, Dillman AR. Phylogeny and evolution. In: Nguyen KB, Hunt DJ, editors. Entomopathogenic nematodes: Systematics, phylogeny, and bacterial symbionts, Volume 5. Leiden-Boston: Brill; 2007. p. 693–733. Nematology monographs and perspectives.Google Scholar
- 21.C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–8.Google Scholar
- 33.Balasubramanian N, Hao YJ, Toubarro D, Nascimento G, Simoes N. Purification, biochemical and molecular analysis of a chymotrypsin protease with prophenoloxidase suppression activity from the entomopathogenic nematode Steinernema carpocapsae. Int J Parasitol. 2009;39:975–84.PubMedCrossRefGoogle Scholar
- 55.Stanley D, Miller J. Eicosanoids in invertebrate immunity: an in vitro approach. In Vitro Cell Dev Biology Ani. 2006;42:5a–a.Google Scholar
- 70.Kaya HK, Stock SP. Techniques in insect nematology. In: Lacey L, editor. Manual of techniques in insect pathology. San Diego, CA: Academic Press Limited; 1997.Google Scholar
- 72.Stiernagle T. Maintenance of C. elegans. In: The C. elegans Research Community, eds. WormBook. 2006. doi/10.1895/wormbook.1.7.1Google Scholar
- 74.WormBase Parasite. http://parasite.wormbase.org
- 84.Felsenstein J. PHYLIP (Phylogeny Inference Package). 36th ed. 2005.Google Scholar
- 90.Swofford DL. Phylogenetic analysis using parsimony (*and other methods). Sunderland, MA: Sinauer Associates; 2002.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.