scDual-Seq: mapping the gene regulatory program of Salmonella infection by host and pathogen single-cell RNA-sequencing
The interaction between a pathogen and a host is a highly dynamic process in which both agents activate complex programs. Here, we introduce a single-cell RNA-sequencing method, scDual-Seq, that simultaneously captures both host and pathogen transcriptomes. We use it to study the process of infection of individual mouse macrophages with the intracellular pathogen Salmonella typhimurium. Among the infected macrophages, we find three subpopulations and we show evidence for a linear progression through these subpopulations, supporting a model in which these three states correspond to consecutive stages of infection.
While simultaneous transcriptome analysis of host and bacteria by RNA-sequencing (RNA-seq) can provide a comprehensive view of cellular states, previous efforts using bulk measurements have been limited to averaging over thousands of host and pathogen cells , thereby losing the ability to capture the heterogeneity of the individual encounters. Another major limitation in single-cell RNA-seq is the dependence on oligo-dT priming, which has restricted the examination to only individual host cells since this approach does not provide information on the state of the individual infecting pathogen [5, 6, 7, 8]. Although poly-A independent methods for single cells are available [9, 10], their efficacy for detecting intracellular bacterial transcriptomes is untested.
Here, we present scDual-Seq, a method to simultaneously analyze both host and pathogen cells using single-cell RNA-seq. This method is highly multiplexed, uses UMI to allow single transcripts quantification, and takes advantage of in vitro transcription (IVT) for amplification. We show that by using scDual-Seq, we are able to identify and quantify the expression levels of both bacterial and host transcripts in individual infected mammalian cells.
Results and discussion
We developed a single-cell dual capture and sequencing method ─ scDual-Seq ─ for single-cell RNA-seq to analyze simultaneously both host and bacterial cells. To capture both bacterial and host transcripts, scDual-Seq primes the reverse transcription (RT) reaction using random hexamer DNA oligos (see “Methods,” Fig. 1b, and Additional file 1). After adding a polyA tail to the complementary DNA (cDNA), scDual-Seq proceeds as an extensive modification of the CEL-Seq2 method for single-cell RNA-seq [11, 12], which includes barcoding for multiplexing, IVT for amplification, and paired-end Illumina sequencing.
To assay the ability of scDual-Seq to detect transcripts of both eukaryotic and bacterial origin, we processed RNA extracted from bulk samples of mouse embryonic stem cells (mESC) and from the Gram-negative, intracellular pathogen S. typhimurium (Salmonella). On average, we detected 63,000 unique mouse transcripts and 120,000 unique Salmonella transcripts when starting with 10 pg RNA, the estimated amount of RNA present in a mammalian cell, respectively (Fig. 1c). This is considerable given that one mESC cell is thought to consist of 500,000 transcripts . To study the sensitivity of scDual-Seq with reduced RNA input amounts, we processed samples with 10 pg of mESC RNA and 1 pg, 0.1 pg, and 0.01 pg Salmonella RNA, respectively. We detected roughly the same number of mouse transcripts and a decrease of one order of magnitude in Salmonella transcripts across the dilutions, as expected from the linearity of detection in scDual-Seq (Fig. 1c). Due to the random priming during RT, we detected messenger RNAs (mRNAs) and non-coding RNA in our samples (Additional file 2: Figure S1a). While most of the Salmonella transcripts correspond to non-coding RNA, in mouse this is not the case; which may be attributed to a difference in the structure of the prokaryotic and eukaryotic ribosomal RNAs. We further detected high correlations between technical replicates; R = 0.99, on average, for 10 pg mouse samples and R = 0.89 for the 10 pg Salmonella samples (Fig. 1d shows one pair of technical replicates). The reproducibility, however, is reduced with lower input amounts: for 0.01 pg Salmonella RNA, the average correlation is 0.79 (Fig. 1e). Based on these studies, we concluded that scDual-Seq accurately measures RNA levels in samples containing as little as 0.01 pg RNA for both polyA+ and polyA- RNA. On average, we detected 470 Salmonella transcripts in 0.01 pg of RNA, which is the expected amount of RNA in a single bacterial cell . Since this amount of RNA has been estimated to correspond to 10,000 transcripts, scDual-Seq has an estimated sensitivity of approximately 4.7%.
To test for the sensitivity of scDual-Seq in measuring the transcriptomes of live Salmonella, we identified genes differentially expressed between an overnight culture of Salmonella grown in bulk and intracellular Salmonella within macrophages in exposed single cells, and 10 and 100 cell populations. We detected a similar set of differentially expressed genes in all three comparisons, indicating that sensitivity is not severely compromised at the single-cell level (P < 0.001; hypergeometric distribution; Additional file 2: Figure S1b). Second, we found good correspondence of Salmonella transcriptomes between the single-cell data and population-level data, as well as between the 10-cell and 100-cell population data (Additional file 2: Figure S1f), demonstrating the accuracy of the single-cell measurements of bacterial transcripts. Comparing the sensitivity of scDual-Seq directly with that of CEL-Seq2, we found that CEL-Seq2 has higher sensitivity with more detected mouse genes than scDual-Seq (Additional file 2: Figure S1c). However, examining at the number of detected Salmonella genes (non-polyA), scDual-Seq performed better than CEL-Seq2. scDual-Seq shows the same dependency of noise on expression level that was observed in CEL-Seq  (Additional file 2: Figure S1d, e).
By performing an unbiased comparison of the macrophage single-cell transcriptomes across the time-points, we identified two responses that, for the most part, distinguished the infected and unexposed cells (Fig. 2b, Additional file 2: Figure S2d). Among the infected cells, one type of response characterizes the majority of infected cells that we term an induced response (87 cells). A second type of response characterizes the 69 infected cells whose transcriptome resembles that of unexposed cells, which we term a “partially induced” response because their global transcriptional patterns are more similar to the unexposed than other infected cells. Genes known to be part of the immune response – Tnf and Sod2 – are differentially expressed between the unexposed and partially induced cells relative to the induced cells. As a control, we verified that Gapdh expression levels were not also higher in the induced cells (Fig. 2c). Moreover, genes previously detected to be differentially expressed between exposed and unexposed macrophages  show significant expression differences across these clusters (Fig. 2d). While the partially induced cells do not induce a full immune response in contrast to the induced cells, they may nonetheless be considered as having a “partial response” because as a population they have higher expression (though not significant) of some major immune response genes when compared to the unexposed cells, e.g. Tnf (Fig. 2c). Furthermore, we visualized the relations among the identified groups based upon the aforementioned immune response genes . Strikingly, we found an arc-shaped distribution of cells whereby the partially induced cells are flanked by the unexposed and induced cells (Fig. 2e, Additional file 2: Figure S2e).
To distinguish between these two models, we used pseudo-time analysis based upon the tSNE plot in Fig. 2e. Pseudo-time was inferred based upon the Euclidean distance among the cells (in tSNE space, Fig. 3c, see “Methods”). Interestingly, we observed that the ordered cells recapitulated the clustering observed in the previous analysis. Cells of the same group generally clustered: first the unexposed cells, next partially induced cells followed by Class I cells and finally Class II cells. To test whether ordering by pseudo-time reflected known biological processes, we queried for the SPI-1 and SPI-2 (type-III secretion systems) Salmonella regulons which are known to be inversely expressed during infection [20, 21, 22]. Interestingly, we found that while the mean expression of both regulons is similar across the subpopulations (Additional file 2: Figure S3B), examining their expression level at the pseudo-time single-cell level ordered cells, SPI-1 is more highly expressed in the partially induced cells, early in the infection, and SPI-2 is more highly expressed in Class I later in infection (Fig. 3d, see “Methods”). This order of SPI-1 and SPI-2 expression matches that from previous reports [20, 21, 22].
To further query whether the infection progresses linearly, we followed the distribution of these subpopulations over the time course of infection. If the linear model is correct with the partially induced cells representing the earliest state in infection, we would expect that the fraction of the cells that are partially induced to decrease with time. Indeed, calculating the fraction of each subpopulation at each time point, we found the largest fraction of partially induced cells at 2.5 h after infection, with a smaller fraction at 4 h, and no partially induced cells at all at 8 h (Fig. 3e). This result, as well as the pattern of correlations among the three subpopulations (Fig. 3b) and the order of cells according to the infection pseudo-time (Fig. 3c), is consistent with infected macrophages following a linear progression from the partially induced state to the fully induced state process (Fig. 4). Further, since the Class I Salmonella transcriptomes in the fully induced macrophages are transcriptionally more akin to those of the Salmonella in partially induced cells, this could suggest a progression of the partially induced cells to the fully induced cells containing Class I Salmonella. At this stage, the host response is induced in a large-scale upregulation of immune functions. Finally, infection could then proceed to macrophages maintaining an induced state with Salmonella in a Class II state. Since the linear and parallel models need not necessarily be mutually exclusive, a combination of these two—a temporal linear process in some cells such as the partially induced macrophages and parallel commitments to different strategies in others—may be possible.
scDual-Seq represents a novel method to query host–pathogen interactions. We expect that among its many possible uses, it will be invaluable to study the adaptive abilities of pathogens to antibiotics, the effect of genotypic changes on the immune response of the host, and the in vivo characterization of infection. Our current implementation of scDual-Seq does have the limitation of a high MOI which might have masked certain infection stages and phenotypes due to a potential variable number of bacteria in the infected macrophages. As previously described, constitutive GFP expression in the infected macrophages is not sufficiently precise to allow us to robustly conclude the number of bacteria infecting each macrophage; for example, starting with an MOI of 25, the GFP+ cells contain a range of 1–10 colony forming units (CFU)  Further work, perhaps using a microfluidics platforms [23, 24] to enable the analysis of a massive cohort of host–pathogen encounters, will address lower MOIs for a more accurate and comprehensive description of infection. Overall, the ability to capture both the pathogen and host transcription programs at the level of individual cells will be important for understanding the relationships among the different states of infection.
The scDual-Seq method
After second strand synthesis, the samples are pooled together and cleaned using 1.2X AMPure beads. IVT is performed in two-fifths reaction volume as described in the Ambion Message AMP II kit for 13 h.
Measuring the sensitivity of scDual-Seq
We tested scDual-Seq on clean RNA of both mouse and Salmonella prepared as follows. Mouse RNA was isolated from CGR8 embryonic stem cell (ESC) line and prepared using a TRIzol extraction and treated with RQ1 RNase-free DNase (Promega) according to the manufacturers’ protocols. RNA cleanup was done with AMPure RNAClean beads. To extract Salmonella RNA, Salmonella were grown in LB and lysed by bead beating and RNA was extracted by the Qiagen RNeasy kit. RNA concentration was measured using Qubit, and diluted for technical replicates RNA in four different dilutions: 10 pg of mouse RNA mixed with 10 pg, 1 pg, 0.1 pg or 0.01 pg of Salmonella RNA, respectively, each in five replicates. All 20 samples were processed using scDual-Seq. As a side-by-side comparison, five replicates of the 10 pg of mouse RNA mixed with 0.01 pg of Salmonella were processed using the CEL-Seq2 protocol .
Mice, cell lines, and bacterial strains
C57BL/6 WT mice were obtained from Jackson Laboratory. All animals were housed and maintained in a conventional pathogen-free facility at the Massachusetts General Hospital. All experiments were performed in accordance to the guidelines outlined by the MGH Committee on Animal Care. All Salmonella typhimurium strains used in this study were derived from the wild-type strain SL1344.
Bone marrow-derived macrophage (BMDM) infection with Salmonella
Cultures of S. typhimurium labeled with GFP (pFPV25.1; Addgene) were grown in Luria-Bertani (LB) medium at 37 °C shaken at 250 rpm for 16 h (overnight culture). One milliliter from the overnight culture was washed in PBS and incubated for 1 h with pHrodo dye (Life Technologies) at room temperature in 100 mM sodium bicarbonate. S. typhimurium was then washed three times with HBSS and OD600 was measured. BMDMs were infected at an MOI of 50:1 and spun down for 5 min at 250 g. After 30 min, cells were washed with media containing 15 μg/mL gentamicin to remove S. typhimurium that were not internalized. We sorted 96 individual cells unexposed to Salmonella, individual cells 2.5, 4, and 8 h after infection. At each time-point cells were lifted from plates and sorted using FACS into 96-well plates containing 4.5 μL lysis buffer (TE containg 5%NP-40 and RNAse inhibitor) . We repeated the experiment with unexposed cells and cells 4 h after infection; we sorted 40 individual exposed cells and 40 unexposed cells. Sorted cells were DNase treated with 1 uL enzyme mix of 0.2 U of DNase I at 65 °C for 5 min, cleaned with 1.8X RNAClean beads, and eluted with 1.2 μL of primer-dNTP mix before continuing to the RT step of scDual-Seq. Overnight culture Salmonella: 1 mL from the overnight culture was washed with PBS and RNA was extracted using phenol chloroform. A total of 1 ng of clean RNA was used as starting amount for the scDual-Seq protocol.
Paired-end sequencing was performed on the HiSeq 4000 in high-throughput mode, 50 bases for read 1, seven bases for the Illumina index, and 50 bases for read 2. Read 2 was trimmed for 35 bases before mapping. On average, each single cell had 2.4 million reads. For the dilution experiment (Fig. 1), paired-end sequencing was performed on the HiSeq 2500 in rapid mode, 15 bases for read 1, seven bases for the Illumina index, and 36 bases for read 2.
Data analysis and statistics
The initial analysis of the scDual-Seq sequenced reads was done using the CEL-Seq2 pipeline . Reads were mapped to the mouse and Salmonella transcriptomes. In the last step, only reads mapping to the reverse strand are counted, since scDual-Seq produces stranded reads (htseq_wrapper; extra_params = −s reverse). In our first experiment, we filtered out cells with < 10,000 unique transcripts and with correlation of < 0.8 with at least ten other cells; we were left with 63 unexposed cells, 57 infected at 2.5 h, 59 infected at 4 h, and 42 infected at 8 h. In the second experiment, after filtering out single cells with < 20,000 unique transcripts, we were left with 33 unexposed cells and 38 infected cells. We used binomial statistics to convert the number of UMIs into transcript counts  and normalized to give read count in transcripts per 10,000.
Gene Ontology analysis
Based on the tSNE plot shown in Fig. 2b, we defined two groups of cells: partially induced and induced cells. For each group, we identified 50 genes that are expressed highest in that group and with the lowest P value for differential expression between the two groups (Kolmogorov–Smirnov test). For each gene set we computed the enrichment for gene ontology terms (hypergeometric distribution) using annotations from Ensembl .
Regulon expression analysis
For each cell, we summed the number of unique transcripts belonging to the same regulon  into a collective regulon expression and normalized to transcripts per thousand. For each defined group (partially induced, Class I, and Class II) we performed the Wilcoxon rank-sum test for each regulon comparing the expression level (tpm) between the group and the two other groups (for example, partially induced compared with Class I and Class II). Only regulons with higher expression levels in this group and a significant P value (P < 0.05) were selected for further analysis. Expression levels (tpm) for each of the selected regulons were averaged for each group. Finally, the mean values were normalized across the three groups, and ordered using ZAVIT [26, 27]. Correlation coefficients between transcriptomes (Fig. 3b) were computed based upon their Salmonella regulon expression levels (log10 tpm) of all infected cells. A Wilcoxon rank-sum test was used to compare the correlations between the partially induced and Class I cells with respect to the partially induced and Class II cells.
We used the ZAVIT method [26, 27] to order the cells based on tSNE-1 and tSNE-2 values generated in Fig. 2e. The values were coerced to a smooth path by a moving mean over 50 radius values. The length of the path going through the cells in tSNE-1, tSNE-2 space was calculated and used as pseudo-time. For SPI-1 and SPI-2 expression profiles in single cells, we filtered out cells with no expression for both SPI-1 and/or SPI-2 regulons. The expression profiles were ordered by the pseudo-time, normalized, and smoothed.
We thank Aviv Regev and Orit Rozenblatt-Rosen at the Klarman Cell Observatory for support for this project. We acknowledge the assistance of Neil Jethani in data analysis. We also thank the anonymous referees for making substantial suggestions to the manuscript.
This work was supported by the Broad - Israel Science Foundation Program.
Availability of data and materials
The complete dataset has been deposited to the NCBI GEO database with accession GSE102163 .
GA, RA, and IY conceived the method. GA and RA led the development of the method with significant contribution from AF and TH. RA and AF collected the infected samples. GA performed the scDual-Seq experiments. GA and IY analyzed the data. GA and IY wrote the manuscript. GA, RA, DH, and IY revised the manuscript. All authors read and approved the final manuscript.
All experiments were performed in accordance to the guidelines outlined by the MGH Committee on Animal Care.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 14.Milo R, Phillips R. Cell Biology by the Numbers. 2015. New York: Garland Science.Google Scholar
- 21.Jones BD. Salmonella invasion gene regulation: a story of environmental awareness. J Microbiol. 2005;43(Spec No):110–7.Google Scholar
- 28.Avital G, Avraham R, Fan A, Hashimshony T, Hung DT, Yanai I. scDual-Seq: Mapping the gene regulatory program of Salmonella infection by host and pathogen single-cell RNA-seq. Gene Expresion Omnibus. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102163. Accessed 7 Oct 2017.