Background

The heat shock response is a highly conserved mechanism found across organisms that ensures effective maintenance of cellular function under stress. Transcriptional activation involving heat shock proteins (HSPs) was found to underpin the seminal observation of expanded chromosomal puffs in Drosophila salivary glands following exposure to heat [1], with subsequent studies in different species highlighting not only changes in expression of genes encoding these essential molecular chaperones but also their regulators, proteins involved in proteolysis, transcription factors and kinases, membrane transport, maintenance of cellular structures, metabolism and nucleic acid repair [29]. As well as significant upregulation of gene expression, involving rapid induction of HSP gene transcription by activated heat shock factors (HSF) binding to promoter heat shock elements (HSEs), the coordinated stress response is also recognised to involve downregulation of a greater number of genes. However, to date inter-individual variation in the heat shock response at the level of transcription in humans remains largely unknown, with studies defining the global transcriptome based on specific cell lines or cells/tissue from particular individuals [8, 9]. Further delineation of the nature and variability in this response is important given the role of HSPs in ensuring effective intracellular protein folding during stress, protecting cells from denaturation, aggregation and apoptosis [4]. This is underlined by evidence linking HSPs with ageing and cancer, as well as the response to infection and immunity [1013].

Genetic modulators of gene expression are important determinants of inter-individual variation in diverse phenotypes and may only operate in specific cell types or after particular environmental exposures [14, 15]. Mapping gene expression as a quantitative trait to identify regulatory genetic variants has informed recent genome-wide association studies (GWAS) of disease as well as pathophysiology including the immune response to endotoxin [16], sepsis [17], T-cell activation [18] or viral infection [19, 20]. Expression of heat shock proteins is highly heritable and has been mapped as a quantitative trait in diverse organisms including Drosophila melanogaster [2123], Caenorhabditis elegans [24] and the Artic charr [25]. In resting (non-heat shocked) human Epstein-Barr virus (EBV)-immortalised lymphoblastoid cell lines (LCLs), expression of heat shock protein and molecular chaperone genes shows high heritability on eQTL mapping, with response to unfolded proteins having the highest heritability of any biological process on gene ontology (GO) analysis (H2 0.38) [26]. A previous QTL analysis of heat shock phenotypes in human cells was restricted to the Hsp70 genes in the MHC class II region and demonstrated a local eQTL for HSPA1B [27].

Here we report the genome-wide changes to gene expression induced by heat shock in HapMap cell lines from Yoruba (YRI) individuals and perform analysis to identify genes and pathways involved in the human heat shock response. To further elucidate underlying mechanisms, we present an analysis of genetic variants modulating the global heat shock transcriptional response.

Methods

Cell culture and heat shock

The 60 founder YRI HapMap cell lines (Coriell) [28] were cultured. These anonymised cell lines were established by the International HapMap Project and made available for use by the scientific research community [29]. LCLs were maintained in RPMI 1640 medium supplemented with 10 % fetal calf serum and 2 mM L-glutamine at 37 °C in 5 % humidified CO2. Growth rates were determined after 72 h in culture for each cell line to ensure the cells were at comparable densities and total numbers when harvested. Trypan blue staining was used to define cell viability. Cells were subject to heat shock at 42 °C for 1 h and then allowed to recover for 6 h in a 37 °C, 5 % CO2 incubator. 2 × 107 cells were harvested for each of the two paired experimental conditions (i.e. heat shock stimulated and basal un-stimulated culture conditions) per individual cell line and stored in RLT buffer with β-mercaptoethanol at −80 °C. Total RNA was purified using QIAGEN RNeasy Mini purification kit following manufacturer’s instructions, including on-column DNase digestion.

Gene expression pre-processing and quality control

Genome-wide gene expression analysis was carried out using the Illumina Human-HT-12 v3 Expression BeadChip gene expression platform comprising 48,804 probes. Probe intensities for resting and stimulated cells were imported into R for further processing together with associated metadata. Annotations for all probes were obtained via the illuminaHumanv3.db Bioconductor package [30]. Only probes considered to be of perfect or good quality according to these annotations were taken forward for analysis. Additionally, all probes mapping to more than one genomic location or to a location that contains a known single nucleotide polymorphism (SNP) were excluded. Probes were required to exhibit significant signal (detection p value <0.01) in at least ten samples and samples with less than 30 % of the remaining probes providing significant signal were excluded (together with the paired sample from the same cell line). Samples showing exceptionally low variation in probe intensities (standard deviation of the log intensities of all retained probes below 0.8) were also removed. After filtering 12,416 of 48,803 probes (25.4 %) remained.

Normalising gene expression estimates

Probe intensities were normalised with VSN [31] and outlier samples removed. The remaining 43 samples were normalised separately for each BeadChip and differences between groups corrected with ComBat [32], preserving differences due to heat shock stimulation (Additional file 1: Figure S1).

Differential expression analysis

Following quality control (QC), samples were analysed for differences in gene expression levels between the basal and stimulated states, i.e. pairing samples from the same individual, using the limma Bioconductor package [33]. Individual probes were associated with corresponding genes by comparing probe positions as provided by the illuminaHumanv3.db Bioconductor package [30] with transcript coordinates obtained via the TxDb.Hsapiens.UCSC.hg19.knownGene Bioconductor package [34]. One of the genes (N4BP2L2) had two probes with opposite effects in terms of differential expression and these probes were excluded from further analysis. For all other genes with multiple differentially expressed probes, the direction of the effect was consistent between probes.

GO enrichment and pathway analysis

GO enrichment analysis was carried out using the Bioconductor package topGO [35]. Fisher’s exact test was used to determine enrichment separately for significantly upregulated and downregulated genes (false discovery rate (FDR) <0.01 and >1.2 fold change (FC)). Biological pathways, function enrichment and prediction of upstream regulators were generated for these genes using Qiagen’s Ingenuity Pathway Analysis (IPA) (www.qiagen.com/ingenuity, QIAGEN Redwood City). For the shortest path analysis, we used the path explorer tool. Here, if two molecules do not have specific direct connections in the Ingenuity Knowledge Base, this tool will define how many and which molecules can be added to the pathway to create the shortest path between them.

Gene functional annotations with heat shock

We investigated which differentially expressed genes we identified had been previously associated with the heat shock or, more generally, stress response. We used the set of genes previously linked directly to heat shock [4] and from this created an extended set based on GO terms and PubMed articles linking differentially expressed genes to heat shock response and closely related processes. As a first step in highlighting genes not previously known to play a role in this context, we identified all significantly upregulated genes that lack GO annotations of obvious relevance to heat shock response. In addition to terms related to stress response and protein folding, we also explored an extended set that included terms related to cell death and proliferation. To account for the presence of EBV in these cell lines, we excluded all genes annotated with terms related to viral infections. Finally, any remaining genes related to regulation of gene expression were considered to be likely to be explained by the large-scale changes in gene expression that are taking place in response to heat shock and also included in the extended set. All genes not annotated with obvious GO terms were subjected to a PubMed search to find publications that link the gene to heat shock or stress response.

Heat shock factor binding

Using binding sites derived from ChIP-seq data obtained from the K562 immortalised leukaemic cell line [36], we annotated our list of differentially expressed genes by cross-referencing it with the list of HSF-binding genes. Groups of genes corresponding to upregulated or downregulated genes as well as those with existing heat shock-related annotations and those without were tested for enrichment of HSF-binding genes using Fisher’s exact test. In addition to the direct evidence from the ChIP-seq data, we carried out a scan for the presence of HSF-binding motifs in the promoter region (1200 bp upstream–300 bp downstream of the transcriptional start site (TSS)) of differentially expressed genes. The scan was based on the position weight matrices (PWM) defined by SwissRegulon [37] and carried out with the Bioconductor package PWMEnrich [38].

Multivariate global heat shock response phenotype

The global heat shock response was summarised using partial least squares (PLS) regression (generated as detailed in ‘Results’). Using the first two PLS components with respect to the treatment, i.e. the two components of the gene expression space that maximise the variation between basal and stimulated samples, we defined the response for each individual as the combination of the vector between the basal and stimulated sample for this individual in the space spanned by the first two PLS components and the location of the basal sample in the same space. Hierarchical cluster analysis was used to investigate grouping of individuals following heat shock and differential gene expression between clusters analysed.

Genotype QC

Genotype data provided by the HapMap project [39] were processed with Plink [40] to restrict the data to autosomes and remove SNPs with low genotyping rate and those with a minor allele frequency of less than 10 % in our sample set. This resulted in the exclusion of 794,511 of 2,582,999 SNPs (30.76 %). Estimation of the proportion of identity by descent for all sample pairs demonstrated three pairs showing evidence of higher than expected relatedness (Additional file 2: Figure S2) which was supported by IBS nearest neighbour calculation. As a result, samples NA18913, NA19192, NA18862 and NA19092 were excluded.

Genotypic association with gene expression

The multivariate global heat shock response phenotype was tested for association with SNPs within a 10 kb window either side of the probe location using the MultiPhen R package [41], 10 kb selected as informative for including functional elements interacting with a gene [42, 43]. All differentially expressed probes and all probes involving predicted upstream regulator genes were analysed but only genotyped SNPs that passed QC were considered. The GRCh37 coordinates for SNPs were obtained via the SNPlocs.Hsapiens.dbSNP142.GRCh37 Bioconductor package [44] and gene coordinates via the TxDb.Hsapiens.UCSC.hg19.knownGene package [34]. The significance of the observed associations was assessed through a permutation test to account for the structure inherent to the data. To this end the observed global response phenotype for each individual and covariates used in the model were randomly assigned to one of the observed set of genotypes 1000 times and p values for the joint model were computed for each permutation. From these we computed FDRs by comparing observed p values to the empirical distribution of minimum p values from each permutation. We tested for associations between genotype and heat shock response (log2 FC) for individual genes using a linear model as implemented in Matrix-eQTL [45], correcting for sex as well as the first two principal components of the treatment response to capture confounding variation, an approach which enhances eQTL mapping [4648].

Results

Transcriptomic response to heat shock

We aimed to establish the nature and extent of inter-individual variation in the genome-wide transcriptomic response to heat shock for a panel of LCLs established from unrelated individuals of African ancestry for whom high-resolution genotyping data are available (International HapMap Project, YRI population) [28]. We cultured the LCLs and exposed cells to heat shock at 42 °C for 1 h and harvested after recovery at 37 °C for 6 h. We then quantified genome-wide gene expression using Human-HT-12 v3 Expression BeadChips (Illumina). Following QC and processing, paired expression data (baseline and following heat shock) were available for 12,416 probes on 43 individual cell lines.

We found that 500 probes (4 % of all analysed probes, corresponding to 465 genes) were differentially expressed (FDR <0.01 and >1.2 FC) with 249 probes (226 genes) upregulated and 251 probes (238 genes) downregulated (Fig. 1, Table 1, Additional file 3: Table S1). The majority of the most significantly differentially expressed probes were upregulated, including 18 of the top 20 genes, of which nine encoded known heat shock proteins. The most significant difference in expression was seen for HSPA1B (22.2 FC, FDR 1.4 × 10−48).

Fig. 1
figure 1

Heat shock response in LCLs. a Volcano plot showing differentially expressed genes following heat shock (42 °C for 1 h with 6 h recovery) in LCLs. Probes with an adjusted p value below 0.01 and a log FC of at least 0.5 are shown as yellow and red dots. Probes showing particularly strong evidence of changes in gene expression through a combination of p value and FC are labelled with the corresponding gene symbol. b Heatmap comparing gene expression for differentially expressed genes between basal and stimulated samples. Samples were clustered by gene with heat shocked (red) and basal (blue) samples forming two distinct groups. Expression estimates for each gene were scaled and centred across samples. Blue cells correspond to lower than average expression and red cells correspond to higher than average expression

Table 1 Top 20 differentially expressed genes following heat shock

To further investigate the patterns of transcriptional response, we carried out a GO enrichment analysis for differentially expressed genes (>1.2 FC, FDR <0.01). This demonstrated significant enrichment among upregulated genes (seven categories with an FDR <0.05 on Fisher’s exact test) but no significant enrichment for downregulated genes (Table 2, Additional file 3: Tables S2 and S3). Considering the top categories, we found that genes upregulated following heat shock were predominantly related to the response to heat (including GO:0009408) and to unfolded protein (GO:0006986), together with negative regulation of inclusion body assembly (GO:0090084), endoplasmic reticulum stress (GO:1903573) and cell death (GO:0060548).

Table 2 GO categories enriched for upregulated and downregulated genes

We then performed pathway analysis of differentially expressed genes. Using IPA we found that the most significantly enriched canonical pathway among upregulated and downregulated genes (>1.2 FC, FDR <0.01) was the unfolded protein response (p value 6.8 × 10−8). We also found that heat shock factor 1 (HSF1) was the most significant upstream regulator (p value 2.5 × 10−13). Further investigation established that 81 % of observed differentially expressed genes were linked to HSF1 directly or through one additional molecule based on shortest path analysis using the Ingenuity Knowledge Base (Additional file 4: Figure S3). In addition to networks involving heat shock protein genes, this analysis highlighted the role of ubiquitination (UBC) and sumoylation (SUMO2, SUMO3) as well as transcription factors (including NFkB, JUN, ATF2, CEBP) and cytokines (IL6 and TNF) in the observed heat shock response at the transcriptional level (Additional file 4: Figure S3). In terms of biological functions, we resolved using IPA that cell death (p value 2.2 × 10−8), cell proliferation (p value 3.6 × 10−8), apoptosis (p value 8.2 × 10−8), cell cycle (p value 2.6 × 10−7) and gene expression (p value 6.6 × 10−7) were most significantly enriched. Upregulated and downregulated genes were found to cluster in a number of highly enriched networks constructed from the Ingenuity Knowledge Base (Additional file 3: Table S4).

Heat shock factor recruitment

Of the 226 significantly upregulated genes following heat shock, 24 genes have been previously directly linked to the heat shock response. We found that there was significant enrichment for genes associated with GO terms that clearly relate to heat shock response with 98 genes annotated with such terms (p value 2.3 × 10−10, Fisher’s exact test) and 21 otherwise linked to the heat shock response as revealed by a text mining strategy (detailed in ‘Methods’). Additionally, 30 genes were annotated with other relevant processes. This leaves 53 genes with no obvious previous association to heat shock.

To further establish links between differentially expressed genes and heat shock response, we considered the evidence for binding of HSF1 and HSF2 in the promoter regions of upregulated genes using ChIP-seq data obtained for K562 cells following heat shock [36]. Overall there was significant enrichment of HSF1 (51 genes, p 4.7 × 10−10 on Fisher’s exact test, odds ratio (OR) 3.0), HSF2 (55 genes, p 9.4 × 10−9, OR 2.6) and binding of both HSF1 and HSF2 (46 genes, p 9.1 × 10−15, OR 4.5) among upregulated genes following heat shock. Of the nine upregulated genes following heat shock without an established role where we find evidence of HSF binding on ChIP-seq (Additional file 3: Table S5), four have HSF-binding motifs in the promoter region (Additional file 3: Table S6).

Variation in the global heat shock response

To assess the global difference in gene expression induced by heat shock, we carried out PLS, using the treatment state (basal or following heat shock) as a binary response variable and all gene expression probes that passed QC as explanatory variables (12,416 probes targeting 10,214 genes). PLS has been previously used to identify differentially expressed genes [49] and coordinated expression profiles [50] including global response phenotypes [51]. The supervised PLS approach identifies variance components that differentiate treatment groups. This contrasts with principal component analysis (PCA), which considers overall variance irrespective of any known groupings. The PLS analysis demonstrated that there is a considerable change in overall gene expression in response to heat shock with the first two PLS components together accounting for 96.1 % of the variation observed and providing clear separation of the two treatment groups (Fig. 2).

Fig. 2
figure 2

Variance in the global heat shock response. a Modelling of the genome-wide transcriptional response to heat-shock (component plot) based on PLS to identify latent structures in the data for cohort of 43 LCLs. The x-axis represents the first PLS component which segregates basal samples (left) and heat shocked samples (right). The y-axis represents the second PLS component which involves variation between cell lines in basal and heat shock response states. Each cell line’s basal and heat shock samples are similarly coloured and paired samples are connected with an arrow, which represents the vector used as quantitative trait in the genetic association test for genetic modulators of the global heat shock response. The average response is indicated by a black arrow. Overall, samples separate clearly by treatment, showing a consistent global effect on gene expression from heat shock. Heat shock stimulated samples show evidence of three distinct clusters (indicated by shaded ovals). b Unsupervised hierarchical cluster analysis with heat shock stimulated samples showing evidence of three distinct clusters (indicated on panel A by shaded ovals). Below the cluster dendrogram is a heatmap showing differential gene expression. Expression estimates for each gene were scaled and centred across samples. Blue cells correspond to lower than average expression and red cells correspond to higher than average expression. c Volcano plot of differential expression results between clusters 1 and 2. Probes with an adjusted p value below 0.01 and a log FC of at least 0.5 are shown as yellow and red dots

In addition to the pronounced shared response to heat shock that is largely accounted for by the first component, a further effect related to differences in the individual response is noticeable in the second component. This manifests in a visually striking grouping of samples into three clusters post treatment (Fig. 2). To further characterise the difference between these clusters we carried out a differential expression analysis between the two clusters that differ most with respect to the second PLS component. Using an FDR threshold of 0.01 and requiring a FC of at least 1.2, this identified 1094 differentially expressed probes (Additional file 3: Table S7). Of these 681 are upregulated and 415 are downregulated in cluster 2 compared to cluster 1 (Fig. 2).

To further investigate which biological processes underlie the observed differences, we carried out a GO analysis of genes exhibiting significantly increased expression in either cluster. GO categories enriched in the set of genes upregulated in cluster 2 are largely similar to those identified in the analysis of genes that show increased expression in response to heat shock, including response to unfolded protein (GO:0006986) and response to topologically incorrect protein (GO:0035966) (Additional file 3: Table S8). In contrast, genes with higher expression in cluster 1 are enriched for GO annotations relating to DNA replication and cell division including DNA recombination (GO:0006310) and DNA replication (GO:0006260) (Additional file 3: Table S9).

To explore to what extent this response is modulated by genetic variation, we used the length and direction of the response vector, i.e. the vector between the basal and stimulated sample for each individual in the space spanned by the first two PLS components, together with the location of the basal sample in the same space, as a multivariate phenotype. This was then tested for association with genotypes for SNPs within a 10-kb window of differentially expressed genes following heat shock or genes encoding predicted upstream regulators of differentially expressed genes identified by IPA analysis. This revealed two significant associations (Fig. 3). The first involved rs10509407 (FDR 0.021), a promoter variant of MINPP1 (encoding endoplasmic reticulum luminal enzyme multiple inositol polyphosphate phosphatase), which was in complete linkage disequilibrium with three further SNPs. The other association we identified involved rs12207548 (FDR 0.064), a regulatory variant located in a CTCF binding site 1.14 kb downstream of CDKN1A. CDKN1A is an important regulator of cell cycle progression. The SNP rs12207548 shows significant variation in allele frequency between human populations (Fig. 3) with an estimated FST of 0.142 (the FST providing a summary of the genetic differentiation between these populations).

Fig. 3
figure 3

Genotypic association with global heat shock response. a Standardized coefficients and adjusted p values for the top associated SNPs. b, c The distribution of p values after permutation of the global response phenotype is shown for rs10509407 (b) and rs12207548 (c). d, e Global response to heat shock showing individual LCLs by genotype for rs10509407 (d) and rs12207548 (e). Each individual is represented by two points corresponding to basal and stimulated state with arrows connecting paired samples. Genotypes are indicated by colour with blue corresponding to homozygous carriers of the major allele and red indicating the presence of at least one copy of the minor allele. Coloured arrows show the average response for each group. The overall average is indicated in black. f Ancestral Allele Frequencies for rs12207548 from Human Genome Diversity Project in 53 populations. g Circos plot showing trans associations for rs12207548. h Box plots for expression of UBQLN1, HSF1, TNFRSF8, EPHB1, SHC1, ZC3HAV1 and ABCD3 by allele for SNPs as indicated. i Pathway analysis using IPA showing links between trans associated genes for rs12207548 and CDKN1A

To explore the observed association between heat shock response and genotypes at these two loci, we proceeded to test for association with differential expression (FC) following heat shock for individual genes with the two identified variants. We found evidence that both SNPs show trans association with differential induction of UBQLN1 after heat shock (rs10509407 FDR 0.011, beta 0.232; rs12207548 FDR 0.010, beta –0.238) (Fig. 3). UBQLN1 encodes ubiquilin, which is involved in protein degradation by linking the ubiquitination machinery to the proteasome. We found that rs12207548 was also associated with a trans network involving differential expression of six further genes: HSF1 (FDR 0.00075, beta –0.643); TNFRSF8 (FDR 0.00075, beta –0.477); EPHB1 (FDR 0.00075, beta –0.532); SHC1 (FDR 0.0031, beta –0.456); ZC3HAV1 (FDR 0.0036, beta –0.399) and ABCD3 (FDR 0.010, beta –0.279) (Fig. 3). Network analysis using IPA highlights the relationship of these trans genes, either directly or involving additional molecules, with CDKN1A (Fig. 3).

Discussion

We have generated a comprehensive catalogue of differential gene transcription following heat shock for human LCLs, significantly expanding the number of genes recognised to be upregulated and downregulated by exposure of cells to heat shock [4, 8, 9]. We have shown how this relates to HSF1 and HSF2 recruitment and determined several key nodal molecules in the observed pattern of differential expression using a network approach. This includes a role for ubiquitin C and small ubiquitin-like modifiers SUMO2/3 as well as heat shock proteins, transcription factors (NFkB, CEBP, JUN) and cytokines (TNF, IL6). Given that transcriptomic differences may not be reflected at a protein level [52], complementary proteomic analysis such as used to define stress-independent HSF1 activation in a ligand-mediated cell line model system would be informative [53].

We have investigated variation in the global heat shock response across individual LCLs, defining a multivariate phenotype using PLS which revealed evidence of clustering with relative predominance of differential expression of genes involved in DNA replication and cell division in some individuals. We further investigated specific genotypic associations with the observed variation which revealed associations with putative regulatory variants, tagged by rs10509407 and rs12207548 located in/near the genes MINPP1 and CDKN1A, key genes involved in cell growth and survival. These SNPs show trans association with differential expression following heat shock of UBQLN1 (ubiquilin), an important mediator of degradation of proteins in the stress response [54] implicated in Alzheimer’s disease [55], and a network of six further genes including HSF1. However, we did not observe cis-associations with expression of MINPP1 and CDKN1A which leaves unresolved the cis-drivers of the observed trans associations. This may require additional time points of sampling to capture such cis-effects, as illustrated by our recent studies of trans-eQTL following endotoxin induction [16].

Our results are necessarily exploratory given the modest sample size of this study requiring further validation and functional characterisation to establish mechanism. If functionally validated, the geographic distribution of the major and minor alleles of rs12207548 suggests selection may be operating on such variants. We recognise that there may be cell type-specific differences in heat shock response not captured by our analysis in LCLs, including differences in HSF binding from the K562 cell line, and that there may also be population specific differences in terms of regulatory variants with the data presented here generated in cells from individuals of African ancestry. We elected to follow a focused high-level approach in this paper as we are not adequately powered for a systematic QTL analysis of all individual genes.

Our approach to analysing the global transcriptional response to stimuli or treatment as a multivariate phenotype provides a single global phenotype for analysis, rather than several thousands of gene-level phenotypes, which is more robust to probe-level technical artefacts and reduces the number of multiple comparisons as well as computational cost of eQTL analysis, especially for omics-scale data. We suggest it is broadly applicable and relevant to other phenotypes in which modulation by genetic variation may be sought. These are highlighted by recent work that has demonstrated the context-specificity of regulatory variants including different disease contexts through QTL approaches in patient samples [15]. For the inflammatory response, these can be complemented by analysis ex vivo of specific phenotypes such as heat shock.

Conclusions

We have defined the global transcriptional response to heat shock for a panel of human B lymphocyte cell lines, establishing a comprehensive catalogue of differentially expressed genes, pathways and networks of broad utility to understand this highly conserved and pathophysiologically significant response. We have also explored the genetic basis for inter-individual variation in the global response, highlighting putative regulatory variants modulating ubiquilin and a further trans gene network.