Alternative DNA secondary structure formation affects RNA polymerase II promoter-proximal pausing in human
Alternative DNA secondary structures can arise from single-stranded DNA when duplex DNA is unwound during DNA processes such as transcription, resulting in the regulation or perturbation of these processes. We identify sites of high propensity to form stable DNA secondary structure across the human genome using Mfold and ViennaRNA programs with parameters for analyzing DNA.
The promoter-proximal regions of genes with paused transcription are significantly and energetically more favorable to form DNA secondary structure than non-paused genes or genes without RNA polymerase II (Pol II) binding. Using Pol II ChIP-seq, GRO-seq, NET-seq, and mNET-seq data, we arrive at a robust set of criteria for Pol II pausing, independent of annotation, and find that a highly stable secondary structure is likely to form about 10–50 nucleotides upstream of a Pol II pausing site. Structure probing data confirm the existence of DNA secondary structures enriched at the promoter-proximal regions of paused genes in human cells. Using an in vitro transcription assay, we demonstrate that Pol II pausing at HSPA1B, a human heat shock gene, is affected by manipulating DNA secondary structure upstream of the pausing site.
Our results indicate alternative DNA secondary structure formation as a mechanism for how GC-rich sequences regulate RNA Pol II promoter-proximal pausing genome-wide.
KeywordsDNA secondary structure RNA polymerase II promoter-proximal pausing
While DNA is typically found in the B-DNA conformation, it has the ability to form a variety of non-B DNA secondary structures, including hairpins and quadruplexes. During DNA processes like replication and transcription, the duplex DNA is unwound, potentially allowing single-stranded DNA to form stable secondary structures, including stem-loop structures. Once formed, DNA secondary structures can play a role in many processes, including replication, transcription, and DNA repair . The extent of DNA secondary structure formation at a genome-wide level is still not fully understood, but studies have demonstrated the importance of DNA secondary structure formation at a number of genomic loci. DNA hairpins formed during V(D)J recombination protect the ends of coding sequences prior to processing by Artemis:DNA-PKcs [2, 3]. DNA secondary structures can also influence DNA-modifying enzyme activity, either by protecting DNA [4, 5] or by promoting enzyme activity [6, 7]. A variety of DNA structures formed in gene promoter regions have been suggested to block gene expression for a number of genes [8, 9, 10, 11, 12, 13, 14]. Conversely, DNA secondary structure can promote transcription by altering transcription factor binding sites .
During transcription, RNA polymerase II (Pol II) binds and initiates transcription, but only travels within 100 nucleotides (nt) downstream of the transcription start site before stalling in a process known as promoter-proximal pausing [16, 17, 18, 19, 20, 21]. While the full mechanisms leading to promoter-proximal pausing and the eventual release into productive elongation are still not fully understood, several protein factors, as well as the DNA and RNA sequences themselves, have been shown to contribute [8, 10, 22, 23, 24, 25, 26, 27, 28]. Several studies have postulated that CpG islands and local GC-rich sequences typically found in the promoter-proximal region of many genes might serve as energy barriers due to the stronger duplex DNA sequence formed by G–C base pairs [18, 29]. Alternatively, GC-rich sequences are likely to form stable DNA secondary structure, which could provide a mechanism for how high GC content in the promoter-proximal regions can influence Pol II pausing.
Formation of stable DNA secondary structures, such as quadruplex, can occur throughout the genome , and sequence motifs for quadruplexes are enriched in the non-template strand of the region flanking the transcription start site (TSS) among pausing genes . However, genome-wide analysis of DNA secondary structure formation has been limited to algorithms able to identify specific sequence motifs that have the potential to form specific structures, such as the quadparser program for quadruplex structures , but do not provide stability measurements to rank the structure-forming sequences. Software platforms such as Mfold  and ViennaRNA  are able to calculate the Gibbs free energy (ΔG) of DNA secondary structures for given sequences, and therefore not only identify sites of potential DNA secondary structure formation, but also provide a measure of structural stability. Further, new methods to measure pausing of Pol II have been developed based on functional genomics data: Pol II ChIP-seq, Global Run on Sequencing (GRO-seq) , Native Elongating Transcript sequencing (NET-seq) , and mammalian-Native Elongating Transcript sequencing (mNET-seq) .
With these tools and datasets in hand, we analyzed the human genome for sites of highly stable DNA secondary structure using the Mfold and ViennaRNA programs with thermodynamic parameters for DNA analysis. We demonstrated that loci with high propensity to form stable secondary structures are highly correlated with loci displaying robust hallmarks of Pol II pausing. Using the combination of Pol II ChIP-seq, GRO-seq, NET-seq, and mNET-seq signals, we refined Pol II pausing sites at nearly single nucleotide resolution, and by overlaying an average free energy profile, we found highly stable secondary structures located ~ 10–50 nt upstream of the pausing sites. Further, analyzing data from two DNA secondary structure-probing experiments performed in human cells [38, 39], we revealed that DNA secondary structures are indeed enriched in the promoter-proximal regions of Pol II pausing sites, which provides strong validity to the application of the Mfold and ViennaRNA programs. Finally, we demonstrated that manipulation of stem-loop structures upstream of the Pol II pausing site in the human HSPA1B gene affects Pol II pausing in an in vitro transcription system with strong pausing associated with more stable DNA secondary structure. Our results demonstrate the potential for alternative DNA secondary structures to play a role in the regulation of gene expression by contributing to the pausing of Pol II during transcription elongation.
DNA secondary structure formation is prevalent throughout the human genome and associates with RNA polymerase II binding at transcription start sites
Because of the high co-localization with TSSs genome-wide, we next determined whether potential highly stable DNA secondary structure sites are enriched with the binding sites of Pol II, using Pol II ChIP-seq data from the encyclopedia of DNA elements (ENCODE) in five cell lines: A549, GM12878, H1-hESC, HeLa-S3, and K562. The level of enrichment for the association was measured by the fold enrichment of Pol II ChIP peaks intersecting highly stable secondary structure sites in comparison to that of random sites (see “Methods” for details). In all five cell lines and for both Mfold and ViennaRNA, we found a 15- to 30-fold enrichment of DNA secondary structure sites at peak sites of RNA Pol II binding (Fig. 1b). These results demonstrate that sites of predicted highly stable DNA secondary structure are significantly enriched at the Pol II binding sites.
To measure the level of enrichment for RNA Pol II ChIP-seq coverage at sites of predicted highly stable DNA secondary structure, we quantified the number of ChIP-seq reads at each Mfold or ViennaRNA site and compared this value to the mean number from randomly shuffled reads (see “Methods”). We find that the degree of significant enrichment is similar to that of the association seen using the Pol II ChIP-seq peaks (Additional file 1: Figure S1). The Mfold and ViennaRNA sites were enriched for RNA Pol II coverage in all five cell lines with an average enrichment of approximately 50- to 100-fold. Overall, these results suggest that alternative DNA secondary structure formation could influence the active transcription mechanism on a genome-wide level.
The promoter-proximal regions of genes with paused RNA polymerase II are able to form highly stable DNA secondary structures
RNA Pol II pausing is a very common step in the transition from initiating to elongating transcription, resulting in a paused polymerase downstream of the TSS and participating in transcription regulation [16, 17, 18, 20, 21]. Results described above strongly suggest that stable secondary structure formed by single-stranded DNA during transcription could play a significant role in Pol II pausing. To define genes in the paused states, we determined the RNA Pol II traveling ratio (TR) as previously described . In brief, we first stratify all RefSeq genes into the Pol II-bound and the no-Pol II groups, and for the Pol II-bound group, the TR was calculated as a ratio of coverage density at a region between − 30 nt to + 300 nt of the TSS over coverage density across the rest of the gene body. Using the same Pol II ChIP-seq datasets as in our association study, we found that 86 and 84% of Pol II-bound genes in HeLa-S3 and H1-hES cells, respectively, are paused, as defined by a TR > 2 (Additional file 1: Figure S2A and Table S3), similar to previous studies . Next, we determined whether the regions surrounding the TSS of paused genes (PAU) are more prone to DNA secondary structure formation than non-paused genes (NPA) or those with no Pol II (NP2). Using our genome-wide Mfold data, we found that the region proximal to the TSS (250 nt upstream to 250 nt downstream) of paused genes displays a significantly lower ΔG than non-paused genes or genes without Pol II bound in all investigated cell lines (Additional file 1: Figure S2B and Table S4), further suggesting the potential for alternative DNA secondary structures to contribute to Pol II pausing in human cells. Very similar results were obtained using ViennaRNA sites (data not shown).
We next sought to determine the location of highly stable secondary structures relative to the Pol II binding site. Regions spanning from 2000 nt upstream to 2000 nt downstream of each gene were analyzed (TSS proximal regions), and average plots of ΔG were shown for paused genes (PAU), non-paused genes (NPA), and genes not bound by Pol II (NP2) (Fig. 2b contains Mfold data; Additional file 1: Figure S2D contains Vienna RNA data). For paused genes, the most stable DNA secondary structures occur at regions directly surrounding and slightly downstream of the TSS, as indicated by the sharp drop in relative free energy. These effects are diminished in non-paused genes and genes without Pol II binding. Notably, the average-gene free energy minimum is slightly upstream of the average-gene Pol II peak at paused genes.
Interestingly, when we compare the free energy landscape between template and non-template strands of the same DNA region, we found a strand bias around the TSS of paused genes, but not of non-paused genes (Additional file 1: Figure S3A). The non-template strand has a significantly higher propensity to form DNA secondary structures than the template strand (p = 1.9 × 10− 28, t-test for average free energy at the − 30 to + 300 region from the TSS) (Additional file 1: Figure S3B).
Together, these data demonstrate that the TSS-proximal region of genes with Pol II pausing are more likely to form DNA secondary structures than those genes that are not paused or are not bound by Pol II, suggesting a role for DNA secondary structures in promoter-proximal RNA Pol II pausing.
Promoter-proximal RNA polymerase II pausing occurs at sites of stable DNA secondary structure formation genome-wide
Because Pol II ChIP-seq is based on immunoprecipitation of RNA Pol II, and its signals might not recapitulate all actual pausing sites, but rather loci bound by Pol II, we next performed our analysis with data from GRO-seq  and NET-seq . Both techniques are designed specifically to sequence nascent RNA transcribed by RNA Pol II, therefore marking pausing sites. High resolution average coverage profiles of GRO-seq and NET-seq data from HeLa-S3 cells (Additional file 1: Figure S4A) show very similar patterns to that of the Pol II ChIP-seq profile, in which high levels of short nascent RNAs produced on the coding strands from the TSS co-localize with secondary structure free energy minima for paused genes only.
Interestingly, approximately 37% of non-paused genes (n = 655) also contain Pol II ChIP-seq peaks within TSS proximal regions, and these peaks are located more than 300 nt downstream of the TSS (therefore, by the TR definition, they were classified as non-paused genes). When analyzed based on Pol II ChIP-seq peak summits, these non-paused genes show a similar, close relationship between Pol II pausing and the formation of stable DNA secondary structures (Additional file 1: Figure S4B), even in the absence of Pol II signals directly at TSSs.
These data demonstrate that Pol II-pausing signals are very close to and track sites that have high potential to form relatively stable DNA secondary structures of single-stranded DNA.
Pausing sites defined by mNET-seq spikes are preceded by highly stable DNA secondary structures
Because both paused (Fig. 3) and non-paused genes (Additional file 1: Figure S4B) show the same strong correlation with stable secondary structure in our analysis of RefSeq annotated genes, we next wanted to refine pausing site loci in an unbiased manner (i.e., not based on annotated genes). We hypothesized that pausing sites produce a strong Pol II pausing-related signature evident in all pausing-relevant data sets. Specifically, we first used GRO-seq peaks (n = 37,276) that intersect significantly (> 50% overlap) with Pol II ChIP-seq peaks (n = 26,487). For the resulting list (n = 15,470) of genomic regions, we further required that each peak have mNET-seq coverage above background (5% FDR, n = 13,931). We then ranked all pausing loci from 1 (the strongest) to 13,931 (the weakest) based on the sum of each ranked signal intensity: Pol II ChIP-seq, GRO-seq, NET-seq, and mNET-seq (Additional file 1: Figure S7A).
This analysis resulted in a ranked list of 13,931 genomic regions that we identify as Pol II pausing sites, with 57% of them (n = 7972, Additional file 2) located within RefSeq gene bodies. When the distances of these sites to TSSs were measured (Fig. 4c), these pausing sites within gene bodies were located about 100 nt downstream of TSSs (Fig. 4c, green line), and a sharp peak of average free energy of stable secondary structure precedes the pausing site about 24 to 48 nt upstream at a free energy of − 3.2 kcal/mol (Fig. 4d, top panel). Three genes, SNAI3-AS1, DHX8, and SMC5, ranked as 7, 239, and 486, respectively, are shown in Additional file 1: Figure S7B as having highly stable DNA secondary structures located immediately upstream of the highest mNET-seq spikes.
The remaining 43% of pausing loci (n = 5959, Additional file 3) are located in either intergenic regions (67%) or enhancer/promoter regions of genes (33%). These pausing sites are located much further away from the nearest TSS or transcription termination site (TTS) of genes (i.e., either downstream or upstream) (Fig. 4c). Most strikingly, the average ΔG of secondary structure formation for those loci (about 24 to 52 nt upstream at a free energy of − 2.8 kcal/mol; Fig. 4d, bottom panel) is very similar to that observed for intragenic pausing loci. Again, inclusion of G-quadruplexes left the location and shape of the sharp free energy minimum unaffected, and lowered the average free energy at the minimum by − 0.4 and − 0.2 kcal/mol for intragenic and intergenic loci, respectively (Additional file 1: Figure S5B). In addition to the average plots, for both groups, the secondary structure-forming free energies are also individually plotted for each pausing site (Additional file 1: Figure S8A, B), and both plots show the same pattern as the averaged plots in Fig. 4d, in which a deep free energy minimum is present 10 to 50 nt upstream of the pausing sites. Further, we plotted the frequency of the positions of secondary structures with free energies within the lowest 2% of free energy values (most stable structures) and found that the structures with the lowest free energies are also enriched just upstream of the pausing sites (Additional file 1: Figure S8C). Conversely, there is a slight depletion in the highest 2% free energy distribution (unfavorable to form the structures) at the region just upstream of the pausing sites, and a uniform density from structures with middle values. These findings further support the close relationship between Pol II pausing and the propensity to create highly stable DNA secondary structures upstream of pausing loci, and the same stable DNA secondary structure signature is observed beyond the RefSeq annotated genes.
Alternative DNA secondary structures identified by probing experiments are enriched at the promoter-proximal regions of paused genes in human cells
We noticed that the majority (80 to 90%) of Pol II-bound genes are paused in all eight investigated cell lines (Additional file 1: Table S3). Importantly, this was based on the stratification of genes according to traveling ratios (TRs) for the three groups (NP2, NPA, and PAU) without accounting for differences or similarity of gene expression across the cell lines. Because a significant amount of DNA sequences are shared among different cell types, this prompted us to investigate pausing states of genes across these eight cell lines derived from different cell types. If Pol II pausing can be driven by DNA sequences/structures, strong overlaps of genes with similar pausing states among these different cell lines will be observed. Indeed, gene ranks of TR-based pausing states for all Pol II-bound genes are highly correlated among cell lines (Additional file 1: Figure S11A). Spearman correlation coefficients calculated for all pairs of cell lines range from 0.71 between H1-hESC and NHEK to 0.88 between GM12878 and K562. Furthermore, a heat map of pausing states across these cell lines (Additional file 1: Figure S11B) clearly demonstrates consistent gene states across cell lines with genes in either paused or non-paused states. Moreover, we identified the numbers of genes which switch pausing states between cell lines: from paused to non-paused and from non-paused to paused in each pair of cell lines (Additional file 1: Figure S11C). The “switched” genes in each cell line pair are about 1–3% of total RefSeq genes, and TRs of those genes are rather low (80% have a TR within the first quartile of all TRs). These differences may result from the threshold used to group genes into paused and not paused categories, and also may be due to variable quality of Pol II ChIP-seq experiments across different laboratories. It is also possible that cell type-specific factors which contribute to Pol II pausing are present.
These analyses suggest that the degree of Pol II pausing and the genes at which Pol II pauses are similar and shared across eight different human cell lines, and that DNA secondary structure formation detected in NHEK and Raji cells plays a role in the ubiquitous process of Pol II pausing. Importantly, these analyses and results provide a concrete mechanism—propensity to form DNA secondary structures—for how high GC content at TSSs influences Pol II pausing.
DNA secondary structures affect RNA pol II promoter-proximal pausing in vitro
In this study, using DNA secondary structure calculation programs, we have, for the first time, provided an energetic potential for secondary structure formation across the human genome in an unbiased manner. This analysis not only indicates genomic regions with the potential to form secondary structures, but also estimates the relative propensity for these regions to actually form such structures. We found that DNA secondary structures are prone to form from single-stranded DNA at the promoter-proximal region of genes displaying Pol II pausing shortly after transcription initiation, with the secondary structure on average located upstream of the Pol II pausing sites. This DNA secondary structure–pausing site relationship is also present in regions located outside of the RefSeq annotated genes, suggesting a common feature associated with Pol II pausing at loci of coding genes and non-coding, intergenic DNA. Further, the presence of DNA secondary structures at the promoter-proximal regions of paused genes can be confirmed in human cells. The mutation analysis of the HSPA1B gene region demonstrates that disruption of DNA secondary structures proximal to pausing sites reduces Pol II pausing in HeLa nuclear extracts in vitro.
The corresponding structures to those found on the non-template DNA strand could also be present on newly synthesized RNA due to sequence similarity. However, the significant portion of the RNA potentially capable of forming the corresponding structures would still be annealed to template DNA within the transcription complex. Therefore, it is unlikely that those structures could be formed on nascent RNA and impact Pol II pausing.
Many paused genes are known to have GC-rich promoter elements , and recent work has suggested a role for CpG islands located within human gene promoters in regulation of Pol II pausing at distinct sites . The possible contribution of these GC-rich sequences in Pol II pausing as an energy barrier for the unwinding of duplex DNA by Pol II has been proposed . However, we found that an enrichment of high GC content at the region of free energy minimum was just upstream of the pausing sites (Additional file 1: Figure S13A, B). If these same sequences were forming highly stable DNA–DNA bonds that acted as a barrier to Pol II elongation, thereby facilitating pausing, they would be downstream of the Pol II pausing sites. Moreover, regions with similar free energy minimums, but with either a low (< 48%, n = 200) or high (70 to 75%, n = 200) GC content in the 50-nt upstream region of pausing sites showed a similar pausing pattern (Additional file 1: Figure S15A, B). This indicates that although GC-rich sequence is associated with low DNA secondary structure free energy, nucleotide composition alone cannot explain Pol II pausing. Further, we found a strand bias around the TSS of paused genes (but not of non-paused genes), with the non-template strand having a significantly higher propensity to form the DNA secondary structures than the template strand (Additional file 1: Figure S3). Additionally, the HSPA1B mutant experimental data suggest that the free energy of secondary structure formation from the non-template strand is more strongly anti-correlated with Pol II pausing than the free energy of secondary structure formation from the template strand (Additional file 1: Figure S12B). These analyses suggest that the ability to form DNA secondary structures on the non-template strand could be a mechanism for how high GC content at TSSs influences Pol II pausing.
A study using a search algorithm  identified stem loop-containing quadruplex sequences on both template and non-template DNA in promoters and TSSs. In contrast, Eddy et al.  found the frequency of G4 quadruplex motifs peak at around 200 nt downstream of the TSS, and they displayed a non-template strand basis. One drawback of both studies is that there are no stability measurements to rank the structure-forming sequences, while in our study, each structure is associated with a free energy estimate that quantifies the structure formation potential.
The possibility that the mNET-seq data  that we used in our analysis could be influenced by Pol II backtracking is considered here. Backtracking occurs when RNA polymerase becomes disengaged from the 3′ end of nascent RNA and moves upstream. Backtracking can be relieved by moving forward to the previous positions or by TFIIS cleaving nascent RNA . Studies have shown that backtracking depends on several factors, including the stability of the RNA:DNA hybrid, with the weaker the hybrid the more likely that backtracking will occur [52, 53, 54]. What is not clear is the degree of backtracking and the degree of TFIIS cleavage under the conditions in which the mNET-seq data were generated . To search for possible traces of backtracking, for each pausing site located within a gene (we have defined the pausing sites as the strongest mNET-seq read spikes), we also identify the second most intense mNET-seq spikes and calculate the distance between the first and the second strongest mNET-seq spikes (n = 5263) (Additional file 1: Figure S16). The distribution of the distances shows that secondary spikes are within a few nucleotides of the strongest primary spikes. Therefore, if backtracking and TFIIS cleavage are present, we estimate that backtracking could at most change our result by ± 5 nt, which does not affect our overall results and conclusion.
Using a combination of Pol II pausing-related data sets, we developed a novel approach to identify pausing sites genome-wide. Our approach is entirely based on measurements rather than gene annotations as in previously proposed methods [22, 29, 31], and it provides the location of pausing sites at single nucleotide resolution. As a result, we can discover de novo pausing sites, and found nearly 6000 intergenic pausing sites in HeLa-S3 cells. Among them, some could be located at 3′ ends of annotated genes downstream of poly(A) signals, and some could be within genes that are not annotated in RefSeq. For example, 9% (n = 515) of those pausing sites are associated with long non-coding RNAs (lncRNAs) [55, 56], and recently, Pol II pausing has been shown to regulate transcription of a subset of lncRNAs in mammalian cells . Further annotation of these sites will provide possible insights into the role of DNA secondary structure in Pol II pausing as well as the functional importance of specific non-coding RNAs.
In the in vitro pausing assay, HSPA1B mutant sequences showed a correlation of diminishing pausing signals associated with less stable DNA secondary structure by robust regression analysis, in which TGG and GGG mutants were effectively identified as outliers. These two mutants with more stable secondary structures than the WT sequence do not display stronger pausing in our assay. We speculate that sequence mutations could also influence sites of transcription factor binding. Analysis of transcription factor binding sites of all 16 HSPA1B sequences used in the assay with FIMO, a motif search tool, and HOCOMOCOv10 HUMAN mono transcription factors motifs database  shows that TGG and GGG mutants generate possible binding sites for TWST1 and MeCP2, respectively, suggesting that binding of these factors at these positions may influence the Pol II pausing process.
We have presented an unbiased genome-wide analysis of Pol II pausing that led to the discovery of novel pausing sites located outside of annotated genes. Our study of DNA propensity to create secondary structures demonstrates a strong relationship of highly stable DNA secondary structure with Pol II pausing sites and their proximal location. Further, we present a mechanistic model that supports a role for DNA sequence features such as GC content and connects these features to potential recruitment of transcription factors in influencing Pol II pausing. Uncovering detailed mechanisms of how these DNA sequence elements and secondary structures establish and maintain paused polymerase awaits future studies.
Genome-wide DNA secondary structure predictions using Mfold and ViennaRNA
The human genomic DNA sequences for each individual chromosome (build GrCh37/hg19) were downloaded from the UCSC genome browser as FASTA files. DNA secondary structure prediction using Mfold  and ViennaRNA  was performed using a 300-nt sliding window with 150-nt step size across all chromosome sequences. For all genome-wide Mfold analyses, the default [Na+], [Mg+], and temperature inputs were 1.0 M, 0.0 M, and 37 °C, respectively. In vitro transcription conditions (60 mM KCl, 7 mM MgCl2, and 30 °C) were used for sequences examined using HeLa nuclear extracts. ViennaRNA analysis was performed using RNAfold v2.1.5 with thermodynamic parameters specifically for folding single-stranded DNA sequences (parameter file dna_mathews2004.par from the ViennaRNA package v2.1.5). G-quadruplex predictions were incorporated, and GU and lonely pairs were disallowed in the secondary structures. The free-energy (ΔG) value, in kcal/mol, of the most stable predicted secondary structure for each 300-nt window was used for the analyses. Undefined values that corresponded to sites of incomplete sequence in the human genome were not included in the analysis. The threshold used to determine sites of predicted highly stable DNA secondary structure was set as seven consecutive windows at which the predicted free energy value was in the lowest 5% of all values [40, 41].
Functional genomics data analyses
If not stated otherwise, functional genomic data analyses and comparisons to gene annotations were performed using BEDTtools (v2.24.0) . Heat maps of genomic data (ChIP-seq, GRO-seq, NET-seq, mNET-seq) were generated using NGS plot (v2.61). Free energy heat maps were visualized using Python (v3.5.2) with matplotlib.pyplot library (v2.0.2). If necessary, alignments of sequencing data were performed using bowtie2 (v2.1.0) and STAR (v2.5.1b) following descriptions from the original papers. Peaks were called with MACS2 (v2.0.9. 20111102) with default parameters. GRO-seq signal peaks were called for each strand separately. First reads were separated into those that aligned to opposite strands. Next, MACS2 was used to call peaks with the –no-model option. Finally, information about strand was added to the resulting lists of peaks, and top and bottom strand peak information was merged.
Annotation of sites of predicted highly stable DNA secondary structures
Promoter region ranging from − 1 knt to − 250 nt of the TSS
TSS region ranging ± 250 nt from the TSS
TTS region ranging ± 250 nt from the TTS
Gene body region ranging from + 250 nt of TSS to − 250 nt of the TTS
Intergenic region is the rest of the genome, not belonging to any of the four regions detailed above.
Free energy intervals (Mfold and ViennaRNA) that overlapped TSS regions were first assigned this annotation. Remaining regions were compared to TTS regions, promoter regions, and gene body regions. Free energy intervals that did not overlap these annotations were annotated as intergenic. The resulting number of intervals was normalized to the total size of each region.
Association analysis of Mfold and ViennaRNA sites with RNA polymerase II features
Association analysis was performed using BEDTools software  to identify the number of Mfold or ViennaRNA sites that overlap with Pol II ChIP-seq peak intervals. Mfold or ViennaRNA sites were shuffled to generate random secondary structure sites consisting of the same number and size of the actual Mfold or ViennaRNA significant secondary structure sites. The ratio of the number of Mfold or ViennaRNA sites overlapping ChIP-seq peaks relative to the mean overlap for 10,000 shuffled sites is defined as a fold enrichment. The fold enrichment quantifies the association of the sites of predicted DNA secondary structure with Pol II binding compared to that expected by chance.
Quantitative read overlap analysis was performed by calculating Mfold or ViennaRNA sites coverage of Pol II ChIP-seq reads within Pol II ChIP-seq peaks. The Mfold or ViennaRNA sites were then shuffled 1000 times maintaining the same number and size of the actual sites of secondary structure. The ratio of the number of reads intersecting actual Mfold or ViennaRNA sites over the mean number of reads intersecting randomly shuffled sites was used as a fold enrichment measure of association between Mfold or ViennaRNA determined sites and Pol II ChIP-seq peak signal.
Genomic regions selected for downstream analysis
To avoid bias towards genes with alternative splicing, we used the longest transcript for each gene in RefSeq (build GRCh37/hg19). Then, we filtered out any genes shorter than 660 nt, which is twice the length of the TSS-containing region (see TR definition) to avoid inaccurately classifying gene pausing status. Moreover, we discarded all genes with undetermined DNA sequences (i.e., nucleotides designated as “N”) within TSS ± 2 knt, and finally obtained 23,122 unique genomic loci.
High resolution promoter-proximal DNA secondary structure free energy calculations
In order to model more accurately DNA secondary structure formation during transcription, ΔG calculations of secondary structure formation were performed on the non-template strand sequence of each gene region using a 30-nt window with a 1-nt step size. For the TSS ± 2 knt region of each gene, the most stable DNA secondary structure and its associated ΔG were determined for each 30-nt window using Mfold (v3.6) and ViennaRNA (v2.1.9) software with their default parameters for DNA calculations. Each ΔG value was assigned to the middle (15th nt) of the current window, resulting in a profile ranging from 1985 nt downstream to 1985 nt upstream of the TSS.
The traveling ratio (TR) was calculated as previously described . In brief, RefSeq genes (build GRCh37/hg19) were first stratified into two groups: intersecting and not intersecting with Pol II ChIP-seq peaks. Next, for Pol II-bound genes, we calculated coverage in two regions: − 30 to + 300 nt from the TSS and in the rest of the gene body. Next we determined the Pol II ChIP-seq read density by calculating the read coverage and dividing this by the length of the region. TR was calculated as a ratio of the density of reads in the − 30 to + 300 nt from the TSS region over the read density within the rest of the gene.
Based on the definitions above, all genes were divided into three groups: genes without Pol II binding, non-paused (TR ≤ 2), and paused (TR > 2) genes.
Statistical analysis of mNET coverage
Its parameters were estimated using mNET-seq coverage ranging from 3 to 100. This normalized density was used as a null model distribution from which p values were calculated and FDR corrected. We applied a 5% FDR cutoff which corresponded to mNET read coverage > 4 reads.
Generation of in vitro transcription templates and secondary structure mutants
Secondary structure mutants were generated with the Q5 Site-Directed Mutagenesis Kit (NEB). All primers used were designed using NEBaseChanger program and the pGEM-HSPA1B plasmid containing HSPA1B sequence (− 547 to + 293 relative to TSS). Generation and amplification of the mutated plasmid was carried out as described in the manufacturer’s protocol with the following PCR conditions: one cycle of 98 °C for 30 s; 25 cycles of 98 °C for 10 s, 65 °C for 10 s, 72 °C for 2 min; one cycle of 72 °C for 2 min. Annealing temperatures were adjusted depending on the primer being used to generate the mutants. Following the Kinase-Ligase-Dpn I reaction, mutated plasmids were transformed into SURE2 cells (Agilent) according to the manufacturer’s specifications with the exception of LB broth being used instead of SOC media. After transformation, colonies grown on LB + Ampicillin plates were selected and incubated for 16–18 h, and DNA was extracted using the GeneJET Plasmid MiniPrep Kit (Thermo Scientific). Mutations were validated with DNA sequencing and were verified to be the only change in the templates used in the pausing assay.
In vitro transcription and pausing assay
All HSPA1B plasmids were subjected to restriction enzyme digestion with AccI and BsaAI to generate a 1290 bp DNA fragment containing HSPA1B sequence (− 547 to + 293 relative to TSS). DNA fragments for each mutant were gel-purified and biotinylated with a biotin conjugated forward primer (5′-biotin-GAACCATCACCCTAATCAAG-3′) using the PCR conditions: one cycle of 94 °C for 3 min; 25 cycles of 94 °C for 30 s, 62 °C for 30 s, and 72 °C for 30 s; one cycle of 72 °C for 7 min. DNA templates were cleaned with the PCR purification kit (BioBasic Inc.) and isolated with Dynabeads MyOne Streptavidin C1 (Invitrogen) beads. For in vitro transcription reactions, 125 ng of biotinylated fragments were immobilized on the beads at 2 fmol DNA per microgram of beads. DNA-bead complexes were washed with double bead volume of 1X B&W buffer. DNA–bead complexes were then washed with double bead volume of transcription buffer (13 mM HEPES pH 7.6, 60 mM KCl, 0.1 mM EDTA, 7 mM DTT, 13% glycerol, 7 mM MgCl2, 10 μM ZnCl2, 10 mM creatine phosphate). DNA–bead complexes were incubated for 30 min at room temperature with nuclear extract prepared following the Cold Spring Harbor HeLa Nuclear Extract protocol  using HeLa cells (Texcell, Inc.). DNA–protein complexes were pulled down and loosely bound proteins were washed away with ten bead volumes of TW buffer (13 mM HEPES pH 7.6, 60 mM KCl, 100 μM EDTA, 7 mM DTT, 13% glycerol, 7 mM MgCl2, 10 μM ZnCl2, 0.0125% NP-40). DNA–protein complexes were resuspended in 23 μL of transcription buffer, and 1 μL of an rNTPs mixture (0.4 mM rCTP, 10 mM rATP, 10 mM rUTP, 10 mM rGTP) and 1 μL of [α-P32] rCTP (Perkin Elmer) were added to the solution. The transcription reaction was allowed to progress for 30 min at 30 °C. The reaction was stopped with 175 μL of stop solution (0.3 M Tris-HCl pH 7.4, 0.3 M sodium acetate, 0.5% SDS, 2 mM EDTA, 3 μg/ml tRNA). RNA transcripts were cleaned by phenol/chloroform/isoamyl alcohol (25:24:1) extraction and ethanol precipitation. RNA transcripts were run on a 6% polyacrylamide/7 M urea denaturating gel, and the images were developed with phosphorimaging.
Structural probing analysis with mung bean nuclease
Oligonucleotides were designed for WT and two mutants (CGG and CGA) from the + 46- to the + 75-nt position of the HSPA1B non-template strand sequence, with the respective point mutations at the + 64-nt position. Each oligo was labeled on the 5′ end with T4 DNA kinase and [γ-P32] ATP. Labeled DNAs (16 ng) were first incubated in modified transcription buffer (13 mM HEPES pH 7.6, 60 mM KCl, 0.1 mM EDTA, 1 mM DTT, 7 mM MgCl2, 10 μM ZnCl2) for 30 min at 30 °C to allow the DNA secondary structure formation, and then 2 U of mung bean nuclease were added to a final reaction of 20 μL. The reactions were allowed to proceed for 40 min at 30 °C, stopped with equal volume of formamide loading buffer and boiled for 5 min, and analyzed on an 8% polyacrylamide/7 M urea denaturating gel.
High-throughput sequencing data used in this study have been downloaded from Gene Expression Omnibus (GSE numbers), Sequence Reads Archive (SRA), or from ENCODE project  through UCSC Genome Browser (GSE and wgEncode numbers) : for Hela-S3 cells, Pol II ChIP-seq (GSM935395, wgEncodeEH000613) , Pol II ChIP-seq peaks files (GSM935395, wgEncodeEH000613) , GRO-seq (GSE62047) , NET-seq (GSE61332) , and mNET-seq (GSE60358) ; for A549 cells, Pol II ChIP-seq (GSM822288, wgEncodeEH002079) , Pol II ChIP-seq peaks files (GSM822288, wgEncodeEH002079) ; for GM12878 cells, Pol II ChIP-seq (GSM935412, wgEncodeEH000626) , Pol II ChIP-seq peaks files (GSM803355, wgEncodeEH001463) ; for H1-hESC cells, Pol II ChIP-seq (GSM822300, wgEncodeEH000563) , Pol II ChIP-seq peaks files (GSM803366, wgEncodeEH001499) ; for K562 cells, Pol II ChIP-seq (GSM935358, wgEncodeEH000616) , Pol II ChIP-seq peaks files (GSM803410, wgEncodeEH001633) ; for HCT116 cells, Pol II ChIP-seq (GSE60106) ; for human Ntera 2 cells, DRIPc-seq (GSE70189) ; for Raji cells, Pol II ChIP-seq (GSM935461, wgEncodeEH001761)  and ssDNA-seq (SRA072844) ; for NHEK cells, Pol II ChIP-seq (GSE30226)  and G4 ChIP-seq (GSE76688) .
We would like to thank Alex Koeppel and Stephen Turner of the Bioinformatics Core Facility at the University of Virginia for bioinformatics support.
This work was supported by the NIH RO1GM101192 and NCI RO1CA113863 (YHW).
Availability of data and materials
The datasets generated are available in the additional files. The HSPA1B plasmids are available upon request.
Conceptualization: RT and YHW. Development of methodology: RT, LCTP, and KS. Performed experiments: NDA. Data analysis: KS. Writing, review, and editing: KS, NDA, SB, and YHW. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
LCTP is a full-time employee of Relay Therapeutics. All other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 11.Qin Y, Fortin JS, Tye D, Gleason-Guzman M, Brooks TA, Hurley LH. Molecular cloning of the human platelet-derived growth factor receptor beta (PDGFR-beta) promoter and drug targeting of the G-quadruplex-forming region to repress PDGFR-beta expression. Biochemistry. 2010;49:4208–19.CrossRefPubMedPubMedCentralGoogle Scholar
- 14.Ray BK, Dhar S, Shakya A, Ray A. Z-DNA-forming silencer in the first exon regulates human ADAM-12 gene expression. Proc Natl Acad Sci U S A. 2010;108:103–8.Google Scholar
- 36.Mayer A, di Iulio J, Maleri S, Eser U, Vierstra J, Reynolds A, Sandstrom R, Stamatoyannopoulos JA, Churchman LS. Native elongating transcript sequencing reveals human transcriptional activity at nucleotide resolution. Cell. 2015;161:541–54. Data set: NCBI GEO, 2015 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE61332CrossRefPubMedPubMedCentralGoogle Scholar
- 37.Nojima T, Gomes T, Grosso AR, Kimura H, Dye MJ, Dhir S, Carmo-Fonseca M, Proudfoot NJ. Mammalian NET-Seq reveals genome-wide nascent transcription coupled to RNA processing. Cell. 2015;161:526–40. Data set: NCBI GEO, 2015. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60358.
- 38.Hansel-Hertsch R, Beraldi D, Lensing SV, Marsico G, Zyner K, Parry A, Di Antonio M, Pike J, Kimura H, Narita M, et al. G-quadruplex structures mark human regulatory chromatin. Nat Genet. 2016;48:1267–72. Data set: NCBI GEO, 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76688.
- 39.Kouzine F, Wojtowicz D, Baranello L, Yamane A, Nelson S, Resch W, Kieffer-Kwon KR, Benham CJ, Casellas R, Przytycka TM, Levens D. Permanganate/S1 nuclease Footprinting reveals non-B DNA structures with regulatory potential across a mammalian genome. Cell Syst. 2017;4:344–356. e347. Data set: NCBI SRA, 2016. https://www.ncbi.nlm.nih.gov/sra/?term=SRA072844.
- 40.Sarafidou T, Kahl C, Martinez-Garay I, Mangelsdorf M, Gesk S, Baker E, Kokkinaki M, Talley P, Maltby EL, French L, et al. Folate-sensitive fragile site FRA10A is due to an expansion of a CGG repeat in a novel gene, FRA10AC1, encoding a nuclear protein. Genomics. 2004;84:69–81.CrossRefPubMedGoogle Scholar
- 47.Sanz LA, Hartono SR, Lim YW, Steyaert S, Rajpurkar A, Ginno PA, Xu X, Chedin F. Prevalent, dynamic, and conserved R-loop structures associate with specific Epigenomic signatures in mammals. Mol Cell. 2016;63:167–78. Data set: NCBI GEO, 2016. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70189.
- 55.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.