RNA secondary structure profiling in zebrafish reveals unique regulatory features
RNA is known to play diverse roles in gene regulation. The clues for this regulatory function of RNA are embedded in its ability to fold into intricate secondary and tertiary structure.
We report the transcriptome-wide RNA secondary structure in zebrafish at single nucleotide resolution using Parallel Analysis of RNA Structure (PARS). This study provides the secondary structure map of zebrafish coding and non-coding RNAs. The single nucleotide pairing probabilities of 54,083 distinct transcripts in the zebrafish genome were documented. We identified RNA secondary structural features embedded in functional units of zebrafish mRNAs. Translation start and stop sites were demarcated by weak structural signals. The coding regions were characterized by the three-nucleotide periodicity of secondary structure and display a codon base specific structural constrain. The splice sites of transcripts were also delineated by distinct signature signals. Relatively higher structural signals were observed at 3’ Untranslated Regions (UTRs) compared to Coding DNA Sequence (CDS) and 5’ UTRs. The 3′ ends of transcripts were also marked by unique structure signals. Secondary structural signals in long non-coding RNAs were also explored to better understand their molecular function.
Our study presents the first PARS-enabled transcriptome-wide secondary structure map of zebrafish, which documents pairing probability of RNA at single nucleotide precision. Our findings open avenues for exploring structural features in zebrafish RNAs and their influence on gene expression.
KeywordsPARS Zebrafish Transcriptome Gene regulation RNA secondary structure
Coding DNA Sequence
Parallel Analysis of RNA Structure
Selective 2’-Hydroxyl Acylation Analyzed by Primer Extension
Single Nucleotide Variation
RNA is a multitasking biomolecule, which not only acts as a messenger molecule to transfer genetic information from DNA to proteins, but also plays a vital role in regulation and catalysis of major biological reactions like transcription , post-transcriptional processing [2, 3] including splicing events, editing, degradation  and translation. In order to perform these processes, RNA adapts specific conformations owing to its ability to fold into secondary and tertiary structures . The nucleotide sequence of RNA is primarily responsible for the secondary structure, formed by Watson Crick base pairing within the polynucleotide backbone. Subsequently, the tertiary structure is governed by the secondary structure and several other interactions with biomolecules . The secondary structure of RNA is relatively stable and is present all throughout the length of mRNAs including CDS and UTRs . A large number of diverse secondary structural motifs in mRNAs have been studied extensively including riboswitches, IRES , AU-rich, localisation elements  and structures that enhance transcription, alternative splicing and translation .
In addition to mRNAs, non-coding RNAs also display secondary structural features. The secondary structural elements in non-coding RNAs have been shown to regulate gene expression [9, 10, 11, 12] and orchestrate the process of protein production. Small non-coding RNAs such as microRNAs  are known to fold in a pre-defined stem with loop structure that aids in binding to protein complexes and small molecules. Furthermore, long non-coding RNAs (lncRNAs) represent a class of regulatory RNAs that are abundantly present in eukaryotic transcriptome [14, 15]. LncRNAs display less nucleotide sequence conservation across species ; however, the secondary structural core of lncRNAs are conserved by reciprocal base pair mutations [17, 18, 19, 20, 21, 22]. It is a well-known fact that structure and synteny conservation preserves the function of protein-coding mRNAs in species [23, 24, 25, 26, 27] separated by large evolutionary distances and this may also apply to non-coding RNAs. Therefore, understanding the secondary structure would be important for predicting the function of non-coding RNAs in general and lncRNAs in particular.
Zebrafish has been extensively used to study spatiotemporal expression profiles of genes including protein-coding genes [28, 29] and non-coding RNAs [30, 31, 32]. In recent years, several groups, including ours have documented spatiotemporal expression profiles of lncRNAs across early developmental stages [16, 33] and tissues in adult zebrafish [34, 35]. Amongst the lncRNAs discovered, only small fraction display nucleotide sequence conservation across different species. Majority of the lncRNAs do not display sequence conservation across evolutionary distances and this poses a significant hurdle for understanding their functional relevance. It is widely envisaged that the conserved structural features in non-coding RNAs especially lncRNAs may provide cues to conserved function across species [19, 22].
In this study, we probe the zebrafish transcriptome using Parallel Analysis of RNA Structure (PARS) [36, 37] to reveal the landscape of pairing probability at single nucleotide resolution. We undertook enzyme based probing of one day old zebrafish transcriptome using RNase V1 and S1 Nuclease to discover paired and unpaired nucleotide respectively. The enzyme cleaved fragments were subjected to next generation sequencing to yield pairing probability at single nucleotide resolution.
Sequence data generation and mapping
RNA-seq data production and alignment results for zebrafish poly (A) RNA reads
Total Mapped reads
Uniquely mapped reads to genome
Mapped reads to transcriptome
Transcripts with load > 1
The total sequencing reads generated from RNase V1 and S1 Nuclease cleaved fragments mapping to zebrafish transcriptome (n = 169 million) were aligned to 54,083 transcripts. Load score for all the transcripts were evaluated to check their abundance in the data. All the transcripts (n = 54,083) displayed a load ≥ 1 (Table 1).
PARS scores for all the respective positions (n = 18,375,999) cleaved by RNase V1 and S1 Nuclease were determined as per the formula described in Methods section. Normalisation constants Kv = 1.17 And Ks = 0.88 were used to normalise the read counts for every position.
The composite unique positions as obtained from RNase V1 and S1 Nuclease cleavage (n = 18,375,999) were aligned to 54,083 transcripts in zebrafish genome assembly (v79/Zv9). Amongst these transcripts, a total of 25,158 transcripts had positions represented by overlapping peaks in both the datasets with ratio score greater than one and 11,450 transcripts were termed as multi-conformation transcripts as they constituted at least five positions with overlapping peaks (Fig. 1b).
Position coverage i.e. positions with read starts across a transcript (generating from RNase V1/S1 Nuclease cleavages) was estimated. Out of 54,083 transcripts, 544 transcripts had more than 85% position represented by enzyme cleavage across the length of the transcript. While 7366 transcripts showed coverage from 40 to 85%, but a majority of transcripts 46,173 had coverage less than 40% (Fig. 1c). The biological distribution of 54,083 transcripts was determined. Majority of the transcripts (77%) were protein-coding, 16% were lncRNAs, 3% rRNAs, 1% NMD and rest 3% were labelled miscellaneous transcripts (Fig. 1d). Of these transcripts, those with more than 85% positions represented by enzyme cleavages were considered for further analysis. Further, the biotype of these transcripts is displayed in Fig. 1d with 90% of transcripts as protein coding, 7% lncRNAs, 3% NMD and less than 1% was rRNA.
PARS-enabled pairing probability at single nucleotide resolution reveals structural conservation of protein coding genes across species
Comparison of PARS derived structure with enzymatic footprinting
Prior to analysing PARS-enabled pairing probability in zebrafish we wanted to investigate the validity of PARS using an orthogonal technology. We chose in vitro enzymatic footprinting to validate pairing probability of ubiquitin c (NCBI Gene ID: 777,766), a candidate protein-coding gene. Ubiquitin c was chosen as it had 97% nucleotide positions represented by enzyme cleavage across the length of the transcript. Since UTRs are known to play important role in gene regulation owing to the secondary structural features , we chose the 3’ UTR of ubc for validation. In addition, short length of 105 bases was considered favourable for enzymatic footprinting.
Enzymatic probing followed by gel based footprinting of the ubc 3’UTR revealed differentially cleaved nucleotide positions (Fig. 3b). Only 40 positions were resolved in one gel. Higher molecular weight positions up to 68th nucleotide positions were resolved in separate gels. Relative structure signals for every nucleotide position as procured from enzymatic footprinting, PARS and the consensus between the two methods are represented as a heatmap (Fig. 3c). Out of the 68 positions resolved in footprinting gel, 28 positions (41%) match the pairing and unpairing possibilities as covered by PARS scores. In summary, conservative estimates of pairing probabilities as determined by PARS and enzymatic footprinting displayed modest concordance with each other.
Attributes of RNA secondary structure as determined by PARS based pairing probability across functional units of mRNAs
We also explored the pairing probability across splice sites of highly expressed mRNAs. The splice junctions (n = 2538) across the 451 transcripts were aligned. Average pairing probabilities of 25 nucleotide positions flanking the splice junctions were calculated (Fig. 4b). It was observed that the pairing probability of terminal dinucleotide at the 5′ exon are different relative to the rest of the positions of the transcript (p-value = 5.5 × 10− 28). Similarly, the pairing probability of the first dinucleotide at 3′ exon are different relative to the rest of the positions of the transcript (p-value = 1 × 10− 3). However, a comparison of the dinucleotides at the splice junctions displayed that terminal dinucleotides at 5′ exon are structurally flexible than first dinucleotides at 3′ exon (p-value = 2.5 × 10− 5). This pattern of secondary structure signal at the splice junctions is inversely correlated with GC content.
Additionally, we investigated the pairing probability across poly-A sites at 3’UTR across mRNAs. The transcripts (n = 451) were aligned at ends of 3’UTRs of transcripts and average pairing probability till 50 positions upstream were calculated (Fig. 4c). Positions from − 10 to − 30 showed low structure signals (p-value = 4.5 × 10− 18) relative to upstream region from − 30 to − 50.
The periodic structural pattern in CDS as observed in Fig. 4a was further investigated. Therefore, to decipher if the primary sequence codes for a structural pattern, periodicity of pairing probability in the CDS was tested. The periodicity was determined on first 100 CDS positions, first 100 3’-UTR and last 100 5’-UTR positions from average pairing probabilities of 451 transcripts. The highest amplitude was observed at a frequency of 0.33 i.e. periodicity of 3 bases in the CDS region. However, the UTRs have very low amplitude relative to CDS and no periodicity was seen at 3 bases as shown in Fig. 4d. When the CDS positions were binned in codons, every position in a codon had pairing probability significantly different from the other two positions (p-value = 1.702 × 10− 8) (Fig. 4e). The first position had the highest pairing probability compared to the second position, while third position had the least ability to be present in a paired conformation (p-value = 1.9 × 10− 7). This is repeated in a cycle of three, suggesting that similar to primary sequence, pairing probabilities in a codon also display a pattern.
Furthermore, these 451 transcripts were categorised in 40 classes of gene ontology based on their molecular function to survey any similar structural features within the same Gene Ontology (GO) class. Out of forty, only ten classes showed significantly similar pattern of pairing probability within a category. A heatmap of these ten classes in Fig. 4f displays four classes, which are over structured with respect to the mean PARS scores of the total 451 transcripts. Genes with ‘structural molecule activity’, ‘structural constituent of ribosomes’ and ‘NADH dehydrogenase activity’ are highly structured in CDS relative to UTRs. However, ‘rRNA binding’ genes have higher structure in 3’UTR than CDS. On the contrary, ‘translation factors’ and ‘organic cycle compound binding’ have negligible secondary structural features in UTRs compared to CDS. Genes with ‘nucleic acid binding’ have lower pairing probability in CDS and 5’UTR than 3’ UTR. Secondary structures were present in 5’UTR region of genes with ‘unfolded protein activity’ with single stranded features in 3’UTR. Thus, the ten classes of gene ontology displayed similar secondary structural pattern amongst transcripts of the same group, across CDS and UTRs.
PARS-enabled pairing probability at single nucleotide resolution reveals secondary structures of candidate non-coding RNAs in zebrafish
Tie1-as, is an antisense lncRNA to the protein coding gene tyrosine kinase containing immunoglobulin and epidermal growth factor homology domain-1 (tie-1) . Like several other antisense RNAs, tie-1as also binds to tie-1 RNA and regulates its levels. Keguo and co-workers have shown the hybrid structure of tie1 and tie1-as by computational predictions. PARS assisted structure probing at single nucleotide resolution (Fig. 6c) may aid in better understanding of this hybrid. Out of 819 positions of tie1-as, 803 are captured by PARS assay. Of these, 449 positions are unpaired and 354 positions are paired. The RNAfold predicted structure of tie1-as shows 53% concordance with PARS assisted structure in the binding region of this lncRNA (Fig. 6d).
Transcriptome-wide single nucleotide resolved secondary structure map of zebrafish
Multiple genome scale sequencing projects have highlighted that a large proportion of the genome actively contributes towards the transcriptome, most of which is engaged in regulatory activities. In order to gain insights into the functional aspect of the regulatory transcriptome, it is important to understand their ability to interact with other biomolecules by the virtue of their structure. The primary information for intramolecular pairing is embedded in the RNA sequence . The Watson-Crick base pair driven secondary structure thereby provides a template for the RNA to fold upon itself and sets the stage for long range interactions. Therefore, it is important to uncover the hidden layer of information in the RNA secondary structure, to further understand the tertiary interactions. Conventional gel based methods of assessing RNA secondary structure focused on single RNA species in isolation. However, it is well-known that the structures are influenced in presence of a heterogeneous pool of transcripts. Lately, several RNA probing methods [36, 37, 41, 42, 43, 44] are coupled with high throughput sequencing to decipher the transcriptome-wide secondary structure in diverse organisms. Amongst these, PARS has been applied to yeast and human transcriptomes to determine pairing probability at single nucleotide resolution. Zebrafish has emerged as a model organism to study various biological processes including human diseases. Understanding RNA secondary structure organisation and its features across functional units of transcripts would provide insights into the functioning of transcripts in zebrafish.
Pairing probabilities obtained from RNase V1 and S1 Nuclease cleavage of zebrafish transcriptome suggested that the transcriptome was equally paired and unpaired. However, the structure maps of other eukaryotic transcriptome such as, A. thaliana  revealed that a larger part of the transcriptome was unpaired. In the zebrafish transcriptome, the RNase V1 and S1 Nuclease enabled cleavages displayed distinct signatures with only 4% overlap of the structure signals, hinting to the ambiguous nature of the pairing probabilities at these nucleotides. Previously, structure profiling of the yeast transcriptome reported 7% nucleotide positions , while that of human transcriptome constituted 3.7% nucleotide positions to have ambiguous pairing probability respectively .
Out of the total transcripts (n = 54,083), a majority (77%) mapped to protein coding genes, while lncRNAs were relatively low (16%). This could be possible as lncRNAs possess a lower expression than protein-coding genes, and hence are not sequenced at a greater depth. This was similar to findings from PARS probed human transcriptome .
The well-known protein-coding gene (rpl35) in zebrafish was found to possess sequence and structure homology with its human orthologue Rpl35. We also studied another conserved gene - nucleoplasmin 1a (npm1a, 852 bases, ZFIN:ZDB-GENE-021028-1). This gene showed 62% sequence homology in the CDS region with its human ortholog NPM (NCBI ID: 4869). We observed an overall structure conservation of 329 bases (38%) out of which 197 positions (60%) were also conserved at the sequence level. In summary, we observed that although npm1a has similar sequence conservation as rpl35 with its human ortholog, the structure conservation does not follow any specific trend. Given that the extent of over-structuring or under-structuring is enriched for specific gene ontology categories in a species as shown in Fig. 4f, the same might hold true across species and may not always be simply a function of the sequence conservation.
Regulation of gene expression at post-transcriptional and translational levels is governed by the functional units of mRNA such as translation start and stop sites, CDS, splice sites, UTRs and poly-adenylation sites. Distinct RNA secondary structure features in zebrafish transcriptome were observed corresponding to the different functional units of highly expressed protein coding transcripts. A sharp decrease in the PARS scores was noticed at translational start and stop signals. This may suggest ribosome accessibility to coding regions and initiation of translation. Similar features were also observed in other transcriptomes studied such as humans , yeast  and Arabidopsis . This is in accordance to the presence of IRES (internal ribosome entry sites) present at translational start signals in eukaryotes  which are structured elements. As observed in our study, structure signals with low pairing probability at translational stop sites were also reported in humans  and yeast .
Zebrafish CDS region had higher pairing probability than 5’-UTRs but displays lesser pairing probability than 3’-UTRs. However, structure signals in the CDS of yeast transcriptome showed higher pairing probability compared to UTRs . The three base periodicity observed in the CDS region of zebrafish transcriptome was correlated to that observed in yeast  and Arabidopsis  suggesting a common universal regulatory feature in translating regions in eukaryotes. This was similar to the three base sequence periodicity in coding exons of DNA . There have been reports suggesting (RNY)n sequence periodicity in CDS of various genomes. The pattern of structure signals within a zebrafish codon was also consistent, such that every first base in a codon had the highest pairing probability suggesting structural constraints relative to the second base, while the last base has the lowest pairing probability suggesting structural flexibility. The structural constraint observed in the first position of the codon in zebrafish mRNAs might create a steric hindrance so that the subsequent codon positions have more steric flexibility. This is in contrast to what Kertesz et al. reported within the yeast, where the second base of the codon had highest pairing probability followed by third and first. The triplet periodicity in the protein coding regions of transcript suggests the translational efficiency of the transcripts . In our study, we observed a three base structure periodicity in the CDS, suggesting a righteous conformation for the occupancy of ribosomes. Additionally, as periodic structure signals are distinct in coding and non-coding UTRs, the structure signals can be employed to annotate the unknown regions in the zebrafish transcriptome.
Distinct structure signals were observed at the splice site junctions of the zebrafish transcriptome. The dinucleotides at the end of the 5’exon possess structure signals with low pairing probability and first dinucleotides of 3’exon have structure signals with higher pairing probability relative to the rest of the positions in a transcript. This was similar to the findings observed at splice junctions of mRNAs in human transcriptome .
The 5’UTR regions in zebrafish mRNAs on average possess lowest pairing probability compared to CDS and 3’-UTRs. This is also true for yeast , Arabidopsis (in vivo)  and mouse (in silico) . The 3’UTR regions display higher pairing probability compared to CDS on average, as 3’UTRs constitute several regulatory elements. Albeit, GC% of the bases rules the structure signals, the UTRs in zebrafish have lower GC content than CDS. Similar to the findings in humans, a low consensus between GC content and RNA secondary structure signals was observed . Moreover, the analysis of structure signals at poly-A sites revealed that A-rich elements (− 10 to − 30 nt) [49, 50, 51] endure structure signals with lower pairing probability relative to upstream stimulating element (USE), suggesting the accessibility of Cleavage and Polyadenylation Specificity Factor (CPSF) protein .
In extension to this, when the highly expressed protein coding transcripts were grouped on the basis of their molecular function, genes with functions of ‘structure molecule activity’ and ‘structural constituent’ of ribosomes such as rRNA had higher pairing probabilities across CDS and UTRs. While, genes with ‘translation factor activity’ had lower pairing probability in CDS than UTRs as they need to be actively translated and highly regulated by domains in UTRs. The presence of structured elements (high pairing probability) at 5’-UTRs signify regulation at translation level, whereas structures at 3’-UTRs represent post-transcriptional processing. Similarly, structure profiling of yeast transcriptome  and Arabidopsis transcriptome  revealed correlation between RNA secondary structure signals and biological function of mRNAs.
PARS scores generated for zebrafish non-coding RNAs were endorsed using known structure of human HOTAIR determined by chemically probing using SHAPE and DMS . PARS scores showed positive correlation with the other two derived structures, thereby confirming the broad utility of PARS for determining secondary structure of non-coding RNAs. The difference in consensus between PARS and the other two techniques could be due to the usage of different structure sensitive reagents in these studies. Nucleases possess steric hindrance in catalysing large structured elements. The efficiency of RNase V1 is limited by helix length whereas chemical probing reagents are much more specific due to smaller size. However, chemicals such as DMS can probe only unpaired adenine or cytosine, therefore not providing information about uracil or guanine. SHAPE, utilises reagents that interact with sterically flexible nucleotides, but can be carried out only for known sequences. In comparison, PARS an enzyme-based probing method, provides structure signals for every position in the transcript.
In the recent few years, non-coding RNAs especially lncRNAs have been extensively studied in zebrafish cataloguing lncRNA transcripts expressed in different developmental stages [16, 33] and adult tissues . Of these, 13–35% lncRNAs are overlapping to protein coding genes in sense or antisense direction . One of the earliest studied lncRNA in zebrafish was tie1-as, antisense to tyrosine kinase containing immunoglobulin and epidermal growth factor homology domain-1 (tie-1) . Similarly, y-rna is a small non-coding RNA, which interacts with DNA replication machinery at maternal to zygotic transition. Both of these non-coding transcripts play a pivotal role in the developmental stages of zebrafish. Therefore, in order to aid in investigating the function and binding partners of the transcripts and the mechanism of regulation, the secondary structural trends in these transcripts were visualised using PARS. This presents the first experimentally validated structures of non-coding RNAs in zebrafish. Furthermore, the single nucleotide resolved pairing probability map of zebrafish transcriptome could be evaluated to predict miRNA binding sites. Strong AGO binding sites display lower pairing probabilities at − 1 to 3 nt upstream of miRNA target sites . Pairing probabilities at the UTRs of transcripts can be availed to verify the miRNA target sites.
In addition, the single nucleotide pairing probabilities could also be utilised to identify riboSNitches , which are secondary structure elements that change in the presence of single nucleotide variations. Previous studies have documented approximately 15 million Single Nucleotide Variations (SNVs) in different strains of zebrafish [52, 53]. The regulation of gene expression by riboSNitches could be evaluated by studying the structure signals across these SNVs, which could provide insights on how SNVs contribute to modulating gene expression.
We present the first PARS-enabled secondary structure transcriptome map of zebrafish, which documents pairing probability of RNA at single nucleotide precision. This has facilitated the identification of unique structural patterns across functional units of mRNA. We also present the enzyme probed structures of selected regions of candidate non-coding RNAs such as tie1-as and y-rna in zebrafish. This study is not without limitation. Currently, PARS has been executed by folding the transcripts in-vitro with the consensus that most of the structure signals are embedded in the sequence. However, future studies may be carried out on native de-proteinised transcripts to see the extent to which in vivo structures deviate from the present ones. The technique can be further explored to determine RNA structure of full length candidate lncRNAs in zebrafish. This study provides a basal data to plan experiments for long range intra- and inter-molecular interactions of transcripts using Psoralen Analysis of RNA Interactions and Structures (PARIS) . This transcriptome-wide secondary structure map at single nucleotide resolution adds to the ever increasing genomics resources for zebrafish and would aid in improving our understanding of the zebrafish transcriptome.
Assam wild type (ASWT) strains of zebrafish maintained at CSIR-Institute of Genomics and Integrative Biology were used in this study . One-day old zebrafish embryos were collected in the eppendorf tubes and water was decanted. The vials were immediately snap frozen at − 80 °C and processed for RNA isolation. Approximately, 200 μg of total RNA was isolated from 24 hpf ASWT zebrafish (n = 250) using RNeasy kit (Qiagen, USA) as previously described . Poly-A RNA was enriched using oligo-dT Dynabeads (Life Technologies, USA) using manufacturer’s protocol to yield 4 μg of processed transcripts from the initial pool. RNA pool of Poly-A transcripts was divided into two parts (2 μg each) for individual catalysis by RNase V1 (Life Technologies, USA) and S1 nuclease (Thermo Fisher Scientific, USA). Likewise, five biological replicates were put using 1250 embryos in total.
Enzymatic probing of RNA
The resulting RNA pool was processed for PARS heated at 90 °C for 2 min followed by snap-chilling in ice (at 0–4 °C). Further, the poly-A pool was folded in RNA Structure Buffer (Life Technologies, USA) containing 10 mM Tris pH 7, 100 mM KCl, and 10 mM MgCl2 slowly from 4 °C to 28 °C for 25 min. Each 2 μg of Poly-A RNA was digested with 10 μl (0.000125 U) of RNase V1 (Life Technologies, USA) for 45 s and 10 μl (10,000 U) of S1 Nuclease (Thermo Fisher Scientific, USA) for 10 min to achieve single hit kinetics. Enzyme-cleaved fragments were further purified using equal volume (100 μl) of phenol: chloroform: isoamyl alcohol (Invitrogen, USA) at 13,000 rpm (4 °C) for 10 min. RNase V1 cleaved RNA pool in the top aqueous layer was extracted and 20 μl of Inactivation/Precipitation buffer (Life Technologies, USA) was added, followed by 1 h (h) incubation at -80 °C. Further, 20 μl of 3 M sodium acetate, 1 μl of glycogen and 300 μl of cold ethanol was added to precipitate RNA at -80 °C for 1 h. S1 Nuclease cleaved RNA pool in the top aqueous layer was extracted and 20 μl of 3 M sodium acetate, 1 μl of glycogen and 300 μl of cold ethanol was added to precipitate RNA at -80 °C for 1 h. Further, the purified RNase V1/S1 Nuclease digested samples were fragmented by alkaline hydrolysis buffer (Life Technologies, USA) containing 500 mM Sodium bicarbonate at 95 °C, for 1.5 min which generated 3’phosphate groups at the enzyme cleaved fragments. The reaction was stopped using 2 μl of 3 M sodium acetate, and further precipitated using 2 μl of glycogen and 300 μl of 100% ethanol at -80 °C for 3–4 h. A detailed schematic of Parallel Analysis of RNA Structure is shown in Fig. 8.
Preparation of library and sequencing
Purified enzyme-cleaved RNA from 24 hpf zebrafish was adapter-ligated using Ion total RNA-Seq kit v2 (Life Technologies, USA) using manufacturer’s protocol. The phosphate groups from the 3′-ends of the alkaline hydrolysed enzyme-cleaved fragments were removed by Antarctic phosphatase treatment to generate 3’-OH ends for adapter ligation. The 5′ adapter ligated products were treated with 5 μl of 10× Antarctic phosphatase buffer (NEB), 2.5 μl of Superasin RNase inhibitor (Life Technologies, USA) and 2.5 μl of Antarctic phosphatase enzyme (NEB). The volume was made to 50 μl using nuclease free water (Ambion, USA). This was followed by adapter ligation at 3’OH ends generated after phosphatase treatment. The adapter ligated products were reverse transcribed to obtain cDNA and amplified by PCR to generate the sequencing library. At each step, purification was carried out using nucleic acid beads enrichment protocol compatible with standard sequencing techniques. The libraries were sequenced on Ion Proton platform (Life Technologies, CA, US) employing semiconductor based chemistry after quality check to generate single end reads.
Sequencing was done for five technical replicates each for RNase V1 and S1 Nuclease probed samples (Additional file 1: Table S1).The raw single-end reads generated by Ion Proton sequencing were trimmed with BWA algorithm at a threshold of Q13 (p-value = 0.05) and length-sorted with a threshold of 25 nucleotides as implemented by SolexaQA version 2.2 . The pre-processed reads were mapped back to the zebrafish transcriptome assembly downloaded from Ensembl (v79, Zv9) comprising of 56,754 transcripts (33,737 genes) using a two-stepped approach involving STAR aligner  and Bowtie2  as prescribed by Life technologies (http://www.thermofisher.com/order/catalog/product/4476610?ICID=search-product&CID=fl-ion-proton-docs). First, the reads were mapped using STAR (with parameters --outReadsUnmapped Fastx --outSAMstrandField intronMotif). The unmapped reads obtained were then aligned locally using Bowtie2 (with parameters --local --no-unal -k 10). The bam files obtained from the above steps were merged using Samtools . In order to select only uniquely mapped reads, those reads mapping more than once in the zebrafish reference genome (Zv9/danRer7) were removed. Aligned reads with erroneous read starts (5′ ends) were further removed to retain only high-confidence reads with perfectly aligned read starts.
Calculation of load, position coverage and ratio scores
Load for each transcript was estimated by the total number of reads mapping to the transcript relative to the effective (mapped) length of the transcript. Load score determines the transcript abundance in the sample . The transcripts with load ≥ 1 (atleast one read per base) were considered.
The position coverage for every transcript was also computed by summing the total number of positions with read starts obtained from both RNase V1 and S1 Nuclease data relative to the length of the transcript. The read starts define the enzymatic cleavage for the respective position and the pairing probability of the prior nucleotide as the RNase V1 and S1 Nuclease enzymes cleave at 3′ phosphodiester bond of the paired and unpaired position respectively. The transcripts that had load score of more than one and at least 85% positions covered with read starts were considered for further analysis.
Ratio score for every position in each of the RNase V1 (henceforth represented as V1 dataset) and S1 Nuclease (henceforth represented as S1 dataset) datasets were calculated by read start coverage for each nucleotide relative to the load of the transcript. Only those positions, which have more than one ratio score, were called as peaks. A peak could be present only in one of the dataset (V1 or S1) and confirmed that a position can be either paired or unpaired. If a position displayed a peak in both the (V1 and S1) datasets, it was termed as overlapping peak and corresponded to dynamic regions with multi-conformations, which were not able to acquire a stable structure. Any transcript with more than five such positions was termed as a multi-conformation transcript.
Calculation of parallel analysis of RNA structure (PARS) scores
The number of reads initiating at every position in the transcriptome were calculated in both V1 and S1 datasets. Normalisation constants were calculated for both the datasets as Kv and Ks as per the following formula:
Where i is any nucleotide position, PARS score for a position defines the pairing probability of the previous position.
In vitro synthesised transcript of ubiquitin C UTR was generated using T7 Megascript kit according to manufacturer’s instructions (Life Technologies, USA). The RNA was checked for single RNA species using 12% denaturing polyacrylamide gel electrophoresis (PAGE). The composition of the gel was 40% 29:1 acrylamide:bisacrylamide, 8 M urea, 133.5 mM TBE. The gel mix is polymerized using 10% APS and 0.05% TEMED. The RNA products were visualized by UV shadowing and eluted from the gel using RNA elution buffer containing 300 mM Sodium acetate + 1 mM EDTA .
Gel purified RNA was radiolabelled using the Kinase max kit (Life Technologies, USA) as per the protocol provided by the manufacturer. Briefly, the RNA was dephosphorylated using Calf Intestinal phosphatase (CIP) and purified using Phosphatase Removal Reagent (PRR). The transcript was further incubated with T4 polynucleotide kinase and [γ-32P] ATP (BARC, India) for overnight at 37 °C . The labelled RNA was further purified using NucAway columns (Life Technologies, USA).
5′-end-radiolabelled RNA (50,000 counts per lane) was added to 1 μg of unlabelled zebrafish RNA and was subjected to heating at 90 °C for 5 min and then allowed to cool to 28 °C in RNA structure buffer (Life Technologies) and 5 mM MgCl2 for overnight to facilitate structure formation. Folded RNA was subjected to digestion with RNase V1 (1:1600 U for 45 s) and S1 Nuclease (1:100 U for 1 min) at 28 °C respectively. RNA ladder for G residues was obtained by digesting the RNA with 1 U of RNase T1 (Fermentas) in the presence of 1 M LiCl and 100 mM MgCl2 for 2.5 min at 37 °C. Alkaline hydrolysis of RNA was performed at 90 °C in 0.5 M sodium bicarbonate buffer for 8 min. All reactions were stopped using equal volumes of gel loading buffer II (Life Technologies, USA) containing 95% formamide and 18 mM EDTA and snap-chilled on ice. Equal counts of digested products were separated on a 12% denaturing gel in 0.5× Tris-borate EDTA buffer and exposed to a phosphorimager screen. The gel images were scanned on a Typhoon scanner (GE Healthcare). Cleavage profiles were visualised using ImageQuant 5.2 software (GE Healthcare).
Validation of PARS in zebrafish using rpl35
Several studies have highlighted that protein-coding genes are well conserved across species based on nucleotide sequence and function. The RNA secondary structures of such conserved genes are also known to be preserved. We tested the validity of PARS based pairing probability in zebrafish using a well-conserved protein-coding gene across human and zebrafish. Pairing probability of rpl35 (ribosomal protein large subunit 35), a candidate gene encoding the protein component of 60S ribosome subunit was compared with its human homolog.
Region-wise RNA structures across enriched gene ontology terms
Gene ontology enrichment analysis was performed for the transcripts showing position coverage of least 85% of the transcript length (209 genes) using WebGEStalt . The enrichment analysis was performed for Molecular Functions gene ontology using default statistical test options and a significance level threshold of 0.05. In order to assess the extent of over-structuring or under-structuring of the RNA within the UTRs and CDS of the transcripts belonging to the enriched GO terms, we employed single sample Wilcoxon rank sum test (with mu = average PARS score for the respective regions - 5’-UTR, CDS and 3’-UTR). The resulting significant p-values were plotted as a heatmap for further inference.
Periodicity pattern in coding regions of mRNAs
Periodicity across CDS regions was determined using Discrete Fourier Transform analysis . PARS scores across first 100 positions in the CDS of 451 transcripts were averaged and were checked for periodicity.
Codon-wise pairing probability
Average PARS scores of first 100 CDS positions were separated into 33 codons. Anova Test (not assuming equal variances) was used to affirm that the PARS score for every position in a codon differs from the rest of the two positions. If significant, pairwise comparisons of PARS scores using t-test with pooled Standard Deviation (SD) was performed.
Structure probing of non-coding RNAs
Human HOTAIR  transcript with well-defined RNA secondary structure  was employed (approx. 1 picomole) as a positive control. HOTAIR was folded and probed at 37 °C with 10 μl of (1:1600 U) of RNase V1 for 45 s and 10 μl of (10,000 U) S1 Nuclease for 5 min. Similarly, RNA structures of two zebrafish candidate non-coding RNAs viz. y-rna and tie-1as were elucidated using PARS. Approximately, one picomole of the above mentioned in-vitro synthesised transcripts were pooled with zebrafish poly-A RNA to constitute 2 μg of the starting material. They were enzymatically probed by both RNase V1 and S1 nuclease as mentioned above and RNA libraries were prepared for sequencing. The data analysis was performed for the candidate ncRNAs using the pipeline described previously. The PARS scores were computed for each ncRNA using the method described in the above section. The oligo sequences used in the study are provided in Additional file 1: Table S2.
We thank members of the zebrafish facility of CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB) for the excellent maintenance of zebrafish. Authors would like to acknowledge Dr. Hemant Gautam for radioactivity facility at CSIR-IGIB. We also thank the NGS and supercomputer facility at CSIR-IGIB.
This work was supported by the Council of Scientific and Industrial Research (CSIR), India [Project code BSC0123]. KK and SP acknowledge CSIR Senior Research Fellowship. TS is thankful to the support provided by the Wellcome Trust/DBT India Alliance grant IA/CPHE/14/1/501504.
Availability of data and materials
The sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) under accession SRR3521405 to SRR3521419.
KK, SM, VS and SSB conceived and supervised the entire study. KK standardized the assays and performed molecular biology experiments for the study. SKV, AV and RJ conducted the RNA sequencing and performed the quality control experiments. KK and SP conducted footprinting experiments for the specific transcripts and along with SM interpreted the findings. AS, KK and TS performed the entire bioinformatics and statistical analysis for the study. All authors were involved in drafting of the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Fish experiments were performed in strict accordance with the recommendations and guidelines laid down by the CSIR Institute of Genomics and Integrative Biology, India. The protocol was approved by the Institutional Animal Ethics Committee (IAEC) of the CSIR Institute of Genomics and Integrative Biology, India. All efforts were made to minimize animal suffering.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 15.Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res. 2012;22(9):1775–89.CrossRefPubMedPubMedCentralGoogle Scholar
- 41.Underwood JG, Uzilov AV, Katzman S, Onodera CS, Mainzer JE, Mathews DH, Lowe TM, Salama SR, Haussler D. FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nature. 2010;7:995–1001.Google Scholar
- 42.Lucks JB, Mortimer SA, Trapnell C, Luo S, Aviran S, Schroth GP, Pachter L, Doudna JA, Arkin AP. Multiplexed RNA structure characterization with selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proc Natl Acad Sci U S A. 2011;108(27):11063–8.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.