Long noncoding RNA repertoire in chicken liver and adipose tissue
- 1.8k Downloads
Improving functional annotation of the chicken genome is a key challenge in bridging the gap between genotype and phenotype. Among all transcribed regions, long noncoding RNAs (lncRNAs) are a major component of the transcriptome and its regulation, and whole-transcriptome sequencing (RNA-Seq) has greatly improved their identification and characterization. We performed an extensive profiling of the lncRNA transcriptome in the chicken liver and adipose tissue by RNA-Seq. We focused on these two tissues because of their importance in various economical traits for which energy storage and mobilization play key roles and also because of their high cell homogeneity. To predict lncRNAs, we used a recently developed tool called FEELnc, which also classifies them with respect to their distance and strand orientation to the closest protein-coding genes. Moreover, to confidently identify the genes/transcripts expressed in each tissue (a complex task for weakly expressed molecules such as lncRNAs), we probed a particularly large number of biological replicates (16 per tissue) compared to common multi-tissue studies with a larger set of tissues but less sampling.
We predicted 2193 lncRNA genes, among which 1670 were robustly expressed across replicates in the liver and/or adipose tissue and which were classified into 1493 intergenic and 177 intragenic lncRNAs located between and within protein-coding genes, respectively. We observed similar structural features between chickens and mammals, with strong synteny conservation but without sequence conservation. As previously reported, we confirm that lncRNAs have a lower and more tissue-specific expression than mRNAs. Finally, we showed that adjacent lncRNA-mRNA genes in divergent orientation have a higher co-expression level when separated by less than 1 kb compared to more distant divergent pairs. Among these, we highlighted for the first time a novel lncRNA candidate involved in lipid metabolism, lnc_DHCR24, which is highly correlated with the DHCR24 gene that encodes a key enzyme of cholesterol biosynthesis.
We provide a comprehensive lncRNA repertoire in the chicken liver and adipose tissue, which shows interesting patterns of co-expression between mRNAs and lncRNAs. It contributes to improving the structural and functional annotation of the chicken genome and provides a basis for further studies on energy storage and mobilization traits in the chicken.
KeywordsNoncoding RNAs Chicken Genome Abdominal Adipose Tissue lncRNA Gene lncRNA Transcript
Long noncoding RNAs (lncRNAs) are commonly defined as non protein-coding transcripts that are often spliced, capped and polyadenylated but have little or no protein-coding potential. Genome-wide transcriptional studies carried out by ENCODE (Encyclopedia of DNA Elements) and other large international consortia  have revealed that more than 60% of mammalian genomes are transcribed and that a large fraction of the transcripts is represented by lncRNAs [1, 2, 3, 4, 5]. Among these studies, the GENCODE consortium has collated a comprehensive set of human lncRNAs and analyzed their genomic organization, modifications, cellular locations and tissue expression profiles in different human cell lines .
Since 2012, the number of lncRNAs identified by RNA-Seq in tumor biopsy samples, normal tissues, and cell lines has shown a continuous and steep increase, with 15,941 lncRNA genes (28,031 transcripts) referenced in GENCODE (version 24 ), in comparison to 19,815 protein-coding genes, and more than 50,000 lncRNA genes reported by Iyer et al. . These lncRNAs are associated with multiple biological processes such as development, cell differentiation or pathologies [9, 10, 11]. However, reliable and comprehensive genomic annotations of lncRNAs are not available for many species, such as livestock or crop species.
In this context, it is important to annotate this major fraction of the transcriptome in livestock species, for which several loci involved in complex and economically relevant traits [i.e. quantitative trait loci (QTL)] have been described but with limited success regarding the identification of the underlying causative mutation(s). Given that approximately 80% of the variants associated with human complex traits map outside of protein-coding exons of which 40% are in intergenic regions [12, 13], identifying the lncRNA repertoire is crucial to better understand the “genotype to phenotype” relationships in livestock [14, 15]. To date, few lncRNA studies have been reported for livestock species, apart from lncRNA studies in bovine  and trout , and the construction of multi-species databases such as NONCODE [18, 19] and the domestic-animal lncRNA database (ALDB) [20, 21]. Research programs are in progress on several farm species, e.g., in projects conducted within the framework of the Functional Annotation of Animal Genomes initiative [14, 15].
Different methodologies have been described to discover and model lncRNAs. This generates some variability in the number of putative lncRNAs reported and stresses the importance of precisely defining the tools and thresholds for each analysis step. Regarding lncRNA modeling, the FEELnc program (FlExible Extraction of Long noncoding RNAs), developed by Wucher et al. [22, 23], distinguishes lncRNAs from mRNAs based on a machine-learning method that estimates a protein-coding score according to different criteria such as the RNA size, ORF coverage and multi k-mer usage. One main advantage of the FEELnc program is its ability to derive an automatically computed cut-off that maximizes the lncRNA prediction sensitivity and specificity. In addition, and contrary to other tools such as CPC  or CPAT , FEELnc provides a lncRNA classification based on their genomic position with respect to a pre-defined set of reference genes (usually protein-coding genes), which allows to distinguish intergenic from intragenic lncRNAs and to sub-classify them according to their orientation with respect to a reference set of genes. Such a classification can be useful to formulate hypotheses about co-expression patterns observed between lncRNAs and their closest protein-coding genes.
In this context, our aim was to describe the chicken lncRNA repertoire. We focused on the liver and abdominal adipose tissues because of their importance in various economical traits for which energy storage and mobilization play key roles. The liver is a key organ for energy and lipid metabolism and homeostasis, and the adipose tissue plays a key role in lipid storage and mobilization when the organism is stressed or in transition phases. These two organs, through the regulation of the lipid metabolism (synthesis, storage and catabolism), are important for the bird’s adaptation to environmental changes [26, 27, 28]. Furthermore, both tissues are relatively homogeneous in cell composition. Both tissues were deeply sequenced (with an average of 100 million stranded paired-end reads per sample, totaling 1.65 billion per tissue) to capture weakly expressed lncRNAs and across a large number of biological replicates (16 birds per tissue) to obtain sufficient statistical power to assess correlations of expression levels between lncRNAs and their closest protein-coding RNAs.
In coordination with the FAANG initiative (FAANG Bioinformatics and Data Analysis subcommittee), we used a pipeline based on STAR, Cufflinks and FEELnc to describe and characterize a catalogue of expressed putative lncRNAs. We used two protein-coding score cut-offs (including a stringent one for lncRNAs) to partition our transcript set into lncRNAs, protein-coding RNAs and ambiguous RNAs (i.e., with intermediate protein-coding scores). We found approximately 2193 lncRNA genes (2979 transcripts), from which we extracted a reliable subset of 1670 genes (2412 transcripts) that were characterized by reproducible expression across the 16 replicates. We then compared their structure and expression levels to those of mouse and human lncRNAs. Using the FEELnc classification, we found interesting cases of co-expression between lncRNAs and their closest coding mRNAs, especially for pairs in divergent or antisense orientations. Overall, we provide a powerful and deeply characterized resource for investigating lncRNA relevance in the chicken liver and adipose tissue.
Results and discussion
Chicken lncRNAs predicted by FEELnc and their structure and expression features
For the liver and adipose tissue samples (16 replicates per tissue), we obtained on average 100 million stranded, paired-end reads. We compared the efficiencies of the recently published Stringtie and the classical Cufflinks programs to predict transcripts from our sequencing data, providing the Ensembl annotation as a guide and starting from the same BAM files generated by STAR. The Cufflinks/Cuffmerge pipeline processed our dataset of 32 samples in approximately 79 h and generated 39,504 transcripts for 22,413 genes. Stringtie took less than 3 h but produced approximately 4 times more predictions (150,659 transcripts for 108,098 genes), which included a majority of mono-exonic models (68 vs. 11% for Cufflinks). The number and the structure of the transcript models found with Stringtie in our data were considerably larger than expected based on data from the literature . Thus, for this study, we used the more realistic models from Cufflinks/Cuffmerge. Finally, the STAR/Cufflinks/Cuffmerge pipeline applied to our 32 samples resulted in a more than two-fold increase in number of transcripts compared to that reported in the Ensembl V84.4 annotation on the reference GalGal4 genome, with 39,504 transcripts for 22,413 genes compared to the 17,954 transcripts for 15,508 genes in the Ensembl annotation.
To evaluate the relevance of our chicken lncRNA set, we analyzed the gene expression profiles of the three classes “putative lncRNA transcripts”, “new mRNAs” and “ambiguous RNAs” and also compared the structural features of our lncRNAs with those of the mouse and human lncRNAs. As expected, the 2193 putative lncRNA genes are on average tenfold less expressed than the known or new protein-coding genes, and the ambiguous RNAs have an intermediate expression (Fig. 1b). This is in accordance with previous findings in mammals that showed that lncRNAs are far less expressed than protein-coding genes [6, 29, 30, 31]. Then, we characterized the structural features of these chicken putative lncRNA transcripts in comparison to the human and mouse lncRNAs available in Ensembl and compared them with the protein-coding RNAs available in Ensembl for these three species. Overall, the features observed for the chicken lncRNAs are consistent with those observed in mammals in the human and mouse ENCODE projects  (Fig. 1c). First, regardless of the species analyzed, lncRNAs are spliced but with fewer exons than the protein-coding RNAs, with medians of 3 and at least 5, respectively. Second, the median exon length is similar for lncRNAs and protein-coding RNAs in chickens (127 ± 1 nt). This is similar to what was found in humans and mouse, even if the chicken lncRNA exons are slightly longer than the protein-coding exons (for example, medians of 155 nt vs. 126 nt in humans, Wilcoxon–Mann–Whitney test, p value <2.2 × 10−16). Third, the lncRNA transcripts are shorter than the protein-coding transcripts in the chicken, as in humans and mouse, because of the observed smaller number of exons. In the chicken, the median transcript length is 529 nt for lncRNAs, compared to 2067 nt for protein-coding RNAs (Wilcoxon–Mann–Whitney test, p value <2.2 × 10−16). Finally, we observed a smaller number of isoforms per lncRNA gene in the three species compared to that of the protein-coding RNA genes, which was expected given that lncRNAs have a smaller number of exons .
In terms of the expression measured at the locus level (see the “Methods” section), the 2193 chicken lncRNA genes are characterized by at least one read in at least one replicate of one tissue (with 1958 in the liver and 2056 in the adipose tissue). To obtain a more reliable set of expressed lncRNAs, we took advantage of the large number of replicates to remove genes with low signals. Rau et al.  developed an R package (HTSfilter) for RNA-Seq data analysis to correctly filter out lowly-expressed genes and thereby increase the power of detection in the context of the differential expression of protein-coding genes. Unfortunately, this data-driven method (based on the Jaccard similarity index to calculate a filtering threshold) is not appropriate for lncRNAs because of their low expression level (see Additional file 1: Fig. S1). Therefore, we analyzed the reproducibility of the expression level across the 16 replicates of each tissue using the standard 0.1 FPKM-UQ threshold (see the “Methods” section). Figure 1d provides the numbers of long noncoding and protein-coding genes expressed according to the number of biological replicates for each tissue. Long noncoding genes show quite good reproducibility of expression across samples, with 1249 of them having an FPKM-UQ higher than 0.1 in at least 10 of the 16 samples in the liver, i.e., 64% of all hepatic lncRNA genes with one read in one sample (Fig. 1d, left). Note that 459 of the long noncoding genes (23%) have a poorly reproducible expression, with no more than four samples with an expression level higher than the threshold in the liver. Similar results were obtained for the adipose tissue (Fig. 1d, right), with 1215 lncRNA genes having an FPKM-UQ higher than 0.1 in at least 10 of the 16 samples. Combining these two sets of expressed lncRNAs results in 1670 genes. Finally, the further analyses were performed with these 1670 reliable long noncoding genes (for 2412 transcripts) that were robustly expressed in the liver and/or adipose tissue. These numbers of long noncoding genes are consistent with other studies that focus on a single tissue, even if the number of replicates, the sequencing depth and the criteria used to consider that a long noncoding gene is expressed, differ between studies. For example, Wang et al.  reported 2805 lncRNA transcripts in the pig endometrium (using 12 porcine samples and 85–105 million reads per sample), and Billerey et al.  reported approximately 1300 lncRNA transcripts in bovine muscle (using nine samples with 15 million to 45 million reads per sample). In contrast, multi-tissue studies reported a larger number of lncRNA transcripts, generally above 10,000, with a wide variation depending on the sequenced tissues and the tools used for the lncRNA detection (9778 lncRNA transcripts reported by Koufariotis et al.  in 18 bovine tissues (using 1.87 million 120-bp stranded paired-end reads and CPC/CNCI tools for lncRNA prediction [24, 35]), and 20,163 lncRNA transcripts reported by Li et al.  in 13 maize tissues (using 1.17 million 35- to 110-bp unstranded paired- and single-end reads and the CPC tool for lncRNA prediction ).
Using the FEELnc classifier module, we then analyzed the class distribution of the 1670 reliable FEELnc lncRNA genes compared to annotated protein-coding genes from Ensembl (Fig. 1e). We found 1493 intergenic lncRNA genes (89%), which was the largest class as reported in humans by Derrien et al. , compared to 177 intragenic lncRNA genes (11%). These 1670 lncRNA genes, which are characterized by a good reproducibility of expression level in at least one of the two tissues and corresponding to 2412 transcripts, were analyzed more deeply and are reported in Additional file 2: Table S1.
Distribution of LncRNAs across chicken macro- and micro-chromosomes
Conservation of lncRNAs between chicken and human genomes
LncRNAs are less expressed and more tissue-specific than mRNAs in the liver and adipose tissues
To evaluate the relevance of these tissue-specificity gene sets, we performed a GO term enrichment analysis for the protein-coding gene subsets with DAVID [41, 42] (see Additional file 3: Table S2). As expected, for the liver-specific protein-coding gene subset, we found an enriched GO term cluster related to lipid metabolism that was supported by well-known liver-specific genes such as those coding for hepatocyte nuclear factors (HNF1A, HNF4, NR1H4), apolipoproteins (APOB, APOA4) or enzymes involved in cholesterol catabolism and bile acid metabolism (CYP7a1, HSD3B7, SLCO1A2). For the adipose-specific protein-coding gene subset, an enriched GO term cluster related to development and morphogenesis was identified, which was supported in particular by several HOX genes involved in body fat mass control and obesity [43, 44]. This cluster of genes is likely related to the capacity of white adipose tissue to expand and differentiate. The four subsets of adipose- and liver-specific genes for long noncoding and protein-coding genes are in Additional file 4: Table S3.
Co-expression of LncRNAs and their nearest protein-coding genes
Long noncoding RNAs are emerging as new players in multiple mechanisms of cell machinery, including regulation of gene expression. Even if they can act over long distances to activate transcription at distal promoters , it has been demonstrated that they can also locally affect the gene expression of their neighboring protein-coding genes [11, 30, 46]. Concerning these “local” regulations leading to co-expression, we can distinguish genic lncRNAs that overlap protein-coding genes in an anti-sense orientation from intergenic lncRNAs in a divergent orientation with respect to their closest protein-coding genes. These latter lncRNAs may share a common bidirectional promoter with their closest protein-coding genes if the distance between them is less than a certain threshold, often fixed at 1 kb [47, 48, 49]. Hence, we evaluated the co-expression of each “lncRNA—nearest protein-coding RNA” pair across all the samples of each tissue according to two criteria: (1) the FEELnc classification, and (2) for the three intergenic lncRNA classes, a distance of less than 1 kb between the two genes considered. For some classes, we expected a larger number of significantly co-expressed pairs when the genes of a pair are closer together than when they are further apart, based on the hypothesis that a lncRNA is more likely to contribute to the regulation of a protein-coding gene if it is close to it.
Significant correlations between expression for lncRNA-mRNA and mRNA–mRNA pairs considering FEELnc classes and distance between genes
51/91 (56%) +49/−2
5/28 (18%) +3/−2
23/105 (22%) +19/−4
139/583 (24%) +127/−12
13/166 (8%) +10/−3
34/265 (13%) +27/−7
2.37 × 10−9
3.7 × 10−2
Regarding genic lncRNA-mRNA pairs, lncRNAs oriented in the antisense direction with respect to an exon or intron of a protein-coding gene are significantly co-expressed (22 and 13%, respectively) with the overlapping protein-coding gene (Table 1). Several cases of co-expression of genic lncRNA-mRNA pairs in an antisense orientation have been reported, and the modes of action of such lncRNAs on the regulation of mRNA loci are multiple and complex [11, 51, 52, 53, 54]. Strikingly, we found that the significant correlations between lncRNA and mRNA levels are positive. Derrien et al.  also reported a majority of positive co-expressions for lncRNA-mRNA pairs in an anti-sense orientation. The mechanisms that underlie such positive co-expression seem to be complex and act at distinct regulatory levels including the translation, splicing and transcription levels [55, 56, 57, 58].
In the same strand pair category, lncRNAs are more significantly correlated with their proximal protein-coding neighbors (≤1 kb) than with distant RNAs (56 vs. 24%, respectively) (Table 1). Most of these lncRNA genes probably have to be considered as an extension of the protein-coding gene, which implies that the Cufflinks/Cuffmerge procedure could not model full-length lncRNAs. Indeed, such a difference is not observed for the protein-coding gene pairs, considered as better characterized and used here as a control (28 and 22% for the two distance subsets) (Table 1).
Next, we focused on two lncRNA-mRNA pairs that were significantly correlated in the liver, i.e. one divergent pair and one exon antisense pair.
Specific cases of divergent and exonic antisense lncRNA-mRNA pairs that are significantly correlated in liver
Our aim was to identify pairs with a protein-coding gene involved in lipid metabolism, to be able to hypothesize a regulatory role of the lncRNA on its neighboring coding gene . Three long noncoding genes were previously described in mammals as being involved in lipid homeostasis: the liver-enriched lncLSTR, reported as a putative regulator of the plasma triglyceride level in mice ; the lncRNA HULC, which is abnormally expressed in hepatocellular carcinoma cells and has been shown to increase the triglyceride and cholesterol levels in these cells ; and the antisense lncRNA APOA1-AS, which was shown in humans and monkeys to negatively regulate APOA1 expression (a major component of high-density lipoprotein) . Surprisingly, these long noncoding genes, absent from the Ensembl chicken V84 annotation, were not modeled with our RNA-Seq data, and a manual inspection using the Integrative Genomics Viewer confirmed that no reads were mapped at the putative genomic locus, contrary to the neighboring protein-coding genes (see Additional file 5: Figure S2). These results suggest that these long noncoding genes are either absent in the chicken genome or not systematically expressed in the liver, regardless of the age, sex and physiological state of the individuals.
For the set of antisense lncRNA-mRNA pairs, no mRNA was found to be clearly involved in lipid metabolism according to the literature. Therefore, we analyzed the co-expression of one pair related to the protein-coding gene, NPNT, which was recently shown to play a role in the liver . For the set of divergent lncRNA-mRNA pairs, we focused on a lncRNA related to the DHCR24 gene known to encode a key enzyme of the biosynthesis of cholesterol, which has not been reported so far.
Exonic antisense lncNPNT-AS and NPNT protein-coding gene
DHCR24 and its divergent lncRNA
The tissue expression pattern is consistent with the physiological role of DHCR24 since it encodes the last enzyme necessary for cholesterol synthesis, with cholesterol being the precursor of the biosynthesis of the steroid hormone. To our knowledge, such co-expression observed in different physiological conditions between DHCR24 and a divergent lncRNA has never been reported before; it suggests that the two members of this gene pair that are in a divergent orientation and at a small distance between the transcription start sites (202 bp) share an active bidirectional promoter. Further experiments are required to determine if this promoter can initiate transcription in both directions. The strong co-expression that was observed in several experimental designs suggests a regulatory role of the lncRNA_DHCR24 on DHCR24 expression and thereby on the biosynthesis of cholesterol. Similar to lncLSTR  or APOA1-AS , lncRNA_DHCR24 thus constitutes a novel candidate gene to be added to the list of lncRNAs involved in lipid metabolism regulation.
Our study aimed at establishing a first repertoire of the lncRNAs in the chicken liver and adipose tissue, two tissues that are known to be important for energy homeostasis and lipid metabolism. We characterized this repertoire in terms of structure, expression and co-expression with respect to protein-coding genes, based on 16 biological replicates per tissue. In terms of structure, we observed a large subset of lncRNAs that were conserved by position between the chicken and human genomes but that were highly divergent at the nucleotide level. Although this latter observation was also reported in other studies [6, 17, 38, 64, 65, 66], complementary strategies could be considered for analyzing splice site sequence conservation . Nevertheless, this reinforces the question regarding the functional meaning of syntenic conservation in the absence of sequence conservation, which does not rule out the conservation of the secondary structures of lncRNA sequences. More specific to the chicken genome, lncRNAs have the same chromosomal distribution as protein-coding genes in terms of gene density and length, with more and shorter genes on the micro-chromosomes. In terms of expression, the chicken lncRNAs are less expressed and more tissue-specific than the protein-coding genes, as previously reported for human and murine lncRNAs, supporting the important role that is attributed to lncRNAs as regulatory elements involved in tissue-specific functions. In terms of co-expression, 22% of the antisense overlapping lncRNA-mRNA pairs are significantly and positively co-expressed, thus providing new candidate genes to investigate the mechanisms that underlie such regulations. We show that divergent lncRNA genes are more significantly co-expressed with their close (≤1 kb) protein-coding genes than with more distant genes, suggesting the existence of active bidirectional promoters in the chicken. In particular, the DRCH24 gene and its divergent lncRNA are highly co-expressed in various conditions in the liver, revealing a new lncRNA that might have an important role in the regulation of cholesterol synthesis.
Sample collection, RNA isolation and RNA sequencing
The liver and abdominal adipose tissue were extracted from 16 male chickens slaughtered at 9 weeks of age. Chickens were feed-deprived for 12 h and then fed again for 3 h before being euthanized by decapitation and bleeding. Immediately after slaughter, the liver and abdominal adipose tissue were removed, frozen in liquid nitrogen and then stored at −80 °C until the analyses.
Approximately 30 mg of liver and 100 mg of adipose tissue were homogenized in TRIzol reagent (Invitrogen, California, USA), and the total RNA was then extracted according to the manufacturer’s instructions, re-suspended in 50 µL of RNase-free water and stored at −80 °C. The total RNA was quantified with a NanoDrop® ND-1000 spectrophotometer (Thermo Scientific, Illkirch, France). A260/280 and A260/230 ratios were greater than 1.7 in all samples, ensuring the purity of the preparation. The RNA quality was verified using an Agilent 2100 Bioanalyzer (Agilent Technologies France, Massy, France). The average RNA integrity numbers were 8.65 ± 0.47 (mean ± SD) for the two tissues: 9.4 ± 0.5 for the liver and 8 ± 0.6 for the abdominal adipose tissue.
Sequencing was conducted on 24 samples (16 livers and eight abdominal adipose tissue samples) and an additional eight abdominal adipose tissue samples, in a stranded and paired end manner with 2 × 100 bp, on a HiSeq 2000 (Illumina) and HiSeq 3000 (Illumina), respectively. Libraries with an on average 230-bp insert were prepared following Illumina’s instructions by purifying poly-A RNAs (TruSeq RNA Sample Prep kit). Illumina adapters containing indexing tags were added for subsequent identification of samples. Samples were PCR-amplified, and quantitative PCR was then performed for library quantification (QPCR NGS Library Quantification kit). All samples were filled on two to five lanes with a flow cell to minimize the inter-lane bias. After sequencing, the samples were de-multiplexed, and the indexed adapter sequences were trimmed using CASAVA v1.8.2 software (Illumina). We obtained 101 million reads per sample on average (111 million reads for the liver and 92 million reads for the adipose tissue), with a total of 3.3 billion reads for the 32 samples.
Pre-processing steps on RNA-Seq data
Three billion reads from the RNA sequencing were mapped onto the chicken Galgal4 reference genome using STAR (v2.4.0i) , and the PCR duplicates were removed for each RNA-Seq sample using the SAMtools rmdup tool (v0.1.19) . All the data were merged into one bam file with the merge tool (v1.1) from the Samtools suite to create the input file used to model transcripts and genes. Gene modeling was performed with both Stringtie (v1.0.1)  and Cufflinks (v2.2.1) , using the Ensembl gene annotation file (release 82) as a reference. To compare the results, tests were conducted under the same conditions with 12 cores. The CPU was an Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50 GHz. The counting step was performed by featureCounts (v1.4.5-p1)  with standard options but using both the multi- and the mono-mapped read options. Note that separated “.bam” files (one per sample) including the PCR duplicates were used for this counting step. We obtained 2.418 billion mapped reads with the ‘no multi-mapping’ option and 2.487 billion reads with the ‘multi-mapping’ option. Therefore, only 2.8% of the total reads were multi-mapped and these were discarded from further analyses. After completing all the filtering steps, we obtained an average number of mapped reads per sample of approximately 75 million overall (88 million and 63 million for the liver and adipose tissue, respectively). Each command line and input/output file used to run the different analyses are in Additional file 6.
Long noncoding RNA prediction
lncRNA annotation was performed by the FEELnc program (FlExible Extraction of Long noncoding RNAs, v.23/11/2015 [22, 23]. Briefly, FEELnc is an alignment-free software that uses multi k-mer frequency data and relaxed open reading frame (ORF) annotation as the main computational features/predictors to discriminate protein-coding from non-coding RNAs. These features are then used in a machine-learning algorithm (random forest) to compute a coding potential score (CPS) that will discriminate between mRNAs and lncRNAs. In particular, the program can be self-trained with species-specific annotations and it automatically defines the coding potential threshold that maximizes the classification performance (i.e., where the sensitivity equals the specificity). Once the FEELnc model is trained with the above predictors, it is then applied on a set of novel transcript models (e.g., from Cufflinks or Stringtie) reconstructed after transcriptome sequencing to predict their protein-coding capacity. The description of the FEELnc program is accessible at bioarxiv  in which extensive benchmarking of the program in comparison with six other programs is presented based on the GENCODE human and mouse gold-standard datasets. Basically, FEELnc has three modules: “FEELnc_filter”, “FEELnc_codpot” and “FEELnc_classifier”. Using the first module “FEELnc_filter”, we filtered out all transcripts for which exons overlapped in the sense protein-coding exons or pseudogenes that are referenced in the chicken V78 Ensembl annotation. Note that the V78 Ensembl annotation is equivalent to the last V84.4 annotation for the chicken, with 15,508 coding genes and 17,954 coding transcripts. We also filtered out transcripts that were shorter than 200 bp according to the commonly accepted definition of long noncoding RNAs. The second module “FEELnc_codpot” separates putative long noncoding RNAs (lncRNAs) from protein-coding RNAs by first computing a coding potential core (CPS, ranging from 0 to 1) for each transcript and then computing a CPS cut-off that maximizes both the lncRNA sensitivity and specificity using a tenfold cross-validation according to the input training files. For the training set of protein-coding transcripts, we used the 15,508 known coding transcripts annotated by Ensembl. For the training set of long noncoding transcripts, we used both the 13,085 chicken putative transcripts from the NONCODEV5 database (v.2016) [18, 19] and a set of 11,000 genomic intergenic regions automatically extracted by FEELnc. Note that the lncRNA predictions of NONCODE are mainly based on the analysis of the Cufflinks gene models by the coding-non-coding index (CNCI) method . Here, the CPS calculation is based on ORF coverage, mRNA size and multi k-mer frequencies; for this latter criterion, we chose frequencies of 1-, 2-, 3-, 6-, 9- and 12-mers, and the optimal performance in terms of specificity for our training data was 0.96. FEELnc allows the user to increase the performance metrics to obtain high-confidence predictions of lncRNAs/mRNAs, although this option leads to the creation of an intermediate category of ambiguous coding/noncoding transcripts (TUCp). The third module “FEELnc_classifier” classifies each lncRNA with respect to its location and orientation compared to its closest annotated protein-coding genes. The two main classes are (1) the genic lncRNA class, corresponding to lncRNA transcripts that overlap a protein-coding gene, and (2) the intergenic lncRNA class, with three subtypes that are the divergent, convergent and same-strand sub-classes, as detailed on the FEELnc website  and schematized in Fig. 1e. Each command line and input/output file used to run the different analyses are available in Additional file 6.
Comparison of our lncRNA set with the chicken lncRNAs from the NONCODE and ALDB databases
The multi-species NONCODE [18, 19] and ALDB [20, 21] databases contain 9343 and 6132 chicken lncRNAs, respectively, that are either intergenic or overlap a gene in antisense orientation. The exon coordinates of our chicken lncRNA set were compared to those of both databases using the “bedtools intersect” tool v.2.25.0 . Two thresholds were used i.e. 100% (stringent criteria) and 50% (relaxed criteria), which refer to the percentage of the lncRNA exon lengths in our dataset that match those of the analyzed database set. Because of the non-perfect modeling of lncRNAs, we considered that a lncRNA was present in two sets if at least one exon was shared by these sets.
Sequences of human lncRNA transcripts were downloaded from the GRCh38 Ensembl database, version 83. Sequence comparisons between our chicken FEELnc sequences and the human sequences were conducted using the Blast software suite  (blastn V2.4.0+, with a word size of 7). The thresholds used for the FEELnc and human transcript comparison were 50% for the query coverage and 70% for the identity percentage.
A syntenic conservation analysis was performed for the lncRNA genes that were surrounded by two neighboring protein-coding genes with a 1-to-1 orthologous relationship with the human genome (Ensembl v.83, Biomart web-based tool [75, 76]). We considered that there was synteny conservation for a lncRNA if a lncRNA was also found in the human (GrCh37) between the same two coding genes, with the same orientation and the same order. Note that no upper limit was used in terms of distance between the lncRNA and the nearest protein-coding genes, but most of the distances are between 6 nt (min) and 35,000 nt (third quartile).
The raw counts for each gene were calculated by featureCounts  at the gene (locus) level and normalized by the gene size and the total number of reads that mapped in the most highly expressed genes, as proposed in the upper quartile (UQ) method described by Bullard et al. . Thus, the raw counts after normalization were called FPKM-UQ (FPKM for Fragment Per Kilobase and Milllions—UQ for Upper Quartile). This method is particularly relevant because highly expressed genes are known to account for most of the reads and therefore to strongly influence the total read number, whereas they represent only a small fraction of the expressed genes. In our study, the top 10 and 25% most highly expressed genes represent 34 and 96% of the reads, respectively, in the liver, and 16 and 90% in the adipose tissue. Finally, a gene was considered as expressed in a tissue when at least 10 of the 16 samples per tissue had a FPKM-UQ greater or equal to 0.1, a threshold often used in studies focusing on lncRNAs [6, 8, 38, 78]. In this study, such a threshold corresponds to eight and two average reads for coding (1987 nt long) and long noncoding (494 nt long) transcripts, respectively. To determine this minimum number of samples (10 of 16) for defining a gene as expressed in one tissue, we analyzed the reproducibility of expression across the 16 biological replicates in each tissue (see the “Results” section and Fig. 1d). Moreover, to provide an estimation of the background signal and then justify the expression threshold of 0.1, we sampled, several times, a set of genomic intervals with the same size distribution as that of our lncRNA loci, and with no overlapping with any gene (protein-coding genes and non-coding genes) using the “bedtools shuffle” command. We refer to this set as the “no-gene” set. We then counted the numbers of reads in these sets for the 16 liver replicates and transformed these read counts into FPKM-UQ (see Additional file 7: Fig. S3). First, we can observe that the third quartile is approximately 0.1 (on the left of Additional file 7: Figure S3). Second, the distribution of the “no gene” set that satisfied the FPKM-UQ threshold of 0.1 across the 16 replicates is very different from those observed for lncRNAs: only 8% of the loci satisfied our double criteria “at least 10 of the 16 samples had a FPKM-UQ greater or equal to 0.1”. Thus, we conclude that our criteria allow us to distinguish expressed entities with a low but reproducible expression from noise with a lower signal that is less reproducible.
For the tissue-specificity analysis, a gene expressed in one tissue was considered as not expressed in the other tissue if its expression was below the FPKM threshold of 0.1 in at least 12 of the 16 samples.
A lncRNA/protein-coding RNA pair was considered as significantly correlated in a tissue across the 16 replicates when the correlation p value was lower or equal to 0.1 after correction for multiple-testing by the Benjamini–Hochberg method . Pearson correlations were calculated using the log10(FPKM-UQ). For all expressed gene pairs, we considered the highest correlations among those calculated for either liver or adipose tissue. To replicate the analyses with “coding–coding” pairs, we reconstituted “coding–coding” pairs for divergent, convergent and same-strand FEELnc classes in accordance with the FEELnc nomenclature.
RT-qPCR primers used to amplify genes of interest
Gene of interest
CD, MB, FJ carried out RNA extraction and RT-qPCR; DE prepared the RNA-Seq libraries and carried out sequencing; CK performed the preprocessing of RNA-Seq data; KM, TD, FLEG, and VW performed long noncoding gene identification; KM, SD, and FLEC carried out the sequence conservation analyses and comparison to lncRNA databases; KM performed the synteny conservation analyses; KM and SL performed the expression and co-expression analysis and the statistical analyses; SL, KM, TD, SF, SD, HA, and EG participated in the interpretation of the data; SL conceived the project; SL, KM, TD, SF, SD, HA, and EG contributed to the writing of the manuscript. All authors read and approved the final manuscript.
The authors thank the INRA staff at the breeding facilities (Pôle d’Expérimentation Avicole de Tours, F-37380 Nouzilly, France) for technical participation (UR83 Recherches Avicoles, F-37380 Nouzilly, France). English was improved by Nature Research Editing Service from Springer Nature (http://authorservices.springernature.com/language-editing/).
Kévin Muret, Christophe Klopp, Diane Esquerré, Frédéric Lecerf, Hervé Acloque, Elisabetta Giuffra, Sarah Djebali, Sylvain Foissac, Thomas Derrien and Sandrine Lagarrigue authors are partners of the FAANG pilot project ‘Fr-AgENCODE’.
The authors declared that they have no competing interests.
Animal ethics statement
All experimental procedures were performed in strict accordance with guidelines edited by the French Ministries of High Education and Research, and of Agriculture and Fisheries (http://ethique.ipbs.fr/sdv/charteexpeanimale.pdf). The protocol was also approved by the local Ethics Committee “Val de Loire” (certificate of authorization to experiment on living animals no. 7740, 30/03/2012). All birds were reared and killed in compliance with national regulations and according to procedures approved by the French Veterinary Services at PEAT experimental facilities.
Availability of data
The raw data supporting the conclusions of this article are available on Sequence Read Archive under accession SRP079637.
KM is a Ph.D. fellow supported by the Brittany region (France) and the INRA’s Animal Genetics division. SD was supported by the Agreenskills fellowship program which has received funding from the EU’s Seventh Framework Program under Grant Agreement No FP7-609398. The animal experimental designs and RNA-Seq data were funded by the French National Agency of Research (FatInteger Project, ANR-11-SVS7 and ChickStress Project, ANR-13-ADAP) and by Europe (Feed-a-Gene H2020 project).
- 7.GENCODE v.24. 2015. http://www.gencodegenes.org/. Accessed 2 Jul 2016.
- 15.FAANG Project home. http://www.faang.org/. Accessed 2 Jul 2016.
- 19.NONCODE v.2016. http://www.noncode.org/. Accessed 9 Nov 2015.
- 21.ALDB v.1. http://res.xaut.edu.cn/aldb/index.jsp. Accessed 28 Jun 2016.
- 22.FEELnc: FlExible Extraction of LncRNA. https://github.com/tderrien/FEELnc. Accessed 20 Apr 2016.
- 23.Wucher V, Legeai F, Hedan B, Rizk G, Lagoutte L, Leeb T, et al. FEELnc: A tool for long non-coding RNAs annotation and its application to the dog transcriptome. 2016. http://biorxiv.org/content/early/2016/07/18/064436 (in press).
- 42.DAVID Functional Annotation Bioinformatics Microarray Analysis. https://david.ncifcrf.gov/. Accessed 8 Jun 2016.
- 76.BioMart. http://www.ensembl.org/biomart/. Accessed 18 Feb 2016.
- 79.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995;57:289–300.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.