Systematically profiling and annotating long intergenic non-coding RNAs in human embryonic stem cell
- 4k Downloads
While more and more long intergenic non-coding RNAs (lincRNAs) were identified to take important roles in both maintaining pluripotency and regulating differentiation, how these lincRNAs may define and drive cell fate decisions on a global scale are still mostly elusive. Systematical profiling and comprehensive annotation of embryonic stem cells lincRNAs may not only bring a clearer big picture of these novel regulators but also shed light on their functionalities.
Based on multiple RNA-Seq datasets, we systematically identified 300 human embryonic stem cell lincRNAs (hES lincRNAs). Of which, one forth (78 out of 300) hES lincRNAs were further identified to be biasedly expressed in human ES cells. Functional analysis showed that they were preferentially involved in several early-development related biological processes. Comparative genomics analysis further suggested that around half of the identified hES lincRNAs were conserved in mouse. To facilitate further investigation of these hES lincRNAs, we constructed an online portal for biologists to access all their sequences and annotations interactively. In addition to navigation through a genome browse interface, users can also locate lincRNAs through an advanced query interface based on both keywords and expression profiles, and analyze results through multiple tools.
By integrating multiple RNA-Seq datasets, we systematically characterized and annotated 300 hES lincRNAs. A full functional web portal is available freely at http://scbrowse.cbi.pku.edu.cn. As the first global profiling and annotating of human embryonic stem cell lincRNAs, this work aims to provide a valuable resource for both experimental biologists and bioinformaticians.
KeywordsGene Ontology Human Embryonic Stem Cell Slim Term Derive Allele Frequency Gini Correlation Coefficient
List of abbreviations used
long intergenic non-coding RNA
long non-coding RNA
human embryonic stem cell
fragments per kilobase of transcript per million mapped reads
Gini correlation coefficient.
The great potential of human embryonic stem cell (hES) in clinical usage inspired scientists to investigate underlying mechanisms for their unique pluripotency and self-renew characteristics [1, 2, 3, 4, 5, 6, 7, 8, 9]. Recently, several studies demonstrate that long intergenic non-coding RNAs (lincRNAs) play key roles in maintaining pluripotency [10, 11], modulating reprogramming  and differentiation . Knockdown of multiple lincRNAs has great effect on global gene expression pattern and could cause exit from the pluripotent state . Several human lincRNAs are further showed to be involved in core regulatory feedback circuits of hES cells and directly regulated by well-known key pluripotency transcription factors such as Oct4 and Nanog [10, 14, 15]. As more and more human lincRNAs were identified [9, 10, 13, 14, 15, 16], systematically characterizing human embryonic stem cell lincRNAs will not only shed lights on the hES transcriptome dynamics but also help revealing biological functions of these novel regulators.
Combining a comprehensive collection of human embryonic stem cell RNA-Seq datasets with Human BodyMap 2, we validated that 295 previously annotated lincRNAs were expressed in multiple human embryonic stem cell samples and further identified five novel hES lincRNAs through de novo assembling. Global statistical analysis revealed that these lincRNAs' expression levels are lower than that of their protein-coding counterparts. Functional analysis further demonstrated that hES lincRNAs were preferentially involved in multiple development processes including embryo development, ribosome biogenesis, and aging. To help explore the abundant information effectively, we built an integrative web portal for scientists to browse, search and perform analysis of all lincRNAs through an intuitive Web interface. It could be accessed freely at http://scbrowse.cbi.pku.edu.cn.
300 lincRNAs are transcribed in human embryonic stem cells
In order to systematically profile hES lincRNAs, we firstly compiled a known human lincRNA catalog by integrating multiple public sources. Annotated lincRNA gene models were extracted from Ensembl, UCSC and RefSeq. Redundant gene models were identified and merged based on the genomic coordinates, resulting in 5,571 standalone annotated lincRNA genes (See Methods and Materials, as well as the Additional File 1 for more details). Moreover, we surveyed and manually screened published hES RNA-Seq datasets in several public repositories, resulting in a list of 31 wild-type human embryonic stem cell samples. Out of which, 19 high-quality datasets with at least 50nt read length were further selected for follow-up analysis to minimize technological biases caused by early Solexa platforms (see Additional File 2 for more details). In addition, transcriptome profiling for 16 adult normal tissues derived from Illumina Human BodyMap 2 Project were also incorporated as control.
To find novel hES lincRNAs, we performed de novo assembling against all wild-type hES samples. After excluding annotated lincRNAs and non-lincRNA transcripts (e.g. known protein-coding genes, miRNAs and tRNAs), five novel lincRNAs were eventually identified. Combining with previous annotated catalog, we got a full list with 5,576 human lincRNA genes (5,571 known lincRNAs and 5 novel ones, see Methods and Materials for more details)
300 (~16.43%) of the expressed lincRNAs were detected as being expressed in hES (Figure 1b). Open chromatin marks were detected to be significantly enriched at their promoters (Fisher's exact test, H3K4me3, p-value < 2.2e-16; H3K4me2, p-value < 2.2e-16) and gene bodies (Wilcoxon test, H3K36me3, p-value<2.2e-16; H3K79me2, p-value< 2.2e-16), confirming active transcription of these genomic regions in human ES cells [20, 21] (Additional File 3).
Nanog and Oct4 are both well-known essential transcription factors in hES [6, 22, 23]. The promoters of hES lincRNAs were found to be enriched with binding sites of Nanog (Fisher's exact test, adjusted p-value < 2.2e-16, odds ratio = 5.0) but not Oct4 (adjusted p-value = 1, odds ratio = 1.7), suggesting different regulation of lincRNAs between these two pluripotency factors in hES.
Comparing with other lincRNAs, hES lincRNAs generally had more complex transcript structure with longer transcript length (1,215 nt versus 906 nt, p-value < 2.2e-16), more exons per transcript (3.44 versus 3.00, p-value = 2.2e-06) and more alternative isoforms per gene (2.76 versus 1.42, p-value < 2.2e-16). On the other hand, being consistent with previous reports [18, 19], both their expression level and breath were overall lower than of their protein-coding counterparts.
Systematic annotation of hES lincRNAs
Partly due to their expression specificity, a bit lower proportion (93.3%, 280 out of 300) of hES lincRNAs were annotated. Global statistical analysis suggested that these hES lincRNAs were preferentially involved in embryo development (108 lincRNAs, p-value = 4.56e-54) and ribosome biogenesis (41 lincRNAs, p-value = 2.10e-32). On the other hand, hES biased lincRNAs were more likely to be involved in mitosis (67 lincRNAs, p-value = 1.90e-21), cell cycle (72 lincRNAs, p-value = 8.96e-18), reproduction (52 lincRNAs, p-value = 3.53e-7), as well as embryo development (76 lincRNAs, p-value = 2.13e-21) and ribosome biogenesis (40 lincRNAs, p-value = 6.86e-19) (Figure 2b and Figure 2c). Interestingly, the five de novo assembled hES lincRNAs were involved in embryonic epithelial tube formation and regulation of cell cycle, suggesting their putative important roles.
In addition to functional annotation, we further investigated the evolutionary pattern of hES lincRNAs. Previous studies have shown that the exon sequences of lincRNA are less conserved than that of protein coding genes, while more conserved than neutrally evolving ancestral repeat sequences [18, 19]. On the other hand, despite of their relatively rapid sequence turn-over rate, several lincRNAs have been reported to have homologs in remote species, suggesting distinct evolutionary patterns among different classes of lincRNAs [18, 19, 26, 27].
For hES lincRNAs, 263 out of 300 were successfully dated. Around half (129 out of 300) were dated at the root of the mammalian tree, i.e. being conserved across mouse and human lineage. A close look of the respective mouse homologous regions found that more than 80% (109 out of 129) of them had at least one active histone markers (H3K9ac, H3K27ac, H3K36me3, and H3K4me3) and only two of them had repressive histone modifications (H3K9me3, H3K27me3) in mouse ES cells, with one third (33) also posed directly upstream PolII binding measured by ChIP-Seq assay, suggesting their bona fide mouse ES expression.
To our surprise, hES lincRNAs were found to be generally younger than other lincRNAs (Wilcoxon test, p-value = 4.46e-07) (Figure 3b). Consistently, we also found that human specific lincRNAs were more likely to be hES lincRNAs (Single-tailed Fisher's exact test, odds ratio =6.3, p-value = 0.0054). Eleven hES lincRNAs were identified as human-specific. Similar to previous reports , we also found these human-specific hES lincRNAs more likely to hold HERVH transposable elements (TEs) than mouse-conserved ones (Single-tailed Fisher's exact test, odds ratio= 66.3, p-value = 0.0001). Of interest, two of them (ENSG00000228437, ENSG00000254339) were found to be under strong intra-population purifying selection indicated by low derived allele frequency (< 0.1) , suggesting their potential human-specific functions.
Integrative web portal for visualizing and analyzing data
To facilitate further investigation of these hES lincRNAs, we integrated gene models and annotations with multiple related biological data into an integrative web portal. Powered by ABrowse , the portal aims to providing users a fully interactive environment for browsing, searching and analyzing these lincRNAs as well as annotations through an intuitive interface.
The gene model tracks are designed to present both gene structures and annotations. In addition to the complete list of 5,576 human lincRNAs (track "lincRNA gene model"), protein-coding genes from Ensembl 68 (track "protein coding gene model"), two human lincRNA sets from LNCipedia , Cabili et al.  and one hES long non-coding RNA (lncRNA) catalog from Sigova et al.  are also available as references. Detailed annotations for each record will be shown in the "Entry detail" tab (of left-side Control panel, Figure 4c) after a click.
"Basic Information" sheet contains its genomic location, original accession number and the source link, as well as the functional and evolutionary annotations.
"Expression Profile" sheet shows global expression profile of the given lincRNA across multiple samples. To get a comprehensive view, we also incorporated 24 differentiated embryonic stem cell samples in addition to the 31 wild-type embryonic stem cell samples and 16 normal adult human tissues. After choosing samples in the "Sample Tree", the user could visually inspect the expression profile through a bar chart, and add mean expression for similar samples (i.e. samples in the same group) if needed.
The last sheet ("Find genes with similar expression pattern") allows user to identify co-expressed genes quickly. After specifying the correlation coefficient measurement (Pearson's r, Spearman's rho or Gini correlation coefficient), the cut-off and the samples, user can either view the matched genes interactively, or download in batch for further analysis.
To offer insights for the regulation and evolution of hES lincRNAs, we further integrated additional tracks derived from several public annotation sources. Multiple transcription factor binding and histone modification tracks, including more than 50 different transcription factors and 11 different types of histone modifications in H1 hES cell line  are grouped as "Transcriptional Regulation". Similarly, evolutionary conservation tracks covering both inter-species (measured by PhloyP score ) and intra-species (measured by derived allele frequency ) comparisons are grouped as "Evolution". All these tracks could be easily turned on or off through the "Tracks" box.
In addition to rich annotations, the portal also provides powerful searching tools for users to find lincRNAs quickly. In addition to common text-based search and sequence-based search (through the "Advanced Search" box), the portal allows users to search lincRNAs with specified expression pattern through the dedicated "Find lincRNAs" tab in the Control panel (Figure 4d). With an intuitive interface, the user can specify the expression pattern across multiple samples by either interactively "drawing" in the "Basic" sub-tab or inputting numbers precisely in the "Advance" sub-tab. Similar to the output of "Find genes with similar expression pattern" sheet, the result could also be exported as plain text for further processing (Figure 4d).
Furthermore, multiple utilities were implemented to make it easier for users to analyze data through the portal. After registration, the user can upload customized tracks and manage them through the "My Tracks" tab in the Control panel. Even more conventionally, the registered user can add Instant Note for any genomic region on-the-fly by clicking-and-dragging. All these user generated annotations could be seamlessly integrated with existing tracks, and freely set as "public" or "private" when necessary.
Compiling annotated human lincRNA catalog
To get a comprehensive annotated human lincRNA catalog, we firstly extracted all annotated lincRNA gene models from Ensembl (v68), then filtered 6,730 RefSeq NR_ records and 10,654 UCSC noncoding genes based on multiple criteria. Inspired by previous studies [18, 19], we took a conservative strategy. Briefly, only multi-exon transcripts satisfying 1) length >200bp, 2) not overlapped with known genes, and 3) classified as "non-coding" by CPC  were kept for further analysis. Redundant gene models were identified and merged based on genomic coordinate, resulting in 5,571 standalone annotated lincRNA genes (see also Additional File 1 for details).
RNA-Seq data collection
We manually screened all Illumine human RNA-Seq datasets in NCBI GEO [36, 37]. In case of potential genomic contamination, only samples with polyA plus libraries were kept. Eventually, we got a list of 31 wild-type samples (19 H1, 4 H9 and 8 other hES cell lines) and 24 differentiated embryonic stem cell samples. To minimize technological biases caused by early Solexa platforms, we further chose 19 high-quality wild-type datasets with at least 50nt read length for follow-up analysis (see Additional File 2 for more details). In addition, transcriptome profiling for 16 adult normal tissues generated by Human BodyMap 2 Project were also incorporated.
De novo assembling lincRNAs from RNA-Seq data
Following standard protocol , we mapped raw reads of the 19 high quality wild-type hES samples onto the human reference genome (hg19) by TopHat  and assembled mapped reads into transcripts by Cufflinks . All assembled transcripts were firstly filtered using similar criteria being described above, and transcripts with low expression level were further removed to control false positives due to transcription noises (see Additional File 4 for more details).
Finally, after combining with previous annotated catalog, we got a full list with 5,576 human lincRNA genes (5,571 known lincRNAs and 5 novel ones).
Identifying hES lincRNAs
Several efforts have been taken to ensure the quality of hES lincRNAs catalog. For effectively reducing false positives caused by random "bench effect", we systematically screened multiple heterogeneous datasets generated by different labs around the world (Figure 1).
To further improve the robustness, we applied a stringent criteria (median FPKM across multiple human ES samples > 1) when calling hES lincRNAs. On the other hand, a rather loose cut-off (0.07), was used by Sigova et al. , resulting in large numbers of marginally expressed candidates. In fact, more than 80% (2,910) of the Sigova et al. reported hES long noncoding RNAs (3,548) have low expression levels (< 1), and less than 18% (638 out of 3,548) Sigova's hES long noncoding RNAs could pass our filter, including only 37 (less than 10%) Sigova's distant hES long noncoding RNAs.
ChIP-Seq analysis for transcription factors and histone modifications
Respective ENCODE ChIP-seq datasets for both human and mouse ES cell lines were downloaded (from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC and http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC). Replicates were merged firstly and then feed into MACS  for peak calling with p-value cutoff = 1e-10.
The upstream 3kb and downstream 1kb of annotated transcript starting sites were defined as promoter regions . TFBS or promoter biased histone modifications as H3K4me3, H3K4me2, H3K27me3 and H3K9me3 located within these regions are considered as effective regulatory sites.
Gene ontology (GO) enrichment analysis
According to previous evaluation , we calculated the Gini correlation coefficient (GCC) as the co-expression measurement with R package rsgcc. For each lincRNA, protein coding genes with GCC >= 0.9 were taken as the "neighbors" with strong expression correlation. GO enrichment analysis for "neighbors" was implemented by R package GOstats . Raw p-values were adjusted for multiple testing using BH procedure . Significantly enriched terms (adjusted p-value <= 0.01) were assigned to the lincRNA.
To get a broad overview, fine-grained terms were further projected onto generic GO slim terms: Firstly, we downloaded GO slim file from http://www.geneontology.org/GO_slims/goslim_generic.obo, then mapped each GO term to its ancestor GO slim terms using map2slim in Perl package go-perl. After mapping lincRNA to GO slim terms, we further tested the enrichment of each lincRNA set for each GO slim term. For example, if we intended to see whether lincRNAs expressed in hES are biasedly enriched in GO slim term X, we classified all lincRNA in two ways: 1) expressed in hES or not expressed in hES, 2) enriched in term × or not enriched in term X. After Fisher's exact tests, GO slim terms with BY adjusted  p-value <= 0.05 and odds ratio > 1 were determined as significantly positively associated with hES lincRNA.
We also tried Pearson correlation coefficient instead of GCC, and changed the cutoff 0.9 to 0.85 and 0.95 for both measurements. Different parameters resulted in similar results, showing the robustness of our analysis pipeline.
Transposable element (TE) content analysis
We downloaded annotations of repeats (hg19 rmsk table) from the UCSC Table Browser. Satellite, low complexity, and simple repeats were then excluded . We compared genome coordinates of repeats with genome coordinates of lincRNA exons.
Evolutionary dating of lincRNA
Firstly, we mapped human lincRNAs onto genomes of other species according to genome alignments by using UCSC liftOver with default parameters. If there were genome alignments between specified species and human, covering > 80% base pairs of human lincRNA, the aligned region within the specified specie was taken as the homolog of this human lincRNA. Using this criteria, we have identified 5,526 (~99%) homologs for our lincRNA catalog (5,576 lincRNA) among the 6 genomes (chimp: 5,383; gorilla: 5,258; orangutan: 5,308; rhesus: 5,064; marmoset: 4,481; mouse: 2,601). We further checked the alignment identity for these homologs in each species, and found that 95% of them with identity > 96% in chimp, > 95% in gorilla, > 92% in orangutan, > 87% in rhesus, > 82% in marmoset, > 50% in mouse.
Secondly, we dated lincRNAs according to appearance of homologs using parsimony rules. Briefly, if there are homologs in each of the species in the specified evolutionary clade for a lincRNA and no homologs in all outgroups, the lincRNA is dated to the period from the time divergent from its closest outgroup to the born time of the most recent common ancestor of that clade.
This work was supported by National Key Basic Research Program of China (No. 2011CBA01102) and the China National High-tech (863) Program (2006AA02Z334). The authors would like to thank Prof. Jingchu Luo for the invaluable suggestions and comments.
Funding to pay the open access publication charges for this article was provided by State Key Laboratory of Protein and Plant Gene Research Open Funding.
This article has been published as part of BMC Genomics Volume 14 Supplement 5, 2013: Twelfth International Conference on Bioinformatics (InCoB2013): Computational biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S5.
- 1.Amit M, Carpenter MK, Inokuma MS, Chiu CP, Harris CP, Waknitz MA, Itskovitz-Eldor J, Thomson JA: Clonally derived human embryonic stem cell lines maintain pluripotency and proliferative potential for prolonged periods of culture. Dev Biol. 2000, 227 (2): 271-278. 10.1006/dbio.2000.9912.PubMedCrossRefGoogle Scholar
- 12.Loewer S, Cabili MN, Guttman M, Loh YH, Thomas K, Park IH, Garber M, Curran M, Onder T, Agarwal S, et al: Large intergenic non-coding RNA-RoR modulates reprogramming of human induced pluripotent stem cells. Nat Genet. 2010, 42 (12): 1113-1117. 10.1038/ng.710.PubMedPubMedCentralCrossRefGoogle Scholar
- 17.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.PubMedPubMedCentralCrossRefGoogle Scholar
- 19.Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, et al: The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res. 2012, 22 (9): 1775-1789. 10.1101/gr.132159.111.PubMedPubMedCentralCrossRefGoogle Scholar
- 32.Sigova AA, Mullen AC, Molinie B, Gupta S, Orlando DA, Guenther MG, Almada AE, Lin C, Sharp PA, Giallourakis CC, et al: Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells. Proc Natl Acad Sci USA. 2013, 110 (8): 2876-2881. 10.1073/pnas.1221904110.PubMedPubMedCentralCrossRefGoogle Scholar
- 38.Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7 (3): 562-578. 10.1038/nprot.2012.016.PubMedPubMedCentralCrossRefGoogle Scholar
- 44.Benjamini Y HY: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995, 289-300. Series B (Methodological)Google Scholar
- 46.Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, et al: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005, 21 (5): 650-659. 10.1093/bioinformatics/bti042.PubMedCrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.