Gateways to the FANTOM5 promoter level mammalian expression atlas
- 11k Downloads
The FANTOM5 project investigates transcription initiation activities in more than 1,000 human and mouse primary cells, cell lines and tissues using CAGE. Based on manual curation of sample information and development of an ontology for sample classification, we assemble the resulting data into a centralized data resource (http://fantom.gsc.riken.jp/5/). This resource contains web-based tools and data-access points for the research community to search and extract data related to samples, genes, promoter activities, transcription factors and enhancers across the FANTOM5 atlas.
KeywordsTranscription Start Site Resource Description Framework Triple Store SPARQL Endpoint Transcription Start Site Region
cap analysis of gene expression
CAGE tag start site
Functional Annotation of Mammalian Genomes 5
Markov Cluster Algorithm
polymerase chain reaction
Resource Description Framework
semantic catalog of samples, transcription initiation and regulators
Table Extraction Tool
transcription factor binding site
transcription start site
One of the most comprehensive ways to study the molecular basis of cellular function is to quantify the presence of RNA molecules expressed by a given cell type. Over the years, the genomics field has collectively built up several gene expression repositories across biological states to facilitate exploration of biological systems. As for genome-wide surveys of encoded RNAs, a number of partial and full-length cDNA clone collections have been constructed and sequenced previously [1-6]. The resulting data were used for genome annotation, in particular to build gene models (NCBI RefSeq , Ensembl transcripts , Representative Transcript and Protein Sets (RTPS) ), and for exploration of active genes within specific biological contexts (NCBI UniGene , DigiNorthern , and cross-species analysis based on simplified ontologies ). However, the ability of these surveys to quantify RNA abundance was limited mainly due to sequencing performance. Another approach to assess gene expression is by hybridization to pre-designed probes (that is, microarrays) [11-13]. Thousands of studies have been published on gene expression profiles using microarrays (Gene Expression Omnibus , ArrayExpress , CIBEX ) and collections of curated data sets (GNF SymAtlas2 , EBI Gene expression atlas , BioGPS ) have become popular tools to survey gene expression levels. However, the coverage of identifiable RNA molecules and the accuracy of quantification are limited due to their probe design, which relies on existing knowledge of RNA species.
The recent development of next-generation sequencers enables us to obtain genome-wide RNA profiles comprehensively, quantitatively and without any pre-determination of what should be expressed using methods like cap analysis of gene expression (CAGE)  and RNA-seq . In particular, a variation of the CAGE protocol using a single molecule sequencer  allows us to quantify transcription start site (TSS) activities at single base pair resolution from as little as approximately 100 ng of total RNA. We used this technology to capture transcription regulation across diverse biological states of mammalian cells in the Functional Annotation of Mammalian Genomes 5 (FANTOM5) project . The collection consists of more than 1,000 human and mouse samples, most of which are derived from primary cells. This is a unique data set to understand regulated transcription in mammalian cell types. The broad coverage of biological states allows researchers to find samples of interest and inspect active genes or transcription factors in their biological contexts. The comprehensive profiling across the sample collection provides the opportunity to look up any gene, transcription factor or non-coding RNA of interest and to examine in which context they are activated across mammalian cellular states. CAGE-based TSS profiles at single base resolution allow the correlation of transcription activity with sequence motifs or epigenetic features. In previous studies, we generated TSS profiles based on CAGE in FANTOM3 [24,25] and FANTOM4 [26,27], but the diversity of biological states and the quantification capabilities were quite limited due to the state of the technologies at that point. To facilitate FANTOM5 data exploration from various perspectives, we prepared a set of computational resources, including a curated data archive and several database systems, so that researchers can easily explore, examine, and extract data. Here, we introduce the online resources with underlying data structure and describe their potential use in multiple research fields. This work is part of the FANTOM5 project. Data downloads, genomic tools and co-published manuscripts are summarized at .
Results and discussions
Annotation of the sample collection
In FANTOM5 , more than 1,000 human and mouse samples were profiled by CAGE. These include primary cells, cell lines, and tissues consisting of multiple cell types. To facilitate examination of the diverse and large number of samples by both wet-bench and computational biologists, we describe the samples from two complementary perspectives: (i) manual collection and curation of sample attributes and (ii) systematic classification using existing ontologies. Manual curation was accomplished via a standardized sample and file naming procedure based on a compiled set of sample attributes (such as age, sex, tissue, and cell type; details in Additional files 1, 2, and 3). Names are formed by concatenating the curated sample names (for example, 'Smooth Muscle Cells - Aortic, donor0'), RNA ID (for example, '11210-116A4') and CAGE library ID (for example, 'CNhs10838'), where the latter two enable us to track the samples in the form of RNA extracts and loaded sequencing materials (Additional file 4). Replicates are further identified with suffix notation (such as tech_rep#, biol_rep#, donor#, pool#) to the sample names. The resulting sample and file names are structured so that related samples (like developmental stages) will be grouped together in order when sorted alphabetically. We faced the challenge that the file names needed to be both informative for researchers and valid for computational systems that impose restrictions on the set of allowed characters in file names and file access paths. A full description of samples often requires a variety of symbols (for example, single quote in 'Hodgkin's lymphoma', slash, caret, parentheses in 'cell line:143B/TK^(−)neo^(R)'), and some computer systems have problems handling file names including these symbols. One option is to use short labels as in the case of genes, where unique short labels for human genes (called gene symbols) are determined through community discussions under coordination by the Human Genome Nomenclature Committee . But we chose not to do this, as this introduces an extra layer of complexity in data handling and coordination, and an additional cognitive burden on human users. Instead, we decided to encode the sample names in 'URL encode' scheme (RFC3986) for file names, so that we can systematically generate them and decrease the risk of data tracing errors. This has the added advantage that URL path accessors to the files are consistent with those of the file system.
To classify samples systematically, we assembled the FANTOM Five (FF) Sample Ontology  consisting of the existing basic ontologies: cell types (CL), anatomical systems (UBERON), and diseases (DOID) [30-32]. We used the RNA ID as a unique identifier term (see Additional file 4 and below) of the individual samples and to link the corresponding FF ontology terms in a parent-child relationship. This scheme provides a way for researchers to query a group of samples based on existing knowledge and to aggregate related information systematically. In addition, we mapped graphical images in the BodyParts3D resource  to the UBERON terms composing the FF ontology, via the Foundational Model of Anatomy ontology . This enables us to provide graphical shapes of individual organs in our databases.
Overview of the data collected from the FANTOM5 samples
Data files available in the data archive
Data or analysis type
Sample, RNA, and CAGE library information (metadata)
Ribosomal RNA hitting reads
/basic/*CAGE/*nobarcode.rdna.fa.gz (1,385 files)
Mapping results (including unmapped reads)
/basic/*CAGE/*nobarcode.bam (1,385 files)
TSS profiles (counts of obtained 5'-end reads at 1 bp resolution)
/basic/*CAGE/*ctss.bed.gz (1,385 files)
Sample classification based on the FANTOM Five Sample Ontology
CAGE peaks (TSS clusters)
CAGE peak annotation (descriptions and gene association)
Expression of the CAGE peaks
De novo motif analysis
Sample enrichment analysis
Gene ontology enrichment analysis of co-expression clusters
Interfaces to the series of FANTOM5 results
For interactive and dynamic data exploration, optimized for individual data types, we configured the ZENBU genome browser and analysis system , which stores and displays all CAGE experiments, including the genome alignments of individual CAGE reads as well as the annotation of each sample. It enables users to explore TSS activities in any region of the genome, with a user-selectable alignment threshold between the CAGE reads and the genome. The Enhancer Selector tool (Li et al., under preparation) stores the summarized activity profiles of the enhancers identified by CAGE  based on curated tissue categories and enables users to select a group of enhancers activated in specified conditions through its intuitive 'slider' interface. BioLayout Express 3D  presents the results of co-expression clustering as a three-dimensional visualization of expression space with an interactive user interface.
Data exploration: use cases
All of the individual interfaces have their own scope and advantages and are linked to each other to allow easy access to relevant information. An example analysis flow using multiple tools is shown in Additional file 5, while a variety of explorations are possible for biological questions and hypotheses. Below, we provide examples to access FANTOM5 data via the specific interfaces.
Starting with sample details
Checking a group of samples based on manually curated classifications
SSTAR provides lists of the sample ontology terms (cell type, tissue, and disease ontologies) with hyperlinks to individual ontology term pages. Within each of these pages, detailed information on the term itself, such as cross-references and name spaces, are shown, and samples associated with the term based on FF Sample Ontology classification are listed (Figure 3; Additional file 7). The ontology term page also shows parent-children relationships via a graphical and interactive user interface by using the NCBO widget . For example, a page describing the cell type 'monocyte' shows that it develops from promonocyte and into macrophage (Additional file 7). Furthermore, it shows the CAGE peaks highly active in the monocyte-related samples based on FF Sample Ontology Enrichment Analysis (Additional file 8).
Overviewing sample proximity and distance across transcriptome space
BioLayout Express 3D  is a powerful network analysis tool that provides an interactive way to explore similarity relationships between samples and transcription initiation activities (that is, CAGE peak expressions). The user can inspect a network in which nodes represent either samples or CAGE peaks where node colors are based on the co-expression cluster they belong to, and edges represent correlations between them above the user-defined threshold. The network displayed in a three-dimensional environment can be rotated, zoomed and explored interactively. Graphical representation of the FANTOM5 data allows the user to examine promoter expression patterns across nearly 1,000 samples included in this study or subsets thereof. A number of pre-calculated graph views (layout files) are available at our web resource. For example, a network shown in Additional file 9 enables us to examine sample-sample (dis)similarities, and one in Additional file 10 to examine relationships between CAGE peaks where their expression patterns can be displayed in a pop-up window. A web search function for nodes (samples or CAGE peaks) is set up to query the SSTAR or ZENBU databases for matches. For further in-depth examination, users can activate the clustering option based on the Markov Cluster Algorithm (MCL)  and adjust the parameters in order to obtain co-expression classes, or clusters, of samples sharing similar patterns in expression.
Inspecting genes, transcription factors and DNA motifs
Putting data in the genomic axis
ZENBU  provides an interactive interface to explore transcription initiation activities in their genomic context and it helps to examine transcription activity in-depth, independent of the CAGE peaks defined in FANTOM5 . It also allows for selection of CAGE profiles to be displayed using the Data Explorer search tab (Additional file 13). A single ‘pooled’ track aggregating multiple CAGE samples allows a user to examine the expression profile in each of the CAGE profiles immediately by selection of any genomic regions. For example, selection of the SPI1 promoter region in a pre-configured pooled track of all the FANTOM5 CAGE profiles displays accumulated transcription activities. From there one can apply a filter on sample names and sort by expression levels (Additional file 14). Several configurations prepared for the FANTOM5 data set are accessible from the ZENBU resource page. Similarly, we prepared a set of configured data files for the data hub in the UCSC Genome Browser , which allow users to overlay the FANTOM5 CAGE peaks and TSS profiles with the views and annotations maintained by the database management team and the community. For example, one can examine the CAGE peaks associated with SPI1 and compare them with the ENCODE regulation tracks and segmentation tracks (Additional file 15).
Exporting selected data
Besides individual inspection of compiled results, further computational analyses with custom parameters and/or tools are sometimes required to build a working hypothesis and select candidates for experiments. Researchers can use several interfaces to obtain desired data rather than downloading and parsing large data files from the entire data archive. ZENBU and the UCSC Genome Browser both have export functions as a part of their user interface. In particular, ZENBU’s unique interface enables us to export expression profiles of arbitrary regions, which is useful for in-depth examination of non-annotated genomic regions. Similarly, portions of the data can be extracted using the BioMart  instance and TET tool. The former provides a way to select and obtain CAGE peak annotations, such as associated genes and promoter features, via a widely used interface (Additional file 16). TET lets users obtain a subset of data by specifying the desired columns and rows. In the FANTOM5 context, TET enables users to specify CAGE peaks and samples to be included. The resulting data matrix is immediately usable for expression analysis across CAGE peaks and biological samples (Additional file 17).
Connecting to linked data
In addition to data export in tab-delimited files, we also modeled the FANTOM5 data as nanopublications (the smallest unit of publishable information) [46,55]. Nanopublications expose individual records allowing automatic integration with any other linked data [56,57] and for citation tracking of their impact . Each of the nanopublications is composed of three elements based on RDF (Additional file 18): an assertion (data or scientific statement), provenance for the assertion (how the assertion came to be), and publication information (how the nanopublication came to be). We have exposed three types of nanopublications from FANTOM5 data: CAGE peaks (type I nanopublications; see Materials and methods); their associated genes (type II); and their expression information (type III). By applying standard SPARQL  queries to the FANTOM5 nanopublications (available at ), specific results can be retrieved semantically. For example, Additional file 19 shows a SPARQL query to retrieve the samples related to skeletal muscle and activities of the TSSs for MYOD, a master regulator of myogenesis, in those samples. Although this is a simple biological question, automatic retrieval of its result is challenging due to ambiguities in several layers. For example, there are ambiguities in concept identification (MYOD1, not MYOD, is the official symbol in HUGO nomenclature), multiple CAGE peaks can be associated with the gene (actually four CAGE peaks are associated with MYOD1), and many different FANTOM5 samples, including cell lines and primary cells, are related to skeletal muscle but not all samples contain the keyword 'muscle' in the sample description (for example, myoblast). Despite these semantic complications, the query in Additional file 19 retrieves expected data (Additional file 20) by resolving these ambiguities with semantic integration of Linked Life Data , retrieved 16 April 2014) and the FANTOM5 nanopublications. We foresee that the nanopublications and associated SPARQL endpoints facilitate the automated integration with many other biomedical datasets.
Continual evolution of resources to treat diverse sets of data
Based on our experience preparing the series of interfaces, here we discuss the challenges we faced in their preparation and the approaches we took, as a lesson for other future projects. At the initial stage of FANTOM5, we had a clear vision of the data set to be generated and analyses to be tackled, but we did not have a complete picture of the results, research questions and directions. The types of raw and processed data were clear, but it was difficult to determine the number of data files and data types, and to predict their complexities through the entire project.
Given the challenge of working with large amounts of data under such uncertainty, we started to prepare interfaces from a minimum set of visible tools requiring less data modeling assumptions ('data agnostic' tools). MediaWiki is designed for Wikipedia, a web-based, collaborative and flexible form of encyclopedia to collect a comprehensive summary from any branch of knowledge. Individual pages can contain any sort of description, and immediate data visibility on a page provides a means for data providers and generators to visually check, confirm or correct details, where Semantic MediaWiki extension helped us to retrieve relevant information even if stored in different pages. Genome browsers require data to have genomic coordinates, and the use of genome browsers for inspection of data (in the context of other data in the same genomic region) is obviously important for the genomics field. Loading all the CAGE profiles into ZENBU helped us to validate the processing of samples by checking the expression of marker genes. After starting with these two interfaces, we gradually added other interfaces to complement uncovered parts. We included BioMart, BioGPS plugin, and UCSC DataHub to disseminate our results across these user communities, and introduced the enhancer selector, BioLayout and TET to facilitate further analysis and inspection of our resources. This might serve as a practical approach in treating data for exploratory research, and a guide for developers to design tools and their functions.
In FANTOM5, the FANTOM Consortium has profiled TSS level transcription activities in a diverse range of samples. We assembled the data and analysis results into an on-line resource containing a comprehensive expression atlas for exploration from multiple perspectives. The expression atlas covers the largest number of samples (nearly 1,000 human and 400 mouse samples) based on HeliScopeCAGE . An existing expression resource, BioGPS , and one of the most popular databases for microarray-based gene expression atlases, provides around 200 samples at its most recent version. CellMontage, a system for searching gene expression databases based on profile similarity, exhaustively collected hundreds of thousands of human microarray gene expression profiles from different public repositories, providing a tool to retrieve data sets from different studies and laboratories . Our resource uniquely consists of the largest number of samples on a single platform. In terms of TSS profiles, the FANTOM5 collection is the largest (ENCODE profiled 36 cell lines by CAGE , while the DataBase of Transcriptional Start Sites (DBTSS)  has TSS profiles from 20 tissues and 7 cell lines). The FANTOM5 atlas expands the existing resources in terms of coverage and diversity of samples that were profiled. Moreover, considering the nature of HeliScopeCAGE data, absolute measurement of capped RNA abundance by using a single molecule sequencer can achieve higher quantification ability  compared with the previous CAGE technology employing two steps of PCR . Thus, the FANTOM5 atlas could contribute to the research community by providing high quality data.
The resource provides extensive annotation about transcription initiation as well as cellular transcription states, which is far beyond merely assembling profiles. We strategically defined TSS regions in a data-driven manner and annotated them by performing a series of computational analyses. Such analyses enriched the characterization of experimentally defined regions, although they also increased data types. We prepared a series of database systems to host heterogeneous data to make it possible for researchers to explore the data from multiple perspectives. The tools or database systems shown in Figure 2 provide multiple means to play with data interactively, export only a subset of the entire data, and integrate with other data beyond FANTOM5. In the on-going activities of the second phase of FANTOM5, we are now working on time-dependent dynamics and their regulation. We expect additional data types and are going to expand the collection to cover additional analysis.
Materials and methods
A standardized description of samples and experimental conditions
A wide range of RNA samples with different origin and with replicates was produced for FANTOM5. To describe, in a consistent manner, the entire set of samples, experiments, and protocols, we employed the MAGE/ISA-tab file format [38,39], a standard format to describe experimental details. The experimental steps described in the file can be visualized with SDRF2GRAPH , a tool developed during the FANTOM4 project  (available as a web tool at ), providing an intuitive representation of the complex experimental steps. These meta-data files help to document the data structure of the FANTOM5 project and support its use and biological interpretation.
Standardized data collection, quality control and automated data processing
For each FANTOM5 sample, cDNAs resulting from CAGE library preparation were loaded onto HeliScope flow cells. Each sequencing result was then systematically processed, discarding sequences that are too short or that represent artifacts , aligning the obtained reads to the reference genome sequences , and counting CAGE read alignments based on their 5’ end (termed CAGE tag start site (CTSS) ) with required mapping quality ≥20 and sequence identity ≥85%. Mapping files were first filtered to discard bad alignments and then indexed by using SAMtools utilities  to allow both extraction of specific mapping locations and access the BAM files remotely. The mapping files were then converted into CTSS BED files using a combination of BedTools  and shell commands to reduce the data. They were then systematically named using a combination of sample names and unique identifiers (Additional file 4). This yields a quantification of transcription initiation activity in each sample at single base pair resolution.
Based on the TSS profiling data above, we determined TSS regions by calling peaks over the CAGE signals (Additional file 21) . We refer to them as 'CAGE peaks' to avoid confusion with co-expression clustering below. We assigned peak names based on the closest gene (located within 500 bp upstream of the 5’ end of the gene model, or alternatively on its first exon up to 500 bp downstream), and ranked them based on the CTSS counts when multiple CAGE peaks were associated with the same gene. For example, p1@B4GALT1 (CAGE peak 1 at the B4GALT1 5’ end) indicates a peak near the B4GALT1 gene which is the most highly expressed among those associated with the same gene. Further, we examined the association of CAGE peaks with gene structure and repetitive elements based on a curation rule (see below). We also examined the similarity of their neighboring genomic sequences to conventional TSSs by a machine learning approach to distinguish TSS-like sequences from others . We quantified activities of the identified TSS regions based on the counts of CAGE read alignments as tags per million after adjusting the library size by the relative log expression method [36,37].
Based on the TSS regions and their expression levels, we performed co-expression analysis by applying the MCL [23,71] followed by pathway enrichment analysis (Figure 1). Gene ontology enrichment analysis  allowed us to annotate individual co-expression clusters in terms of gene function, while the sample ontology let us annotate the biological context in which a CAGE peak or a co-expression cluster is activated in an analogous way to gene set enrichment analysis . In parallel, we examined the presence of DNA motifs, which are regulatory elements encoded in the genome. We examined over-representation of known DNA motifs (obtained from Jaspar ) in each of the co-expression clusters, and correlation between their presence and expression (see Materials and methods). Furthermore, we explored novel DNA motifs by evaluating their correlation with CAGE expression patterns .
Significance assessment of DNA motifs
We predicted putative transcription factor binding sites (TFBSs) using a position-weight matrix model as implemented in Biopython  for each JASPAR  motif and for each novel motif, with a background probability based on a 40.9% GC content. The position-weight matrix scores were converted to Bayesian posterior probabilities using a prior probability of 5 × 10-4. We retained all predicted TFBSs with a posterior probability larger than 0.1. We then associated predicted TFBSs with the 184,476 (human) or 116,064 (mouse) robust promoters  as described previously  using a -300.. +100 bp region with respect to the representative genome position of the promoter, defined as its most highly expressed position in the FANTOM5 samples. For each motif in each sample, we calculated the Pearson correlation across the robust promoters between the number of TFBSs estimated for each promoter and its CAGE expression level. For each motif, we repeated this procedure for 1,000 randomized position-weight matrices, in which the order of rows (corresponding to positions along the motif) is randomly permuted. We then expressed the Pearson correlation for each motif as a Z-score by subtracting the mean and dividing by the standard deviation of the Pearson correlations found for the randomized motifs. The P-value displayed is the tail probability of the normal distribution corresponding to this Z-score.
For each novel motif, we calculated the number of predicted TFBSs for each promoter by summing their posterior probabilities. We averaged this number over the robust promoters and multiplied it by the number of robust promoters in each of the co-expression clusters to find the expected number of TFBSs for the motif under the null hypothesis that the motif is not overrepresented in the given co-expression cluster. The observed number of TFBSs of a motif was found by summing its predicted TFBSs over the co-expression cluster. We then calculated the statistical significance of motif overrepresentation in the co-expression cluster by finding the tail probability of the observed number of TFBSs under a Poisson distribution with a mean equal to the expected number of TFBSs in the co-expression cluster.
Annotation of CAGE peaks based on transcript structure
We devised a hierarchical approach to annotate TSS regions (or CAGE peaks) with respect to Gencode V10 transcript model structures such as TSSs, proximal promoter regions (500 bp upstream and 500 bp downstream of the TSS, or ending with the 3' end of its first exon), exonic region split into coding and non-coding (differentiating non-coding transcript exons, coding transcripts' 5' UTR and 3' UTR exonic regions) as well as relative position within the transcript (first, inner or last exon of the transcript), and intronic regions (similarly differentiated with respect to the coding sequence and position relative to the transcript). We also defined genome segments corresponding to the opposite DNA strand of those TSSs, proximal promoters, exons and intronic regions. A CAGE peak can overlap more than one genome segment region (for example, the proximal promoter region of a transcript and the first intron of another co-localized transcript). The annotation follows this hierarchy: TSS followed by proximal promoter regions, first followed by inner and last exons, antisense the TSS, then proximal promoter regions, then exonic regions, and finally intron (first sense and then antisense). The complete process is described in Additional file 22, and its implementation is based upon BedTools IntersectBed and groupBy utilities .
Finally, we used the same genome segmentation annotation pipeline to annotate CAGE peaks with respect to CpG island proximal region (retrieved from the UCSC table browser), TATA box proximal region (based on a genome-wide scanning of the JASPAR TATA-binding protein position weight matrix ), repeat elements (retrieved from the rmsk UCSC table) and ENCODE clustered TFBS proximal region (wgEncodeRegTfbsClustered track from UCSCwgEncodeRegTfbsClustered track from UCSC; region defined as cluster boundaries ±300 bp).
ZENBU data load and view configuration
We implemented a semi-automated pipeline using command line tools for bulk loading of the large numbers of CTSS and BAM alignment files into ZENBU along with the corresponding sample annotation metadata using ZENBU's command line tools . Several preconfigured views where created and updated to aid users in their research activities. Views included full sets of human and mouse samples, together with primary cell only, cell line only and tissues only. In addition, the flexibility of ZENBU allows researchers to modify and create their own visualization views on the FANTOM5 data and share them publicly or within a collaboration.
BioMart interface for the defined transcription start site regions
BioMart  is a freely available, open source, and powerful query-oriented data management system. The BioMart system provides simple web browser interfaces and web services that allow a user to rapidly access an underlying database without knowledge of its data model. We customized the BioMart system to have CAGE peak annotation data and sample annotation data for both human and mouse. The FANTOM5 BioMart provides researchers with a simple web interface for performing queries of the FANTOM5 CAGE peaks and samples. It holds 1,048,124 human and 652,860 mouse CAGE peaks for 889 human and 389 mouse samples. Each CAGE peak has multiple attributes representing various annotation properties, including gene association, repeat association, robust and permissive designations, TSS-like flags, and GENCODE association for human and Ensembl association for mouse.
Configuration of BioLayout
BioLayout Express 3D is an application that has been specifically designed for the integration, visualization, and analysis of large network graphs derived from biological data. It can be configured to a high degree in order to respond to the needs of various areas of research. The FANTOM5 BioLayout runs on a Java webstart program accessible from the FANTOM5 site. When the Java webstart application is launched BioLayout is opened with the input files that have been chosen as a default view describing our data collection. Nodes can be either samples or genes. BioLayout itself can be configured in order to provide access to other tools, such as SSTAR sample/gene searches or ZENBU experiment searches.
Table extraction tool
FANTOM5 expression data are primarily distributed in compressed tab-separated-value (TSV) file format, each file consisting of the full set of CAGE peaks (184,827 rows in human and 116,277 rows in mouse) and expression values over samples (975 columns in human and 399 columns in mouse). In order to assist in the data extraction process we have created the FANTOM5 Table Extract Tool (TET). TET is intended to be a simplified way of extracting relevant sections from a curated set of FANTOM5 data tables. Using TET a user will select one of the FANTOM5 data sets, select the columns they wish to extract (that is, samples), then specify a set of rows (that is, CAGE peaks) using a regular expression search pattern, and finally view or download the resulting subset.
When exposing nanopublications from FANTOM5, we followed a four-step process as in Additional file 23. First, we examined the dataset to identify conceptual entities (for example, CAGE peaks, TSSs, genes) and assigned appropriate ontological descriptors. Second, we composed RDF triples and used the Vocabulary of Interlinked Datasets (VoID)  to create a ‘naive’ data model describing the data structure of the FANTOM5 entities. Using VoID statements, we could convert the dataset to 'nanopublication compliant' RDF and give each entry in the dataset (for example, each row-column combination) a Uniform Resource Identifier (URI). For example, each row of the dataset is transformed to a CAGE peak web resource. Using the void:inDataset predicate, each CAGE peak is linked back to the resource for the entire dataset. Subsequent predicates connect the CAGE peak to entities that represent columns of the raw dataset.
The third and most intellectually demanding step was to model the scientifically meaningful associations, the provenance metadata and publication information. This step uses the framework of the naive model to construct the actual nanopublication data model. When considering the FANTOM5 dataset, we developed several compelling proposals on how to model TSS-related assertions. As we worked through the models, we concluded that gene association should be a separate assertion (that is, a separate nanopublication) from the definition of a CAGE peak region as well as its expression. We generated three types of nanopublications: type I nanopublications make the link between CAGE peaks and the physical genome location; type II nanopublications make explicit the association that a particular CAGE peak is also a TSS region for a particular gene; type III nanopublications link the CAGE peaks to samples (that is, species, cell type) with the expression levels in those samples. This has several advantages: first, the process used to determine gene association is an independent process from the identification of CAGE peaks, so the provenance of gene association should be different from CAGE peak identification. Second, by separating the gene association from CAGE peak assertion, we can easily release a new set of associations if the FANTOM consortium needs to repeat the gene association process with different sets of data and/or parameters without redefining CAGE peaks. Third, it increases the granularity and reusability of data as others may use their own method/data to assign gene associations with FANTOM5 CAGE peaks. In modeling the provenance and publication information elements of the nanopublications, we chose here minimal models that simply referenced the FANTOM5 Consortium. As they are used in this study, the nanopublications have a clear provenance and so the minimal model is sufficient and without unnecessary complications. However, as stand-alone publications the provenance could be elaborated upon, creating more ‘autonomous’ data with distinct advantages for maximizing citations or for tracking scientific impact.
Lastly, we applied each of the three developed nanopublication models to instantiate the individual nanopublications as a referenceable linked data resource. This involved writing a script to instantiate the triples that compose the nanopublications. These triples were initially exported as large RDF files, which were then uploaded in the triple store provided by the Database Center for Life Science (DBCLS). The triple store is an OpenLink Virtuoso OS 7.1 and provides the SPARQL endpoint that is required to do integration queries such as the one shown in the section above. The last step consisted of making the nanopublication URLs resolvable, which is encouraged by and in line with the principles of Linked Data. This was achieved by means of a virtual host redirect on the Apache web server and a small application to query the triple store and return the requested nanopublication as serialized RDF (in Trig format. An example of each type of nanopublication, as well as a direct link to the triple store is available at ).
In writing these nanopublications, we surveyed existing ontologies. However, these were inadequate for our purposes and we decided to develop our own ontology, such as Reference Sequence Annotation (RSA) to fill the gap . We wanted the RSA to accommodate the basic CAGE region description as well as scenarios such as allowing a single annotation to be mapped onto different reference assemblies. This provided the mechanism to compare data between FANTOM4, FANTOM5, and others.
To provide the on-line resources for FANTOM5, we used nine physical servers and one virtual server for web applications, databases and file systems (not including the RDF store, Enhancer Selector tool and RIKENBASE). We used in total approximately 120 Tbytes hard disk space for storing data. We used existing software to host the data, and URLs of the source code are summarized in Additional file 24. All of the data are available at .
FANTOM5 was made possible by the following grants: Research Grant for RIKEN Omics Science Center from MEXT to Yoshihide Hayashizaki; Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to Yoshihide Hayashizaki; Research Grant from MEXT to the RIKEN Center for Life Science Technologies; Research Grant to RIKEN Preventive Medicine and Diagnosis Innovation Program from MEXT to Yoshihide Hayashizaki. This publication was also supported by a grant from the John Templeton Foundation, EU’s Innovative Medicine Joint Undertaking under grant agreement number 115191 (Open PHACTS), the Novo Nordisk and Lundbeck Foundations, the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 305444 (RD-Connect), the Center for Medical Systems Biology within the framework of The Netherlands Genomics Initiative (NGI)/Netherlands Organisation for Scientific Research (NWO), an Institute Strategic Grant from the Biotechnology and Biological Sciences Research Council (BBSRC; grant number BB/JO1446X/1, BB/I001107/1), and the Director, Office of Science, Office of Basic Energy Sciences, of the US Department of Energy under contract number DE-AC02-05CH11231. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the John Templeton Foundation. The pictures in Figure 1 are provided by Gundula G Schulze-Tanzil (tenocyte), Anna Ehrlund (Adipocyte), RIKEN BRC (cell lines), and BodyParts 3D (tissues). We would like to thank all members of the FANTOM5 consortium for contributing to generation of samples and analysis of the data-set and thank GeNAS for data production. We would also like to thank Kang Li for working on the enhancer slider.
- 3.Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, et al. The transcriptional landscape of the mammalian genome. Science 2005, 309:1559–1563.Google Scholar
- 28.FANTOM5 [http://fantom.gsc.riken.jp/5/]
- 43.Kasprzyk A: BioMart: driving a paradigm change in biological data management. Database (Oxford) 2011, 2011:bar049.Google Scholar
- 44.Semantic MediaWiki [http://semantic-mediawiki.org/]
- 45.Wikipedia [http://wikipedia.org/]
- 47.BioSemantics [http://rdf.biosemantics.org]
- 48.RIKENBASE [http://database.riken.jp]
- 52.FANTOM5: CD14+ Monocytes, donor1, [http://fantom.gsc.riken.jp/5/sstar/FF:11224-116B9]
- 53.Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, et al. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011;39:W541–5.PubMedCentralPubMedCrossRefGoogle Scholar
- 55.Nanopub [http://nanopub.org/]
- 56.Bizer C, Heath T, Berners-Lee T. Linked Data - the story so far. Int J Semantic Web Inf Syst. 2009;5:1–22.Google Scholar
- 57.Linked Data [http://linkeddata.org/]
- 59.SPARQL [http://www.w3.org/TR/sparql11-query/]
- 60.Linked life data [http://linkedlifedata.com/]
- 66.SDRF2GRAPH [http://fantom.gsc.riken.jp/4/sdrf2graph]
- 76.Describing Linked Datasets with the VoID Vocabulary [http://www.w3.org/TR/void/]
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.