Nanopublications for exposing experimental data in the life-sciences: a Huntington’s Disease case study
- 1.2k Downloads
Data from high throughput experiments often produce far more results than can ever appear in the main text or tables of a single research article. In these cases, the majority of new associations are often archived either as supplemental information in an arbitrary format or in publisher-independent databases that can be difficult to find. These data are not only lost from scientific discourse, but are also elusive to automated search, retrieval and processing. Here, we use the nanopublication model to make scientific assertions that were concluded from a workflow analysis of Huntington’s Disease data machine-readable, interoperable, and citable. We followed the nanopublication guidelines to semantically model our assertions as well as their provenance metadata and authorship. We demonstrate interoperability by linking nanopublication provenance to the Research Object model. These results indicate that nanopublications can provide an incentive for researchers to expose data that is interoperable and machine-readable for future use and preservation for which they can get credits for their effort. Nanopublications can have a leading role into hypotheses generation offering opportunities to produce large-scale data integration.
KeywordsHuntington’s disease Nanopublication Provenance Research object Workflows Interoperability Data integration
The large amount of scientific literature in the field of biomedical sciences makes it impossible to manually access and extract all relevant information for a particular study. This problem is mitigated somewhat by text mining techniques on scientific literature and the availability of public online databases containing (supplemental) data. However, many problems remain with respect to the availability, persistence and interpretation of the essential knowledge and data of a study.
Text mining techniques allow scientists to mine relations from vast amounts of abstracts and extract explicitly defined information  or even implicit information [2, 3]. Because most of these techniques are limited to mining abstracts, it is reasonable to assume that information such as tables, figures and supplementary information are overlooked. Moreover, recent attempts to mine literature for mutations stored in databases, showed that there was a very low coverage of mutations described in full text and supplemental information .
This is partly remedied by making data public via online databases. However, this by itself does not guarantee that data can be readily found, understood and used in computational experiments. This is particularly problematic at a time when more, and larger, datasets are produced that will never be fully published in traditional journals. Moreover, there is no well-defined standard for scientists to get credit for the curation effort that is typically required to make a discovery and its supporting experimental data available in an online database. We argue that attribution and provenance are important to ensure trust in the findings and interpretations that scientists make public. Additionally, a sufficiently detailed level of attribution provides an incentive for scientists, curators and technicians to make experimental data available in an interoperable and re-usable way. The Nanopublication data model  was proposed to take all these issues into consideration. The nanopublication guidelines document  provides details of the nanopublication schema and recommendations for constructing nanopublications from Life Science data. Based on Semantic-web technology, the nanopublication model is a minimal model for publishing an assertion, together with attribution and provenance metadata.
The assertion graph contains the central statement that the author considers valuable (publishable) and for which she would like to be cited (attribution). It should be kept as small as possible in accordance with the guidelines. The provenance graph is used to provide evidence for the assertion. It is up to the author to decide how much provenance information to give, but in general, more provenance will increase the trustworthiness of the assertion, and thus the value of the nanopublication. The publication info graph provides detailed information about the nanopublication itself: creation date, licenses, authors and other contributors can be listed there. Attribution to curators and data modelers are part of the nanopublication design to incentivize data publishing.
We used the nanopublication schema to model scientific results from an in-silico experiment. Previously Beck et al.  used GWAS data stored in the GWAS central database to model as nanopublications and they demonstrated how such valuable information can be incorporated within the Linked Data web to assist the formation of new hypotheses and interesting findings. In our experiment we investigated the relation between gene deregulation in Huntington’s disease and epigenetic features that might be associated with transcriptional abnormalities (E. Mina et al., manuscript in preparation).
We show how the results of this case study can be represented as nanopublications and how this promotes data integration and interoperability.
Huntington’s Disease as case study for modelling scientific results into nanopublications
Huntington’s Disease is a dominantly inherited neurodegenerative disease that affects 1 - 10/100.000 individuals and thus making it the most common inherited neurodegenerative disorder . Despite the fact that the genetic cause for HD was already identified in 1993, no cure has yet been found and the exact mechanisms that lead to the HD phenotype are still not well known. Gene expression studies revealed massive changes in HD brain that take place even before first symptoms arise . There is evidence for altered chromatin conformation in HD  that might explain these changes. We selected to analyse two datasets that are associated with epigenetic regulation, concerning CpG islands in the human genome  and chromatin marks mapped across nine cell types . Identifying genes that are deregulated in HD and are associated with these regions can give insight into chromatin-associated mechanisms that are potentially at play in this disease.
Our analysis has been implemented through the use of workflows using the Taverna workflow management system [12, 13]. As input we used gene expression data from three different brain regions from HD affected individuals and age and sex matched controls . We tested for gene differential expression (DE) between controls and HD samples in the most highly affected brain region, caudate nucleus, and we integrated this data with the two epigenetic datasets discussed previously which are publicly available via the genome browser [15, 16].
HD is a devastating disease and no actual cure has been found yet to treat or slow down disease progression. Therefore, research on this domain is mainly focusing on the production of new data and investment on expensive experiments. It is important to realize that sharing information is essential in research for developing new hypotheses that can tackle difficult use cases such as HD. Because of the unavailability of previous experiments to be found online using common biomedical engines, expensive experiments become lost and unnecessarily replicated. For example in our case study, we found that the association that we inferred between the HTT gene, which mutant form causes Huntington’s Disease, and BAIAP2, a brain-specific angiogenesis inhibitor (BAI1)-binding protein, was present in a table in a paper by Kaltenbach et al. . However, it is not explicitly in any abstract which makes it hard to retrieve from systems such as PubMed.
Results and Discussion
Nanopublication model design principles
We decided to model and expose as nanopublications two assertions from the results of our workflow: 1) differentially expressed genes in HD and 2) genes that overlap with a particular genomic region that is associated with epigenetic regulation. Note that these natural language statements would typically be used in a caption for a figure, table or supplemental information section to describe a dataset in a traditional publication. Considering the problems with automatic retrieval and interpretation of such data, we aim to expose these assertions in a way that is more useful to other scientists (for example to integrate our results with their own data). Moreover, we provide provenance containing the origin and experimental context for the data in order to increase trust and confidence. Our nanopublications are stored in the AllegroGraph triple store . The link to the browsable user interface and the SPARQL endpoint can be found on the myExperiment link: http://www.myexperiment.org/packs/622.html. The user can log in and browse through the nanopublications by logging in with username “test” and password “tester”. The queries used in this paper are stored under the menu “Queries → Saved”.
We defined two natural language statements that we wish to convert to RDF:
“gene X is associated with HD, because it was found to be deregulated in HD” and “gene Y is associated with a promoter, and this promoter overlaps with a CpG island and/or a particular chromatin state”, and we wish to refer to the experiment by which we found these associations. We decided to model our results into two nanopublications. By further subdividing those statements, we see the RDF triple relations appear naturally:
Nanopublication assertion 1:
There is a gene disease association that refers_to gene X and Huntington’s Disease
Nanopublication assertion 2:
For some of the terms in these statements we found several ontologies that defined classes for them. For example, “promoter”, “gene”, and “CpG island” appear (among others) in the following ontologies: NIF Standard ontology (NIFSTD), NCI Thesaurus (NCI) and the Gene Regulation Ontology (GRO)b. We chose to use NIFSTD for our case study, because it covers an appropriate domain and it uses the Basic Formal Ontology (BFO), which can benefit data interoperability and OWL reasoning (e.g. for checking inconsistencies).
Definition of new classes
subclass of URI
A region of chromatin, likely to
be involved in a biological process
Annotation of chromatin states, defined by
combinations of chromatin modification patterns
(described in publication by Ernst et al. Nature, 2011)
Open chromatin region, associated with promoters,
transcriptionally active, defined by the most highly
observed chromatin marks : H3K4me2,H3K4me3, H3K27ac,
Open chromatin region, associated with promoters,
weak transcription activity, defined by the most highly
observed chromatin marks : H3K4me1, H3K4me2,H3K4me3,
Open chromatin region, associated with promoters,
described as a bivalent domain that has strong signals
of both active and inactive histone marks. Most highly
observed histone marks: H3K27me3, H3K4me2, H3K4me3
Closed chromatin formation, transcriptionally inactive.
It is associated with none histone marks
For the predicates we considered the use of the Relation Ontology. Extending the Relation Ontology with the appropriate predicates would support interoperability and reasoning in the long term, because its use of BFO. However, we found that the OWL domain and range specifications did not match our statements. Therefore, we decided to use predicates from the also popular Sequence Ontology (SO) and Semanticscience Integrated Ontology (SIO)  that also seemed appropriate for our assertions. This is a typical trade-off between quality and effort that we expect nanopublishers will have to make frequently. We can justify this for two reasons: 1) releasing experimental data as linked open data using any standard ontology is already an important step forward from current practice and 2) interoperability issues at the ontology level is a shared responsibility with ontology developers and curators who provide mappings between ontologies and with higher level ontologies.
The process of nanopublication modeling can be minimized when previous examples are used as templates for similar data. For instance, the nanopublication models presented here can serve as templates for exposing differentially expressed genes in a disease condition. We demonstrate the reuse of our own template of Figure 2 by exposing 5 types of nanopublications concerning genomic overlap. The reuse of templates improves interoperability of scientific results beyond the interoperability that RDF already provides. It facilitates crafting assertions while ensuring that the same URIs are used for the same type of data.
Publishing information is meaningful only if there is enough supporting information for reproducing them. For example Ioannidis et al., pointed out that they could not reproduce the majority of the 18 articles they investigated describing results from microarray experiments, including selected tables and figures . Nanopublication does not guarantee full reproducibility, but as a model for combining data with attribution and provenance in a digital format it at least makes it possible to trace the origin of scientific results. The provenance section of a nanopublication ties the results (the nanopub assertion) to a description of an experiment and the associated materials, conditions and methods. The main purpose is to capture as accurately as possible where the assertion came from and what the conditions of our experiment were by aggregating and annotating resources that were used throughout the experiment.
In our case the experiment is in-silico: a workflow process that combines existing data sources to expose new associations. Details and references to the original datasets, the workflow process itself and the final workflow output are interesting provenance as they increase trust in the assertion and make it possible to trace back the results of the experiment. An extra benefit of using workflows is that provenance information can be automatically generated by the workflow system and additional tools can be used to associate a workflow with additional metadata and resources. We used Taverna to build and execute our workflows . Taverna provides an option to export the provenance of a workflow execution in prov-o .
Extending nanopublication provenance with the Research Object model
With the nanopublication provenance model as a starting point we further enhance provenance with a model that has been developed for bundling workflows with additional resources in the form of workflow-centric Research Objects.
Additional resources may include documents, input and output data, annotations, provenance traces of past executions of the workflow, and so on. Research objects enable in silico experiments to be preserved, such that peers can evaluate the method that led to certain results, and a method can be more easily reproduced and reused. Similar to nanopublications, the Research object model is grounded in Semantic Web technologies . It is comprised by a core ontology and extension ontologies. The core ontology reuses the Annotation Ontology (AO) and the Object Reuse and Exchange (ORE) model to provide annotation and aggregation of the resources. The extension ontologies keep track of the results and methods of a workflow experiment (wfprov), provide the descriptions of scientific workflows (wfdesc) and capture the RO evolution process (roevo) (Belhajjame K, Zhao J, Carijo D, Hettne KM, Palma R, Mina E, Corcho O, Gómez-Pérez JM, Bechhofer S, Klyne G, Goble C: Using a suite of ontologies for preserving workflow-centric research objects, accepted for publication Journal of Web Semantics). Research objects extend the already existing functionality of my Experiment packs. We created Research objects using the Research object repository sandbox, which offers a user friendly interface for creating Research objects either by importing an already existing pack from my Experiment, or uploading a.zip archive or creating a research object manually .
In the Publication Info section of a nanopublication we capture details that are required for citation and usage of the nanopublication itself. The authors of the nanopublication and possible contributors are described here, and represented by a unique research identifier to account for author ambiguity. The timestamp of the nanopublications creation is also recorded in this part, as well as versioning details. Finally, information about usage rights and licence holders is included.
Data integration using nanopublications: assisting drug target prioritization in HD
By choosing RDF as exchange format for nanopublications, we also support the data integration features of RDF. In HD research, diverse working groups recruit a variety of disciplines that produce data encompassing brain images, gene expression profiles in brain and peripheral tissues, genetic variation, epigenome data, etcetera, with the common goal to identify biomarkers to monitor disease progression or the effectiveness of therapies. Nanopublications provide an incentive to expose this data such that we can more easily integrate them with each other to assist research in HD by creating novel hypotheses. These hypotheses can be further tested and ultimately help the development of effective treatments. Following standardized templates to model information ensures data interoperability that can facilitate complex queries for discovering new information. In addition, the attached provenance of the assertion will give necessary information related to the experiment, to ensure trust but also to be able to reuse the scientific protocol and replicate the results. Moreover, we note that nanopublications enable opportunities for data integration beyond the assertions and the experimental data itself. Since nanopublication Provenance relates to the methodology/protocol that was used as part of an experiment, it allows us to retrieve all other published nanopublications based on our workflow, or workflows that are related to it (e.g. because they use the same kind of input data). Such (indirect) provenance links greatly improve the discoverability of research data. Another option could be to use the information stored in the Publication Info graph and retrieve the attribution information for the nanopublication. This makes it relatively easy to determine the most frequently cited nanopublication creators and authors, for example in order to calculate some kind of impact factor.
To demonstrate how data integration with nanopublications can occur in practice we applied simple SPARQL queries to our local nanopublication store. For example, to identify differentially expressed genes for which the promoters are associated both with a CpG island and a poised chromatin state. The resulted genes can be the start for further research, as they may be indicators for an epigenetically mediated gene alteration. The set of the canned queries are stored in our nanopublication store for the user to browse and execute (details for accessing the nanopublication store were mentioned previously in this paper)
Nanopublications can also facilitate more sophisticated queries to support data integration. Note that traditional methods of querying integrated data would typically require converting data sources to a common format or database and writing one or more queries and scripts to solve the actual question. Such an approach can be complex and time-consuming, while the resulting code and data is not necessarily re-usable to answer other research questions. Here we show a complex data integration question that can be answered using a relatively short SPARQL query.
Drug target results from the data integration query
Adenosine triphosphate (ATP)
proteasomal protein catabolic process
In the end, the query ran within 15 seconds and retrieved the results as given in Table 2. A detailed discussion of the results is out of the scope of this paper. However, we note that the effort required to integrate these data is relatively minimal: aside from loading the data sources, it approximately takes only an hour to construct the query. Moreover, as is indicated in Figure 6, the query itself is modular: consisting of specific sections related to specific datasets. Therefore extending the query to include other datasets is not very difficult.
To date there is an enormous amount of valuable information that has been produced by expensive experiments, but remains lost in databases and other repositories that are not easily accessed or processed automatically. This results not only in replicating experiments that have already been performed, but also in preventing all those associations from being tested or reused for building new hypotheses. This paper presents a method that enables life scientists to (i) expose the results from an analysis as scientific assertions, (ii) claim these as their contribution and (iii) provide provenance of the analysis as reference for the claimed assertions. We demonstrated an example from research in Huntington’s Disease. In addition, we presented examples of nanopublication integration in the context of HD, and examples of how nanopublications can facilitate more sophisticated queries, integrating datasets from different research domains. The models for these nanopublications can be used as templates to create similar nanopublications, while the extension to the RO model can also be used to aggregate resources from other experiments that do not involve scientific workflows. Nanopublication provides an incentive for scientists to expose the results from individual experiments and make them available for future exploitation. This ultimately facilitates research across datasets that we anticipate will provide new insights about disease mechanisms. Research can become more efficient and go beyond monolithic journal publication .
The work presented in this paper is supported by grants received from the Netherlands Bioinformatics Centre (NBIC) under the BioAssist program, the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreements No. 305444 (RD-Connect; HEALTH.2012.2.1.1-1-C) and No. 270129 (Wf4Ever; ICT-2009.4.1), and the IMI-JU project Open PHACTS (grant agreement No. 115191).
- 1.Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al.: The STRING, database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 2010. http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gkq973, 39(Database):D561-D568.CrossRefGoogle Scholar
- 2.Jelier R, Schuemie MJ, Roes PJ, van Mulligen EM, Kors JA: Literature-based concept profiles for gene annotation: The issue of weighting. Int J Med Inform 2008. http://www.sciencedirect.com/science/article/pii/S1386505607001372, 77(5):354-362.CrossRefGoogle Scholar
- 4.Yepes AJ, Verspoor K: Towards automatic large-scale curation of genomic variation: improving coverage based on supplementary material. In BioLINK SIG 2013. Berlin, Germany; 2013.Google Scholar
- 5.Guidelines for Nanopublication http://nanopub.org/guidelines/working_draft/
- 6.Beck T, Free RC, Thorisson GA, Brookes AJ: Semantically enabling a genome-wide association study database. 3:9. http://www.jbiomedsem.com/content/3/1/9/abstract
- 7.Landles C, Bates GP: Huntingtin and the molecular pathogenesis of Huntington’s disease. EMBO reports 2004. http://www.nature.com/doifinder/10.1038/sj.embor.7400250, 5(10):958-963.CrossRefGoogle Scholar
- 8.Cha JHJ: Transcriptional dysregulation in Huntington’s disease. Trends Neurosci 2000. http://www.sciencedirect.com/science/article/pii/S016622360001609X, 23(9):387-392.CrossRefGoogle Scholar
- 9.Thomas EA, Coppola G, Desplats PA, Tang B, Soragni E, Burnett R, et al.: The HDAC inhibitor 4b ameliorates the disease phenotype and transcriptional abnormalities in Huntington’s disease transgenic mice. Proc Nat Acad Sci 2008. http://www.pnas.org/content/105/40/15564.short, 105(40):15564.CrossRefGoogle Scholar
- 11.Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB,: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011. http://www.nature.com/doifinder/10.1038/nature09906, 473(7345):43-49.CrossRefGoogle Scholar
- 13.Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S,: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 2013. [PMID: 23640334]Google Scholar
- 14.Hodges A: Regional and cellular gene expression changes in human Huntington’s disease brain. Hum Mol Genet 2006. http://www.hmg.oxfordjournals.org/cgi/doi/10.1093/hmg/ddl013, 15(6):965-977.CrossRefGoogle Scholar
- 15.UCSC genome browser home , http://genome.ucsc.edu/
- 16.Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al.: The human genome browser at UCSC. http://genome.cshlp.org/content/12/6/996 12(6):996-1006.
- 18.AllegroGraph RDFStore Web 3.0’s Database http://www.franz.com/agraph/allegrograph/
- 20.SIO - semanticscience - the semanticscience integrated ontology (SIO) - scientific knowledge discovery - google project hosting https://code.google.com/p/semanticscience/wiki/SIO
- 22.Taverna workflow provenance http://www.w3.org/2011/prov/wiki/TavernaProvenance
- 23.Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman D, et al.: Workflow-centric research objects: first class citizens in scholarly discourse. In Proc. Workshop on the Semantic Publishing (SePublica). Proc. Workshop on the Semantic Publishing (SePublica), Crete, Greece; 2012. http://users.ox.ac.uk/~oerc0033/preprints/sepublica2012.pdf Google Scholar
- 24.Zhao J, Klyne G, Holubowicz P, Palma R, Soiland-Reyes S, Hettne K, et al.: RO-Manager: a tool for creating and manipulating research objects to support reproducibility and reuse in sciences. Proceedings of the 2nd International Workshop on Linked Science, 2012. https://www.escholar.manchester.ac.uk/uk-ac-man-scw:192016 Google Scholar
- 25.OWL-S: Semantic markup for web services http://www.w3.org/Submission/2004/SUBM-OWL-S-20041122/
- 26.Imarisio S, Carmichael J, Korolchuk V, Chen CW, Saiki S, Rose C, et al.: Huntington’s disease: from pathology and genetics to potential therapies. 412(2):191-209.Google Scholar
- 27.Jenny Carmichael DCR: Huntington’s disease: molecular basis of neurodegeneration. Expert Rev Mol Med 2003, 5(20):1-21.Google Scholar
- 28.Pajouhesh H, Lenz GR: Medicinal chemical properties of successful central nervous system drugs. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1201314/. [PMID: 16489364 PMCID: PMC1201314] 2(4):541-553.
- 29.Kegel KB, Sapp E, Alexander J, Reeves P, Bleckmann D, Sobin L, et al.: Huntingtin cleavage product A forms in neurons and is reduced by gamma-secretase inhibitors. http://www.molecularneurodegeneration.com/content/5/1/58/abstract. [PMID: 21156064] 5: 58.
- 30.Harmelen F, Kampis G, Börner K, Besselaar P, Schultes E, Goble C, et al.: Theoretical and technological building blocks for an innovation accelerator. Eur Phys J Spec Topics 2012. http://epjst.epj.org/index.php?option=com_article&access=standard&Itemid=129&url=/articles/epjst/abs/2012/14/epjst214009/epjst214009.html, 214: 183-214.CrossRefGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.