Bio-medical entity prioritisation based on literature with Semantic Web annotations
KeywordsAtopic Dermatitis Mouse Gene Search Path Logical Query Semantic Link
An extension of the General and Rapid Association Study Engine (GRASE), a Semantic Web data (entity) prioritisation engine, is discussed. The GRASE employs a unique mechanism to prioritise entities using entity-document relations by computing the statistical significance between entities and user keywords based on the number of related documents. We describe an improvement of prioritisation accuracy and connectivity to the Semantic Web using PubAnnotation.
In life-science data analysis, prioritisation of entities among a large number of candidates is an important task. The General and Rapid Association Study Engine (GRASE)  introduced here is a Semantic Web data prioritisation engine. The GRASE was originally used for causative gene search in a RIKEN mouse ENU-mutagenesis program and 65 genes were successfully highly ranked  through its web interface called PosMed . The supported data has been extended and used successfully to discover bioresources in mice and Arabidopsis. The unique characteristic of GRASE prioritisation is employing entity-document relations where documents include MEDLINE abstracts. This mechanism has an advantage in which each prioritised entity can be shown with related documents as evidence, and these entities can be searched even if their descriptions are not given as Semantic Web data. However, the precision of such functions depends strongly on the accuracy of the entity-document relations. In the following, we examine the GRASE with a PosMed example and propose an extension using PubAnnotation  to improve accuracy and provide further support for Semantic Web data.
Statistical prioritisation of GRASE
The core functions of the GRASE are (1) keyword-entity search and (2) entity-entity search, which includes both (2.a) literature co-citation search and (2.b) semantic link search.
consisting of the number of documents where (a) both the entity and the keyword appear, (b) the keyword appears but the entity does not, (c) the entity appears but the keyword does not and (d) neither the entity nor the keyword appear. Then, Fisher's exact test is applied to the contingency table, and the P-value is computed as the statistical significance.
Next, (2.b) entity-entity search based on the semantic links, denoted entity (2.b)→ entity, is performed on the result of (1). For each result of (1), the GRASE discovers mouse bioresources linked from the resultant mouse gene and its P-value is given as 0. Therefore, the total P-value of the search path is the P-value of (1). Finally, mouse bioresources are ranked by the P-value of their paths.
Association between literature and entities
In the current implementation, entity-document relations are obtained by full-text search with human-curated logical queries defined as a list of entity names concatenated with logical operators such as AND, OR and NOT. For example, the logical query for the Arabidopsis gene AT1G03880 (cruciferin B, CRB) is defined as follows:
(AT1G03880 OR CRU2 OR CRB OR 'CRUCIFERIN 2' OR 'CRUCIFERIN B')
AND (Arabidopsis) NOT ('chloroplast RNA binding').
Logical queries for mouse, human and Arabidopsis genes have already been curated manually.
Here, we discuss GRASE statistical prioritisation search, which ranks entities using entity-document relations. Although our curation method to obtain entity-document relations is effective for frequently updated arbitrary document sets, it may still produce false-positive errors and cannot be extended for additional entities. The open collections of annotations of literature, including PubAnnotation can be used to solve this problem, which allows GRASE search to be performed more precisely. Another advantage of introducing PubAnnotation is the ability to show documents with detailed annotation with semantic links.
Therefore, future work includes implementing an interface to input/output PubAnnotation data. More concretely, we would like to introduce PubAnnotation data as accurate entity-document relations in addition to our query approach and implement functions to display and download documents that are related to the resultant entity in PubAnnotation format.
Part of this research was supported by RIKEN Incentive Research Projects. We thank Dr. Hiroshi Masuya for his useful comments regarding bioresource ranking.
- 3.Makita Y, Kobayashi N, Yoshida Y, Doi K, Mochizuki Y, Nishikata K, Matsushima A, Takahashi S, Ishii M, Takatsuki T, Bhatia R, Khadbaatar Z, Watabe H, Masuya H, Toyoda T: PosMed: Ranking genes and bioresources based on Semantic Web Association Study. Nucleic Acids Res. 2013, 41 (Web Server): W109-W114.PubMedCentralCrossRefPubMedGoogle Scholar
- 4.PubAnnotation: Share your annotation in alignment with others. [http://pubannotation.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.