Towards Knowledge-Based Life Science Publication Repositories

Nováček, Vít; Groza, Tudor; Handschuh, Siegfried

doi:10.1007/978-1-4419-5908-9_9

Towards Knowledge-Based Life Science Publication Repositories

Vít Nováček⁴,
Tudor Groza⁴ &
Siegfried Handschuh⁴

Chapter
First Online: 01 January 2010

499 Accesses

Part of the book series: Annals of Information Systems ((AOIS,volume 11))

Abstract

Despite being a flourishing field, the contemporary online scientific publishing properly exploits mostly raw publication data (rather meaningless bags of words) and shallow meta-data (authors, keywords, citations, etc.) regarding search. The much needed economical mass exploitation of the knowledge implicitly contained in publication texts is still largely an uncharted territory. The way towards filling this gap leads through (1) extraction of asserted publication meta-data together with the knowledge implicitly present in the respective text; (2) integration, refinement and extension of the emergent content; (3) release of the processed content via a meaning-sensitive search&browse interface catering for services complementary to the current full-text search. This chapter addresses the scientific and engineering challenges related to the suggested approach and introduces a particular solution that tackles them – CORAAL, a prototype for knowledge-based life science publication search.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
CORAAL stands for COntent extended by emeRgent and Asserted Annotations of Linked publication data.
2.
ACE stands for Addition, Closure, Extension. See Section 9.4.2 for details.
3.
Cf. http://www.simile-widgets.org/exhibit/. Details on how to use the CORAAL user interface are given in Section 9.5.
4.
Note that without loss of generality, URIs may serve as concept indices in the statements. Consequently, \(\textrm{ind}^{-1}\) de facto plays a role of the URI dereference. To facilitate readability, we provide simply lexical terms instead of indices or URIs in the examples throughout the chapter, though.
5.
Defined in [28] as \(F(a_1, \dots,a_n) = \sum_{j=1}^n w_jb_j\), where b _j is the jth largest of the a _i and w _j are a collection of weights (also called a weight vector) such that \(w_j \in [0,1]\) and \(\sum_{j=1}^n w_j = 1\). Note that we use the additional u, v weights in order to explicitly capture the relative relevance of the \(\varDelta_{u,v}\) first and second argument independently from their relative sizes.
6.
Essentially only one value w fully dependent on \(u, v, x, y\) is to be derived, since the remaining element of the OWA weight vector of size 2 is equal to \(1-w\).
7.
The duality w.r.t. the distance is ensured by the conformance to two intuitive conditions – inverse proportionality and equality to 1 when the distance is 0.
8.
By iterating through a respective knowledge base and/or by similar concept retrieval.
9.
Computed as \(|\mathbf{A}| = |{(i,j)|a_{i,\,\,j} \ne 0}|\) for a concept A.
10.
Very large means hundreds or thousands of concepts and millions of respective statements, or more.
11.
Each of the journals was associated with a specific context identifier to maintain the sub-domain provenance of the respective extracted information and reflect it later on in the CORAAL user interface.
12.
The NCI and EMTREE thesauri – see http://www.cancer.gov/cancertopics/terminologyresources and http://www.embase.com/emtree/, respectively.
13.
These results were achieved on a single server machine (which is not exclusively dedicated to CORAAL). There are still reserves regarding scalability even with the current implementation; however, for processing data two and more orders of magnitude larger, a distributed solution would be much better.
14.
See http://www.elseviergrandchallenge.com/.
15.
See http://salt.semanticauthoring.org/onto/. An extracted RDF file example is given at http://resources.smile.deri.ie/coraal/2008/11/ee7c3ec2536e6754ad424c9f95a0d8dce7059a4e.rdf.
16.
The heuristics is quite similar to the technique described in [39]. We use the Python NLTK library for NLP (see http://nltk.sourceforge.net). We also experimented with state-of-the-art ontology learning solutions (such as Text2Onto, see http://ontoware.org/projects/text2onto/). The respective tools performed rather poorly in larger scale, though, while providing not that significant improvement in quality when compared to our simple approach. However, we do plan to include more sophisticated as well as domain-specific methods of knowledge extraction (cf. [4, 7, 40]) into our light-weight implementation at some stage.
17.
\(^{*}, ^+\) and ? mean zero or more, one or more and zero or one repetitions of the preceding expression, respectively.
18.
We exclude the by far most common is a predicate from the set considered for the \(f_P(t)\) computation (the value of \(\nu_1f_P(t_{\textrm{p}})\) was set to 0.8 for the is a statements). We also do not include the statements with \(f_P(t_{\textrm{p}}) = 1\) at all. Note that \(f(t_{\textrm s})f_D(t_{\textrm s}), \; f(t_{\textrm o})f_D(t_{\textrm o})\) are relevance scores of the particular s, o terms, respectively.
19.
See http://en.wikipedia.org/wiki/SHA_hash_functions.
20.
We use the Sesame repository (see http://www.openrdf.org/).
21.
A comprehensive Java search engine library (see http://lucene.apache.org).
22.
Modulo mapping the terms to indices and neglecting the infinite number of columns and rows with zero-only elements. We consider not a as a negation of is a.
23.
See http://www.embase.com/emtree/ and http://www.cancer.gov/cancertopics/terminologyresources, respectively. EMTREE terms and relations were used in case of conflicts, since they cover more general domain. Synonyms defined in the thesauri were reflected in the lexicon data structure accordingly.
24.
Note that the pipeline can be executed even as (ACE)⁺, i.e. as a search for a global fixed point of the respective operations; however, for the CORAAL prototype we employed only single iteration, since the results were already sufficient for the presented proof-of-concept.
25.
Detailed Lucene syntax description can be found at http://lucene.apache.org/java/2_3_2/queryparsersyntax.html. Note that even though the meaning of the AND, NOT keywords is intuitively similar for both types of search in CORAAL, the knowledge and full-text variants are based on completely different principles. For instance, NOT indicates documents not containing the query expression for the full-text search, while in the knowledge search, it leads to documents containing a negation of the respective query statement; similarly for the AND keyword.
26.
Note that you can watch a video comprehensively illustrating the essential CORAAL capabilities at http://resources.smile.deri.ie/coraal/videos/coraal_web.mp4 before starting to play with the tool itself.
27.
Note that the HAS PART relation has rather general semantics in the knowledge extracted by CORAAL, i.e. its meaning is not strictly mereological in the physical sense, it can refer also to, e.g. conceptual parts or possession of entities. Similarly for the PART OF relation.
28.
For instance, the users were asked to find all authors who support the fact that the acute granulocytic leukemia and T-cell leukemia concepts are disjoint, or to find which process is used as a complementary method, while being different from the polymerase chain reaction, and identify publications that support their findings.

References

Bechhofer, S., Gangemi, A., Guarino, N., van Harmelen, F., Horrocks, I. Klein, M., Masolo, C., Oberle, D., Staab, S., Stuckenschmidt, H., Volz, R.: Tackling the ontology acquisition bottleneck: An experiment in ontology re-engineering (2003) Retrieved at http://tinyurl.com/96w7ms, Apr’08. 13 Jul 2010
Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering. Advanced Information and Knowledge Processing. Springer, New York (2004)
Google Scholar
Aberer, K., Cudré-Mauroux, P., Ouksel, A.M.: Emergent semantics principles and issues. In: Proceedings of Database Systems for Advanced Applications, 9th International Conference, DASFAA 2004, Jeju Island, Korea (2004)
Google Scholar
Maedche, A., Staab, S.: Ontology learning. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies. Springer, New York (2004) 173–190
Google Scholar
Maedche, A.: Emergent semantics for ontologies. In: Emergent Semantics. IEEE Intelligent Systems. IEEE Press, NYC, USA (2002) 85–86
Google Scholar
Ottens, K., Aussenac-Gilles, N., Gleizes, M.P., Camps, V.: Dynamic ontology coevolution from texts: Principles and case study. In: Proceedings of ESOE 2007 Workshop, CEUR-WS, Busan, Korea (2007) 70–83
Google Scholar
Buitelaar, P., Cimiano, P.: Ontology Learning and Population: Bridging the Gap Between Text and Knowledge. IOS Press, Amsterdam, Netherlands (2008)
Google Scholar
Haase, P., Völker, J.: Ontology learning and reasoning – dealing with uncertainty and inconsistency. In: Proceedings of the URSW2005 Workshop. (NOV 2005), Galway, Ireland 45–55
Google Scholar
Hein, J., Hendler, J.: Dynamic ontologies on the web. In: Proceedings of AAAI 2000, AAAI Press, Menlo Park, California, USA (2000)
Google Scholar
Haase, P., van Harmelen, F., Huang, Z., Stuckenschmidt, H., Sure, Y.: A framework for handling inconsistency in changing ontologies. In: Proceedings of ISWC’05. Volume 3792 of LNCS. Springer, New York (2005) 353–367
Google Scholar
Straccia, U.: A fuzzy description logic for the semantic web. In: Sanchez, E. (ed.) Fuzzy Logic and the Semantic Web. Capturing Intelligence. Elsevier, Amsterdam (2006) 73–90
Chapter Google Scholar
Flouris, G., Huang, Z., Pan, J.Z., Plexousakis, D., Wache, H.: Inconsistencies, negations and changes in ontologies. In: Proceedings of AAAI 2006, AAAI Press, Menlo Park, California, USA (2006)
Google Scholar
Sheth, A., Ramakrishnan, C., Thomas, C.: Semantics for the semantic web: The implicit, the formal and the powerful. International Journal on SemanticWeb & Information Systems 1(1) (2005) 1–18
Article Google Scholar
Frith, C.: Making Up the Mind: How the Brain Creates Our Mental World. Blackwell, Oxford, UK (2007)
Google Scholar
Gentner, D., Holyoak, K.J., Kokinov, B.K. (eds.): The Analogical Mind: Perspectives from Cognitive Science. MIT Press, Cambridge, MA (2001)
Google Scholar
McGuinness, D.L.: Ontology-enhanced search for primary care medical literature. In: Proceedings of the Medical Concept Representation and Natural Language Processing Conference, Phoenix, Arizona, USA (1999) 16–19
Google Scholar
Abasolo, J.M., Gómez, M.: M.: Melisa: An ontology-based agent for information retrieval in medicine. In: Proceedings of the First International Workshop on the Semantic Web (SemWeb2000), Lisbon, Portugal (2000) 73–82
Google Scholar
Dietze, H., et al.: Gopubmed: Exploring pubmed with ontological background knowledge. In: Ontologies and Text Mining for Life Sciences, IBFI (2008)
Google Scholar
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge, USA (2003)
Google Scholar
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11) (2004) 1984–1998
Google Scholar
Groza, T., Handschuh, S., Moeller, K., Decker, S.: KonneXSALT: First steps towards a semantic claim federation infrastructure. In: The Semantic Web: Research and Applications (Proceedings of ESWC 2008), Springer, New York (2008) 80–94
Google Scholar
Hulpus, I.: Design and implementation of a semantic claim federation infrastructure. Master’s Thesis, Technical University of Cluj-Napoca (2008)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 5 (2001)
Google Scholar
Zadeh, L.A.: Fuzzy sets. Journal of Information and Control 8 (1965) 338–353
Article Google Scholar
Ogden, C.K., Richards, I.A.: The Meaning of Meaning. Mariner Books (1989)
Google Scholar
Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema. (2004) Available at (Feb 2006): http://www.w3.org/TR/rdf-schema/. 13 Jul 2010
Deschrijver, G., Cornelis, C., Kerre, E.E.: On the representation of intuitionistic fuzzy t-norms and t-conorms. In: Transactions on Fuzzy Systems. IEEE (2004)
Google Scholar
Yager, R.R.: On ordered weighted averaging aggregation operators in multi-criteria decision making. IEEE Transactions on Systems, Man and Cybernetics 18 (1988) 183–190
Article Google Scholar
Greenwald, A.G.: Cognitive learning, cognitive response to persuasion, and attitude change. In: Psychological Foundations of Attitudes, Academic Press Inc., New York (1968) 147–169
Google Scholar
Grimm, S., Motik, B.: Closed world reasoning in the semantic web through epistemic operators. In: Proceedings of the Workshop OWL – Experiences and Directions, CEUR-WS (2005)
Google Scholar
Patel-Schneider, P.F., Horrocks, I.: Position paper: A comparison of two modelling paradigms in the semantic web. In: Proceedings of http://WWW2006, ACM Press, NYC, USA (2006) 3–12
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Communications of the ACM 29(12) (1986) 1213–1228
Article Google Scholar
Kokinov, B.N., Petrov, A.: Integrating memory and reasoning in analogy-making: The AMBR model. In: The Analogical Mind: Perspectives from Cognitive Science, MIT Press, Cambridge, MA (2001) 59–124
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5) (1999) 604–632
Google Scholar
Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Magazine 17(3) (1996) 73–83
Google Scholar
Nováček, V.: Towards an efficient knowledge-based publication data exploitation: An oncological literature search scenario. Technical Report DERI-TR-2009-03-23, DERI, NUIG (2009) Available at http://tinyurl.com/csh3rf. 13 Jul 2010
Manola, F., Miller, E.: RDF Primer. (2004) Available at (November 2008): http://www.w3.org/TR/rdf-primer/. 13 Jul 2010
Groza, T., Möller, K., Handschuh, S., Trif, D., Decker, S.: SALT: Weaving the claim web. In: ISWC 2007, Busan, Korea (2007)
Google Scholar
Maedche, A., Staab, S.: Discovering conceptual relations from text. In: Proceedings of ECAI 2000, IOS Press, Amsterdam, Netherlands (2000)
Google Scholar
Blaschke, C., Andrade, M., Ouzounis, C., Valencia, A.: Automatic extraction of biological information from scientific text: Protein-protein interactions. In: Proc. Int Conf Intell Syst Mol Biol, Protein Design Group, CNB-CSIC, Madrid, Spain (1999) 60–67
Google Scholar
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA (1998)
Google Scholar
Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning taxonomic relations from heterogenous sources of evidence. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, Amsterdam, Netherlands (2005) 59–73
Google Scholar
Voelker, J., Vrandecic, D., Sure, Y., Hotho, A.: Learning disjointness. In: Proceedings of ESWC’07, Springer, New York (2007)
Google Scholar
Gärdenfors, P.: Conceptual Spaces: The Geometry of Thought. MIT Press, Cambridge, MA (2000)
Google Scholar
Aisbett, J., Gibbon, G.: A general formulation of conceptual spaces as a meso level representation. Artificial Intelligence 133(1–2) (2001) 189–232
Google Scholar
Smolensky, P., Legendre, G.: The Harmonic Mind: From Neural Computation to Optimality – Theoretic Grammar. MIT Press, Cambridge, MA (2006)
Google Scholar
Sowa, J.F., Majumdar, A.K.: Analogical reasoning. In: Proceedings of ICCS’03. Springer, Berlin, Heidelberg (2003)
Google Scholar
Sowa, J.F.: A dynamic theory of ontology. In: Proceedings of FOIS’06, IOS Press, Amsterdam, Netherlands (2006)
Google Scholar
Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Stein, L.A.: OWL Web Ontology Language Reference. (2004) Available at (February 2006): http://www.w3.org/TR/owl-ref/. 13 Jul 2010
ter Horst, H.J.: Completeness, decidability and complexity of entailment for rdf schema and a semantic extension involving the owl vocabulary. Journal of Web Semantics 3(2-3) (2005) 79–115
Google Scholar
Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontology Language: Profiles. Working draft, available at http://www.w3.org/TR/owl2-profiles as of Dec 11 (2008). 13 Jul 2010
Noy, N., Rector, A.: Defining N-ary Relations on the Semantic Web (2006). Available at (June 2008): http://www.w3.org/TR/swbp-n-aryRelations/. 13 Jul 2010
Laskey, K.J., Laskey, K.B., Costa, P.C.G., Kokar, M.M., Martin, T., Lukasiewicz, T.: Uncertainty Reasoning for the World Wide Web. (2008) W3C Incubator Group final report, available at http://www.w3.org/2005/Incubator/urw3/XGR-urw3-20080331/ as of Dec 11, 2008. 13 Jul 2010

Download references

Acknowledgments

This work has been supported by the EU IST 6th framework’s project “Nepomuk” (FP6-027705) and the “Líon” and “Líon II” projects funded by Science Foundation Ireland under Grant No. SFI/02/CE1/I131, SFI/08/CE/ I1380, respectively. We would like to thank the employees of Masaryk Oncology Institute for their feedback and to Ioana Hulpus for her work on the former CORAAL user interface. Very special thanks goes to the people who have actively participated in the continuous prototype evaluation and testing, namely to (in alphabetical order) Doug Foxvog, Peter Gréll, MD, Miloš Holánek, MD, Matthias Samwald, Holger Stenzhorn and Jiří Vyskočil, MD. We also acknowledge the valuable comments from the anonymous reviewers who helped to improve the final shape of the chapter.

Author information

Authors and Affiliations

Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland
Vít Nováček, Tudor Groza & Siegfried Handschuh

Authors

Vít Nováček
View author publications
You can also search for this author in PubMed Google Scholar
Tudor Groza
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Handschuh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vít Nováček .

Editor information

Editors and Affiliations

College of Computer Science, Zhejiang University, Zheda Road 38, Hangzhou, 310058, China, People's Republic
Huajun Chen
Drug Discovery Pte Ltd., Lilly Singapore Centre for, Biomedical Grove 8A, Singapore, 138648, Singapore
Yimin Wang
Center for Medical Informatics, Yale University School of Medicine, Cedar St. 333, New Haven, 06520-8009, Connecticut, USA
Kei-Hoi Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nováček, V., Groza, T., Handschuh, S. (2010). Towards Knowledge-Based Life Science Publication Repositories. In: Chen, H., Wang, Y., Cheung, KH. (eds) Semantic e-Science. Annals of Information Systems, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-5908-9_9

Download citation

DOI: https://doi.org/10.1007/978-1-4419-5908-9_9
Published: 17 June 2010
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5902-7
Online ISBN: 978-1-4419-5908-9
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics