Abstract
Despite being a flourishing field, the contemporary online scientific publishing properly exploits mostly raw publication data (rather meaningless bags of words) and shallow meta-data (authors, keywords, citations, etc.) regarding search. The much needed economical mass exploitation of the knowledge implicitly contained in publication texts is still largely an uncharted territory. The way towards filling this gap leads through (1) extraction of asserted publication meta-data together with the knowledge implicitly present in the respective text; (2) integration, refinement and extension of the emergent content; (3) release of the processed content via a meaning-sensitive search&browse interface catering for services complementary to the current full-text search. This chapter addresses the scientific and engineering challenges related to the suggested approach and introduces a particular solution that tackles them – CORAAL, a prototype for knowledge-based life science publication search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
CORAAL stands for COntent extended by emeRgent and Asserted Annotations of Linked publication data.
- 2.
ACE stands for Addition, Closure, Extension. See Section 9.4.2 for details.
- 3.
Cf. http://www.simile-widgets.org/exhibit/. Details on how to use the CORAAL user interface are given in Section 9.5.
- 4.
Note that without loss of generality, URIs may serve as concept indices in the statements. Consequently, \(\textrm{ind}^{-1}\) de facto plays a role of the URI dereference. To facilitate readability, we provide simply lexical terms instead of indices or URIs in the examples throughout the chapter, though.
- 5.
Defined in [28] as \(F(a_1, \dots,a_n) = \sum_{j=1}^n w_jb_j\), where b j is the jth largest of the a i and w j are a collection of weights (also called a weight vector) such that \(w_j \in [0,1]\) and \(\sum_{j=1}^n w_j = 1\). Note that we use the additional u, v weights in order to explicitly capture the relative relevance of the \(\varDelta_{u,v}\) first and second argument independently from their relative sizes.
- 6.
Essentially only one value w fully dependent on \(u, v, x, y\) is to be derived, since the remaining element of the OWA weight vector of size 2 is equal to \(1-w\).
- 7.
The duality w.r.t. the distance is ensured by the conformance to two intuitive conditions – inverse proportionality and equality to 1 when the distance is 0.
- 8.
By iterating through a respective knowledge base and/or by similar concept retrieval.
- 9.
Computed as \(|\mathbf{A}| = |{(i,j)|a_{i,\,\,j} \ne 0}|\) for a concept A.
- 10.
Very large means hundreds or thousands of concepts and millions of respective statements, or more.
- 11.
Each of the journals was associated with a specific context identifier to maintain the sub-domain provenance of the respective extracted information and reflect it later on in the CORAAL user interface.
- 12.
The NCI and EMTREE thesauri – see http://www.cancer.gov/cancertopics/terminologyresources and http://www.embase.com/emtree/, respectively.
- 13.
These results were achieved on a single server machine (which is not exclusively dedicated to CORAAL). There are still reserves regarding scalability even with the current implementation; however, for processing data two and more orders of magnitude larger, a distributed solution would be much better.
- 14.
- 15.
See http://salt.semanticauthoring.org/onto/. An extracted RDF file example is given at http://resources.smile.deri.ie/coraal/2008/11/ee7c3ec2536e6754ad424c9f95a0d8dce7059a4e.rdf.
- 16.
The heuristics is quite similar to the technique described in [39]. We use the Python NLTK library for NLP (see http://nltk.sourceforge.net). We also experimented with state-of-the-art ontology learning solutions (such as Text2Onto, see http://ontoware.org/projects/text2onto/). The respective tools performed rather poorly in larger scale, though, while providing not that significant improvement in quality when compared to our simple approach. However, we do plan to include more sophisticated as well as domain-specific methods of knowledge extraction (cf. [4, 7, 40]) into our light-weight implementation at some stage.
- 17.
\(^{*}, ^+\) and ? mean zero or more, one or more and zero or one repetitions of the preceding expression, respectively.
- 18.
We exclude the by far most common is a predicate from the set considered for the \(f_P(t)\) computation (the value of \(\nu_1f_P(t_{\textrm{p}})\) was set to 0.8 for the is a statements). We also do not include the statements with \(f_P(t_{\textrm{p}}) = 1\) at all. Note that \(f(t_{\textrm s})f_D(t_{\textrm s}), \; f(t_{\textrm o})f_D(t_{\textrm o})\) are relevance scores of the particular s, o terms, respectively.
- 19.
- 20.
We use the Sesame repository (see http://www.openrdf.org/).
- 21.
A comprehensive Java search engine library (see http://lucene.apache.org).
- 22.
Modulo mapping the terms to indices and neglecting the infinite number of columns and rows with zero-only elements. We consider not a as a negation of is a.
- 23.
See http://www.embase.com/emtree/ and http://www.cancer.gov/cancertopics/terminologyresources, respectively. EMTREE terms and relations were used in case of conflicts, since they cover more general domain. Synonyms defined in the thesauri were reflected in the lexicon data structure accordingly.
- 24.
Note that the pipeline can be executed even as (ACE)+, i.e. as a search for a global fixed point of the respective operations; however, for the CORAAL prototype we employed only single iteration, since the results were already sufficient for the presented proof-of-concept.
- 25.
Detailed Lucene syntax description can be found at http://lucene.apache.org/java/2_3_2/queryparsersyntax.html. Note that even though the meaning of the AND, NOT keywords is intuitively similar for both types of search in CORAAL, the knowledge and full-text variants are based on completely different principles. For instance, NOT indicates documents not containing the query expression for the full-text search, while in the knowledge search, it leads to documents containing a negation of the respective query statement; similarly for the AND keyword.
- 26.
Note that you can watch a video comprehensively illustrating the essential CORAAL capabilities at http://resources.smile.deri.ie/coraal/videos/coraal_web.mp4 before starting to play with the tool itself.
- 27.
Note that the HAS PART relation has rather general semantics in the knowledge extracted by CORAAL, i.e. its meaning is not strictly mereological in the physical sense, it can refer also to, e.g. conceptual parts or possession of entities. Similarly for the PART OF relation.
- 28.
For instance, the users were asked to find all authors who support the fact that the acute granulocytic leukemia and T-cell leukemia concepts are disjoint, or to find which process is used as a complementary method, while being different from the polymerase chain reaction, and identify publications that support their findings.
References
Bechhofer, S., Gangemi, A., Guarino, N., van Harmelen, F., Horrocks, I. Klein, M., Masolo, C., Oberle, D., Staab, S., Stuckenschmidt, H., Volz, R.: Tackling the ontology acquisition bottleneck: An experiment in ontology re-engineering (2003) Retrieved at http://tinyurl.com/96w7ms, Apr’08. 13 Jul 2010
Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering. Advanced Information and Knowledge Processing. Springer, New York (2004)
Aberer, K., Cudré-Mauroux, P., Ouksel, A.M.: Emergent semantics principles and issues. In: Proceedings of Database Systems for Advanced Applications, 9th International Conference, DASFAA 2004, Jeju Island, Korea (2004)
Maedche, A., Staab, S.: Ontology learning. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies. Springer, New York (2004) 173–190
Maedche, A.: Emergent semantics for ontologies. In: Emergent Semantics. IEEE Intelligent Systems. IEEE Press, NYC, USA (2002) 85–86
Ottens, K., Aussenac-Gilles, N., Gleizes, M.P., Camps, V.: Dynamic ontology coevolution from texts: Principles and case study. In: Proceedings of ESOE 2007 Workshop, CEUR-WS, Busan, Korea (2007) 70–83
Buitelaar, P., Cimiano, P.: Ontology Learning and Population: Bridging the Gap Between Text and Knowledge. IOS Press, Amsterdam, Netherlands (2008)
Haase, P., Völker, J.: Ontology learning and reasoning – dealing with uncertainty and inconsistency. In: Proceedings of the URSW2005 Workshop. (NOV 2005), Galway, Ireland 45–55
Hein, J., Hendler, J.: Dynamic ontologies on the web. In: Proceedings of AAAI 2000, AAAI Press, Menlo Park, California, USA (2000)
Haase, P., van Harmelen, F., Huang, Z., Stuckenschmidt, H., Sure, Y.: A framework for handling inconsistency in changing ontologies. In: Proceedings of ISWC’05. Volume 3792 of LNCS. Springer, New York (2005) 353–367
Straccia, U.: A fuzzy description logic for the semantic web. In: Sanchez, E. (ed.) Fuzzy Logic and the Semantic Web. Capturing Intelligence. Elsevier, Amsterdam (2006) 73–90
Flouris, G., Huang, Z., Pan, J.Z., Plexousakis, D., Wache, H.: Inconsistencies, negations and changes in ontologies. In: Proceedings of AAAI 2006, AAAI Press, Menlo Park, California, USA (2006)
Sheth, A., Ramakrishnan, C., Thomas, C.: Semantics for the semantic web: The implicit, the formal and the powerful. International Journal on SemanticWeb & Information Systems 1(1) (2005) 1–18
Frith, C.: Making Up the Mind: How the Brain Creates Our Mental World. Blackwell, Oxford, UK (2007)
Gentner, D., Holyoak, K.J., Kokinov, B.K. (eds.): The Analogical Mind: Perspectives from Cognitive Science. MIT Press, Cambridge, MA (2001)
McGuinness, D.L.: Ontology-enhanced search for primary care medical literature. In: Proceedings of the Medical Concept Representation and Natural Language Processing Conference, Phoenix, Arizona, USA (1999) 16–19
Abasolo, J.M., Gómez, M.: M.: Melisa: An ontology-based agent for information retrieval in medicine. In: Proceedings of the First International Workshop on the Semantic Web (SemWeb2000), Lisbon, Portugal (2000) 73–82
Dietze, H., et al.: Gopubmed: Exploring pubmed with ontological background knowledge. In: Ontologies and Text Mining for Life Sciences, IBFI (2008)
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge, USA (2003)
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11) (2004) 1984–1998
Groza, T., Handschuh, S., Moeller, K., Decker, S.: KonneXSALT: First steps towards a semantic claim federation infrastructure. In: The Semantic Web: Research and Applications (Proceedings of ESWC 2008), Springer, New York (2008) 80–94
Hulpus, I.: Design and implementation of a semantic claim federation infrastructure. Master’s Thesis, Technical University of Cluj-Napoca (2008)
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 5 (2001)
Zadeh, L.A.: Fuzzy sets. Journal of Information and Control 8 (1965) 338–353
Ogden, C.K., Richards, I.A.: The Meaning of Meaning. Mariner Books (1989)
Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema. (2004) Available at (Feb 2006): http://www.w3.org/TR/rdf-schema/. 13 Jul 2010
Deschrijver, G., Cornelis, C., Kerre, E.E.: On the representation of intuitionistic fuzzy t-norms and t-conorms. In: Transactions on Fuzzy Systems. IEEE (2004)
Yager, R.R.: On ordered weighted averaging aggregation operators in multi-criteria decision making. IEEE Transactions on Systems, Man and Cybernetics 18 (1988) 183–190
Greenwald, A.G.: Cognitive learning, cognitive response to persuasion, and attitude change. In: Psychological Foundations of Attitudes, Academic Press Inc., New York (1968) 147–169
Grimm, S., Motik, B.: Closed world reasoning in the semantic web through epistemic operators. In: Proceedings of the Workshop OWL – Experiences and Directions, CEUR-WS (2005)
Patel-Schneider, P.F., Horrocks, I.: Position paper: A comparison of two modelling paradigms in the semantic web. In: Proceedings of http://WWW2006, ACM Press, NYC, USA (2006) 3–12
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Communications of the ACM 29(12) (1986) 1213–1228
Kokinov, B.N., Petrov, A.: Integrating memory and reasoning in analogy-making: The AMBR model. In: The Analogical Mind: Perspectives from Cognitive Science, MIT Press, Cambridge, MA (2001) 59–124
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5) (1999) 604–632
Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Magazine 17(3) (1996) 73–83
Nováček, V.: Towards an efficient knowledge-based publication data exploitation: An oncological literature search scenario. Technical Report DERI-TR-2009-03-23, DERI, NUIG (2009) Available at http://tinyurl.com/csh3rf. 13 Jul 2010
Manola, F., Miller, E.: RDF Primer. (2004) Available at (November 2008): http://www.w3.org/TR/rdf-primer/. 13 Jul 2010
Groza, T., Möller, K., Handschuh, S., Trif, D., Decker, S.: SALT: Weaving the claim web. In: ISWC 2007, Busan, Korea (2007)
Maedche, A., Staab, S.: Discovering conceptual relations from text. In: Proceedings of ECAI 2000, IOS Press, Amsterdam, Netherlands (2000)
Blaschke, C., Andrade, M., Ouzounis, C., Valencia, A.: Automatic extraction of biological information from scientific text: Protein-protein interactions. In: Proc. Int Conf Intell Syst Mol Biol, Protein Design Group, CNB-CSIC, Madrid, Spain (1999) 60–67
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA (1998)
Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning taxonomic relations from heterogenous sources of evidence. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, Amsterdam, Netherlands (2005) 59–73
Voelker, J., Vrandecic, D., Sure, Y., Hotho, A.: Learning disjointness. In: Proceedings of ESWC’07, Springer, New York (2007)
Gärdenfors, P.: Conceptual Spaces: The Geometry of Thought. MIT Press, Cambridge, MA (2000)
Aisbett, J., Gibbon, G.: A general formulation of conceptual spaces as a meso level representation. Artificial Intelligence 133(1–2) (2001) 189–232
Smolensky, P., Legendre, G.: The Harmonic Mind: From Neural Computation to Optimality – Theoretic Grammar. MIT Press, Cambridge, MA (2006)
Sowa, J.F., Majumdar, A.K.: Analogical reasoning. In: Proceedings of ICCS’03. Springer, Berlin, Heidelberg (2003)
Sowa, J.F.: A dynamic theory of ontology. In: Proceedings of FOIS’06, IOS Press, Amsterdam, Netherlands (2006)
Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Stein, L.A.: OWL Web Ontology Language Reference. (2004) Available at (February 2006): http://www.w3.org/TR/owl-ref/. 13 Jul 2010
ter Horst, H.J.: Completeness, decidability and complexity of entailment for rdf schema and a semantic extension involving the owl vocabulary. Journal of Web Semantics 3(2-3) (2005) 79–115
Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontology Language: Profiles. Working draft, available at http://www.w3.org/TR/owl2-profiles as of Dec 11 (2008). 13 Jul 2010
Noy, N., Rector, A.: Defining N-ary Relations on the Semantic Web (2006). Available at (June 2008): http://www.w3.org/TR/swbp-n-aryRelations/. 13 Jul 2010
Laskey, K.J., Laskey, K.B., Costa, P.C.G., Kokar, M.M., Martin, T., Lukasiewicz, T.: Uncertainty Reasoning for the World Wide Web. (2008) W3C Incubator Group final report, available at http://www.w3.org/2005/Incubator/urw3/XGR-urw3-20080331/ as of Dec 11, 2008. 13 Jul 2010
Acknowledgments
This work has been supported by the EU IST 6th framework’s project “Nepomuk” (FP6-027705) and the “Líon” and “Líon II” projects funded by Science Foundation Ireland under Grant No. SFI/02/CE1/I131, SFI/08/CE/ I1380, respectively. We would like to thank the employees of Masaryk Oncology Institute for their feedback and to Ioana Hulpus for her work on the former CORAAL user interface. Very special thanks goes to the people who have actively participated in the continuous prototype evaluation and testing, namely to (in alphabetical order) Doug Foxvog, Peter Gréll, MD, Miloš Holánek, MD, Matthias Samwald, Holger Stenzhorn and Jiří Vyskočil, MD. We also acknowledge the valuable comments from the anonymous reviewers who helped to improve the final shape of the chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Nováček, V., Groza, T., Handschuh, S. (2010). Towards Knowledge-Based Life Science Publication Repositories. In: Chen, H., Wang, Y., Cheung, KH. (eds) Semantic e-Science. Annals of Information Systems, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-5908-9_9
Download citation
DOI: https://doi.org/10.1007/978-1-4419-5908-9_9
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5902-7
Online ISBN: 978-1-4419-5908-9
eBook Packages: Business and EconomicsBusiness and Management (R0)