An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases

  • Saqib Mir
  • Steffen Staab
  • Isabel Rojas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6089)


In the Linked Open Data cloud one of the largest data sets, comprising of 2.5 billion triples, is derived from the Life Science domain. Yet this represents a small fraction of the total number of publicly available data sources on the Web. We briefly describe past attempts to transform specific Life Science sources from a plethora of open as well as proprietary formats into RDF data. In particular, we identify and tackle two bottlenecks in current practice: Acquiring ontologies to formally describe these data and creating “RDFizer” programs to convert data from legacy formats into RDF. We propose an unsupervised method, based on transformation rules, for performing these two key tasks, which makes use of our previous work on unsupervised wrapper induction for extracting labelled data from complete Life Science Web sites. We apply our approach to 13 real-world online Life Science databases. The learned ontologies are evaluated by domain experts as well as against gold standard ontologies. Furthermore, we compare the learned ontologies against ontologies that are “lifted” directly from the underlying relational schema using an existing unsupervised approach. Finally, we apply our approach to three online databases to extract RDF data. Our results indicate that this approach can be used to bootstrap and speed up the migration of life science data into the Linked Open Data cloud.


Transformation Rule Relational Schema Formal Concept Analysis Unsupervised Approach XPath Expression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Galperin, M.Y., Cochrane, G.R.: Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res. 37(Database issue), 1–4 (2009)CrossRefGoogle Scholar
  2. 2.
    Cheung, K.H., et al.: YeastHub: a semantic web use case for integrating data in the life sciences domain. Bioinformatics 21(suppl. 1) (June 1, 2005)Google Scholar
  3. 3.
    Baker, C.J.O., et al.: Semantic Web infrastructure for fungal enzyme biotechnologists. Journal of Web Semantics 3(4) (2006)Google Scholar
  4. 4.
    Stephens, S., LaVigna, D., Dilascio, M., Luciano, J.: Aggregation of bioinformatics data using semantic web technology. Journal of Web Semantics, 4 (2006)Google Scholar
  5. 5.
    Pasquier, C.: Biological data integration using Semantic Web technologies. Biochimie 90(4), 584–594 (2008)CrossRefGoogle Scholar
  6. 6.
    Zhao, J., Miles, A., Klyne, G., Shotton, D.: OpenFlyData: The Way to Go for Biological Data Integration. In: Paton, N.W., Missier, P., Hedeler, C. (eds.) Data Integration in the Life Sciences. LNCS (LNBI), vol. 5647, pp. 47–54. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Bizer, C.: D2RQ - treating non-RDF databases as virtual RDF graphs. In: Proceedings of the 3rd International Semantic Web Conference ISWC 2004 (2004)Google Scholar
  8. 8.
    Belleau, F., et al.: Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. In: WWW 2007, Banff, Canada (2007)Google Scholar
  9. 9.
    Reyes-Palomares, A., et al.: Systems Biology Metabolic Modeling Assistant (SBMM): An ontology-based tool for the integration of metabolic data in kinetic modeling. Bioinformatics, doi:10.1093/bioinformatics/btp061Google Scholar
  10. 10.
    Mir, S., Staab, S., Rojas, I.: Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: Paton, N.W., Missier, P., Hedeler, C. (eds.) DILS 2009. LNCS, vol. 5647, pp. 96–112. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  11. 11.
    Cimiano, P., Handschuh, S., Staab, S.: Towards the Sself-Annotating Web. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 462–471. ACM Press, New York (2004)CrossRefGoogle Scholar
  12. 12.
    Dellschaft, K., Staab, S.: On How to Perform a Gold Standard based Evaluation of Ontology Learning. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 228–241. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Sabou, M., Wroe, C., Goble, C., Mishne, G.: Learning domain ontologies for web service descriptions: an experiment in bioinformatics. In: Proc. of WWW 2005 (2005)Google Scholar
  14. 14.
    Green, A.M.: Kappa statistics for multiple raters using categorical classifications. In: Proceedings of the Twenty-Second Annual Conference of SAS Users Group (1997)Google Scholar
  15. 15.
    An, Y.J., et al.: Automatic Generation of Ontology from the Deep Web. In: DEXA Workshops 2007, pp. 470–474 (2007)Google Scholar
  16. 16.
    Roitman, H., Gal, A.: OntoBuilder: Fully Automatic Extraction and Consolidation of Ontologies from Web Sources using Sequence Semantics. In: Proceedings of the International Conference on Semantics of a Networked World, ICSNW 2006 (2006)Google Scholar
  17. 17.
    Davalcu, H., Vadrevu, S., Nagarajan, S., Ramakrishnan, I.: Ontominer: bootstrapping and populating ontologies from domain-specific web sites. IEEE Intelligent Systems 18(5), 24–33 (2003)CrossRefGoogle Scholar
  18. 18.
    Stojanovic, L., Stojanovic, N., Volz, R.: Migrating data-intensive Web Sites into the Semantic Web. In: Proc. of the 17th symposium on Proceedings of the 2002 ACM Symposium on Applied Computing, SAC 2002, Madrid Spain, pp. 1100–1107 (2002)Google Scholar
  19. 19.
    Li, M., Du, X.-Y., Wang, S.: Learning ontology from relational database. In: Proceedings of International Conference on Machine Learning and Cybernetics (2005)Google Scholar
  20. 20.
    Pivk, A., et al.: Transforming arbitrary Tables into F-Logic Frames with TARTAR. Data & Knowledge Engineering (DKE) 60(3), 567–595 (2007)CrossRefGoogle Scholar
  21. 21.
    Cafarella, M.J., et al.: Uncovering the relational Web. In: WebDB 2008 (2008)Google Scholar
  22. 22.
    Deutsch, A., Fernandez, M., Suciu, D.: Storing Semistructured Data in Relations. In: Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats (1998)Google Scholar
  23. 23.
    Cohen, S., Kanza, Y., Sagiv, Y.: Generating Relations from XML Documents. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 282–296. Springer, Heidelberg (2002)Google Scholar
  24. 24.
    Garofalakis, M.N., et al.: DTD inference from XML documents: The xtract approach. IEEE Data Eng. Bull. 26(3), 19–25 (2003)Google Scholar
  25. 25.
    Hegewald, J., Naumann, F., Weis, M.: XStruct: Efficient Schema Extraction from Multiple and Large XML Documents. In: Data Engineering Workshop, 22nd International Conference on Data Engineering Workshops (ICDEW 2006), p. 81 (2006)Google Scholar
  26. 26.
    Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: VLDB 2007, pp. 998–1009 (2007)Google Scholar
  27. 27.
    Cimiano, P., Hotho, A., Staab, S.: Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis. J. Artif. Intell. Res. (JAIR) 24, 305–339 (2005)zbMATHGoogle Scholar
  28. 28.
    Maedche, A., Staab, S.: Measuring Similarity between Ontologies. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, p. 251. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  29. 29.
    Auer, S., et al.: Triplify - Light-Weight Linked Data Publication from Relational Databases. In: 18th International World Wide Web Conference, pp. 621–621Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Saqib Mir
    • 1
    • 2
  • Steffen Staab
    • 2
  • Isabel Rojas
    • 1
  1. 1.EML-ResearchHeidelbergGermany
  2. 2.University of Koblenz-LandauKoblenzGermany

Personalised recommendations