Skip to main content

OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data

  • Conference paper
Biological and Medical Data Analysis (ISBMDA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4345))

Included in the following conference series:

Abstract

Within the knowledge discovery in databases (KDD) process, previous phases to data mining consume most of the time spent analysing data. Few research efforts have been carried out in theses steps compared to data mining, suggesting that new approaches and tools are needed to support the preparation of data. As regards, we present in this paper a new methodology of ontology-based KDD adopting a federated approach to database integration and retrieval. Within this model, an ontology-based system called OntoDataClean has been developed dealing with instance-level integration and data preprocessing. Within the OntoDataClean development, a preprocessing ontology was built to store the information about the required transformations. Various biomedical experiments were carried out, showing that data have been correctly transformed using the preprocessing ontology. Although OntoDataClean does not cover every possible data transformation, it suggests that ontologies are a suitable mechanism to improve quality in the various steps of KDD processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rahm, E., Hai Do, H.: Data cleaning: problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4), 3–13 (2001)

    Google Scholar 

  2. Dasu, T., Jonson, T.: Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Chichester (2003)

    Book  MATH  Google Scholar 

  3. Weiss, S.M., Indurkhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann, San Francisco (1998)

    MATH  Google Scholar 

  4. Gurwitz, D., Lunshof, J.E., Altman, R.B.: A call for the creation of personalized medicine database. Nature Reviews, Drug Discovery 5, 23–26 (2006)

    Article  Google Scholar 

  5. Fayyad, U., Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in databases. AI Magazine 17, 37–54 (1996)

    Google Scholar 

  6. Sujansky, W.: Heterogeneous Database Integration in Biomedicine. Journal of Biomedical Informatics 34(4), 285–298 (2001)

    Article  Google Scholar 

  7. Maojo, V., García-Remesal, M., Billhardt, H., Alonso-Calvo, R., Pérez-Rey, D., Martín-Sánchez, F.: Designing New Methodologies for Integrating Biomedical Information in Clinical Trials. Methods Inf Med 45(2), 180–185 (2006)

    Google Scholar 

  8. Galhardas, H., Florescu, D., Shasha, D., Simon, E.: AJAX: An Extensible Data Cleaning Tool. In: SIGMOD 2000 Conf. Management of Data, Dallas, p. 590 (2000)

    Google Scholar 

  9. Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB 2001, 27th International Conference on Very Large Databases, Rome, pp. 381–390 (2001)

    Google Scholar 

  10. Gruber, T.R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2), 199–220 (1993)

    Article  Google Scholar 

  11. Silvescu, A., Reinoso-Castillo, J., Honavar, V.: Ontology-Driven information extraction and knowledge acquisition from heterogeneous, distributed, autonomous data sources. In: Proceedings of the IJCAI (2001)

    Google Scholar 

  12. Cespivova, H., Rauch, J., Svatek, V., Kejkula, M., Tomeckova, M.: Roles of Medical Ontology in Association Mining CRISP-DM Cycle. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies (KDO 2004), Pisa (2004)

    Google Scholar 

  13. Pérez-Rey, D., Maojo, V., Garcia-Remesal, M., Alonso-Calvo, R., Billhardt, H., Martin-Sanchez, F., Sousa, A.: ONTOFUSION: Ontology-Based Integration of Genomic and Clinical Databases. Computers in Biology and Medicine 36, 712–730 (2006)

    Article  Google Scholar 

  14. Bizer, C.: D2R MAP - A Database to RDF Mapping Language. In: Proceedings of the International World Wide Web Conference (WWW 2003), Budapest, Hungary (2003)

    Google Scholar 

  15. Köhler, J., Philippi, S., Lange, M.: SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19(18), 2420–2427 (2003)

    Article  Google Scholar 

  16. http://kaon.semanticweb.org/alphaworld/reverse/ (last accessed September 1, 2006)

  17. Phillips, J., Buchanan, B.G.: Ontology-guided knowledge discovery in databases. In: International Conf. Knowledge Capture Victoria, Canada (2001)

    Google Scholar 

  18. Kedad, Z., Métais, E.: Ontology-based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  19. Wang, X., Hamilton, H.J., Bither, Y.: An Ontology-Based Approach to Data Cleaning. Technical report. University of Regina. Canada (2005)

    Google Scholar 

  20. Cannataro, M., Hiram Guzzi, P., Mazza, T., Tradigo, G., Veltri, P.: Using Ontologies in PROTEUS for Modeling Proteomics Data Mining Applications. Studies in Health Technology and Informatics 112, 17–26 (2005)

    Google Scholar 

  21. Bernstein, A., Provost, F., Hill, S.: Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification. IEEE Transactions on Knowledge and Data Engineering 17(4), 503–518 (2005)

    Article  Google Scholar 

  22. Gottgtroy, P., Kasabov, N., MacDonell, S.: An ontology driven approach for knowledge discovery in Biomedicine. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, Springer, Heidelberg (2004)

    Google Scholar 

  23. Svatek, V., Rauch, J., Flek, M.: Ontology-Based Explanation of Discovered Associations in the Domain of Social Reality. In: ECML/PKDD05 Workshop on Knowledge Discovery and Ontologies, Porto (2005)

    Google Scholar 

  24. Euler, T., Scholz, M.: Using Ontologies in a KDD Workbench. In: Workshop on Knowledge Discovery and Ontologies at ECML/PKDD (2004)

    Google Scholar 

  25. McGuinness, D., van Harmelen, F. (eds.): OWL Web Ontology Language Overview (2003), http://www.w3.org/TR/owl-features/ (last accessed September 1, 2006)

  26. Knublauch, H., Fergerson, R.W., Noy, N., Musen, M.A.: The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications. In: Third International Semantic Web Conference (2004)

    Google Scholar 

  27. Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., Hendler, J.: Swoop: A web ontology editing browser. Journal of Web Semantics 4(2) (2005)

    Google Scholar 

  28. Volz, R., Oberle, D., Motik, B., Staab, S.: KAON server - a semantic web management system. In: Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Alternate Tracks - Practice and Experience, Budapest, Hungary (2003)

    Google Scholar 

  29. http://www.es.embnet.org/Services/MolBio/gepas/index.html (last accessed September 1, 2006)

  30. http://www.reactome.org/cgi-bin/frontpage (last accessed September 1, 2006)

  31. http://www.biomerieux.com/servlet/srt/bio/portail/home (last accessed September 1, 2006)

  32. Sanandrés-Ledesma, J.A., Maojo, V., Crespo, J., García-Remesal, M., Gómez de la Cámara, A.: A Performance Comparative Analysis Between Rule Induction-Algorithms and Clustering-Based Constructive Induction Algorithms. In: Application to Rheumatoid Arthritis. ISMBDA (2004)

    Google Scholar 

  33. Martín-Sanchez, F., Maojo, V., López-Campos, G.: Integrating genomics into health information systems. Methods Inf. Med. 41, 25–30 (2002)

    Google Scholar 

  34. Maojo, V., Martin-Sanchez, F.: Bioinformatics: towards new directions for public health. Methods Inf. Med. 43(3), 208–214 (2004)

    Google Scholar 

  35. Maojo, V., Kulikowski, C.A.: Bioinformatics and Medical Informatics: Collaborations on the Road to Genomic Medicine? J. Am. Med. Inform. Assoc. 10(6), 515–522 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Perez-Rey, D., Anguita, A., Crespo, J. (2006). OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_24

Download citation

  • DOI: https://doi.org/10.1007/11946465_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68063-5

  • Online ISBN: 978-3-540-68065-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics