Skip to main content

Towards a Wrapper-Driven Ontology-Based Framework for Knowledge Extraction

  • Conference paper
Book cover Knowledge Science, Engineering and Management (KSEM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4798))

Abstract

Since Web resources are formatted in diverse ways for human viewing, the accuracy of extracting information is not satisfactory and, further, it is not convenient for users to query information extracted by traditional techniques. This paper proposes WebKER, a wrapper-driven system for extracting knowledge from Web pages in Chinese based on domain ontologies. Wrappers are first learned through suffix arrays. Based on HowNet, a novel approach is proposed to automatically align the raw data extracted by wrappers. Then knowledge is generated and described with Resource Description Framework (RDF) statements. After merged, knowledge is finally added to the Knowledge Base (KB). A prototype of WebKER is implemented and in the experiments, the performance of our system and the comparison between querying information stored in the KB and querying information extracted with traditional techniques are given, indicating the superiority of our system. In addition, the evaluation of the outstanding wrapper and the method for merging knowledge are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pinto, D., McCallum, A., Wei, X., Croft, W.B: Table Extraction Using Conditional Random Fields. In: Proceedings of the SIGIR 2003, pp. 235–242. ACM Press, New York (2003)

    Chapter  Google Scholar 

  2. Cowie, J., Lehnert, W.: Information Extraction. Communications of the ACM 39, 80–91 (1996)

    Article  Google Scholar 

  3. Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Proceedings of the WWW 2002, pp. 232–241. ACM Press, New York (2002)

    Chapter  Google Scholar 

  4. Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27, 59–74 (1998)

    Article  Google Scholar 

  5. Soderland, S.: Learning to Extract Text-based Information from the World Wide Web. In: Proceedings of the KDD 1997, pp. 251–254. Springer, Heidelberg (1997)

    Google Scholar 

  6. McDowell, L.K., Cafarella, M.: Ontology-driven Information Extraction with OntoSyphon. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 428–444. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Welty, C., Murdock, J.W.: Towards Knowledge Acquisition from Information Extraction. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 709–722. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  8. Kushmerick, N.: Wrapper Induction for Information Extraction. Technical Report UW-CSE-97-11-04, University of Washington (1997)

    Google Scholar 

  9. Habegger, B., Quafafou, M.: WetDL: A Web Information Extraction Language. In: Yakhno, T. (ed.) ADVIS 2004. LNCS, vol. 3261, pp. 128–138. Springer, Heidelberg (2004)

    Google Scholar 

  10. Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proceedings of the WWW 2005, pp. 76–85. ACM Press, New York (2005)

    Chapter  Google Scholar 

  11. Pek, E.H., Li, X., Liu, Y.Z.: Web Wrapper Validation. In: Goos, G., Hartmanis, J., Leeuwen, J.V. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 388–393. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Chidlovskii, B., Ragetli, J., Rijke, M.D.: Wrapper Generation Via Grammar Induction. In: López de Mántaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Habegger, B., Debarbieux, D.: Integrating Data from the Web by Machine-Learning Tree-Pattern Queries. In: Meersman, R., Tari, Z. (eds.) On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. LNCS, vol. 4275, pp. 941–948. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Deng, X.B., Zhu, Y.Y.: L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises. Computer Science and Technology 20, 763–773 (2006)

    Article  MathSciNet  Google Scholar 

  15. Schindler, C., Arya, P., Rath, A., Slany, W.: HtmlButler–Wrapper Usability Enhancement Through Ontology Sharing and Large Scale Cooperation. Adaptive and Personalized Semantic Web 14, 85–94 (2006)

    Article  Google Scholar 

  16. Lewis, D.D.: Naive Bayes at Forty: the Independence Assumption in Information Retrieval. In: Carbonell, J.G., Siekmann, J. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–5. Springer, Heidelberg (1998)

    Google Scholar 

  17. Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line Search. SIAM Journal on Computing 22, 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  18. HTML Tidy Project, http://www.w3.org/People/Raggett/tidy

  19. Gan, K.W., Wong, P.W.: Annotating Information Structures in Chinese Texts Using HowNet. In: Palmer, M., Marcus, M., Joshi, A., Xia, F. (eds.) Proceedings of the second workshop on Chinese language processing, pp. 85–92 (2000)

    Google Scholar 

  20. RDF Primer, http://www.w3.org/TR/rdf-primer

  21. Jena Semantic Web Toolkit, http://www.hpl.hp.com/semweb/jena.htm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zili Zhang Jörg Siekmann

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, J., Bai, X., Li, Z., Che, H., Liu, H. (2007). Towards a Wrapper-Driven Ontology-Based Framework for Knowledge Extraction. In: Zhang, Z., Siekmann, J. (eds) Knowledge Science, Engineering and Management. KSEM 2007. Lecture Notes in Computer Science(), vol 4798. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76719-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76719-0_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76718-3

  • Online ISBN: 978-3-540-76719-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics