Skip to main content

Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs

  • Conference paper
Advances in Databases and Information Systems (ADBIS 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2798))

Abstract

The Web has become the world’s largest information source. Unfortunately, the main success factor of the Web, the inherent principle of distribution and autonomy of the participants, is also its main problem. When trying to make this information machine processable, common structures and semantics have to be identified. The goal of information extraction (IE) is exactly this, to transform text into a structural format. In this paper, we present a novel approach for information extraction developed as part of the XI3 project. Central to our approach is the assumption that we can obtain a better understanding of a text fragment if we consider its integration into higher-level concepts by exploiting text fragments from different parts of a source. In addition to previous approaches, we offer higher expressiveness of the extraction schema and an advanced method to deal with ambiguous texts. Our approach provides a way to use one extraction schema for multiple sources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eikvil, L.: Information extraction from world wide web: a survey. Technical Report 945, Norwegian Computing Center (1999)

    Google Scholar 

  2. Gaizauskas, R., Wilks, Y.: Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield (1997)

    Google Scholar 

  3. Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and Synthesizing Information from the Web. In: Proceedings of the 3rd IFCIS international Conference on Cooperative Information Systems, COOPIS 1998, New York (1998)

    Google Scholar 

  4. Gao, X., Sterling, L.: Semi-Structured Data Extraction from Heterogeneous Sources. In: Schwartz, D.G., Divitini, M., Brasethvik, T. (eds.) Internet-based organizational memory and knowledge management, Part 2, ch. 5, pp. 83–102. The idea group Publisher (1999)

    Google Scholar 

  5. Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the web. In: Principles of Data Mining and Knowledge Discovery, pp. 216–227 (2001)

    Google Scholar 

  6. Hsu, C.-H., Dung, M.-T.: Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information systems 23(8), 521–538 (1998)

    Article  Google Scholar 

  7. Kazakos, W., Nagypal, G., Schmidt, A., Tomczyk, P.: XI3 – Towards an Integration Web. In: Proceedings of WITS02, Barcelona (December 2002)

    Google Scholar 

  8. Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  9. Kushmerick, N., Thomas, B.: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents R&D in Europe: An AgentLink perspective. Springer, Heidelberg (in press)

    Google Scholar 

  10. W3C XML Schema, http://www.w3c.org/XML/Schema

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vlach, R., Kazakos, W. (2003). Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs. In: Kalinichenko, L., Manthey, R., Thalheim, B., Wloka, U. (eds) Advances in Databases and Information Systems. ADBIS 2003. Lecture Notes in Computer Science, vol 2798. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39403-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39403-7_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20047-5

  • Online ISBN: 978-3-540-39403-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics