Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs

Vlach, Richard; Kazakos, Wassili

doi:10.1007/978-3-540-39403-7_11

Richard Vlach⁸ &
Wassili Kazakos⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2798))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

397 Accesses
3 Citations

Abstract

The Web has become the world’s largest information source. Unfortunately, the main success factor of the Web, the inherent principle of distribution and autonomy of the participants, is also its main problem. When trying to make this information machine processable, common structures and semantics have to be identified. The goal of information extraction (IE) is exactly this, to transform text into a structural format. In this paper, we present a novel approach for information extraction developed as part of the XI³ project. Central to our approach is the assumption that we can obtain a better understanding of a text fragment if we consider its integration into higher-level concepts by exploiting text fragments from different parts of a source. In addition to previous approaches, we offer higher expressiveness of the extraction schema and an advanced method to deal with ambiguous texts. Our approach provides a way to use one extraction schema for multiple sources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Eikvil, L.: Information extraction from world wide web: a survey. Technical Report 945, Norwegian Computing Center (1999)
Google Scholar
Gaizauskas, R., Wilks, Y.: Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield (1997)
Google Scholar
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and Synthesizing Information from the Web. In: Proceedings of the 3rd IFCIS international Conference on Cooperative Information Systems, COOPIS 1998, New York (1998)
Google Scholar
Gao, X., Sterling, L.: Semi-Structured Data Extraction from Heterogeneous Sources. In: Schwartz, D.G., Divitini, M., Brasethvik, T. (eds.) Internet-based organizational memory and knowledge management, Part 2, ch. 5, pp. 83–102. The idea group Publisher (1999)
Google Scholar
Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the web. In: Principles of Data Mining and Knowledge Discovery, pp. 216–227 (2001)
Google Scholar
Hsu, C.-H., Dung, M.-T.: Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information systems 23(8), 521–538 (1998)
Article Google Scholar
Kazakos, W., Nagypal, G., Schmidt, A., Tomczyk, P.: XI³ – Towards an Integration Web. In: Proceedings of WITS02, Barcelona (December 2002)
Google Scholar
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Chapter Google Scholar
Kushmerick, N., Thomas, B.: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents R&D in Europe: An AgentLink perspective. Springer, Heidelberg (in press)
Google Scholar
W3C XML Schema, http://www.w3c.org/XML/Schema

Download references

Author information

Authors and Affiliations

Charles University, Malostranske nam 25, 118 00, Praha 1, Czech Republic
Richard Vlach
Forschungszentrum Informatik, Haid-und-Neu-Straße 10-14, 76131, Karlsruhe, Germany
Wassili Kazakos

Authors

Richard Vlach
View author publications
You can also search for this author in PubMed Google Scholar
Wassili Kazakos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Informatics Problems, Russian Academy of Science,
Leonid Kalinichenko
Institute of Computer Science III, University of Bonn, Roemerstr. 164, D-53117, Bonn, Germany
Rainer Manthey
Institute of Computer Science and Applied Mathematics, Christian-Albrechts-University of Kiel, Olshausenstr. 40, 24098, Kiel, Germany
Bernhard Thalheim
Department of InformationTechnology/Mathematics, University ofApplied Sciences, Friedrich-List-Platz 1, 01069, Dresden, Germany
Uwe Wloka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vlach, R., Kazakos, W. (2003). Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs. In: Kalinichenko, L., Manthey, R., Thalheim, B., Wloka, U. (eds) Advances in Databases and Information Systems. ADBIS 2003. Lecture Notes in Computer Science, vol 2798. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39403-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-39403-7_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20047-5
Online ISBN: 978-3-540-39403-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics