Abstract
The Web has become the world’s largest information source. Unfortunately, the main success factor of the Web, the inherent principle of distribution and autonomy of the participants, is also its main problem. When trying to make this information machine processable, common structures and semantics have to be identified. The goal of information extraction (IE) is exactly this, to transform text into a structural format. In this paper, we present a novel approach for information extraction developed as part of the XI3 project. Central to our approach is the assumption that we can obtain a better understanding of a text fragment if we consider its integration into higher-level concepts by exploiting text fragments from different parts of a source. In addition to previous approaches, we offer higher expressiveness of the extraction schema and an advanced method to deal with ambiguous texts. Our approach provides a way to use one extraction schema for multiple sources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Eikvil, L.: Information extraction from world wide web: a survey. Technical Report 945, Norwegian Computing Center (1999)
Gaizauskas, R., Wilks, Y.: Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield (1997)
Huck, G., Fankhauser, P., Aberer, K., Neuhold, E.: Jedi: Extracting and Synthesizing Information from the Web. In: Proceedings of the 3rd IFCIS international Conference on Cooperative Information Systems, COOPIS 1998, New York (1998)
Gao, X., Sterling, L.: Semi-Structured Data Extraction from Heterogeneous Sources. In: Schwartz, D.G., Divitini, M., Brasethvik, T. (eds.) Internet-based organizational memory and knowledge management, Part 2, ch. 5, pp. 83–102. The idea group Publisher (1999)
Hong, T.W., Clark, K.L.: Using grammatical inference to automate information extraction from the web. In: Principles of Data Mining and Knowledge Discovery, pp. 216–227 (2001)
Hsu, C.-H., Dung, M.-T.: Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information systems 23(8), 521–538 (1998)
Kazakos, W., Nagypal, G., Schmidt, A., Tomczyk, P.: XI3 – Towards an Integration Web. In: Proceedings of WITS02, Barcelona (December 2002)
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents Using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Kushmerick, N., Thomas, B.: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents R&D in Europe: An AgentLink perspective. Springer, Heidelberg (in press)
W3C XML Schema, http://www.w3c.org/XML/Schema
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vlach, R., Kazakos, W. (2003). Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs. In: Kalinichenko, L., Manthey, R., Thalheim, B., Wloka, U. (eds) Advances in Databases and Information Systems. ADBIS 2003. Lecture Notes in Computer Science, vol 2798. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39403-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-39403-7_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20047-5
Online ISBN: 978-3-540-39403-7
eBook Packages: Springer Book Archive