Abstract
The Web is the largest repository of information that has ever existed. This information is presented in a human friendly format using HTML, which complicates the consumption of this information by automatic processes. Solutions to this problem are the Semantic Web and Web Services, but the lack of such services in the majority of web sites has increased the interest on information extraction, which allow extracting and structuring information from web documents in ontological models. Despite the high number of proposals on information extraction, there does not exist a universally applicable information extractor. As a consequence, when populating an ontology model automatically from a web site, it is not unusual to need more than one information extractor. We propose a framework that allows the development, training, and the application of information extractors on semi-structured web documents to produce semantic data. We have developed a version of the framework and verified it by means of experiments on 15 web sites. Experimental results are very promising.
Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants grants TIN2010-21744-C02-01, TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, and TIN2010-09988-E).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adelberg, B., et al.: NoDoSE - a tool for semi-automatically extracting semi-structured data from text documents. In: SIGMOD (1998)
Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)
Cohen, W.W., et al.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)
Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB (2001)
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)
Kayed, M., Chang, C.-H.: FiVaTech: Page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. (2010)
Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)
Laender, A.H.F., et al.: DEByE - data extraction by example. Data Knowl. Eng. 40(2) (2002)
Suchanek, F.M., et al.: SOFIE: a self-organizing framework for information extraction. In: World Wide Web Conference Series (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sleiman, H.A., Hernández, I. (2012). A Framework for Populating Ontological Models from Semi-structured Web Documents. In: Atzeni, P., Cheung, D., Ram, S. (eds) Conceptual Modeling. ER 2012. Lecture Notes in Computer Science, vol 7532. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34002-4_48
Download citation
DOI: https://doi.org/10.1007/978-3-642-34002-4_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34001-7
Online ISBN: 978-3-642-34002-4
eBook Packages: Computer ScienceComputer Science (R0)