A Framework for Populating Ontological Models from Semi-structured Web Documents

Sleiman, Hassan A.; Hernández, Inma

doi:10.1007/978-3-642-34002-4_48

Hassan A. Sleiman¹⁹ &
Inma Hernández¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7532))

Included in the following conference series:

International Conference on Conceptual Modeling

2784 Accesses

Abstract

The Web is the largest repository of information that has ever existed. This information is presented in a human friendly format using HTML, which complicates the consumption of this information by automatic processes. Solutions to this problem are the Semantic Web and Web Services, but the lack of such services in the majority of web sites has increased the interest on information extraction, which allow extracting and structuring information from web documents in ontological models. Despite the high number of proposals on information extraction, there does not exist a universally applicable information extractor. As a consequence, when populating an ontology model automatically from a web site, it is not unusual to need more than one information extractor. We propose a framework that allows the development, training, and the application of information extractors on semi-structured web documents to produce semantic data. We have developed a version of the framework and verified it by means of experiments on 15 web sites. Experimental results are very promising.

Supported by the European Commission (FEDER), the Spanish and the Andalusian R&D&I programmes (grants grants TIN2010-21744-C02-01, TIN2007-64119, P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, and TIN2010-09988-E).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adelberg, B., et al.: NoDoSE - a tool for semi-automatically extracting semi-structured data from text documents. In: SIGMOD (1998)
Google Scholar
Chang, C.-H., et al.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10) (2006)
Google Scholar
Cohen, W.W., et al.: A flexible learning system for wrapping tables and lists in HTML documents. In: WWW (2002)
Google Scholar
Crescenzi, V., et al.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB (2001)
Google Scholar
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the web. Inf. Syst. 23(8) (1998)
Google Scholar
Kayed, M., Chang, C.-H.: FiVaTech: Page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. (2010)
Google Scholar
Kushmerick, N., et al.: Wrapper induction: Efficiency and expressiveness. Artif. Intell. 118(1-2) (2000)
Google Scholar
Laender, A.H.F., et al.: DEByE - data extraction by example. Data Knowl. Eng. 40(2) (2002)
Google Scholar
Suchanek, F.M., et al.: SOFIE: a self-organizing framework for information extraction. In: World Wide Web Conference Series (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Sevilla, Spain
Hassan A. Sleiman & Inma Hernández

Authors

Hassan A. Sleiman
View author publications
You can also search for this author in PubMed Google Scholar
Inma Hernández
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di informatica e Automazione, Università Roma Tre, Via Vasca Navale, 79, 00145, Roma, Italy
Paolo Atzeni
Department of Computer Science, University of Hong Kong, Pok Fu Lam Road, Hong Kong, China
David Cheung
Eller College of Management, University of Arizona, McClelland Hall, Room 108, P.O. Box 210108, 85721-0108, Tucson, AZ, USA
Sudha Ram

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sleiman, H.A., Hernández, I. (2012). A Framework for Populating Ontological Models from Semi-structured Web Documents. In: Atzeni, P., Cheung, D., Ram, S. (eds) Conceptual Modeling. ER 2012. Lecture Notes in Computer Science, vol 7532. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34002-4_48

Download citation

DOI: https://doi.org/10.1007/978-3-642-34002-4_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34001-7
Online ISBN: 978-3-642-34002-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics