Definition
A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (i) Web interaction, which comprises mainly the navigation to usually pre-determined target web pages containing the desired information; (ii) Support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (iii) Scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (iv) Data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources and...
Recommended Reading
Anupam V, Freire J, Kumar B, Lieuwen D. Automating web navigation with the WebVCR. Comput Netw. 2000;33(1–6):503–17.
Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. In: Proceeding of 27th International Conference on Very Large Data Bases. 2001. p. 119–28.
Crescenzi V, Mecca G, Merialdo P. Road runner: towards automatic data extraction from large web sites. In: Proceeding of 27th International Conference on Very Large Data Bases. 2001. p. 109–18.
Etzioni O, Cafarella MJ, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS, Yates Y. Web-scale information extraction in KnowItAll: (preliminary results). In: Proceeding of 12th International World Wide Web Conference. 2004. p. 100–10.
Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B. Towards domain-independent information extraction from web tables. In: Proceeding of 16th International World Wide Web Conference. 2007. p. 71–80.
Gottlob G, Koch C. Monadic datalog and the expressive power of languages for web information extraction. J ACM. 2002;51(1):74–113.
Gottlob G, Koch CA. Formal comparison of visual web wrapper generators. In: Proceeding of 32nd International Current Trends in Theory and Practice of Computer Science. 2006. p. 30–48.
Kuhlins S, Tredwell R. Toolkits for generating wrappers: a survey of software toolkits for automated data extraction from websites. NODe 2002, LNCS:2591; 2003.
Kushmerick N, Weld DS, Doorenbos RB. Wrapper induction for information extraction. In: Proceeding of 15th International Joint Conference on AI. 1997. p. 729–37.
Laender AHF, Ribeiro-Neto BA, da Silva AS. DEByE – data extraction by example. Data Knowl Eng. 2000;40(2):121–54.
Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceeding of 16th International Conference on Data Engineering. 2000. p. 611–21.
Liu B, Grossman RL, Zhai Y. Mining web pages for data records. IEEE Intell Syst. 2004;19(6):49–55.
Muslea I, Minton S, Knoblock CA. Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst. 2001;4(1/2):93–114.
Pan A, Raposo J, Álvarez M, Montoto P, Orjales V, Hidalgo J, Ardao L, Molano A, Viña Á. The Denodo data integration platform. In: Proceeding of 28th International Conference on Very Large Data Bases. 2002.
Sahuguet A, Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. 2001;36(3):283–316.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this entry
Cite this entry
Baumgartner, R., Gatterbauer, W., Gottlob, G. (2016). Web Data Extraction System. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_1154-2
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7993-3_1154-2
Received:
Accepted:
Published:
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4899-7993-3
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering