Definition
Web data extraction is the process of automatically converting Web resources into a specific structured format. For example, if a collection of HTML web pages describes details about various companies (name, headquarters, etc) then web data extraction would involve converting this native HTML format into computer-processable data structures, such as entries in relational database tables. The purpose of web data extraction is to make web data available for subsequent manipulation or integration steps. In the previous example, the goal may be summarizing the results as some form of analytical report.
There are several approaches to Web data extraction. The most common approach is to specify the conversion process using a special-purpose programming Language for Web Data Extraction. Web data extraction then becomes a matter of executing a well-defined computer program.
Web data...
Keywords
- Programming Language
- Regular Expression
- XPath Query
- Automate Content Extraction
- Modern Programming Language
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Recommended Reading
Arasu A, Garcia-Molina H. Extracting structured data from Web pages. In: Proceedings of 2003 ACM SIGMOD International Conference on Management of data; 2003. p. 337–48.
Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. In: Proceedings of 27th International Conference on Very Large Data Bases; 2001. p. 119–28.
Crescenzi V, Mecca G, Merialdo P. RoadRunner: towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases; 2001.
Kistler T, Marais H. WebL – a programming language for the web. Comput Netw ISDN Syst. 1998;30(1–7):259–70.
Knoblock CA, Lerman K, Minton S, Muslea I. Accurately and reliably extracting data from the web: a machine learning approach. In: Szczepaniak PS, Segovia J, Kacprzyk J, Zadeh LA, editors. Intelligent exploration of the web. Heidelberg: Physica-Verlag; 2003. p. 275–87.
Kushmerick N. Wrapper induction: efficiency and expressiveness. Artif Intell. 2000;118(1–2):15–68. Special issue on Intelligent Internet Systems.
Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS. A brief survey of web data extraction tools. ACM SIGMOD Rec. 2002;31(2):84–93.
Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of 16th International Conference on Data Engineering; 2000.
Muslea I, Minton S, Knoblock CA. Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst. 2001;4(1–2):93–114.
Sahuguet A, Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. 2000;36(3):283–316.
Spertus E, Andrea Stein L. Squeal: structured queries on the Web. In: Proceedings of 9th International World-Wide Web Conference; 2000.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media LLC
About this entry
Cite this entry
Kushmerick, N. (2017). Languages for Web Data Extraction. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_1156-3
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7993-3_1156-3
Received:
Accepted:
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4899-7993-3
Online ISBN: 978-1-4899-7993-3
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering
Publish with us
Chapter history
-
Latest
Languages for Web Data Extraction- Published:
- 16 February 2017
DOI: https://doi.org/10.1007/978-1-4899-7993-3_1156-3
-
Original
Languages for Web Data Extraction- Published:
- 29 November 2016
DOI: https://doi.org/10.1007/978-1-4899-7993-3_1156-2