Fully Automatic Web Data Extraction

Ziegler, Cai-Nicolas

doi:10.1007/978-1-4614-8265-9_1159

Cai-Nicolas Ziegler³

22 Accesses
1 Citations

Synonyms

Automatic wrapper induction; Web content extraction; Web information extraction

Definition

Web documents contain abundant hypertext markup information, both for indicating structure as well as for giving page rendering hints, next to informative textual content. Fully-automatic Web data extraction is geared towards extracting all relevant textual information from HTML documents, without requiring human intervention throughout the process. Commonly, two types of automatic Web extraction paradigms are distinguished in this vein. First, the extraction of one single block of informative content, e.g., in case of news pages, which is also referred to as page cleaning [4]. Second, the extraction of recurring patterns across multiple blocks, typically the case for the extraction of search engine results. In the latter case, the extraction system will commonly also assign labelsto the single atoms of each identified recurring block, such as the search result record’s title, snippet,...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Crescenzi V, Mecca G, Merialdo P. RoadRunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 109–18.
Google Scholar
Debnath S, Mitra P, Giles CL. Automatic extraction of informative blocks from webpages. In: Proceedings of the 2005 ACM Symposium on Applied Computing; 2005. p. 1722–6.
Google Scholar
Glance N, Hurst M, Nigam K, Siegler M, Stockton R, Tomokiyo T. Deriving marketing intelligence from online discussion. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2005. p. 419–28.
Google Scholar
Hofmann K, Weerkamp W. Web corpus cleaning using content and structure. In: Fairon C, Naerts H, Kilgarrif A, de Schryver G, editors. Building and exploring web Corpora. vol. 4, UCL; 2007.p. 145–54.
Google Scholar
Kovacevic M, Dilligenti M, Gori M, Milutinovic V. Recognition of common areas in a web page using a visualization approach. In: Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications; 2002. p. 203–12.
Chapter Google Scholar
Kushmerick N, Weld D, Doorenbos R. Wrapper induction for information extraction. In: Proceedings of the 15th International Joint Conference on AI; 1997. p. 119–28.
Google Scholar
Lin SH, Ho JM. Discovering informative content blocks from web documents. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.p. 588–93.
Google Scholar
Liu B, Grossman R, Zhai Y. Mining data records in web pages. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2003. p. 601–6.
Google Scholar
Muslea I, Minton S, Knoblock C. Hierarchical wrapper induction for semistructured information sources. Auton Agent Multi-Agent Syst. 2001;4(1–2):93–114.
Article Google Scholar
Simon K, Lausen G. ViPER: augmenting automatic information extraction with visual perceptions. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management; 2005. p. 381–8.
Google Scholar
Ziegler CN, Skubacz M. Towards automated reputation and brand monitoring on the web. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence; 2006. p. 1066–70.
Google Scholar
Ziegler CN, Skubacz M. Content extraction from news pages using particle swarm optimization on an linguistic and structural features. In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence; 2007. p. 242–9.
Google Scholar
Zhao H, Meng W, Wu Z, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 66–75.
Google Scholar

Download references

Author information

Authors and Affiliations

Siemens AG, Munich, Germany
Cai-Nicolas Ziegler

Authors

Cai-Nicolas Ziegler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cai-Nicolas Ziegler .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Computing Lab., Oxford Univ., Oxford, UK
Georg Gottlob

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Ziegler, CN. (2018). Fully Automatic Web Data Extraction. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1159

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_1159
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics