Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

GUIs for Web Data Extraction

  • Cai-Nicolas ZieglerEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1163


Visual web data extraction; Visual web information extraction; Wrapper generator GUIs


While content management systems (CMS) are geared towards adding presentational information to relational and structured data from database systems, thus dynamically generating HTML documents, the goal of GUIs for Web data extraction is diametrically opposed: The commonly semi-automatic Web data extraction tools intend to removeall presentational information from Web pages, so that only pure structured content remains. The extraction process itself does not address single documents, but template types, such as the product page of an online retailer or the news page template of an online journal. That is, for each template type, one set of extraction rules is generated. These extraction rules are defined in a graphical manner, by selecting the pieces of information that are relevant and by assigning labels to them. To this end, GUIs are used that largely resemble Web browsers,...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Adelberg B. NoDoSE: a tool for semi-automatically extracting structured and semi-structured data from text documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998. p. 283–94.Google Scholar
  2. 2.
    Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 119–28.Google Scholar
  3. 3.
    Baumgartner R, Flesca S, Gottlob G. The ELOG web extraction language. In: Proceedings of the Artificial Intelligence on Logic for Programming; 2001. p. 548–60.CrossRefGoogle Scholar
  4. 4.
    Crescenzi V, Mecca G, Merialdo P. RoadRunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 109–18.Google Scholar
  5. 5.
    Kushmerick N, Weld D, Doorenbos R. Wrapper induction for information extraction. In: Proceedings of the 15th International Joint Conference on Artificial Intelligence; 1997. p. 119–28.Google Scholar
  6. 6.
    Muslea I, Minton S, Knoblock C. Stalker: learning extraction rules for semistructured, web-based information sources. In: Proceedings of the of the AAAI Workshop on AI and Information Integration; 1998.Google Scholar
  7. 7.
    Muslea I, Minton S, Knoblock C. Hierarchical wrapper induction for semistructured information sources. Auton Agent Multi-Agent Syst. 2001;4(1–2):93–114.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Siemens AGMunichGermany

Section editors and affiliations

  • Georg Gottlob
    • 1
  1. 1.Computing Lab.Oxford Univ.OxfordUK