Skip to main content

Web Data Extraction System

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems

Synonyms

Web information extraction system; Web macros; Web scraper; Wrapper generator

Definition

A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (i) Web interaction, which comprises mainly the navigation to usually pre-determined target web pages containing the desired information; (ii) Support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (iii) Scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (iv) Data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources and...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Anupam V, Freire J, Kumar B, Lieuwen D. Automating web navigation with the WebVCR. Comput Netw. 2000;33(1–6):503–17.

    Article  Google Scholar 

  2. Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 119–28.

    Google Scholar 

  3. Crescenzi V, Mecca G, Merialdo P. Road runner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 109–18.

    Google Scholar 

  4. Etzioni O, Cafarella MJ, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld DS, Yates Y. Web-scale information extraction in KnowItAll: (preliminary results). In: Proceedings of the 12th International World Wide Web Conference; 2004. p. 100–10.

    Google Scholar 

  5. Gatterbauer W, Bohunsky P, Herzog M, Krüpl B, Pollak B. Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International World Wide Web Conference; 2007. p. 71–80.

    Google Scholar 

  6. Gottlob G, Koch C. Monadic datalog and the expressive power of languages for web information extraction. J ACM. 2002;51(1):74–113.

    Article  MathSciNet  MATH  Google Scholar 

  7. Gottlob G, Koch CA. Formal comparison of visual web wrapper generators. In: Proceedings of the 32nd International Current Trends in Theory and Practice of Computer Science; 2006. p. 30–48.

    Chapter  Google Scholar 

  8. Kuhlins S, Tredwell R. Toolkits for generating wrappers: a survey of software toolkits for automated data extraction from websites. In: Objects, Components, Architectures, Services, and Applications for a Networked World. International Conference NetObjectDays; 2003.

    Google Scholar 

  9. Kushmerick N, Weld DS, Doorenbos RB. Wrapper induction for information extraction. In: Proceedings of the 15th International Joint Conference on AI; 1997. p. 729–37.

    Google Scholar 

  10. Laender AHF, Ribeiro-Neto BA, da Silva AS. DEByE – data extraction by example. Data Knowl Eng. 2000;40(2):121–54.

    Article  MATH  Google Scholar 

  11. Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering; 2000. p. 611–21.

    Google Scholar 

  12. Liu B, Grossman RL, Zhai Y. Mining web pages for data records. IEEE Intell Syst. 2004;19(6):49–55.

    Article  Google Scholar 

  13. Muslea I, Minton S, Knoblock CA. Hierarchical wrapper induction for semistructured information sources. Auton Agents Multi-Agent Syst. 2001;4(1/2):93–114.

    Article  Google Scholar 

  14. Pan A, Raposo J, Álvarez M, Montoto P, Orjales V, Hidalgo J, Ardao L, Molano A, Viña Á. The Denodo data integration platform. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.

    Google Scholar 

  15. Sahuguet A, Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. 2001;36(3):283–316.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Baumgartner .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Baumgartner, R., Gatterbauer, W., Gottlob, G. (2018). Web Data Extraction System. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1154

Download citation

Publish with us

Policies and ethics