Skip to main content

Languages for Web Data Extraction

  • Reference work entry
  • First Online:
  • 31 Accesses

Synonyms

Information extraction; Screen scraping; Web mining; Web scraping; Web site wrappers

Definition

Web data extraction is the process of automatically converting Web resources into a specific structured format. For example, if a collection of HTML web pages describes details about various companies (name, headquarters, etc) then web data extraction would involve converting this native HTML format into computer-processable data structures, such as entries in relational database tables. The purpose of web data extraction is to make web data available for subsequent manipulation or integration steps. In the previous example, the goal may be summarizing the results as some form of analytical report.

There are several approaches to Web data extraction. The most common approach is to specify the conversion process using a special-purpose programming Language for Web Data Extraction. Web data extraction then becomes a matter of executing a well-defined computer program.

Web data...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Arasu A, Garcia-Molina H. Extracting structured data from Web pages. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 337–48.

    Google Scholar 

  2. Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 119–28.

    Google Scholar 

  3. Crescenzi V, Mecca G, Merialdo P. RoadRunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.

    Google Scholar 

  4. Kistler T, Marais H. WebL – a programming language for the web. Comput Netw ISDN Syst. 1998;30(1–7):259–70.

    Article  Google Scholar 

  5. Knoblock CA, Lerman K, Minton S, Muslea I. Accurately and reliably extracting data from the web: a machine learning approach. In: Szczepaniak PS, Segovia J, Kacprzyk J, Zadeh LA, editors. Intelligent exploration of the web. Heidelberg: Physica-Verlag; 2003. p. 275–87.

    Chapter  Google Scholar 

  6. Kushmerick N. Wrapper induction: efficiency and expressiveness. Artif Intell. 2000;118(1–2):15–68. Special issue on Intelligent Internet Systems.

    Article  MathSciNet  MATH  Google Scholar 

  7. Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS. A brief survey of web data extraction tools. ACM SIGMOD Rec. 2002;31(2):84–93.

    Article  Google Scholar 

  8. Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering; 2000.

    Google Scholar 

  9. Muslea I, Minton S, Knoblock CA. Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst. 2001;4(1–2):93–114.

    Article  Google Scholar 

  10. Sahuguet A, Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl Eng. 2000;36(3):283–316.

    Article  MATH  Google Scholar 

  11. Spertus E, Andrea Stein L. Squeal: structured queries on the Web. In: Proceedings of the 9th International World-Wide Web Conference; 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas Kushmerick .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Kushmerick, N. (2018). Languages for Web Data Extraction. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1156

Download citation

Publish with us

Policies and ethics