Skip to main content

A Framework for Generating Attribute Extractors for Web Data Sources

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2476))

Included in the following conference series:

Abstract

To cope with the irregularities of typical semistructured Web data, extraction tools usually break the extraction task in two phases: an extraction phase, in which atomic attribute values are extracted from Web pages, and an assembling phase, in which these atomic values are grouped to form complex objects. As a consequence, the whole process is highly dependent on the attribute values collected in the first phase. All attribute values of interest should be properly recognized and spurious values should be discarded. Thus, attribute values extraction is an important problem. In this paper, we propose a new framework for generating attribute value extractors. The main appeal of this framework is that it can be adapted for dealing with specific types of data sources and to incorporate distinct types of heuristics for achieving good extraction performance. To demonstrate the feasibility of this proposal, we present an implementation of this framework for data-rich Web pages and show how a number of simple heuristics, some of them presented in the recent literature, can be incorporated into this framework. We also show experimental results and, in most cases, our results are at least as good as results previously presented in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco, 1999.

    Google Scholar 

  2. Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), pages 119–128, Rome, Italy, 2001.

    Google Scholar 

  3. William W. Cohen and Lee S. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington, 2001.

    Google Scholar 

  4. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 109–118, Rome, Italy, 2001.

    Google Scholar 

  5. Altigran Soares da Silva. Example-based Strategies for Extracting Semistructured Web Data. PhD thesis, Deptartment of Computer Science, Federal University of Minas Gerais, 2002.

    Google Scholar 

  6. David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu kai Ng, Dallan Quass, and Randy D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 31(3):227–251, 1999.

    Article  MATH  Google Scholar 

  7. Paulo B. Golgher, Altigran S. da Silva, Alberto H. F. Laender, and Berthier A. Ribeiro-Neto. Bootstrapping for Example-Based Data Extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 371–378, Atlanta, GA, 2001.

    Google Scholar 

  8. Chun-Nan Hsu and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.

    Google Scholar 

  9. Lee S. Jensen and William W. Cohen. Grouping extracted fields. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington, 2001.

    Google Scholar 

  10. Nicholas Kushmerick, Daniel S. Weld, and Robert Doorenbos. Wrapper Induction for Information Extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, pages 729–737, Osaka, Japan, 1997.

    Google Scholar 

  11. Alberto H. F. Laender, Berthier Ribeiro-Neto, and Altigran S. da Silva. DEByE-Data Extraction by Example. Data and Knowledge Engineering, 40(2):121–154, 2002.

    Article  MATH  Google Scholar 

  12. Alberto Henrique Frade Laender, Berthier Ribeiro-Neto, Altigran Soares da Silva, and Juliana Santiago Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record, 2002. To appear.

    Google Scholar 

  13. Ion Muslea, Steven Minton, and Craig Knoblock. An Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents, pages 190–197, Seattle, WA, 1999.

    Google Scholar 

  14. Ion Muslea, Steven Minton, and Craig Knoblock. Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001.

    Article  Google Scholar 

  15. Berthier Ribeiro-Neto, Alberto Henrique Frade Laender, and Altigran Soares da Silva. Extracting semi-structured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, pages 94–101, Kansas City, MO, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

de Castro Reis, D., Araújo, R.B., da Silva, A.S., Ribeiro-Neto, B.A. (2002). A Framework for Generating Attribute Extractors for Web Data Sources. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_19

Download citation

  • DOI: https://doi.org/10.1007/3-540-45735-6_19

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44158-8

  • Online ISBN: 978-3-540-45735-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics