A Framework for Generating Attribute Extractors for Web Data Sources

de Castro Reis, Davi; Araújo, Robson Braga; da Silva, Altigran S.; Ribeiro-Neto, Berthier A.

doi:10.1007/3-540-45735-6_19

Davi de Castro Reis⁶,
Robson Braga Araújo⁶,
Altigran S. da Silva⁶ &
…
Berthier A. Ribeiro-Neto⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2476))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

794 Accesses
1 Citations

Abstract

To cope with the irregularities of typical semistructured Web data, extraction tools usually break the extraction task in two phases: an extraction phase, in which atomic attribute values are extracted from Web pages, and an assembling phase, in which these atomic values are grouped to form complex objects. As a consequence, the whole process is highly dependent on the attribute values collected in the first phase. All attribute values of interest should be properly recognized and spurious values should be discarded. Thus, attribute values extraction is an important problem. In this paper, we propose a new framework for generating attribute value extractors. The main appeal of this framework is that it can be adapted for dealing with specific types of data sources and to incorporate distinct types of heuristics for achieving good extraction performance. To demonstrate the feasibility of this proposal, we present an implementation of this framework for data-rich Web pages and show how a number of simple heuristics, some of them presented in the recent literature, can be incorporated into this framework. We also show experimental results and, in most cases, our results are at least as good as results previously presented in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco, 1999.
Google Scholar
Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual web information extraction with lixto. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), pages 119–128, Rome, Italy, 2001.
Google Scholar
William W. Cohen and Lee S. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington, 2001.
Google Scholar
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 109–118, Rome, Italy, 2001.
Google Scholar
Altigran Soares da Silva. Example-based Strategies for Extracting Semistructured Web Data. PhD thesis, Deptartment of Computer Science, Federal University of Minas Gerais, 2002.
Google Scholar
David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu kai Ng, Dallan Quass, and Randy D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 31(3):227–251, 1999.
Article MATH Google Scholar
Paulo B. Golgher, Altigran S. da Silva, Alberto H. F. Laender, and Berthier A. Ribeiro-Neto. Bootstrapping for Example-Based Data Extraction. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pages 371–378, Atlanta, GA, 2001.
Google Scholar
Chun-Nan Hsu and Chien-Chi Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.
Google Scholar
Lee S. Jensen and William W. Cohen. Grouping extracted fields. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, Seattle, Washington, 2001.
Google Scholar
Nicholas Kushmerick, Daniel S. Weld, and Robert Doorenbos. Wrapper Induction for Information Extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, pages 729–737, Osaka, Japan, 1997.
Google Scholar
Alberto H. F. Laender, Berthier Ribeiro-Neto, and Altigran S. da Silva. DEByE-Data Extraction by Example. Data and Knowledge Engineering, 40(2):121–154, 2002.
Article MATH Google Scholar
Alberto Henrique Frade Laender, Berthier Ribeiro-Neto, Altigran Soares da Silva, and Juliana Santiago Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record, 2002. To appear.
Google Scholar
Ion Muslea, Steven Minton, and Craig Knoblock. An Hierarchical Approach to Wrapper Induction. In Proceedings of the 3rd Annual Conference on Autonomous Agents, pages 190–197, Seattle, WA, 1999.
Google Scholar
Ion Muslea, Steven Minton, and Craig Knoblock. Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001.
Article Google Scholar
Berthier Ribeiro-Neto, Alberto Henrique Frade Laender, and Altigran Soares da Silva. Extracting semi-structured data through examples. In Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, pages 94–101, Kansas City, MO, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Minas Gerais, 31270-901, Belo Horizonte MG, Brazil
Davi de Castro Reis, Robson Braga Araújo, Altigran S. da Silva & Berthier A. Ribeiro-Neto

Authors

Davi de Castro Reis
View author publications
You can also search for this author in PubMed Google Scholar
Robson Braga Araújo
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Berthier A. Ribeiro-Neto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Ciěncia da Computação, Universidade Federal de Minas Gerais, 31270-901, Belo Horizonte, MG, Brazil
Alberto H. F. Laender
Instituto Superior Técnico, INESC-ID, R. Alves Redol 9, 1000-029, Lisboa, Portugal
Arlindo L. Oliveira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Castro Reis, D., Araújo, R.B., da Silva, A.S., Ribeiro-Neto, B.A. (2002). A Framework for Generating Attribute Extractors for Web Data Sources. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_19

Download citation

DOI: https://doi.org/10.1007/3-540-45735-6_19
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44158-8
Online ISBN: 978-3-540-45735-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics