Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites

Wong, Tak-Lam; Lam, Wai; Wang, Wei

doi:10.1007/978-3-540-45080-1_97

Tak-Lam Wong⁷,
Wai Lam⁷ &
Wei Wang⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2690))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

958 Accesses

Abstract

We investigate the problem of wrapper adaptation which aims at adapting a previously learned wrapper to an unseen target site. To achieve this goal, we make use of extraction rules previously discovered from a particular site to seek potential candidates of training examples for the target site. We pose the problem of training example identification for the target site as a hybrid text classification problem. The idea is to use a classification model to capture the characteristics of the attribute item of interests. Based on the automatically annotated training examples, a new wrapper for the unseen target Web site can then be discovered. We present encouraging experimental results on wrapper adaptation for some real-world Web sites.

The work described in this paper was substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project No: CUHK 4187/01E). This work was also partially supported by a grant from the Defense Advanced Research Projects Agency (DARPA), USA under TIDES programme (Grant No: N66001-00-1-8912), subcontract from City University of New York (Subcontract No: 47427-00-01A).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

User-Friendly and Extensible Web Data Extraction

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

References

Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of SIGMOD Workshop on Databases and the Web, pp. 172–183 (1998)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th Very Large Databases Conference, pp. 109–118 (2001)
Google Scholar
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1–2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (July 1999)
Google Scholar
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the 17th National Conference on Artificial Intelligence, pp. 609–614 (2000)
Google Scholar
Lin, W.Y., Lam, W.: Learning to extract hierarchical information from semistructured documents. In: Proceedings of the Ninth International Conference on Information and Knowledge Management CIKM, pp. 250–257 (November 2000)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Selective Sampling with Redundant Views. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 621–626 (2000)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Systems Engineering and Engineering Management, Ho Sin Hang Engineering Building, The Chinese University of Hong Kong, Shatin, Hong Kong
Tak-Lam Wong, Wai Lam & Wei Wang

Authors

Tak-Lam Wong
View author publications
You can also search for this author in PubMed Google Scholar
Wai Lam
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
Jiming Liu
Department of Computer Science, Hong Kong Baptist University, Hong Kong
Yiu-ming Cheung
School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wong, TL., Lam, W., Wang, W. (2003). Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites. In: Liu, J., Cheung, Ym., Yin, H. (eds) Intelligent Data Engineering and Automated Learning. IDEAL 2003. Lecture Notes in Computer Science, vol 2690. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45080-1_97

Download citation

DOI: https://doi.org/10.1007/978-3-540-45080-1_97
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40550-4
Online ISBN: 978-3-540-45080-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites

Abstract

Access this chapter

Preview

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites

Abstract

Access this chapter

Preview

Similar content being viewed by others

User-Friendly and Extensible Web Data Extraction

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation