Skip to main content

Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites

  • Conference paper
Book cover Intelligent Data Engineering and Automated Learning (IDEAL 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2690))

  • 943 Accesses

Abstract

We investigate the problem of wrapper adaptation which aims at adapting a previously learned wrapper to an unseen target site. To achieve this goal, we make use of extraction rules previously discovered from a particular site to seek potential candidates of training examples for the target site. We pose the problem of training example identification for the target site as a hybrid text classification problem. The idea is to use a classification model to capture the characteristics of the attribute item of interests. Based on the automatically annotated training examples, a new wrapper for the unseen target Web site can then be discovered. We present encouraging experimental results on wrapper adaptation for some real-world Web sites.

The work described in this paper was substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project No: CUHK 4187/01E). This work was also partially supported by a grant from the Defense Advanced Research Projects Agency (DARPA), USA under TIDES programme (Grant No: N66001-00-1-8912), subcontract from City University of New York (Subcontract No: 47427-00-01A).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of SIGMOD Workshop on Databases and the Web, pp. 172–183 (1998)

    Google Scholar 

  2. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th Very Large Databases Conference, pp. 109–118 (2001)

    Google Scholar 

  3. Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1–2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  4. Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (July 1999)

    Google Scholar 

  5. Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the 17th National Conference on Artificial Intelligence, pp. 609–614 (2000)

    Google Scholar 

  6. Lin, W.Y., Lam, W.: Learning to extract hierarchical information from semistructured documents. In: Proceedings of the Ninth International Conference on Information and Knowledge Management CIKM, pp. 250–257 (November 2000)

    Google Scholar 

  7. Muslea, I., Minton, S., Knoblock, C.A.: Selective Sampling with Redundant Views. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 621–626 (2000)

    Google Scholar 

  8. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wong, TL., Lam, W., Wang, W. (2003). Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites. In: Liu, J., Cheung, Ym., Yin, H. (eds) Intelligent Data Engineering and Automated Learning. IDEAL 2003. Lecture Notes in Computer Science, vol 2690. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45080-1_97

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45080-1_97

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40550-4

  • Online ISBN: 978-3-540-45080-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics