Abstract
We investigate the problem of wrapper adaptation which aims at adapting a previously learned wrapper to an unseen target site. To achieve this goal, we make use of extraction rules previously discovered from a particular site to seek potential candidates of training examples for the target site. We pose the problem of training example identification for the target site as a hybrid text classification problem. The idea is to use a classification model to capture the characteristics of the attribute item of interests. Based on the automatically annotated training examples, a new wrapper for the unseen target Web site can then be discovered. We present encouraging experimental results on wrapper adaptation for some real-world Web sites.
The work described in this paper was substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project No: CUHK 4187/01E). This work was also partially supported by a grant from the Defense Advanced Research Projects Agency (DARPA), USA under TIDES programme (Grant No: N66001-00-1-8912), subcontract from City University of New York (Subcontract No: 47427-00-01A).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of SIGMOD Workshop on Databases and the Web, pp. 172–183 (1998)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th Very Large Databases Conference, pp. 109–118 (2001)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1–2), 15–68 (2000)
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (July 1999)
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the 17th National Conference on Artificial Intelligence, pp. 609–614 (2000)
Lin, W.Y., Lam, W.: Learning to extract hierarchical information from semistructured documents. In: Proceedings of the Ninth International Conference on Information and Knowledge Management CIKM, pp. 250–257 (November 2000)
Muslea, I., Minton, S., Knoblock, C.A.: Selective Sampling with Redundant Views. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 621–626 (2000)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wong, TL., Lam, W., Wang, W. (2003). Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites. In: Liu, J., Cheung, Ym., Yin, H. (eds) Intelligent Data Engineering and Automated Learning. IDEAL 2003. Lecture Notes in Computer Science, vol 2690. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45080-1_97
Download citation
DOI: https://doi.org/10.1007/978-3-540-45080-1_97
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40550-4
Online ISBN: 978-3-540-45080-1
eBook Packages: Springer Book Archive