Abstract
We investigate the problem of wrapper adaptation which aims at adapting a previously learned wrapper to an unseen target site. To achieve this goal, we make use of extraction rules previously discovered from a particular site to seek potential candidates of training examples for the target site. We pose the problem of training example identification for the target site as a hybrid text classification problem. The idea is to use a classification model to capture the characteristics of the attribute item of interests. Based on the automatically annotated training examples, a new wrapper for the unseen target Web site can then be discovered. We present encouraging experimental results on wrapper adaptation for some real-world Web sites.
The work described in this paper was substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project No: CUHK 4187/01E). This work was also partially supported by a grant from the Defense Advanced Research Projects Agency (DARPA), USA under TIDES programme (Grant No: N66001-00-1-8912), subcontract from City University of New York (Subcontract No: 47427-00-01A).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brin, S.: Extracting patterns and relations from the World Wide Web. In: Proceedings of SIGMOD Workshop on Databases and the Web, pp. 172–183 (1998)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th Very Large Databases Conference, pp. 109–118 (2001)
Kushmerick, N.: Wrapper induction: Efficiency and expressiveness. Artificial Intelligence 118(1–2), 15–68 (2000)
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (July 1999)
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the 17th National Conference on Artificial Intelligence, pp. 609–614 (2000)
Lin, W.Y., Lam, W.: Learning to extract hierarchical information from semistructured documents. In: Proceedings of the Ninth International Conference on Information and Knowledge Management CIKM, pp. 250–257 (November 2000)
Muslea, I., Minton, S., Knoblock, C.A.: Selective Sampling with Redundant Views. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 621–626 (2000)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wong, TL., Lam, W., Wang, W. (2003). Beyond Supervised Learning of Wrappers for Extracting Information from Unseen Web Sites. In: Liu, J., Cheung, Ym., Yin, H. (eds) Intelligent Data Engineering and Automated Learning. IDEAL 2003. Lecture Notes in Computer Science, vol 2690. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45080-1_97
Download citation
DOI: https://doi.org/10.1007/978-3-540-45080-1_97
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40550-4
Online ISBN: 978-3-540-45080-1
eBook Packages: Springer Book Archive