Abstract
To facilitate effective search on the World Wide Web, meta search engines have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve information from the pages returned by search engines. We present an approach to automatically create such wrappers by means of an incremental grammar induction algorithm. The algorithm uses an adaptation of the string edit distance. Our method performs well; it is quick, can be used for several types of result pages and requires a minimal amount of user interaction.
Supported by the Logic and Language Links project funded by Elsevier Science.
Supported by the Spinoza project ‘Logic in Action.’
Chapter PDF
Similar content being viewed by others
Keywords
References
Aho, Alfred V. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 255–300, Elsevier, 1990. 100, 101
Andreoli, J.-M., Borghoff, U., Chevalier, P-Y., Chidlovskii, B., Pareschi, R., and Willamowski, J. The Constraint-Based Knowledge Broker System. Proc. of the 13th Int’l Conf. on Data Engineering, 1997. 96
Ashish, N., and Knoblock, C. Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4):8–15, 1997. 97, 106
Chidlovskii, B., Borghoff, U., Chevalier, P.-Y. Chevalier. Toward Sophisticated Wrapping of Web-based Information Repositories. Proc. 5th RIAO Conference, Montreal, Canada, pages 123–135, 1997. 97
Florescu, D., Levy, A., and Mendelzon, A. Database techniques for the World-Wide Web: A Survey. SIGMOD Record 27(3):59–74, 1998. 97
Garcia-Molina, H., Hammer, J., and Ireland, K. Accessing Heterogeneous Information Sources in TSIMMIS. AAAI Symp. Inform. Gathering, pages 61–64, 1995. 96, 106
Gauch, S., Wang, G., Gomez, M. ProFusion: Intelligent Fusion from Multiple Distributed Search Engines. J. Universal Computer Science, 2(9): 637–649, 1996. 96
Hammer, J. Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, 1997. 106
Hsu, C.-N., and Chang, C.-C. Finite-State Transducers for Semi-Structured Text Mining. Proc. IJCAI-99 Workshop on Text Mining, 1999. 97, 106
Hsu, C.-N., and Dung, M.-T., Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998. 97, 106
JavaCC — The Java parser generator. URL: http://www.metamata.com/JavaCC/. 104
Kushmerick, N., Weld, D.S., and Doorenbos, R., Wrapper Induction for Information Extraction. Proc. IJCAI-97: 729–737, 1997. 97, 106, 107
Muslea, I., Minton, S., Knoblock, C. STALKER. AAAI Workshop on AI & Information Integration, 1998. 97, 106
Ragetli, H.J.N. Semi-automatic Parser Generation for Information Extraction from the WWW. Master’s Thesis, Faculteit WINS, Universiteit van Amsterdam, 1998. 98, 101
Sakakibara, Y. Recent advances of grammatical inference. Theoretical Computer Science 185:15–45, 1997. 99
Soderland, S. Learning to Extract Text-based Information from the World Wide Web. Proc. KDD-97, pages 251–254, 1997. 106
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chidlovskii, B., Ragetli, J., de Rijke, M. (2000). Wrapper Generation via Grammar Induction. In: López de Mántaras, R., Plaza, E. (eds) Machine Learning: ECML 2000. ECML 2000. Lecture Notes in Computer Science(), vol 1810. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45164-1_11
Download citation
DOI: https://doi.org/10.1007/3-540-45164-1_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67602-7
Online ISBN: 978-3-540-45164-8
eBook Packages: Springer Book Archive