Abstract
To facilitate effective search on the World Wide Web, several ‘meta search engines’ have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve relevant information from the HTML pages returned by search engines. In this paper we present an algorithm to create such wrappers automatically, based on an adaptation of the string edit distance. Our algorithm performs well; it is quick, it can be used for several types of result pages and it requires a minimal amount of interaction with the user.
Supported by the Logic and Language Links project funded by Elsevier Science.
Supported by the Spinoza project ‘Logic in Action.’
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, Alfred V. Algorithms for finding patterns in strings. In J van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 255–300, Elsevier, 1990.
Andreoli J.-M., Borghoff U., Chevalier P-Y., Chidlovskii B., Pareschi R., and Willamowski, J. The Constraint-Based Knowledge Broker System. Proc. of the 13th Int’l Conf. on Data Engineering, 1997.
Ashish, N., and Knoblock, C. Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4):8–15, 1997.
Chidlovskii, B., Borghoff, U., Chevalier, P.-Y. Chevalier. Toward Sophisticated Wrapping of Web-based Information Repositories. Proc. 5th RIAO Conference, Montreal, Canada, pages 123–135, 1997.
Florescu, D., Levy, A., and Mendelzon, A. Database techniques for the World-Wide Web: A Survey. SIGMOD Record 27(3):59–74, 1998.
Garcia-Molina, H., Hammer, J., and Ireland, K. Accessing Heterogeneous Information Sources in TSIMMIS. AAAI Symp. Inform. Gathering, pages 61–64, 1995.
Gauch, S., Wang, G., Gomez, M. ProFusion: Intelligent Fusion from Multiple Distributed Search Engines. J. Universal Computer Science, 2(9): 637–649, 1996.
Gravano, L., Papakonstantinou, Y. Mediating and Metasearching on the Internet. Data Engineering Bulletin 21(2), pages 28–36, 1998.
Hammer, J. Garcia-Molina H., Cho J., Aranha R., and Crespo, A. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, 1997.
Hsu, C.-N., and Chang, C.-C. Finite-State Transducers for Semi-Structured Text Mining. Proc. IJCAI-99 Workshop on Text Mining, 1999.
Hsu, C.-N., and Dung, M.-T., Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998.
JavaCC-The Java parser generator. http://www.metamata.com/JavaCC/.
Kushmerick N., Weld D.S., and Doorenbos R., Wrapper Induction for Information Extraction. Proc. IJCAI-97: 729–737, 1997.
Muslea I., Minton S., Knoblock C. STALKER. AAAI Workshop on AI & Information Integration, 1998.
Ragetli, H.J.N. Semi-automatic Parser Generation for Information Extraction from the WWW. Master’s Thesis, Faculteit WINS, Universiteit van Amsterdam, 1998.
Soderland, S. Learning to Extract Text-based Information from the World Wide Web. Proc. KDD-97, pages 251–254, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chidlovskii, B., Ragetli, J., de Rijke, M. (2000). Automatic Wrapper Generation for Web Search Engines. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_38
Download citation
DOI: https://doi.org/10.1007/3-540-45151-X_38
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive