Skip to main content

Automatic Wrapper Generation for Web Search Engines

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

Abstract

To facilitate effective search on the World Wide Web, several ‘meta search engines’ have been developed which do not search the Web themselves, but use available search engines to find the required information. By means of wrappers, meta search engines retrieve relevant information from the HTML pages returned by search engines. In this paper we present an algorithm to create such wrappers automatically, based on an adaptation of the string edit distance. Our algorithm performs well; it is quick, it can be used for several types of result pages and it requires a minimal amount of interaction with the user.

Supported by the Logic and Language Links project funded by Elsevier Science.

Supported by the Spinoza project ‘Logic in Action.’

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, Alfred V. Algorithms for finding patterns in strings. In J van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 255–300, Elsevier, 1990.

    Google Scholar 

  2. Andreoli J.-M., Borghoff U., Chevalier P-Y., Chidlovskii B., Pareschi R., and Willamowski, J. The Constraint-Based Knowledge Broker System. Proc. of the 13th Int’l Conf. on Data Engineering, 1997.

    Google Scholar 

  3. Ashish, N., and Knoblock, C. Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4):8–15, 1997.

    Article  Google Scholar 

  4. Chidlovskii, B., Borghoff, U., Chevalier, P.-Y. Chevalier. Toward Sophisticated Wrapping of Web-based Information Repositories. Proc. 5th RIAO Conference, Montreal, Canada, pages 123–135, 1997.

    Google Scholar 

  5. Florescu, D., Levy, A., and Mendelzon, A. Database techniques for the World-Wide Web: A Survey. SIGMOD Record 27(3):59–74, 1998.

    Article  Google Scholar 

  6. Garcia-Molina, H., Hammer, J., and Ireland, K. Accessing Heterogeneous Information Sources in TSIMMIS. AAAI Symp. Inform. Gathering, pages 61–64, 1995.

    Google Scholar 

  7. Gauch, S., Wang, G., Gomez, M. ProFusion: Intelligent Fusion from Multiple Distributed Search Engines. J. Universal Computer Science, 2(9): 637–649, 1996.

    Google Scholar 

  8. Gravano, L., Papakonstantinou, Y. Mediating and Metasearching on the Internet. Data Engineering Bulletin 21(2), pages 28–36, 1998.

    Google Scholar 

  9. Hammer, J. Garcia-Molina H., Cho J., Aranha R., and Crespo, A. Extracting Semistructured Information from the Web. Proceedings of the Workshop on Management of Semistructured Data, 1997.

    Google Scholar 

  10. Hsu, C.-N., and Chang, C.-C. Finite-State Transducers for Semi-Structured Text Mining. Proc. IJCAI-99 Workshop on Text Mining, 1999.

    Google Scholar 

  11. Hsu, C.-N., and Dung, M.-T., Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998.

    Article  Google Scholar 

  12. JavaCC-The Java parser generator. http://www.metamata.com/JavaCC/.

  13. Kushmerick N., Weld D.S., and Doorenbos R., Wrapper Induction for Information Extraction. Proc. IJCAI-97: 729–737, 1997.

    Google Scholar 

  14. Muslea I., Minton S., Knoblock C. STALKER. AAAI Workshop on AI & Information Integration, 1998.

    Google Scholar 

  15. Ragetli, H.J.N. Semi-automatic Parser Generation for Information Extraction from the WWW. Master’s Thesis, Faculteit WINS, Universiteit van Amsterdam, 1998.

    Google Scholar 

  16. Soderland, S. Learning to Extract Text-based Information from the World Wide Web. Proc. KDD-97, pages 251–254, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chidlovskii, B., Ragetli, J., de Rijke, M. (2000). Automatic Wrapper Generation for Web Search Engines. In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_38

Download citation

  • DOI: https://doi.org/10.1007/3-540-45151-X_38

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67627-0

  • Online ISBN: 978-3-540-45151-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics