Skip to main content

A Unifying Approach to HTML Wrapper Representation and Learning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1967))

Abstract

The number, the size, and the dynamics of Internet informa- tion sources bears abundant evidence of the need for automation in infor- mation extraction. This calls for representation formalisms that match the World Wide Web reality and for learning approaches and learnability results that apply to these formalisms.

The concept of elementary formal systems is appropriately generalized to allow for the representation of wrapper classes which are relevant to the description of Internet sources in HTML format. Related learning results prove that those wrappers are automatically learnable from examples. This is setting the stage for information extraction from the Internet by exploitation of inductive learning techniques.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Angluin, ‘Inductive inference of formal languages from positive data’, Information and Control, 45, 117–135, (1980).

    Article  MATH  MathSciNet  Google Scholar 

  2. D. Angluin and C.H. Smith, ‘A survey of inductive inference: Theory and methods’, Computing Surveys, 15, 237–269, (1983).

    Article  MathSciNet  Google Scholar 

  3. S. Arikawa, S. Miyano, A. Shinohara, T. Shinohara, and A. Yamamoto, ‘Algorithmic learning theory with elementary formal systems’, IEICE Trans. Inf. amp; Syst., E75-D(4), 405–414, (1992).

    Google Scholar 

  4. S. Arikawa, T. Shinohara, and A. Yamamoto, ‘Elementary formal systems as a unifying framework for language learning’, in Proc. Second Int. Workshop on Computational Learning Theory, pp. 312–327. Morgan Kaufmann, (1989).

    Google Scholar 

  5. S. Arikawa, T. Shinohara, and A. Yamamoto, ‘Learning elementary formal systems’, Theoretical Computer Science, 95, 97–113, (1992).

    Article  MATH  MathSciNet  Google Scholar 

  6. M.E. Gold, ‘Language identification in the limit’, Information and Control, 14, 447–474, (1967).

    Article  Google Scholar 

  7. J.E. Hopcroft and J.D. Ullman, Formal Languages and their Relation to Automata, Addison-Wesley, (1969).

    Google Scholar 

  8. N. Kushmerick, Wrapper Induction for Information Extraction, Ph.D. thesis, University of Washington, (1997).

    Google Scholar 

  9. V. Lifschitz, ‘Foundations of logic programming’, in G. Brewka (ed.), Principles of knowledge representation, pp. 69–127, CSLI Publications, (1996).

    Google Scholar 

  10. S. Miyano, A. Shinohara, and T. Shinohara, ‘Polynomial-time learning of elementary formal systems’, New Generation Computing, 18, 217–242, (2000).

    Article  Google Scholar 

  11. T. Shinohara, ‘Rich classes inferable from positive data: Length-bounded elementary formal systems’, Information and Computation, 108, 175–186, (1994).

    Article  MATH  MathSciNet  Google Scholar 

  12. R.M. Smullyan, Theory of Formal Systems, Annals of Mathematical Studies, No. 47, Princeton University, (1961).

    Google Scholar 

  13. S. Soderland, ‘Learning information extraction rules from semi-structured and free text’, Machine Learning, 34, 233–272, (1999).

    Article  MATH  Google Scholar 

  14. B. Thomas, ‘Anti-unification based learning of T-Wrappers for information extraction’, in Proc. of AAAI Workshop on Machine Learning for IE, pp. 15–20. AAAI, (1999).

    Google Scholar 

  15. B. Thomas, ‘Logic programs for intelligent web search’, in Proc. of Int. Symposium on Methodologies for Intelligent Systems, Lecture Notes in Artificial Intelligence 1609, pp. 190–198. Springer-Verlag, (1999).

    Google Scholar 

  16. A. Yamamoto, ‘Elementary formal systems as a logic programming language’, in Proc. Logic Programming, Lecture Notes in Artificial Intelligence 485, pp. 73–86. Springer-Verlag, (1989).

    Google Scholar 

  17. C. Zeng and S. Arikawa, ‘Applying inverse resolution to EFS language learning’, in Proc. Int. Conference for Young Computer Scientists, pp. 480–487. Int. Academic Publishers, (1999).

    Google Scholar 

  18. T. Zeugmann and S. Lange, ‘A guided tour across the boundaries of learning recursive languages’, in K.P. Jantke and S. Lange (eds), Algorithmic Learning for Knowledge-Based Systems, Lecture Notes in Artificial Intelligence 961, pp. 190–258. Springer-Verlag, (1995).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grieser, G., Jantke, K.P., Lange, S., Thomas, B. (2000). A Unifying Approach to HTML Wrapper Representation and Learning. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_5

Download citation

  • DOI: https://doi.org/10.1007/3-540-44418-1_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41352-3

  • Online ISBN: 978-3-540-44418-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics