A Unifying Approach to HTML Wrapper Representation and Learning

Grieser, Gunter; Jantke, Klaus P.; Lange, Steffen; Thomas, Bernd

doi:10.1007/3-540-44418-1_5

A Unifying Approach to HTML Wrapper Representation and Learning

Gunter Grieser³,
Klaus P. Jantke⁴,
Steffen Lange⁵ &
…
Bernd Thomas⁶

Conference paper
First Online: 19 October 2001

387 Accesses
13 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1967))

Abstract

The number, the size, and the dynamics of Internet informa- tion sources bears abundant evidence of the need for automation in infor- mation extraction. This calls for representation formalisms that match the World Wide Web reality and for learning approaches and learnability results that apply to these formalisms.

The concept of elementary formal systems is appropriately generalized to allow for the representation of wrapper classes which are relevant to the description of Internet sources in HTML format. Related learning results prove that those wrappers are automatically learnable from examples. This is setting the stage for information extraction from the Internet by exploitation of inductive learning techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Angluin, ‘Inductive inference of formal languages from positive data’, Information and Control, 45, 117–135, (1980).
Article MATH MathSciNet Google Scholar
D. Angluin and C.H. Smith, ‘A survey of inductive inference: Theory and methods’, Computing Surveys, 15, 237–269, (1983).
Article MathSciNet Google Scholar
S. Arikawa, S. Miyano, A. Shinohara, T. Shinohara, and A. Yamamoto, ‘Algorithmic learning theory with elementary formal systems’, IEICE Trans. Inf. amp; Syst., E75-D(4), 405–414, (1992).
Google Scholar
S. Arikawa, T. Shinohara, and A. Yamamoto, ‘Elementary formal systems as a unifying framework for language learning’, in Proc. Second Int. Workshop on Computational Learning Theory, pp. 312–327. Morgan Kaufmann, (1989).
Google Scholar
S. Arikawa, T. Shinohara, and A. Yamamoto, ‘Learning elementary formal systems’, Theoretical Computer Science, 95, 97–113, (1992).
Article MATH MathSciNet Google Scholar
M.E. Gold, ‘Language identification in the limit’, Information and Control, 14, 447–474, (1967).
Article Google Scholar
J.E. Hopcroft and J.D. Ullman, Formal Languages and their Relation to Automata, Addison-Wesley, (1969).
Google Scholar
N. Kushmerick, Wrapper Induction for Information Extraction, Ph.D. thesis, University of Washington, (1997).
Google Scholar
V. Lifschitz, ‘Foundations of logic programming’, in G. Brewka (ed.), Principles of knowledge representation, pp. 69–127, CSLI Publications, (1996).
Google Scholar
S. Miyano, A. Shinohara, and T. Shinohara, ‘Polynomial-time learning of elementary formal systems’, New Generation Computing, 18, 217–242, (2000).
Article Google Scholar
T. Shinohara, ‘Rich classes inferable from positive data: Length-bounded elementary formal systems’, Information and Computation, 108, 175–186, (1994).
Article MATH MathSciNet Google Scholar
R.M. Smullyan, Theory of Formal Systems, Annals of Mathematical Studies, No. 47, Princeton University, (1961).
Google Scholar
S. Soderland, ‘Learning information extraction rules from semi-structured and free text’, Machine Learning, 34, 233–272, (1999).
Article MATH Google Scholar
B. Thomas, ‘Anti-unification based learning of T-Wrappers for information extraction’, in Proc. of AAAI Workshop on Machine Learning for IE, pp. 15–20. AAAI, (1999).
Google Scholar
B. Thomas, ‘Logic programs for intelligent web search’, in Proc. of Int. Symposium on Methodologies for Intelligent Systems, Lecture Notes in Artificial Intelligence 1609, pp. 190–198. Springer-Verlag, (1999).
Google Scholar
A. Yamamoto, ‘Elementary formal systems as a logic programming language’, in Proc. Logic Programming, Lecture Notes in Artificial Intelligence 485, pp. 73–86. Springer-Verlag, (1989).
Google Scholar
C. Zeng and S. Arikawa, ‘Applying inverse resolution to EFS language learning’, in Proc. Int. Conference for Young Computer Scientists, pp. 480–487. Int. Academic Publishers, (1999).
Google Scholar
T. Zeugmann and S. Lange, ‘A guided tour across the boundaries of learning recursive languages’, in K.P. Jantke and S. Lange (eds), Algorithmic Learning for Knowledge-Based Systems, Lecture Notes in Artificial Intelligence 961, pp. 190–258. Springer-Verlag, (1995).
Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität Darmstadt, FB Informatik, Alexanderstraβe 10, 64283, Darmstadt, Germany
Gunter Grieser
Deutsches Forschungszentrum für Künstliche Intelligenz, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Klaus P. Jantke
Universität Leipzig, Institut für Informatik, Augustusplatz 10-11, 04109, Leipzig, Germany
Steffen Lange
Universität Koblenz, FB Informatik, Rammsweg 1, 56070, Koblenz, Germany
Bernd Thomas

Authors

Gunter Grieser
View author publications
You can also search for this author in PubMed Google Scholar
Klaus P. Jantke
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Lange
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Thomas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Information Science and Electrical Engineering, Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Setsuo Arikawa
Faculty of Science Department of Information Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
Shinichi Morishita

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grieser, G., Jantke, K.P., Lange, S., Thomas, B. (2000). A Unifying Approach to HTML Wrapper Representation and Learning. In: Arikawa, S., Morishita, S. (eds) Discovery Science. DS 2000. Lecture Notes in Computer Science(), vol 1967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44418-1_5

Download citation

DOI: https://doi.org/10.1007/3-540-44418-1_5
Published: 19 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41352-3
Online ISBN: 978-3-540-44418-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics