Abstract
This paper introduces a novel method for learning a wrapper for extraction of text nodes from web pages based upon (k,l)-contextual tree languages. It also introduces a method to learn good values of k and l based on a few positive and negative examples. Finally, it describes how the algorithm can be integrated in a tool for information extraction.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ahonen, H.: Generating grammars for structured documents using grammatical inference methods. PhD thesis, University of Helsinki, Department of Computer Science (1996)
Angluin, D.: Inference of reversible languages. Journal of the ACM (JACM) 29(3), 741–765 (1982)
Angluin, D.: Queries and concept-learning. Machine Learning 2, 319–342 (1988)
Carme, J., Lemay, A., Niehren, J.: Learning node selecting tree transducer from completely annotated examples. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 91–102. Springer, Heidelberg (2004)
Chidlovskii, B., Ragetli, J., de Rijke, M.: Wrapper generation via grammar induction. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 96–108. Springer, Heidelberg (2000)
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Innovative Applications of AI Conference, pp. 577–583. AAAI Press, Menlo Park (2000)
Freitag, D., McCallum, A.: Information extraction with HMMs and shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction (1999)
GarcÃa, P.: Learning k-testable tree sets from positive data. Technical report, Technical Report DSIC-ii-1993-46, DSIC, Universidad Politecnica de Valencia (1993)
GarcÃa, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)
Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967)
Knuutila, T.: Inference of k-testable tree languages. In: Bunke, H. (ed.) Advances in Structural and Syntactic Pattern Recognition: Proc. of the Intl. Workshop, pp. 109–120. World Scientific, Singapore (1993)
Kosala, R., Bruynooghe, M., Blockeel, H., den Bussche, J.V.: Information extraction from web documents based on local unranked tree automaton inference. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 403–408 (2003)
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information extraction in structured documents using tree automata induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 299–310. Springer, Heidelberg (2002)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper induction for information extraction. In: Intl. Joint Conference on Artificial Intelligence (IJCAI), pp. 729–737 (1997)
McNaughton, R.: Algebraic decision procedures for local testability. Math. Systems Theory 8(1), 60–76 (1974)
Muggleton, S.: Inductive Acquisition of Expert Knowledge. Addison-Wesley, Reading (1990)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)
Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: A case study on wrapper induction. In: Intl. Joint Conference on Artificial Intelligence, IJCAI (2003)
Raeymaekers, S., Bruynooghe, M.: Extracting information from structured documents with automata in a single run. In: Proc. 2nd Int. Workshop on Mining Graphs, Trees and Sequences (MGTS 2004), Pisa, Italy, pp. 71–82. University of Pisa (2004)
Rico-Juan, J.R., Calera-Rubio, J., Carrasco, R.C.: Probabilistic k-testable tree languages. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 221–228. Springer, Heidelberg (2000)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Raeymaekers, S., Bruynooghe, M., Van den Bussche, J. (2005). Learning (k,l)-Contextual Tree Languages for Information Extraction. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_31
Download citation
DOI: https://doi.org/10.1007/11564096_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)