Abstract
Machine-generated documents containing semi-structured text are rapidly forming the bulk of data being stored in an organisation. Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction (IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information. We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in the form of features and their values) is better than using it to construct intensional models for the tasks (in the form of rules for information extraction).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pp. 12–15 (1994)
Aitken, J.S.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html
Borthwick, A.: A maximum entropy approach to named entity recognition. PhD thesis (1999)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152 (1992)
Brants, T.: Tnt: a statistical part-of-speech tagger. In: Proceedings of the sixth conference on Applied natural language processing (2000)
Bunescu, R., Mooney, R.J.: Relational markov networks for collective information extraction. In: Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (2004)
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1998)
Finn, A., Kushmerick, N.: Multi-level boundary classification for information extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)
Freitag, D.: Toward general-purpose learning for information extraction. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics (1998)
Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Informatino Extraction (1999)
King, R.D., Srinivasan, A., DeHaspe, L.: WARMR: A Data Mining Tool for Chemical Data. Computer Aided Molecular Design 15, 173–181 (2001)
Kramer, S., Lavra, N., Flach, P.: Propositionalization approaches to relational data mining. Springer, New York (2000)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Lewis, D.D.: Representation and learning in information retrieval. PhD thesis (1992)
Lin, D.: Dependency-based evaluation of minipar. In: Workshop on the Evaluation of Parsing Systems (1998)
Lloyd, J.W.: Logic for learning: Learning comprehensible theories from structured data. Cognitive Technologies Series. Springer, Heidelberg (2003)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Miller, G.: Wordnet: A lexical database for english. Commun. ACMÂ 38(11) (1995)
Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)
Muggleton, S.H., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support Vector Inductive Logic Programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)
Nienhuys-Cheng, S., de Wolf, R.: Foundations of Inductive Logic Programming. Springer, Berlin (1997)
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: ANLP, pp. 16–19 (1997)
Roth, D., Yih, W.t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI, pp. 1257–1263 (2001)
Sang, E.F.T.K., Daelemans, W., Déjean, H., Koeling, R., Krymolowski, Y., Punyakanok, V., Roth, D.: Applying system combination to base noun phrase identification. In: COLING, pp. 857–863 (2000)
Siegel, S., Castellan Jr, N.J.: Nonparametric Statistics for The Behavioral Sciences. McGraw-Hill, New York (1956)
Specia, L., Srinivasan, A., Ramakrishnan, G., Nunes, M.G.V.: Word sense disambiguation using ilp. In: 16th International Conference on Inductive Logic Programming (2006)
Srinivasan, A.: The Aleph Manual (1999), http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ramakrishnan, G., Joshi, S., Balakrishnan, S., Srinivasan, A. (2008). Using ILP to Construct Features for Information Extraction from Semi-structured Text. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds) Inductive Logic Programming. ILP 2007. Lecture Notes in Computer Science(), vol 4894. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78469-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-78469-2_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78468-5
Online ISBN: 978-3-540-78469-2
eBook Packages: Computer ScienceComputer Science (R0)