Using ILP to Construct Features for Information Extraction from Semi-structured Text

Ramakrishnan, Ganesh; Joshi, Sachindra; Balakrishnan, Sreeram; Srinivasan, Ashwin

doi:10.1007/978-3-540-78469-2_22

Ganesh Ramakrishnan¹,
Sachindra Joshi¹,
Sreeram Balakrishnan¹ &
…
Ashwin Srinivasan^1,2

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4894))

Included in the following conference series:

International Conference on Inductive Logic Programming

586 Accesses
8 Citations

Abstract

Machine-generated documents containing semi-structured text are rapidly forming the bulk of data being stored in an organisation. Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction (IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information. We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in the form of features and their values) is better than using it to construct intensional models for the tasks (in the form of rules for information extraction).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pp. 12–15 (1994)
Google Scholar
Aitken, J.S.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html
Borthwick, A.: A maximum entropy approach to named entity recognition. PhD thesis (1999)
Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152 (1992)
Google Scholar
Brants, T.: Tnt: a statistical part-of-speech tagger. In: Proceedings of the sixth conference on Applied natural language processing (2000)
Google Scholar
Bunescu, R., Mooney, R.J.: Relational markov networks for collective information extraction. In: Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (2004)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1998)
Google Scholar
Finn, A., Kushmerick, N.: Multi-level boundary classification for information extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)
Google Scholar
Freitag, D.: Toward general-purpose learning for information extraction. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics (1998)
Google Scholar
Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Informatino Extraction (1999)
Google Scholar
King, R.D., Srinivasan, A., DeHaspe, L.: WARMR: A Data Mining Tool for Chemical Data. Computer Aided Molecular Design 15, 173–181 (2001)
Article Google Scholar
Kramer, S., Lavra, N., Flach, P.: Propositionalization approaches to relational data mining. Springer, New York (2000)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Lewis, D.D.: Representation and learning in information retrieval. PhD thesis (1992)
Google Scholar
Lin, D.: Dependency-based evaluation of minipar. In: Workshop on the Evaluation of Parsing Systems (1998)
Google Scholar
Lloyd, J.W.: Logic for learning: Learning comprehensible theories from structured data. Cognitive Technologies Series. Springer, Heidelberg (2003)
MATH Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Miller, G.: Wordnet: A lexical database for english. Commun. ACM 38(11) (1995)
Google Scholar
Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)
Article MathSciNet Google Scholar
Muggleton, S.H., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support Vector Inductive Logic Programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)
Chapter Google Scholar
Nienhuys-Cheng, S., de Wolf, R.: Foundations of Inductive Logic Programming. Springer, Berlin (1997)
Google Scholar
Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: ANLP, pp. 16–19 (1997)
Google Scholar
Roth, D., Yih, W.t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI, pp. 1257–1263 (2001)
Google Scholar
Sang, E.F.T.K., Daelemans, W., Déjean, H., Koeling, R., Krymolowski, Y., Punyakanok, V., Roth, D.: Applying system combination to base noun phrase identification. In: COLING, pp. 857–863 (2000)
Google Scholar
Siegel, S., Castellan Jr, N.J.: Nonparametric Statistics for The Behavioral Sciences. McGraw-Hill, New York (1956)
MATH Google Scholar
Specia, L., Srinivasan, A., Ramakrishnan, G., Nunes, M.G.V.: Word sense disambiguation using ilp. In: 16th International Conference on Inductive Logic Programming (2006)
Google Scholar
Srinivasan, A.: The Aleph Manual (1999), http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/

Download references

Author information

Authors and Affiliations

IBM India Research Laboratory, Block 1, Indian Institute of Technology, New Delhi, 110016, India
Ganesh Ramakrishnan, Sachindra Joshi, Sreeram Balakrishnan & Ashwin Srinivasan
Dept. of CSE & Centre for Health Informatics, University of New Kensington, Sydney, Australia
Ashwin Srinivasan

Authors

Ganesh Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Sachindra Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Sreeram Balakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Ashwin Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hendrik Blockeel Jan Ramon Jude Shavlik Prasad Tadepalli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramakrishnan, G., Joshi, S., Balakrishnan, S., Srinivasan, A. (2008). Using ILP to Construct Features for Information Extraction from Semi-structured Text. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds) Inductive Logic Programming. ILP 2007. Lecture Notes in Computer Science(), vol 4894. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78469-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-78469-2_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78468-5
Online ISBN: 978-3-540-78469-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics