Skip to main content

Using ILP to Construct Features for Information Extraction from Semi-structured Text

  • Conference paper
Inductive Logic Programming (ILP 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4894))

Included in the following conference series:

Abstract

Machine-generated documents containing semi-structured text are rapidly forming the bulk of data being stored in an organisation. Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction (IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information. We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in the form of features and their values) is better than using it to construct intensional models for the tasks (in the form of rules for information extraction).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pp. 12–15 (1994)

    Google Scholar 

  2. Aitken, J.S.: Learning Information Extraction Rules: An Inductive Logic Programming approach. In: Proceedings of the 15th European Conference on Artificial Intelligence, pp. 355–359 (2002), http://citeseer.ist.psu.edu/586553.html

  3. Borthwick, A.: A maximum entropy approach to named entity recognition. PhD thesis (1999)

    Google Scholar 

  4. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: 5th Annual ACM Workshop on COLT, pp. 144–152 (1992)

    Google Scholar 

  5. Brants, T.: Tnt: a statistical part-of-speech tagger. In: Proceedings of the sixth conference on Applied natural language processing (2000)

    Google Scholar 

  6. Bunescu, R., Mooney, R.J.: Relational markov networks for collective information extraction. In: Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (2004)

    Google Scholar 

  7. Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing (1998)

    Google Scholar 

  8. Finn, A., Kushmerick, N.: Multi-level boundary classification for information extraction. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 111–122. Springer, Heidelberg (2004)

    Google Scholar 

  9. Freitag, D.: Toward general-purpose learning for information extraction. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics (1998)

    Google Scholar 

  10. Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Informatino Extraction (1999)

    Google Scholar 

  11. King, R.D., Srinivasan, A., DeHaspe, L.: WARMR: A Data Mining Tool for Chemical Data. Computer Aided Molecular Design 15, 173–181 (2001)

    Article  Google Scholar 

  12. Kramer, S., Lavra, N., Flach, P.: Propositionalization approaches to relational data mining. Springer, New York (2000)

    Google Scholar 

  13. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  14. Lewis, D.D.: Representation and learning in information retrieval. PhD thesis (1992)

    Google Scholar 

  15. Lin, D.: Dependency-based evaluation of minipar. In: Workshop on the Evaluation of Parsing Systems (1998)

    Google Scholar 

  16. Lloyd, J.W.: Logic for learning: Learning comprehensible theories from structured data. Cognitive Technologies Series. Springer, Heidelberg (2003)

    MATH  Google Scholar 

  17. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  18. Miller, G.: Wordnet: A lexical database for english. Commun. ACM 38(11) (1995)

    Google Scholar 

  19. Muggleton, S., De Raedt, L.: Inductive logic programming: Theory and methods. Journal of Logic Programming 19(20), 629–679 (1994)

    Article  MathSciNet  Google Scholar 

  20. Muggleton, S.H., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support Vector Inductive Logic Programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  21. Nienhuys-Cheng, S., de Wolf, R.: Foundations of Inductive Logic Programming. Springer, Berlin (1997)

    Google Scholar 

  22. Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: ANLP, pp. 16–19 (1997)

    Google Scholar 

  23. Roth, D., Yih, W.t.: Relational learning via propositional algorithms: An information extraction case study. In: IJCAI, pp. 1257–1263 (2001)

    Google Scholar 

  24. Sang, E.F.T.K., Daelemans, W., Déjean, H., Koeling, R., Krymolowski, Y., Punyakanok, V., Roth, D.: Applying system combination to base noun phrase identification. In: COLING, pp. 857–863 (2000)

    Google Scholar 

  25. Siegel, S., Castellan Jr, N.J.: Nonparametric Statistics for The Behavioral Sciences. McGraw-Hill, New York (1956)

    MATH  Google Scholar 

  26. Specia, L., Srinivasan, A., Ramakrishnan, G., Nunes, M.G.V.: Word sense disambiguation using ilp. In: 16th International Conference on Inductive Logic Programming (2006)

    Google Scholar 

  27. Srinivasan, A.: The Aleph Manual (1999), http://www.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hendrik Blockeel Jan Ramon Jude Shavlik Prasad Tadepalli

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ramakrishnan, G., Joshi, S., Balakrishnan, S., Srinivasan, A. (2008). Using ILP to Construct Features for Information Extraction from Semi-structured Text. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds) Inductive Logic Programming. ILP 2007. Lecture Notes in Computer Science(), vol 4894. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78469-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78469-2_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78468-5

  • Online ISBN: 978-3-540-78469-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics