Incorporating Linguistic Expertise Using ILP for Named Entity Recognition in Data Hungry Indian Languages

Patel, Anup; Ramakrishnan, Ganesh; Bhattacharya, Pushpak

doi:10.1007/978-3-642-13840-9_16

Anup Patel²⁰,
Ganesh Ramakrishnan²⁰ &
Pushpak Bhattacharya²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5989))

Included in the following conference series:

International Conference on Inductive Logic Programming

557 Accesses
4 Citations

Abstract

Developing linguistically sound and data-compliant rules for named entity annotation is usually an intensive and time consuming process for any developer or linguist. In this work, we present the use of two Inductive Logic Programming (ILP) techniques to construct rules for extracting instances of various named entity classes thereby reducing the efforts of a linguist/developer. Using ILP for rule development not only reduces the amount of effort required but also provides an interactive framework wherein a linguist can incorporate his intuition about named entities such as in form of mode declarations for refinements (suitably exposed for ease of use by the linguist) and the background knowledge (in the form of linguistic resources). We have a small amount of tagged data - approximately 3884 sentences for Marathi and 22748 sentences in Hindi. The paucity of tagged data for Indian languages makes manual development of rules more challenging, However, the ability to fold in background knowledge and domain expertise in ILP techniques comes to our rescue and we have been able to develop rules that are mostly linguistically sound that yield results comparable to rules hand-crafted by linguists. The ILP approach has two advantages over the approach of hand-crafting all rules: (i) the development time reduces by a factor of 240 when ILP is used instead of involving a linguist for the entire rule development and (ii) the ILP technique has the computational edge that it has a complete and consistent view of all significant patterns in the data at the level of abstraction specified through the mode declarations. The point (ii) enables the discovery of rules that could be missed by the linguist and also makes it possible to scale the rule development to a larger training dataset. The rules thus developed could be optionally edited by linguistic experts and consolidated either (a) through default ordering (as in TILDE[1]) or (b) with an ordering induced using [2] or (c) by using the rules as features in a statistical graphical model such a conditional random field (CRF) [3]. We report results using WARMR [4] and TILDE to learn rules for named entities of Indian languages namely Hindi and Marathi.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blockeel, H., Raedt, L.D.: Top-down induction of logical decision trees. In: Artificial Intelligence (1998)
Google Scholar
Chakravarthy, V., Joshi, S., Ramakrishnan, G., Godbole, S., Balakrishnan, S.: Learning Decision Lists with Known Rules for Text Mining. In: The Third International Joint Conference on Natural Language Processing, IJCNLP 2008 (2008)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the International Conference on Machine Learning, ICML 2001 (2001)
Google Scholar
Dehaspe, L., De Raedt, L.: Mining association rules in multiple relations. In: Džeroski, S., Lavrač, N. (eds.) ILP 1997. LNCS, vol. 1297, pp. 125–132. Springer, Heidelberg (1997)
Google Scholar
Chinchor, N.A.: Overview of MUC-7/MET-2 (1998)
Google Scholar
Grishman, R., Sundheim, B.: Message understanding. Conference-6: A brief history, pp. 466–471 (1996)
Google Scholar
Grishman, R., Sundheim, B.: Automatic content extraction program. In: NIST (1998)
Google Scholar
Annotation guidelines for entity detection and tracking (2004), http://www.ldc.upenn.edu/Projects/ACE/docs/EnglishEDTV4-2-6.PDF
Tjong Kim Sang, E.F., Meulder, F.D.: Introduction to the conll-2003 shared task: Language. In: Seventh Conference on Natural Language Learning (CoNLL 2003), pp. 142–147 (2003)
Google Scholar
Appelt, D.E., Hobbs, J.R., Bear, J., Israel, D.J., Tyson, M.: Fastus: A finite-state processor for information extraction from real-world text. In: IJCAI, pp. 1172–1178 (1993)
Google Scholar
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: AAAI, pp. 811–816 (1993)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), pp. 328–334 (1999)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. In: Machine Learning (1999)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: Gate: An architecture for development of robust HLT applications. In: Recent Advanced in Language Processing, pp. 168–175 (2002)
Google Scholar
Blockeel, H., et al.: Machine Learning Group - ACE Dataming System (March 2008), http://www.cs.kuleuven.be/~dtai/ACE/doc/ACEuser-1.2.12-r1.pdf
Sarawagi, S.: CRF Project Page (2004), http://crf.sourceforge.net/

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Bombay, Mumbai, 400076, India
Anup Patel, Ganesh Ramakrishnan & Pushpak Bhattacharya

Authors

Anup Patel
View author publications
You can also search for this author in PubMed Google Scholar
Ganesh Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Pushpak Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Heverlee, Belgium
Luc De Raedt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patel, A., Ramakrishnan, G., Bhattacharya, P. (2010). Incorporating Linguistic Expertise Using ILP for Named Entity Recognition in Data Hungry Indian Languages. In: De Raedt, L. (eds) Inductive Logic Programming. ILP 2009. Lecture Notes in Computer Science(), vol 5989. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13840-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-13840-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13839-3
Online ISBN: 978-3-642-13840-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics