A Semi-supervised Approach for Maximum Entropy Based Hindi Named Entity Recognition

  • Sujan Kumar Saha
  • Pabitra Mitra
  • Sudeshna Sarkar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5909)


Scarcity of annotated data is a challenge in building high performance named entity recognition (NER) systems in resource poor languages. We use a semi-supervised approach which uses a small annotated corpus and a large raw corpus for the Hindi NER task using maximum entropy classifier. A novel statistical annotation confidence measure is proposed for the purpose. The confidence measure is used in selective sampling based semi-supervised NER. Also a prior modulation of maximum entropy classifier is used where the annotation confidence values are used as ‘prior weight’. The superiority of the proposed technique over baseline classifier is demonstrated extensively through experiments.


Training Corpus Name Entity Recognition Entity Recognition Annotate Corpus Name Entity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Berger, A., Pietra, S., Pietra, V.: A maximum entropy approach to natural language processing. Computational Linguistic 22(1), 39–71 (1996)Google Scholar
  2. Borthwick, A.: A maximum entropy approach to named entity recognition. Ph.D. thesis, Computer Science Department, New York University (1999)Google Scholar
  3. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)Google Scholar
  4. Li, W., McCallum, A.: Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Language Information Processing (TALIP) 2(3), 290–294 (2004)CrossRefGoogle Scholar
  5. Mohit, B., Hwa, R.: Syntax-based semi-supervised named entity tagging. In: Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 57–60. Association for Computational Linguistics, Ann Arbor (2005)CrossRefGoogle Scholar
  6. Saha, S., Sarkar, S., Mitra, P.: A hybrid feature set based maximum entropy Hindi named entity recognition. In: Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP), pp. 343–349 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sujan Kumar Saha
    • 1
  • Pabitra Mitra
    • 1
  • Sudeshna Sarkar
    • 1
  1. 1.Indian Institute of TechnologyKharagpurIndia

Personalised recommendations