A Semi-supervised Approach for Maximum Entropy Based Hindi Named Entity Recognition
Scarcity of annotated data is a challenge in building high performance named entity recognition (NER) systems in resource poor languages. We use a semi-supervised approach which uses a small annotated corpus and a large raw corpus for the Hindi NER task using maximum entropy classifier. A novel statistical annotation confidence measure is proposed for the purpose. The confidence measure is used in selective sampling based semi-supervised NER. Also a prior modulation of maximum entropy classifier is used where the annotation confidence values are used as ‘prior weight’. The superiority of the proposed technique over baseline classifier is demonstrated extensively through experiments.
KeywordsTraining Corpus Name Entity Recognition Entity Recognition Annotate Corpus Name Entity
- Berger, A., Pietra, S., Pietra, V.: A maximum entropy approach to natural language processing. Computational Linguistic 22(1), 39–71 (1996)Google Scholar
- Borthwick, A.: A maximum entropy approach to named entity recognition. Ph.D. thesis, Computer Science Department, New York University (1999)Google Scholar
- Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)Google Scholar
- Saha, S., Sarkar, S., Mitra, P.: A hybrid feature set based maximum entropy Hindi named entity recognition. In: Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP), pp. 343–349 (2008)Google Scholar