Advertisement

Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri

  • Pitambar BeheraEmail author
  • Atul Kr. Ojha
  • Girish Nath Jha
Conference paper
  • 291 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10930)

Abstract

Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collection and annotation of a voluminous corpus for the purpose of NLP application for these languages prove to be quite challenging. For the development of any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121 k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Project. Both the taggers are trained and tested with approximately 80 k and 13 k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.

Keywords

Low-density language Parts of speech tagger SVM CRF++ Sambalpuri Eastern IA Language 

References

  1. 1.
    McEnery, T., Baker, P., Burnard, L.: Corpus resources and minority language engineering. In: LREC (2000)Google Scholar
  2. 2.
    Ostler, N.: Language technology and the smaller language. ELRA Newsl. 4(2) (1999) Google Scholar
  3. 3.
    Abbi, A.: A Manual of Linguistic Fieldwork and Structures of Indian Languages, vol. 17. Lincom Europa (2001)Google Scholar
  4. 4.
    Kushal, G.: Case and agreement in Sambalpuri. M. Phil. Thesis, Centre for Linguistics, Jawaharlal Nehru University, New Delhi, Delhi (2015)Google Scholar
  5. 5.
    Mathai, E.K., Kelsall, J.: Sambalpuri of Orissa, India: A Brief Sociolinguistic Survey. SIL International (2013)Google Scholar
  6. 6.
    Tripathy, B.: Sambalpuri semantics. Graduate Thesis, Sambalpur University, Sambalpur (1984)Google Scholar
  7. 7.
    Behera, P. Dash, B.N.: Documenting Sambalpuri-Kosli: the case of a less-resourced language. Indian J. Appl. Linguist. (IJOAL). Bahri Publications (0379-0037), June 2017. (accepted)Google Scholar
  8. 8.
    Padhy, H.H., Mohanty, S.: Designing hybrid approach spell checker for Oriya. Int. J. Latest Trends Eng. Technol. 2(4), 156–160 (2013)Google Scholar
  9. 9.
    Jena, I., Chaudhury, S., Chaudhry, H., Sharma, Dipti M.: Developing Oriya morphological analyzer using Lt-Toolbox. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 124–129. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-19403-0_20CrossRefGoogle Scholar
  10. 10.
    Behera, P.: Odia parts of speech tagging corpora: suitability of statistical models. M. Phil. Thesis, Centre for Linguistics, Jawaharlal Nehru University, Delhi (2015)Google Scholar
  11. 11.
    Behera, P.: Evaluation of SVM-based automatic parts of speech tagger for Odia. In: Proceedings of WILDRE-3 (LREC-2016), Portoroz, Slovenia, pp. 32–38 (2016). ISBN: 978-2-9517408-8-4Google Scholar
  12. 12.
    Behera, P.: An experimentation with the CRF++ parts of speech tagger for Odia. Lang. India 17(1) (2017). ISSN: 1930-2940Google Scholar
  13. 13.
    Ojha, A.K., Behera, P., Singh, S., Jha, G.N.: Training & evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri. In: Proceedings of LTC-2015, Poland, pp. 524–529 (2015)Google Scholar
  14. 14.
    Behera, P.: Issues and challenges in corpus collection and annotation of Sambalpuri: the case of a lesser-known language. Language Forum, Bahri Publications, June 2018. ISSN 0253-9071. (accepted)Google Scholar
  15. 15.
    Bhattacharya, T.: The structure of the Bangla DP. Doctoral Dissertation, University College, London (1999)Google Scholar
  16. 16.
    Neukom, L., Patnaik, M.: A Grammar of Oriya. Seminar für Allgemeine Sprachwissenschaft der University, Zürich (2003)Google Scholar
  17. 17.
    Shukla, S.: Bhojpuri Grammar. Georgetown University Press, Washington, D.C. (1981)Google Scholar
  18. 18.
    Baskaran, S., Bali, K., Bhattacharya, T., Bhattacharyya, P., Jha, G.N.: A common parts-of-speech tagset framework for Indian languages. In: LREC (2008)Google Scholar
  19. 19.
    Abbi, A.: Reduplication in South Asian Languages: An Areal, Typological and Historical Study. Allied Publishers Pvt. Ltd., Chennai (1992)Google Scholar
  20. 20.
    Jha, G.N., Hellan, L., Beermann, D., Singh, S., Behera, P., Banerjee, E. Indian languages on the TypeCraft platform - the case of Hindi and Odia. In: LREC, Iceland (2014)Google Scholar
  21. 21.
    Kumar, R., Kaushik, S., Nainwani, P., Banerjee, E., Hadke, S., Jha, G.N.: Using the ILCI annotation tool for POS annotation: a case of Hindi. In: 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2012), New Delhi, India, March 2012Google Scholar
  22. 22.
    Joachims, T.: Making large scale SVM learning practical. Universität Dortmund (1999)Google Scholar
  23. 23.
    Giménez, J., Màrquez, L.: Technical Manual v1.3. Universitat Politecnica de Catalunya, Barcelona (2006)Google Scholar
  24. 24.
    Kudo, T.: CRF ++: Yet Another CRF Toolkit (2013). http://crfpp.sourceforge.net/ptojrcts/crfpp/. Accessed 10 July 2015
  25. 25.
    Patel, K.: A Sambalpuri Phonetic Reader. Menaka Prakashani, Sambalpur (undated)Google Scholar
  26. 26.
    Masica, C.P.: The Indo-Aryan Languages. Cambridge University Press, Cambridge (1993)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Pitambar Behera
    • 1
    Email author
  • Atul Kr. Ojha
    • 2
  • Girish Nath Jha
    • 1
    • 2
  1. 1.Centre for LinguisticsJawaharlal Nehru UniversityNew DelhiIndia
  2. 2.Special Centre for Sanskrit StudiesJawaharlal Nehru UniversityNew DelhiIndia

Personalised recommendations