Abstract
There are different approaches to the problem of assigning each word of a text with a parts-of-speech tag, which is known as Part-Of-Speech (POS) tagging. In this paper we compare the performance of a few POS tagging techniques for Bangla language, e.g. statistical approach (n-gram, HMM) and transformation based approach (Brill’s tagger). A supervised POS tagging approach requires a large amount of annotated training corpus to tag properly. At this initial stage of POS-tagging for Bangla, we have very limited resource of annotated corpus. We tried to see which technique maximizes the performance with this limited resource. We also checked the performance for English and tried to conclude how these techniques might perform if we can manage a substantial amount of annotated corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Summer Institute for Linguistics (SIL) Ethnologue Survey (1999).
Daniel Jurafsky and James H. Martin, Chapter 8: Word classes and Part-Of-Speech Tagging, Speech and Language Processing, Prentice Hall, 2000.
Yair Halevi, Part of Speech Tagging, Seminar in Natural Language Processing and Computational Linguistics (Prof. Nachum Dershowitz), School of Computer Science, Tel Aviv University, Israel, April 2006.
B. Greene and G. Rubin, Automatic Grammatical Tagging of English, Technical Report, Department of Linguistics, Brown University, Providence, Rhode Island, 1971.
S. Klein and R. Simmons, A computational approach to grammatical coding of English words, JACM 10, 1963.
Z. Harris, String Analysis of Language Structure, Mouton and Co., The Hague, 1962.
L. Bahl and R. L. Mercer, Part-Of-Speech assignment by a statistical decision algorithm, IEEE International Symposium on Information Theory, pages: 88 - 89, 1976.
K. W. Church, A stochastic parts program and noun phrase parser for unrestricted test, In proceeding of the Second Conference on Applied Natural Language Processing, pages: 136 - 143, 1988.
D. Cutting, J. Kupiec, J. Pederson and P. Sibun, A practical Part-Of-Speech Tagger, In proceedings of the Third Conference on Applied Natural Language Processing, pages: 133 - 140, ACL, Trento, Italy, 1992.
S. J. DeRose, Grammatical Category Disambiguation by Statistical Optimization, Computational Linguistics, 14 (1), 1988
Helmut Schmid, Probabilistic Part-Of-Speech Tagging using Decision Trees, In Proceedings of The International Conference on new methods in language processing, page 44 - 49, Manchester, UK, 1994.
Eric Brill, A simple rule based part of speech tagger, In Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
Eric Brill, Automatic grammar induction and parsing free text: A transformation based approach, In proceedings of 31st Meeting of the Association of Computational Linguistics, Columbus, Oh, 1993.
Eric Brill, Transformation based error driven parsing, In Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, The Netherlands, 1993.
Eric Brill, Some advances in rule based part of speech tagging, In Proceedings of The Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Washington, 1994.
Robbert Prins and Gertjan van Noord, Unsupervised Pos-Tagging Improves Parsing Accuracy And Parsing Efficiency, In Proceedings of the International Workshop on Parsing Technologies, 2001.
Mihai Pop, Unsupervised Part-of-speech Tagging, Department of Computer Science, Johns Hopkins University, 1996.
Eric Brill, Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging, In Proceeding of The Natural Language Processing Using Very Large Corpora, Boston, MA, 1997.
Linda Van Guilder, Automated Part of Speech Tagging: A Brief Overview, Handout for LING361, Fall 1995, Georgetown University.
Sandipan Dandapat, Sudeshna Sarkar and Anupan Basu, A Hybrid Model for Part-Of-Speech Tagging and its Application to Bengali, In Proceedings of the International Journal of Information Technology, Volume 1, Number 4.
[21] Md. Shahnur Azad Chowdhury, Nahid Mohammad Minhaz Uddin, Mohammad Imran, Mohammad Mahadi Hassan, and Md. Emdadul Haque, Parts of Speech Tagging of Bangla Sentence, In Proceeding of the 7th International Conference on Computer and Information Technology (ICCIT), Bangladesh, 2004.
Md. Hanif Seddiqui, A. K. Muhammad Shohel Rana, Abdullah Al Mahmud and Taufique Sayeed, Parts of Speech Tagging Using Morphological Analysis in Bangla, In Proceeding of the 6$th$ International Conference on Computer and Information Technology (ICCIT), Bangladesh, 2003.
Brown Tagset, available online at: http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz, Building a Large Annotated Corpus of English: The Penn Treebank, Computational Linguistics Journal, Volume 19,Number 2, Pages: 313-330, 1994. Available online at: http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
NLTK, The Natural Language Toolkit, available online at: http://nltk.sourceforge.net/index.html
NLTK’s tagger documentation, available online at: http://nltk.sourceforge.net/tutorial/tagging.pdf
Bangla Newspaper, Prothom-Alo. Online version available online at: http://www.prothom-alo.net
Bangla POS Tagset used in our Bangla POS tagger, available online at http://www.naushadzaman.com/bangla_tagset.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer
About this paper
Cite this paper
Hasan, F.M., UzZaman, N., Khan, M. (2007). Comparison of different POS Tagging Techniques (n-gram, HMM and Brill’s tagger) for Bangla. In: Elleithy, K. (eds) Advances and Innovations in Systems, Computing Sciences and Software Engineering. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6264-3_23
Download citation
DOI: https://doi.org/10.1007/978-1-4020-6264-3_23
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-6263-6
Online ISBN: 978-1-4020-6264-3
eBook Packages: EngineeringEngineering (R0)