Abstract
The focus of this article is unknown word Part-of-Speech (POS) tagging. POS tagging which is one the fundamental requirements for intelligent text processing based on texts language. Therefore, this article firstly aims to provide a POS tagger with high accuracy for Persian language. The technique which is proposed by this article for handling unknown words is using a combination of a type of associative classifier along with a Hidden Markov Models (HMM) algorithm. Associative classification is a new classification approach integrating association mining and classification. The associative classifier used in this study is a type of associative classifiers that is innovated by this research. This kind of classifier not only uses sequence probability but also uses the CBA classifier. CBA first generates all the association rules with certain support and confidence thresholds as candidate rules. It then selects a small set of rules from them to form a classifier. When predicting the class label for an example, the best rule whose body is satisfied by the example is chosen for prediction. Based on the experimental results, the proposed algorithm can increase the accuracy of Persian unknown word POS tagging to 81.8 %. The total accuracy of proposed tagger is 98 % and its sentence accuracy is 63.1 %.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Attia, M., Foster, J., Hogan, D., Roux, J.L., Tounsi, L., van Genabith, J.: Handling unknown words in statistical latent-variable parsing models for Arabic, English and French. In: SPMRL 2010, pp. 67–75 (2010)
Bijankhan, M., Sheykhzadegan, J., Bahrani, M., Ghayoomi, M.: Lessons from building a Persian written corpus: Peykare. Lang. Resour. Eval. 45(2), 143–164 (2011)
Brants, T.: TnT: a statistical part of speech tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing, 29 April–04 May, Association for Computational Linguistics Morris-town, USA (2000)
Behmanesh, A.A., Pilevar, A.H.: Statistical part of speech tagger for Persian words. In: JeTou 2011 (2011)
Elahimanesh, M.H., Minaei-Bidgoli, B.: Making part of speech taggers robust to unknown words, pp. 45–47. M.Sc. thesis, Islamic Azad University, Qazvin branch (2012, in Persian)
Erbach, G.: Syntactic processing of unknown words. IWBS report 131, IBM, Stuttgart (1990)
Erk, K.: Unknown word sense detection as outlier detection. In: Proceedings of NAACL 2006, New York, NY (2006)
Fu, G., Luke, K.-K.: Chinese unknown word identification using class-based LM. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 704–713. Springer, Heidelberg (2005)
Fadaei, H., Shamsfard M..: Persian POS tagging using probabilistic morphological analysis. Int. J. Comput. Appl. Technol. 264–273 (2010)
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: KDD’98, New York, NY, August 1998
Li, W., Han, J., Pei, J.: CMAR: accurate and efficient classification based on multiple class-association rules. In: proceedings of ICDM, pp. 369–376 (2001)
Manning, C.D.: Part-of-Speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)
Mohseni, M., Minaei-Bidgoli, B.: A system for Persian text corpora POS Tagging and disambiguation. B.E. dissertation, 78 pp. Iran University of Science and Technology, Tehran (2008, in Persian)
Okhovvat, M., Minaei-Bidgoli, B.: A hidden Markov model for Persian part-of-speech tagging. In: Proceedings of Procedia CS, pp. 977–981 (2011)
Raja, F., Tasharofi, S., Oroumchian F.: Statistical POS tagging experiments on Persian text. In: Second Workshop on Computational Approaches to Arabic Script-Based Languages, 21–22 July 2007, Stanford, California (2007)
Samuelsson, C.: Morphological tagging based entirely on Bayesian inference. In: 9th Nordic Conference on Computational Linguistic NODALIDA-93, Stockholm University, Stockholm, Sweden (1993)
Seraji, M.: A statistical part-of-speech tagger for Persian. In: Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. NEALT Proceedings Series, pp. 340–343 (2011)
Taylor, J.M., Raskin, V., Hempelmann, C.F.: Towards computational guessing of unknown word meanings: the ontological se-mantic approach. In: Cognitive Science Conference, Boston, MA (2011)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL (2003)
Umansky-Pesin, S., Reichart, R., Rappoport, A.: A multi-domain web-based algorithm for POS tagging of unknown words. In: Coling 2010, pp. 1274–1282 (2010)
Yin, X., Han, J.: CPAR: classification based on Predictive Association Rules. In: proceedings of SIAM International Conference on Data Mining, San Fransisco, CA, pp. 331–335 (2003)
Acknowledgments
The authors would like to thank Noor Text Mining Research group of Computer Research Center of Islamic Sciences (www.noorsoft.org) for supporting this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Elahimanesh, M.H., Minaei-Bidgoli, B., Kermani, F. (2014). ACUT: An Associative Classifier Approach to Unknown Word POS Tagging. In: Movaghar, A., Jamzad, M., Asadi, H. (eds) Artificial Intelligence and Signal Processing. AISP 2013. Communications in Computer and Information Science, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-10849-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-10849-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10848-3
Online ISBN: 978-3-319-10849-0
eBook Packages: Computer ScienceComputer Science (R0)