Abstract
Modern Standard Arabic (MSA) contains optional diacritical marks (diacritics, in Arabic harakat), which became less used in Arabic books, newspapers and other written media. Diacritics are very important for readability and understandability of texts. Their absence causes critical problems that add to the lexical, morphological and semantic ambiguities. In this paper, we present an automatic diacritization system of the Arabic language, using Hidden Markov Models with the Viterbi’s algorithm, based on probabilities based on learning on diacritized Arabic texts. The corpus used was mostly composed of religious texts. Our results were satisfactory, achieving a precision of up to 80% at the word level.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Hamdi, A.: Apport de la diacritisation dans l’analyse morphosyntaxique de l’Arabe. In: JEP-TALN-RECITAL 2012, Volume 3: RECITAL (2012)
Fashwan, A., Alansary, S.: SHAKKIL: an automatic diacritization system for modern standard Arabic texts. Phonetics and Linguistics Department, Faculty of Arts, Alexandria University, Alexandria, Egypt (2017)
Azmi, Almajed: Survey much of the literature on MSA diacritization (2015)
Chelba, C., Jelinek, F.: Structured language modeling. Comput. Speech Lang. 14(4), 283–332 (2000)
Darwish, K., Mubarak, H., Abdelali, A.: Arabic diacritization: stats, rules, and hacks. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, pp. 9–17 (2017)
Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew (2002)
Abandah, G., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., Al-Taee, M.: Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recognit. 18(2), 183–197 (2015)
Goweder, A., de Roeck, A.: Assessment of a significant Arabic corpus. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001)
Jurafsky, D., Martin, J.H.: Speech and language processing. In: Draft Chapters in Progress (2018)
Kontrovich, L., Lee, D.D.: Learning semitic languages with Hidden Markov Models. In: NIPS 2001 Workshop on Machine Learning Methods for Text and Images (2001)
Bebah, M., Amine, C., Azzeddine, M., Abdelhak, L.: Hybrid approaches for automatic vowelization of Arabic texts. Int. J. Nat. Lang. Comput. (IJNLC) 3, 53–71 (2014). https://doi.org/10.5121/ijnlc.2014.3404
Diab, M., Ghoneim, M., Habash, N.: Arabic diacritization in the context of statistical machine translation (2007)
Alnefaie, R., Azmi, A.M.: Automatic minimal diacritization of Arabic texts. In: 3rd International Conference on Arabic Computational Linguistics, Dubai, United Arab Emirates, 5–6 November 2017
Alansary, S.: Alserag: an automatic diacritization system for Arabic. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.) AISI 2016. AISC, vol. 533, pp. 182–192. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-48308-5_18
Smrž, O., Zemánek, P.: Sherds from an Arabic treebanking mosaic. Bull. Math. Linguist. 78, 63–76 (2002)
Mustafa, S.H.: Arabic string searching in the context of character code standards and orthographic variations. Comput. Stand. Interfaces 20(1), 31–51 (1998)
Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)
Khorsheed, M.S.: A HMM-based system to diacritize arabic text. J. Softw. Eng. Appl., 124–127 (2012). https://doi.org/10.4236/jsea.2012.512b024
Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., Attia, M.: Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach (2018)
Hadj Ameur, M.S., Moulahoum, Y., Guessoum, A.: Restoration of Arabic diacritics using a multilevel statistical model. In: Amine, A., Bellatreche, L., Elberrichi, Z., Neuhold, Erich J., Wrembel, R. (eds.) CIIA 2015. IAICT, vol. 456, pp. 181–192. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19578-0_15
Jarrar, M., Zaraket, F., Asia, R., Amayreh, H.: Diacritic-based matching of Arabic words. In: ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 18, no. 2, Article 10, December 2018
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hadjir, I., Abbache, M., Belkredim, F.Z. (2019). An Approach for Arabic Diacritization. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-23281-8_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23280-1
Online ISBN: 978-3-030-23281-8
eBook Packages: Computer ScienceComputer Science (R0)