A Hybrid Approach for Arabic Diacritization

Said, Ahmed; El-Sharqwi, Mohamed; Chalabi, Achraf; Kamal, Eslam

doi:10.1007/978-3-642-38824-8_5

Ahmed Said²⁰,
Mohamed El-Sharqwi²⁰,
Achraf Chalabi²⁰ &
…
Eslam Kamal²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7934))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

2407 Accesses
14 Citations

Abstract

The orthography of Modern standard Arabic (MSA) includes a set of special marks called diacritics that carry the intended pronunciation of words. Arabic text is usually written without diacritics which leads to major linguistic ambiguities in most of the cases since Arabic words have different meaning depending on how they are diactritized. This paper introduces a hybrid diacritization system combining both rule-based and data- driven techniques targeting standard Arabic text. Our system relies on automatic correction, morphological analysis, part of speech tagging and out of vocabulary diacritization components. The system shows improved results over the best reported systems in terms of full-form diacritization, and comparable results on the level of morphological diacritization. We report these results by evaluating our system using the same training and evaluation sets used by the systems we compare against.. Our system shows a word error rate (WER) of 4.4% on the morphological diacritization, ignoring the last letter diacritics, and 11.4% on the full-form diacritization including case ending diacritics. This means an absolute 1.1% reduction on the word error rate (WER) over the best reported system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rashwan, M.A.A., et al.: A stochastic arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19, 166–175 (2011)
Article Google Scholar
Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. In: NAACL-Short 2007 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 53–56 (2007)
Google Scholar
Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of arabic diacritics. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 577–584 (2006)
Google Scholar
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, pp. 573–580 (2005)
Google Scholar
Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: Proceedings of the COLING 2004 Workshop on Computational Approaches to Arabic Script-Based Languages, pp. 31–34 (2004)
Google Scholar
Emam, O., Fisher, V.: A hierarchical approach for the statistical vowelization of arabic text. Tech. rep., IBM (2004)
Google Scholar
Gimnez, J., Mrquez, L.: Svmtool: A general pos tagging generator based on support vector machines. In: LERC 2004. pp. 573–580 (2004)
Google Scholar
Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The penn arabic treebank: Building a large-scale annotated arabic corpus. In: Arabic Lang. Technol. Resources Int. Conf.; NEMLAR, Cairo, Egypt (2004)
Google Scholar
Stolcke, A.: Srilman extensible language modeling toolkit. In: Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pp. 901–904 (2002)
Google Scholar
Laerty, J.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: The Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing; an Introduction to Natural Language Processing, Computational Linguistics, and Speech Processing. Prentice-Hall (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Advanced Technology Lab, Cairo, Egypt
Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi & Eslam Kamal

Authors

Ahmed Said
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed El-Sharqwi
View author publications
You can also search for this author in PubMed Google Scholar
Achraf Chalabi
View author publications
You can also search for this author in PubMed Google Scholar
Eslam Kamal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, 2 rue Conté, 75003, Paris, France
Elisabeth Métais
School of Computing, Science and Engineering, University of Salford, The Crescent, M5 4WT, Salford, Lancashire, UK
Farid Meziane & Sunil Vadera &
School of Computing Science and Engineering, University of Salford, The Crescent, M5 4WT, Salford, Lancashire, UK
Mohamad Saraee
Department of Decision and Information Sciences School of Business Administration, Oakland University, 306 Elliott Hall, 48309, Rochester, MI, USA
Vijayan Sugumaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Said, A., El-Sharqwi, M., Chalabi, A., Kamal, E. (2013). A Hybrid Approach for Arabic Diacritization. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2013. Lecture Notes in Computer Science, vol 7934. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38824-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-38824-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38823-1
Online ISBN: 978-3-642-38824-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics