A Hybrid Approach for Arabic Diacritization

  • Ahmed Said
  • Mohamed El-Sharqwi
  • Achraf Chalabi
  • Eslam Kamal
Conference paper

DOI: 10.1007/978-3-642-38824-8_5

Volume 7934 of the book series Lecture Notes in Computer Science (LNCS)
Cite this paper as:
Said A., El-Sharqwi M., Chalabi A., Kamal E. (2013) A Hybrid Approach for Arabic Diacritization. In: Métais E., Meziane F., Saraee M., Sugumaran V., Vadera S. (eds) Natural Language Processing and Information Systems. NLDB 2013. Lecture Notes in Computer Science, vol 7934. Springer, Berlin, Heidelberg

Abstract

The orthography of Modern standard Arabic (MSA) includes a set of special marks called diacritics that carry the intended pronunciation of words. Arabic text is usually written without diacritics which leads to major linguistic ambiguities in most of the cases since Arabic words have different meaning depending on how they are diactritized. This paper introduces a hybrid diacritization system combining both rule-based and data- driven techniques targeting standard Arabic text. Our system relies on automatic correction, morphological analysis, part of speech tagging and out of vocabulary diacritization components. The system shows improved results over the best reported systems in terms of full-form diacritization, and comparable results on the level of morphological diacritization. We report these results by evaluating our system using the same training and evaluation sets used by the systems we compare against.. Our system shows a word error rate (WER) of 4.4% on the morphological diacritization, ignoring the last letter diacritics, and 11.4% on the full-form diacritization including case ending diacritics. This means an absolute 1.1% reduction on the word error rate (WER) over the best reported system.

Keywords

Arabic Arabic orthography diacritization vowelization morphology morphology features morphological analysis part-of-speech tagging automatic correction Viterbi case ending natural language processing language modeling conditional random fields CRF 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ahmed Said
    • 1
  • Mohamed El-Sharqwi
    • 1
  • Achraf Chalabi
    • 1
  • Eslam Kamal
    • 1
  1. 1.Microsoft Advanced Technology LabCairoEgypt