Skip to main content

Alserag: An Automatic Diacritization System for Arabic

  • Chapter
  • First Online:
Intelligent Natural Language Processing: Trends and Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

Abstract

Diacritization of written text has a significant impact on Arabic NLP applications. We present an approach to Arabic automatic diacritization that integrates morphological analysis with shallow syntactic analysis. The developed system (Alserag) is a rule based system. The system depends on three modules in order to provide fully diacritized Arabic words namely, morphological analysis module, syntactic analysis module and morph-phonological processing module. To evaluate the performance of the system, we used the benchmark LDC Arabic Treebank datasets used by the state-of-the-art systems (Metwally et al. 2016; Zitouni 2006) and (Shahrour et al. 2015). The proposed system achieved a morphological WER of 5.6%, and a syntactic WER of 10.1%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://tahadz.com/mishkal.

  2. 2.

    http://harakat.ae/.

  3. 3.

    http://qatsdemo.cloudapp.net/farasa/.

  4. 4.

    http://www.unlweb.net/unlarium/index.php?unlarium=dictionary.

  5. 5.

    It is a web application developed in Java and available at http://dev.undlfoundation.org/index.jsp.

  6. 6.

    It is a web application developed in Java and available at http://dev.undlfoundation.org/index.

References

  1. Smr, O.: Yet another intro to Arabic. In: NLP. http://ufal.mff.cuni.cz/~smrz/ANLP/anlp-lecture-notes.pdf (2005)

  2. Rashwan, M., Abdou, S., Rafea, A.: Stochastic Arabic hybrid diacritizer. IEEE Trans. Nat. Lang. Process. Knowl. Eng. (2009)

    Google Scholar 

  3. Attia, M., Rashwan, M.A., Al-Badrashiny, M.A.: Fassieh, a semi-automatic visual interactive tool for morphological, PoS-Tags, phonetic, and semantic annotation of Arabic text corpora. IEEE Trans. Audio Speech Lang. Process. 17(5), 916–925 (2009)

    Google Scholar 

  4. Rashwan, M.A., Al Sallab, A.A., Raafat, H.M., Rafea, A.: Automatic Arabic diacritics restoration based on deep nets. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 65–72 (2014)

    Google Scholar 

  5. Rashwan, M.A., Al Sallab, A.A., Raafat, H.M., Rafea, A.: Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 505–516 (2015)

    Article  Google Scholar 

  6. Metwally, A.S., Rashwan, M.A., Atiya F.A.: A multi-layered approach for Arabic text diacritization. In: Proceeding of 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp. 389–393 (2016)

    Google Scholar 

  7. Abandah, G.A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., Al-Taee, M.: Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. (IJDAR) 18(2), 183–197 (2015)

    Article  Google Scholar 

  8. Maamouri, M., Bies, A., Kulick, S.: A challenge to Arabic Treebank annotation and parsing. In: Linguistic Data Consortium. University of Pennsylvania, USA (2006)

    Google Scholar 

  9. Bouamor, H., Zaghouani, W., Diab, M., Obeid, O., Oflazer, K., Ghoneim, M., Hawwari, A.: A pilot study on Arabic multi-genre corpus diacritization annotation. In: Proceedings of the Second Workshop on Arabic Natural Language Processing, 2014, pp. 80–88. Association for Computational Linguistics, Beijing, China (2015)

    Google Scholar 

  10. EL-Desoky, A., Fayz, M., Samir, D.: A smart dictionary for the Arabic full-form words. (IJSCE) 2(5) (2012). ISSN: 2231-2307

    Google Scholar 

  11. Al Badrashiny, M.: Automatic diacritizer for Arabic text. MA thesis, Faculty of Engineering, Cairo University, Egypt (2009)

    Google Scholar 

  12. Vergyri, D., Kirchhoff, K.: Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: COLING Workshop, Geneva, Switzerland (2004)

    Google Scholar 

  13. Ananthakrishnan, S., Narayanan, S., Bangalore, S.: Automatic diacritization of Arabic transcripts for ASR. In: Proceedings of ICON-05, Kanpur, India (2005)

    Google Scholar 

  14. Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of Arabic diacritics. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of Association for Computational Linguistics, pp. 577–584. Association for Computational Linguistics (2006)

    Google Scholar 

  15. Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 573–580. Association for Computational Linguistics, Ann Arbor (2005)

    Google Scholar 

  16. Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics Companion Volume. Short Papers, pp. 53–56. Association for Computational Linguistics, Rochester, NY (2007)

    Google Scholar 

  17. Diab, M., Ghoneim, M., Habash, N.: Arabic diacritization in the context of statistical machine translation. In: Proceeding of MT-Summit, Copenhagen, Denmark (2007)

    Google Scholar 

  18. Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using Lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 117–120. Association for Computational Linguistics, Columbus, Ohio, USA (2008)

    Google Scholar 

  19. Shaalan, K., Abo Bakr, H.M., Ziedan, I.A.: A statistical method for adding case ending diacritics for Arabic text. In: Proceedings of Language Engineering Conference, pp. 225–234. Cairo, Egypt (2008)

    Google Scholar 

  20. Shaalan, K., Abo Bakr, H.M., Ziedan, I.: A hybrid approach for building Arabic diacritizer. In: Proceedings of the 9th EACL Workshop on Computational Approaches to Semitic Languages, pp. 27–35. Association for Computational Linguistics (2009)

    Google Scholar 

  21. Shahrour, A., Khalifa, S., Habash, N.: Improving Arabic diacritization through syntactic analysis. In: Proceedings of Empirical Methods in Natural Language Processing Conference (EMNLP), pp. 1309–1315. Association for Computational Linguistics (2015)

    Google Scholar 

  22. Alansary, S.: MUHIT: a multilingual harmonized dictionary. In: The 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland (2014)

    Google Scholar 

  23. Alansary, S.: A suite of tools for Arabic natural language processing: a UNL approach. In: The Special Session on Arabic Natural Language Processing: Algorithms, Resources, Tools, Techniques and Applications, (ICCSPA’13), Sharjah, UAE (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sameh Alansary .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Alansary, S. (2018). Alserag: An Automatic Diacritization System for Arabic. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67056-0_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67055-3

  • Online ISBN: 978-3-319-67056-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics