The Generative Power of Arabic Morphology and Implications: A Case for Pattern Orientation in Arabic Corpus Annotation and a Proposed Pattern Ontology

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 753)

Abstract

Most of current Arabic morphological analyzer use complex rules to handle the idiosyncrasies of certain Arabic word classes and special cases. The question that arises: is it feasible to design a pattern-oriented morphological analyzer that streamlines the process and avoid the use of complex rules? To answer this question a detailed study has been conducted using a small representative Arabic corpus. The study revealed that most of the words in the language can be generated using a limited number of patterns, morphemes and particles. Inflected and derivational words can be generated through combinations of roots and patterns. The total number of roots is around 10,000 while the total number of morphological patterns is below 1000. The total number of particles is around 325. Around 70% of words in the experimental corpus are templatic (based on morphological patterns). Although, the number of identified patterns reached 943, only a small subset of these is active. For example, the top 12 patterns in the identified list accounted for more than 50% of the generated templatic words. Although the total number of roots is around 10,000 the number of active roots is 3,461. Particles and similar morphemes account for around 30% of the text in the experimental corpus. These features greatly simplify the development of NLP applications such as spelling correctors, normalizers, lemmatizes and higher-level applications.

Notes

Acknowledgement

I would like to thank King Abdulaziz City for Science and Technology (KACST) for supporting this research work under NSTIP project 11-INF2159-04 “Arabic Spelling Checking and Correction”.

References

  1. 1.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 173–180 (2003)Google Scholar
  2. 2.
    Majdi, S., Atwell, E.: Fine-grain morphological analyzer & part-of-speech tagger for Arabic Text. In: Language Resources and Evaluation Conference, LREC (2010)Google Scholar
  3. 3.
    Diab, M.: Towards an Optimal POS tag set for Modern Standard Arabic Processing. In Proceedings of Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria (2007)Google Scholar
  4. 4.
    Diab, M., Hacioglu, K., Jurafsky, D.: Automatic tagging of Arabic Text: from raw text to base phrase chunks. In: Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL 2004), Boston, MA, pp. 149–152 (2004)Google Scholar
  5. 5.
    Diab, M., Ghoneim, M., Habash, N.: Arabic diacritization in the context of statistical machine translation. In: Proceedings of Machine Translation Summit (MT-Summit), Copenhagen, Denmark (2007)Google Scholar
  6. 6.
    Habash, N., Rambow, O., Roth, R.: MADA+TOKAN: a toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt (2009)Google Scholar
  7. 7.
    Habash, N.: Introduction to Arabic Natural Language Processing. Morgan & Claypool, San Rafael (2010)Google Scholar
  8. 8.
    Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, June 2005, pp. 573–580. Association for Computational Linguistics 2005Google Scholar
  9. 9.
    Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: ACL-08: HLT, June 2008, Columbus, Ohio, pp. 117–120 (2008)Google Scholar
  10. 10.
    Khoja, S., Garside, R., Knowles, G.: A tagset for the morphosyntactic tagging of Arabic. In: Proceedings of Corpus Linguistics 2001, pp. 341–353, Lancaster, UK (2001)Google Scholar
  11. 11.
    Khoja, S.: APT: Arabic part-of-speech tagger. In: Proceedings of Student Research Workshop at NAACL 2001, Pittsburgh, pp. 20–26. Association for Computational Linguistics (2001)Google Scholar
  12. 12.
    Yagoub, A.B.: A Dictionary of Arabic Morphological Patterns (in Arabic). World of Books Publishing, Bierut (1996)Google Scholar
  13. 13.
    ELAffendi, M., Altayeb, M.: The SWAM Arabic morphological tagger: multilevel tagging and diacritization, using lexicon driven morphotactics and viterbi. In: ICAI 2014: The 2014 International Conference on Artificial Intelligence, 21–24 July 2014, Las Vegas, Nevada, USA (2014)Google Scholar
  14. 14.
    Dukes, K., Atwell, E., Sharaf, A.B.M.: Syntactic annotation guidelines for the Quranic Arabic dependency treebank. In Proceedings of the Language Resources and Evaluation Conference (LREC) (18221827), Valletta, Malta (2010b)Google Scholar
  15. 15.
    ELAffendi, M.A., Abuhaimed, I.: SWAM Arabic morphological toolkit: a hybrid neuro model for segmentation. POS Tagging and Spellchecking (in press)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science, College of Computer and Information SciencesPrince Sultan UniversityRiyadhSaudi Arabia

Personalised recommendations