Skip to main content

Automatic Processing of Modern Standard Arabic Text

  • Chapter
Arabic Computational Morphology

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 38))

Abstract

To date, there are no fully automated systems addressing the community’s need for fundamental language processing tools for Arabic text. In this chapter, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of- speech (POS) tag and annotate Base Phrase Chunks (BPC) in Modern Standard Arabic (MSA) text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the (SVM-TOK) tokenizer achieves an Fß = 1 score of 99.1, the (SVM-POS) tagger achieves an accuracy of 96.6%, and the (SVM-BPC) chunker yields an Fß = 1 score of 91.6.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allwein, E. L., Schapire, R. E. & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.

    Article  Google Scholar 

  • Buchholz, S., Veenstra, J. & Daelemans, W. (1999). Cascaded grammatical relation assignment. In Proceedings of EMNLP/VLC (pp. 239–246).

    Google Scholar 

  • Darwish, K. (2002). Building a shallow Arabic morphological analyser in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (pp. 47–54), Philadelpia, PA.

    Google Scholar 

  • Diab, M., Hacioglu, K. & Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of North American Association for Computational Linguistics (NAACL, pp. 149–152).

    Google Scholar 

  • Habash, N. & Rambow, O. (2005). Arabic Tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Association for Computational Linguistics (ACL, pp. 573–580).

    Google Scholar 

  • Habash, N. & Sadat, F. (2006).Arabic preprocessing schemes for statistical machine translation. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL, pp. 49–52).

    Google Scholar 

  • Hacioglu, K. & Ward, W. (2003). Target word detection and semantic role chunking using support vector machines. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 25–27).

    Google Scholar 

  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (EMCL, pp. 137–142).

    Google Scholar 

  • Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the North American Association for Computational Linguistics Student Workshop (pp. 20–25).

    Google Scholar 

  • Kudo, T. & Matsumato, Y. (2001). Use of support vector learning for chunk identification. In Proceedings of the North American Association for Computational Linguistics (NAACL).

    Google Scholar 

  • Lee, Y.-S., Papineni, K., Roukos, S., Emam, O. & Hassan, H. (2003). Language model based Arabic word segmentation. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (pp. 399–406).

    Google Scholar 

  • Maamouri, M., Bies, A. & Buckwalter, T. (2004). The Penn Arabic treebank: Building a largescale annotated Arabic corpus. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt.

    Google Scholar 

  • Ramshaw, L. A. & Marcus, M. P. (1995). Text chunking using transformation-based learning. In Proceedings of the Association for Computational Linguistics Workshop on Very Large Corpora (pp. 82–94).

    Google Scholar 

  • Tjong Kim Sang, E. & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL, pp. 127–132).

    Google Scholar 

  • Toutanova, K., Klein, D., Manning, C. & Singer, Y. (2003). Feature-Rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 252–259).

    Google Scholar 

  • Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer Verlag.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer

About this chapter

Cite this chapter

Diab, M., Hacioglu, K., Jurafsky, D. (2007). Automatic Processing of Modern Standard Arabic Text. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_9

Download citation

Publish with us

Policies and ethics