Automatic Processing of Modern Standard Arabic Text

Diab, Mona; Hacioglu, Kadri; Jurafsky, Daniel

doi:10.1007/978-1-4020-6046-5_9

Mona Diab¹⁴,
Kadri Hacioglu¹⁵ &
Daniel Jurafsky¹⁶

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 38))

1093 Accesses
14 Citations
2 Altmetric

Abstract

To date, there are no fully automated systems addressing the community’s need for fundamental language processing tools for Arabic text. In this chapter, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of- speech (POS) tag and annotate Base Phrase Chunks (BPC) in Modern Standard Arabic (MSA) text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the (SVM-TOK) tokenizer achieves an F_{ß = 1} score of 99.1, the (SVM-POS) tagger achieves an accuracy of 96.6%, and the (SVM-BPC) chunker yields an F_{ß = 1} score of 91.6.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allwein, E. L., Schapire, R. E. & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Article Google Scholar
Buchholz, S., Veenstra, J. & Daelemans, W. (1999). Cascaded grammatical relation assignment. In Proceedings of EMNLP/VLC (pp. 239–246).
Google Scholar
Darwish, K. (2002). Building a shallow Arabic morphological analyser in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages (pp. 47–54), Philadelpia, PA.
Google Scholar
Diab, M., Hacioglu, K. & Jurafsky, D. (2004). Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of North American Association for Computational Linguistics (NAACL, pp. 149–152).
Google Scholar
Habash, N. & Rambow, O. (2005). Arabic Tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Association for Computational Linguistics (ACL, pp. 573–580).
Google Scholar
Habash, N. & Sadat, F. (2006).Arabic preprocessing schemes for statistical machine translation. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL, pp. 49–52).
Google Scholar
Hacioglu, K. & Ward, W. (2003). Target word detection and semantic role chunking using support vector machines. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 25–27).
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (EMCL, pp. 137–142).
Google Scholar
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the North American Association for Computational Linguistics Student Workshop (pp. 20–25).
Google Scholar
Kudo, T. & Matsumato, Y. (2001). Use of support vector learning for chunk identification. In Proceedings of the North American Association for Computational Linguistics (NAACL).
Google Scholar
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O. & Hassan, H. (2003). Language model based Arabic word segmentation. In Proceedings of the 41st Meeting of the Association for Computational Linguistics (pp. 399–406).
Google Scholar
Maamouri, M., Bies, A. & Buckwalter, T. (2004). The Penn Arabic treebank: Building a largescale annotated Arabic corpus. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt.
Google Scholar
Ramshaw, L. A. & Marcus, M. P. (1995). Text chunking using transformation-based learning. In Proceedings of the Association for Computational Linguistics Workshop on Very Large Corpora (pp. 82–94).
Google Scholar
Tjong Kim Sang, E. & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL, pp. 127–132).
Google Scholar
Toutanova, K., Klein, D., Manning, C. & Singer, Y. (2003). Feature-Rich part-of-speech tagging with a cyclic dependency network. In Proceedings of Human Language Technology and North American Association for Computational Linguistics (HLT-NAACL, pp. 252–259).
Google Scholar
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer Verlag.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computational Learning Systems, Columbia University, 3022 Broadway, New York
Mona Diab
Center for Spoken Language Research, University of Colorado, Boulder
Kadri Hacioglu
Linguistics Department, Stanford University, Palo Alto, California
Daniel Jurafsky

Authors

Mona Diab
View author publications
You can also search for this author in PubMed Google Scholar
Kadri Hacioglu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Jurafsky
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Ecole Nationale de I’Industrie Minérale, Rabat, Morocco
Abdelhadi Soudi
Tilburg University, The Netherlands
Antal van den Bosch
Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany
Günter Neumann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Diab, M., Hacioglu, K., Jurafsky, D. (2007). Automatic Processing of Modern Standard Arabic Text. In: Soudi, A., Bosch, A.v., Neumann, G. (eds) Arabic Computational Morphology. Text, Speech and Language Technology, vol 38. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-6046-5_9

Download citation

DOI: https://doi.org/10.1007/978-1-4020-6046-5_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-6045-8
Online ISBN: 978-1-4020-6046-5
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics