Skip to main content

A cross-study of Sentiment Classification on Arabic corpora

  • Conference paper
  • First Online:
Research and Development in Intelligent Systems XXIX (SGAI 2012)

Abstract

Sentiment Analysis is a research area where the studies focus on processing and analyzing the opinions available on the web. Several interesting and advanced works were performed on English. In contrast, very few works were conducted on Arabic. This paper presents the study we have carried out to investigate supervised sentiment classification in an Arabic context. We use two Arabic Corpora which are different in many aspects. We use three common classifiers known by their effectiveness, namely Naïve Bayes, Support Vector Machines and k-Nearest Neighbor. We investigate some settings to identify those that allow achieving the best results. These settings are about stemming type, term frequency thresholding, term weighting and n-gram words. We show that Naïve Bayes and Support Vector Machines are competitively effective; however k- Nearest Neighbor’s effectiveness depends on the corpus. Through this study, we recommend to use light-stemming rather than stemming, to remove terms that occur once, to combine unigram and bigram words and to use presence-based weighting rather than frequency-based one. Our results show also that classification performance may be influenced by documents length, documents homogeneity and the nature of document authors. However, the size of data sets does not have an impact on classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abdul-Mageed, M., Diab, M.T., Korayem, M.: Subjectivity and Sentiment Analysis of Modern Standard Arabic. In Proc. ACL (Short Papers), pp.587-591 (2011).

    Google Scholar 

  2. Pang, B., Lee, L., Vaithyanathain, S.: Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.79-86 (2002).

    Google Scholar 

  3. Wilson, T.A., Wiebe, J., Hwa. R.: Recognizing strong and weak opinion clauses. In Computational Intelligence, 22(2):73–99 (2006).

    Google Scholar 

  4. Zhuang, L., Jing, F., Zhu, X.: Movie Review Mining and Summarization. In CIKM’06, Virginia, USA (2006).

    Google Scholar 

  5. Turney, P.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In ACL’02, pp. 417–424 (2002).

    Google Scholar 

  6. Tsarfaty, R., Seddah, D., Goldberg, Y., Kuebler, S., Versley, Y., Candito, M., Foster, J., Rehbein, I., Tounsi, L.: Statistical parsing of morphologically rich languages (spmrl) what, how and whither. In Proc. NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, CA, (2010).

    Google Scholar 

  7. Saad, M.K., Ashour, W.: OSAC: Open Source Arabic Corpora. In 6th ArchEng Int. Symposiums, EEECS’10 the 6th Int. Symposium on Electrical and Electronics Engineering and Computer Science, European University of Lefke, Cyprus, (2010).

    Google Scholar 

  8. Mitchell, T.: Machine Learning. McCraw Hill (1996).

    Google Scholar 

  9. Vapnik, V.: The Nature of Statistical Learning. Springer-Verlag (1995).

    Google Scholar 

  10. Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. McGraw-Hill Computer Science Series. Las Alamitos, California: IEEE Computer Society Press (1991).

    Google Scholar 

  11. Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst., 26, pp.1–34 (2008).

    Google Scholar 

  12. Rushdi-Saleh, M., Mrtin-Valdivia, M.T., Urena-Lopez, L.A., Perea-Ortega, J.M.: Bilingual Experiments with an Arabic-English Corpus for Opinion Mining. In Proc. Of Recent Advances in Natural Language Processing, Hissar, Bulgaria, pp.740-745 (2011).

    Google Scholar 

  13. Duwairi, R., Al-Refai, M., Khasawneh, N.: Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science. Volume 60 Issue 11, pp. 2347-2352 (2009).

    Article  Google Scholar 

  14. Duwairi, R., Al-Refai, M., Khasawneh, N.: Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization. 4th Int. Conf. on Innovations in Information Technology. IIT’07. Pp. 446-450 (2007).

    Google Scholar 

  15. Khoja, S., Garside, R.: Stemming Arabic text. Computer Science Department, Lancaster University, Lancaster, UK (1999).

    Google Scholar 

  16. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, Pennsylvania: Addison-Wesley (1989).

    Google Scholar 

  17. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. In 2nd Edition, Morgan Kaufmann, San Francisco, California (2005).

    Google Scholar 

  18. Abbasi, A., Chen, H.: Identification and comparison of extremist-group web forum messages using authorship analysis. In IEEE Intelligent Systems 20, 5, pp.67-75 (2005).

    Article  Google Scholar 

  19. Zheng, R., Li, J., Huang, Z. Chen, H.: A framework for authorship analysis of online messages: Writing-style features and techniques. In Journal of the American Society for Information Science and Technology 57, 3, pp.378-393 (2006).

    Article  Google Scholar 

  20. Yang, Y.: An evaluation of statistical approaches to text categorization. Inform. Retr. 1, 1–2, pp. 69–90 (1999).

    Article  Google Scholar 

  21. John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, pp.338-345 (1995).

    Google Scholar 

  22. Platt, J.: Fast training on SVMs using sequential minimal optimization. In Scholkopf, B., Burges, C., and Smola, A. (Ed.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, pp.185-208 (1999).

    Google Scholar 

  23. Salton, G., McGill, M.: Modern Information Retrieval. New York: McGraw-Hill (1983).

    MATH  Google Scholar 

  24. Sebastiani, F.: Machine learning in automated text categorization. In ACM Comput. Surv., Volume 34, Number 1, pp.1-47 (2002).

    Article  Google Scholar 

  25. Shannon, C.: A mathematical theory of communication. In Bell System Technical Journal, 27, Bell System Technical Journal (1948).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Mountassir .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London

About this paper

Cite this paper

Mountassir, A., Benbrahim, H., Berrada, I. (2012). A cross-study of Sentiment Classification on Arabic corpora. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXIX. SGAI 2012. Springer, London. https://doi.org/10.1007/978-1-4471-4739-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-4739-8_21

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-4738-1

  • Online ISBN: 978-1-4471-4739-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics