Ensemble Learning Sentiment Classification for Un-labeled Arabic Text

Alkabkabi, Amal; Taileb, Mounira

doi:10.1007/978-3-030-36365-9_17

Ensemble Learning Sentiment Classification for Un-labeled Arabic Text

Amal Alkabkabi¹⁰ &
Mounira Taileb¹⁰

Conference paper
First Online: 05 December 2019

983 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1097))

Abstract

Sentiment classification has become one of the most trending research topics, due to the rapid growth of social media platforms and applications. It is the process of determining the opinion or the feeling of a piece of text and assigning a label to it (positive, negative or neutral). One of the issues in sentiment classification is the need for labeled data – that is often carried out manually - in order to train the classifiers which is a time consuming task. In this paper we consider the lexicon-based classification as labeling technique instead of the manual labeling. In addition, for an effective sentiment classification we investigate the using of multiple ensemble learning methods - where multiple classifiers are combined - in order to improve the performance of the classification. Experiments have been run on datasets of reviews written in Modern Standard Arabic. Results show that the labeling technique is effective and promising and the use of ensemble learning has clearly improved the accuracy for the sentiment classification compared to the traditional methods.

Download conference paper PDF

1 Introduction

It has become apparent in the last ten years that social media has changed drastically the way how people communicate and socialize with each other in social networking sites such as: Facebook, twitter, whats-app and blogs, etc. People nowadays can express their feelings, views and opinions regarding many topics without boundaries, thus enrich those platforms with enormous amount of data that can be of great help for many people such as: customer services in companies or simply individuals seeking a review on some product. Thus, the need for automated system or model to extract readable information from those sites arise. Sentiment analysis (SA), also known as opinion mining, is the use of natural language processing or text analysis to identify and extract opinion, sentiment and subjective information from text [8]. SA models as stated by [11], consists of the following steps: data pre-processing, features extraction and applying machine learning algorithms. Sentiment analysis can be utilized using two main approaches: machine learning classification and lexicon-based classification. One of the challenging issues in SA is labeling the dataset for training (usually done manually by human which may be considered as a burden and time consuming task) as well as choosing the best classifier for the dataset. In this paper, we took advantage of some approaches and methods to handle those challenges; Lexicon-based approach has been used for data labeling. And the ensemble learning method has been used to improve the classification performance. The remainder of the paper is organized as follows. In Sect. 2, a review of the ensemble learning approaches is presented, in Sect. 3 the proposed model is described in details, in Sect. 4 we presented the experiments and evaluation of the model. Finally, in Sect. 5 we conclude the paper.

2 Ensemble Learning

Ensemble learning (EL) is a scheme in which multiple machine learning algorithms are combined in a single model in order to improve the performance of a single classifier and achieve better performance [17]. EL uses machine learning algorithms such as: SVM, Decision Tree, etc. to train a specific dataset, then EL produces either a homogeneous algorithm (learner) from combining a single type of algorithm also called “base algorithm”, or it produces a heterogeneous algorithm (learner) from combining multiple types of algorithms called “component algorithm” (see Fig. 1) [3]. There are several methods for combining the algorithms: bagging, boosting, stacking and voting [13]. Ensemble learning has been applied in many fields, for example, Malware detection [7, 16], voice recognition [2, 15], decision-making [9] and last but not least in sentiment analysis.

2.1 Bagging

Bagging, also known as Bootstrap Aggregation, where the same base algorithm is trained multiple times in parallel using a bootstrapped samples from the original dataset. The classifier algorithms are then aggregated using either averaging or majority voting technique and the most voted class is predicted. Bagging is used to minimize the prediction variance by producing additional data for training from the original dataset (see Fig. 2). The variance is linked directly to over-fitting and the performance improvement in ensembles is the direct result when reducing variance [5].

2.2 Boosting

Boosting is a technique where the base algorithm is trained multiple times in series using the same dataset but with more focus on difficult instances each time [12]. The main goal in boosting is to convert a weak classifier algorithm (learner) into a strong one [17]. The previous model is used in order to adjust the weight of observation for the next model according to the model’s error rates-better models are given better weights. The aim in this method is to decrease the bias error and produce better model (see Fig. 2).

2.3 Stacking

Stacking, also known as stacked generalization, has two level structure in which multiple component algorithms are trained using the same dataset in level 1, then the output of the level 1 is used as training data for level 2 classifier algorithm, also called meta classifier, as shown in Fig. 2. This technique can either reduce bias or variance error depending on the algorithms used [14].

3 Proposed Model

Our proposed model consists of four stages: data pre-processing, lexicon labeling, feature extraction and ensemble classification as shown in Fig. 3. For the data pre-processing, each review in the dataset goes through cleaning - where characters and symbols are removed such as: $ % # @ : , etc - and stop words removal - in which a predefined list of Arabic stop words are used to eliminate connecting words that has no sentiment meanings.

In the lexicon labeling, each review is labeled to either positive (+1) or negative (−1). An Arabic lexicon is needed for the labeling process, and by the end of this stage the whole dataset would be labeled for further processing as shown in Fig. 4. A lexicon is a special dictionary that contains words with their sentiment label (+1 for positive, −1 for negative and 0 for neutral).

The most common feature extraction technique is bag of words (BOWs), in which the counts of the words appearance in a document (text) is more significant than the order of the words. BOW converts the dataset into a matrix with one row per document (text or review) and one column per word occurring in the dataset. If the word appears in the document, “1” is assigned to it, otherwise “0” is assigned to the word.

For our experiments, the dataset (labeled Arabic reviews) is represented using bag of words (BOWs) and N-gram model. Two n-gram models will be considered: uni-gram (individual words), bi-gram (two consecutive words), uni-gram provides full text converge while bi-gram captured some phrases and dependencies between the words [1, 6, 10]. After feature extraction stage, the labeled dataset is split into a training-validation set (80%), and a testing set (20%), which is used to evaluate the model.

For the ensemble classification, we examine three ensemble learning techniques - described in Sect. 2. Bagging, Boosting and Stacking. In our experiments, we used three well-known classifiers: K-nearest neighbor (KNN), Random Forest (RF) and Decision Tree (DT). In Bagging and Boosting for increasing the efficiency each of the classifiers is used as a base classifier, while the three classifiers are used in Stacking along with Logistic Regression (LG) as the meta-classifier.

4 Experiments and Evaluation

The following sections describe the proposed model, the data set used in the experiment, the metrics used to evaluate the proposed model and the experimental results.

4.1 Dataset

Lack of Arabic datasets may be considered as a challenge in the field of text mining. It is clearly noted that Arabic resources are not as rich as in English language. We considered the dataset proposed by El-Sahar and El-Beltag [4], it consists of 33k labeled reviews written in Modern Standard Arabic (MSA)^{Footnote 1}. This dataset one of the largest Arabic dataset available and it contains multi-domain reviews for movies, hotels, restaurants and products. It also consists of multi-domain lexicon of 2K entries extracted from the datasets. The Restaurant, Movie and Hotel reviews are used in this paper from the dataset along with their multi-domain lexicons [4]. We only processed positive and negative reviews (the neutral reviews was omitted beforehand). The reviews were processed without their respective labels (as un-labeled dataset) and then the labels are used for evaluation (validation) purposes to measure the accuracy of the model. The dataset also contains multi-domain lexicon (1912 words with their respective labels). The dataset details are provided in Table 1.

Table 1. Details of the used datasets.

Full size table

4.2 Evaluation Metrics

The accuracy rate is used as a metric for the performance evaluation, it is given in Eq. (1). True Positive (TP) indicates that the review is positive and classified successfully by the model as positive. True Negative (TN) indicates that the review is negative and classified successfully by the model as negative. False Positive (FP) indicates that the review is negative and classified by the model as positive. False Negative (FN) indicates that the review is positive and classified by the model as negative.

$$\begin{aligned} Accuracy=\frac{TP + TN }{TP + FP + FN + TN} \end{aligned}$$

(1)

4.3 Experimental Results

The three classifiers (KNN, DT and RF) were run as stand-alone and ensemble classifier on three different datasets (Restaurant, Movie and Hotel reviews). The results shown in Table 3 represent a comparative analysis of each classifier along with their ensemble classifiers based on accuracy.

The un-labeled dataset, as explained in Sect. 3, goes through data pre-processing then data labeling in which the lexicon-based reviews labeling process takes place. The lexicon-based labeling was able to label the reviews with an accuracy of 76% for the Movies dataset, 83% for the Restaurant dataset and 90% for the Hotel dataset (see Table 2).

Table 2. Accuracy values for lexicon-based labeling.

Full size table

In general, all the ensemble learning algorithms achieved good results, with the highest accuracy equal to 92.10% and the lowest is 85.56%. As shown in Table 3, all the ensemble classifiers have better accuracy compared to the stand-alone classifiers. Boosting algorithm outperforms Bagging in most cases, both methods are homogeneous algorithms. The highest improvement in accuracy for Bagging was in the case of RF classifier with the Movie reviews by 1.22% (compared to the RF when used as a stand-alone classifier). While in Boosting, the highest accuracy is achieved when using KNN classifier with the Hotel reviews by 2.69% (compared to the KNN when used as a stand-alone classifier). While Stacking, which is a heterogeneous algorithm, has achieved the highest accuracy 92.10% for the Hotel reviews, the improvement is equal to 3.77%.

Table 3. Accuracy of ensemble learning methods

Full size table

5 Conclusion

Sentiment Analysis is the process of extracting subjective information from texts SA applications are very broad, it is very useful in many areas such as: marketing and customer services. Arabic Sentiment analysis has gained considerable interest in re-cent years, which may be considered challenging due to the natural of the Arabic language. Another challenging issue in SA is labeling the dataset for training and the fact that this process is usually done manually by humans. In addition, choosing the most appropriate ML classification algorithm is considered a challenging task. In this paper, we developed a model for classifying unlabeled Arabic text to either positive or negative. We used lexicon-based technique to label the dataset and the results obtained from the experimental results show that the labeling process has performed well with high accuracy values. In addition, we investigated the possibility of increasing the accuracy for some of well-known ML algorithms using ensemble learning methods (Bagging, Boosting and Stacking). The results show that the use of ensemble learning methods has improved the classification accuracy. As future work, the enrichment of the lexicon is necessary by adding more word along with their polarity, this will improve the labeling phase.

Notes

1.
https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces/tree/master/datasets.

References

Assiri, A., Emam, A., Aldossari, H.: Arabic sentiment analysis: a survey. Int. J. Adv. Comput. Sci. Appl. 6(12), 75–85 (2015). https://doi.org/10.14569/IJACSA.2015.061211
Article Google Scholar
Deng, L., Platt, J.C.: Ensemble deep learning for speech recognition. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Dong, Y.S., Han, K.S.: A comparison of several ensemble methods for text categorization. In: Proceedings of IEEE International Conference on Services Computing (SCC 2004), pp. 419–422. IEEE (2004)
Google Scholar
ElSahar, H., El-Beltagy, S.R.: Building large arabic multi-domain resources for sentiment analysis. In: Gelbukh, A. (ed.) CICLing 2015. LNCS, vol. 9042, pp. 23–34. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18117-2_2
Chapter Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012). https://doi.org/10.1109/TSMCC.2011.2161285
Article Google Scholar
Mejova, Y., Srinivasan, P.: Exploring feature definition and selection for sentiment classifiers. In: Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, January 2011
Google Scholar
Menahem, E., Shabtai, A., Rokach, L., Elovici, Y.: Improving malware detection by applying multi-inducer ensemble. Comput. Stat. Data Anal. 53(4), 1483–1494 (2009)
Article MathSciNet Google Scholar
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008). https://doi.org/10.1561/1500000011
Article Google Scholar
Polikar, R.: Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 6(3), 21–45 (2006)
Article Google Scholar
Alowaidi, S., Saleh, M., Abulnaja, O.: Semantic sentiment analysis of Arabic texts. Int. J. Adv. Comput. Sci. Appl. 8(2), 256–262 (2017). https://doi.org/10.14569/IJACSA.2017.080234
Article Google Scholar
Saxena, D., Gupta, S., Joseph, J., Mehra, R.: Sentiment analysis. Journal Homepage http://www.ijesm.co. in 8(3) (2019)
Schapire, R.E.: The boosting approach to machine learning: an overview. In: Denison, D.D., Hansen, M.H., Holmes, C.C., Mallick, B., Yu, B. (eds.) Nonlinear Estimation and Classification. Lecture Notes in Statistics, vol. 171, pp. 149–171. Springer, New York (2003). https://doi.org/10.1007/978-0-387-21579-2_9
Chapter Google Scholar
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowl.-Based Syst. 118, 124–139 (2017)
Article Google Scholar
Su, Y., Zhang, Y., Ji, D., Wang, Y., Wu, H.: Ensemble learning for sentiment classification. In: Ji, D., Xiao, G. (eds.) CLSW 2012. LNCS (LNAI), vol. 7717, pp. 84–93. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36337-5_10
Chapter Google Scholar
Tao, F., Liu, G., Zhao, Q.: An ensemble framework of voice-based emotion recognition system for films and TV programs. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6209–6213. IEEE (2018)
Google Scholar
Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious codes detection based on ensemble learning. In: Xiao, B., Yang, L.T., Ma, J., Muller-Schloer, C., Hua, Y. (eds.) ATC 2007. LNCS, vol. 4610, pp. 468–477. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73547-2_48
Chapter Google Scholar
Zhang, C., Ma, Y.: Ensemble Machine Learning: Methods and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-1-4419-9326-7
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
Amal Alkabkabi & Mounira Taileb

Authors

Amal Alkabkabi
View author publications
You can also search for this author in PubMed Google Scholar
Mounira Taileb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mounira Taileb .

Editor information

Editors and Affiliations

Princess Nourah Bint Abdul Rahman University, Riyadh, Saudi Arabia
Auhood Alfaries
Princess Nourah Bint Abdul Rahman University, Riyadh, Saudi Arabia
Hanan Mengash
Hasselt University, Hasselt, Belgium
Ansar Yasar
Acadia University, Wolfville, NS, Canada
Elhadi Shakshuki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alkabkabi, A., Taileb, M. (2019). Ensemble Learning Sentiment Classification for Un-labeled Arabic Text. In: Alfaries, A., Mengash, H., Yasar, A., Shakshuki, E. (eds) Advances in Data Science, Cyber Security and IT Applications. ICC 2019. Communications in Computer and Information Science, vol 1097. Springer, Cham. https://doi.org/10.1007/978-3-030-36365-9_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-36365-9_17
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36364-2
Online ISBN: 978-3-030-36365-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Ensemble Learning

2.1 Bagging

2.2 Boosting

2.3 Stacking

3 Proposed Model

4 Experiments and Evaluation

4.1 Dataset

4.2 Evaluation Metrics

4.3 Experimental Results

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation