1 Introduction

It has become apparent in the last ten years that social media has changed drastically the way how people communicate and socialize with each other in social networking sites such as: Facebook, twitter, whats-app and blogs, etc. People nowadays can express their feelings, views and opinions regarding many topics without boundaries, thus enrich those platforms with enormous amount of data that can be of great help for many people such as: customer services in companies or simply individuals seeking a review on some product. Thus, the need for automated system or model to extract readable information from those sites arise. Sentiment analysis (SA), also known as opinion mining, is the use of natural language processing or text analysis to identify and extract opinion, sentiment and subjective information from text [8]. SA models as stated by [11], consists of the following steps: data pre-processing, features extraction and applying machine learning algorithms. Sentiment analysis can be utilized using two main approaches: machine learning classification and lexicon-based classification. One of the challenging issues in SA is labeling the dataset for training (usually done manually by human which may be considered as a burden and time consuming task) as well as choosing the best classifier for the dataset. In this paper, we took advantage of some approaches and methods to handle those challenges; Lexicon-based approach has been used for data labeling. And the ensemble learning method has been used to improve the classification performance. The remainder of the paper is organized as follows. In Sect. 2, a review of the ensemble learning approaches is presented, in Sect. 3 the proposed model is described in details, in Sect. 4 we presented the experiments and evaluation of the model. Finally, in Sect. 5 we conclude the paper.

2 Ensemble Learning

Ensemble learning (EL) is a scheme in which multiple machine learning algorithms are combined in a single model in order to improve the performance of a single classifier and achieve better performance [17]. EL uses machine learning algorithms such as: SVM, Decision Tree, etc. to train a specific dataset, then EL produces either a homogeneous algorithm (learner) from combining a single type of algorithm also called “base algorithm”, or it produces a heterogeneous algorithm (learner) from combining multiple types of algorithms called “component algorithm” (see Fig. 1) [3]. There are several methods for combining the algorithms: bagging, boosting, stacking and voting [13]. Ensemble learning has been applied in many fields, for example, Malware detection [7, 16], voice recognition [2, 15], decision-making [9] and last but not least in sentiment analysis.

Fig. 1.
figure 1

Ensemble learning types

2.1 Bagging

Bagging, also known as Bootstrap Aggregation, where the same base algorithm is trained multiple times in parallel using a bootstrapped samples from the original dataset. The classifier algorithms are then aggregated using either averaging or majority voting technique and the most voted class is predicted. Bagging is used to minimize the prediction variance by producing additional data for training from the original dataset (see Fig. 2). The variance is linked directly to over-fitting and the performance improvement in ensembles is the direct result when reducing variance [5].

2.2 Boosting

Boosting is a technique where the base algorithm is trained multiple times in series using the same dataset but with more focus on difficult instances each time [12]. The main goal in boosting is to convert a weak classifier algorithm (learner) into a strong one [17]. The previous model is used in order to adjust the weight of observation for the next model according to the model’s error rates-better models are given better weights. The aim in this method is to decrease the bias error and produce better model (see Fig. 2).

2.3 Stacking

Stacking, also known as stacked generalization, has two level structure in which multiple component algorithms are trained using the same dataset in level 1, then the output of the level 1 is used as training data for level 2 classifier algorithm, also called meta classifier, as shown in Fig. 2. This technique can either reduce bias or variance error depending on the algorithms used [14].

Fig. 2.
figure 2

Ensemble learning combining methods

3 Proposed Model

Our proposed model consists of four stages: data pre-processing, lexicon labeling, feature extraction and ensemble classification as shown in Fig. 3. For the data pre-processing, each review in the dataset goes through cleaning - where characters and symbols are removed such as: $ % # @ : , etc - and stop words removal - in which a predefined list of Arabic stop words are used to eliminate connecting words that has no sentiment meanings.

In the lexicon labeling, each review is labeled to either positive (+1) or negative (−1). An Arabic lexicon is needed for the labeling process, and by the end of this stage the whole dataset would be labeled for further processing as shown in Fig. 4. A lexicon is a special dictionary that contains words with their sentiment label (+1 for positive, −1 for negative and 0 for neutral).

The most common feature extraction technique is bag of words (BOWs), in which the counts of the words appearance in a document (text) is more significant than the order of the words. BOW converts the dataset into a matrix with one row per document (text or review) and one column per word occurring in the dataset. If the word appears in the document, “1” is assigned to it, otherwise “0” is assigned to the word.

For our experiments, the dataset (labeled Arabic reviews) is represented using bag of words (BOWs) and N-gram model. Two n-gram models will be considered: uni-gram (individual words), bi-gram (two consecutive words), uni-gram provides full text converge while bi-gram captured some phrases and dependencies between the words [1, 6, 10]. After feature extraction stage, the labeled dataset is split into a training-validation set (80%), and a testing set (20%), which is used to evaluate the model.

For the ensemble classification, we examine three ensemble learning techniques - described in Sect. 2. Bagging, Boosting and Stacking. In our experiments, we used three well-known classifiers: K-nearest neighbor (KNN), Random Forest (RF) and Decision Tree (DT). In Bagging and Boosting for increasing the efficiency each of the classifiers is used as a base classifier, while the three classifiers are used in Stacking along with Logistic Regression (LG) as the meta-classifier.

Fig. 3.
figure 3

The proposed model

Fig. 4.
figure 4

Lexicon-based labeling

4 Experiments and Evaluation

The following sections describe the proposed model, the data set used in the experiment, the metrics used to evaluate the proposed model and the experimental results.

4.1 Dataset

Lack of Arabic datasets may be considered as a challenge in the field of text mining. It is clearly noted that Arabic resources are not as rich as in English language. We considered the dataset proposed by El-Sahar and El-Beltag [4], it consists of 33k labeled reviews written in Modern Standard Arabic (MSA)Footnote 1. This dataset one of the largest Arabic dataset available and it contains multi-domain reviews for movies, hotels, restaurants and products. It also consists of multi-domain lexicon of 2K entries extracted from the datasets. The Restaurant, Movie and Hotel reviews are used in this paper from the dataset along with their multi-domain lexicons [4]. We only processed positive and negative reviews (the neutral reviews was omitted beforehand). The reviews were processed without their respective labels (as un-labeled dataset) and then the labels are used for evaluation (validation) purposes to measure the accuracy of the model. The dataset also contains multi-domain lexicon (1912 words with their respective labels). The dataset details are provided in Table 1.

Table 1. Details of the used datasets.

4.2 Evaluation Metrics

The accuracy rate is used as a metric for the performance evaluation, it is given in Eq. (1). True Positive (TP) indicates that the review is positive and classified successfully by the model as positive. True Negative (TN) indicates that the review is negative and classified successfully by the model as negative. False Positive (FP) indicates that the review is negative and classified by the model as positive. False Negative (FN) indicates that the review is positive and classified by the model as negative.

$$\begin{aligned} Accuracy=\frac{TP + TN }{TP + FP + FN + TN} \end{aligned}$$
(1)

4.3 Experimental Results

The three classifiers (KNN, DT and RF) were run as stand-alone and ensemble classifier on three different datasets (Restaurant, Movie and Hotel reviews). The results shown in Table 3 represent a comparative analysis of each classifier along with their ensemble classifiers based on accuracy.

The un-labeled dataset, as explained in Sect. 3, goes through data pre-processing then data labeling in which the lexicon-based reviews labeling process takes place. The lexicon-based labeling was able to label the reviews with an accuracy of 76% for the Movies dataset, 83% for the Restaurant dataset and 90% for the Hotel dataset (see Table 2).

Table 2. Accuracy values for lexicon-based labeling.

In general, all the ensemble learning algorithms achieved good results, with the highest accuracy equal to 92.10% and the lowest is 85.56%. As shown in Table 3, all the ensemble classifiers have better accuracy compared to the stand-alone classifiers. Boosting algorithm outperforms Bagging in most cases, both methods are homogeneous algorithms. The highest improvement in accuracy for Bagging was in the case of RF classifier with the Movie reviews by 1.22% (compared to the RF when used as a stand-alone classifier). While in Boosting, the highest accuracy is achieved when using KNN classifier with the Hotel reviews by 2.69% (compared to the KNN when used as a stand-alone classifier). While Stacking, which is a heterogeneous algorithm, has achieved the highest accuracy 92.10% for the Hotel reviews, the improvement is equal to 3.77%.

Table 3. Accuracy of ensemble learning methods

5 Conclusion

Sentiment Analysis is the process of extracting subjective information from texts SA applications are very broad, it is very useful in many areas such as: marketing and customer services. Arabic Sentiment analysis has gained considerable interest in re-cent years, which may be considered challenging due to the natural of the Arabic language. Another challenging issue in SA is labeling the dataset for training and the fact that this process is usually done manually by humans. In addition, choosing the most appropriate ML classification algorithm is considered a challenging task. In this paper, we developed a model for classifying unlabeled Arabic text to either positive or negative. We used lexicon-based technique to label the dataset and the results obtained from the experimental results show that the labeling process has performed well with high accuracy values. In addition, we investigated the possibility of increasing the accuracy for some of well-known ML algorithms using ensemble learning methods (Bagging, Boosting and Stacking). The results show that the use of ensemble learning methods has improved the classification accuracy. As future work, the enrichment of the lexicon is necessary by adding more word along with their polarity, this will improve the labeling phase.