A C4.5 algorithm for english emotional classification

Ngoc, Phu Vo; Ngoc, Chau Vo Thi; Ngoc, Tran Vo Thi; Duy, Dat Nguyen

doi:10.1007/s12530-017-9180-1

A C4.5 algorithm for english emotional classification

Original Paper
Published: 08 April 2017

Volume 10, pages 425–451, (2019)
Cite this article

Download PDF

Evolving Systems Aims and scope Submit manuscript

A C4.5 algorithm for english emotional classification

Download PDF

Phu Vo Ngoc ORCID: orcid.org/0000-0001-6047-9066¹,
Chau Vo Thi Ngoc²,
Tran Vo Thi Ngoc³ &
…
Dat Nguyen Duy⁴

7515 Accesses
30 Citations
Explore all metrics

Abstract

The solutions for processing sentiment analysis are very important and very helpful for many researchers, many applications, etc. This new model has been proposed in this paper, used in the English document-level sentiment classification. In this research, we propose a new model using C4.5 Algorithm of a decision tree to classify semantics (positive, negative, neutral) for the English documents. Our English training data set has 140,000 English sentences, including 70,000 English positive sentences and 70,000 English negative sentences. We use the C4.5 algorithm on the 70,000 English positive sentences to generate a decision tree and many association rules of the positive polarity are created by the decision tree. We also use the C4.5 algorithm on the 70,000 English negative sentences to generate a decision tree and many association rules of the negative polarity are created by the decision tree. Classifying sentiments of one English document is identified based on the association rules of the positive polarity and the negative polarity. Our English testing data set has 25,000 English documents, including 12,500 English positive reviews and 12,500 English negative reviews. We have tested our new model on our testing data set and we have achieved 60.3% accuracy of sentiment classification on this English testing data set.

1 Introduction

The solutions for processing the semantic analysis are very important and very helpful for many researchers, many applications, etc. Today there are many studies and many applications for sentiment classification in many languages.

In this work we propose a new model using a decision tree, specifically as C4.5 algorithm (CA), for English document-level emotional classification.

A decision tree is a decision support tool that uses a tree-like-graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

The C4.5 algorithm is a famous algorithm of the decision tree which belongs to the data mining filed, but it has been used in many different fields for a long time. However, the C4.5 algorithm is not used in natural language processing (NLP), especially in sentiment classification. We thought that it can be used in the opinion analysis. Therefore, we try applying it into the semantic analysis. This is also very difficult for us to perform it into the sentiment analysis. This is very significantly important for the works and applications in the NLP. From the results which we got, it is true that the C4.5 algorithm is used in the NLP and also in the opinion classification. The aim of this research is to implement the C4.5 algorithm for the emotional analysis of the English documents based on the English sentences of the English training data set. We searched the surveys in the world, which is related to the decision tree, emotional classification. From the below proofs, we found that there is not any research in the world which is similar to this study. We looked for many methodologies to apply the C4.5 algorithm into the sentiment classification for the English documents and then, they are experimented on our data sets. Thus, this proposed model is the originality and novelty research and it also has many meanings in the data mining field, the NLP, the computer science field, etc.

We use the CA to classify semantics (positive, negative, neutral) of one English document in the English testing data set based on 140,000 English sentences of English testing data set which includes 70,000 English positive sentences and 70,000 English negative sentences.

We propose many basis principles to implement our new model as follows:

Assuming that one English document in the English testing data set has n English sentences.
Assuming that one English sentence in the English testing data set or in the English training data set has m English words (or English phrases).
Assuming that there is one English sentence which has the longest length in both the English testing data set and the English training data set; and the longest length is m_max. It means that m_max is greater than m or m_max is as equal as m.
We build a table of training data for the CA based on 140,000 English sentences of English testing data set as follows:
- The table of training data has 140,000 records (or 140,000 rows) and (m_max + 1) columns.
- Each column of the table from column 0 to column (m_max − 1) is one English word (or one English phrase) and value of each column is one English word (or one English phrase). If one English sentence has length m (m < m_max) then each column from m to (m_max − 1) is 0 (zero).
- Column m_max in the table is polarity column. This column shows that the sentence belongs to positive in 70,000 English positive sentences or negative in 70,000 English negative sentences.
- Example, we have three English sentences such as:
The film is very good ≥ the sentence belongs to the 70,000 English positive sentences.

The actor is very bad ≥ the sentence belongs to the 70,000 English negative sentences.

The film sounds good ≥ the sentence belongs to the 70,000 English positive sentences.
- The table of training data is in the Table 1 below in the “Appendix”.
When we use the IA on the Table 1, we get a decision tree to generate many association rules. The association rules have the format as “X ≥ positive” or “Y ≥ negative”. These rules are divided into two groups: the positive rule group and the negative rule group. The positive rule group contains all association rules having the format as “X ≥ positive”. The negative rule group contains all association rules having the format as “Y ≥ negative”.
One English sentence of one English document in the English testing data set is the positive polarity if the sentence contains X fully. The English sentence is the negative polarity if the sentence contains Y fully. The English sentence is the neutral polarity if the sentence does not contain both X and Y fully.
Assuming that we have some rules such as: “very good” ≥ positive; “very handsome” ≥ positive; “excellent” ≥ positive; “very bad” ≥ negative; “terrible” ≥ negative; we have three sentences such as “the film is very good”; “the actor is very bad”; and “he is drinking some beer”. With the first sentence “the film is very good”, the sentence only contains one rule “very good” ≥ positive, thus, the sentence is the positive polarity. With the second sentence “the actor is very bad”, the sentence only contains one rule “very bad”, therefore, the sentence is the negative polarity. With the third sentence “he is drinking some beer”, the sentence does not contain any rule in our rule set, so, the sentence is the neutral polarity.
One English document in the English testing data set is the positive polarity if the number of the English sentence classified into the positive polarity is greater than the number of the English sentences classified into the negative polarity in the English document. The English document is the negative polarity if the number of the English sentences classified into the positive polarity is less than the number of the English sentences classified into the negative polarity in the document. The English document is the neutral polarity if the number of the English sentences classified into the positive polarity is as equal as the number of the English sentences classified into the negative polarity in the document.

Table 1 Training data set for a decision tree

Full size table

In many researches related to the C4.5 algorithm (CA) in the world and in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000, 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), there is not any CA—related work which is similar to our study.

In many studies related to the decision tree for sentiment classification (opinion analysis, semantic classification) in the world and in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015, 20, 21; Vinodhini and Chandrasekaran 2013, 23, 24; Opinion 2015; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003), there is not any CA—related research for semantic classification, which is similar to our work.

In many works related to the sentiment classification in the world and in (Manek et al. 2016; Agarwal and Mittal 2016a, b; Canuto et al. 2016, Kaur et al. 2016; Phu 2014; Tran et al. 2014; Li and Liu 2014), there is not any CA—related study for sentiment classification, which is similar to our model.

In many researches related to the unsupervised classification in the world and in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Te-Won; Lee and Lewicki 2002; Gllavata et al. 2004), there is not any CA—related study of unsupervised classification, which is similar to our work.

According to the CA in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), there are many advantages and disadvantages of the CA. Many advantages of the CA are as follows: builds models that can be easily interpreted; easy to implement; can use both categorical and continuous values; deals with noise. Many disadvantages of the CA are as follows: small variation in data can lead to different decision trees (especially when the variables are close to each other in value); does not work very well on a small training set.

Based on the works related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), we build the CA—related algorithms to perform our new model.

The motivation of the work is as follows: rule—based sentiment classification often has high accuracy and the rules are very popular in data mining. Researchers have sought to find many ways to use data mining rules in opinion analysis and to find the many different relationships between data mining and natural language processing. The C4.5 algorithm is a very popular and significant algorithm of the data mining, thus, the rules are generated by the C4.5 algorithm are very correct. This will result in many discoveries in scientific research, hence the motivation for this study.

The proposed approach is quite novel. The semantic analysis of an English document is based on many English sentences in the English training data set. The emotional classification of an English document is based on many association rules in the data mining field. Sentiment analysis is based on the FA algorithm. These principles are proposed to classify the semantics of an English document and data mining is used in natural language processing.

According to the researches in the world and in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012; Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Shrivastava and Nair 2015; Vinodhini and Chandrasekaran 2013; Voll et al. 2007; Mandal et al. 2014; Kaur et al. 2015, 2016; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003; Manek et al. 2016; Agarwal and Mittal 2016a, b; Canuto et al. 2016; Phu and Tuoi 2014; Tran et al. 2014; Li and Liu 2014; Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004), to understand the significant contributions of this study, we present briefly as follows:

a.
The C4.5 algorithm is a decision tree algorithm, but it is applied into the NLP.
b.
It is not used in the sentiment classification, however, it is applied in the opinion analysis.
c.
It is not used for the English document semantic analysis, whereas, it it applied in the emotional classification of the English documents.
d.
From the results of this survey, it is widely applied in the different fields and the different applications.
e.
This model can be applied into the other languages.
f.
The C4.5—related algorithms are built in this search.
g.
The rules are generated in this model.

Based on the above contributions, the model is clear superiority which is compared with the other methodologies and it is completely different from the other methods/models.

This study contains 6 sections: Sect. 1 is the introduction; Sect. 2 discusses the related works about the C4.5, etc., Sect. 3 is about the English data set of classifying sentences; Sect. 4 represents the methodology of our proposed model; Sect. 5 represents the experimental model and experimental results in this study; the conclusion of the proposed model is in Sect. 6. In addition, the References section displays many reference researches, and all the tables are shown in the Appendices section. Finally, all the codes of all algorithms in the Methodology are shown in the “Appendices of All Codes” section.

2 Related work

In this part, we summarize many studies related to our research, such as C4.5, sentiment analysis, etc.

There are many works related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012). (Ruggieri 2002) Authors present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. It improves on C4.5 by adopting the best among the three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. The authors’ implementation computes the same decision trees as C4.5 with a performance gain of up to five times. (Kretschmann et al. 2001) The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data, but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations. A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet un-annotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%, etc.

Then, we compare our proposed model’s results with the surveys in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012; Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Shrivastava and Nair 2015; Vinodhini and Chandrasekaran 2013; Voll et al. 2007; Mandal et al. 2014; Kaur et al. 2015, 2016; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003, 31, 32, 33, 34; Phu and Tuoi 2014; Tran et al. 2014; Li and Liu 2014; Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004; Phu et al. 2016, 2017a, b; Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995; Phu et al. 2017).

There are many researches related to a decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Vinodhini and Chandrasekaran 2013, 23; Mandal et al. 2014; Kaur et al. 2015; Prasad et al. 2016; Pong-Inwong et al. 2014; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003). Automatic Text Classification (Mita 2011) is a semi-supervised machine learning task that automatically assigns a given document to a set of pre-defined categories based on its textual content and extracted features. Automatic Text Classification has important applications in content management, contextual search, opinion mining, analysis of product review, spam filtering and text sentiment mining. This survey (Mita 2011) explains the generic strategy for automatic text classification and surveys existing solutions. The authors in (Taboada et al. 2008) present an approach to extracting sentiment from texts that makes use of contextual information. Using two different approaches, the authors (Taboada et al. 2008) extract the most relevant sentences of a text, and calculate the semantic orientation weighing those more heavily, etc.

The latest researches of the sentiment classification are (Manek et al. 2016; Agarwal and Mittal 2016a, b, 34; Kaur et al.2016; Phu 2014; Tran et al. 2014; Li and Liu 2014; Phu et al. 2017a, b; Phu et al. 2017). With the rapid development of the World Wide Web in (Manek et al. 2016), electronic word-of-mouth interaction has made consumers active participants. Nowadays, a large number of reviews posted by the consumers on the Web provide valuable information to other consumers. Such information is highly essential for decision making and hence popular among the internet users. This information is very valuable not only for prospective consumers to make decisions, but also for businesses in predicting the success and sustainability. In this survey (Manek et al. 2016), a Gini Index based feature selection method with Support Vector Machine (SVM) classifier is proposed for sentiment classification for large movie review dataset. Opinion Mining or Sentiment Analysis in Agarwal an Mittal (2016a) is the study that analyzes people’s opinions or sentiments from the text towards entities such as products and services. It has always been important to know what other people think. With the rapid growth of availability and popularity of online review sites, blogs’, forums’, and social networking sites’ necessity of analyzing and understanding these reviews has arisen. The main approaches for sentiment analysis can be categorized into semantic orientation-based approaches, knowledge-based, and machine-learning algorithms. This work (Agarwal an Mittal 2016a) surveys the machine learning approaches applied to sentiment analysis-based applications, etc.

The latest works of the unsupervised classification are (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004). This study in (Turney 2002) presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The authors in (Lee et al. 2002) propose a new method for unsupervised classification of terrain types and man-made objects using polarimetric synthetic aperture radar (SAR) data, etc.

3 Data set

In the Fig. 1, the English training data set includes 140,000 English sentences in the movie field, which contains 70,000 positive English sentences and 70,000 negative English sentences. All English sentences in our English training data set are automatically extracted from English Facebook, English websites; then we labeled positive and negative for them.

In Fig. 2, we use a public available large data set of classified movie reviews from the Internet Movie Database (IMDb) (Large 2016). This English data set includes two parts in two different folders. The first part is in the “testing data set” folder, it was named as the testing data set and we call it as the first testing data set; the second part is in the “training data set” folder, it was named as the training data set and we call it as the second testing data set. Both our first testing data set and our second testing data set have 25,000 English documents; and each the data set includes 12,500 positive English movie reviews and 12,500 negative English movie reviews.

4 Methodology

In this section, we present how our new model is implemented. First of all, the table of training dataset is created on the 70,000 positive sentences and the 70,000 negative sentences. Secondly, the C4.5 algorithm (CA) is applied to the table of the training dataset for generating the positive association rule set and the negative association rule set. Next, one English document of the English testing dataset is split into many English sentences. Then, the positive association rule set and the negative association rule set are applied to each English sentence of the English document, and the emotional classification of the English sentence is identified. Finally, the semantic classification of the English document is identified on its sentences.

In Fig. 3, this research is done as follows diagram below.

The criteria of selection both positive and negative association rules are certainly dependent on the English training data set and the algorithm for generating them (in the paper, the algorithm is the C4.5 algorithm). The positive and negative association rules are very important for this model to identify the emotional polarities (positive, negative, neutral) of one English sentence. Then, the semantic classification of one English document is identified on its sentences.

We propose many algorithms to perform the model.

We build algorithm 1 to create the table of training data has 140,000 records (or 140,000 rows) and (m_max + 1) columns. Each English sentence in all the sentences of the training data set is split into the meaningful phrases (or the meaningful words). Each row of the table tableOfTrainingData is each English sentence. The columns of each row in the tableOfTrainingData are the meaningful phrases (or the meaningful words) of each English sentence in all the sentences of the English training data set.

The algorithm 1 is presented more detail in the Code 1 below. The main ideas of the algorithm 1 are as follows:

Input: 140,000 English sentences of the English training data set including the 70,000 English positive sentences and the 70,000 English negative sentences
Output: table of training data.
Step 1: Create table tableOfTrainingData which has (m_max + 1) columns and 140,000 rows.
Step 2: With each sentence (one sentence) in the 70,000 English positive sentences of the 140,000 sentences, do repeat:
Step 3: Split this sentence into many words (or phrases) based on ‘ ’ or “ ”: arrayWords. Assuming that m is a number of words (or phraes) of this sentence which is split.
Step 4: Create one new row in table tableOfTrainingData: NewRow
Step 5: Do repeat i from 0 (the head of this sentence) to m-1 (the tail of this sentence):
Step 6: NewRow.column[i] = arrayWords[i]
Step 7: End of Step 5
Step 8: If i is less than m_max Then: do repeat
Step 9: NewRow.column[i] = 0 (or “ ”)
Step 10: End of Step 8
Step 11: NewRow.Column[m_max] = “positive”
Step 12: End of Step 2
Step 13: With each sentence (one sentence) in the 70,000 English negative sentences of the 140,000 sentences, do repeat:
Step 14: Split this sentence into many words (or phrases) based on ‘ ’ or “ ”: arrayWords. Assuming that m is a number of words (or phraes) of this sentence which is split.
Step 15: Create one new row in table tableOfTrainingData: NewRow
Step 16: Do repeat i from 0 (the head of this sentence) to m-1 (the tail of this sentence):
Step 17: NewRow.column[i] = arrayWords[i]
Step 18: End of Step 16
Step 19: If i is less than m_max Then: do repeat
Step 20: NewRow.column[i] = 0 (or “ ”)
Step 21: End of Step 19
Step 22: NewRow.Column[m_max] = “negative”
Step 23: End of Step 13
Step 24: Return table tableOfTrainingData

According to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), we build algorithm 2 to generate many association rules in the positive rule group and the negative rule group by using the C4.5 algorithm. The basic construction of C4.5 decision tree is

1.
The root nodes are the top node of the tree. It considers all samples and selects the attributes that are most significant.
2.
The sample information is passed to subsequent nodes, called ‘branch nodes’ which eventually terminate in leaf nodes that give decisions.
3.
Rules are generated by illustrating the path from the root node to leaf node.

The algorithm 2 is presented more detail in the Code 2 below. The main ideas of the algorithm 2 are as follows:

Input:

Table of training data tableOfTrainingData is the training examples.
Attributes S is a list of other attributes that may be tested by the learned decision tree. (column from 0 to m_max −1 of tableOfTrainingData)
A decision tree (actually the root node of the tree) that correctly classifies the given Examples. This decision tree is divided into the positive rule group and the negative rule group.

Output: the positive rule group and the negative rule group.

From Step 1 to Step 26: Apply the C4.5 algorithm to the table tableOfTrainingData

Step 27: Set positiveRuleGroup := null
Step 28: Set negativeRuleGroup := null
Step 29: Browse decision tree Tree, do:
Step 30: If the rule is positive Then
Step 31:positiveRuleGroup.Add (the rule);
Step 32: Else If the rule is negative Then
Step 33:negativeRuleGroup.Add (the rule);
Step 34: End of Step 30
Step 35: End of Step 29
Step 36: Return positiveRuleGroup and negativeRuleGroup;

We build algorithm 3 to classify one English sentence into the positive polarity, the negative polarity or the neutral polarity. The positive association rule set in positiveRuleGroup and the negative association rule set in negativeRuleGroup are applied to one English sentence A. If the number of positive rules which A contains is greater than the number of negative rules which A contains, A is classified to the positive polarity. If the number of negative rules which A contains is less than the number of negative rules which A contains, A is classified to the negative polarity. If the number of positive rules which A contains is as equal as the number of negative rules which A contains, A is classified to the neutral polarity; or if A does not contain any positive rule and any negative rule, A is classified to the neutral polarity.

The algorithm 3 is presented more detail in the Code 3 below. The main ideas of the algorithm 3 are as follows:

Input: one English sentence A, the positive rule group positiveRuleGroup and the negative rule group negativeRuleGroup
Output: positive, negative, neutral of this sentence A.
Step 1: With each rule (one rule) R in the positive rule group positiveRuleGroup, do repeat:
Step 2: If the sentence A contains R Then
Step 3: Set variable varibleOfPositive := varibleOfPositive + 1
Step 4: End Of Step 2
Step 5: End of Step 1
Step 6: With each rule (one rule) R in the negative rule group negativeRuleGroup, do repeat:
Step 7: If the sentence A contains R Then
Step 8: Set variable varibleOfNegative := varibleOfNegative + 1
Step 9: End Of Step 6
Step 10: End of Step 7
Step 11: If varibleOfPositive is greater than varibleOfNegative Then
Step 12: Return positive
Step 13: Else If varibleOfPositive is less than varibleOfNegative Then
Step 14: Return negative
Step 15: End If
Step 16: Return neutral

We build algorithm 4 to classify one English document into the positive polarity, the negative polarity or the neutral polarity. The English document is classified to the positive polarity if the number of sentences classified to the positive polarity is greater than the number of sentences classified to the negative polarity in the document. The English document is classified to the negative polarity if the number of sentences classified to the positive polarity is less than the number of sentences classified to the negative polarity in the document. The English document is classified to the negative polarity if the number of sentences classified to the positive polarity is as equal as the number of sentences classified to the negative polarity in the document.

The algorithm 4 is presented more detail in the Code 4 below. The main ideas of the algorithm 4 are as follows:

Input: one English document, including the n English sentences with the polarity result of each English sentence which is implemented by using the algorithm 3.
Output: positive, negative, neutral of this English document
Step 1: If the number of English sentences classified into the positive polarity is greater than the number of English sentences classified into the negative polarity in the document Then
Step 2: Return positive;
Step 3: End If
Step 4: If the number of English sentences classified into the positive polarity is less than the number of English sentences classified into the negative polarity in the document Then
Step 5: Return negative;
Step 6: End If
Step 7: Return neutral;

Or the main ideas of the algorithm 4 are as follows:

Input: one English document A
Output: positive, negative, neutral of this English document
Step 1: Split this English document A into many English sentences: m sentences.
Step 2: With each sentence (one sentence) i in m sentences, do repeat:
Step 3: Run algorithm 3 with the sentence i
Step 4: If the result is positive Then
Step 5: Set variableOfPositive := variableOfPositive + 1
Step 6: End of Step 4
Step 7: If the result is negative Then
Step 8: Set variableOfNegative := variableOfNegative + 1
Step 9: End of Step 7
Step 10: End of Step 2
Step 11: If variableOfPositive is greater than variableOfNegative Then
Step 12: Return positive
Step 13: Else If variableOfPositive is less than variableOfNegative Then
Step 14: Return negative
Step 15: End of Step 11
Step 16: Return neutral

5 Experiment

To implement the proposed model, we have already used Microsoft SQL Server 2008 R2 to save the English data sets and save the results of emotion classification.

Microsoft Visual Studio 2010 (C #) is used for programming to save data sets, implementing our proposed model to classify the 25,000 English documents of t1 and t2.

The experiment programs have been conducted on the Intel Dual laptop with Core i5 processor at 2.6 GHz Memory 8 GB; the operating system is Microsoft Windows 8.

We have used a measure such as Accuracy (A) to calculate the accuracy of the results of emotion classification.

The results of the 25,000 English documents of the testing data set t1 to test are presented in the Table 2 below in the Appendix.

Table 2 The results of the 25,000 English documents in testing data set t1

Full size table

The results of the 25,000 English documents of the testing data set t2 to test are presented in the Table 3 below in the “Appendix”.

Table 3 The results of the 25,000 English documents in testing data set t2

Full size table

The accuracy of the 25,000 English documents in the testing dataset t1 is shown in the Table 4 below in the “Appendix”.

Table 4 The accuracy of our new model for the 25,000 English documents in testing data set t1

Full size table

The accuracy of the 25,000 English documents in the testing dataset t2 is shown in the Table 5 below in the “Appendix”.

Table 5 The accuracy of our new model for the 25,000 English documents in testing data set t2

Full size table

We also have the comparisons between our results with the surveys in the “Appendix”.

6 Conclusion

Classification result of 25,000 English documents of t1 data set by using our model has achieved accuracy 60.3 and 60.7% of t2 data set.

With the same of the English training data set, the classification results of the different English testing data sets are very different from each others. The classification results are depending on the association rules of the positive rule group and the negative rule group. The association rules of the positive rule group and the negative rule group are depending on the algorithms and the English training data sets.

With the same of the English training data set, the association rules of the positive rule group and the negative rule group are very different from each others by using the different algorithms. Thus, the classification results are very different from each others.

With the same of the algorithms, the association rules of the positive rule group and the negative rule group are very different from each others by using the different data sets. Thus, the classification results are very different from each others.

To increase the accuracy of the classification results significantly, we can increase the association rules of the positive rule group and the negative rule group certainly.

To increase the association rules of the positive rule group and the negative rule group significantly, we can improve the algorithms, or the English training data sets, or both the algorithms and the English training data sets.

Although our model’s accuracy is not high, our model is a new contribution to English sentiment classification and sentiment classification of other languages.

Based on the basis the C4.5 algorithm, we build the algorithms related to the CA for performing our new model.

This model also has many benefits and drawbacks. The benefits of the model are as follows: the document-level emotional analysis is based on the English sentences. The rules are generated by the C4.5 algorithm are high correct. The rules are used in many researches and commercial applications. The drawbacks of the model are as follows: The accuracy of the model is low, because the rule-based sentiment classification often has better accuracy. It takes too much time to generate the rules.

To understand the scientific values of this research, we conduct to compare our model’ results with many studies as the tables below in the “Appendix”.

In the Table 6 below, we compare our model’s results with many researches related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012).

Table 6 Comparison our model’s results with many researches related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012)

Full size table

In the Table 7 below, we compare our model’s results with many researches related to the decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Vinodhini and Chandrasekaran 2013, 2007, 2014; Kaur et al. 2015; Prasad et al. 2016, 2014; Sharma 2014).

Table 7 Comparison our model’s results with many researches related to the decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Vinodhini and Chandrasekaran 2013, 2007, 2014; Kaur et al. 2015; Prasad et al. 2016, 2014; Sharma 2014)

Full size table

In the Table 8 below, we compare our model’s results with the latest researches of the sentiment classification in (2016, Kaur et al. 2016; Phu 2014; Tran et al. 2014).

Table 8 Comparison our model with the latest sentiment classification models in (2016, Kaur et al. 2016; Phu and Tuoi2014; Tran et al. 2014)

Full size table

In the Table 9 below, we compare our model’s results with the latest works of the unsupervised classification in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004).

Table 9 Comparison our model with the latest unsupervised classification works in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004)

Full size table

We compare our model with many algorithms for the decision tree in (Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995) in the Table 10.

Table 10 Comparison our model with many algorithms for the decision tree in (Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995)

Full size table

References

Agarwal B, Mittal N (2016a) Semantic orientation-based approach for sentiment analysis. Promin Feature Extr Sentim Anal doi:10.1007/978-3-319-25343-5_6 (ISBN 978-3-319-25341-1)
Article Google Scholar
Agarwal B, Mittal N (2016b) Machine Learning Approach for Sentiment Analysis. Promin Feature Extr Sentim Anal doi:10.1007/978-3-319-25343-5_3 (ISBN 978-3-319-25341-1)
Article Google Scholar
Ahmed S, Danti A (2016) Effective sentimental analysis and opinion mining of web reviews using rule based classifiers. Comput Intell Data Mining 1:171–179, doi:10.1007/978-81-322-2734-2$418, (India, Print ISBN 978-81-322-2732-8)
Article Google Scholar
Canuto S, Gonçalves AM, Benevenuto F (2016) Exploiting new sentiment-based meta-level features for effective sentiment analysis. In: Proceedings of the ninth ACM international conference on web search and data mining (WSDM ‘16), New York, USA, pp 53–62
Chang RL, Pavlidis T (1977) Fuzzy decision tree algorithms. IEEE Trans Syst Man Cybern 7:28–35
Article MathSciNet MATH Google Scholar
Chaovalit P, Zhou L (2005) Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches, Proceedings of the 38th Annual Hawaii International Conference on System Sciences, pp 112
Dalal MK, Zaveri M (2011) Automatic text classification: a technical review. Int J Comput Appl 28(2):0975–8887
Google Scholar
Ferro-Famil L, Pottier E, Lee J-S (2002) Unsupervised classification of multifrequency and fully polarimetric SAR images based on the H/A/Alpha-Wishart classifier. IEEE Trans Geosci Remote Sens 39(11):2332–2342
Article Google Scholar
Freund Y, Mason L (1999) The alternating decision tree learning algorithm, ICML ‘99 Proceedings of the Sixteenth International Conference on Machine Learning, pp 124–133
Friedl MA, Brodley CE (1997) Decision tree classification of land cover from remotely sensed data. Remote Sens Environ 61(3):399–409
Article Google Scholar
Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004) 1:425–428
Kaur A, Duhan N (2015) A survey on sentiment analysis and opinion mining. Int J Innov Adv Comput Sci (IJIACS) (ISSN 2347–8616, Volume 4, Special Issue)
Korting TS (2006) C4.5 algorithm and Multivariate Decision Trees. National Institute for Space Research–INPE, SP Brazil
Kretschmann E, Fleischmann W, Apweiler R (2001) Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10):920–926
Article Google Scholar
Large Movie Review Dataset (2016) http://ai.stanford.edu/~amaas/data/sentiment/. Accessed Jun 2011
Le Hegarat-Mascle S, Bloch I, Vidal-Madjar D (2002) Application of Dempster–Shafer evidence theory to unsupervised classification in multisource remote sensing. IEEE Trans Geosci Remote Sens 35(4):1018–1031
Article Google Scholar
Lee T-W, Lewicki MS, Sejnowski TJ (2002a) ICA mixture models for unsupervised classification of non-Gaussian classes and automatic context switching in blind signal separation, IEEE Trans Pattern Anal Mach Intell 22(10):1078–1089
Google Scholar
Lee J-S, Grunes MR, Ainsworth TL, Du L-J (2002b) Unsupervised classification using polarimetric decomposition and the complex Wishart classifier. IEEE Trans Geosci Remote Sens 37(5):2249–2258
Google Scholar
Li G, Liu F (2014) Sentiment analysis based on clustering: a framework in improving accuracy and recognizing neutral opinions, Appl Intell (APIN) 40(3):441–452
Article Google Scholar
Loh S, de Oliveira JPM, Gameiro MA (2003) Gameiro, knowledge discovery in texts for constructing decision support systems. Appl Intell (APIN) 18(3):357–366
Article MATH Google Scholar
Mandal AK, Sen R (2014) Supervised learning Methods for Bangla Web Document Categorization. Int J Artif Intell Appl (IJAIA) 5(5)
Manek AS, Shenoy PD, Mohan MC, Venugopal KR (2016) Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier. World Wide Web. doi:10.1007/s11280-015-0381-x (ISSN1386-145X)
Google Scholar
Mazid MM, Ali ABMS, Tickle KS (2016) Improved C4.5 algorithm for rule based classification. In: AIKED’10 proceedings of the 9th WSEAS international conference on artificial intelligence, knowledge engineering and data bases, UK, pp 296–301
Mehta M, Rissanen J, Agrawal R (1995) MDL-based Decision Tree Pruning KDD-95Proceedings
Muniyandi AP, Rajeswari R, Rajaram R (2012) network anomaly detection by cascading K-means clustering and C4.5 decision tree algorithm. Procedia Eng (International Conference on Communication Technology System Design 2011) 30:174–182
Google Scholar
Nizamani S, Memon N, Wiil UK, Karampelas P (2012) Modeling suspicious email detection using enhanced feature selection. IJMO 2(4):371–377 (ISSN: 2010–3697, 2013)
Article Google Scholar
Pan Z-S, Chen S-C, Hu G-B, Zhang D-Q (2003) Hybrid neural network and C4.5 for misuse detection. Int Conf Mach Learn Cybern 4:2463–2467
Google Scholar
Park S-B, Zhang B-T, Kim YT (2003) Word sense disambiguation by learning decision trees from unlabeled data, Appl Intell (APIN) 19(1):27–38
Article MATH Google Scholar
Payne HJ, Tignor SC (1978) Freeway incident-detection algorithms based on decision trees with states. 57th Annual Meeting of the Transportation Research Board, pp 30–37
Phu VN, Tuoi PT (2014) Sentiment classification using Enhanced Contextual Valence Shifters. International Conference on Asian Language Processing (IALP), pp 224–229
Phu VN, Dat ND, Tran VTN, Chau VTN, Nguyen TA (2016) Fuzzy C-means for english sentiment classification in a distributed system. Int J Appl Intell (APIN), pp 1–22
Phu VN, Chau VTN, Tran VTN, Dat ND (2017a) A Vietnamese adjective emotion dictionary based on exploitation of Vietnamese language characteristics, Int J Artif Intell Rev (AIR). doi:10.1007/s10462-017-9538-6
Google Scholar
Phu VN, Chau VTN, Tran VTN, Dat ND, Nguyen TA (2017b) STING algorithm used english sentiment classification in a parallel environment. Int J Pattern Recognit Artif Intell. doi:10.1142/S0218001417500215
Google Scholar
Pong-Inwong C, Rungworawut WS (2014) Teaching senti-lexicon for automated sentiment polarity definition in teaching evaluation. 10th International Conference on Semantics, Knowledge and Grids (SKG), pp 84–91
Prasad SS, Kumar J, Prabhakar DK, Pal S (2016) Sentiment classification: an approach for indian language tweets using decision tree. Mining Intelligence and Knowledge Exploration. In: Lecture Notes in Computer Science, Vol 9468, pp 656–663
Psomakelis E, Tserpes K, Anagnostopoulos D, Varvarigou T (2015) Comparing methods for Twitter Sentiment Analysis, arXiv:1505.02973 [cs.CL], 2015
Quinlan JR (1996a) Improved use of continuous attributes in C4.5. ‎J Artif Intell Res 4(1):77–90
Article MATH Google Scholar
Quinlan JR (1996b) Bagging, Boosting, and C4.5 In: Proceedings of the thirteenth national conference on Artificial intelligence (AAAI’96) 1:725–730
Google Scholar
Rajeswari LP, Arputharaj K (2008) An active rule approach for network intrusion detection with enhanced C4.5 algorithm. Int J Commun Netw Syst Sci 1:314–321
Google Scholar
Ruggieri S (2002) Efficient C4.5 [classification algorithm]. IEEE Trans Knowl Data Eng 14(2):438–444
Article Google Scholar
Sharma M (2014) Z-CRIME: a data mining tool for the detection of suspicious criminal activities based on decision tree. International Conference on Data Mining and Intelligent Computing (ICDMIC), pp 1–6
Shrivastava S, Nair PS (2015) Mood prediction on tweets using classification algorithm. Int J Sci Res (IJSR) 4(11):295–299
Article Google Scholar
Sornlertlamvanich V, Potipiti T, Charoenporn T (2000) Automatic corpus-based Thai word extraction with the c4.5 learning algorithm. In: Proceedings of the 18th conference on Computational linguistics (COLING’00), Vol 2, pp 802–807, USA
Steven L (1994) Salzberg, C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993, Mach Learn 16(3):235–240
Google Scholar
Taboada M, Voll K, Brooke J (2008) Extracting sentiment as a function of discourse structure and topicality, Technical Report 2008-20, School of Computing Science, Simon Fraser University, Burnaby
Google Scholar
Tran VTN, Phu VN, Tuoi PT (2014) Learning More Chi Square Feature Selection to Improve the Fastest and Most Accurate Sentiment Classification, The Third Asian Conference on Information Systems, ACIS
Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews, ACL ‘02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 417–424, USA
van Zyl JJ (2002) Unsupervised classification of scattering behavior using radar polarimetry data. IEEE Trans Geosci Remote Sens 27(1):36–45
Google Scholar
Vinodhini G, Chandrasekaran RM (2013) Performance evaluation of sentiment mining classifiers on balanced and imbalanced dataset. Int J Comput Sci Bus Inform 6(1)
Voll K, Taboada M (2007) Not all words are created equal: extracting semantic orientation as a function of adjective relevance, AI 2007: advances in artificial intelligence. In: Lecture notes in computer science. vol 4830, pp 337–346
Wan Y, Gao Q (2015) An ensemble sentiment classification system of twitter data for airline services analysis. IEEE International Conference on Data Mining Workshop (ICDMW), pp 1318–1325
Winkler S, Schaller S, Dorfer V, Affenzeller M, Petz G, Karpowicz M (2015) Data-based prediction of sentiments using heterogeneous model ensembles, Soft Comput 19(12):3401–3412
Article Google Scholar
Xiaoliang Z, Hongcan Y, Jian W, Shangzhuo W (2009) Research and application of the improved algorithm C4.5 on Decision tree. Int Conf Test Meas 2:184–187
Google Scholar
Zhou Z-H, Jiang Y (2004) NeC4.5: neural ensemble based C4.5. IEEE Trans Knowl Data Eng 16(6):770–773
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Research and Development, Duy Tan University - DTU, Da Nang, Vietnam
Phu Vo Ngoc
Computer Science & Engineering (CSE), Ho Chi Minh City University of Technology - HCMUT, Vietnam National University, Ho Chi Minh City, Vietnam
Chau Vo Thi Ngoc
School of Industrial Management (SIM), Ho Chi Minh City University of Technology - HCMUT, Vietnam National University, Ho Chi Minh City, Vietnam
Tran Vo Thi Ngoc
Faculty of Information Technology, Ly Tu Trong Technical College, Ho Chi Minh City, Vietnam
Dat Nguyen Duy

Authors

Phu Vo Ngoc
View author publications
You can also search for this author in PubMed Google Scholar
Chau Vo Thi Ngoc
View author publications
You can also search for this author in PubMed Google Scholar
Tran Vo Thi Ngoc
View author publications
You can also search for this author in PubMed Google Scholar
Dat Nguyen Duy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Phu Vo Ngoc.

Appendices

Appendix

See Tables (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Appendices of all codes

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ngoc, P.V., Ngoc, C.V.T., Ngoc, T.V.T. et al. A C4.5 algorithm for english emotional classification. Evolving Systems 10, 425–451 (2019). https://doi.org/10.1007/s12530-017-9180-1

Download citation

Received: 09 November 2016
Accepted: 10 March 2017
Published: 08 April 2017
Issue Date: September 2019
DOI: https://doi.org/10.1007/s12530-017-9180-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A C4.5 algorithm for english emotional classification

Abstract

1 Introduction

2 Related work

3 Data set

4 Methodology

5 Experiment

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Appendices of all codes

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation