1 Introduction

The solutions for processing the semantic analysis are very important and very helpful for many researchers, many applications, etc. Today there are many studies and many applications for sentiment classification in many languages.

In this work we propose a new model using a decision tree, specifically as C4.5 algorithm (CA), for English document-level emotional classification.

A decision tree is a decision support tool that uses a tree-like-graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.

The C4.5 algorithm is a famous algorithm of the decision tree which belongs to the data mining filed, but it has been used in many different fields for a long time. However, the C4.5 algorithm is not used in natural language processing (NLP), especially in sentiment classification. We thought that it can be used in the opinion analysis. Therefore, we try applying it into the semantic analysis. This is also very difficult for us to perform it into the sentiment analysis. This is very significantly important for the works and applications in the NLP. From the results which we got, it is true that the C4.5 algorithm is used in the NLP and also in the opinion classification. The aim of this research is to implement the C4.5 algorithm for the emotional analysis of the English documents based on the English sentences of the English training data set. We searched the surveys in the world, which is related to the decision tree, emotional classification. From the below proofs, we found that there is not any research in the world which is similar to this study. We looked for many methodologies to apply the C4.5 algorithm into the sentiment classification for the English documents and then, they are experimented on our data sets. Thus, this proposed model is the originality and novelty research and it also has many meanings in the data mining field, the NLP, the computer science field, etc.

We use the CA to classify semantics (positive, negative, neutral) of one English document in the English testing data set based on 140,000 English sentences of English testing data set which includes 70,000 English positive sentences and 70,000 English negative sentences.

We propose many basis principles to implement our new model as follows:

  • Assuming that one English document in the English testing data set has n English sentences.

  • Assuming that one English sentence in the English testing data set or in the English training data set has m English words (or English phrases).

  • Assuming that there is one English sentence which has the longest length in both the English testing data set and the English training data set; and the longest length is m_max. It means that m_max is greater than m or m_max is as equal as m.

  • We build a table of training data for the CA based on 140,000 English sentences of English testing data set as follows:

    • The table of training data has 140,000 records (or 140,000 rows) and (m_max + 1) columns.

    • Each column of the table from column 0 to column (m_max − 1) is one English word (or one English phrase) and value of each column is one English word (or one English phrase). If one English sentence has length m (m < m_max) then each column from m to (m_max − 1) is 0 (zero).

    • Column m_max in the table is polarity column. This column shows that the sentence belongs to positive in 70,000 English positive sentences or negative in 70,000 English negative sentences.

    • Example, we have three English sentences such as:

    The film is very good ≥ the sentence belongs to the 70,000 English positive sentences.

    The actor is very bad ≥ the sentence belongs to the 70,000 English negative sentences.

    The film sounds good ≥ the sentence belongs to the 70,000 English positive sentences.

    • The table of training data is in the Table 1 below in the “Appendix”.

  • When we use the IA on the Table 1, we get a decision tree to generate many association rules. The association rules have the format as “X ≥ positive” or “Y ≥ negative”. These rules are divided into two groups: the positive rule group and the negative rule group. The positive rule group contains all association rules having the format as “X ≥ positive”. The negative rule group contains all association rules having the format as “Y ≥ negative”.

  • One English sentence of one English document in the English testing data set is the positive polarity if the sentence contains X fully. The English sentence is the negative polarity if the sentence contains Y fully. The English sentence is the neutral polarity if the sentence does not contain both X and Y fully.

  • Assuming that we have some rules such as: “very good” ≥ positive; “very handsome” ≥ positive; “excellent” ≥ positive; “very bad” ≥ negative; “terrible” ≥ negative; we have three sentences such as “the film is very good”; “the actor is very bad”; and “he is drinking some beer”. With the first sentence “the film is very good”, the sentence only contains one rule “very good” ≥ positive, thus, the sentence is the positive polarity. With the second sentence “the actor is very bad”, the sentence only contains one rule “very bad”, therefore, the sentence is the negative polarity. With the third sentence “he is drinking some beer”, the sentence does not contain any rule in our rule set, so, the sentence is the neutral polarity.

  • One English document in the English testing data set is the positive polarity if the number of the English sentence classified into the positive polarity is greater than the number of the English sentences classified into the negative polarity in the English document. The English document is the negative polarity if the number of the English sentences classified into the positive polarity is less than the number of the English sentences classified into the negative polarity in the document. The English document is the neutral polarity if the number of the English sentences classified into the positive polarity is as equal as the number of the English sentences classified into the negative polarity in the document.

Table 1 Training data set for a decision tree

In many researches related to the C4.5 algorithm (CA) in the world and in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000, 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), there is not any CA—related work which is similar to our study.

In many studies related to the decision tree for sentiment classification (opinion analysis, semantic classification) in the world and in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015, 20, 21; Vinodhini and Chandrasekaran 2013, 23, 24; Opinion 2015; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003), there is not any CA—related research for semantic classification, which is similar to our work.

In many works related to the sentiment classification in the world and in (Manek et al. 2016; Agarwal and Mittal 2016a, b; Canuto et al. 2016, Kaur et al. 2016; Phu 2014; Tran et al. 2014; Li and Liu 2014), there is not any CA—related study for sentiment classification, which is similar to our model.

In many researches related to the unsupervised classification in the world and in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Te-Won; Lee and Lewicki 2002; Gllavata et al. 2004), there is not any CA—related study of unsupervised classification, which is similar to our work.

According to the CA in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), there are many advantages and disadvantages of the CA. Many advantages of the CA are as follows: builds models that can be easily interpreted; easy to implement; can use both categorical and continuous values; deals with noise. Many disadvantages of the CA are as follows: small variation in data can lead to different decision trees (especially when the variables are close to each other in value); does not work very well on a small training set.

Based on the works related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), we build the CA—related algorithms to perform our new model.

The motivation of the work is as follows: rule—based sentiment classification often has high accuracy and the rules are very popular in data mining. Researchers have sought to find many ways to use data mining rules in opinion analysis and to find the many different relationships between data mining and natural language processing. The C4.5 algorithm is a very popular and significant algorithm of the data mining, thus, the rules are generated by the C4.5 algorithm are very correct. This will result in many discoveries in scientific research, hence the motivation for this study.

The proposed approach is quite novel. The semantic analysis of an English document is based on many English sentences in the English training data set. The emotional classification of an English document is based on many association rules in the data mining field. Sentiment analysis is based on the FA algorithm. These principles are proposed to classify the semantics of an English document and data mining is used in natural language processing.

According to the researches in the world and in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012; Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Shrivastava and Nair 2015; Vinodhini and Chandrasekaran 2013; Voll et al. 2007; Mandal et al. 2014; Kaur et al. 2015, 2016; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003; Manek et al. 2016; Agarwal and Mittal 2016a, b; Canuto et al. 2016; Phu and Tuoi 2014; Tran et al. 2014; Li and Liu 2014; Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004), to understand the significant contributions of this study, we present briefly as follows:

  1. a.

    The C4.5 algorithm is a decision tree algorithm, but it is applied into the NLP.

  2. b.

    It is not used in the sentiment classification, however, it is applied in the opinion analysis.

  3. c.

    It is not used for the English document semantic analysis, whereas, it it applied in the emotional classification of the English documents.

  4. d.

    From the results of this survey, it is widely applied in the different fields and the different applications.

  5. e.

    This model can be applied into the other languages.

  6. f.

    The C4.5—related algorithms are built in this search.

  7. g.

    The rules are generated in this model.

Based on the above contributions, the model is clear superiority which is compared with the other methodologies and it is completely different from the other methods/models.

This study contains 6 sections: Sect. 1 is the introduction; Sect. 2 discusses the related works about the C4.5, etc., Sect. 3 is about the English data set of classifying sentences; Sect. 4 represents the methodology of our proposed model; Sect. 5 represents the experimental model and experimental results in this study; the conclusion of the proposed model is in Sect. 6. In addition, the References section displays many reference researches, and all the tables are shown in the Appendices section. Finally, all the codes of all algorithms in the Methodology are shown in the “Appendices of All Codes” section.

2 Related work

In this part, we summarize many studies related to our research, such as C4.5, sentiment analysis, etc.

There are many works related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012). (Ruggieri 2002) Authors present an analytic evaluation of the runtime behavior of the C4.5 algorithm which highlights some efficiency improvements. Based on the analytic evaluation, we have implemented a more efficient version of the algorithm, called EC4.5. It improves on C4.5 by adopting the best among the three strategies for computing the information gain of continuous attributes. All the strategies adopt a binary search of the threshold in the whole training set starting from the local threshold computed at a node. The first strategy computes the local threshold using the algorithm of C4.5, which, in particular, sorts cases by means of the quicksort method. The second strategy also uses the algorithm of C4.5, but adopts a counting sort method. The third strategy calculates the local threshold using a main-memory version of the RainForest algorithm, which does not need sorting. The authors’ implementation computes the same decision trees as C4.5 with a performance gain of up to five times. (Kretschmann et al. 2001) The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data, but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations. A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet un-annotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%, etc.

Then, we compare our proposed model’s results with the surveys in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012; Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Shrivastava and Nair 2015; Vinodhini and Chandrasekaran 2013; Voll et al. 2007; Mandal et al. 2014; Kaur et al. 2015, 2016; Prasad et al. 2016, 27; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003, 31, 32, 33, 34; Phu and Tuoi 2014; Tran et al. 2014; Li and Liu 2014; Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004; Phu et al. 2016, 2017a, b; Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995; Phu et al. 2017).

There are many researches related to a decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Psomakelis et al. 2015; Vinodhini and Chandrasekaran 2013, 23; Mandal et al. 2014; Kaur et al. 2015; Prasad et al. 2016; Pong-Inwong et al. 2014; Mugdha; Sharma 2014; Park et al. 2003; Loh and Mauricio 2003). Automatic Text Classification (Mita 2011) is a semi-supervised machine learning task that automatically assigns a given document to a set of pre-defined categories based on its textual content and extracted features. Automatic Text Classification has important applications in content management, contextual search, opinion mining, analysis of product review, spam filtering and text sentiment mining. This survey (Mita 2011) explains the generic strategy for automatic text classification and surveys existing solutions. The authors in (Taboada et al. 2008) present an approach to extracting sentiment from texts that makes use of contextual information. Using two different approaches, the authors (Taboada et al. 2008) extract the most relevant sentences of a text, and calculate the semantic orientation weighing those more heavily, etc.

The latest researches of the sentiment classification are (Manek et al. 2016; Agarwal and Mittal 2016a, b, 34; Kaur et al.2016; Phu 2014; Tran et al. 2014; Li and Liu 2014; Phu et al. 2017a, b; Phu et al. 2017). With the rapid development of the World Wide Web in (Manek et al. 2016), electronic word-of-mouth interaction has made consumers active participants. Nowadays, a large number of reviews posted by the consumers on the Web provide valuable information to other consumers. Such information is highly essential for decision making and hence popular among the internet users. This information is very valuable not only for prospective consumers to make decisions, but also for businesses in predicting the success and sustainability. In this survey (Manek et al. 2016), a Gini Index based feature selection method with Support Vector Machine (SVM) classifier is proposed for sentiment classification for large movie review dataset. Opinion Mining or Sentiment Analysis in Agarwal an Mittal (2016a) is the study that analyzes people’s opinions or sentiments from the text towards entities such as products and services. It has always been important to know what other people think. With the rapid growth of availability and popularity of online review sites, blogs’, forums’, and social networking sites’ necessity of analyzing and understanding these reviews has arisen. The main approaches for sentiment analysis can be categorized into semantic orientation-based approaches, knowledge-based, and machine-learning algorithms. This work (Agarwal an Mittal 2016a) surveys the machine learning approaches applied to sentiment analysis-based applications, etc.

The latest works of the unsupervised classification are (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004). This study in (Turney 2002) presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The authors in (Lee et al. 2002) propose a new method for unsupervised classification of terrain types and man-made objects using polarimetric synthetic aperture radar (SAR) data, etc.

3 Data set

In the Fig. 1, the English training data set includes 140,000 English sentences in the movie field, which contains 70,000 positive English sentences and 70,000 negative English sentences. All English sentences in our English training data set are automatically extracted from English Facebook, English websites; then we labeled positive and negative for them.

Fig. 1
figure 1

Our English training data set

In Fig. 2, we use a public available large data set of classified movie reviews from the Internet Movie Database (IMDb) (Large 2016). This English data set includes two parts in two different folders. The first part is in the “testing data set” folder, it was named as the testing data set and we call it as the first testing data set; the second part is in the “training data set” folder, it was named as the training data set and we call it as the second testing data set. Both our first testing data set and our second testing data set have 25,000 English documents; and each the data set includes 12,500 positive English movie reviews and 12,500 negative English movie reviews.

Fig. 2
figure 2

Our English testing data set

4 Methodology

In this section, we present how our new model is implemented. First of all, the table of training dataset is created on the 70,000 positive sentences and the 70,000 negative sentences. Secondly, the C4.5 algorithm (CA) is applied to the table of the training dataset for generating the positive association rule set and the negative association rule set. Next, one English document of the English testing dataset is split into many English sentences. Then, the positive association rule set and the negative association rule set are applied to each English sentence of the English document, and the emotional classification of the English sentence is identified. Finally, the semantic classification of the English document is identified on its sentences.

In Fig. 3, this research is done as follows diagram below.

Fig. 3
figure 3

Overview process of our new model

The criteria of selection both positive and negative association rules are certainly dependent on the English training data set and the algorithm for generating them (in the paper, the algorithm is the C4.5 algorithm). The positive and negative association rules are very important for this model to identify the emotional polarities (positive, negative, neutral) of one English sentence. Then, the semantic classification of one English document is identified on its sentences.

We propose many algorithms to perform the model.

We build algorithm 1 to create the table of training data has 140,000 records (or 140,000 rows) and (m_max + 1) columns. Each English sentence in all the sentences of the training data set is split into the meaningful phrases (or the meaningful words). Each row of the table tableOfTrainingData is each English sentence. The columns of each row in the tableOfTrainingData are the meaningful phrases (or the meaningful words) of each English sentence in all the sentences of the English training data set.

The algorithm 1 is presented more detail in the Code 1 below. The main ideas of the algorithm 1 are as follows:

  • Input: 140,000 English sentences of the English training data set including the 70,000 English positive sentences and the 70,000 English negative sentences

  • Output: table of training data.

  • Step 1: Create table tableOfTrainingData which has (m_max + 1) columns and 140,000 rows.

  • Step 2: With each sentence (one sentence) in the 70,000 English positive sentences of the 140,000 sentences, do repeat:

  • Step 3: Split this sentence into many words (or phrases) based on ‘ ’ or “ ”: arrayWords. Assuming that m is a number of words (or phraes) of this sentence which is split.

  • Step 4: Create one new row in table tableOfTrainingData: NewRow

  • Step 5: Do repeat i from 0 (the head of this sentence) to m-1 (the tail of this sentence):

  • Step 6: NewRow.column[i] = arrayWords[i]

  • Step 7: End of Step 5

  • Step 8: If i is less than m_max Then: do repeat

  • Step 9: NewRow.column[i] = 0 (or “ ”)

  • Step 10: End of Step 8

  • Step 11: NewRow.Column[m_max] = “positive”

  • Step 12: End of Step 2

  • Step 13: With each sentence (one sentence) in the 70,000 English negative sentences of the 140,000 sentences, do repeat:

  • Step 14: Split this sentence into many words (or phrases) based on ‘ ’ or “ ”: arrayWords. Assuming that m is a number of words (or phraes) of this sentence which is split.

  • Step 15: Create one new row in table tableOfTrainingData: NewRow

  • Step 16: Do repeat i from 0 (the head of this sentence) to m-1 (the tail of this sentence):

  • Step 17: NewRow.column[i] = arrayWords[i]

  • Step 18: End of Step 16

  • Step 19: If i is less than m_max Then: do repeat

  • Step 20: NewRow.column[i] = 0 (or “ ”)

  • Step 21: End of Step 19

  • Step 22: NewRow.Column[m_max] = “negative”

  • Step 23: End of Step 13

  • Step 24: Return table tableOfTrainingData

According to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012), we build algorithm 2 to generate many association rules in the positive rule group and the negative rule group by using the C4.5 algorithm. The basic construction of C4.5 decision tree is

  1. 1.

    The root nodes are the top node of the tree. It considers all samples and selects the attributes that are most significant.

  2. 2.

    The sample information is passed to subsequent nodes, called ‘branch nodes’ which eventually terminate in leaf nodes that give decisions.

  3. 3.

    Rules are generated by illustrating the path from the root node to leaf node.

The algorithm 2 is presented more detail in the Code 2 below. The main ideas of the algorithm 2 are as follows:

Input:

  • Table of training data tableOfTrainingData is the training examples.

  • Attributes S is a list of other attributes that may be tested by the learned decision tree. (column from 0 to m_max −1 of tableOfTrainingData)

  • A decision tree (actually the root node of the tree) that correctly classifies the given Examples. This decision tree is divided into the positive rule group and the negative rule group.

Output: the positive rule group and the negative rule group.

From Step 1 to Step 26: Apply the C4.5 algorithm to the table tableOfTrainingData

  • Step 27: Set positiveRuleGroup := null

  • Step 28: Set negativeRuleGroup := null

  • Step 29: Browse decision tree Tree, do:

  • Step 30: If the rule is positive Then

  • Step 31:positiveRuleGroup.Add (the rule);

  • Step 32: Else If the rule is negative Then

  • Step 33:negativeRuleGroup.Add (the rule);

  • Step 34: End of Step 30

  • Step 35: End of Step 29

  • Step 36: Return positiveRuleGroup and negativeRuleGroup;

We build algorithm 3 to classify one English sentence into the positive polarity, the negative polarity or the neutral polarity. The positive association rule set in positiveRuleGroup and the negative association rule set in negativeRuleGroup are applied to one English sentence A. If the number of positive rules which A contains is greater than the number of negative rules which A contains, A is classified to the positive polarity. If the number of negative rules which A contains is less than the number of negative rules which A contains, A is classified to the negative polarity. If the number of positive rules which A contains is as equal as the number of negative rules which A contains, A is classified to the neutral polarity; or if A does not contain any positive rule and any negative rule, A is classified to the neutral polarity.

The algorithm 3 is presented more detail in the Code 3 below. The main ideas of the algorithm 3 are as follows:

  • Input: one English sentence A, the positive rule group positiveRuleGroup and the negative rule group negativeRuleGroup

  • Output: positive, negative, neutral of this sentence A.

  • Step 1: With each rule (one rule) R in the positive rule group positiveRuleGroup, do repeat:

  • Step 2: If the sentence A contains R Then

  • Step 3: Set variable varibleOfPositive := varibleOfPositive + 1

  • Step 4: End Of Step 2

  • Step 5: End of Step 1

  • Step 6: With each rule (one rule) R in the negative rule group negativeRuleGroup, do repeat:

  • Step 7: If the sentence A contains R Then

  • Step 8: Set variable varibleOfNegative := varibleOfNegative + 1

  • Step 9: End Of Step 6

  • Step 10: End of Step 7

  • Step 11: If varibleOfPositive is greater than varibleOfNegative Then

  • Step 12: Return positive

  • Step 13: Else If varibleOfPositive is less than varibleOfNegative Then

  • Step 14: Return negative

  • Step 15: End If

  • Step 16: Return neutral

We build algorithm 4 to classify one English document into the positive polarity, the negative polarity or the neutral polarity. The English document is classified to the positive polarity if the number of sentences classified to the positive polarity is greater than the number of sentences classified to the negative polarity in the document. The English document is classified to the negative polarity if the number of sentences classified to the positive polarity is less than the number of sentences classified to the negative polarity in the document. The English document is classified to the negative polarity if the number of sentences classified to the positive polarity is as equal as the number of sentences classified to the negative polarity in the document.

The algorithm 4 is presented more detail in the Code 4 below. The main ideas of the algorithm 4 are as follows:

  • Input: one English document, including the n English sentences with the polarity result of each English sentence which is implemented by using the algorithm 3.

  • Output: positive, negative, neutral of this English document

  • Step 1: If the number of English sentences classified into the positive polarity is greater than the number of English sentences classified into the negative polarity in the document Then

  • Step 2: Return positive;

  • Step 3: End If

  • Step 4: If the number of English sentences classified into the positive polarity is less than the number of English sentences classified into the negative polarity in the document Then

  • Step 5: Return negative;

  • Step 6: End If

  • Step 7: Return neutral;

Or the main ideas of the algorithm 4 are as follows:

  • Input: one English document A

  • Output: positive, negative, neutral of this English document

  • Step 1: Split this English document A into many English sentences: m sentences.

  • Step 2: With each sentence (one sentence) i in m sentences, do repeat:

  • Step 3: Run algorithm 3 with the sentence i

  • Step 4: If the result is positive Then

  • Step 5: Set variableOfPositive := variableOfPositive + 1

  • Step 6: End of Step 4

  • Step 7: If the result is negative Then

  • Step 8: Set variableOfNegative := variableOfNegative + 1

  • Step 9: End of Step 7

  • Step 10: End of Step 2

  • Step 11: If variableOfPositive is greater than variableOfNegative Then

  • Step 12: Return positive

  • Step 13: Else If variableOfPositive is less than variableOfNegative Then

  • Step 14: Return negative

  • Step 15: End of Step 11

  • Step 16: Return neutral

5 Experiment

To implement the proposed model, we have already used Microsoft SQL Server 2008 R2 to save the English data sets and save the results of emotion classification.

Microsoft Visual Studio 2010 (C #) is used for programming to save data sets, implementing our proposed model to classify the 25,000 English documents of t1 and t2.

The experiment programs have been conducted on the Intel Dual laptop with Core i5 processor at 2.6 GHz Memory 8 GB; the operating system is Microsoft Windows 8.

We have used a measure such as Accuracy (A) to calculate the accuracy of the results of emotion classification.

The results of the 25,000 English documents of the testing data set t1 to test are presented in the Table 2 below in the Appendix.

Table 2 The results of the 25,000 English documents in testing data set t1

The results of the 25,000 English documents of the testing data set t2 to test are presented in the Table 3 below in the “Appendix”.

Table 3 The results of the 25,000 English documents in testing data set t2

The accuracy of the 25,000 English documents in the testing dataset t1 is shown in the Table 4 below in the “Appendix”.

Table 4 The accuracy of our new model for the 25,000 English documents in testing data set t1

The accuracy of the 25,000 English documents in the testing dataset t2 is shown in the Table 5 below in the “Appendix”.

Table 5 The accuracy of our new model for the 25,000 English documents in testing data set t2

We also have the comparisons between our results with the surveys in the “Appendix”.

6 Conclusion

Classification result of 25,000 English documents of t1 data set by using our model has achieved accuracy 60.3 and 60.7% of t2 data set.

With the same of the English training data set, the classification results of the different English testing data sets are very different from each others. The classification results are depending on the association rules of the positive rule group and the negative rule group. The association rules of the positive rule group and the negative rule group are depending on the algorithms and the English training data sets.

With the same of the English training data set, the association rules of the positive rule group and the negative rule group are very different from each others by using the different algorithms. Thus, the classification results are very different from each others.

With the same of the algorithms, the association rules of the positive rule group and the negative rule group are very different from each others by using the different data sets. Thus, the classification results are very different from each others.

To increase the accuracy of the classification results significantly, we can increase the association rules of the positive rule group and the negative rule group certainly.

To increase the association rules of the positive rule group and the negative rule group significantly, we can improve the algorithms, or the English training data sets, or both the algorithms and the English training data sets.

Although our model’s accuracy is not high, our model is a new contribution to English sentiment classification and sentiment classification of other languages.

Based on the basis the C4.5 algorithm, we build the algorithms related to the CA for performing our new model.

This model also has many benefits and drawbacks. The benefits of the model are as follows: the document-level emotional analysis is based on the English sentences. The rules are generated by the C4.5 algorithm are high correct. The rules are used in many researches and commercial applications. The drawbacks of the model are as follows: The accuracy of the model is low, because the rule-based sentiment classification often has better accuracy. It takes too much time to generate the rules.

To understand the scientific values of this research, we conduct to compare our model’ results with many studies as the tables below in the “Appendix”.

In the Table 6 below, we compare our model’s results with many researches related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and Kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012).

Table 6 Comparison our model’s results with many researches related to the C4.5 algorithm in (Ruggieri 2002; Kretschmann et al. 2001; Quinlan 1996a, b; Xiaoliang et al. 2009, 2004; Korting 2006; Pan et al. 2003; Sornlertlamvanich et al. 2000; Rajeswari and kannan 2008; Steven 1994; Mazid et al. 2016; Muniyandi et al. 2012)

In the Table 7 below, we compare our model’s results with many researches related to the decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Vinodhini and Chandrasekaran 2013, 2007, 2014; Kaur et al. 2015; Prasad et al. 2016, 2014; Sharma 2014).

Table 7 Comparison our model’s results with many researches related to the decision tree for sentiment classification in (Mita 2011; Taboada et al. 2008; Nizamani et al. 2012; Wan et al. 2015; Winkler et al. 2015; Vinodhini and Chandrasekaran 2013, 2007, 2014; Kaur et al. 2015; Prasad et al. 2016, 2014; Sharma 2014)

In the Table 8 below, we compare our model’s results with the latest researches of the sentiment classification in (2016, Kaur et al. 2016; Phu 2014; Tran et al. 2014).

Table 8 Comparison our model with the latest sentiment classification models in (2016, Kaur et al. 2016; Phu and Tuoi2014; Tran et al. 2014)

In the Table 9 below, we compare our model’s results with the latest works of the unsupervised classification in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004).

Table 9 Comparison our model with the latest unsupervised classification works in (Turney 2002; Lee et al. 2002; Zyl 2002; Le Hegarat-Mascle et al. 2002; Ferro-Famil and Pottier 2002; Chaovalit and Zhou 2005; Lee and Lewicki 2002; Gllavata et al. 2004)

We compare our model with many algorithms for the decision tree in (Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995) in the Table 10.

Table 10 Comparison our model with many algorithms for the decision tree in (Friedl and Brodley 1997; Freund and Mason 1999; Payne et al. 1978; Chang 1977; Mehta et al. 1995)