Abstract
In a Multiple Scanning discretization technique the entire attribute set is scanned many times. During every scan, the best cutpoint is selected for all attributes. The main objective of this paper is to compare the quality of two setups: the Multiple Scanning discretization technique combined with the C4.5 classification system and the internal discretization technique of C4.5. Our results show that the Multiple Scanning discretization technique is significantly better than the internal discretization used in C4.5 in terms of an error rate computed by ten-fold cross validation (two-tailed test, 5 % level of significance). Additionally, the Multiple Scanning discretization technique is significantly better than a variant of discretization based on conditional entropy introduced by Fayyad and Irani called Dominant Attribute. At the same time, decision trees generated from data discretized by Multiple Scanning are significantly simpler from decision trees generated directly by C4.5 from the same data sets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Mining numerical data sets requires an additional step called discretization. Discretization is a process of transforming numerical values into intervals.
For a numerical attribute \(a\) with an interval \([i, j]\) as a range, a partition of the range into \(k\) intervals
where \(i_0 = i\), \(i_k = j\), and \(i_l < i_{l + 1}\) for \(l = 0, 1,..., k-1\), defines a discretization of \(a\). The numbers \(i_1\), \(i_2\),..., \(i_{k-1}\) are called cut-points.
A new discretization technique, called Multiple Scanning, introduced in [11, 12], was very successful when combined with rule induction and a classification system of LERS (Learning from Examples based on Rough Sets) [9]. The novelty of this paper is a comparison of the C4.5 classification system applied to data discretized using Multiple Scanning with C4.5 applied directly to the original data sets with numeric attributes. Additionally, we compare the Multiple Scanning discretization technique with a variant of the well-known discretization based on conditional entropy introduced by Fayyad and Irani [7, 8] and called Dominant Attribute [11, 12].
In Multiple Scanning, during every scan, the entire attribute set is analyzed. For all attributes the best cutpoint is selected. At the end of a scan, some subtables that still need discretization are created. The entire attribute set of any subtable is scanned again, and the best corresponding cutpoints are selected. The process continues until the stopping condition is satisfied or the required number of scans is reached. If the required number of scans is reached and the stopping condition is not satisfied, discretization is completed by Dominant Attribute, in which first the best attribute is selected, then for this attribute, the best cutpoint, again using conditional entropy, is selected. This process continues recursively until the same stopping criterion is satisfied. Multiple Scanning ends up with an attempt to reduce the number of intervals called merging. Since Multiple Scanning uses Dominant Attribute as the last resort, if we skip scanning, or equivalently set the required number of scans to zero, discretization is reduced to Dominant Attribute. Thus we may include a comparison of Multiple Scanning with Dominant Attribute. Typically, in Multiple Scanning the required number of scans should be set to some small number. In our experiments, for all data sets, after six scans the error rate computed using ten-fold cross validation was constant, because new intervals created in consecutive scans were merged together during the last step of discretization. The stopping criterion used in this paper is based on rough set theory.
The main objective of this paper is to compare the quality of two setups: the Multiple Scanning discretization technique combined with the C4.5 classification system and the internal discretization technique of C4.5. For 12 numerical data sets two sets of experiments were conducted: first the C4.5 system of tree induction was used to compute an error rate using ten-fold cross validation, then the same data sets were discretized using Multiple Scanning and for such discretized data sets the same C4.5 system was used to establish an error rate. Thus we may compare two discretization techniques: Multiple Scanning with the internal discretization of C4.5.
Our results show that the Multiple Scanning discretization technique is significantly better than the internal discretization used in C4.5 or the Dominant Attribute discretization in terms of an error rate computed by ten-fold cross validation (two-tailed test, 5 % level of significance). Additionally, decision trees generated from data discretized by Multiple Scanning are significantly simpler than decision trees generated directly by C4.5 from the same data sets.
2 Entropy Based Discretization
Discretization based on conditional entropy of the concept given the attribute is considered to be one of the most successful discretization techniques [2–8, 10, 11, 13–15, 19, 20].
An example of a data set with numerical attributes is presented in Table 1. In this table all cases are described by variables called attributes and one variable called a decision. The set of all attributes is denoted by \(A\). The decision is denoted by \(d\). The set of all cases is denoted by \(U\). In Table 1 the attributes are Max_Speed and Number_of_Seats while the decision is Price. Additionally, \(U\) = {1, 2, 3, 4, 5, 6, 7}. For a subset \(S\) of the set \(U\) of all cases, an entropy of a variable \(v\) (attribute or decision) with values \(v_1\), \(v_2\),..., \(v_n\) is defined by the following formula
where \(p(v_i)\) is a probability (relative frequency) of value \(v_i\) in the set \(S\), \(i = 0, 1,..., n\). All logarithms in this paper are binary.
A conditional entropy of the decision \(d\) given an attribute \(a\) is
where \(a_1, a_2,..., a_m\) are all values of \(a\) and \(d_1, d_2,..., d_n\) are all values of \(d\), all values are restricted to \(S\). There are two fundamental criteria of quality based on entropy. The first is an information gain associated with an attribute \(a\) and defined by
the second is information gain ratio, for simplicity called gain ratio, defined by
Both criteria were introduced by J.R. Quinlan, see, e.g., [18] and used for decision tree generation.
Let \(a\) be an attribute and \(q\) be a cutpoint that splits the set \(S\) into two subsets, \(S_1\) and \(S_2\). The conditional entropy \(H_S(d|q)\) is defined as follows
where \(|X|\) denotes the cardinality of the set \(X\). The cut-point \(q\) for which the conditional entropy \(H_S(d|q)\) has the smallest value is selected as the best cut-point. The corresponding information gain is the largest.
2.1 Stopping Criterion for Discretization
A stopping criterion of the process of discretization, described in this paper, is the level of consistency [3], based on rough set theory [16, 17]. For any subset \(B\) of the set \(A\) of all attributes, an indiscernibility relation \(IND(B)\) is defined, for any \(x, y \in U\), in the following way
where \(a(x)\) denotes the value of the attribute \(a \in A\) for the case \(x \in U\). The relation \(IND(B)\) is an equivalence relation. The equivalence classes of \(IND(B)\) are denoted by \([x]_B\) and are called B-elementary sets. Any finite union of \(B\)-elementary sets is B-definable.
A partition on \(U\) constructed from all \(B\)-elementary sets of \(IND(B)\) is denoted by \(B^*\). {\(d\)}-elementary sets are called concepts, where \(d\) is a decision. For example, for Table 1, if \(B = \{Max\_Speed\}\), \(B^*\) = {{1, 6}, {2, 4, 5}, {3, 7}} and {\(d\)}\(^*\) = {{1}, {2, 3, 7}, {4, 6}, {5}}. In general, arbitrary \(X \in \{d\}^*\) is not \(B\)-definable. For example, the concept {2, 3, 7} is not \(B\)-definable. However, any \(X \in \{d\}^*\) may be approximated by a B-lower approximation of \(X\), denoted by \(\underline{B}X\) and defined as follows
and by B-upper approximation of \(X\), denoted by \(\overline{B}X\) and defined as follows
In our example, \(\underline{B}\{2, 3, 7\}\) = {3, 7} and \(\overline{B}\{2, 3, 7\}\) = {2, 3, 4, 5, 7}.
The \(B\)-lower approximation of \(X\) is the greatest \(B\)-definable set contained in \(X\). The \(B\)-upper approximation of \(X\) is the least \(B\)-definable set containing \(X\). A level of consistency [3], denoted by \(L(A)\), is defined as follows
Practically, the requested level of consistency for discretization is 1.0, i.e., we want the discretized data set to be consistent. For example, for Table 1, the level of consistency \(L(A)\) is equal to 1.0, since {A}\(^*\) = {{1}, {2}, {3}, {4}, {5}, {6}, {7}} and, for any \(X\) from {Price}\(^*\) = {{1}, {2, 3, 7}, {4, 6}, {5}}}, we have \(\underline{A}X = X\). Additionally, \(L(B) \approx \) 0.286.
2.2 Multiple Scanning Strategy
This discretization technique needs some parameter denoted by \(t\) and called the total number of scans. In Multiple Scanning algorithm,
-
for the entire set \(A\) of attributes the best cutpoint is computed for each attribute \(a \in A\), based on minimum of conditional entropy \(H_U({d} | a)\), a new discretized attribute set is \(A^D\), and the set \(U\) is partitioned into a partition \((A^D)^*\),
-
if the number \(t\) of scans is not reached, the next scan is conducted: we need to scan the entire set of partially discretized attributes again, for each attribute we need only one cutpoint, the best cutpoint for each block \(X \in (A^D)^*\) is computed, the best cutpoint, among all such blocks is selected,
-
if the requested number \(t\) of scans is reached and the data set needs more discretization, the Dominant Attribute technique is used for remaining subtables,
-
the algorithm stops when \(L(A^D) = 1\), where \(A^D\) is the discretized set of attributes.
We illustrate this technique by scanning Table 1 once, i.e., \(t\) = 1. First we are searching for the best cut-point for both attributes, Max_Speed and Number_of_Seats. For the attribute Max_Speed there exist two potential cutpoints: 220 and 280 with three potential intervals: [180, 220), [220, 280) and [280, 280]. The corresponding conditional entropies are
The better cutpoint is 280. Similarly, there are three potential cutpoints for the attribute Number_of_Seats: 4 and 5, with three potential intervals: [2, 4), [4, 5) and [5, 5]. The corresponding conditional entropies are
The better cut-point is 4. Table 1, partially discretized this way, is presented as Table 2.
The level of consistency for Table 2 is 0.429 since \(A^*\) = {{1}, {2, 3, 4, 7}, {5}, {6}}, we need to distinguish cases 2, 3, and 7 from the case 4. Therefore we need to use the Dominant Attribute technique for a subtable, with four cases, 2, 3, 4 and 7. This data set is presented in Table 3.
3 Experiments
Our experiments were conducted on 12 data sets available on the University of California at Irvine Machine Learning Repository, with the exception of bankruptcy. The bankruptcy data set is a well-known data set used by E.I. Altman to predict a bankruptcy of companies [1].
Both discretization methods, Multiple Scanning and C4.5, were applied to all data sets, with the level of consistency equal to 100 %. For a choice of the best attribute, we used gain ratio.
Table 4 presents results of ten-fold cross validation, using increasing number of scans. Obviously, for any data set, after some fixed number of scans, an error rate is stable (constant). For example, for Australian data set, the error rate is 14.93 % for the scan number 4, 5, etc. Thus, any data set from Table 4 is characterized by two error rates: minimal and stable [12]. For a given data set, the smallest error rate from Table 4 is called minimal and the last entry in the row that corresponds to the data set is called stable. For example, for the Australian data set, the minimal error rate is 13.48 % and the stable error rate is 14.93 %. For some data sets (e.g., for bankruptcy), minimal and stable error rates are identical.
Table 5 presents the size of decision trees generated from all 12 data sets discretized by Multiple Scanning. In Table 6 error rates are shown for decision trees generated directly by C4.5 and for the decision trees generated by C4.5 from data sets discretized by Multiple Scanning, only the minimal error rates are presented with the corresponding scan numbers. Finally, Table 7 presents tree size for decision trees generated directly by C4.5 and for decision trees generated by C4.5 from data sets discretized by Multiple Scanning.
It is clear from Tables 4–7 that the minimal error rate is never associated with 0 scans, i.e., with a special case of the Multiple Scanning discretization technique: Dominant Attribute. Using the Wilcoxon matched-pairs signed-ranks test, we conclude that the following three statements are statistically significant (with the significance level equal to 5 % for a two-tail test):
-
the minimal error rate associated with Multiple Scanning is smaller than the error rate associated with Dominant Attribute,
-
the minimal error rate associated with Multiple Scanning is smaller than the error rate associated with C4.5,
-
the size of decision trees generated from data discretized by Multiple Scanning is smaller than size of decision trees generated directly by C4.5.
4 Conclusions
This paper presents results of experiments in which three different techniques were used for discretization: Multiple Scanning, the internal discretization of C4.5, and Dominant Attribute. All techniques were validated by conducting experiments on 12 data sets with numerical attributes. Our discretization techniques were combined with decision tree generation using the C4.5 system. Results of our experiments show that the Multiple Scanning technique is significantly better than discretization included in C4.5 and that decision trees generated from data discretized by Multiple Scanning are significantly simpler than decision trees generated directly by C4.5 from the same data sets (two-tailed test and 0.05 level of significance). Additionally, the Multiple Scanning discretization technique is significantly better than the Dominant Attribute technique. Thus, we show that there exists a new successful technique for discretization.
References
Altman, E.I.: Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Financ. 23(4), 189–209 (1968)
Blajdo, P., Grzymala-Busse, J.W., Hippe, Z.S., Knap, M., Mroczek, T., Piatek, L.: A comparison of six approaches to discretization—a rough set perspective. In: Wang, G., Li, T., Grzymala-Busse, J.W., Miao, D., Skowron, A., Yao, Y. (eds.) RSKT 2008. LNCS (LNAI), vol. 5009, pp. 31–38. Springer, Heidelberg (2008)
Chmielewski, M.R., Grzymala-Busse, J.W.: Global discretization of continuous attributes as preprocessing for machine learning. Int. J. Approximate Reasoning 15(4), 319–331 (1996)
Clarke, E.J., Barton, B.A.: Entropy and MDL discretization of continuous variables for bayesian belief networks. Int. J. Intell. Syst. 15, 61–92 (2000)
Elomaa, T., Rousu, J.: General and efficient multisplitting of numerical attributes. Mach. Learn. 36, 201–244 (1999)
Elomaa, T., Rousu, J.: Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Min. Knowl. Disc. 8, 97–126 (2004)
Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8, 87–102 (1992)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Grzymala-Busse, J.W.: A new version of the rule induction system LERS. Fundamenta Informaticae 31, 27–39 (1997)
Grzymala-Busse, J.W.: Discretization of numerical attributes. In: Kloesgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, pp. 218–225. Oxford University Press, New York (2002)
Grzymala-Busse, J.W.: A multiple scanning strategy for entropy based discretization. In: Proceedings of the 18th International Symposium on Methodologies for Intelligent Systems, pp. 25–34 (2009)
Grzymala-Busse, J.W.: Discretization based on entropy and multiple scanning. Entropy 15, 1486–1502 (2013)
Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the 10-th National Conference on AI, pp. 123–128 (1992)
Kohavi, R., Sahami, M.: Error-based and entropy-based discretization of continuous features. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 114–119 (1996)
Nguyen, H.S., Nguyen, S.H.: Discretization methods in data mining. In: Polkowski, L., Skowron, A. (eds.) Rough Sets in Knowledge Discovery 1: Methodology and Applications, pp. 451–482. Physica-Verlag, Heidelberg (1998)
Pawlak, Z.: Rough sets. Int. J. Comput. Inform. Sci. 11, 341–356 (1982)
Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht (1991)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Stefanowski, J.: Handling continuous attributes in discovery of strong decision rules. In: Polkowski, L., Skowron, A. (eds.) RSCTC 1998. LNCS (LNAI), vol. 1424, pp. 394–401. Springer, Heidelberg (1998)
Stefanowski, J.: Algorithms of Decision Rule Induction in Data Mining. Poznan University of Technology Press, Poznan (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Grzymala-Busse, J.W., Mroczek, T. (2015). A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5. In: Kryszkiewicz, M., Bandyopadhyay, S., Rybinski, H., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2015. Lecture Notes in Computer Science(), vol 9124. Springer, Cham. https://doi.org/10.1007/978-3-319-19941-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-19941-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19940-5
Online ISBN: 978-3-319-19941-2
eBook Packages: Computer ScienceComputer Science (R0)