Abstract
In this chapter, tree-based methods are discussed as another of the three major machine learning paradigms considered in the book. This includes the basic information theoretical approach used to construct classification and regression trees and a few simple examples to illustrate the characteristics of decision tree models. Following this is a short introduction to ensemble theory and ensembles of decision trees, leading to random forest models, which are discussed in detail. Unsupervised learning of random forests in particular is reviewed, as these characteristics are potentially important in unsupervised fault diagnostic systems. The interpretation of random forest models includes a discussion on the assessment of the importance of variables in the model, as well as partial dependence analysis to examine the relationship between predictor variables and the response variable. A brief review of boosted trees follows that of random forests, including discussion of concepts, such as gradient boosting and the AdaBoost algorithm. The use of tree-based ensemble models is illustrated by an example on rotogravure printing and the identification of defects in hot rolled steel plate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Binary splitting is considered here; extension to multiple splits is trivial.
- 2.
The C4.5 algorithm (Quinlan 1993) scales the decrease in impurity for categorical input variables, as a bias favouring multilevel variables exists in the cross-entropy impurity function. This corrected impurity decrease is known as the gain ratio.
- 3.
See “The Elements of Statistical Learning” (Hastie et al. 2009) for details.
- 4.
Shi and Horvath (2006) focused on the clustering utility of random forest proximities, a subtle difference to general feature extraction applications. Here, clustering refers to the ability of a feature extraction method to generate projections where known clusters are separate, without using cluster information in training.
References
Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9(7), 1545–1588.
Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4), 2249–2260.
Auret, L., & Aldrich, C. (2012). Interpretation of nonlinear relationships between process variables by use of random forests. Minerals Engineering, 35, 27–42.
Belson, W. A. (1959). Matching and prediction on the principle of biological classification. Journal of the Royal Statistical Society Series C (Applied Statistics), 8(2), 65–75.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Breiman, L., & Cutler, A. (2003). Manual on setting up, using, and understanding random forests v4.0. ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forests_v4.0.pdf. Available at: ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forests_v4.0.pdf. Accessed 30 May 2008.
Breiman, L., Friedman, J. H., Olshen, R., & Stone, C. J. (1984). Classification and regression trees. Belmont: Wadsworth.
Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling. Boca Raton: Chapman & Hall.
Cutler, A. (2009). Random forests. In useR! The R User Conference 2009. Available at: http://www.agrocampus-ouest.fr/math/useR-2009/
Cutler, A., & Stevens, J. R. (2006). Random forests for microarrays. In Methods in enzymology; DNA microarrays, Part B: Databases and statistics (pp. 422–432). San Diego: Academic Press.
Dietterich, T. G. (2000a). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–157.
Dietterich, T. (2000b). Ensemble methods in machine learning. In Multiple classifier systems (Lecture notes in computer science, pp. 1–15). Berlin/Heidelberg: Springer. Available at: http://dx.doi.org/10.1007/3-540-45014-9_1.
Evans, B., & Fisher, D. (1994). Overcoming process delays with decision tree induction. IEEE Expert, 9(1), 60–66.
Frank, A., & Asuncion, A. (2010). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Available at: http://archive.ics.uci.edu/ml
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning. Proceedings of the Thirteenth International Conference (ICML’96)| (pp.148–156|558).
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2), 337–374.
Gillo, M. W., & Shelly, M. W. (1974). Predictive modeling of multivariable and multivariate data. Journal of the American Statistical Association, 69(347), 646–653.
Hansen, L., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning – Data mining, inference and prediction. New York: Springer.
Ho, T. K. (1995). Random decision forests. In Proceedings of the Third International Conference on Document Analysis and Recognition (pp. 278–282). ICDAR1995. Montreal: IEEE Computer Society.
Izenman, A. (2008). Modern multivariate statistical techniques: Regression, classification, and manifold learning. New York/London: Springer.
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society Series C (Applied Statistics), 29(2), 119–127.
Messenger, R., & Mandell, L. (1972). A modal search technique for predictive nominal scale multivariate analysis. Journal of the American Statistical Association, 67(340), 768–772.
Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58(302), 415–434.
Nicodemus, K. K., & Malley, J. D. (2009). Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics, 25(15), 1884–1890.
Polikar, R. (2006). Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3), 21–45.
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Quinlan, R. (1993). C4.5: Programs for machine learning. Palo Alto: Morgan Kaufmann.
Ratsch, G., Onoda, T., & Muller, K. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320.
RuleQuest Research. (2011). Data mining tools See5 and C5.0. Information on See5/C5.0. Available at: http://www.rulequest.com/see5-info.html. Accessed 10 Feb 2011.
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5), 401–409.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.
Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5), 1651–1686.
Shi, T., & Horvath, S. (2006). Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, 15(1), 118–138.
Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307–317.
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Author information
Authors and Affiliations
Nomenclature
Nomenclature
Symbol | Description |
---|---|
i(η) | Impurity measure of node η in a classification tree |
p(k|η) | Proportion of samples in class k at node η in a classification tree |
C | Number of classes in a classification problem |
Δi(ς, η) | Decrease in impurity for a candidate split position ς at node η |
ς | Split point index in a decision tree |
ς* | Optimal split point index in a decision tree |
η | Node index variable |
| Input space |
N | Sample size |
p R | Proportion of samples reporting to the right descendent node after splitting in a classification tree |
p L | Proportion of samples reporting to the left descendent node after splitting in a classification tree |
c R | Prediction in right descendent node of a regression tree after splitting |
c L | Prediction in left descendent node of a regression tree after splitting |
η R | Index of right descendent node |
η L | Index of left descendent node |
X k | kth bootstrap sample of learning data set X |
T | Set of ensemble trees |
t k | kth tree in an ensemble of trees |
K | Number of trees in an ensemble of classification or regression trees |
t k (⋅) | Prediction of kth tree in an ensemble of classification or regression trees |
m | Number of variables considered at each split point in a random forest tree |
M | Total number of input variables |
\( \mathbf{X}_{{\rm OOB}(j)}^k \) | Out-of-bag (OOB) input learning data for the kth tree in an ensemble of trees with variable j permuted |
\( \mathbf{y}_{\rm OOB}^k \) | Out-of-bag (OOB) output learning data for the kth tree in an ensemble of trees |
\( {\omega_j}({{t_k}}) \) | Importance measure for jth variable in kth tree in an ensemble of trees (random forest) |
\( {\omega_j} \) | Importance measure for jth variable in an ensemble of trees (random forest) |
\( {{\mathbf{X}}_S} \) | Subset of variables in X |
X C | Subset of variables in X complementary to \( {X_S} \) |
\( {X_{i,C }} \) | Values of samples in X C |
\( \bar{f}({{{\mathbf{X}}_S}}) \) | Partial dependence of a predicted response to the subset of variables in \( {{\mathbf{X}}_S} \) |
b j | jth scalar calculation point |
S | Proximity matrix |
D | Dissimilarity matrix |
g | Unknown data density |
g 0 | Reference distribution |
X 0 | Synthetic data set obtained by random sampling from the product of marginal distributions in X |
Z | Concatenated matrix |
T | Scaling coordinate features |
β | K × 1 weighting vector of trees in a boosted tree ensemble |
w | N × 1 weighting vector of samples in a boosted tree ensemble |
\( {\epsilon_k} \) | Ensemble error |
\( {\beta_k} \) | Weight of kth tree in boosted tree ensemble |
W k | Normalizing constant |
\( F(\mathbf{x}) \) | Output of ensemble of boosted classification or regression trees |
L(y,f(x)) | Loss function of a classifier or regressor |
g k (x) | Gradient at x after at kth iteration |
ρ k | Optimization search step size at kth iteration |
θ k | Parameters of kth model in an ensemble |
i w (η) | Weighted cross-entropy of node η |
\( Q\left( {k|\eta } \right) \) | Sum of weights of samples in node η labelled as class k |
W(η) | Sum of all sample weights present in node η |
\( {{{\mathrm{X}}^{\prime}}_{\mathrm{t}}} \) | Matrix of time series column vectors with mean centred columns |
\( {{\widehat{\mathbf{X}}}_i} \) | ith of d |
\( {{\tilde{\mathbf{X}}}_i} \) | ith trajectory matrix |
\( \rho_{p,q}^{(w) } \) | Weighted or w-correlation between time series p and q |
\( \rho_{\max}^{(L,K) } \) | Maximum of the absolute value of the correlations between the rows and between the columns of a pair of trajectory matrices \( {{\tilde{\mathbf{X}}}_i} \) and \( {{\tilde{\mathbf{X}}}_j} \) |
\( \mathcal{N}(a,b) \) | Normal distribution with mean a and standard deviation b |
\( \mathbf{u}(t) \) | Input vector at time t |
\( \mathbf{y}(t) \) | Vector of measured variables at time t |
\( \mathbf{v}(\mathrm{t}) \) | Gaussian noise with variance 0.01 |
\( \mathbf{w}(t) \) | Gaussian noise with variance 0.1 |
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Aldrich, C., Auret, L. (2013). Tree-Based Methods. In: Unsupervised Process Monitoring and Fault Diagnosis with Machine Learning Methods. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5185-2_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5185-2_5
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5184-5
Online ISBN: 978-1-4471-5185-2
eBook Packages: Computer ScienceComputer Science (R0)