Skip to main content

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

  • 4366 Accesses

Abstract

In this chapter, tree-based methods are discussed as another of the three major machine learning paradigms considered in the book. This includes the basic information theoretical approach used to construct classification and regression trees and a few simple examples to illustrate the characteristics of decision tree models. Following this is a short introduction to ensemble theory and ensembles of decision trees, leading to random forest models, which are discussed in detail. Unsupervised learning of random forests in particular is reviewed, as these characteristics are potentially important in unsupervised fault diagnostic systems. The interpretation of random forest models includes a discussion on the assessment of the importance of variables in the model, as well as partial dependence analysis to examine the relationship between predictor variables and the response variable. A brief review of boosted trees follows that of random forests, including discussion of concepts, such as gradient boosting and the AdaBoost algorithm. The use of tree-based ensemble models is illustrated by an example on rotogravure printing and the identification of defects in hot rolled steel plate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Binary splitting is considered here; extension to multiple splits is trivial.

  2. 2.

    The C4.5 algorithm (Quinlan 1993) scales the decrease in impurity for categorical input variables, as a bias favouring multilevel variables exists in the cross-entropy impurity function. This corrected impurity decrease is known as the gain ratio.

  3. 3.

    See “The Elements of Statistical Learning” (Hastie et al. 2009) for details.

  4. 4.

    Shi and Horvath (2006) focused on the clustering utility of random forest proximities, a subtle difference to general feature extraction applications. Here, clustering refers to the ability of a feature extraction method to generate projections where known clusters are separate, without using cluster information in training.

References

  • Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9(7), 1545–1588.

    Article  Google Scholar 

  • Archer, K. J., & Kimes, R. V. (2008). Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis, 52(4), 2249–2260.

    Article  MathSciNet  MATH  Google Scholar 

  • Auret, L., & Aldrich, C. (2012). Interpretation of nonlinear relationships between process variables by use of random forests. Minerals Engineering, 35, 27–42.

    Article  Google Scholar 

  • Belson, W. A. (1959). Matching and prediction on the principle of biological classification. Journal of the Royal Statistical Society Series C (Applied Statistics), 8(2), 65–75.

    Google Scholar 

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Breiman, L., & Cutler, A. (2003). Manual on setting up, using, and understanding random forests v4.0. ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forests_v4.0.pdf. Available at: ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forests_v4.0.pdf. Accessed 30 May 2008.

  • Breiman, L., Friedman, J. H., Olshen, R., & Stone, C. J. (1984). Classification and regression trees. Belmont: Wadsworth.

    MATH  Google Scholar 

  • Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling. Boca Raton: Chapman & Hall.

    MATH  Google Scholar 

  • Cutler, A. (2009). Random forests. In useR! The R User Conference 2009. Available at: http://www.agrocampus-ouest.fr/math/useR-2009/

  • Cutler, A., & Stevens, J. R. (2006). Random forests for microarrays. In Methods in enzymology; DNA microarrays, Part B: Databases and statistics (pp. 422–432). San Diego: Academic Press.

    Google Scholar 

  • Dietterich, T. G. (2000a). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–157.

    Article  Google Scholar 

  • Dietterich, T. (2000b). Ensemble methods in machine learning. In Multiple classifier systems (Lecture notes in computer science, pp. 1–15). Berlin/Heidelberg: Springer. Available at: http://dx.doi.org/10.1007/3-540-45014-9_1.

  • Evans, B., & Fisher, D. (1994). Overcoming process delays with decision tree induction. IEEE Expert, 9(1), 60–66.

    Article  Google Scholar 

  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. Available at: http://archive.ics.uci.edu/ml

  • Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning. Proceedings of the Thirteenth International Conference (ICML’96)| (pp.148–156|558).

    Google Scholar 

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2), 337–374.

    Article  MathSciNet  MATH  Google Scholar 

  • Gillo, M. W., & Shelly, M. W. (1974). Predictive modeling of multivariable and multivariate data. Journal of the American Statistical Association, 69(347), 646–653.

    Article  MATH  Google Scholar 

  • Hansen, L., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning – Data mining, inference and prediction. New York: Springer.

    Book  MATH  Google Scholar 

  • Ho, T. K. (1995). Random decision forests. In Proceedings of the Third International Conference on Document Analysis and Recognition (pp. 278–282). ICDAR1995. Montreal: IEEE Computer Society.

    Google Scholar 

  • Izenman, A. (2008). Modern multivariate statistical techniques: Regression, classification, and manifold learning. New York/London: Springer.

    Book  Google Scholar 

  • Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society Series C (Applied Statistics), 29(2), 119–127.

    Google Scholar 

  • Messenger, R., & Mandell, L. (1972). A modal search technique for predictive nominal scale multivariate analysis. Journal of the American Statistical Association, 67(340), 768–772.

    Google Scholar 

  • Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58(302), 415–434.

    Article  MATH  Google Scholar 

  • Nicodemus, K. K., & Malley, J. D. (2009). Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics, 25(15), 1884–1890.

    Article  Google Scholar 

  • Polikar, R. (2006). Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3), 21–45.

    Article  Google Scholar 

  • Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

    Google Scholar 

  • Quinlan, R. (1993). C4.5: Programs for machine learning. Palo Alto: Morgan Kaufmann.

    Google Scholar 

  • Ratsch, G., Onoda, T., & Muller, K. (2001). Soft margins for AdaBoost. Machine Learning, 42(3), 287–320.

    Article  Google Scholar 

  • RuleQuest Research. (2011). Data mining tools See5 and C5.0. Information on See5/C5.0. Available at: http://www.rulequest.com/see5-info.html. Accessed 10 Feb 2011.

  • Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5), 401–409.

    Article  Google Scholar 

  • Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227.

    Google Scholar 

  • Schapire, R., Freund, Y., Bartlett, P., & Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5), 1651–1686.

    Article  MathSciNet  MATH  Google Scholar 

  • Shi, T., & Horvath, S. (2006). Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, 15(1), 118–138.

    Article  MathSciNet  Google Scholar 

  • Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307–317.

    Article  Google Scholar 

  • Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348.

    Article  Google Scholar 

  • Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Nomenclature

Nomenclature

Symbol

Description

i(η)

Impurity measure of node η in a classification tree

p(k|η)

Proportion of samples in class k at node η in a classification tree

C

Number of classes in a classification problem

Δi(ς, η)

Decrease in impurity for a candidate split position ς at node η

ς

Split point index in a decision tree

ς*

Optimal split point index in a decision tree

η

Node index variable

Input space

N

Sample size

p R

Proportion of samples reporting to the right descendent node after splitting in a classification tree

p L

Proportion of samples reporting to the left descendent node after splitting in a classification tree

c R

Prediction in right descendent node of a regression tree after splitting

c L

Prediction in left descendent node of a regression tree after splitting

η R

Index of right descendent node

η L

Index of left descendent node

X k

kth bootstrap sample of learning data set X

T

Set of ensemble trees

t k

kth tree in an ensemble of trees

K

Number of trees in an ensemble of classification or regression trees

t k ()

Prediction of kth tree in an ensemble of classification or regression trees

m

Number of variables considered at each split point in a random forest tree

M

Total number of input variables

\( \mathbf{X}_{{\rm OOB}(j)}^k \)

Out-of-bag (OOB) input learning data for the kth tree in an ensemble of trees with variable j permuted

\( \mathbf{y}_{\rm OOB}^k \)

Out-of-bag (OOB) output learning data for the kth tree in an ensemble of trees

\( {\omega_j}({{t_k}}) \)

Importance measure for jth variable in kth tree in an ensemble of trees (random forest)

\( {\omega_j} \)

Importance measure for jth variable in an ensemble of trees (random forest)

\( {{\mathbf{X}}_S} \)

Subset of variables in X

X C

Subset of variables in X complementary to \( {X_S} \)

\( {X_{i,C }} \)

Values of samples in X C

\( \bar{f}({{{\mathbf{X}}_S}}) \)

Partial dependence of a predicted response to the subset of variables in \( {{\mathbf{X}}_S} \)

b j

jth scalar calculation point

S

Proximity matrix

D

Dissimilarity matrix

g

Unknown data density

g 0

Reference distribution

X 0

Synthetic data set obtained by random sampling from the product of marginal distributions in X

Z

Concatenated matrix

T

Scaling coordinate features

β

K × 1 weighting vector of trees in a boosted tree ensemble

w

N × 1 weighting vector of samples in a boosted tree ensemble

\( {\epsilon_k} \)

Ensemble error

\( {\beta_k} \)

Weight of kth tree in boosted tree ensemble

W k

Normalizing constant

\( F(\mathbf{x}) \)

Output of ensemble of boosted classification or regression trees

L(y,f(x))

Loss function of a classifier or regressor

g k (x)

Gradient at x after at kth iteration

ρ k

Optimization search step size at kth iteration

θ k

Parameters of kth model in an ensemble

i w (η)

Weighted cross-entropy of node η

\( Q\left( {k|\eta } \right) \)

Sum of weights of samples in node η labelled as class k

W(η)

Sum of all sample weights present in node η

\( {{{\mathrm{X}}^{\prime}}_{\mathrm{t}}} \)

Matrix of time series column vectors with mean centred columns

\( {{\widehat{\mathbf{X}}}_i} \)

ith of d

\( {{\tilde{\mathbf{X}}}_i} \)

ith trajectory matrix

\( \rho_{p,q}^{(w) } \)

Weighted or w-correlation between time series p and q

\( \rho_{\max}^{(L,K) } \)

Maximum of the absolute value of the correlations between the rows and between the columns of a pair of trajectory matrices \( {{\tilde{\mathbf{X}}}_i} \) and \( {{\tilde{\mathbf{X}}}_j} \)

\( \mathcal{N}(a,b) \)

Normal distribution with mean a and standard deviation b

\( \mathbf{u}(t) \)

Input vector at time t

\( \mathbf{y}(t) \)

Vector of measured variables at time t

\( \mathbf{v}(\mathrm{t}) \)

Gaussian noise with variance 0.01

\( \mathbf{w}(t) \)

Gaussian noise with variance 0.1

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag London

About this chapter

Cite this chapter

Aldrich, C., Auret, L. (2013). Tree-Based Methods. In: Unsupervised Process Monitoring and Fault Diagnosis with Machine Learning Methods. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5185-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5185-2_5

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5184-5

  • Online ISBN: 978-1-4471-5185-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics