Skip to main content
Log in

Tree-structured modelling of categorical predictors in generalized additive regression

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Belitz C, Brezger A, Kneib T, Lang S, Umlauf N (2015) BayesX: software for Bayesian inference in structured additive regression models. R package version 1.0-0

  • Berger M (2017) structree: tree-structured clustering. R package version 1.1.4

  • Bondell HD, Reich BJ (2009) Simultaneous factor selection and collapsing levels in anova. Biometrics 65(1):169–177

    Article  MathSciNet  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone JC (1984) Classification and regression trees. Wadsworth, Monterey

    MATH  Google Scholar 

  • Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Stat Assoc 98(462):324–339

    Article  Google Scholar 

  • Bürgin R, Ritschard G (2015) Tree-based varying coefficient regression for longitudinal ordinal responses. Comput Stat Data Anal 86:65–80

    Article  MathSciNet  Google Scholar 

  • Chen J, Yu K, Hsing A, Therneau TM (2007) A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 31(3):238–251

    Article  Google Scholar 

  • Dusseldorp E, Meulman JJ (2004) The regression trunk approach to discover treatment covariate interaction. Psychometrika 69(3):355–374

    Article  MathSciNet  Google Scholar 

  • Dusseldorp E, Conversano C, Van Os BJ (2010) Combining an additive and tree-based regression model simultaneously: Stima. J Comput Graph Stat 19(3):514–530

    Article  MathSciNet  Google Scholar 

  • Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11(2):89–121

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  Google Scholar 

  • Fisher WD (1958) On grouping for maximum homogeneity. J Am Stat Assoc 53(284):789–798

    Article  MathSciNet  Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  MathSciNet  Google Scholar 

  • Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  MathSciNet  Google Scholar 

  • Gertheiss J, Tutz G (2010) Sparse modeling of categorial explanatory variables. Ann Appl Stat 4(4):2150–2180

    Article  MathSciNet  Google Scholar 

  • Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London

    MATH  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674

    Article  MathSciNet  Google Scholar 

  • Ishwaran H et al (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537

    Article  MathSciNet  Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York

    Book  Google Scholar 

  • Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58(302):415–435

    Article  Google Scholar 

  • Oelker M-R (2015) gvcm.cat: regularized categorical effects/categorical effect modifiers/continuous/smooth effects in GLMs. R package version 1.9

  • Oelker M-R, Tutz G (2015) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 1(11):97–120

    MathSciNet  Google Scholar 

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  • Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Sandri M, Zuccolotto P (2008) A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat 17(3):611–628

    Article  MathSciNet  Google Scholar 

  • Sela RJ, Simonoff JS (2012) Re-em trees: a data mining approach for longitudinal and clustered data. Mach Learn 86(2):169–207

    Article  MathSciNet  Google Scholar 

  • Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307

    Article  Google Scholar 

  • Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods 14(4):323–348

    Article  Google Scholar 

  • Su X, Tsai C-L, Wang MC (2009) Tree-structured model diagnostics for linear regression. Mach Learn 74(2):111–131

    Article  Google Scholar 

  • Tutz G, Gertheiss J (2014) Rating scales as predictors—the old question of scale level and some answers. Psychometrika 79(3):357–376

    Article  MathSciNet  Google Scholar 

  • Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Stati Model 16(3):161–200

    Article  MathSciNet  Google Scholar 

  • Tutz G, Oelker M (2016) Modeling clustered heterogeneity: fixed effects, random effects and mixtures. Int Stat Rev 85(2):204–227

    Article  Google Scholar 

  • Umlauf N, Adler D, Kneib T, Lang S, Zeileis A (2015) Structured additive regression models: an R interface to BayesX. J Stat Softw 63(21):1–46

    Article  Google Scholar 

  • Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London

    Book  Google Scholar 

  • Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 73(1):3–36

    Article  MathSciNet  Google Scholar 

  • Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J (2010) A partially linear tree-based regression model for multivariate outcomes. Biometrics 66(1):89–96

    Article  MathSciNet  Google Scholar 

  • Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Comput Graph Stat 17(2):492–514

    Article  MathSciNet  Google Scholar 

  • Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer, New York

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moritz Berger.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 502 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tutz, G., Berger, M. Tree-structured modelling of categorical predictors in generalized additive regression. Adv Data Anal Classif 12, 737–758 (2018). https://doi.org/10.1007/s11634-017-0298-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-017-0298-6

Keywords

Mathematics Subject Classification

Navigation