Abstract
Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well.
Similar content being viewed by others
References
Belitz C, Brezger A, Kneib T, Lang S, Umlauf N (2015) BayesX: software for Bayesian inference in structured additive regression models. R package version 1.0-0
Berger M (2017) structree: tree-structured clustering. R package version 1.1.4
Bondell HD, Reich BJ (2009) Simultaneous factor selection and collapsing levels in anova. Biometrics 65(1):169–177
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman JH, Olshen RA, Stone JC (1984) Classification and regression trees. Wadsworth, Monterey
Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Stat Assoc 98(462):324–339
Bürgin R, Ritschard G (2015) Tree-based varying coefficient regression for longitudinal ordinal responses. Comput Stat Data Anal 86:65–80
Chen J, Yu K, Hsing A, Therneau TM (2007) A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 31(3):238–251
Dusseldorp E, Meulman JJ (2004) The regression trunk approach to discover treatment covariate interaction. Psychometrika 69(3):355–374
Dusseldorp E, Conversano C, Van Os BJ (2010) Combining an additive and tree-based regression model simultaneously: Stima. J Comput Graph Stat 19(3):514–530
Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11(2):89–121
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fisher WD (1958) On grouping for maximum homogeneity. J Am Stat Assoc 53(284):789–798
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Gertheiss J, Tutz G (2010) Sparse modeling of categorial explanatory variables. Ann Appl Stat 4(4):2150–2180
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning, 2nd edn. Springer, New York
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674
Ishwaran H et al (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58(302):415–435
Oelker M-R (2015) gvcm.cat: regularized categorical effects/categorical effect modifiers/continuous/smooth effects in GLMs. R package version 1.9
Oelker M-R, Tutz G (2015) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 1(11):97–120
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann, San Francisco
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Sandri M, Zuccolotto P (2008) A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat 17(3):611–628
Sela RJ, Simonoff JS (2012) Re-em trees: a data mining approach for longitudinal and clustered data. Mach Learn 86(2):169–207
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307
Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods 14(4):323–348
Su X, Tsai C-L, Wang MC (2009) Tree-structured model diagnostics for linear regression. Mach Learn 74(2):111–131
Tutz G, Gertheiss J (2014) Rating scales as predictors—the old question of scale level and some answers. Psychometrika 79(3):357–376
Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Stati Model 16(3):161–200
Tutz G, Oelker M (2016) Modeling clustered heterogeneity: fixed effects, random effects and mixtures. Int Stat Rev 85(2):204–227
Umlauf N, Adler D, Kneib T, Lang S, Zeileis A (2015) Structured additive regression models: an R interface to BayesX. J Stat Softw 63(21):1–46
Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London
Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 73(1):3–36
Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J (2010) A partially linear tree-based regression model for multivariate outcomes. Biometrics 66(1):89–96
Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Comput Graph Stat 17(2):492–514
Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer, New York
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Tutz, G., Berger, M. Tree-structured modelling of categorical predictors in generalized additive regression. Adv Data Anal Classif 12, 737–758 (2018). https://doi.org/10.1007/s11634-017-0298-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-017-0298-6
Keywords
- Categorical predictors
- Tree-structured clustering
- Recursive partitioning
- Partially linear tree-based regression