Tree-structured modelling of categorical predictors in generalized additive regression

Tutz, Gerhard; Berger, Moritz

doi:10.1007/s11634-017-0298-6

Tree-structured modelling of categorical predictors in generalized additive regression

Regular Article
Published: 26 October 2017

Volume 12, pages 737–758, (2018)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

363 Accesses
5 Citations
Explore all metrics

Abstract

Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Belitz C, Brezger A, Kneib T, Lang S, Umlauf N (2015) BayesX: software for Bayesian inference in structured additive regression models. R package version 1.0-0
Berger M (2017) structree: tree-structured clustering. R package version 1.1.4
Bondell HD, Reich BJ (2009) Simultaneous factor selection and collapsing levels in anova. Biometrics 65(1):169–177
Article MathSciNet Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone JC (1984) Classification and regression trees. Wadsworth, Monterey
MATH Google Scholar
Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Stat Assoc 98(462):324–339
Article Google Scholar
Bürgin R, Ritschard G (2015) Tree-based varying coefficient regression for longitudinal ordinal responses. Comput Stat Data Anal 86:65–80
Article MathSciNet Google Scholar
Chen J, Yu K, Hsing A, Therneau TM (2007) A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 31(3):238–251
Article Google Scholar
Dusseldorp E, Meulman JJ (2004) The regression trunk approach to discover treatment covariate interaction. Psychometrika 69(3):355–374
Article MathSciNet Google Scholar
Dusseldorp E, Conversano C, Van Os BJ (2010) Combining an additive and tree-based regression model simultaneously: Stima. J Comput Graph Stat 19(3):514–530
Article MathSciNet Google Scholar
Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton
MATH Google Scholar
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11(2):89–121
Article MathSciNet Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Article MathSciNet Google Scholar
Fisher WD (1958) On grouping for maximum homogeneity. J Am Stat Assoc 53(284):789–798
Article MathSciNet Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Article MathSciNet Google Scholar
Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407
Article MathSciNet Google Scholar
Gertheiss J, Tutz G (2010) Sparse modeling of categorial explanatory variables. Ann Appl Stat 4(4):2150–2180
Article MathSciNet Google Scholar
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
MATH Google Scholar
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning, 2nd edn. Springer, New York
Book Google Scholar
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674
Article MathSciNet Google Scholar
Ishwaran H et al (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
Article MathSciNet Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York
Book Google Scholar
Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58(302):415–435
Article Google Scholar
Oelker M-R (2015) gvcm.cat: regularized categorical effects/categorical effect modifiers/continuous/smooth effects in GLMs. R package version 1.9
Oelker M-R, Tutz G (2015) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 1(11):97–120
MathSciNet Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Book Google Scholar
Sandri M, Zuccolotto P (2008) A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat 17(3):611–628
Article MathSciNet Google Scholar
Sela RJ, Simonoff JS (2012) Re-em trees: a data mining approach for longitudinal and clustered data. Mach Learn 86(2):169–207
Article MathSciNet Google Scholar
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307
Article Google Scholar
Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods 14(4):323–348
Article Google Scholar
Su X, Tsai C-L, Wang MC (2009) Tree-structured model diagnostics for linear regression. Mach Learn 74(2):111–131
Article Google Scholar
Tutz G, Gertheiss J (2014) Rating scales as predictors—the old question of scale level and some answers. Psychometrika 79(3):357–376
Article MathSciNet Google Scholar
Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Stati Model 16(3):161–200
Article MathSciNet Google Scholar
Tutz G, Oelker M (2016) Modeling clustered heterogeneity: fixed effects, random effects and mixtures. Int Stat Rev 85(2):204–227
Article Google Scholar
Umlauf N, Adler D, Kneib T, Lang S, Zeileis A (2015) Structured additive regression models: an R interface to BayesX. J Stat Softw 63(21):1–46
Article Google Scholar
Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London
Book Google Scholar
Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 73(1):3–36
Article MathSciNet Google Scholar
Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J (2010) A partially linear tree-based regression model for multivariate outcomes. Biometrics 66(1):89–96
Article MathSciNet Google Scholar
Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Comput Graph Stat 17(2):492–514
Article MathSciNet Google Scholar
Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer, New York
Book Google Scholar

Download references

Author information

Authors and Affiliations

Ludwig-Maximilians-Universität München, Akademiestraße 1, 80799, Munich, Germany
Gerhard Tutz
Institut für Medizinische Biometrie, Informatik und Epidemiologie (IMBIE), Universitätsklinikum Bonn, Sigmund-Freud-Straße 25, 53105, Bonn, Germany
Moritz Berger

Authors

Gerhard Tutz
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Berger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moritz Berger.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 502 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tutz, G., Berger, M. Tree-structured modelling of categorical predictors in generalized additive regression. Adv Data Anal Classif 12, 737–758 (2018). https://doi.org/10.1007/s11634-017-0298-6

Download citation

Received: 01 August 2016
Revised: 14 September 2017
Accepted: 16 October 2017
Published: 26 October 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11634-017-0298-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Tree-structured modelling of categorical predictors in generalized additive regression

Abstract

Access this article

Similar content being viewed by others

Mixed-effect models with trees

Seemingly unrelated clusterwise linear regression

An optimal test for the additive model with discrete or categorical predictors

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 502 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Tree-structured modelling of categorical predictors in generalized additive regression

Abstract

Access this article

Similar content being viewed by others

Mixed-effect models with trees

Seemingly unrelated clusterwise linear regression

An optimal test for the additive model with discrete or categorical predictors

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 502 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation