Skip to main content
Log in

A Bayesian Approach to Multicollinearity and the Simultaneous Selection and Clustering of Predictors in Linear Regression

  • Published:
Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Abstract

High correlation among predictors has long been an annoyance in regression analysis. The crux of the problem is that the linear regression model assumes each predictor has an independent effect on the response that can be encapsulated in the predictor’s regression coefficient. When predictors are highly correlated, the data do not contain much information on the independent effects of each predictor. The high correlation among predictors can result in large standard errors for the regression coefficients and coefficients with signs opposite of what is expected based on a priori, subject-matter theory. We propose a Bayesian model that accounts for correlation among the predictors by simultaneously performing selection and clustering of the predictors. Our model combines a Dirichlet process prior and a variable selection prior for the regression coefficients. In our model highly correlated predictors can be grouped together by setting their corresponding coefficients exactly equal. Similarly, redundant predictors can be removed from the model through the variable selection component of our prior. We demonstrate the competitiveness of our method through simulation studies and analysis of real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, Petrov, B.N., Csaki, F. (Editors).

    MATH  Google Scholar 

  • Belsley, D.A., 1984. Demeaning conditioning diagnostics through centering (with discussion). The American Statistician, 38, 73–93.

    Google Scholar 

  • Belsley, D.A., 1991. Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley.

    MATH  Google Scholar 

  • Belsley, D.A., Kuh, E., Welsch, R.E., 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley.

    Book  Google Scholar 

  • Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via polya urn schemes. Annals of Statistics, 1, 353–355.

    Article  MathSciNet  Google Scholar 

  • Blanchard, O.J., 1987. Comment. Journal of Business and Economic Statisitics, 5, 449–451.

    Google Scholar 

  • Bondell, H.D., Reich, B.J., 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64, 115–123.

    Article  MathSciNet  Google Scholar 

  • Bondell, H.D., Reich, B.J., 2009. Simultaneous factor selection and collapsing levels in anova. Biometrics, 65, 169–177.

    Article  MathSciNet  Google Scholar 

  • Buse, A., 1994. Brickmaking and the collinear arts: A cautionary tale. Canadian Journal of Economics, 27, 408–414.

    Article  Google Scholar 

  • Ehrlich, I., 1973. Participation in illegitimate activities: a theoretical and empirical investigation. Journal of Political Economy, 81, 521–567.

    Article  Google Scholar 

  • Ehrlich, I., 197. The deterrent effect of capital punishment: a question of life or death. American Economic Review, 65, 397–417.

    Google Scholar 

  • Ferguson, T., 1973. A Bayesian analysis of some nonparmetric problems. Annals of Statistics, 1(2), 209–230.

    Article  MathSciNet  Google Scholar 

  • Gelman, A., Rubin, D., 1992. Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–511.

    Article  Google Scholar 

  • George, E.I., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88, 881–889.

    Article  Google Scholar 

  • Ghosh, J.K., Ramamoorthi, R.V., 2003. Bayesian Nonparametrics. Springer.

    MATH  Google Scholar 

  • Goldberger, A.S., 1991. A Course in Econometrics. Harvard University Press.

    Google Scholar 

  • Gopalan, R., Berry, D.A., 1998. Bayesian multiple comparisons using Dirichlet process priors. Journal of the American Statistical Association, 93, 1130–1139.

    Article  MathSciNet  Google Scholar 

  • Hald, A., 1952. Statistical Theory with Engineering Applications. Wiley, New York.

    MATH  Google Scholar 

  • Hill, R.C., Adkins, L.C., 2003. Collinearity. In A Companion to Theoretical Econometrics, Baltagi, B.H. (Editor), Chapter 12, 256–278, Blackwell Publishing.

    Google Scholar 

  • Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Tech-nometrics, 12, 55–67.

    MATH  Google Scholar 

  • Ishwaran, H., Zarepour, M., 2002. Exact and approximate sum representations for the dirichlet process. The Canadian Journal of Statistics, 30(2), 269–283.

    Article  MathSciNet  Google Scholar 

  • Kennedy, P., 1982. Eliminating problems caused by multicollinearity: A warning. Journal of Economic Education, 13, 62–64.

    Article  Google Scholar 

  • Kennedy, P.E., 1983. On an inappropriate means of reducing multicollinearity. Regional Science and Urban Economics, 13, 579–581.

    Article  Google Scholar 

  • Kim, S., Dahl, D.B., Vannucci, M., 2009. Spiked dirichlet process prior for bayesian multiple hypothesis testing in random effects models. Bayesian Analysis, 4(4), 707–732.

    Article  MathSciNet  Google Scholar 

  • MacLehose, R.F., Dunson, D.B., Herring, A.H., Hoppin J.A., 2007. Bayesian methods for highly correlated exposure data. Epidemiology, 18, 199–2007.

    Article  Google Scholar 

  • McQuarrie, A.D.R., Tsai C., 1998. Regression and Time Series Model Selection. World Scientific.

    Book  Google Scholar 

  • Neal, R.M., 2003. Density modeling and clustering using dirichlet diffusion trees. In Bayesian Statistics, 7, 619–629. Oxford University Press.

    MathSciNet  Google Scholar 

  • Nott, D.J., 2008. Predictive performance of dirichlet process shrinkage methods in linear regression. Computational Statistics & Data Analysis, 52(7), 3658–3669.

    Article  MathSciNet  Google Scholar 

  • Plummer, M., Best, N., Cowles, K., Vines K., 2007. coda: Output analysis and diagnostics for MCMC. R package, version 0.12–1.

    Google Scholar 

  • Segal, M.R., Dahlquist, K.D., Conklin, B.R., 2003. Regression approaches for microarray data analysis. Journal of Computational Biology, 10, 961–980.

    Article  Google Scholar 

  • Sturtz, S., Ligges, U., Gelman, A., 2005. R2winbugs: A package for running winbugs from R. Journal of Statistical Software, 12(3), 1–16.

    Article  Google Scholar 

  • Theil, H., 1963. On the use of incomplete prior information in regression analysis. Journal of the American Statistical Association, 58, 401–414.

    Article  MathSciNet  Google Scholar 

  • Theil, H., Goldberger, A.S., 1961. On pure and mixed statistical estimation in economics. International Economic Review, 2(1), 65–78.

    Article  Google Scholar 

  • Thomas, A., O’Hara, B., Ligges, U., Sturtz, S., 2006. Making BUGS open. R News, 6, 12–17.

    Google Scholar 

  • Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Vandaele, W., 1978. Participation in illegitimate activities: Ehrlich revisited. In Deterrence and Incapacitation, Blumstein, A., Cohen, J. and Nagin, D. (Editors), 270–335, National Academy of Sciences Press.

    Google Scholar 

  • Venables, W.N., Ripley B.D., 2002. Modern Applied Statistics with S. Springer.

    Book  Google Scholar 

  • West, M., 2003. Bayesian factor regression in the ‘large p small n’ problem. In Bayesian Statistics, 7, 733–743. Oxford University Press.

    MathSciNet  Google Scholar 

  • Woods, H., Steinour, H.H., Starke, H.R., 1932. Effect of composition of portland cement on heat evolved during hardening. Industrial Engineering and Chemistry, 24, 1207–1214.

    Article  Google Scholar 

  • Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301–320.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. McKay Curtis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McKay Curtis, S., Ghosh, S.K. A Bayesian Approach to Multicollinearity and the Simultaneous Selection and Clustering of Predictors in Linear Regression. J Stat Theory Pract 5, 715–735 (2011). https://doi.org/10.1080/15598608.2011.10483741

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1080/15598608.2011.10483741

AMS Subject Classification

Key-words

Navigation