Journal of Statistical Theory and Practice

, Volume 5, Issue 4, pp 715–735 | Cite as

A Bayesian Approach to Multicollinearity and the Simultaneous Selection and Clustering of Predictors in Linear Regression

  • S. McKay CurtisEmail author
  • Sujit K. Ghosh


High correlation among predictors has long been an annoyance in regression analysis. The crux of the problem is that the linear regression model assumes each predictor has an independent effect on the response that can be encapsulated in the predictor’s regression coefficient. When predictors are highly correlated, the data do not contain much information on the independent effects of each predictor. The high correlation among predictors can result in large standard errors for the regression coefficients and coefficients with signs opposite of what is expected based on a priori, subject-matter theory. We propose a Bayesian model that accounts for correlation among the predictors by simultaneously performing selection and clustering of the predictors. Our model combines a Dirichlet process prior and a variable selection prior for the regression coefficients. In our model highly correlated predictors can be grouped together by setting their corresponding coefficients exactly equal. Similarly, redundant predictors can be removed from the model through the variable selection component of our prior. We demonstrate the competitiveness of our method through simulation studies and analysis of real data.

AMS Subject Classification

62F15 62J07 62P25 


Dirichlet process Variable selection Stochastic search 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory, Petrov, B.N., Csaki, F. (Editors).zbMATHGoogle Scholar
  2. Belsley, D.A., 1984. Demeaning conditioning diagnostics through centering (with discussion). The American Statistician, 38, 73–93.Google Scholar
  3. Belsley, D.A., 1991. Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley.zbMATHGoogle Scholar
  4. Belsley, D.A., Kuh, E., Welsch, R.E., 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley.CrossRefGoogle Scholar
  5. Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via polya urn schemes. Annals of Statistics, 1, 353–355.MathSciNetCrossRefGoogle Scholar
  6. Blanchard, O.J., 1987. Comment. Journal of Business and Economic Statisitics, 5, 449–451.Google Scholar
  7. Bondell, H.D., Reich, B.J., 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64, 115–123.MathSciNetCrossRefGoogle Scholar
  8. Bondell, H.D., Reich, B.J., 2009. Simultaneous factor selection and collapsing levels in anova. Biometrics, 65, 169–177.MathSciNetCrossRefGoogle Scholar
  9. Buse, A., 1994. Brickmaking and the collinear arts: A cautionary tale. Canadian Journal of Economics, 27, 408–414.CrossRefGoogle Scholar
  10. Ehrlich, I., 1973. Participation in illegitimate activities: a theoretical and empirical investigation. Journal of Political Economy, 81, 521–567.CrossRefGoogle Scholar
  11. Ehrlich, I., 197. The deterrent effect of capital punishment: a question of life or death. American Economic Review, 65, 397–417.Google Scholar
  12. Ferguson, T., 1973. A Bayesian analysis of some nonparmetric problems. Annals of Statistics, 1(2), 209–230.MathSciNetCrossRefGoogle Scholar
  13. Gelman, A., Rubin, D., 1992. Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–511.CrossRefGoogle Scholar
  14. George, E.I., McCulloch, R.E., 1993. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88, 881–889.CrossRefGoogle Scholar
  15. Ghosh, J.K., Ramamoorthi, R.V., 2003. Bayesian Nonparametrics. Springer.zbMATHGoogle Scholar
  16. Goldberger, A.S., 1991. A Course in Econometrics. Harvard University Press.Google Scholar
  17. Gopalan, R., Berry, D.A., 1998. Bayesian multiple comparisons using Dirichlet process priors. Journal of the American Statistical Association, 93, 1130–1139.MathSciNetCrossRefGoogle Scholar
  18. Hald, A., 1952. Statistical Theory with Engineering Applications. Wiley, New York.zbMATHGoogle Scholar
  19. Hill, R.C., Adkins, L.C., 2003. Collinearity. In A Companion to Theoretical Econometrics, Baltagi, B.H. (Editor), Chapter 12, 256–278, Blackwell Publishing.Google Scholar
  20. Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Tech-nometrics, 12, 55–67.zbMATHGoogle Scholar
  21. Ishwaran, H., Zarepour, M., 2002. Exact and approximate sum representations for the dirichlet process. The Canadian Journal of Statistics, 30(2), 269–283.MathSciNetCrossRefGoogle Scholar
  22. Kennedy, P., 1982. Eliminating problems caused by multicollinearity: A warning. Journal of Economic Education, 13, 62–64.CrossRefGoogle Scholar
  23. Kennedy, P.E., 1983. On an inappropriate means of reducing multicollinearity. Regional Science and Urban Economics, 13, 579–581.CrossRefGoogle Scholar
  24. Kim, S., Dahl, D.B., Vannucci, M., 2009. Spiked dirichlet process prior for bayesian multiple hypothesis testing in random effects models. Bayesian Analysis, 4(4), 707–732.MathSciNetCrossRefGoogle Scholar
  25. MacLehose, R.F., Dunson, D.B., Herring, A.H., Hoppin J.A., 2007. Bayesian methods for highly correlated exposure data. Epidemiology, 18, 199–2007.CrossRefGoogle Scholar
  26. McQuarrie, A.D.R., Tsai C., 1998. Regression and Time Series Model Selection. World Scientific.CrossRefGoogle Scholar
  27. Neal, R.M., 2003. Density modeling and clustering using dirichlet diffusion trees. In Bayesian Statistics, 7, 619–629. Oxford University Press.MathSciNetGoogle Scholar
  28. Nott, D.J., 2008. Predictive performance of dirichlet process shrinkage methods in linear regression. Computational Statistics & Data Analysis, 52(7), 3658–3669.MathSciNetCrossRefGoogle Scholar
  29. Plummer, M., Best, N., Cowles, K., Vines K., 2007. coda: Output analysis and diagnostics for MCMC. R package, version 0.12–1.Google Scholar
  30. Segal, M.R., Dahlquist, K.D., Conklin, B.R., 2003. Regression approaches for microarray data analysis. Journal of Computational Biology, 10, 961–980.CrossRefGoogle Scholar
  31. Sturtz, S., Ligges, U., Gelman, A., 2005. R2winbugs: A package for running winbugs from R. Journal of Statistical Software, 12(3), 1–16.CrossRefGoogle Scholar
  32. Theil, H., 1963. On the use of incomplete prior information in regression analysis. Journal of the American Statistical Association, 58, 401–414.MathSciNetCrossRefGoogle Scholar
  33. Theil, H., Goldberger, A.S., 1961. On pure and mixed statistical estimation in economics. International Economic Review, 2(1), 65–78.CrossRefGoogle Scholar
  34. Thomas, A., O’Hara, B., Ligges, U., Sturtz, S., 2006. Making BUGS open. R News, 6, 12–17.Google Scholar
  35. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267–288.MathSciNetzbMATHGoogle Scholar
  36. Vandaele, W., 1978. Participation in illegitimate activities: Ehrlich revisited. In Deterrence and Incapacitation, Blumstein, A., Cohen, J. and Nagin, D. (Editors), 270–335, National Academy of Sciences Press.Google Scholar
  37. Venables, W.N., Ripley B.D., 2002. Modern Applied Statistics with S. Springer.CrossRefGoogle Scholar
  38. West, M., 2003. Bayesian factor regression in the ‘large p small n’ problem. In Bayesian Statistics, 7, 733–743. Oxford University Press.MathSciNetGoogle Scholar
  39. Woods, H., Steinour, H.H., Starke, H.R., 1932. Effect of composition of portland cement on heat evolved during hardening. Industrial Engineering and Chemistry, 24, 1207–1214.CrossRefGoogle Scholar
  40. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2), 301–320.MathSciNetCrossRefGoogle Scholar

Copyright information

© Grace Scientific Publishing 2011

Authors and Affiliations

  1. 1.Division of General Internal MedicineUniversity of WashingtonSeattleUSA
  2. 2.Department of StatisticsNorth Carolina State UniversityRaleighUSA

Personalised recommendations