Comparing six shrinkage estimators with large sample theory and asymptotically optimal prediction intervals

Abstract

Consider the multiple linear regression model \(Y = \beta _1 + \beta _2 x_2 + \cdots + \beta _p x_p + e = {\varvec{x}}^T \varvec{\beta }+ e\) with sample size n. This paper compares the six shrinkage estimators: forward selection, lasso, partial least squares, principal components regression, lasso variable selection, and ridge regression, with large sample theory and two new prediction intervals that are asymptotically optimal if the estimator \({\hat{\varvec{\beta }}}\) is a consistent estimator of \(\varvec{\beta }\). Few prediction intervals have been developed for \(p>n\), and they are not asymptotically optimal. For p fixed, the large sample theory for variable selection estimators like forward selection is new, and the theory shows that lasso variable selection is \(\sqrt{n}\) consistent under much milder conditions than lasso. This paper also simplifies the proofs of the large sample theory for lasso, ridge regression, and elastic net.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

References

  1. Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csakim F (eds) Proceedings, 2nd international symposium on information theory. Akademiai Kiado, Budapest, pp 267–281

  2. Artigue H, Smith G (2019) The principal problem with principal components regression. Cogent Math Stat 6:1622190

    MathSciNet  MATH  Google Scholar 

  3. Belloni A, Chernozhukov V (2013) Least squares after model selection in high-dimensional sparse models. Bernoulli 19:521–547

    MathSciNet  MATH  Google Scholar 

  4. Brillinger DR (1977) The identification of a particular nonlinear time series. Biometrika 64:509–515

    MathSciNet  MATH  Google Scholar 

  5. Brillinger DR (1983) A generalized linear model with “Gaussian” regressor variables. In: Bickel PJ, Doksum KA, Hodges JL (eds) A festschrift for Erich L. Lehmann, Wadsworth, Pacific Grove, pp 97–114

  6. Butler R, Rothman E (1980) Predictive intervals based on reuse of the sample. J Am Stat Assoc 75:881–889

    MathSciNet  MATH  Google Scholar 

  7. Cai T, Tian L, Solomon SD, Wei LJ (2008) Predicting future responses based on possibly misspecified working models. Biometrika 95:75–92

    MathSciNet  MATH  Google Scholar 

  8. Charkhi A, Claeskens G (2018) Asymptotic post-selection inference for the Akaike information criterion. Biometrika 105:645–664

    MathSciNet  MATH  Google Scholar 

  9. Chun H, Keleş S (2010) Sparse partial least squares regression for simultaneous dimension reduction and predictor selection. J R Stat Soc B 72:3–25

    MATH  Google Scholar 

  10. Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge University Press, New York

    Google Scholar 

  11. Cook RD (2018) An introduction to envelopes: dimension reduction for efficient estimation in multivariate statistics. Wiley, Hoboken

    Google Scholar 

  12. Cook RD, Forzani L (2018) Big data and partial least squares prediction. Can J Stat 46:62–78

    MathSciNet  MATH  Google Scholar 

  13. Cook RD, Forzani L (2019) Partial least squares prediction in high-dimensional regression. Ann Stat 47:884–908

    MathSciNet  MATH  Google Scholar 

  14. Cook RD, Helland IS, Su Z (2013) Envelopes and partial least squares regression. J R Stat Soc B 75:851–877

    MathSciNet  MATH  Google Scholar 

  15. Cook RD, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York

    Google Scholar 

  16. Denham MC (1997) Prediction intervals in partial least squares. J Chemom. 11:39–52

    Google Scholar 

  17. Efron B, Hastie T (2016) Computer age statistical inference. Cambridge University Press, New York

    Google Scholar 

  18. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression (with discussion). Ann Stat 32:407–451

    MATH  Google Scholar 

  19. Fan J, Li R (2001) Variable selection via noncave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    MATH  Google Scholar 

  20. Frey J (2013) Data-driven nonparametric prediction intervals. J Stat Plan Inference 143:1039–1048

    MathSciNet  MATH  Google Scholar 

  21. Grübel R (1988) The length of the shorth. Ann Stat 16:619–628

    MathSciNet  MATH  Google Scholar 

  22. Gunst RF, Mason RL (1980) Regression analysis and its application. Marcel Dekker, New York

    Google Scholar 

  23. Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press Taylor & Francis, Boca Raton

    Google Scholar 

  24. Hebbler B (1847) Statistics of Prussia. J R Stat Soc A 10:154–186

    Google Scholar 

  25. Hoerl AE, Kennard D (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

    MATH  Google Scholar 

  26. Hong L, Kuffner TA, Martin R (2018) On overfitting and post-selection uncertainty assessments. Biometrika 105:221–224

    MathSciNet  MATH  Google Scholar 

  27. Hurvich C, Tsai CL (1989) Regression and time series model selection in small samples. Biometrika 76:297–307

    MathSciNet  MATH  Google Scholar 

  28. Hyndman RJ (1996) Computing and graphing highest density regions. Am Stat 50:120–126

    Google Scholar 

  29. James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning with applications in R. Springer, New York

    Google Scholar 

  30. Jansen L, Fithian W, Hastie T (2015) Effective degrees of freedom: a flawed metaphor. Biometrika 102:479–485

    MathSciNet  MATH  Google Scholar 

  31. Jia J, Yu B (2010) On model selection consistency of the elastic net when \(p>>n\). Stat Sin 20:595–611

    MathSciNet  MATH  Google Scholar 

  32. Jones HL (1946) Linear regression functions with neglected variables. J Am Stat Assoc 41:356–369

    MathSciNet  MATH  Google Scholar 

  33. Knight K, Fu WJ (2000) Asymptotics for lasso-type estimators. Ann Stat 28:1356–1378

    MathSciNet  MATH  Google Scholar 

  34. Lei J (2019) Fast exact conformalization of lasso using piecewise linear homotopy. Biometrika 106:749–764

    MathSciNet  MATH  Google Scholar 

  35. Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L (2018) Distribution-free predictive inference for regression. J Am Stat Assoc 113:1094–1111

    MathSciNet  MATH  Google Scholar 

  36. Li K-C (1987) Asymptotic optimality for \(C_p\), \(C_L\), cross-validation and generalized cross-validation: discrete index set. Ann Stat 15:958–975

    MATH  Google Scholar 

  37. Lin D, Foster DP, Ungar LH (2012) VIF regression, a fast regression algorithm for large data. J Am Stat Assoc 106:232–247

    MathSciNet  MATH  Google Scholar 

  38. Luo S, Chen Z (2013) Extended BIC for linear regression models with diverging number of relevant features and high or ultra-high feature spaces. J Stat Plan Inference 143:494–504

    MathSciNet  MATH  Google Scholar 

  39. Mallows C (1973) Some comments on \(C_p\). Technometrics 15:661–676

    MATH  Google Scholar 

  40. Meinshausen N (2007) Relaxed lasso. Comput Stat Data Anal 52:374–393

    MathSciNet  MATH  Google Scholar 

  41. Mohie El-Din MM, Shafay AR (2013) One- and two-sample Bayesian prediction intervals based on progressively type-II censored data. Stat Pap 54:287–307

    MathSciNet  MATH  Google Scholar 

  42. Nishii R (1984) Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat 12:758–765

    MathSciNet  MATH  Google Scholar 

  43. Olive DJ (2007) Prediction intervals for regression models. Comput Stat Data Anal 51:3115–3122

    MathSciNet  MATH  Google Scholar 

  44. Olive DJ (2013) Asymptotically optimal regression prediction intervals and prediction regions for multivariate data. Int J Stat Probab 2:90–100

    Google Scholar 

  45. Olive DJ (2017a) Linear regression. Springer, New York

    Google Scholar 

  46. Olive DJ (2017b) Robust multivariate analysis. Springer, New York

    Google Scholar 

  47. Olive DJ (2018) Applications of hyperellipsoidal prediction regions. Stat Pap 59:913–931

    MathSciNet  MATH  Google Scholar 

  48. Olive DJ (2020) Prediction and statistical learning, unpublished online course notes. (http://parker.ad.siu.edu/Olive/slearnbk.htm)

  49. Olive DJ, Hawkins DM (2005) Variable selection for 1D regression models. Technometrics 47:43–50

    MathSciNet  Google Scholar 

  50. Pelawa Watagoda LCR (2017) Inference after variable selection, PhD Thesis, Southern Illinois University. (http://parker.ad.siu.edu/Olive/slasanthiphd.pdf)

  51. Pelawa Watagoda LCR, Olive DJ (2019) Bootstrapping multiple linear regression after variable selection. Stat Pap. https://doi.org/10.1007/s00362-019-01108-9

  52. Pilz M (2020) Consistency of the elastic net under a finite second moment assumption on the noise. J Stat Plan Inference 204:72–79

    MathSciNet  MATH  Google Scholar 

  53. Pratt JW (1959) On a general concept of “in Probability.”. Ann Math Stat 30:549–558

    MathSciNet  MATH  Google Scholar 

  54. Pötscher B (1991) Effects of model selection on inference. Econometric Theory 7:163–185

    MathSciNet  Google Scholar 

  55. R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  56. Rathnayake RC, Olive DJ (2020) Bootstrapping some GLMs and survival regression models after variable selection. Unpublished manuscript at (http://parker.ad.siu.edu/Olive/ppbootglm.pdf)

  57. Romera R (2010) Prediction intervals in partial least squares regression via a new local linearization approach. Chemom Intell Lab Syst 103:122–128

    Google Scholar 

  58. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    MathSciNet  MATH  Google Scholar 

  59. Sen PK, Singer JM (1993) Large sample methods in statistics: an introduction with applications. Chapman & Hall, New York

    Google Scholar 

  60. Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494

    MathSciNet  MATH  Google Scholar 

  61. Slawski M, zu Castell W (2010) Feature selection guided by structural information. Ann Appl Stat 4:1056–1080

    MathSciNet  MATH  Google Scholar 

  62. Su Z, Cook RD (2012) Inner envelopes: efficient estimation in multivariate linear regression. Biometrika 99:687–702

    MathSciNet  MATH  Google Scholar 

  63. Sun T, Zhang C-H (2012) Scaled sparse linear regression. Biometrika 99:879–898

    MathSciNet  MATH  Google Scholar 

  64. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  65. Wasserman L (2014) Discussion: a significance test for the lasso. Ann Stat 42:501–508

    MATH  Google Scholar 

  66. Wold H (1975) Soft modelling by latent variables: the nonlinear partial least squares (NIPALS) approach. In: Gani J (ed) Perspectives in probability and statistics, papers in honor of M.S. Bartlett. Academic Press, San Diego, pp 117–144

  67. Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 72:2541–2563

    MathSciNet  MATH  Google Scholar 

  68. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors thank the Editor and two referees for their work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to David J. Olive.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Watagoda, L.C.R.P., Olive, D.J. Comparing six shrinkage estimators with large sample theory and asymptotically optimal prediction intervals. Stat Papers (2020). https://doi.org/10.1007/s00362-020-01193-1

Download citation

Keywords

  • Forward Selection
  • Lasso
  • Partial least squares
  • Principal components regression
  • Ridge regression