Abstract
The Akaike information criterion (AIC) is routinely used for model selection in best subset regression. The standard AIC, however, generally under-penalizes model complexity in the best subset regression setting, potentially leading to grossly overfit models. Recently, Zhang and Cavanaugh (Comput Stat 31(2):643–669, 2015) made significant progress towards addressing this problem by introducing an effective multistage model selection procedure. In this paper, we present a rigorous and coherent conceptual framework for extending AIC to best subset regression. A new model selection algorithm derived from our framework possesses well understood and desirable asymptotic properties and consistently outperforms the procedure of Zhang and Cavanaugh in simulation studies. It provides an effective tool for combating the pervasive overfitting that detrimentally impacts best subset regression analysis so that the selected models contain fewer irrelevant predictors and predict future observations more accurately.
Similar content being viewed by others
References
Akaike H (1973) Information theorey and an extension of the maximum likelihood principle. In: Proceeding of the second international symposium of information theory, pp 267–281
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19:716–723
Bengtsson T, Cavanaugh JE (2006) An improved akaike information criterion for state-space model selection. Comput Stat Data Anal 50(10):2635–2654
Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 35(6):813–852
Efron B (1983) Estimating the error rate of a prediction rule: improvement on crossvalidation. J Am Stat Assoc 78(382):316–331
Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470
Fujikoshi Y (1983) A criterion for variable selection in multiple discriminant analysis. Hiroshima Math 13:203–214
Hurvich CM, Tsai C-L (1989) Regression and time series model selection in small samples. Biometrika 76(2):297–307
Hurvich CM, Shumway R, Tsai C-L (1990) Improved estimators of Kullback–Leibler information for autoregressive model selection in small samples. Biometrika 77(4):709–719
Kitagawa G, Konishi S (2008) Information criteria and statistical modeling. Springer, New York
Liao J, McGee D (2003) Adjusted coefficients of determination for logistic regression. Am Stat 57:161–165
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shibata RITEI (1976) Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 63(1):117–126
Sugiura N (1978) Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun Stat 7(1):13–26
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
White H (1982) Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25
Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131
Zhang T, Cavanaugh JE (2015) A multistage algorithm for best-subset model selection based on the Kullback-Leibler discrepancy. Comput Stat 31(2):643–669
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs of the three Lemmas
Appendix: Proofs of the three Lemmas
Our proof is built on Sections 3.4.2 and 3.4.3 of the comprehensive book by Kitagawa and Konishi (2008). As such, we will assume the same standard regularity conditions on each regression model \(f_j \left( {y|\theta _k ,X_k } \right) \) in Section 3.3.5 of their book. These conditions guarantee the consistency of the maximum likelihood estimator \({\hat{\theta }} _k \) and an increasingly accurate quadratic approximation to the log-likelihood near \({\hat{\theta }} _k \). To save space, we refer the reader to their book for the detailed specifications of these conditions. We start our proof by first stating some basic results on likelihood inference for adequately and inadequately specified models based on the White’s (1982) work. For any given regression model \(f_j \left( {y|\theta _k ,X_k } \right) ,\) let \(\theta _k^0 \) be the value of \(\theta _k \) that minimizes the Kullback discrepancy between \(f_j \left( {y|\theta _k ,X_k } \right) \) and g:
Let \({\hat{\theta }} _k \) be the maximum likelihood estimator for this model for data y. It follows (White 1982) that \({\hat{\theta }} _k -\theta _k^0 \rightarrow 0\) and
as \(n\rightarrow \infty ,\) where the convergence is in probability. Furthermore, \(f_j \left( {y|\theta _k^0 ,X_k } \right) =g\) when \(f_j \left( {y|\theta _k ,X_k } \right) \) is adequately specified and the resulting Kullback discrepancy
is smaller than the Kullback discrepancy for any misspecified model.
Proof of Lemma 1
Equation (3) can be written as
Now first look at the second term of the right of Eq. (6). Recall that \(\rho _k \left( y \right) \equiv 2\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log g\left( y \right) \) and \(\rho _{\mathrm{max}} \left( y \right) \equiv \hbox {max}_{k\in M_j } \left\{ {\rho _k \left( y \right) } \right\} .\) By the definition of \({\hat{f}} _{j,Y}^{\mathrm{best}} \) in Sect. 3, we have
It follows from Equations (3.86) and (3.95) in Kitagawa and Konishi (2008) that
It further follows that
Combining with the first term on the right of Eq. (6), we have
Let \(\rho _{\mathrm{max}}^S \left( Y \right) \equiv \hbox {max}_{k\in S_j } \left\{ {\rho _k \left( Y \right) } \right\} \) and \(\rho _{\mathrm{max}}^T \left( Y \right) \equiv \hbox {max}_{k\in T_j } \left\{ {\rho _k \left( Y \right) } \right\} .\) We have that
Since there are only a finite number of models in \(M_j \), we have, from Eqs. (4) and (5), that \(\Pr \left( {\rho _{\mathrm{max}}^S \left( Y \right) <\rho _{\mathrm{max}}^T \left( Y \right) } \right) \rightarrow 0.\) There,
Finally, \(2E\left\{ {\rho _k \left( y \right) } \right\} =2\times \left( {\hbox {dimension of }\theta } \right) +o\left( 1 \right) \) for \(k\in S_j \) because
as \(n\rightarrow \infty \) from standard asymptotic likelihood theory. The proof is complete. Note that the proof in Kitagawa and Konishi (2008) mostly deals with identically distributed \(y_1 ,\ldots .,y_n ,\) which can be mimicked in a regression setup by assuming that the predictor x of individual subjects is a random draw from a non-degenerative distribution.\(\square \)
Proof of Lemma 2:
(i) Based on Eqs. 4 and 5, the value of \(\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) for an adequately specified model exceeds, with asymptotic probability 1, the value of \(\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) for a misspecified model by any given amount as \(n\rightarrow \infty \). Therefore, any misspecified model has 0 asymptotic probability of being a self-consistent model. (ii) For any \(j\ge p_0 \), any fitted model \(f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) converges to the true model g as \({\hat{\theta }} _k \rightarrow \theta _k^0 \) if \(f_j \left( {y|\theta _k ,X_k } \right) \) is adequately specified. The result follows from the continuous dependence of \({ EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) \) on distribution g. \(\square \)
Proof of Lemma 3:
Consider first the event that \({\hat{f}} _{p_0 ,y}^{\mathrm{best}} \) is self-consistent, which requires
for l in the range of \(p_0 <l\le p,\) where \(O\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \) is the expected optimism under the true g. Let \(f_{p_0 } \left( {\cdot |\theta _1 ,X_1 } \right) \) be the only adequately specified model with \(p_0 \) predictors. It is easy to see from Eqs. (4) and (5) that, with asymptotic probability one, that \({\hat{f}} _{p_0 ,y}^{\mathrm{best}} \) is simply \(f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) .\) It then follows that
It is easy to see that
where \(\xi _{l,k} \left( y \right) \equiv 2\log f_l \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log f_{p_0 } \left( {y|{\hat{\theta }} _1 ,X_1 } \right) .\) Similar to the proof of Lemma 1, we have
It follows that
Based on Lemma 2, the probability that any \({\hat{f}} _{j,y}^{\mathrm{best}} \) is consistent goes to 0 for \(j<p_0 .\) We have
and therefore Lemma 3. \(\square \)
Rights and permissions
About this article
Cite this article
Liao, J.G., Cavanaugh, J.E. & McMurry, T.L. Extending AIC to best subset regression. Comput Stat 33, 787–806 (2018). https://doi.org/10.1007/s00180-018-0797-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-018-0797-8