Skip to main content
Log in

Extending AIC to best subset regression

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The Akaike information criterion (AIC) is routinely used for model selection in best subset regression. The standard AIC, however, generally under-penalizes model complexity in the best subset regression setting, potentially leading to grossly overfit models. Recently, Zhang and Cavanaugh (Comput Stat 31(2):643–669, 2015) made significant progress towards addressing this problem by introducing an effective multistage model selection procedure. In this paper, we present a rigorous and coherent conceptual framework for extending AIC to best subset regression. A new model selection algorithm derived from our framework possesses well understood and desirable asymptotic properties and consistently outperforms the procedure of Zhang and Cavanaugh in simulation studies. It provides an effective tool for combating the pervasive overfitting that detrimentally impacts best subset regression analysis so that the selected models contain fewer irrelevant predictors and predict future observations more accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Akaike H (1973) Information theorey and an extension of the maximum likelihood principle. In: Proceeding of the second international symposium of information theory, pp 267–281

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19:716–723

    Article  MathSciNet  MATH  Google Scholar 

  • Bengtsson T, Cavanaugh JE (2006) An improved akaike information criterion for state-space model selection. Comput Stat Data Anal 50(10):2635–2654

    Article  MathSciNet  MATH  Google Scholar 

  • Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 35(6):813–852

    Article  MathSciNet  MATH  Google Scholar 

  • Efron B (1983) Estimating the error rate of a prediction rule: improvement on crossvalidation. J Am Stat Assoc 78(382):316–331

    Article  MATH  Google Scholar 

  • Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470

    Article  MathSciNet  MATH  Google Scholar 

  • Fujikoshi Y (1983) A criterion for variable selection in multiple discriminant analysis. Hiroshima Math 13:203–214

    MathSciNet  MATH  Google Scholar 

  • Hurvich CM, Tsai C-L (1989) Regression and time series model selection in small samples. Biometrika 76(2):297–307

    Article  MathSciNet  MATH  Google Scholar 

  • Hurvich CM, Shumway R, Tsai C-L (1990) Improved estimators of Kullback–Leibler information for autoregressive model selection in small samples. Biometrika 77(4):709–719

    MathSciNet  Google Scholar 

  • Kitagawa G, Konishi S (2008) Information criteria and statistical modeling. Springer, New York

    MATH  Google Scholar 

  • Liao J, McGee D (2003) Adjusted coefficients of determination for logistic regression. Am Stat 57:161–165

    Article  MathSciNet  MATH  Google Scholar 

  • Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Shibata RITEI (1976) Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 63(1):117–126

    Article  MathSciNet  MATH  Google Scholar 

  • Sugiura N (1978) Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun Stat 7(1):13–26

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • White H (1982) Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25

    Article  MathSciNet  MATH  Google Scholar 

  • Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang T, Cavanaugh JE (2015) A multistage algorithm for best-subset model selection based on the Kullback-Leibler discrepancy. Comput Stat 31(2):643–669

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. G. Liao.

Appendix: Proofs of the three Lemmas

Appendix: Proofs of the three Lemmas

Our proof is built on Sections 3.4.2 and 3.4.3 of the comprehensive book by Kitagawa and Konishi (2008). As such, we will assume the same standard regularity conditions on each regression model \(f_j \left( {y|\theta _k ,X_k } \right) \) in Section 3.3.5 of their book. These conditions guarantee the consistency of the maximum likelihood estimator \({\hat{\theta }} _k \) and an increasingly accurate quadratic approximation to the log-likelihood near \({\hat{\theta }} _k \). To save space, we refer the reader to their book for the detailed specifications of these conditions. We start our proof by first stating some basic results on likelihood inference for adequately and inadequately specified models based on the White’s (1982) work. For any given regression model \(f_j \left( {y|\theta _k ,X_k } \right) ,\) let \(\theta _k^0 \) be the value of \(\theta _k \) that minimizes the Kullback discrepancy between \(f_j \left( {y|\theta _k ,X_k } \right) \) and g:

$$\begin{aligned} -\,2E_{\mathrm{Y}\sim g} \left\{ {\log f_j \left( {Y|\theta _k ,X_k } \right) } \right\} . \end{aligned}$$

Let \({\hat{\theta }} _k \) be the maximum likelihood estimator for this model for data y. It follows (White 1982) that \({\hat{\theta }} _k -\theta _k^0 \rightarrow 0\) and

$$\begin{aligned} n^{-1}\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) -n^{-1}E_{Y\sim g} \left\{ {\log f_j \left( {Y|\theta _k^0 ,X_k } \right) } \right\} \rightarrow 0 \end{aligned}$$
(4)

as \(n\rightarrow \infty ,\) where the convergence is in probability. Furthermore, \(f_j \left( {y|\theta _k^0 ,X_k } \right) =g\) when \(f_j \left( {y|\theta _k ,X_k } \right) \) is adequately specified and the resulting Kullback discrepancy

$$\begin{aligned} -\,2E_{Y\sim g} \left\{ {\log f_j \left( {Y|\theta _k^0 ,X_k } \right) } \right\} =-2E_{Y\sim g} \left\{ {\log g\left( Y \right) } \right\} \end{aligned}$$
(5)

is smaller than the Kullback discrepancy for any misspecified model.

Proof of Lemma 1

Equation (3) can be written as

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =&2E_Y \left[ {\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Y \right) -\hbox {log}g\left( Y \right) } \right] \nonumber \\&+2E_Y \left[ {E_Z \left\{ {\log g\left( Z \right) -\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right) } \right\} } \right] . \end{aligned}$$
(6)

Now first look at the second term of the right of Eq. (6). Recall that \(\rho _k \left( y \right) \equiv 2\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log g\left( y \right) \) and \(\rho _{\mathrm{max}} \left( y \right) \equiv \hbox {max}_{k\in M_j } \left\{ {\rho _k \left( y \right) } \right\} .\) By the definition of \({\hat{f}} _{j,Y}^{\mathrm{best}} \) in Sect. 3, we have

$$\begin{aligned} 2\log g\left( Z \right) -\hbox {2log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right)= & {} 2\sum \limits _{k\in M_j } \left\{ {\log g\left( Z \right) -\log f_{j,Y} \left( {Z|{\hat{\theta }} _k ,X_k } \right) } \right\} \\&\quad \hbox { Ind}\left( {\rho _k \left( Y \right) =\rho _{\mathrm{max}} \left( Y \right) } \right) . \end{aligned}$$

It follows from Equations (3.86) and (3.95) in Kitagawa and Konishi (2008) that

$$\begin{aligned} 2E_Z \left\{ {\log g\left( Z \right) -\log f_{j,Y} \left( {Z|{\hat{\theta }} _k ,X_k } \right) } \right\} =\rho _k \left( Y \right) +o_p \left( 1 \right) . \end{aligned}$$

It further follows that

$$\begin{aligned} E_Z \left\{ {\log g\left( Z \right) -\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right) } \right\} =\rho _{\mathrm{max}} \left( y \right) +o_p \left( 1 \right) . \end{aligned}$$

Combining with the first term on the right of Eq. (6), we have

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =2E_Y \left\{ {\rho _{\mathrm{max}} \left( y \right) } \right\} +o\left( 1 \right) . \end{aligned}$$

Let \(\rho _{\mathrm{max}}^S \left( Y \right) \equiv \hbox {max}_{k\in S_j } \left\{ {\rho _k \left( Y \right) } \right\} \) and \(\rho _{\mathrm{max}}^T \left( Y \right) \equiv \hbox {max}_{k\in T_j } \left\{ {\rho _k \left( Y \right) } \right\} .\) We have that

$$\begin{aligned} \rho _{\mathrm{max}} \left( y \right) =&\rho _{\mathrm{max}}^S \left( Y \right) \hbox {Ind}\left( {\rho _{\mathrm{max}}^S \left( Y \right) \ge \rho _{\mathrm{max}}^T \left( Y \right) } \right) \\&+\,\rho _{\mathrm{max}}^T \left( Y \right) \hbox {Ind}\left( {\rho _{\mathrm{max}}^S \left( Y \right) <\rho _{\mathrm{max}}^T \left( Y \right) } \right) . \end{aligned}$$

Since there are only a finite number of models in \(M_j \), we have, from Eqs. (4) and (5), that \(\Pr \left( {\rho _{\mathrm{max}}^S \left( Y \right) <\rho _{\mathrm{max}}^T \left( Y \right) } \right) \rightarrow 0.\) There,

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =2E_Y \left\{ {\rho _{\mathrm{max}}^S \left( Y \right) } \right\} +o\left( 1 \right) . \end{aligned}$$

Finally, \(2E\left\{ {\rho _k \left( y \right) } \right\} =2\times \left( {\hbox {dimension of }\theta } \right) +o\left( 1 \right) \) for \(k\in S_j \) because

$$\begin{aligned} \rho _k \left( y \right) \sim \chi _{df=\hbox {dimension of }\theta }^2 +o_p \left( 1 \right) \end{aligned}$$

as \(n\rightarrow \infty \) from standard asymptotic likelihood theory. The proof is complete. Note that the proof in Kitagawa and Konishi (2008) mostly deals with identically distributed \(y_1 ,\ldots .,y_n ,\) which can be mimicked in a regression setup by assuming that the predictor x of individual subjects is a random draw from a non-degenerative distribution.\(\square \)

Proof of Lemma 2:

(i) Based on Eqs. 4 and 5, the value of \(\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) for an adequately specified model exceeds, with asymptotic probability 1, the value of \(\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) for a misspecified model by any given amount as \(n\rightarrow \infty \). Therefore, any misspecified model has 0 asymptotic probability of being a self-consistent model. (ii) For any \(j\ge p_0 \), any fitted model \(f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) \) converges to the true model g as \({\hat{\theta }} _k \rightarrow \theta _k^0 \) if \(f_j \left( {y|\theta _k ,X_k } \right) \) is adequately specified. The result follows from the continuous dependence of \({ EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) \) on distribution g. \(\square \)

Proof of Lemma 3:

Consider first the event that \({\hat{f}} _{p_0 ,y}^{\mathrm{best}} \) is self-consistent, which requires

$$\begin{aligned} 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log {\hat{f}} _{p_0 ,y}^{\mathrm{best}}\le & {} { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g={\hat{f}} _{p_0 ,y}^{\mathrm{best}} } \right) -{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g={\hat{f}} _{p_0 ,y}^{\mathrm{best}} } \right) \\= & {} { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) -{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) +o\left( 1 \right) , \end{aligned}$$

for l in the range of \(p_0 <l\le p,\) where \(O\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \) is the expected optimism under the true g. Let \(f_{p_0 } \left( {\cdot |\theta _1 ,X_1 } \right) \) be the only adequately specified model with \(p_0 \) predictors. It is easy to see from Eqs. (4) and (5) that, with asymptotic probability one, that \({\hat{f}} _{p_0 ,y}^{\mathrm{best}} \) is simply \(f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) .\) It then follows that

$$\begin{aligned}&\Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \,\hbox {is self-consistent}} \right\} \\&\quad =\Pr \left\{ 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) \le { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \right. \\&\qquad \left. -\,{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) \right\} +o\left( 1 \right) . \end{aligned}$$

It is easy to see that

$$\begin{aligned} 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) =\hbox {max}_{k\in M_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} , \end{aligned}$$

where \(\xi _{l,k} \left( y \right) \equiv 2\log f_l \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log f_{p_0 } \left( {y|{\hat{\theta }} _1 ,X_1 } \right) .\) Similar to the proof of Lemma 1, we have

$$\begin{aligned} \hbox {max}_{k\in M_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} =\hbox {max}_{k\in S_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} +o_p \left( 1 \right) . \end{aligned}$$

It follows that

$$\begin{aligned} \Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \hbox { is self-consistent}} \right\}= & {} \Pr \left\{ \hbox {max}_{k\in S_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} \le { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \right. \\&\quad \left. -\,{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) \right\} +o\left( 1 \right) . \end{aligned}$$

Based on Lemma 2, the probability that any \({\hat{f}} _{j,y}^{\mathrm{best}} \) is consistent goes to 0 for \(j<p_0 .\) We have

$$\begin{aligned} \Pr \left\{ {{\hat{f}} _{{\hat{j}} ,y}^{\mathrm{best}} \,\hbox {is a correct model}} \right\} =\Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \,\hbox {is self-consistent}} \right\} \end{aligned}$$

and therefore Lemma 3. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, J.G., Cavanaugh, J.E. & McMurry, T.L. Extending AIC to best subset regression. Comput Stat 33, 787–806 (2018). https://doi.org/10.1007/s00180-018-0797-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0797-8

Keywords

Navigation