Extending AIC to best subset regression

Liao, J. G.; Cavanaugh, Joseph E.; McMurry, Timothy L.

doi:10.1007/s00180-018-0797-8

Extending AIC to best subset regression

Original Paper
Published: 09 February 2018

Volume 33, pages 787–806, (2018)
Cite this article

Computational Statistics Aims and scope Submit manuscript

J. G. Liao¹,
Joseph E. Cavanaugh² &
Timothy L. McMurry³

756 Accesses
10 Citations
Explore all metrics

Abstract

The Akaike information criterion (AIC) is routinely used for model selection in best subset regression. The standard AIC, however, generally under-penalizes model complexity in the best subset regression setting, potentially leading to grossly overfit models. Recently, Zhang and Cavanaugh (Comput Stat 31(2):643–669, 2015) made significant progress towards addressing this problem by introducing an effective multistage model selection procedure. In this paper, we present a rigorous and coherent conceptual framework for extending AIC to best subset regression. A new model selection algorithm derived from our framework possesses well understood and desirable asymptotic properties and consistently outperforms the procedure of Zhang and Cavanaugh in simulation studies. It provides an effective tool for combating the pervasive overfitting that detrimentally impacts best subset regression analysis so that the selected models contain fewer irrelevant predictors and predict future observations more accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

A comprehensive survey on feature selection in the various fields of machine learning

Article 23 July 2021

References

Akaike H (1973) Information theorey and an extension of the maximum likelihood principle. In: Proceeding of the second international symposium of information theory, pp 267–281
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19:716–723
Article MathSciNet MATH Google Scholar
Bengtsson T, Cavanaugh JE (2006) An improved akaike information criterion for state-space model selection. Comput Stat Data Anal 50(10):2635–2654
Article MathSciNet MATH Google Scholar
Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 35(6):813–852
Article MathSciNet MATH Google Scholar
Efron B (1983) Estimating the error rate of a prediction rule: improvement on crossvalidation. J Am Stat Assoc 78(382):316–331
Article MATH Google Scholar
Efron B (1986) How biased is the apparent error rate of a prediction rule? J Am Stat Assoc 81(394):461–470
Article MathSciNet MATH Google Scholar
Fujikoshi Y (1983) A criterion for variable selection in multiple discriminant analysis. Hiroshima Math 13:203–214
MathSciNet MATH Google Scholar
Hurvich CM, Tsai C-L (1989) Regression and time series model selection in small samples. Biometrika 76(2):297–307
Article MathSciNet MATH Google Scholar
Hurvich CM, Shumway R, Tsai C-L (1990) Improved estimators of Kullback–Leibler information for autoregressive model selection in small samples. Biometrika 77(4):709–719
MathSciNet Google Scholar
Kitagawa G, Konishi S (2008) Information criteria and statistical modeling. Springer, New York
MATH Google Scholar
Liao J, McGee D (2003) Adjusted coefficients of determination for logistic regression. Am Stat 57:161–165
Article MathSciNet MATH Google Scholar
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Shibata RITEI (1976) Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 63(1):117–126
Article MathSciNet MATH Google Scholar
Sugiura N (1978) Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun Stat 7(1):13–26
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288
MathSciNet MATH Google Scholar
White H (1982) Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25
Article MathSciNet MATH Google Scholar
Ye J (1998) On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 93(441):120–131
Article MathSciNet MATH Google Scholar
Zhang T, Cavanaugh JE (2015) A multistage algorithm for best-subset model selection based on the Kullback-Leibler discrepancy. Comput Stat 31(2):643–669
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Penn State University, Hershey, PA, USA
J. G. Liao
University of Iowa, Iowa City, IA, USA
Joseph E. Cavanaugh
University of Virginia, Charlottesville, VA, USA
Timothy L. McMurry

Authors

J. G. Liao
View author publications
You can also search for this author in PubMed Google Scholar
Joseph E. Cavanaugh
View author publications
You can also search for this author in PubMed Google Scholar
Timothy L. McMurry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. G. Liao.

Appendix: Proofs of the three Lemmas

Our proof is built on Sections 3.4.2 and 3.4.3 of the comprehensive book by Kitagawa and Konishi (2008). As such, we will assume the same standard regularity conditions on each regression model $f_j \left( {y|\theta _k ,X_k } \right) $ in Section 3.3.5 of their book. These conditions guarantee the consistency of the maximum likelihood estimator ${\hat{\theta }} _k $ and an increasingly accurate quadratic approximation to the log-likelihood near ${\hat{\theta }} _k $. To save space, we refer the reader to their book for the detailed specifications of these conditions. We start our proof by first stating some basic results on likelihood inference for adequately and inadequately specified models based on the White’s (1982) work. For any given regression model $f_j \left( {y|\theta _k ,X_k } \right) ,$ let $\theta _k^0 $ be the value of $\theta _k $ that minimizes the Kullback discrepancy between $f_j \left( {y|\theta _k ,X_k } \right) $ and g:

$$\begin{aligned} -\,2E_{\mathrm{Y}\sim g} \left\{ {\log f_j \left( {Y|\theta _k ,X_k } \right) } \right\} . \end{aligned}$$

Let ${\hat{\theta }} _k $ be the maximum likelihood estimator for this model for data y. It follows (White 1982) that ${\hat{\theta }} _k -\theta _k^0 \rightarrow 0$ and

$$\begin{aligned} n^{-1}\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) -n^{-1}E_{Y\sim g} \left\{ {\log f_j \left( {Y|\theta _k^0 ,X_k } \right) } \right\} \rightarrow 0 \end{aligned}$$

(4)

as $n\rightarrow \infty ,$ where the convergence is in probability. Furthermore, $f_j \left( {y|\theta _k^0 ,X_k } \right) =g$ when $f_j \left( {y|\theta _k ,X_k } \right) $ is adequately specified and the resulting Kullback discrepancy

$$\begin{aligned} -\,2E_{Y\sim g} \left\{ {\log f_j \left( {Y|\theta _k^0 ,X_k } \right) } \right\} =-2E_{Y\sim g} \left\{ {\log g\left( Y \right) } \right\} \end{aligned}$$

(5)

is smaller than the Kullback discrepancy for any misspecified model.

Proof of Lemma 1

Equation (3) can be written as

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =&2E_Y \left[ {\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Y \right) -\hbox {log}g\left( Y \right) } \right] \nonumber \\&+2E_Y \left[ {E_Z \left\{ {\log g\left( Z \right) -\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right) } \right\} } \right] . \end{aligned}$$

(6)

Now first look at the second term of the right of Eq. (6). Recall that $\rho _k \left( y \right) \equiv 2\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log g\left( y \right) $ and $\rho _{\mathrm{max}} \left( y \right) \equiv \hbox {max}_{k\in M_j } \left\{ {\rho _k \left( y \right) } \right\} .$ By the definition of ${\hat{f}} _{j,Y}^{\mathrm{best}} $ in Sect. 3, we have

$$\begin{aligned} 2\log g\left( Z \right) -\hbox {2log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right)= & {} 2\sum \limits _{k\in M_j } \left\{ {\log g\left( Z \right) -\log f_{j,Y} \left( {Z|{\hat{\theta }} _k ,X_k } \right) } \right\} \\&\quad \hbox { Ind}\left( {\rho _k \left( Y \right) =\rho _{\mathrm{max}} \left( Y \right) } \right) . \end{aligned}$$

It follows from Equations (3.86) and (3.95) in Kitagawa and Konishi (2008) that

$$\begin{aligned} 2E_Z \left\{ {\log g\left( Z \right) -\log f_{j,Y} \left( {Z|{\hat{\theta }} _k ,X_k } \right) } \right\} =\rho _k \left( Y \right) +o_p \left( 1 \right) . \end{aligned}$$

It further follows that

$$\begin{aligned} E_Z \left\{ {\log g\left( Z \right) -\hbox {log}{\hat{f}} _{j,Y}^{\mathrm{best}} \left( Z \right) } \right\} =\rho _{\mathrm{max}} \left( y \right) +o_p \left( 1 \right) . \end{aligned}$$

Combining with the first term on the right of Eq. (6), we have

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =2E_Y \left\{ {\rho _{\mathrm{max}} \left( y \right) } \right\} +o\left( 1 \right) . \end{aligned}$$

Let $\rho _{\mathrm{max}}^S \left( Y \right) \equiv \hbox {max}_{k\in S_j } \left\{ {\rho _k \left( Y \right) } \right\} $ and $\rho _{\mathrm{max}}^T \left( Y \right) \equiv \hbox {max}_{k\in T_j } \left\{ {\rho _k \left( Y \right) } \right\} .$ We have that

$$\begin{aligned} \rho _{\mathrm{max}} \left( y \right) =&\rho _{\mathrm{max}}^S \left( Y \right) \hbox {Ind}\left( {\rho _{\mathrm{max}}^S \left( Y \right) \ge \rho _{\mathrm{max}}^T \left( Y \right) } \right) \\&+\,\rho _{\mathrm{max}}^T \left( Y \right) \hbox {Ind}\left( {\rho _{\mathrm{max}}^S \left( Y \right) <\rho _{\mathrm{max}}^T \left( Y \right) } \right) . \end{aligned}$$

Since there are only a finite number of models in $M_j $, we have, from Eqs. (4) and (5), that $\Pr \left( {\rho _{\mathrm{max}}^S \left( Y \right) <\rho _{\mathrm{max}}^T \left( Y \right) } \right) \rightarrow 0.$ There,

$$\begin{aligned} { EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) =2E_Y \left\{ {\rho _{\mathrm{max}}^S \left( Y \right) } \right\} +o\left( 1 \right) . \end{aligned}$$

Finally, $2E\left\{ {\rho _k \left( y \right) } \right\} =2\times \left( {\hbox {dimension of }\theta } \right) +o\left( 1 \right) $ for $k\in S_j $ because

$$\begin{aligned} \rho _k \left( y \right) \sim \chi _{df=\hbox {dimension of }\theta }^2 +o_p \left( 1 \right) \end{aligned}$$

as $n\rightarrow \infty $ from standard asymptotic likelihood theory. The proof is complete. Note that the proof in Kitagawa and Konishi (2008) mostly deals with identically distributed $y_1 ,\ldots .,y_n ,$ which can be mimicked in a regression setup by assuming that the predictor x of individual subjects is a random draw from a non-degenerative distribution.$\square $

Proof of Lemma 2:

(i) Based on Eqs. 4 and 5, the value of $\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) $ for an adequately specified model exceeds, with asymptotic probability 1, the value of $\log f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) $ for a misspecified model by any given amount as $n\rightarrow \infty $. Therefore, any misspecified model has 0 asymptotic probability of being a self-consistent model. (ii) For any $j\ge p_0 $, any fitted model $f_j \left( {y|{\hat{\theta }} _k ,X_k } \right) $ converges to the true model g as ${\hat{\theta }} _k \rightarrow \theta _k^0 $ if $f_j \left( {y|\theta _k ,X_k } \right) $ is adequately specified. The result follows from the continuous dependence of ${ EO}\left( {{\hat{f}} _j^{\mathrm{best}} ,g} \right) $ on distribution g. $\square $

Proof of Lemma 3:

Consider first the event that ${\hat{f}} _{p_0 ,y}^{\mathrm{best}} $ is self-consistent, which requires

$$\begin{aligned} 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log {\hat{f}} _{p_0 ,y}^{\mathrm{best}}\le & {} { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g={\hat{f}} _{p_0 ,y}^{\mathrm{best}} } \right) -{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g={\hat{f}} _{p_0 ,y}^{\mathrm{best}} } \right) \\= & {} { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) -{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) +o\left( 1 \right) , \end{aligned}$$

for l in the range of $p_0 <l\le p,$ where $O\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) $ is the expected optimism under the true g. Let $f_{p_0 } \left( {\cdot |\theta _1 ,X_1 } \right) $ be the only adequately specified model with $p_0 $ predictors. It is easy to see from Eqs. (4) and (5) that, with asymptotic probability one, that ${\hat{f}} _{p_0 ,y}^{\mathrm{best}} $ is simply $f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) .$ It then follows that

$$\begin{aligned}&\Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \,\hbox {is self-consistent}} \right\} \\&\quad =\Pr \left\{ 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) \le { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \right. \\&\qquad \left. -\,{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) \right\} +o\left( 1 \right) . \end{aligned}$$

It is easy to see that

$$\begin{aligned} 2\log {\hat{f}} _{l,y}^{\mathrm{best}} \left( y \right) -2\log f_{p_0 } \left( {\cdot |{\hat{\theta }} _1 ,X_1 } \right) =\hbox {max}_{k\in M_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} , \end{aligned}$$

where $\xi _{l,k} \left( y \right) \equiv 2\log f_l \left( {y|{\hat{\theta }} _k ,X_k } \right) -2\log f_{p_0 } \left( {y|{\hat{\theta }} _1 ,X_1 } \right) .$ Similar to the proof of Lemma 1, we have

$$\begin{aligned} \hbox {max}_{k\in M_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} =\hbox {max}_{k\in S_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} +o_p \left( 1 \right) . \end{aligned}$$

It follows that

$$\begin{aligned} \Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \hbox { is self-consistent}} \right\}= & {} \Pr \left\{ \hbox {max}_{k\in S_l } \left\{ {\xi _{l,k} \left( y \right) } \right\} \le { EO}\left( {{\hat{f}} _l^{\mathrm{best}} ,g} \right) \right. \\&\quad \left. -\,{ EO}\left( {{\hat{f}} _{p_0 }^{\mathrm{best}} ,g} \right) \right\} +o\left( 1 \right) . \end{aligned}$$

Based on Lemma 2, the probability that any ${\hat{f}} _{j,y}^{\mathrm{best}} $ is consistent goes to 0 for $j<p_0 .$ We have

$$\begin{aligned} \Pr \left\{ {{\hat{f}} _{{\hat{j}} ,y}^{\mathrm{best}} \,\hbox {is a correct model}} \right\} =\Pr \left\{ {{\hat{f}} _{p_0 ,y}^{\mathrm{best}} \,\hbox {is self-consistent}} \right\} \end{aligned}$$

and therefore Lemma 3. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liao, J.G., Cavanaugh, J.E. & McMurry, T.L. Extending AIC to best subset regression. Comput Stat 33, 787–806 (2018). https://doi.org/10.1007/s00180-018-0797-8

Download citation

Received: 14 October 2016
Accepted: 01 February 2018
Published: 09 February 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s00180-018-0797-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extending AIC to best subset regression

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Feature selection techniques for machine learning: a survey of more than two decades of research

A comprehensive survey on feature selection in the various fields of machine learning

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs of the three Lemmas

Proof of Lemma 1

Proof of Lemma 2:

Proof of Lemma 3:

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extending AIC to best subset regression

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Feature selection techniques for machine learning: a survey of more than two decades of research

A comprehensive survey on feature selection in the various fields of machine learning

References

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs of the three Lemmas

Appendix: Proofs of the three Lemmas

Proof of Lemma 1

Proof of Lemma 2:

Proof of Lemma 3:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation