Skip to main content

Linear Regression via Elastic Net: Non-enumerative Leave-One-Out Verification of Feature Selection

  • Chapter
  • First Online:
Book cover Clusters, Orders, and Trees: Methods and Applications

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 92))

  • 1111 Accesses

Abstract

The feature-selective non-quadratic Elastic Net criterion of regression estimation is completely determined by two numerical regularization parameters which penalize, respectively, the squared and absolute values of the regression coefficients under estimation. It is an inherent property of the minimum of the Elastic Net that the values of regularization parameters completely determine a partition of the variable set into three subsets of negative, positive, and strictly zero values, so that the former two subsets and the latter subset are, respectively, associated with “informative” and “redundant” features. We propose in this paper to treat this partition as a secondary structural parameter to be verified by leave-one-out cross validation. Once the partitioning is fixed, we show that there exists a non-enumerative method for computing the leave-one-out error rate, thus enabling an evaluation of model generality in order to tune the structural parameters without the necessity of multiple training repetitions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In [1], denominators in (5) have the form 1 +λ 2 instead of \(1 +\lambda _{2}/N\). This is a consequence of a specific normalization of the training set \(\sum \nolimits _{j=1}^{N}\!x_{ij}^{2} = 1\) as distinct to the commonly adopted normalization \((1/N)\sum \nolimits _{j=1}^{N}\!x_{ij}^{2} = 1\) accepted in this paper (2).

  2. 2.

    http://en.wikipedia.org/wiki/Woodbury_matrix_identity.

References

  1. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. 67, 301–320 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  2. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. 58(1), 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  3. Ye, G., Chen, Y., Xie, X.: Efficient variable selection in support vector machines via the alternating direction method of multipliers. J. Mach. Learn. Res. Proc. Track 832–840 (2011)

    Google Scholar 

  4. Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sinica 16, 589–615 (2006)

    MATH  MathSciNet  Google Scholar 

  5. Grosswindhager, S.: Using penalized logistic regression models for predicting the effects of advertising material (2009). http://publik.tuwien.ac.at/files/PubDat_179921.pdf

  6. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

    Google Scholar 

  7. Christensen, R.: Plane Answers to Complex Questions. The Theory of Linear Models, 3rd edn. Springer, New York (2010)

    Google Scholar 

  8. Tibshirani, R., Efron, B., Hastie, T., Johnstone, I.: Least angle regression. Ann. Stat. 32, 407–499 (2004)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elena Chernousova .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Proof of Theorem 1

Let us open out the brackets in (5):

$$\displaystyle\begin{array}{rcl} J_{\mathrm{EN}}(\mathbf{a}\vert \lambda _{1},\lambda _{2})& =& \lambda _{1}\|\mathbf{a}\|_{1} +\lambda _{2}\mathbf{a}^{T}\mathbf{a} - 2 \frac{\lambda _{2}} {N}\mathbf{a}^{T}\mathbf{X}^{T}\mathbf{y} {}\\ & & +\mathop{\underbrace{ \frac{\lambda _{2}} {N^{2}} \mathbf{y}^{T}\mathbf{X}\mathbf{X}^{T}\mathbf{y} + \mathbf{y}^{T}\mathbf{y}}}\limits _{\mathrm{con\!st}} - 2\mathbf{a}^{T}\mathbf{X}^{T}\mathbf{y} + \mathbf{a}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{a} \rightarrow \min (\mathbf{a}). {}\\ \end{array}$$

Summands not depending on a may be omitted from the optimization. Collecting the remaining summands gives:

$$\displaystyle{J_{\mathrm{EN}}(\mathbf{a}\vert \lambda _{1},\lambda _{2}) =\lambda _{1}\|\mathbf{a}\|_{1} + \mathbf{a}^{T}{\bigl (\mathbf{X}^{T}\mathbf{X} +\lambda _{ 2}\mathbf{I}\bigr )}\mathbf{a} +{\Bigl ( 1 + \frac{\lambda _{2}} {N}\Bigr )}\mathbf{a}^{T}\mathbf{X}^{T}\mathbf{y} \rightarrow \min (\mathbf{a}).}$$

Division of the last equality by the constant \((1 +\lambda _{2}/N)\) yields (8). The theorem is proven.

1.2 Proof of Theorem 2

Differentiation of (11) by the active regression coefficients a i , \(i\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}\), leads to the equalities

$$\displaystyle\begin{array}{rcl} & & \frac{\partial } {\partial a_{i}}J_{\mathrm{EN}}{\bigl (a_{l},l\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}\vert \lambda _{ 1},\lambda _{2}\bigr )} {}\\ & & \quad = 2\lambda _{2}(a_{i} - a_{i}^{{\ast}})^{2} + \left (\begin{array}{l} \;\;\,\lambda _{1},i \in \hat{ I}_{\lambda _{1},\lambda _{2}}^{+} \\ -\lambda _{1},i \in \hat{ I}_{\lambda _{1},\lambda _{2}}^{-} \end{array} \right ) - 2\sum _{j=1}^{N}{\Bigl (y_{ j} -\sum _{l\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}}a_{l}x_{lj}\Bigr )} = 0,{}\\ \end{array}$$

which make a system of linear equations over \(i\!\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}\)

$$\displaystyle{\lambda _{2}a_{i}+\sum _{l\notin \hat{I}_{\lambda _{ 1},\lambda _{2}}^{0}}{\Biggl (\sum _{j=1}^{N}x_{ ij}x_{lj}\Biggr )}a_{l} =\sum _{ j=1}^{N}x_{ ij}y_{j}-\frac{\lambda _{1}} {2}\left (\begin{array}{l} \;\;\,1,i \in \hat{ I}_{\lambda _{1},\lambda _{2}}^{+} \\ - 1,i \in \hat{ I}_{\lambda _{1},\lambda _{2}}^{-} \end{array} \right )+\lambda _{2}\mathbf{a}^{{\ast}}.}$$

The matrix form of this system in accordance with (12), (13), and (14) is just (16), with (13) its solution. The theorem is proven.

1.3 Proof of Theorem 3

Let the feature set partitioning \({\bigl \{\hat{I}_{\lambda _{1},\lambda _{2}}^{-},\hat{I}_{\lambda _{1},\lambda _{2}}^{0},\hat{I}_{\lambda _{1},\lambda _{2}}^{+}\bigr \}}\) (9) at the minimum point of (5) be treated as fixed, and the k th entity (x k , y k ) be omitted from the training set (1). In terms of notation (4) and (2), this implies deletion of the k element from the vector \(\mathbf{y} \in \mathbb{R}^{N}\) and the kth row from the matrix \(\tilde{\mathbf{X}}_{\lambda _{1},\lambda _{2}}\) \((N \times \hat{ n}_{\lambda _{1},\lambda _{2}})\):

$$\displaystyle{\mathbf{y}^{(k)} \in \mathbb{R}^{N-1},\;\tilde{\mathbf{X}}_{\lambda _{ 1},\lambda _{2}}^{(k)}{\bigl ((N - 1) \times \hat{ n}_{\lambda _{ 1},\lambda _{2}}\bigr )}.}$$

The vector of preliminary estimates of regression coefficients \(\mathbf{a}^{{\ast}} \in \mathbb{R}^{n}\) (12) occurs only in the Elastic Net (EN) training criterion (5), and equals zero in the naive Elastic Net (NEN) (3) \(\mathbf{a}^{{\ast}} = \mathbf{0} \in \mathbb{R}^{n}\). Its subvector cut out from a by deletion of the kth entity will be:

$$\displaystyle{\tilde{\mathbf{a}}_{\lambda _{1},\lambda _{2}}^{{\ast}(k)} = \left \{\!\begin{array}{ll} \frac{1} {N - 1}\sum _{j=1,j\neq k}^{N}y_{ j}\tilde{\mathbf{x}}_{j,\lambda _{1},\lambda _{2}} = \frac{1} {N - 1}(\tilde{\mathbf{X}}_{\lambda _{1},\lambda _{2}}^{(k)})^{T}\mathbf{y}^{(k)} \in \mathbb{R}^{\hat{n}_{\lambda _{1},\lambda _{2}} },&\mathit{EN} \\ \mathbf{0} \in \mathbb{R}^{\hat{n}_{\lambda _{1},\lambda _{2}}}, &\mathit{NEN} \end{array} \right.}$$

Correspondingly, the solution (13) of the optimization problem (11) will take the form (lower indices (λ 1, λ 2) are omitted below):

$$\displaystyle{ \hat{\tilde{\mathbf{a}}}^{(k)} ={\bigl ( (\tilde{\mathbf{X}}^{(k)})^{T}\tilde{\mathbf{X}}^{(k)}+\lambda _{ 2}\tilde{\mathbf{I}}_{\hat{n}}\bigr )}^{-1}\!\left \{\!(\tilde{\mathbf{X}}^{(k)})^{T}\mathbf{y} -\!\frac{\lambda _{1}} {2}\tilde{\mathbf{e}} +\! \left [\!\begin{array}{ll} \lambda _{2}\tilde{\mathbf{a}}^{{\ast}(k)},&\mathit{EN} \\ \mathbf{0}, &\mathit{NEN} \end{array} \right ]\!\right \}. }$$
(23)

Notice here that

$$\displaystyle{ \left \{\begin{array}{l} (\tilde{\mathbf{X}}^{(k)})^{T}\tilde{\mathbf{X}}^{(k)} =\tilde{ \mathbf{X}}^{T}\tilde{\mathbf{X}} -\tilde{\mathbf{x}}_{k}^{T}\tilde{\mathbf{x}}_{k}, \\ (\tilde{\mathbf{X}}^{(k)})^{T}\mathbf{y}^{(k)} =\tilde{ \mathbf{X}}^{T}\mathbf{y} - y_{k}\tilde{\mathbf{x}}_{k}, \\ \tilde{\mathbf{a}}^{{\ast}(k)} = \frac{1} {N - 1}{\bigl [\tilde{\mathbf{X}}^{T}\mathbf{y} - y_{ k}\tilde{\mathbf{x}}_{k}\bigr ]} = \frac{N} {N - 1}\tilde{\mathbf{a}}^{{\ast}}- \frac{1} {N - 1}y_{k}\tilde{\mathbf{x}}_{k} \\ =\tilde{ \mathbf{a}}^{{\ast}}- \frac{1} {N - 1}{\bigl (y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}}\bigr )}. \end{array} \right. }$$
(24)

Application of the Woodbury formulaFootnote 2

$$\displaystyle{(\mathbf{A} + \mathbf{B}\mathbf{C})^{-1} = \mathbf{A}^{-1} -\mathbf{A}^{-1}\mathbf{B}{\bigl (\mathbf{I} + \mathbf{C}\mathbf{A}^{-1}\mathbf{B}\bigr )}^{-1}\mathbf{C}\mathbf{A}^{-1}}$$

and (24) to (23) yields:

$$\displaystyle\begin{array}{rcl} & & \!\!\!\!\hat{\tilde{\mathbf{a}}}^{(k)} ={\Bigl (\mathop{\underbrace{ \tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{ 2}\tilde{\mathbf{I}}}}\limits _{\mathbf{A}} +\mathop{\underbrace{ (-\tilde{\mathbf{x}}_{k})}}\limits _{\mathbf{B}}\mathop{\underbrace{ \tilde{\mathbf{x}}_{k}^{T}}}\limits _{ \mathbf{C}}\Bigr )}^{-1} {}\\ & & \qquad \quad \!\!\!\! \times \left \{\tilde{\mathbf{X}}^{T}\mathbf{y} - y_{ k}\tilde{\mathbf{x}}_{k} -\frac{\lambda _{1}} {2}\tilde{\mathbf{e}} +\lambda _{2}\left [\begin{array}{ll} \tilde{\mathbf{a}}^{{\ast}}- \frac{1} {N - 1}{\bigl (y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}}\bigr )},&EN \\ \mathbf{0}, &NEN \end{array} \right ]\right \} {}\\ & & \qquad \!\!\!\!\! =\hat{\tilde{ \mathbf{a}}} + \frac{{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}\tilde{\mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} - \frac{y_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{ 2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{ k} {}\\ & & \!\!\!\!\!\quad \qquad - \frac{\lambda _{2}} {N - 1}\!\left [\!\begin{array}{ll} {\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}}\,+\,\lambda _{ 2}\tilde{\mathbf{I}}\bigr )}^{-1} + \frac{{\bigl (\tilde{\mathbf{X}}^{T}\!\tilde{\mathbf{X}} +\!\lambda _{ 2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{ k}\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\!\tilde{\mathbf{X}} +\!\lambda _{ 2}\tilde{\mathbf{I}}\bigr )}^{-1}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {\bigl (y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}}\bigr )},&EN \\ \mathbf{0}, &NEN \end{array} \!\right ]. {}\\ \end{array}$$

Algebraic transformation of this expression with respect to the notation \(\hat{y}_{k} =\tilde{ \mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}\) (16) and \(\hat{y}_{k}^{(k)} =\tilde{ \mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}^{(k)}\) (18) leads to the equality

$$\displaystyle\begin{array}{rcl} \tilde{\mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}^{(k)}& =& \frac{\hat{y}_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} - y_{k} \frac{\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {}\\ & & - \frac{\lambda _{2}} {N - 1}\left [\begin{array}{ll} \frac{\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}(y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}})} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}},&EN \\ \mathbf{0}, &NEN \end{array} \right ].{}\\ \end{array}$$

Thus, the leave-one-out residuals \(\hat{\delta }_{k}^{(k)}\) in (17) and (18) permit the representation

$$\displaystyle\begin{array}{rcl} \hat{\delta }_{k}^{(k)}& =& y_{ k} -\tilde{\mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}^{(k)} {}\\ & =& y_{k} - \frac{\hat{y}_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} - y_{k} \frac{\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {}\\ & & - \frac{\lambda _{2}} {N - 1}\left [\begin{array}{ll} \frac{\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}(y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}})} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}},&EN \\ \mathbf{0}, &NEN \end{array} \right ] {}\\ & =& \frac{y_{k} -\hat{ y}_{k}} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}} {}\\ & & + \frac{\lambda _{2}} {N - 1}\left [\begin{array}{ll} \frac{\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}(y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}})} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}},&EN \\ \mathbf{0}, &NEN \end{array} \right ] {}\\ & =& \frac{\delta _{k} + \frac{\lambda _{2}} {N - 1}\left [\begin{array}{ll} \tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}(y_{k}\tilde{\mathbf{x}}_{k} -\tilde{\mathbf{a}}^{{\ast}}),&EN \\ \mathbf{0}, &NEN \end{array} \right ]} {1 -\tilde{\mathbf{x}}_{k}^{T}{\bigl (\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} +\lambda _{2}\tilde{\mathbf{I}}\bigr )}^{-1}\tilde{\mathbf{x}}_{k}}.{}\\ \end{array}$$

Substitution of \(\hat{\delta }_{k}^{(k)}\) into (17) with respect to notations (21) yields (19) and (20).

The theorem is proven.

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Chernousova, E., Razin, N., Krasotkina, O., Mottl, V., Windridge, D. (2014). Linear Regression via Elastic Net: Non-enumerative Leave-One-Out Verification of Feature Selection. In: Aleskerov, F., Goldengorin, B., Pardalos, P. (eds) Clusters, Orders, and Trees: Methods and Applications. Springer Optimization and Its Applications, vol 92. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0742-7_22

Download citation

Publish with us

Policies and ethics