Abstract
The feature-selective non-quadratic Elastic Net criterion of regression estimation is completely determined by two numerical regularization parameters which penalize, respectively, the squared and absolute values of the regression coefficients under estimation. It is an inherent property of the minimum of the Elastic Net that the values of regularization parameters completely determine a partition of the variable set into three subsets of negative, positive, and strictly zero values, so that the former two subsets and the latter subset are, respectively, associated with “informative” and “redundant” features. We propose in this paper to treat this partition as a secondary structural parameter to be verified by leave-one-out cross validation. Once the partitioning is fixed, we show that there exists a non-enumerative method for computing the leave-one-out error rate, thus enabling an evaluation of model generality in order to tune the structural parameters without the necessity of multiple training repetitions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In [1], denominators in (5) have the form 1 +λ 2 instead of \(1 +\lambda _{2}/N\). This is a consequence of a specific normalization of the training set \(\sum \nolimits _{j=1}^{N}\!x_{ij}^{2} = 1\) as distinct to the commonly adopted normalization \((1/N)\sum \nolimits _{j=1}^{N}\!x_{ij}^{2} = 1\) accepted in this paper (2).
- 2.
References
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. 67, 301–320 (2005)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. 58(1), 267–288 (1996)
Ye, G., Chen, Y., Xie, X.: Efficient variable selection in support vector machines via the alternating direction method of multipliers. J. Mach. Learn. Res. Proc. Track 832–840 (2011)
Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine. Stat. Sinica 16, 589–615 (2006)
Grosswindhager, S.: Using penalized logistic regression models for predicting the effects of advertising material (2009). http://publik.tuwien.ac.at/files/PubDat_179921.pdf
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Christensen, R.: Plane Answers to Complex Questions. The Theory of Linear Models, 3rd edn. Springer, New York (2010)
Tibshirani, R., Efron, B., Hastie, T., Johnstone, I.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Proof of Theorem 1
Let us open out the brackets in (5):
Summands not depending on a may be omitted from the optimization. Collecting the remaining summands gives:
Division of the last equality by the constant \((1 +\lambda _{2}/N)\) yields (8). The theorem is proven.
1.2 Proof of Theorem 2
Differentiation of (11) by the active regression coefficients a i , \(i\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}\), leads to the equalities
which make a system of linear equations over \(i\!\notin \hat{I}_{\lambda _{1},\lambda _{2}}^{0}\)
The matrix form of this system in accordance with (12), (13), and (14) is just (16), with (13) its solution. The theorem is proven.
1.3 Proof of Theorem 3
Let the feature set partitioning \({\bigl \{\hat{I}_{\lambda _{1},\lambda _{2}}^{-},\hat{I}_{\lambda _{1},\lambda _{2}}^{0},\hat{I}_{\lambda _{1},\lambda _{2}}^{+}\bigr \}}\) (9) at the minimum point of (5) be treated as fixed, and the k th entity (x k , y k ) be omitted from the training set (1). In terms of notation (4) and (2), this implies deletion of the k element from the vector \(\mathbf{y} \in \mathbb{R}^{N}\) and the kth row from the matrix \(\tilde{\mathbf{X}}_{\lambda _{1},\lambda _{2}}\) \((N \times \hat{ n}_{\lambda _{1},\lambda _{2}})\):
The vector of preliminary estimates of regression coefficients \(\mathbf{a}^{{\ast}} \in \mathbb{R}^{n}\) (12) occurs only in the Elastic Net (EN) training criterion (5), and equals zero in the naive Elastic Net (NEN) (3) \(\mathbf{a}^{{\ast}} = \mathbf{0} \in \mathbb{R}^{n}\). Its subvector cut out from a ∗ by deletion of the kth entity will be:
Correspondingly, the solution (13) of the optimization problem (11) will take the form (lower indices (λ 1, λ 2) are omitted below):
Notice here that
Application of the Woodbury formulaFootnote 2
Algebraic transformation of this expression with respect to the notation \(\hat{y}_{k} =\tilde{ \mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}\) (16) and \(\hat{y}_{k}^{(k)} =\tilde{ \mathbf{x}}_{k}^{T}\hat{\tilde{\mathbf{a}}}^{(k)}\) (18) leads to the equality
Thus, the leave-one-out residuals \(\hat{\delta }_{k}^{(k)}\) in (17) and (18) permit the representation
Substitution of \(\hat{\delta }_{k}^{(k)}\) into (17) with respect to notations (21) yields (19) and (20).
The theorem is proven.
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Chernousova, E., Razin, N., Krasotkina, O., Mottl, V., Windridge, D. (2014). Linear Regression via Elastic Net: Non-enumerative Leave-One-Out Verification of Feature Selection. In: Aleskerov, F., Goldengorin, B., Pardalos, P. (eds) Clusters, Orders, and Trees: Methods and Applications. Springer Optimization and Its Applications, vol 92. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0742-7_22
Download citation
DOI: https://doi.org/10.1007/978-1-4939-0742-7_22
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-0741-0
Online ISBN: 978-1-4939-0742-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)