Consistency bounds and support recovery of d-stationary solutions of sparse sample average approximations

Abstract

This paper studies properties of the d(irectional)-stationary solutions of sparse sample average approximation problems involving difference-of-convex sparsity functions under a deterministic setting. Such properties are investigated with respect to a vector which satisfies a verifiable assumption to relate the empirical sample average approximation problem to the expectation minimization problem defined by an underlying data distribution. We derive bounds for the distance between the two vectors and the difference of the model outcomes generated by them. Furthermore, the inclusion relationships between their supports, sets of nonzero valued indices, are studied. We provide conditions under which the support of a d-stationary solution is contained within, and contains, the support of the vector of interest; the first kind of inclusion can be shown for any given arbitrary set of indices. Some of the results presented herein are generalization of the existing theory for a specialized problem of \(\ell _1\)-norm regularized least squares minimization for linear regression.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. 1.

    Ahn, M.: Difference-of-Convex Learning: Optimization with Non-convex Sparsity Functions. University of Southern California, Los Angeles (2018)

    Google Scholar 

  2. 2.

    Ahn, M., Pang, J.S., Xin, J.: Difference of convex learning: directional stationarity, optimality and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)

    Google Scholar 

  4. 4.

    Bühlmann, P., van de Geer, S.: Statistics for High-dimensional Data. Springer Series in Statistics. Springer, Berlin (2011)

    Google Scholar 

  5. 5.

    Bickel, B.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of LASSO and Dantzig selector. Ann. Stat> 37(4), 1705–1732 (2009)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Candès, E., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4216 (2005)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Candès, E., Tao, T.: Near optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351 (2007)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Candès, E., Wakin, M., Boyd, S.: Enhancing sparsity by reweighted \(\ell _1\) minimization. J. Fourier Anal. Appl. 14(5), 877–905 (2008)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Dong, H., Ahn, M., Pang, J.S.: Structural properties of affine sparsity constraints. Math. Program. Ser. B (2018). https://doi.org/10.1007/s10107-018-1283-3

  11. 11.

    Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Fan, J., Lv, J.: Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory 57(8), 5467–5484 (2011)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press/Taylor & Francis Group, Boca Raton (2015)

    Google Scholar 

  14. 14.

    Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28(5), 1356–1378 (2000)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Le Thi, H.A., Pham, D.T.: The DC programming and DCA revised with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133, 25–46 (2005)

    MATH  Google Scholar 

  16. 16.

    Le Thi, H.A., Pham, D.T., Vo, X.T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244, 26–46 (2015)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Loh, P., Wainwright, M.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Loh, P., Wainwright, M.: Support recovery without incoherence: a case for nonconvex regularization. Ann. Stat. 45(6), 2455–2482 (2017)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Lou, Y., Yin, P., Xin, J.: Point source super-resolution via nonconvex L1 based methods. J. Sci. Comput. 68(3), 1082–1100 (2016)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Lu, S., Liu, Y., Yin, L., Zhang, K.: Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization. J. Roy. Stat. Soc. B 79(2), 589–611 (2017)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Negahban, S., Ravikumar, P., Wainwright, M., Yu, B.: A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Nikolova, M.: Local strong homogeneity of a regularized estimator. SIAM J. Appl. Math. 61(2), 633–658 (2000)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Mathe. Oper. Res. 42(1), 95–118 (2017)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Pang, J.S., Tao, M.: Decomposition methods for computing directional stationary solutions of a class of nonsmooth nonconvex optimization problems. SIAM J. Optim. 28(2), 1640–1669 (2018)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM Publications. Philadelphia (2009)

  26. 26.

    Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  27. 27.

    Yin, P., Lou, Y., He, Q., Xin, J.: Minimization of L1–L2 for compressed sensing. SIAM J. Sci. Comput. 37(1), 536–563 (2015)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for \(\text{ L }_1\)-minimization with applications to compressed Sensing. SIAM J. Imag. Sci. 1(1), 143–168 (2008)

    Article  Google Scholar 

  29. 29.

    Zhang, C.: Nearly unbiased variable selection under Minimax Concave Penalty. Ann. Stat. 38(2), 894–942 (2010)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Zhang, S., Xin, J.: Minimization of transformed \(\text{ L }_1\) penalty: theory, difference of convex function algorithm, and robust application in compressed sensing. Mathe. Programm. Ser. B 169(1), 307–336 (2018)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Zhao, P., Yu, B.: On model selection consistency of LASSO. J. Mach. Learn. Res. 7, 2541–2563 (2006)

    MathSciNet  MATH  Google Scholar 

  32. 32.

    Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

The author gratefully acknowledges Jong-Shi Pang for his involvement in fruitful discussions, and for providing valuable ideas that helped to build the foundation of this work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Miju Ahn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is derived and extended from the last chapter [1, Chapter 4] of the author’s Ph. D. dissertation which was written under the supervision of Jong-Shi Pang.

Appendices

Appendix A: Details about the numerical experiment

The following are parameters used in data generation and problem formulation in Sect. 5.

Number of samples 30
Dimension of \(w^*\) 50
Sparsity rate in \(w^*\) 0.5
Value of \(\delta '\) \(1 - 10^{-2}\)
Value of \(\delta \) Start with \(10^{-4}\) then increment by multiplying 1.2 to the current value until the value reaches \(\delta '\)
\(\ell _1\)-norm parameter \(c = 1.5\)
MCP parameters \(a = 2\), \(\lambda = 0.75\)
SCAD parameters \(a = 2\), \(\lambda = 1.5\)
Transformed \(\ell _1\) parameter \( a = 2\)

Appendix B: A summary for the LASSO analysis

This section is to provide a literature review of one particular reference which presented a chapter on the statistical inferences analysis for the LASSO problem [13, Chapter 11]. We formally define the LASSO problem:

(28)

where each row of \(X \in {\mathbb {R}}^{n \times d}\) and \(Y \in {\mathbb {R}}^{n \times 1}\) are the feature information sample \(x^T\) and its outcome y respectively. Being a convex program, any local minimizer of LASSO is a global optimum as proven from the convex optimization theory. Exploiting the fact, LASSO analysis provided in the reference compares the optimal solutions of the \(\ell _1\)-regularized sample average approximation problem (28), denoted by \({\widehat{w}}^{\text {Lasso}}\), with the underlying ground truth vector \(w^{\, 0}\). The assumption on \(w^{\, 0}\) is that all attained samples are generated from the vector then perturbed by some random Gaussian noise, i.e., \(Y = X \, w^{\, 0} + \epsilon \) where each component \(\epsilon _i \sim {\mathcal {N}}(0, \sigma ^2)\) for \(1 \le i \le N\). Moreover, the authors assume that the ground truth is a sparse vector with its nonzero components defining the support set \(S_{\, 0} \triangleq \{ \, i \ | \ w_i^{\, 0} \ne 0 \ \text {for} \ 1 \le i \le d \, \}\).

The topics the authors address in the referenced work are the following: how close the empirical solution is to the ground truth; if the ultimate goal of the problem is to make future predictions, is it possible to compare the empirical model outputs with the noise-free outputs produced by the ground truth on the available samples; and whether the empirical solution can recover the indices of nonzero components that are contained in \(S_{\, 0}\). To answer these questions, the authors derive a region which contains all vectors of the difference between the solutions of (28) and \(w^{\, 0}\) is contained in the set. They define

$$\begin{aligned} {\mathcal {V}}_{\mathrm{Lasso}} \triangleq \{ \, v \in {\mathbb {R}}^d \ \big | \ \Vert \, v_{S_{\, 0}^c} \, \Vert _1 \le 3 \, \Vert \, v_{S_{\, 0}} \, \Vert _1 \}, \end{aligned}$$

where \(S_{\, 0}^c\) is the complement of \(S_{\, 0}\). Therein, the authors assume that \(\theta _N^{\mathrm{Lasso}}\) is strongly convex at the point \(w^{\, 0}\) with respect to \({\mathcal {V}}_{\mathrm{Lasso}} \), making a connection between the empirical solution and the ground truth by utilizing the region. Though (28) is a convex program, strong convexity of the entire objective function can not be expected in general. Such assumption guarantees that the submatrix of the Hessian, \(X^T X\), corresponding to the indices in \(S_{\, 0}\) has a full rank. The statement of the restricted eigenvalues assumption, analogous to restricted strong convexity for the special case of least squares error minimization for linear regression, is as follows: there exists a constant \(\gamma ^{\mathrm{Lasso}} > 0\) such that

$$\begin{aligned} \displaystyle {\frac{\frac{1}{N} v^T X^T X v}{\Vert \, v \, \Vert _2^2}} \ge \gamma ^{\mathrm{Lasso}} \quad \text {for all nonzero } v \in {\mathcal {V}}_{\mathrm{Lasso}}. \end{aligned}$$
(29)

We list the results provided in the reference: a basic consistency bound, a bound on the prediction error, and the support recovery of \({\widehat{w}}^{\text {Lasso}}\). The assumptions imposed for each theorem and the key ideas for the proof will be discussed.

\(\bullet \) Consistency result [13, Theorem 11.1]: Suppose the model matrix X satisfies the restricted eigenvalue bound (29) with respect to the set \({\mathcal {V}}_{\mathrm{Lasso}}\). Given a regularization parameter \(\lambda _N^{\mathrm{Lasso}} \ge \displaystyle {\frac{2}{N}} \Vert \, X^T \epsilon \, \Vert _\infty > 0\), any solution \({\widehat{w}}^{\text {Lasso}}\) of (28) satisfies the bound

$$\begin{aligned} \Vert \, {\widehat{w}}^{\text {Lasso}} - w^{\, 0} \Vert _2 \, \le \, \displaystyle { \frac{3}{\gamma ^{\mathrm{Lasso}}} \, \sqrt{\frac{| \, S_0 \, |}{N}}} \, \sqrt{N} \, \lambda _N^{\mathrm{Lasso}} . \end{aligned}$$
(30)

Exploiting the fact that \({\widehat{w}}^{\text {Lasso}}\) is the global minimizer of the LASSO problem, the proof of the theorem starts from \(\theta _N^{\mathrm{Lasso}}({\widehat{w}}^{\text {Lasso}}) \le \theta _N^{\mathrm{Lasso}} (w^{\, 0})\). We substitute the assumption on the ground truth, \(Y = X \, w^{\, 0} + \epsilon \), to both sides of the inequality, then apply the assumption on the regularization parameter \(\lambda _N^{\mathrm{Lasso}} \). These steps yield a key inequality given by,

$$\begin{aligned} \frac{\Vert \, X ({\widehat{w}}^{\text {Lasso}} - w^{\, 0} ) \, \Vert _2^2}{2N} \le \frac{3}{2} \, \sqrt{| \, S_0 \, |} \, \lambda _N^{\mathrm{Lasso}} \, \Vert \, {\widehat{w}}^{\text {Lasso}} - w^{\, 0 } \, \Vert _2, \end{aligned}$$
(31)

which serves as a building block to derive the current theorem and the prediction error bound to be shown. It can be verified that by letting \(v = {\widehat{w}}^{\text {Lasso}} - w^{\, 0}\), the proof is complete provided that the restricted eigenvalue condition holds; the last step requires a lemma which shows that any error \({\widehat{w}}^{\text {Lasso}} - w^{\, 0}\) associated with the LASSO solution \({\widehat{w}}^{\text {Lasso}}\) belongs to the set \({\mathcal {V}}_{\mathrm{Lasso}}\) if the condition on \(\lambda _N^{\mathrm{Lasso}} \) holds.

\(\bullet \) Bounds on the prediction error [13, Theorem 11.2]: Suppose the matrix X satisfies the restricted eigenvalue condition (29) over the set \({\mathcal {V}}_{\mathrm{Lasso}}\). Given a regularization parameter \(\lambda _N^{\mathrm{Lasso}} \ge \displaystyle {\frac{2}{N}} \Vert \, X^T \epsilon \, \Vert _\infty > 0\), any solution \({\widehat{w}}^{\text {Lasso}}\) of (28) satisfies the bound

$$\begin{aligned} \frac{ \Vert \, X({\widehat{w}}^{\text {Lasso}} - w^{\, 0}) \, \Vert _2^2}{N} \, \le \, \frac{9}{\gamma ^{\mathrm{Lasso}}} \, | \, S_0 \, | \, ( \lambda _N^{\mathrm{Lasso}} )^2. \end{aligned}$$
(32)

The proof for the prediction error bound is straightforward; it can be shown by combining the restricted eigenvalue assumption (29) and the inequality (31) which is derived in the process of proving the consistency result.

\(\bullet \) Assumptions for variable selection consistency result: To address variable selection consistency of the LASSO solution \({\widehat{w}}^{\text {Lasso}}\), the authors provide a distinct set of assumptions which are related to the structures of the matrix X. The mutual incoherence (sometimes also referred as irrepresentability) condition states that there must exist some \(\gamma ^{\mathrm{Lasso}} > 0\) such that

$$\begin{aligned} \max \limits _{j \in S_0^{c}} \, \Vert \, (X_{S_0}^T X_{S_0})^{-1} X_{S_0}^T x_j \, \Vert _1 \le 1 - \gamma ^{\mathrm{Lasso}}. \end{aligned}$$
(33)

The authors point out that in the most desirable case, any jth column \(x_j\) where j belongs to the set of indices consisting zero components of \(w^{\, 0}\) would be orthogonal to the columns of \(X_{S_{0}} \in {\mathbb {R}}^{N \times | \, S_0 \, |}\), which is the submatrix of X that consists of columns corresponding to \(S_{0}\). As such is not attainable for high-dimensional linear regression, the assumption ensures that ‘near orthogonality’ to hold for the design matrix. In addition, they assume

$$\begin{aligned} \max \limits _{1 \le j \le d} \, \frac{1}{\sqrt{N}} \, \Vert \, x_j \, \Vert _2 \le K_{\mathrm{{clm}}} \end{aligned}$$
(34)

for some \( K_{\mathrm{{clm}}} > 0\), which can be interpreted as the matrix X has normalized columns. For example, the matrix can be normalized such that \(\Vert \, x_j \, \Vert _2\) is equal to \(\sqrt{N}\) for any j, resulting the value of constant \(K_{\mathrm{clm}}\) to be 1. The last assumption made on the matrix X is

$$\begin{aligned} \lambda _{\min } \left( \displaystyle { \frac{X_{S_0}^T X_{S_0}}{N}} \right) \ge C_{\min } \end{aligned}$$
(35)

for some positive constant \(C_{\min }\), where \(\lambda _{\min }\) denotes the minimum eigenvalue of the given matrix. The authors note that if this condition is violated then the columns of \(X_{S_0}\) are linearly dependent, and it is not possible to recover \(w^{\, 0}\) even if its supporting indices are known.

\(\bullet \) Variable selection consistency [13, Theorem 11.3]: Suppose the matrix X satisfies the mutual incoherence condition (33) with parameter \(\gamma ^{\mathrm{Lasso}} > 0\), the column normalization condition (34) and the eigenvalue condition (35). For a noise vector \(\epsilon \in {\mathbb {R}}^N\) with i.i.d \({\mathcal {N}}(0, \sigma ^2)\) entries, consider the LASSO problem (28) with a regularization parameter

$$\begin{aligned} \lambda _N \ge \displaystyle {\frac{8 \, K_{\mathrm{{clm}}} \, \sigma }{\gamma ^{\mathrm{Lasso}}} \sqrt{\frac{\log d}{N}}}. \end{aligned}$$

Then with a probability greater than \(1 - c_1 e^{-c_2 N \lambda _N^2}\), the Lasso has the following properties:

  1. 1.

    Uniqueness: the optimal solution \({\widehat{w}}^{\text {Lasso}}\) is unique;

  2. 2.

    No false inclusion: The unique optimal solution has its support contained within \(S_0\), i.e., support(\({\widehat{w}}^{\text {Lasso}}\)) \(\subseteq \) support(\(w^{\, 0}\));

  3. 3.

    \(\ell _\infty \)- bound: the error \({\widehat{w}}^{\text {Lasso}} - w^{\, 0}\) satisfies the \(\ell _{\infty }\) bound

    $$\begin{aligned} \Vert \, {\widehat{w}}^{\text {Lasso}}_{S_0} - w_{S_0}^{\, 0} \, \Vert _\infty \le \underbrace{\lambda _N \left[ \, \displaystyle {\frac{4 \sigma }{\sqrt{C_{\min }}}} + \Vert \, (X_{S_0}^T X_{S_0} / N)^{-1} \, \Vert _{\infty } \, \right] }_{B(\lambda _N, \, \sigma ; \, X)} \end{aligned}$$
    (36)

    where \(\Vert A \Vert _\infty \) for a matrix A is defined as \(\max \limits _{\Vert u \Vert _\infty = 1} \Vert A \, u \Vert _\infty \);

  4. 4.

    No false exclusion: the nonzero components of the LASSO solution \({\widehat{w}}^{\text {Lasso}}\) include all indices \(j \in S_0\) such that \(| w_j^{\, 0} | > B(\lambda _N, \, \sigma ; \, X)\), and hence is variable selection consistent as long as \(\min \limits _{j \in S_0} | w_j^{\, 0} | > B(\lambda _N, \, \sigma ; \, X)\).

Showing the uniqueness involves solving a hypothetical problem; they set \({\widehat{w}}^{\text {Lasso}}_{S_0^c} = 0\) and solve a reduced-size problem where the objective function of LASSO is minimized with respect to \(w_{S_0} \in {\mathbb {R}}^{| \, S_0 \, |}\). By properties of convexity and the first order optimality condition (referred as zero-subgradient condition in the reference), the authors show that all optimal solutions of the original LASSO are supported only on \(S_0\) thus the solutions can be obtained by solving the reduced problem. The lower eigenvalue condition (35) then is used to show the uniqueness.

By the first order optimality condition for the convex and non-differentiable problems, there exists a subgradient of \(\Vert \bullet \Vert _1\), denoted by \({\widehat{z}}\), such that \(\frac{1}{N} X^T ( \, Y - X \, {\widehat{w}}^{\text {Lasso}} \, ) + \lambda _N \, {\widehat{z}} = 0\). This equation can be rewritten in a block-matrix form by substituting the definition of Y:

$$\begin{aligned} \displaystyle {\frac{1}{N}} \left[ \begin{array}{cc} X_{S_0}^T X_{S_0} &{} X_{S_0}^T X_{S_0^c} \\ X_{S_0^c}^T X_{S_0} &{} X_{S_0^c}^T X_{S_0^c} \end{array} \right] \ \left[ \begin{array}{c} {\widehat{w}}^{\text {Lasso}}_{S_0} - w_{S_0}^{\, 0} \\ 0 \end{array} \right] \ + \displaystyle {\frac{1}{N}} \left[ \begin{array}{c} X_{S_0}^T \epsilon \\ X_{S_0^c}^T \epsilon \end{array} \right] + \lambda _N \left[ \begin{array}{c} {\widehat{z}}_{S_0} \\ {\widehat{z}}_{S_0^c} \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \end{array} \right] . \end{aligned}$$

This is the key equation which is used to show the remaining parts of the theorem. By applying the assumptions, the authors investigate the quantity \({\widehat{w}}^{\text {Lasso}}_{S_0} - w_{S_0}^{\, 0}\) by examining the above equation. Due to the existence of the error vector \(\epsilon \), probability is introduced in the statement; the error is a zero-mean Gaussian random noise hence the authors apply related probabilistic bounds to achieve the third part of the theorem.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ahn, M. Consistency bounds and support recovery of d-stationary solutions of sparse sample average approximations. J Glob Optim 78, 397–422 (2020). https://doi.org/10.1007/s10898-019-00857-z

Download citation

Keywords

  • Non-convex optimization
  • Sparse learning
  • Difference-of-convex program
  • Directional stationary solution