Abstract
This paper studies properties of the d(irectional)stationary solutions of sparse sample average approximation problems involving differenceofconvex sparsity functions under a deterministic setting. Such properties are investigated with respect to a vector which satisfies a verifiable assumption to relate the empirical sample average approximation problem to the expectation minimization problem defined by an underlying data distribution. We derive bounds for the distance between the two vectors and the difference of the model outcomes generated by them. Furthermore, the inclusion relationships between their supports, sets of nonzero valued indices, are studied. We provide conditions under which the support of a dstationary solution is contained within, and contains, the support of the vector of interest; the first kind of inclusion can be shown for any given arbitrary set of indices. Some of the results presented herein are generalization of the existing theory for a specialized problem of \(\ell _1\)norm regularized least squares minimization for linear regression.
This is a preview of subscription content, log in to check access.
References
 1.
Ahn, M.: DifferenceofConvex Learning: Optimization with Nonconvex Sparsity Functions. University of Southern California, Los Angeles (2018)
 2.
Ahn, M., Pang, J.S., Xin, J.: Difference of convex learning: directional stationarity, optimality and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)
 3.
Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999)
 4.
Bühlmann, P., van de Geer, S.: Statistics for Highdimensional Data. Springer Series in Statistics. Springer, Berlin (2011)
 5.
Bickel, B.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of LASSO and Dantzig selector. Ann. Stat> 37(4), 1705–1732 (2009)
 6.
Candès, E., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4216 (2005)
 7.
Candès, E., Tao, T.: Near optimal signal recovery from random projections: universal encoding strategies. IEEE Trans. Inf. Theory 52(12), 5406–5425 (2006)
 8.
Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351 (2007)
 9.
Candès, E., Wakin, M., Boyd, S.: Enhancing sparsity by reweighted \(\ell _1\) minimization. J. Fourier Anal. Appl. 14(5), 877–905 (2008)
 10.
Dong, H., Ahn, M., Pang, J.S.: Structural properties of affine sparsity constraints. Math. Program. Ser. B (2018). https://doi.org/10.1007/s1010701812833
 11.
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
 12.
Fan, J., Lv, J.: Nonconcave penalized likelihood with NPdimensionality. IEEE Trans. Inf. Theory 57(8), 5467–5484 (2011)
 13.
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press/Taylor & Francis Group, Boca Raton (2015)
 14.
Knight, K., Fu, W.: Asymptotics for lassotype estimators. Ann. Stat. 28(5), 1356–1378 (2000)
 15.
Le Thi, H.A., Pham, D.T.: The DC programming and DCA revised with DC models of real world nonconvex optimization problems. Ann. Oper. Res. 133, 25–46 (2005)
 16.
Le Thi, H.A., Pham, D.T., Vo, X.T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244, 26–46 (2015)
 17.
Loh, P., Wainwright, M.: Regularized Mestimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)
 18.
Loh, P., Wainwright, M.: Support recovery without incoherence: a case for nonconvex regularization. Ann. Stat. 45(6), 2455–2482 (2017)
 19.
Lou, Y., Yin, P., Xin, J.: Point source superresolution via nonconvex L1 based methods. J. Sci. Comput. 68(3), 1082–1100 (2016)
 20.
Lu, S., Liu, Y., Yin, L., Zhang, K.: Confidence intervals and regions for the lasso by using stochastic variational inequality techniques in optimization. J. Roy. Stat. Soc. B 79(2), 589–611 (2017)
 21.
Negahban, S., Ravikumar, P., Wainwright, M., Yu, B.: A unified framework for highdimensional analysis of Mestimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)
 22.
Nikolova, M.: Local strong homogeneity of a regularized estimator. SIAM J. Appl. Math. 61(2), 633–658 (2000)
 23.
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing Bstationary points of nonsmooth DC programs. Mathe. Oper. Res. 42(1), 95–118 (2017)
 24.
Pang, J.S., Tao, M.: Decomposition methods for computing directional stationary solutions of a class of nonsmooth nonconvex optimization problems. SIAM J. Optim. 28(2), 1640–1669 (2018)
 25.
Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM Publications. Philadelphia (2009)
 26.
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
 27.
Yin, P., Lou, Y., He, Q., Xin, J.: Minimization of L1–L2 for compressed sensing. SIAM J. Sci. Comput. 37(1), 536–563 (2015)
 28.
Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for \(\text{ L }_1\)minimization with applications to compressed Sensing. SIAM J. Imag. Sci. 1(1), 143–168 (2008)
 29.
Zhang, C.: Nearly unbiased variable selection under Minimax Concave Penalty. Ann. Stat. 38(2), 894–942 (2010)
 30.
Zhang, S., Xin, J.: Minimization of transformed \(\text{ L }_1\) penalty: theory, difference of convex function algorithm, and robust application in compressed sensing. Mathe. Programm. Ser. B 169(1), 307–336 (2018)
 31.
Zhao, P., Yu, B.: On model selection consistency of LASSO. J. Mach. Learn. Res. 7, 2541–2563 (2006)
 32.
Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Acknowledgements
The author gratefully acknowledges JongShi Pang for his involvement in fruitful discussions, and for providing valuable ideas that helped to build the foundation of this work.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is derived and extended from the last chapter [1, Chapter 4] of the author’s Ph. D. dissertation which was written under the supervision of JongShi Pang.
Appendices
Appendix A: Details about the numerical experiment
The following are parameters used in data generation and problem formulation in Sect. 5.
Number of samples  30 
Dimension of \(w^*\)  50 
Sparsity rate in \(w^*\)  0.5 
Value of \(\delta '\)  \(1  10^{2}\) 
Value of \(\delta \)  Start with \(10^{4}\) then increment by multiplying 1.2 to the current value until the value reaches \(\delta '\) 
\(\ell _1\)norm parameter  \(c = 1.5\) 
MCP parameters  \(a = 2\), \(\lambda = 0.75\) 
SCAD parameters  \(a = 2\), \(\lambda = 1.5\) 
Transformed \(\ell _1\) parameter  \( a = 2\) 
Appendix B: A summary for the LASSO analysis
This section is to provide a literature review of one particular reference which presented a chapter on the statistical inferences analysis for the LASSO problem [13, Chapter 11]. We formally define the LASSO problem:
where each row of \(X \in {\mathbb {R}}^{n \times d}\) and \(Y \in {\mathbb {R}}^{n \times 1}\) are the feature information sample \(x^T\) and its outcome y respectively. Being a convex program, any local minimizer of LASSO is a global optimum as proven from the convex optimization theory. Exploiting the fact, LASSO analysis provided in the reference compares the optimal solutions of the \(\ell _1\)regularized sample average approximation problem (28), denoted by \({\widehat{w}}^{\text {Lasso}}\), with the underlying ground truth vector \(w^{\, 0}\). The assumption on \(w^{\, 0}\) is that all attained samples are generated from the vector then perturbed by some random Gaussian noise, i.e., \(Y = X \, w^{\, 0} + \epsilon \) where each component \(\epsilon _i \sim {\mathcal {N}}(0, \sigma ^2)\) for \(1 \le i \le N\). Moreover, the authors assume that the ground truth is a sparse vector with its nonzero components defining the support set \(S_{\, 0} \triangleq \{ \, i \  \ w_i^{\, 0} \ne 0 \ \text {for} \ 1 \le i \le d \, \}\).
The topics the authors address in the referenced work are the following: how close the empirical solution is to the ground truth; if the ultimate goal of the problem is to make future predictions, is it possible to compare the empirical model outputs with the noisefree outputs produced by the ground truth on the available samples; and whether the empirical solution can recover the indices of nonzero components that are contained in \(S_{\, 0}\). To answer these questions, the authors derive a region which contains all vectors of the difference between the solutions of (28) and \(w^{\, 0}\) is contained in the set. They define
where \(S_{\, 0}^c\) is the complement of \(S_{\, 0}\). Therein, the authors assume that \(\theta _N^{\mathrm{Lasso}}\) is strongly convex at the point \(w^{\, 0}\) with respect to \({\mathcal {V}}_{\mathrm{Lasso}} \), making a connection between the empirical solution and the ground truth by utilizing the region. Though (28) is a convex program, strong convexity of the entire objective function can not be expected in general. Such assumption guarantees that the submatrix of the Hessian, \(X^T X\), corresponding to the indices in \(S_{\, 0}\) has a full rank. The statement of the restricted eigenvalues assumption, analogous to restricted strong convexity for the special case of least squares error minimization for linear regression, is as follows: there exists a constant \(\gamma ^{\mathrm{Lasso}} > 0\) such that
We list the results provided in the reference: a basic consistency bound, a bound on the prediction error, and the support recovery of \({\widehat{w}}^{\text {Lasso}}\). The assumptions imposed for each theorem and the key ideas for the proof will be discussed.
\(\bullet \) Consistency result [13, Theorem 11.1]: Suppose the model matrix X satisfies the restricted eigenvalue bound (29) with respect to the set \({\mathcal {V}}_{\mathrm{Lasso}}\). Given a regularization parameter \(\lambda _N^{\mathrm{Lasso}} \ge \displaystyle {\frac{2}{N}} \Vert \, X^T \epsilon \, \Vert _\infty > 0\), any solution \({\widehat{w}}^{\text {Lasso}}\) of (28) satisfies the bound
Exploiting the fact that \({\widehat{w}}^{\text {Lasso}}\) is the global minimizer of the LASSO problem, the proof of the theorem starts from \(\theta _N^{\mathrm{Lasso}}({\widehat{w}}^{\text {Lasso}}) \le \theta _N^{\mathrm{Lasso}} (w^{\, 0})\). We substitute the assumption on the ground truth, \(Y = X \, w^{\, 0} + \epsilon \), to both sides of the inequality, then apply the assumption on the regularization parameter \(\lambda _N^{\mathrm{Lasso}} \). These steps yield a key inequality given by,
which serves as a building block to derive the current theorem and the prediction error bound to be shown. It can be verified that by letting \(v = {\widehat{w}}^{\text {Lasso}}  w^{\, 0}\), the proof is complete provided that the restricted eigenvalue condition holds; the last step requires a lemma which shows that any error \({\widehat{w}}^{\text {Lasso}}  w^{\, 0}\) associated with the LASSO solution \({\widehat{w}}^{\text {Lasso}}\) belongs to the set \({\mathcal {V}}_{\mathrm{Lasso}}\) if the condition on \(\lambda _N^{\mathrm{Lasso}} \) holds.
\(\bullet \) Bounds on the prediction error [13, Theorem 11.2]: Suppose the matrix X satisfies the restricted eigenvalue condition (29) over the set \({\mathcal {V}}_{\mathrm{Lasso}}\). Given a regularization parameter \(\lambda _N^{\mathrm{Lasso}} \ge \displaystyle {\frac{2}{N}} \Vert \, X^T \epsilon \, \Vert _\infty > 0\), any solution \({\widehat{w}}^{\text {Lasso}}\) of (28) satisfies the bound
The proof for the prediction error bound is straightforward; it can be shown by combining the restricted eigenvalue assumption (29) and the inequality (31) which is derived in the process of proving the consistency result.
\(\bullet \) Assumptions for variable selection consistency result: To address variable selection consistency of the LASSO solution \({\widehat{w}}^{\text {Lasso}}\), the authors provide a distinct set of assumptions which are related to the structures of the matrix X. The mutual incoherence (sometimes also referred as irrepresentability) condition states that there must exist some \(\gamma ^{\mathrm{Lasso}} > 0\) such that
The authors point out that in the most desirable case, any jth column \(x_j\) where j belongs to the set of indices consisting zero components of \(w^{\, 0}\) would be orthogonal to the columns of \(X_{S_{0}} \in {\mathbb {R}}^{N \times  \, S_0 \, }\), which is the submatrix of X that consists of columns corresponding to \(S_{0}\). As such is not attainable for highdimensional linear regression, the assumption ensures that ‘near orthogonality’ to hold for the design matrix. In addition, they assume
for some \( K_{\mathrm{{clm}}} > 0\), which can be interpreted as the matrix X has normalized columns. For example, the matrix can be normalized such that \(\Vert \, x_j \, \Vert _2\) is equal to \(\sqrt{N}\) for any j, resulting the value of constant \(K_{\mathrm{clm}}\) to be 1. The last assumption made on the matrix X is
for some positive constant \(C_{\min }\), where \(\lambda _{\min }\) denotes the minimum eigenvalue of the given matrix. The authors note that if this condition is violated then the columns of \(X_{S_0}\) are linearly dependent, and it is not possible to recover \(w^{\, 0}\) even if its supporting indices are known.
\(\bullet \) Variable selection consistency [13, Theorem 11.3]: Suppose the matrix X satisfies the mutual incoherence condition (33) with parameter \(\gamma ^{\mathrm{Lasso}} > 0\), the column normalization condition (34) and the eigenvalue condition (35). For a noise vector \(\epsilon \in {\mathbb {R}}^N\) with i.i.d \({\mathcal {N}}(0, \sigma ^2)\) entries, consider the LASSO problem (28) with a regularization parameter
Then with a probability greater than \(1  c_1 e^{c_2 N \lambda _N^2}\), the Lasso has the following properties:

1.
Uniqueness: the optimal solution \({\widehat{w}}^{\text {Lasso}}\) is unique;

2.
No false inclusion: The unique optimal solution has its support contained within \(S_0\), i.e., support(\({\widehat{w}}^{\text {Lasso}}\)) \(\subseteq \) support(\(w^{\, 0}\));

3.
\(\ell _\infty \) bound: the error \({\widehat{w}}^{\text {Lasso}}  w^{\, 0}\) satisfies the \(\ell _{\infty }\) bound
$$\begin{aligned} \Vert \, {\widehat{w}}^{\text {Lasso}}_{S_0}  w_{S_0}^{\, 0} \, \Vert _\infty \le \underbrace{\lambda _N \left[ \, \displaystyle {\frac{4 \sigma }{\sqrt{C_{\min }}}} + \Vert \, (X_{S_0}^T X_{S_0} / N)^{1} \, \Vert _{\infty } \, \right] }_{B(\lambda _N, \, \sigma ; \, X)} \end{aligned}$$(36)where \(\Vert A \Vert _\infty \) for a matrix A is defined as \(\max \limits _{\Vert u \Vert _\infty = 1} \Vert A \, u \Vert _\infty \);

4.
No false exclusion: the nonzero components of the LASSO solution \({\widehat{w}}^{\text {Lasso}}\) include all indices \(j \in S_0\) such that \( w_j^{\, 0}  > B(\lambda _N, \, \sigma ; \, X)\), and hence is variable selection consistent as long as \(\min \limits _{j \in S_0}  w_j^{\, 0}  > B(\lambda _N, \, \sigma ; \, X)\).
Showing the uniqueness involves solving a hypothetical problem; they set \({\widehat{w}}^{\text {Lasso}}_{S_0^c} = 0\) and solve a reducedsize problem where the objective function of LASSO is minimized with respect to \(w_{S_0} \in {\mathbb {R}}^{ \, S_0 \, }\). By properties of convexity and the first order optimality condition (referred as zerosubgradient condition in the reference), the authors show that all optimal solutions of the original LASSO are supported only on \(S_0\) thus the solutions can be obtained by solving the reduced problem. The lower eigenvalue condition (35) then is used to show the uniqueness.
By the first order optimality condition for the convex and nondifferentiable problems, there exists a subgradient of \(\Vert \bullet \Vert _1\), denoted by \({\widehat{z}}\), such that \(\frac{1}{N} X^T ( \, Y  X \, {\widehat{w}}^{\text {Lasso}} \, ) + \lambda _N \, {\widehat{z}} = 0\). This equation can be rewritten in a blockmatrix form by substituting the definition of Y:
This is the key equation which is used to show the remaining parts of the theorem. By applying the assumptions, the authors investigate the quantity \({\widehat{w}}^{\text {Lasso}}_{S_0}  w_{S_0}^{\, 0}\) by examining the above equation. Due to the existence of the error vector \(\epsilon \), probability is introduced in the statement; the error is a zeromean Gaussian random noise hence the authors apply related probabilistic bounds to achieve the third part of the theorem.
Rights and permissions
About this article
Cite this article
Ahn, M. Consistency bounds and support recovery of dstationary solutions of sparse sample average approximations. J Glob Optim 78, 397–422 (2020). https://doi.org/10.1007/s1089801900857z
Received:
Accepted:
Published:
Issue Date:
Keywords
 Nonconvex optimization
 Sparse learning
 Differenceofconvex program
 Directional stationary solution