1 Introduction

Rapid improvement in sensing technologies has created high dimensional data in research fields and industries with many features and a huge number of samples. The increasing data volume becomes a great challenge for contemporary statistical learning algorithms (John Lu, 2010) in various research areas, including high-resolution imaging (Bruckstein et al., 2009), target tracking (Zhang et al., 2012; 2013; 2014; 2015a; 2015b), astronomical data processing (Borne, 2009), genomics (Kim and Xing, 2014), functional and longitudinal processing (Jenatton et al., 2012), and warehouse data analysis in business (Fan et al., 2011). For example, astronomical projects produce more than 109 pixels every 20 s and terabytes of data in a single evening (Borne, 2009). Financial data is measured with hundreds of financial instruments and tracked over time with 106 trades per second in high-frequency trading (Fan et al., 2011).

It is almost impossible to learn a consistent model with high accuracy, model explicability, and computational efficiency at the same time, unless one assumes that the sample size is much larger than the feature size (Candès and Tao, 2007). However, in high-dimensional settings, the dimension of the feature is often the same as, or even larger than the sample size. Therefore, traditional methods face significant challenges, ranging from theoretical analysis, efficient algorithm design, to model estimation and interpretation. Note that consistent estimators may be obtained if additional assumptions are imposed on the traditional models (Negahban et al., 2012). A widely used constraint is that models should be sparse for high-dimensional problems (Bach et al., 2011).

1.1 Sparsity

In a sparse model, only a small number of variables are non-zero among all the variables in the model. The sparsity assumption is typically associated with desired interesting properties such as succinct interpretation, fast evaluation of the model, statistical robustness (sparsity is usually associated with robust statistical performance), and computational advantages, which appeals to a great many researchers.

Sparsity is preferred in learning problems with high-dimensional data. In many applications, though the raw data is high dimensional, the intrinsic dimension is relatively low. For example, in bioinformatics, different high-dimensional genes may belong to the same functional group; in multi-task learning, several estimators are expected to share common types of covariates. In fact, it is now common sense that sparsity is a powerful assumption for contemporary machine learning algorithms (Bach et al., 2012a).

Sparsity is considered to be one of the most significant philosophical and aesthetic principles that have ever existed. It is also known as Occam’s razor (Rasmussen and Ghahramani, 2001), “Entities should not be multiplied without necessity”, by William Ockham in the 13th century. The parsimony principle has been addressed again and again, which has led to several beautiful results, such as minimal description length (MDL) (John Lu, 2010). Modern sparse learning methods were introduced by Wrinch and Jeffreys (1921), who expressed the sparsity of models in physics as the non-zero number of learning variables. This concept is very close to today’s definition of sparsity. Since then, numerous tools (see Mairal et al. (2014) and multiple references therein) have been developed in the statistics community to build sparsity-related models, which have greatly improved the explicability of the models, and dramatically decreased the computational cost of the model in the prediction procedure. With the efforts of researchers and engineers, sparse learning has become a popular tool with the development of theoretical frameworks and various efficient algorithms. The theoretical frameworks range from the original idea in underdetermined linear systems (see Bruckstein et al. (2009) and references therein), signal processing (Chen and Donoho, 1994) and statistical learning (Tibshirani, 1996), the computational complexity of the 0-norm regularized problem, the uniqueness of the solution for the 0-norm regularized optimization (see Donoho and Huo (2001), Tropp et al. (2003), Elad (2010), and references therein), the model selection consistency of the convex relaxation from 0-norm to 1-norm (Candès et al., 2006; Zhao and Yu, 2006; Candès and Tao, 2007; Candès, 2008; Candès and Recht, 2009; Zhang, 2009), to the statistical analysis of extended algorithms (Elad, 2010; Jenatton, 2011). Proposed algorithms range from greedy algorithms for 0-norm regularized methods (Tropp, 2004) to convex optimization methods after convex approximation (see Friedman et al. (2007), Beck and Teboulle (2009), Jenatton et al. (2011), Yang and Yuan (2013), and references therein).

Recently, a line of work has been devoted to the framework of empirical risk minimization with sparsity-inducing regularizations, which is commonly formulated as

$$\min\limits_{x \in {\mathcal X}} \;l(x) + \lambda \cdot r(x),$$
((1))

where x ∈ ℝd (d is the dimensionality of the features), χ is the feasible domain, l(·) is the empirical loss function, r(·): ℝd → ℝ is a sparsity-regularized function, and λ is a parameter balancing these two terms. Problem (1) accommodates quite a few classic classification and regression models including linear regression obtained by setting l(x) = ∥ATxb2/2, logistic regression obtained by setting l(x) = log(1 + exp(−bATx)), and linear support vector machine (SVM) obtained by letting l(x) = max(0, 1 − bATx), where A ∈ ℝd×N are the data samples and b ∈ ℝN are the labels of these data samples. For variable selection problems in linear models, sparsity may be directly achieved by adding a penalty function of non-zero elements (Tibshirani, 1996), specifically r(x) = ∥x0, which is known as the 0-norm of variable x.

In Fig. 1, the signal is recovered through a sparse linear regression algorithm. With the constraint that the basis of the signal is sparse, or the number of bases is relatively small compared to the whole basis, the original basis is determined while the noise is removed.

Fig. 1
figure 1

Signal recovery from observation with noise through a sparse linear regression algorithm. There are six Gaussian distributed bases with means of (3, 4, 10, 30, 35, 40) and variances of (1, 3, 3, 4, 2, 4). The observed signal is a mixture of the second, third, and fourth bases with weights of (1, 3, 2) accordingly. The sparse learning method determines exactly the bases and weights of the original signal (References to color refer to the online version of this figure)

When r(·) is a nonconvex function, it was reported in the literature that nonconvex regularization usually yields a solution with more desirable structural properties. Let us take the 0-norm regularized least squares problem (i.e., l(·) is a least squares function) as an example. It is well known that such a problem is NP-hard because of its combinatorial nature. To this end, the 1-norm regularized model was proposed to pursue computational tractability. In spite of computational advantages and successful applications, the 1 model has some limits in certain scenarios (Candès et al., 2008), since the 1-norm comes at the price of shifting the resulting estimator by a constant (Fan and Li, 2011) which leads to an over-penalized problem. To circumvent the issues pertaining to the 1-norm, researchers have managed to impose some nonconvex regularizations on problem (1), which have been proven to be better approximations of the 0-norm theoretically and computationally. Some nonconvex regularized functions have been widely used in sparse learning (Gong et al., 2013). These nonconvex regularized functions include the p -norm (0 < p < 1) (Chartrand and Yin, 2008; Foucart and Lai, 2009; Xu et al., 2012; Lai et al., 2013; Wang Y et al., 2013), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2011), log-sum penalty (LSP) (Candès et al., 2008), minimax concave penalty (MCP) (Zhang CH, 2010), and the capped-1 penalty (Zhang T, 2010; Zhang, 2013). In Table 1, we present the details of these regularized functions for readers’ convenience.

Table 1 Examples of nonconvex penalty r(x)

1.2 Structured sparsity

Structured sparse learning is commonly used in two situations. First, structured sparse learning is used given the prior knowledge that the model should be structured sparsely. Second, to make the model more interpretable or easier to use in the following procedures, even if the underlying problem does not admit assumed structured sparse solutions, one looks for the best structured sparse approximation.

The 1-norm could be used to induce model parsimony; however, it does not encode the structural information. To encode the structured sparsity, various structured regularizations have been proposed. These regularizations encode the structural information into traditional sparse learning, and they are recognized as structured sparsity inducing norms. Structured sparsity may be achieved by adding explicitly structured regularization. Structured sparsity-inducing norms are natural extensions of the 0-norm. We could formulate the structured learning problems as

$$\min\limits_{x \in {\mathcal X}} \;l(x) + \lambda \cdot R(x),$$
((2))

where R(·) is the structured sparsity-inducing regularization, which can be seen as an extension of pure sparsity-inducing penalization. Compared to the sparse learning problem (1), the term R(·) encodes the structured information, which provides a great advantage beyond the traditional sparse learning algorithms to pursue the structured model.

1.3 Aim and scope of this paper

In this article, we will review several structured sparsity-inducing norms (Fig. 2) ranging from grouped sparsity (Fig. 2b), fused sparsity (Fig. 2c), hierarchical sparsity (Fig. 2d), to graphical sparsity (Fig. 2e). This review sheds light on new directions in research fields and engineering problems to take structural information into account.

Fig. 2
figure 2

Illustration of sparsity and its extensions: (a) standard sparsity; (b) grouped sparsity; (c) fused sparsity; (d) hierarchical sparsity; (e) graphical sparsity (References to color refer to the online version of this figure)

Structured sparsity has been widely used in practical problems, including model-based compressive sensing (Baraniuk et al., 2010; Asaei et al., 2011a; Duarte and Eldar, 2011; Chen et al., 2014), signal processing (Bach and Jordan, 2006; Asaei et al., 2011a; 2014a; 2014b; Najafian, 2016), computer vision (Jenatton et al., 2009; Kim et al., 2013; Chen and Huang, 2014; Karygianni and Frossard, 2014; Xiao et al., 2016), bioinformatics (Wille and Bühlmann, 2006; Zhang SZ et al., 2011; Kim and Xing, 2012), and recommendation systems (Koren et al., 2009; Takacs et al., 2009; Rendle and Schmidt-Thieme, 2010; Zhang ZK et al., 2011). This article focuses mainly on their formulations and algorithms. The models and the algorithms are relatively independent (Xu et al., 2005; Zhang et al., 2006; Xu et al., 2007; Yuan et al., 2015; Hu and Yu, 2016; Xie et al., 2016; Xie and Tong, 2016; Zhu et al., 2016), so is in the privacy research field (Sun et al., 2015a; 2015b; Wu et al., 2016).

Note that there are other books and articles that offer diverse perspectives on sparse learning methods, including Elad (2010) and Mallat (2008) from a perspective of signal processing, and Bach et al. (2012b) in a view of optimization. However, this article focuses on intuitive formulations, their variants, and the algorithms to solve them.

1.4 Notations

Vectors are denoted by bold lower-case letters and matrices by upper-case ones. ∥x0 is the number of non-zero elements in a vector x, ∥x1 is the sum of absolute values of elements in a vector x, the q -norm of a vector x ∈ ℝn is defined as \({\left\Vert x \right\Vert _q}\;\;: = {({\sum\nolimits_{i = 1}^n {\vert {x_i}\vert} ^q})^{1/q}}\) for q > 0, and ∥x ≔ maxi=1, 2,…, nx i ∣, where x i denotes the ith coordinate of x. The Frobenius norm of a matrix X ∈ ℝd×n is defined as \({\left\Vert X \right\Vert _{\rm{F}}}\;\;: = {(\sum\nolimits_{i = 1}^n {\sum\nolimits_{j = 1}^p {x_{ij}^2}} )^{1/2}}\), where x ij denotes the entry of X at the ith row and jth column. X i denotes the ith column of X.

2 Comparisons of different structured sparsities and computational complexity of optimization methods

In this section, we summarize all these structured sparsities presented in this article and compare the computational complexity of typical optimization methods of these structured sparse learning problems.

In Table 2, we list formulations of these structured sparse learning problems, their corresponding optimization algorithms, and public available software implementations.

We also compare the convergence rates under convex and strongly convex conditions of typical first-order optimization methods in Table 3. In this article, we focus mainly on the first-order optimization methods. First-order methods typically require access to an objective function’s gradient or subgradient. The algorithms typically take the form xt+1 = xtαtgt for some step sizes αt and descent direction gt. As such, each iteration takes approximately O(n) time. A comparison between several first-order methods was provided by Qiao et al. (2016a). Higher-order methods are excluded in this article, because some open issues need to be addressed to apply higher-order methods in large-scale machine learning problems.

Table 2 An overview of formulations, processing algorithms, and software packages for Lasso and its structured extensions
Table 3 An overview of optimization methods’ computational complexity

3 Grouped structured sparsity

In many regression problems, the variables are predefined in groups as prior knowledge in practical situations such as the analysis-of-variance (ANOVA). In this setting, the typical goal is to choose major effects and interactions among variables. For example, in supervised learning problems, to generate the classification label, variables are organized as collections of categorical predictors, and the selection of significant variables corresponds to the selection of groups of variables. To handle these problems, grouped structured sparsity has been proposed, and these extended methods are often called ‘grouped Lasso’ (Yuan and Lin, 2006).

3.1 Formulations of grouped structured sparsity

The grouped Lasso was studied and generalized by Yuan and Lin (2006). Assume that the number of groups is fixed and finite, and predictors are divided into g groups with p i as the number of predictors in the ith group G i , \(p = \sum\nolimits_{i = 1}^g {{p_i}}\), and G i G j = ⊘.

The grouped Lasso is formulated as

$$\min\limits_{\beta \in \mathbb{R}^{p}} \Vert y - {\beta_0} - \sum\limits_{i = 1}^g {X_i^{\rm{T}}{\beta_i}\Vert _{\rm{F}}^2 + \lambda} \sum\limits_{i = 1}^g {\sqrt {{p_i}} \Vert {\beta_i}\vert {\vert _2}{.}}$$
((3))

The sparsity of the solution depends on the magnitude of the tuning parameter λ, and exploits the non-differentiability of ∥β i 2 at β i = 0. Note that the grouped Lasso estimates and the grouped sparsity pattern converges to the correct patterns in probability (Bach, 2008b).

To obtain sparsity at both group and variables’ element-wise levels, Simon et al. (2013) proposed a sparse grouped Lasso, which is formulated as

$$\begin{array}{*{20}c} {\min ||y - {\beta _0} - \sum\limits_{i = 1}^g {X_i^T{\beta _i}} ||_{\rm{F}}^2 + {\lambda _1}\sum\limits_{i = 1}^g {\sqrt {{p_i}} ||{\beta _i}|{|_2}} } \\ { + {\lambda _2}||\beta |{|_1},\quad \quad \quad \quad \quad \quad \quad \quad \quad } \\ \end{array} $$
((4))

where the second term controls the sparsity at the group level, and the third term controls the sparsity at the variables’ element-wise level. When λ1 = 0, problem (4) degenerates to standard Lasso (Tibshirani, 1996). When λ2 =0, it generates grouped Lasso (problem (3)). Because the grouped Lasso may suffer from estimation inefficiency and selective inconsistency, an adaptive grouped Lasso method (Wang and Leng, 2008) has been proposed as a remedy. Considering the potential non-uniqueness of solutions and high computational costs, a generalized linear model (GLM) (Roth and Fischer, 2008) has been proposed with an active-set algorithm. The number of groups is allowed to grow with the increase of the number of observed data points (Meinshausen and Yu, 2008), and the extension with dynamic group division is also available in Mougeot et al. (2013).

3.2 Algorithms for grouped structured sparsity

The grouped Lasso optimization problem (3) can be solved through a coordinate gradient descent algorithm, which is applicable to a broad class of convex loss functions, and convergence of the algorithm is established (see Hong et al. (2015) and references therein). The blockwise coordinate descent (BCD) algorithm was also used to solve the problem in Meier et al. (2008) and Liu H et al. (2009). However, the BCD method may get stuck in ill conditions, and in Vincent and Hansen (2014), a modified BCD algorithm was proposed, which first computes a descent direction at the given point, and then uses a line search to find the next starting point. There are also publicly accessible software implementations that can be used to solve the problem. For example, the SLEP package (Liu J et al., 2009) was used in Xie and Xu (2014). More available implementations are listed in Table 2.

The performance of these algorithms has been comparatively studied in Rakotomamonjy (2011), who concluded that depending on the performance measure, greedy approaches and iterative re-weighted algorithms are more efficient in either computational complexity or sparsity recovered.

3.3 Applications of grouped structured sparsity

There are numerous applications to pursue grouped sparsity. In computer vision, each group corresponds to different data sources or data types, and different data sources could be referred to as views. In speech and signal processing, similar groups represent different frequency bands (McAuley et al., 2005). In Vincent and Hansen (2014), grouped Lasso was used in multinomial classification and solved with a coordinate gradient descent method. In document processing, Bengio et al. (2009) combined grouped sparse learning with a bag-of-words document representation.

4 Fused structured sparsity

In many real-world applications, coefficients are organized in a specific order and have local constancy. For instance, organizing variables in blocks as prior knowledge may produce a better result. In this setting, it is necessary to extend Lasso to exploit the ordered structure, and the extension is called ‘fussed Lasso’ (Tibshirani et al., 2005).

4.1 Formulations of fused structured sparsity

In regression problems, variables x may have a natural order. Specifically, variables are ordered according to some index variable t, and they have local constancy of the feature profile. These variables are invoked as predictors. To exploit the local constancy of the coefficient, Tibshirani et al. (2005) proposed fussed Lasso, which extends the Lasso penalty to take ordering into account. The extended method is also called ‘generalized Lasso’. The fused Lasso can be formulated as

$$\begin{array}{*{20}c}{\min\limits_{\beta \in \mathbb{R}^{p}}} & {\sum\limits_{i = 1}^N {\Vert {y_i} - {\beta_0} - \sum\limits_{j = 1}^p {{x_{ij}}{\beta_j}\Vert _F^2 + {\lambda_1}\sum\limits_{j = 1}^p {\Vert {\beta_j}\vert {\vert _1}}}}} \\ {} & { + {\lambda_2}\sum\limits_{j = 1}^{p - 1} {\Vert {\beta_{j + 1}} - {\beta_j}\vert {\vert _1},\quad \quad \quad \quad \quad \quad \quad \quad}} \end{array}$$
((5))

where β ∈ ℝp are the variables of the model, and the N pairs (x i , y i ) are the training data with noise. There are three terms in the formulation: the first term empirically minimizes the training error for a given dataset, the second term encourages sparsity in feature, and the third term is used to penalize the differences of adjacent coefficients. These terms are tuned with parameters A1 and A2. Note that the fused Lasso makes the assumption that the index t is uniformly distributed (John Lu, 2010), and the third term should be generalized on divided differences \({\lambda_2}\sum\nolimits_{j = 1}^{p - 1} {({\beta_{j + 1}} - {\beta_j})/({t_{j + 1}} - {t_j})}\).

A generalized two-dimensional fused Lasso was proposed in Friedman et al. (2007). In this model, parameters are laid out in a grid of pixels with a 2D total variation (TV) norm (Rudin et al., 1992), which is usually used in image denoising, image smoothing, and data recovery. The general strategy for two-dimensional fused Lasso can be directly applied in higher-order problems, formulated as a tensor with a higher-order TV-norm as regularization.

4.2 Algorithms to solve fused structured sparsity

The fused Lasso is a strictly convex problem in β. For 1D fused Lasso, SQOPT (Gill et al., 2008) can be used directly, in which there is a two-phase, active-set algorithm designed for quadratic programming problems with sparse linear constraints. For a class of convex optimization problems, a coordinate descent algorithm was presented in Friedman et al. (2007), which is a one-at-a-time coordinate-wise descent algorithm, and can be generalized to solve the 2D fused Lasso or even higher dimensional cases.

To handle large-scale fused Lasso problems, Ye and Xie (2011) proposed an iterative algorithm based on the split Bregman method to solve a class of large-scale fused Lasso problems. Wang LC et al. (2013) presented an augmented Lagrangian method (ALM) for general convex loss. Li et al. (2014) proposed a fast linearized alternating direction method to solve the general Lasso model, and Qiao et al. (2016b; 2016c) improved the method to solve the structed nonconvex problems.

Experiment results in Hoefling (2010) showed that Tibshirani’s and Friedman’s algorithms are clear and fast, and are state-of-the-art methods. The linearized method in Li et al. (2014) can be used to handle larger-scale problems.

4.3 Applications of fused structured sparsity

Fused Lasso is validated in protein mass spectroscopy (MS), gene expression, and image smoothing problems. The protein MS, which holds great promise for biomarker identification and proteomics profiling, was also used as a motivating example to demonstrate the efficacy of fused Lasso in Tibshirani et al. (2005) and Tibshirani and Wang (2008). As demonstrated in Huang et al. (2005), another important application of fused Lasso is the reconstruction of copy numbers from comparative genomic hybridization (CGH) data arrays. For images with smoothness among pixels, fused Lasso can achieve a rather good performance (Friedman et al., 2007).

5 Hierarchically structured sparsity

In hierarchical sparsity, the variables are organized hierarchically (Xu et al., 2011) or integrated into a tree, and form a union of potentially overlapping groups that were defined previously. The hierarchical sparsity may be achieved through various extended sparsity-inducing norms, and the extension is often called ‘hierarchical Lasso’ (Zhao et al., 2009).

5.1 Formulations of hierarchically structured sparsity

Hierarchical Lasso assumes that the p variables are assigned to the nodes of a tree T, or a forest. In this setting, if a feature is selected then it implies that all its ancestors in T have already been selected, and if a node is not selected, then its descendants are not selected.

The hierarchical Lasso was first presented in Zhao et al. (2009). The authors proposed the structured penalty, which is called the composite absolute penalty (CAP). By allowing the groups to overlap, CAP can be used to represent a hierarchy structure among the predictors. In a given group or hierarchical structure, the CAP penalty must be specialized for grouped and hierarchical selection. Assume the grouping is denoted as G =(G1, G2,…,G K ), and the norm is denoted as \(\gamma = ({\gamma_0},\;{\gamma_1},\; \ldots \;,\;{\gamma_K}) \in \mathbb{R}_{+} ^{K +1}\). The CAP penalty with grouping G could be formulated as follows:

$${T_{G,\gamma}}(\beta ) = \sum\limits_k {\Vert {\beta_{{G_k}}}\Vert _{{\gamma_k}}^{{\gamma_0}},}$$

where \({\beta_{{G_k}}} = \{ {\beta_j}\vert j \in {G_k}\}\), and the corresponding CAP estimation for the tuning parameter A is

$${\hat \beta_{G,\gamma}}(\lambda ) =\arg \min\limits_\beta (L(Z,\beta ) + \lambda {T_{G,\gamma}}(\beta )),$$
((6))

where L(Z, β) is the loss function. Specified overlapping patterns corresponding to the given structure can be used for hierarchical variable selection. For piecewise quadratic loss functions, CAP with 1-norm or -norms has advantages that their regularization paths are piecewise linear. If γ i ≥ 1 (i = 0, 1, …, K), then T(β) is convex. If the loss function L(·) is convex in β, then the objective function of CAP estimation is convex.

In compressive sensing, Baraniuk et al. (2010) presented tree sparsity in the context of sparse wavelet decompositions. The consistency of the sparse estimator of potentially overlapping groups was given in Jacob et al. (2009). In sparse coding, Jenatton et al. (2010) proposed an extension that the atoms are further assumed to be embedded in a tree, which was achieved using tree-structured sparse regularization norms.

5.2 Algorithms of hierarchically structured sparsity

The BLasso algorithm (Zhao and Yu, 2007), which was derived from a coordinate descent method with a fixed step size applied to the general Lasso loss function, can be used to solve the minimization problem. Zhao et al. (2009) extended BLasso and proposed the hiCAP algorithm for hierarchical variables selection, which was valid for the 2-loss when γ0 = 1, γ k = ∞, or a tree-structured hierarchy in graph representation. Considering the formulation’s nonseparability and non-smoothness, Chen et al. (2012) proposed the smoothing proximal gradient (SPG) method, which combines a smoothing technique with an effective proximal gradient method to solve structured sparse regression problems with a smooth convex loss.

Proximal methods (Parikh and Boyd, 2014), which extend the projection operator to a convex set, have recently been shown effective in solving variational problems. To accelerate the convergence, Mosci et al. (2010) added a strictly convex function to the objective function and the experiment results showed that it reduces the number of substantial optimization iterations. After introducing auxiliary variables, Micchelli et al. (2013) used an alternating minimization algorithm with a projection procedure to solve the problem and established the theorem of convergence. Combined with an active set strategy, Villa et al. (2014) accelerated the proximal method by using a new active set strategy to compute the proximal operator. When the objective function is strongly convex, Xiao and Zhang (2014) proposed a proximal stochastic gradient method iterated in an incremental way, which provides an efficient way to solve large-scale problems.

5.3 Applications of hierarchically structured sparsity

In wavelet decompositions, it is natural to organize them in a tree because of their multi-scale structure, and it benefits image compression and de-noising (Baraniuk et al., 2010; Huang et al., 2011). In dictionary learning, Jenatton (2011) used hierarchical dictionary learning in image restoration and performed multi-scale mining of fMRI data for the prediction of simple cognitive tasks (Jenatton et al., 2012). In genetics, Kim and Xing (2010) used it to exploit the tree structure of gene networks for multi-task regression. In topic models, Blei et al. (2010) proposed a hierarchical model of latent variables based on Bayesian non-parametric methods to model hierarchies of topics.

6 Graphically structured sparsity

The graph is a powerful data structure in model construction for a lot of machine learning algorithms, such as the graphical model (Wainwright and Jordan, 2008) and high-dimensional model selection (Meinshausen and Bühlmann, 2006). Sparse graphs have a relatively small number of edges, and are economical to use with good interpretability. The problem of estimating sparse graphs may be resolved by ‘graphical Lasso’ (Banerjee et al., 2008), which is an extension of Lasso on the inverse covariance matrix.

6.1 Formulations of graphically structured sparsity

A graph G consists of a set of vertices N, and an edge set E. In undirected graphical models, each vertex represents a random variable, and each edge represents the dependent relationship between the two vertices. The absence of an edge between two vertices has a special meaning: the corresponding random variables are conditionally independent, given that the rest of the variables are known.

Wermuth (1976) showed that if graph G is Gaussian distributed, then it has the property that conditional independence of vertices corresponds to non-zero entries in the precision matrix. Model selection for undirected Gaussian graphical models is equivalent to selecting non-zero elements in the precision matrix. Dempster (1972) named the problem ‘covariance selection’. Considering that the values of the precision matrix are part of continuous variables, it is natural to extend variable selection to edge selection on the graph (see Banerjee et al. (2008) and references therein).

Assume that there are N multi-variate normal variables x i (i = 1, 2, …, N) with population mean μ and covariance Σ. The empirical covariance matrix is \(S = \sum\nolimits_{i = 1}^N {({x_i} - \bar x){{({x_i} - \bar x)}^{\rm{T}}}/N}\), where \(\bar x\) is the mean of the samples. Without considering the constants, the log-likelihood of the data can be written as (Θ) = log det Θ − trace(), where the quantity (Θ) is a convex function of Θ and the maximum likelihood estimate of Σ is S. Considering Wishart log-likelihood, there is Θ = Σ−1. Then Lagrange constants for all missing edges are formulated as

$${\ell_{\rm{c}}}(\Theta ) = \log \det \Theta - {\rm{trace}}(S\Theta ) - \lambda \Vert \Theta \vert {\vert _1},$$
((7))

where ∥Θ1 is the element-wise 1-norm of Θ, and term λΘ1 is used as a sparsity-inducing norm on the inverse covariance matrix.

6.2 Algorithms of graphically structured sparsity

The optimization problem (7) is more convenient to solve than the original model selection problem. Banerjee et al. (2008) showed that problem (7) is convex and the problem can be solved by optimizing over each row and the corresponding column of W in a block coordinate descent fashion. Concretely, W and S could be partitioned as

$$W = \left( {\begin{array}{*{20}c}{{W_{11}}} & {{w_{12}}} \\ {w_{12}^{\rm{T}}} & {{w_{22}}} \end{array}} \right),\;S = \left( {\begin{array}{*{20}c}{{S_{11}}} & {{s_{12}}} \\ {s_{12}^{\rm{T}}} & {{s_{22}}} \end{array}} \right),$$

and the gradient equation for maximizing problem (7) is Θ−1Sλ · sign(Θ)= 0. Then, the solution for w12 would be given by solving

$${w_{12}} = \arg \min\limits_y \{ {y^{\rm{T}}}W_{11}^{ - 1}y:\Vert y - {s_{12}}\vert {\vert _\infty} \leq \rho \}{.}$$

Following Banerjee’s work, Friedman et al. (2008) proposed an algorithm named ‘graphical Lasso’ using a coordinate descent method. Motivated by the success of convex relaxation for the rank-minimization problem, Chandrasekaran et al. (2012) introduced a regularized maximum normal likelihood decomposition framework with a trace norm penalty term, and Ma et al. (2013) developed a proximal gradient based alternating direction method of multipliers to solve these problems.

In practice, tuning the parameter is essential for the results. Yuan and Lin (2007) proposed a Bayesian information criterion (BIC) type criterion for the selection of the tuning parameter.

6.3 Applications of graphically structured sparsity

Graphical Lasso has been applied to various research fields, including gene network discovery and social-network data analysis.

Jones and West (2005) applied graphical Lasso to the analysis of gene expression data, which consists of 8408 variables and has roughly a multivariate Gaussian distribution. Friedman et al. (2008) used it to analyze a flow cytometry dataset of 11 proteins and 7466 cells, and produced a directed acyclic graph in cell signal data.

In network discovery, Leng and Tang (2012) used it to analyze the U.S. agricultural export data and presented the network and the regions of the US Department of Agriculture export data from 1970 to 2009. Considering that the structure may vary from time to time, Kolar and Xing (2011) proposed that the structure of the undirected graphical model can be consistently estimated in the high-dimensional setting, when the dimensionality of the model is allowed to diverge with the sample size.

7 Experiments

In this section, four numerical experiments are performed to evaluate three of these structured sparse learning methods mentioned in previous sections. The results obtained from these numerical studies are detailed in this section. The first two experiments demonstrate that the structured sparsity-inducing method can recover the signal while the standard sparse learning cannot. The third experiment demonstrates that model parsimony can be obtained through a structured method beyond the standard sparse learning method in wavelet coefficients selection. The fourth experiment demonstrates that the accuracy could be increased with the graphical structure used in logistical regression.

Several publicly accessible software packages are used, including SparseLab (Donoho et al., 2007), SLEP (Liu J et al., 2009), SPAMS (Mairal et al., 2011), YALL1 (Zhang Y et al., 2011), and SPGL1 (van den Berg and Friedlander, 2007). In this article, we use mainly SPAMS and YALL1 to solve the optimization problem. All numerical experiments are conducted with Matlab 7.12.0 on a laptop with an Intel Core I7-4710-MQ, 2.5 GHz CPU, and 4 GB of RAM.

7.1 Measurements

Accuracy, model explainability, and computational complexity are the three most important aspects to consider in machine learning algorithms. The latter two are dominated mainly by model parsimony. Thus, we use prediction accuracy on a test dataset and model parsimony to evaluate the structured sparse learning methods in this study.

The prediction accuracy on the test dataset is formulated as

$${\rm{Ac}}{{\rm{c}}_{{\rm{test}}}} = {{{\rm{TP}} + {\rm{TN}}} \over {P + N}},$$
((8))

where TP is true positive with hit, and TN is true negative with correct rejection. We propose a parsimony ratio to measure model parsimony, which is formulated as

$${\rm{PR}} = {{{\rm{Number}}\;{\rm{of}}\;{\rm{variables}}\;{\rm{in}}\;{\rm{the}}\;{\rm{model}}} \over {{\rm{Number}}\;{\rm{of}}\;{\rm{variables}}\;{\rm{in}}\;{\rm{the}}\;{\rm{original}}\;{\rm{model}}}}{.}$$
((9))

7.2 Signal recovery with fused structured sparsity

To show the capability of the fused structure for solving fused structured problems, we conduct the following tests. First, we generate the regression coefficient \(\hat x \in \mathbb{R}^{n}\) for n = 1000 as

$$\hat x = \left\{ {\begin{array}{*{20}c}{{r_1},} & {j = 1,\;2,\; \cdots \;,\;100,\quad \quad} \\ {{r_2},} & {j = 201,\;202,\; \cdots \;,\;300,} \\ {{r_3},} & {j = 401,\;402,\; \cdots \;,\;500,} \\ {{r_4},} & {j = 601,\;602,\; \cdots \;,\;700,} \\ {{r_5},} & {{\rm{else}},\quad \quad \quad \quad \quad \quad \quad \quad} \end{array}} \right.$$

where scalers r1, r2, r3, and r4 are randomly generated and uniformly distributed on (0, 1). The plot of \(\hat x\) is shown in Fig. 3a. The entries of matrix A ∈ ℝm×n with m = 500 and n = 1000 are drawn from standard distribution \({\mathcal N}(0,\;1)\). Observations b ∈ ℝm are then created as the signs of \(A\hat x + e\), where e ∈ ℝm is a random vector with distribution \({\mathcal N}(0,\;0{.}05)\). Parameters in the fused model, specifically problem (5), are setted as λ1 = 5 × 10−4, λ2 =5 × 10−2, while in the sparse model (problem (1)), λ1 = 5 × 10−2.

Fig. 3
figure 3

Coefficients recovered with fused structure regularization: (a) original coefficients; (b) coefficients recovered with the fused sparse model (problem (5)); (c) coefficients recovered with sparse learning (without the fused structure regularized term)

Fig. 3b shows that the fused model can preserve the natural ordering rather well. Fig. 3c shows that the sparse model presents a sparse solution, but cannot preserve the natural ordering. This example demonstrates that fused structured sparse learning surpasses sparse learning when the original data has a fused structure.

7.3 Signal recovery improved by grouped structured sparsity

To show the capability of the grouped structure for solving grouped structured problems, we conduct the following tests. We consider solving problem (3) with weighted groups. First, we create a random m × n encoding matrix, and perform scaling by normalizing the rows of the encoding matrix. Then, we generate groups with the desired number of unique groups, and a weight is determined for each group. Next, the observations are generated through the encoding matrix and grouped sparse vector, followed by a Gaussian noise added to the observation.

The original signal is presented in Fig. 4a. Fig. 4b shows that the grouped model can recover the signal rather well. Fig. 4c shows that the sparse model without the grouped regularization term presents a sparse solution. A sparse method without the grouped regularized term could not recover the original signal, although the data fitting loss is relatively low. This example demonstrates that grouped structured sparse learning surpasses sparse learning when the original data has a grouped structure.

Fig. 4
figure 4

Coefficients recovered with grouped structure regularization: (a) original coefficients; (b) coefficients recovered with the grouped sparse model (problem (5)); (c) coefficients recovered with sparse learning (without the grouped structure regularized term)

7.4 Model parsimony gained by hierarchically structured sparsity

To show the capability of the hierarchical structure, we conduct the following tests. First, an image with five rectangles is randomly generated. Then, we apply ordinary linear sampling to measure the 4096 wavelet coefficients directly. Compared to the linear structure, we also perform a hierarchical structure that has 1152 wavelet coefficients.

Fig. 5b shows that the linear model can reconstruct images well with \(\left\| {{x_{{\rm{lin}}}} - {x_0}} \right\|_{\rm{F}}^2/\left\| {{x_0}} \right\|_{\rm{F}}^2 = 0.3567\). Fig. 5c shows that the hierarchical model can greatly reduce the number of coefficients, specifically from 4096 to 1152, with a model parsimony rate of 1152/4095=28.1%, while the relative error is comparable, \(\left\Vert {{x_{{\rm{hie}}}} - {x_0}} \right\Vert _{\rm{F}}^2/\left\Vert {{x_0}} \right\Vert _{\rm{F}}^2 = 0{.}3568\). This example demonstrates that hierarchically structured sparse learning can greatly improve model parsimony beyond sparse learning when the original data is structured hierarchically.

Fig. 5
figure 5

Images reconstructed with hierarchically structured sparsity: (a) original image; (b) linear reconstruction from 4096 samples \(\left\| {{x_{{\rm{lin}}}} - {x_0}} \right\|_{\rm{F}}^2/\left\| {{x_0}} \right\|_{\rm{F}}^2 = 0.3567\); (c) hierarchically structured reconstruction from 1152 samples (\(\left\Vert {{x_{{\rm{hie}}}} - {x_0}} \right\Vert _{\rm{F}}^2/\left\Vert {{x_0}} \right\Vert _{\rm{F}}^2 = 0{.}3568\) and the model parsimony rate is 1152/4096=28.1%)

7.5 Prediction accuracy improved by graphically structured sparsity

To show the ability of the graphical structure to solve graphically structured problems, we conduct the following tests. The experiment is conducted on the binary classification datasets: ‘20 Newsgroups’ (www.cs.nyu.edu/~roweis/data.html). The 20 Newsgroups dataset is a collection of approximately 20 000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 Newsgroups collection has become a popular dataset for experiments in text applications of machine learning techniques, such as text classification and text clustering. Here, the first 100 words are selected and we use 80% of the samples for training and 20% for testing. To reduce statistical variability, experimental results are averaged over 10 repetitions. First, we use graphical Lasso to generate the graphical relationship. Then, we propose a novel graph-guided method to do classification, and compare the average predication accuracy of the standard classifier without a graphical structure penalty.

In the second step, we propose a novel graph-guided logistic regression approach, which is formulated as

$$\min\limits_x \;l(x) + {\gamma \over 2}\Vert x\Vert _2^2 + \lambda \Vert Fx\vert {\vert _1},$$
((10))

where \(l(x) = \sum\nolimits_{i = 1}^N {l(x,\;{\xi_i})/N,\;l(x,\;{\xi_i})}\) is the logistic loss on data sample ξ i , and λ > 0 is a regularization parameter. Furthermore, F is the penalty matrix promoting the desired graphical structure of x. Specifically, F in problem (10) is generated by sparse inverse covariance selection (Scheinberg et al., 2010). We observe that problem (10) can be solved by the stochastic primal dual hybrid gradient method proposed in our prior work (Qiao et al., 2016a).

Fig. 6a shows that the graphical model can generate the graphical relationship among words very well. In Fig. 6b, we see that graphical logistic regression could greatly improve the prediction accuracy beyond standard logistic regression. This example shows that a graphically structured model can greatly improve prediction accuracy.

Fig. 6
figure 6

Classification with graphically structured sparsity: (a) graphical relationship among words; (b) average predication accuracy of graph-guided logistic regression and standard logistic regression

8 Conclusions and discussion

Structured sparse learning methods incorporate specific structure information with sparse learning methods, and have been used in various fields. In this article, we reviewed the development of the theory, formulations, algorithms, and applications of the latest structured sparse learning methods, including grouped structured sparsity, fused structured sparsity, hierarchically structured sparsity, and graphically structured sparsity. For each type of structured sparsity, we presented the original formulation and its variations and the mathematical motivation of these methods, addressed the algorithms for solving these problems, and discussed the fact that applications with prior knowledge lead to improved explicability of the sparse estimations and/or increased prediction performance in related research fields.

Experiments have been conducted to demonstrate the advantage of structured sparse learning algorithms beyond standard sparse learning methods. These experiments demonstrated that the structured sparsity-inducing method could achieve better performance than the standard sparse learning method. We also proposed a novel graph-guided logistic regression method to demonstrate the efficacy of the graphical structure. However, the experiment results on super computers (Yang et al., 2010; 2011) are expected and power efficient algorithms (Lai et al., 2015; 2016) and algorithms for new infrastructures (Chen et al., 2016) are still required.

Though structured sparse learning methods have shown great success from scientific research fields to industrial engineering, there are still many issues to be addressed:

  1. 1.

    Online learning algorithms for structured sparsity problems. Most current structured sparsity is optimized in a batch way. In real-world applications, the training data volume may be huge or be given in a sequential way. Online learning is a better choice to address these problems.

  2. 2.

    Efficient algorithms for non-convex models. Today, most structured models are solved through a convex approximation of the original formulation. However, in statistics, it was reported in the literature that non-convex regularization usually yields a solution with more desirable structural properties, for example, the 0-norm regularized least squares problem (i.e., l(·) is a least squares function). Efficient non-convex algorithms are needed for these strict models.

  3. 3.

    Specific structure inducing regularization. Many structure-inducing regularizations have been proposed, and many of them have been applied in a wide range of fields. For specifically structured problems, we still need new regularization to induce the specific structure.