An unexpected connection between Bayes A-optimal designs and the group lasso

Sagnol, Guillaume; Pauwels, Edouard

doi:10.1007/s00362-018-01062-y

An unexpected connection between Bayes A-optimal designs and the group lasso

Regular Article
Published: 13 December 2018

Volume 60, pages 565–584, (2019)
Cite this article

Download PDF

Statistical Papers Aims and scope Submit manuscript

An unexpected connection between Bayes A-optimal designs and the group lasso

Download PDF

Guillaume Sagnol¹ &
Edouard Pauwels²

329 Accesses
1 Citation
Explore all metrics

A Publisher Correction to this article was published on 12 September 2019

This article has been updated

Abstract

We show that the A-optimal design optimization problem over m design points in ${\mathbb {R}}^n$ is equivalent to minimizing a quadratic function plus a group lasso sparsity inducing term over $n\times m$ real matrices. This observation allows to describe several new algorithms for A-optimal design based on splitting and block coordinate decomposition. These techniques are well known and proved powerful to treat large scale problems in machine learning and signal processing communities. The proposed algorithms come with rigorous convergence guarantees and convergence rate estimate stemming from the optimization literature. Performances are illustrated on synthetic benchmarks and compared to existing methods for solving the optimal design problem.

Finding global minima via kernel approximations

Article 04 April 2024

Alessandro Rudi, Ulysse Marteau-Ferey & Francis Bach

A Guide for Sparse PCA: Model Comparison and Applications

Article Open access 29 June 2021

Rosember Guerra-Urzola, Katrijn Van Deun, … Klaas Sijtsma

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

Jianchao Bai, Linyuan Jia & Zheng Peng

1 Introduction

We consider an optimal experimental design problem of the form

$$\begin{aligned} \underset{\varvec{w} \in \varDelta }{\varvec{{\text {minimize}}}}\ \ \varPhi _{A_K} \left( \varSigma ^{-1} + \frac{N}{\sigma ^2} \sum _{i=1}^m w_i \varvec{a}_i \varvec{a}_i^T \right) , \end{aligned}$$

(1)

where $\varPhi _{A_K}(M)={\text {trace}}K^T M^{-1} K$ is the criterion of $A_K$-optimality for some matrix $K\in {\mathbb {R}}^{n \times r}$ depending on the quantity to be estimated, $\varSigma $ is a known positive definite matrix, the constants N, $\sigma $ and the vectors $\varvec{a}_i \in {\mathbb {R}}^n$, $\forall i\in [m]:=\{1,\ldots ,m\},$ are known, and $\varDelta :=\{\varvec{w}\in {\mathbb {R}}^m:\ \varvec{w}\ge \varvec{0},\ \sum _{i=1}^m w_i=1\}$ is the probability simplex.

This problem was first introduced in Duncan and DeGroot (1976) under the name $\psi $-optimality, and studied in detail by Pilz (1983) and Chaloner (1984), who observed that this problem could also be called Bayes A-optimality, a name still used in the literature. Nevertheless, Bayes-optimal designs can also be used in a non-Bayesian context, when the experimenter is committed to a first batch of trials, and need to select an additional batch of N trials, cf. Chaloner (1984). More background on Problem (1) will be given in Sect. 2.

We should observe that Problem (1) is in fact the continuous relaxation of the following discrete problem, which we call N-exact Bayes $A_K$-optimal design:

$$\begin{aligned} \underset{\varvec{n}\in \varDelta _N}{\varvec{{\text {minimize}}}}\ \ \varPhi _{A_K} \left( \varSigma ^{-1} + \frac{1}{\sigma ^2} \sum _{i=1}^m n_i \varvec{a}_i \varvec{a}_i^T \right) , \end{aligned}$$

(2)

where $\varDelta _N:=\{\varvec{n}\in {\mathbb {Z}}_{\ge 0}^m:\ \sum _{i=1}^m n_i=N\}$ is the standard discrete N-simplex, and $n_i$ represents the number of trials to perform at the ith design point. While Problem (2) is of immediate relevance for the experimenter, this problem has a hard combinatorial structure; in particular, it contains as a special case the problem of exact $\varvec{c}$-optimality, which was proved to be NP-hard in Černỳ and Hladík (2012). Therefore, it is almost impossible to certify global optimality of a design $\varvec{n}$, except for small instances, when a mixed integer second order cone programming solver can be used (Sagnol and Harman 2015). To overcome this issue, the classical machinery of approximate design theory proposes to introduce a variable $w_i=\frac{n_i}{N}$ and to relax the integer constraints “$N w_i\in {\mathbb {Z}}$”, which leads to the convex optimization problem (1). In practice, the solution of Problem (1) gives a lower bound on the optimal value of (2). This can be used to ascertain the quality of an exact design $\varvec{n}$, which can typically be computed by using heuristic methods, such as exchange algorithms (see, e.g. Atkinson and Donev 1992, Chap. 12) or, as recently proposed, with particle swarm optimization (Lukemire et al. 2018). Alternatively, rounding methods can be used to turn an approximate design $\varvec{w}^*$ (i.e., a solution to Problem 1) into a good exact design $\varvec{n}\in \varDelta _N$, which works particularly well when the total number N of trials is large (Pukelsheim and Rieder 1992). For more details on the subject, we refer the reader to the monographs of Fedorov (1972) or Pukelsheim (1993).

Many different approaches have been proposed to solve Problem (1). The traditional methods are the Fedorov–Wynn type vertex-direction algorithms (Fedorov 1972; Wynn 1970) and the closely related vertex exchange methods (Böhning 1986), the multiplicative weight update algorithms (Silvey et al. 1978; Yu 2010), and interior point methods based on semidefinite programming (Fedorov and Lee 2000) or second-order cone programming (Sagnol 2011) formulations. Recent progress in this area has been obtained by employing hybrid methods that alternate between steps of the aforementioned algorithms (the cocktail algorithm Yu 2011), or by using randomization (Harman et al. 2018).

Contribution and organization The main contribution of this article is a new reformulation of Problem (1) as a convex, unconstrained optimization problem, which brings to light a strong connection with the well-studied problem of group lasso regression (Yuan and Lin 2006). The particular structure of the new formulation also suggests algorithmic ideas based on proximal decomposition methods, which already proved to be very useful in machine-learning and signal processing applications (Bach et al. 2012; Beck and Teboulle 2009; Combettes and Pesquet 2011). An appealing property of these methods is that they come with rigorous convergence guarantees, and yield sparse iterates very quickly, corresponding to designs with only a few support points.

The rest of this paper is organized as follows. In Sect. 2 we give more background on Problem (1), and show how this problem can be reformulated as an unconstrained convex optimization problem involving a squared group lasso penalty. Then, we characterize the proximity operator of this penalty in Sect. 3. This makes it possible to use a new class of algorithms, described in Sect. 4, to solve the reformulated problem. Finally, Sect. 5 presents some numerical experiments comparing performances of the proposed algorithm to existing approaches.

2 Problem reformulation

2.1 The Bayes $A_K$-optimal design problem

For the sake of completeness, we briefly explain the derivation of Problem (2) and its relaxation for approximate designs, Problem (1). For more details, the reader is referred to Pilz (1983) or Chaloner (1984).

The experimental design is specified by a vector $\varvec{n}=(n_1,\ldots ,n_m)\in \varDelta _N$, which indicates the number of replications at the ith design point. Specifically, we obtain random observations from a linear model

$$\begin{aligned} y_{ij} = \varvec{a}_i^T\varvec{\theta } + \epsilon _{ij},\quad \forall i\in [m], \forall j\in [n_i], \end{aligned}$$

where the measurements are unbiased (i.e., ${\mathbb {E}}[\epsilon _{ij}]=0$), uncorrelated (i.e., $(i,j)\ne (k,\ell ) \implies {\mathbb {E}}[\epsilon _{ij} \epsilon _{k\ell }]=0$), and the variance is known: ${\mathbb {E}}[\epsilon _{ij}^2]=\sigma ^2$. Further, we have a prior observation $\varvec{\theta }_0=\varvec{\theta }+\varvec{\eta }$, for some random vector $\varvec{\eta }\in {\mathbb {R}}^n$ satisfying ${\mathbb {E}}[\varvec{\eta }]=\varvec{0},\ {\mathbb {E}}[\varvec{\eta }\varvec{\eta }^T]=\varSigma $, and ${\mathbb {E}}[\varvec{\eta }\varvec{\epsilon }^T]=0$.

Denote by $\varvec{y}$ the vector of ${\mathbb {R}}^m$ with the averaged observations at each location, that is, $y_i = \frac{1}{n_i}\sum _{j=1}^{n_i} y_{ij}$ (and $y_i$ can be set to some arbitrary constant whenever $n_i=0$). Then, it is well known that the best linear unbiased estimator (BLUE) for $\varvec{\theta }$ solves the least squares problem

$$\begin{aligned} \underset{\varvec{\theta }\in {\mathbb {R}}^n}{\varvec{{\text {minimize}}}}\quad \frac{1}{\sigma ^2} (A\varvec{\theta }-\varvec{y})^T {\text {Diag}}(\varvec{n}) (A\varvec{\theta }-\varvec{y}) + (\varvec{\theta }-\varvec{\theta }_0)^T\varSigma ^{-1}(\varvec{\theta }-\varvec{\theta }_0), \end{aligned}$$

which admits the closed-form solution $ {\hat{\varvec{\theta }}}:=M(\varvec{n})^{-1} (A^T {\text {Diag}}(\varvec{n}) \frac{\varvec{y}}{\sigma ^2} + \varSigma ^{-1} \varvec{\theta }_0), $ where $ M(\varvec{n}):= \varSigma ^{-1}+\frac{1}{\sigma ^2} A^T {\text {Diag}}(\varvec{n}) A =\varSigma ^{-1} + \frac{1}{\sigma ^2} \sum _{i=1}^m n_i \varvec{a}_i \varvec{a}_i^T $ is the information matrix of the design, i.e., the inverse of the variance–covariance matrix of $\varvec{{\hat{\theta }}}$. Then, the exact Bayes $A_K$-optimal design problem is to find a vector $\varvec{n}\in \varDelta _N$ that minimizes $\varPhi _{A_K}(M(\varvec{n}))$, which amounts to minimizing the trace of the variance–covariance matrix of the BLUE $\varvec{{\hat{\zeta }}}=K^T\varvec{{\hat{\theta }}}$ for $\varvec{\zeta }:=K^T\varvec{\theta }$. Finally, the (approximate) Bayes $A_K$-optimal design problem (1) is obtained by introducing the variable $\varvec{w}=\varvec{n}/N \in \varDelta $, and by relaxing the integrality constraints $N\varvec{w}\in {\mathbb {Z}}^m$. We find it convenient to introduce the symbol $\sigma _N^2 = \frac{\sigma ^2}{N}$, so the information matrix of an approximate design $\varvec{w}\in \varDelta $ can be written as

$$\begin{aligned} M_N(\varvec{w}):= M(N\, \varvec{w}) = \frac{1}{\sigma _N^2} \sum _{i=1}^m w_i \varvec{a}_i \varvec{a}_i^T + \varSigma ^{-1}. \end{aligned}$$

We conclude this part by mentioning another common situation that leads to a Problem of the form (1). It is well known that ${\hat{\eta }}(\varvec{x}):=\phi (\varvec{x})^T \hat{\varvec{\theta }}$ is the best linear unbiased predictor (BLUP) for $\eta (\varvec{x}):=\phi (\varvec{x})\varvec{\theta }$, and its variance is $\phi (\varvec{x})^T M_N(\varvec{w})^{-1} \phi (\varvec{x})$. If $\mu $ is a measure over ${\mathcal {X}}$ weighing the interest of the experimenter to predict $\eta $ at $\varvec{x}\in {\mathcal {X}}$, a natural criterion to consider is the integrated mean squared error, ${\text {IMSE}}(\varvec{w}):= \int _{\varvec{x}\in {\mathcal {X}}} \phi (\varvec{x})^T M_N(\varvec{w})^{-1} \phi (\varvec{x})\, d\mu (\varvec{x})$.^{Footnote 1} The minimization of ${\text {IMSE}}(\varvec{w})$ can be cast as an $A_K$-optimal design problem, because:

$$\begin{aligned} {\text {IMSE}}(\varvec{w})&= \int _{\varvec{x}\in {\mathcal {X}}} \phi (\varvec{x})^T M_N(\varvec{w})^{-1} \phi (\varvec{x})\, d\mu (\varvec{x})\\&= {\text {trace}}M_n(\varvec{w})^{-1} KK^T = \varPhi _{A_K}(M_N(\varvec{w})), \end{aligned}$$

where $KK^T$ is a Cholesky decomposition of the positive semidefinite matrix $\int _{\varvec{x}\in {\mathcal {X}}} \phi (\varvec{x}) \phi (\varvec{x})^T\, d\mu (\varvec{x})$. We point out that large scale problems involving the minimization of ${\text {IMSE}}(\varvec{w})$ arise for the design of experiments over a random field, when a truncated Karhunen–Loève expansion is used to approximate the covariance kernel. This idea was first proposed in Fedorov (1996), and used for the sequential design of computer experiments in Gauthier and Pronzato (2016), Gauthier and Pronzato (2017), Sagnol et al. (2016), and Spöck and Pilz (2010).

2.2 Reformulation as an unconstrained convex problem

Now, take a linear estimator $\hat{\varvec{\zeta }}=X\varvec{y}+H\varvec{\theta }_0$ of $\varvec{\zeta }=K^T\varvec{\theta }$ for some matrices $X \in {\mathbb {R}}^{r\times m}$ and $H \in {\mathbb {R}}^{r \times n}$. This estimator is unbiased if and only if $XA+H=K^T$, and we have ${\mathbb {V}}[{\hat{\varvec{\zeta }}}]= \sigma _N^2 X {\text {Diag}}(\varvec{w})^{-1} X^T + H\varSigma H^T$, where the symbol ${\mathbb {V}}$ is used to denote the variance–covariance matrix of a random vector. By the Gauss Markov theorem, minimizing over X and H the quantity $\sum _{i=1}^n {\mathbb {V}}[{\hat{\zeta }}_i] = {\text {trace}}{\mathbb {V}}[{\hat{\varvec{\zeta }}}]$ such that $\hat{\varvec{\zeta }}$ is unbiased, leads to the BLUE estimator, in which case we have already seen that $\varPhi _{A_K}(M_N(\varvec{w}))=\sum _{i=1}^r {\mathbb {V}}[\hat{\zeta _i}]$. Hence, using the computed variance estimate, the Bayes $A_K$-optimal design is obtained by minimizing further with respect to $\varvec{w}$, i.e., it can be obtained by solving the following optimization problem

$$\begin{aligned} \underset{\varvec{w}\in {\mathbb {R}}^m, X \in {\mathbb {R}}^{r\times m}, H \in {\mathbb {R}}^{r \times n}}{\varvec{{\text {minimize}}}}&\quad {\text {trace}}\sigma _N^2 X {\text {Diag}}(\varvec{w})^{-1} X^T + H\varSigma H^T \\ s.t.&\quad XA+H=K^T\nonumber , \quad \varvec{w}\ge \varvec{0},\quad \sum _{i=1}^m w_i=1. \nonumber \end{aligned}$$

(3)

The objective function is convex, as it can be written as $\Vert H \varSigma ^{1/2}\Vert _F^2 + \sigma _N^2 \sum _i \frac{\Vert \varvec{x}_i\Vert ^2}{w_i}$, where $\varvec{x}_i$ is the ith column of $X\in {\mathbb {R}}^{r\times m}$, which is the sum of a convex quadratic and the perspective functions $(\varvec{x}_i,w_i)\mapsto \frac{\Vert \varvec{x}_i\Vert ^2}{w_i}$ of $\varvec{x}_i\mapsto \Vert \varvec{x}_i\Vert $; see Boyd and Vandenberghe (2004). We also point out that this problem can be reformulated as a second order cone program (SOCP); see Sagnol (2011).

For a fixed X, consider the function $J:\varvec{w}\mapsto \sum _{i=1}^m \frac{\Vert \varvec{x}_i\Vert ^2}{w_i}$ from ${\mathbb {R}}_+^m$ to ${\mathbb {R}}\cup +\infty $. We use the convention that $\Vert \varvec{x}_i\Vert / 0 = 0$ whenever $\Vert \varvec{x}_i\Vert = 0$ for $i = 1,\ldots , m$ which amounts to sum over indices with nonzero numerators: $J:\varvec{w}\mapsto \sum _{i: \Vert \varvec{x}_i\Vert > 0} \frac{\Vert \varvec{x}_i\Vert ^2}{w_i}$ and ensures that J is well defined. We also assume that $\sum _i \Vert \varvec{x}_i\Vert > 0$ so that J is not constant. In this case, J is minimized over the probability simplex for $w_i^* = \frac{\Vert \varvec{x}_i\Vert }{\sum _i \Vert \varvec{x}_i\Vert }$, $i=1,\ldots ,m$. In other words

$$\begin{aligned} \varvec{w}^* \in \arg \min \quad \left\{ J(\varvec{w}),\quad \mathrm {s.t.} \quad w_i \ge 0,\, i =1 ,\ldots ,m,\quad \sum _{i=1}^m w_i = 1 \right\} . \end{aligned}$$

(4)

Since J is convex, this can be verified by checking the first order Karush-Kuhn-Tucker (KKT) conditions: note that $\varvec{w}^*$ is feasible for (4) and F is differentiable at $\varvec{w}^*$ and for all $i = 1, \ldots , m$,

$$\begin{aligned} \left. \frac{\partial J(\varvec{w})}{\partial w_i}\right| _{\varvec{w}=\varvec{w}^*} = {\left\{ \begin{array}{ll} -\frac{\Vert \varvec{x}_i\Vert ^2}{w_i^{*2}} =- (\sum _i \Vert \varvec{x}_i\Vert )^2&{}\text { if } w_i^* > 0;\\ 0&{} \text { otherwise.} \end{array}\right. } \end{aligned}$$

(5)

Equation (5) is precisely KKT optimality condition at $\varvec{w}^*$ for Problem (4) (see e.g. Bertsekas 1999, Example 3.4.1). Plugging the expression of $\varvec{w}^*$ into (3), we obtain the following problem:

$$\begin{aligned} \underset{X, H}{\varvec{{\text {minimize}}}}&\quad \Vert H \varSigma ^{1/2}\Vert _F^2 + \sigma _N^2 (\sum _i \Vert \varvec{x}_i\Vert )^2\quad s.t.&XA+H=K^T. \end{aligned}$$

We can eliminate the variable H from this problem, which leads to an unconstrained, convex optimization problem with a nice structure. We summarize our findings in the next proposition:

Proposition 1

Consider the optimization problem

$$\begin{aligned} \rho \quad :=\quad \underset{X}{\varvec{{\text {min}}}}&\quad \Vert (XA-K^T) \varSigma ^{1/2}\Vert _F^2 + \sigma _N^2 \left( \sum _i \Vert \varvec{x}_i\Vert \right) ^2. \end{aligned}$$

(6)

Then, $\rho $ is equal to the optimal value of Problem (1), and if $X^*=[\varvec{x}_1^*,\ldots ,\varvec{x}_m^*]$ solves Problem (6), then the design defined by $w_i^* = \frac{\Vert \varvec{x}_i^*\Vert }{\sum _j \Vert \varvec{x}_j^*\Vert }$ is Bayes $A_K$-optimal.

If the square was removed from $(\sum _i \Vert \varvec{x}_i \Vert )^2$, this last term would be similar to a group lasso penalty (Yuan and Lin 2006). From a practical perspective, the main interest of this reformulation is that it paves the way toward the use of well established first order methods to tackle such problems (Beck and Teboulle 2009; Combettes and Pesquet 2011; Bach et al. 2012).

Interestingly, the idea of using a group lasso to design experiments has already been proposed in Tanaka and Miyakawa (2013). However, this paper justified the group lasso approach heuristically, in order to select the support points of an exact design. Indeed, group lasso regression was designed to recover an approximate solution of an equation of the form $\sum _i A_i' \varvec{x}_i \simeq \varvec{y}'$ with only a small number of nonzero blocks $\varvec{x}_i$. It is widely known that optimal designs often have a small number of support points, and hence correspond to an estimator $\hat{\varvec{\theta }}=X\varvec{y}$ with many columns of X equal to $\varvec{0}$. Therefore, group lasso regression can be used to find sparse estimators that satisfy approximately the unbiasedness property: $XA\simeq K^T$. The result of Proposition (1) shows that in fact, one obtains an exact reformulation of the Bayes $A_K$-optimal design problem by squaring the penalty.

3 Convex analysis of the squared group lasso penalty

Throughout the rest of this article, we set for all $X \in {\mathbb {R}}^{r \times m}$:

$f :X \mapsto \Vert (XA-K^T)\varSigma ^{1/2}\Vert _F^2$.
The norm $\varOmega :X \mapsto \sum _{i=1}^m \Vert \varvec{x}_i\Vert $ and its dual norm, $\varOmega ^* :X \mapsto \max _{i=1\ldots m} \Vert \varvec{x}_i\Vert $.
$g :{\mathbb {R}}^{r\times m} \mapsto {\mathbb {R}}$ with $g(X) = \frac{1}{2} \varOmega (X)^2$,
$L=\mathrm {trace}(A\varSigma A^T)$ is the Lipschitz constant of $\nabla f$ with respect to Frobenius norm (that is $\Vert \nabla f(X) - \nabla f(Y)\Vert _F \le \Vert X - Y\Vert _f,\, \forall X,Y \in {\mathbb {R}}^{r \times m}$).

where $\varvec{x}_i \in {\mathbb {R}}^n$ denotes the ith column of X (for all $i\in [m]$). We use the usual Euclidean scalar product on matrices. With these notations, Problem (6) may be rewritten as

$$\begin{aligned} \mathrm {min}_X \quad&F(X):=\ f(X) + 2\, \sigma _N^2\, g(X) \end{aligned}$$

(7)

Note that the function g is convex and that the outer square destroys the separability of the inner sum in g, unlike standard group lasso penalty. This leads to non trivial optimization developments. The reader is referred to Rockafellar (1970) and Borwein and Lewis (2010)) for detailed exposition of convex analysis related material. We recall two definitions for the readers convenience.

Definition 1

(See e.g. Rockafellar 1970) Let $\gamma :{\mathbb {R}}^q \rightarrow {\mathbb {R}}$ be a convex function, we denote by $\gamma ^*$ and $\partial \gamma $ the Legendre transform and subgradient of $\gamma $ respectively, for any $\varvec{z} \in {\mathbb {R}}^q$:

$$\begin{aligned} \gamma ^*(\varvec{z})&= \sup _{\varvec{y} \in {\mathbb {R}}^q} \varvec{y}^T\varvec{z} - \gamma (\varvec{z}),\\ \partial \gamma (\varvec{z})&= \left\{ \varvec{v} \in {\mathbb {R}}^q;\; \gamma (\varvec{y}) \ge \gamma (\varvec{z}) + v^T(\varvec{y} - \varvec{z}),\, \forall \varvec{y} \in {\mathbb {R}}^q \right\} . \end{aligned}$$

For example, if $\gamma (\cdot ) = \Vert \cdot \Vert $ is the Euclidean norm on ${\mathbb {R}}^q$, we have $\gamma ^*(\varvec{z}) = 0$ if $\Vert \varvec{z}\Vert \le 1$ and $\gamma ^*(\varvec{z}) = + \infty $ otherwise. Similarly $\partial \gamma (\varvec{z}) = \frac{\varvec{z}}{\Vert \varvec{z}\Vert }$ if $\varvec{z} \ne 0$ and $\partial \gamma (\varvec{0}) = \left\{ \varvec{v} \in {\mathbb {R}}^q;\; \Vert \varvec{v}\Vert \le 1 \right\} $.

Lemma 1

(Subgradient and conjugate) We have the following formula for the subgradient and the Legendre transform of g denoted by $g^*$:

$$\begin{aligned}&\forall X \in {\mathbb {R}}^{r\times m}, \, \partial g (X) = \left( \sum _{i=1}^m \Vert \varvec{x}_i\Vert \right) \left[ \varvec{v}_1 \varvec{v}_2 \ldots \varvec{v}_m \right] ,\, \varvec{v}_i \in \partial \Vert \varvec{x}_i\Vert ,\quad i=1,\ldots ,m.\\&\forall Z \in {\mathbb {R}}^{r\times m},\, g^*(Z) = \frac{1}{2} \max _{i=1,\ldots ,m}\left\{ \Vert \varvec{z}_i\Vert ^2 \right\} . \end{aligned}$$

Proof

We mostly follow (Bach et al. 2012) and provide detailed arguments. Fix any $Z \in {\mathbb {R}}^{r\times m}$, we have for any $X \in {\mathbb {R}}^{r\times m}$,

$$\begin{aligned} \left\langle X,Z \right\rangle - g(X)&\le \varOmega ^*(Z)\varOmega (X) - \frac{1}{2} \varOmega (X)^2 \le \frac{1}{2} \varOmega ^*(Z)^2. \end{aligned}$$

Setting $X = \varOmega ^*(Z) \partial \varOmega ^*(Z)$, we obtain $\varOmega (X) = \varOmega ^*(Z)$ and $\left\langle Z, X\right\rangle = \varOmega ^*(Z)^2$ so that the above holds with equality. This entails that $g^* = \frac{1}{2} (\varOmega ^*)^2$ which is precisely the claimed formula for the conjugate function. Now symmetrically, for any fixed $X \in {\mathbb {R}}^{r\times m}$, setting $Z = \varOmega (X) \partial \varOmega (X)$ we obtain $\varOmega ^*(Z) = \varOmega (X)$ and $\left\langle Z, X\right\rangle = \varOmega (X)^2 = \frac{1}{2}\left( \varOmega (X)^2 + \varOmega ^*(Z)^2 \right) $ which shows by Rockafellar (1970, Theorem 23.5) that $Z \in \partial g(X)$. The claimed form of the subgradient follows because $\varOmega $ has a structure of separable sum, see Rockafellar (1970, Theorem 23.8). $\square $

Given $t> 0$, the following lemma describes how to compute the proximity operator of $X\mapsto t\,g(X)$:

$$\begin{aligned} {\text {prox}}_{tg}(V):=\arg \!\min _{X}\, t\cdot g(X) + \frac{1}{2} \Vert X-V\Vert _F^2. \end{aligned}$$

Lemma 2

(Proximity operator) Let $V \in {\mathbb {R}}^{r \times m}$ and $\varvec{v}_i \in {\mathbb {R}}^n$ be its columns for $i=1,\ldots ,m$ and $t > 0$. Then Algorithm 1 computes ${\text {prox}}_{tg}(V)$.

Proof

First note that k is well defined at Step 2, since the condition obviously holds for $k = 1$. Furthermore, for all $i \le k$ we have $\Vert \varvec{v}_i\Vert \ge \Vert \varvec{v}_k\Vert \ge \frac{t}{tk+1} \sum _{j=1}^k \Vert \varvec{v}_j\Vert $. Note also that the proposed definition for $\varvec{x}_i$ ensures that $i > k$ for all i such that $\varvec{v}_i = 0$ so that there is no division by 0 and $\varvec{x}_i = 0$ whenever $\varvec{v}_i = 0$ . We just need to check that $(V-X)/t \in \partial g(X)$. We have

$$\begin{aligned} \sum _{i=1}^m \Vert \varvec{x}_i\Vert = \sum _{i =1}^k \Vert \varvec{v}_i\Vert - \frac{kt}{tk + 1} \sum _{j = 1}^k \Vert \varvec{v}_j\Vert = \frac{1}{tk + 1} \sum _{j=1}^k \Vert \varvec{v}_j\Vert . \end{aligned}$$

We now consider several cases.

If $i > k$ and $\varvec{v}_i = 0$, then $\varvec{x}_i = 0$ and $(\varvec{v}_i - \varvec{x}_i ) / t = 0 \in \partial \Vert \varvec{x}_i\Vert $.
If $i > k$ and $\varvec{v}_i \ne 0$, then $\varvec{x}_i = 0$ and it holds that $\frac{\varvec{v}_i}{\Vert \varvec{v}_i\Vert } \in \partial \Vert \varvec{x}_i\Vert $ and $\frac{\varvec{v}_i - \varvec{x}_i}{t} = \frac{ \varvec{v}_i }{ \Vert \varvec{v}_i\Vert } \left( \sum _{i=1}^m \Vert \varvec{x}_i\Vert \right) $.
If $i \le k$, then $\left( 1-\frac{t}{tk+1} \sum _{j = 1}^k \frac{\Vert \varvec{v}_j\Vert }{\Vert \varvec{v}_i\Vert } \right) \ge 0$ and $\varvec{v}_i \ne 0$ so that $\frac{\varvec{v}_i}{\Vert \varvec{v}_i\Vert } \in \partial \Vert \varvec{x}_i\Vert $. We also have $\frac{\varvec{v}_i - \varvec{x}_i}{t} = \frac{ \varvec{v}_i }{ \Vert \varvec{v}_i\Vert } \left( \sum _{i=1}^m \Vert \varvec{x}_i\Vert \right) $.

This shows that the proposed X satisfies the subdifferential characterization in Lemma 1 and the result follows. $\square $

4 Algorithms

4.1 Proximal decomposition methods

In this section we describe convex optimization algorithms dedicated to structured “smooth plus nonsmooth” problems with easily computable proximity operator. Further details and historical comments are found in Combettes and Pesquet (2011), Beck and Teboulle (2009), and Bach et al. (2012). On the one hand, we have $\nabla f(X) = 2 (XA-K^T) \varSigma A^T.$ On the other hand, Lemma 2 ensures that ${\text {prox}}_{tg}(V)$, can be computed by Algorithm 1. These are the building blocks of proximal decomposition algorithms. We describe the backtracking line search variants of the forward–backward algorithm and FISTA algorithm. Backtracking line search ensures minimal parameter tuning beyond the initialization. One can use a fixed step size 1 / L instead, where L is the Lipschitz constant of $\nabla f$ (see Sect. 3). The Backtracking scheme which we propose is very common in the composite optimization litterature (see e.g. Beck and Teboulle 2009) and is a variant of Armijo condition which is adapted to our “smooth plus non-smooth” setting.

Forward–backward algorithm This is the simplest proximal decomposition algorithm. More details can be found in Combettes and Pesquet (2011) and Beck and Teboulle (2009). Known properties for this algorithm include the following, for both the constant step size or backtracking variants:

The sequence $\left( X_k \right) _{k \in {\mathbb {N}}}$ converges to a solution of problem (7) and for any $X^*$ solution of the problem, the sequence $\left( \Vert X_k - X^*\Vert \right) _{k \in {\mathbb {N}}}$ is non increasing.
The objective function $F(X_k)$ is monotonically decreasing along the sequence and we have for all $k\in {\mathbb {N}}$,
$$\begin{aligned} F(X_k) - \rho \le \frac{\eta L \Vert X_0 - X^*\Vert _F^2}{2k}, \end{aligned}$$
for any $X^*$ solution to Problem 7 (see Beck and Teboulle 2009), this holds with $\eta = 1$ if one considers constant step sizes.
For all $k \in {\mathbb {N}}$, we have $L_k \le \max \{ L_0,\eta L\}$. If $L_0 \ge L$, then $i = 0$ is the solution for all backtracking steps on line 2. If $L>L_0$, denoting by $i_k$ the value of i which lets the k-th backtraking step succeed, then it holds that $\sum _{k=1}^{k_{\max }} i_k \le 1 + \frac{\log \left( L \right) - \log (L_0)}{ \log (\eta )}$ and the total number of different i which have to be tested on line 2 is at most logarithmic in L.

FISTA acceleration It is known since the seminal work of Nesterov (1983) that O(1 / k) is not optimal for convex optimization with gradient methods. Accelerated methods exist with a faster $O(1/k^2)$ convergence rate. We now describe the FISTA algorithm (Beck and Teboulle 2009) which belongs to this family of methods and is applicable to problem (7).

Contrary to forward–backward algorithm, FISTA algorithm does not provide a monotonically decreasing sequence of objective values, and convergence of the sequence $(X_k)_k$ is not known yet for this precise version, although it is for very close variants (Chambolle and Dossal 2015). The main feature of FISTA is the following complexity estimate, for any $k \in {\mathbb {N}}$,

$$\begin{aligned} F(X_k) - \rho \le \frac{2\eta L \Vert X_0 - X^*\Vert _F^2}{(k+1)^2}, \end{aligned}$$

for any $X^*$ solution to Problem 7 (see Beck and Teboulle 2009).

Regarding the choice of constant step sizes, the value of $L_k$ and the number of i to be tested to have a succesful Backtracking line search step, the same comments as those made for the forward–backward algorithm hold for FISTA. The complexity of one iteration of either forward–backward or FISTA algorithm is dominated by the cost of computing $\nabla f(X) = 2(XA - K^T) \varSigma A^T$ which can be done in O(nrm) operations (this is the cost of multiplication of $X \in {\mathbb {R}}^{r\times m}$ and $A \in {\mathbb {R}}^{m\times n}$ and multiplying the result by $\varSigma A^T \in {\mathbb {R}}^{n \times m}$). For a typical situation with $r = n$, this is $O(n^2 m)$. The cost of computing the proximity operator is negligible.

4.2 Block coordinate descent

An alternative to solve the unconstrained optimization problem (6) is to iteratively solve the problem for one particular block $\varvec{x}_i$, while keeping all other blocks fixed. This idea is attractive, because optimization over a single block admits a simple closed-form solution, as the following proposition shows.

Proposition 2

Let $i\in [m]$, and let the $\varvec{x}_j$’s be a fixed vectors in ${\mathbb {R}}^n$ ($\forall j\in [m], j\ne i$). We denote by $\varvec{a}_i$ the ith column of A, $i = 1,...,m$. We consider the variant of Problem (6) in which we minimize the criterion with respect to the block of variables $\varvec{x}_i$ only, that is:

$$\begin{aligned} \mathrm {min}_{\varvec{x}_i}&\quad h_i(\varvec{x}_i):= \Vert ( \varvec{x}_i \varvec{a}_i^T + R) \varSigma ^{1/2}\Vert _F^2 + \sigma _N^2 (\Vert \varvec{x}_i\Vert + \beta )^2, \end{aligned}$$

(8)

where $R:=\sum _{j\ne i} \varvec{x}_j \varvec{a}_j^T-K^T$ and $\beta = \sum _{j\ne i} \Vert \varvec{x}_j\Vert $. The optimal solution of this problem is given by $\varvec{x}_i^*=\varvec{0}$ whenever $R \varSigma \varvec{a}_i=\varvec{0}$ and

$$\begin{aligned} \varvec{x}_i^* = -\frac{1}{\varvec{a}_i^T \varSigma \varvec{a}_i + \sigma _N^2} \cdot \max \left\{ 1 -\sigma _N^2 \frac{\beta }{\Vert R\varSigma \varvec{a}_i\Vert }, 0 \right\} \cdot R \varSigma \varvec{a}_i\quad \text { otherwise.} \end{aligned}$$

Proof

We can rewrite the function to minimize as

$$\begin{aligned} h_i(\varvec{x}_i) = \varvec{a}_i^T \varSigma \varvec{a}_i \Vert \varvec{x}_i\Vert ^2 + 2 \varvec{x}_i^T R \varSigma \varvec{a}_i + \Vert R \varSigma ^{1/2} \Vert _F^2 + \sigma _N^2 (\Vert \varvec{x}_i\Vert + \beta )^2. \end{aligned}$$

Expanding the square, the subgradient sum rule (Rockafellar 1970, Theorem 23.8) gives the following expression for the subgradient $\partial (\Vert \varvec{x}\Vert + \beta )^2 = 2\partial \Vert \varvec{x}\Vert (\Vert \varvec{x}\Vert + \beta )$, hence the subgradient of $h_i$ has the following form:

$$\begin{aligned} \partial h_i(\varvec{x}_i) = \left\{ \begin{array}{ll} \quad \left\{ 2 \big [ (\varvec{a}_i^T \varSigma \varvec{a}_i) \varvec{x}_i + R\varSigma \varvec{a}_i + \sigma _N^2 (\Vert \varvec{x}_i\Vert + \beta ) \frac{\varvec{x}_i}{\Vert \varvec{x}_i\Vert } \big ] \right\} &{}\ \text {if }\varvec{x}_i\ne \varvec{0}; \\ \quad \left\{ 2 \big [ R\varSigma \varvec{a}_i + \sigma _N^2 \varvec{u}\big ]:\ \Vert \varvec{u}\Vert \le \beta \right\} &{} \ \text {otherwise.} \end{array}\right. \end{aligned}$$

It remains to show that $\varvec{0}\in \partial {h_i}(\varvec{x}_i^*)$. If $R\varSigma \varvec{a}_i = \varvec{0}$ then $\varvec{x}_i = 0$ and the statement holds. Assume that $R\varSigma \varvec{a}_i\ne \varvec{0}$, we distinguish two cases.

If $(1 -\sigma _N^2 \frac{\beta }{\Vert R\varSigma \varvec{a}_i\Vert })>0$, then
$$\begin{aligned} \varvec{x}_i^* = -\frac{1}{\varvec{a}_i^T \varSigma \varvec{a}_i + \sigma _N^2} \cdot \left( 1 -\sigma _N^2 \frac{\beta }{\Vert R\varSigma \varvec{a}_i\Vert }\right) \cdot R \varSigma \varvec{a}_i\ne \varvec{0} . \end{aligned}$$
Substituting in the expression of $\partial h_i$, easy (though lengthy) calculations shows that $\partial h(\varvec{x}_i^*)=\{\varvec{0}\}$.
Otherwise, we have $\Vert R\varSigma \varvec{a}_i\Vert \le \sigma _N^2 \beta $ and $\varvec{x}_i^*=\varvec{0}$. To see that $\varvec{0}\in \partial {h_i}(\varvec{x}_i^*)$, we need a vector $\varvec{u}$ such that $R\varSigma \varvec{a}_i + \sigma _N^2 \varvec{u}=\varvec{0}$ and $\Vert \varvec{u}\Vert \le \beta $. This works for $\varvec{u}= -\frac{1}{\sigma _N^2} R\varSigma \varvec{a}_i$.

$\square $

Alternating minimization Block coordinate methods are wide spread for large scale problems, see for example Wright (2015) for a recent overview. The idea is to update only a subset of variable at each iteration. The choice of the subset could be performed in various ways: at random with replacement, in a cyclic order, using random permutations. Complexity analysis is often easier to carry out when considering sampling with replacement but practical performances are usually superior when using random permutations which corresponds to sampling without replacement (Wright 2015). Implementing the alternating minimization algorithm requires to keep track of $XA\varSigma \in {\mathbb {R}}^{r \times n}$ (similarly as for computing $\nabla f$). Keeping track of this quantity when changing a single column can be done in O(nr) operations. A full path through the m columns can be done in O(nrm) operations which is the same as for gradient based methods.

To our knowledge, application of this algorithm to a problem of the form of (6) is new in the optimization literature. Indeed, alternating minimization and more generally block coordinate methods are not convergent in general, their use is limited to smooth problems or problems with a separable sum structure. This is not the case because of the square in the last term of (6).

To understand why block coordinate methods do not converge to global minima in general, consider the function $\phi :(x,y) \mapsto \max \left\{ x + 2 y, -2x - y \right\} $ Taking for any $t > 0$, $x = t$ and $y = -t$ and letting $t \rightarrow \infty $ shows that $\inf _{{\mathbb {R}}^2}\phi = -\infty $. Yet it can be checked that $0 = \arg \min _x \phi (x,0) = \arg \min _y \phi (0,y)$ so that the origin is actually a stationary point for the alternating minimization algorithm applied to $\phi $.

However problem (6) has an additional structure: the subgradient of its objective is a simple Cartesian product. Furthermore, partial minimization is strongly convex. Combining these properties leads to the following result. This guarantee is weak, indeed, convergence of alternating minimization methods is a difficult matter for which only few results are known and virtually none outside of separable nonsmoothness.

Proposition 3

The alternating minimization algorithm applied to problem (6) with blocks taken in a cyclic order or using random permutations, produces a decreasing sequence of objective function value and satisfies $F(X_k) \rightarrow \rho \ \text {as}\ k \rightarrow \infty $.

Proof

Monotonicity is obvious here, we denote by ${\tilde{\rho }}$ the limiting value of the objective function along the sequence. The Cartesian product structure of the subgradient of g in Lemma 1 entails that the subgradient of the objective of (6) has the same Cartesian product structure. This implies that if all the columns of $X \in {\mathbb {R}}^{r \times m}$ are blockwise optimal for problem (6), then X itself is the global optimum. This is because block optimality ensures that $\varvec{0}$ belongs to each partial subgradients in (8) and the global subgradient of (6) is the Cartesian product of the partial subgradients (Rockafellar and Wets 2009, Corollary 10.11).

Now the partial minimization in (8) is $2 \sigma _N^2$-strongly convex. Hence, for all $k \in {\mathbb {N}}$,

$$\begin{aligned} F(X_{k}) - F(X_{k+1}) \ge \sigma _N^2 \Vert X_{k+1} - X_{k}\Vert ^2. \end{aligned}$$

So $\{ \Vert X_{k+1} - X_{k}\Vert ^2\}_{k\in {\mathbb {N}}}$ is summable and as $k \rightarrow \infty $, we have $\Vert X_{k+1} - X_{k}\Vert \rightarrow 0$. By monotonicity $\left\{ X_k \right\} _{k \in {\mathbb {N}}}$ is a bounded sequence since the objective in (6) is coercive. Let ${\bar{X}}$ be any accumulation point of the sequence (there exists at least one).

For cyclic or random permutation selections, since all blocks are visited every m iteration, using the notation of Proposition 2, by continuity of the objective function, one must have for all $i = 1,\ldots ,m$ that the quantity

$$\begin{aligned} \varvec{x}_{ik} - \arg \min _{\varvec{x}_i} \Vert ( \varvec{x}_i \varvec{a}_i^T + R_k) \varSigma ^{1/2}\Vert _F^2 + \sigma _N^2 (\Vert \varvec{x}_i\Vert + \beta _k)^2 \quad \underset{k \rightarrow \infty }{\rightarrow }\quad \varvec{0}, \end{aligned}$$

where $R_k:=\sum _{j\ne i} \varvec{x}_{jk} \varvec{a}_j^T-K^T$ and $\beta _k = \sum _{j\ne i} \Vert \varvec{x}_{jk}\Vert $. By continuity ${\bar{X}}$ must be blockwise optimal for (6) and hence global optimal so that ${\tilde{\rho }} = \rho $. $\square $

5 Numerical experiments

5.1 Instances

As was done in Harman et al. (2018), we report numerical experiments on two kinds of instances to test the performance of proximal decomposition methods to solve Problem (6). On the one hand, we generate random instances by sampling the elements of $A\in {\mathbb {R}}^{m \times n}$ independently from a standard normal distribution. On the other hand, we compute Bayes A-optimal designs for quadratic regression over $[-1,1]^d$:

$$\begin{aligned} y(x) = \theta _0 + \sum _{i=1}^d \theta _i x_i + \sum _{1\le i\le j\le d} \theta _{ij} x_i x_j + \epsilon . \end{aligned}$$

So in practice, to construct the matrix A we first form a regular grid $X=\{\varvec{x}_1,\ldots ,\varvec{x}_m\}\subseteq [-1,1]^d$, and for each $k\in [m]$ the kth row of A is set to

$$\begin{aligned} \varvec{a}_k^T=[1,(x_{ki})_{k=1,\ldots ,d},(x_{ki}\, x_{kj})_{1\le i\le j \le d}]\in {\mathbb {R}}^{n}, \end{aligned}$$

where $n=1+d+d(d+1)/2$. In addition, for all our experiments, we set $K=\varSigma =I_n$, and $\sigma _N^2=0.01$.

5.2 Algorithms

We present results for the two proximal decomposition methods with backtracking line search presented in Sect. 4.1, which we denote by FB (for forward–backward) and FISTA. We also used two variants of the alternating block coordinate descent algorithm of 4.2, where blocks are selected in a fixed cyclic order ($\texttt {ABCD-cy}$), or according to a new random permutation that is drawn at random every m steps ($\texttt {ABCD-rp}$). The best block sampling scheme is usually problem dependent and it is common to observe that there is no uniformly better strategy for different problem instances (see e.g. Beck et al. 2015 for a numerical illustration). It is folklore to mention that sampling without replacement performs better than with replacement in the machine learning community, but this vary from settings to settings.

We compare these methods to a Fedorov–Wynn type vertex-direction method (VDM), which is, in fact, an adaptation of the celebrated Frank–Wolfe algorithm for constrained convex optimization (Frank and Wolfe 1956). Several variants exist to compute the step sizes $\alpha $ of this algorithm, in particular, optimal step length can be used, see Harman et al. (2018). However, no simple formula exists for the optimal step lengths in the case of Bayes A-optimality, so we next describe a method with backtracking line search, which also allows a more straightforward comparison with FB and FISTA.

We will also compare to the multiplicative algorithm (Silvey et al. 1978) (MUL), where at each iteration, we set

$$\begin{aligned} (\varvec{d}_k)_i = -\frac{\partial \varPhi _{A_k}(\varvec{w}_k)}{\partial w_i}=\frac{1}{\sigma _N^2}\Vert K^T M_N(\varvec{w}_{k})^{-1} \varvec{a}_i\Vert ^2, \end{aligned}$$

(9)

and we perform the update $\varvec{w}_{k+1} = \frac{\varvec{w}_k \odot \varvec{d}_k}{\varvec{w}_k^T \varvec{d}_k}$; here, the symbol $\odot $ is used for the Hadamard (elementwise) product of two vectors.

For VDM , regarding the value of $L_k$ and the number of i to be tested to have a succesful Backtracking line search step, the same comments as those made for the forward–backward algorithm hold. For both VDM and MUL, the cost of one iteration is dominated by the cost of computing $\varvec{d}=-\nabla \varPhi $ which requires the inversion of $M_N(\varvec{w})$ with computational cost $O(n^3)$ and multiplications by $K^T$ and $A^T$ which can be done in O(rnm) and dominates the overall cost of this gradient computation.

For all algorithms, we used the constants $\eta =2$ and $L_0=1$ for backtracking line searches. The initial designs were set to $\varvec{w}_0=\frac{1}{m} \varvec{1}_m$ for $\texttt {VDM}$ and $\texttt {MUL}$, and we used the initial matrix $X_0=0\in {\mathbb {R}}^{r \times m}$ for the other algorithms.

In our experiments, K is taken to be the identity so that all the algorithms have iteration complexity of order $O(n^2 m)$ and thus comparing the evolution of the cost along iterations of each algorithm provides a good intuition about their comparative performances. Note that for $\texttt {ABCD-cy}$ and $\texttt {ABCD-rp}$, we consider that one iteration is complete after going through a full cycle so that all the entries of X are updated.

5.3 Results

To monitor the speed of convergence of the algorithms, we can compute the design efficiencies

$$\begin{aligned} {\text {eff}}_{A_K} (\varvec{w}_k) := \frac{\rho }{\varPhi _{A_K}(\varvec{w}_k)}, \end{aligned}$$

where $\rho = \inf _{\varvec{w^*} \in \varDelta } \varPhi _{A_K}(\varvec{w^*})$ was computed by letting the multiplicative algorithm run for a very long time. On the graphics, we plot the quantity $\log _{10}(1-{\text {eff}}_{A_K} (\varvec{w}_k))$, so a value of $-k$ corresponds to an efficiency of $1-10^{-k}$. We will also use the following optimality measure:

$$\begin{aligned} s_k = \max _{i=1,\ldots ,m}\ (\varvec{d}_k)_i - \varvec{w}_k^T \varvec{d}_k, \end{aligned}$$

where $\varvec{d}_k$ is the gradient of $-\varPhi _{A_K}$ at $\varvec{w}_k$, see (9). It is folklore (see Pukelsheim 1993) that this expression gives a duality bound on the efficiency of the design $\varvec{w}_k$: ${\text {eff}}_{A_K} (\varvec{w}_k)\ge 1-\varepsilon _k$, where $\varepsilon _k:= \frac{s_k}{\varPhi _{A_K}(\varvec{w}_k)+s_k}$. Furthermore, for any sequence $(\varvec{w}_k)_{k\in {\mathbb {N}}}$ of designs converging to an optimal design $\varvec{w}^*$, it is known that $s_k$ converges to 0, so the lower bound $1-\varepsilon _k$ on the design efficiency converges to 1. The algorithms FB, FISTA and ABCD do not directly involve iterates $\varvec{w}_k$, but we can compute the above efficiency bound by setting $\varvec{w}_k=\Vert \varvec{x}_k\Vert /\varOmega (X)$.

Another important measure of a design’s quality is its sparsity. It is well known that optimal designs are supported by a few points only, which is a desired property for many applications. The problem formulation (6) gives a new explanation for this fact, as the penalty term $\varOmega (X)$ is added in group lasso regression in order to induce block sparsity, so we expect the optimal matrix $X^*$ to have a lot of columns equal to $\varvec{0}$ (the squared penalty term $g(X)=\frac{1}{2} \varOmega (X)^2$ is also known to be block-sparsity inducing, cf. Bach 2008).

A remarkable property of the proximal decomposition methods presented in this article is that ${\text {prox}}_{tg}(V)$ acts as a thresholding operator on V, literally zeroing a lot of columns. The same is true for alternating block coordinate descent methods, in which whole columns are set to $\varvec{0}$ if a certain threshold property holds. As a result, the iterates produced by FB, FISTA, ABCD-cy and ABCD-rp are expected to have a small support. To observe this fact, we measure the sparsity of a design by $\delta _{0.01}(\varvec{w}_k)$, the number of coordinates of $\varvec{w}_k$ exceeding the value $\frac{0.01}{m}$.

The evolution of the efficiency, the duality bound $1-\varepsilon _k$, and the support size during the 5000 first iterations of each algorithm is depicted in Figs. 1 and 2 for four random and quadratic regression instances of various sizes. As already mentioned, all the algorithms we compare have a complexity of $O(n^2m)$ per iteration. It is therefore possible to get a rough idea of their comparative computational efficiency from an iteration-based performance analysis. Performing a more precise time-based analysis depends on optimization of the linear algebra operations required for each algorithm and is beyond the scope of this work, so we stick to iteration-based analysis.

We observe several properties of the new algorithms on these figures. First, the effect of acceleration can clearly be seen on the figures, as FISTA always beats VDM, while the simple forward–backward algorithm FB is typically outperformed. As explained in Sect. 4.1, this comes at the price of FISTAnot being a descent method, which can also be observed on the plots, especially for the quadratic regression instances (Fig. 2). Second, MUL is performing in general better than other algorithms, closely followed by alternating minimization methods. The performance of the remaining algorithms is in general below. These preliminary results suggest that the group lasso formulation of the experimental design problem has the potential to help deciphering powerful algorithms for the later problem. In particular, the alternating block coordinate descents exhibit a nice linear convergence on many instances. Pushing further would require to look more carefully at the implementation details of each algorithm and perform much larger scale experiments which is beyond the scope of this paper. Many upgrades and improvements also have to be evaluated, e.g. preconditioning, clever subsampling of the blocks to be updated at each iteration. Third, the support plots show that FISTA and ABCD quickly converge to a sparse solution. MUL also quickly identifies a design with a small support. We recall that the iterates of FISTA and ABCD are truly sparse, while for MUL, this is only a numerical sparsity, as the iterates $\varvec{w}_k$ remain strictly positive. On the other hand VDM always fail to identify a smaller support. Finally, we point out that all algorithms presented in this article could be combined with support reduction techniques, which identify points that cannot belong to the support of an optimal design in order to speed up the subsequent iterations (Pronzato 2013).

Finite time identification of sparsity patterns is an active topic of research in nonsmooth optimization and we believe that this property can be exploited to yield high-performance algorithms to solve very large scale optimal design problems. Finally, for ABCD, there is no clear distinction between random permutation and fixed cyclic choice of the blocks.

6 Conclusion

This paper presents a strong, previously unrevealed connection between two standard problems in statistics (Bayes A-optimal design and group lasso regression), hence clearing the path to a convergence of algorithms used in the communities of optimal design of experiments and machine learning. While the new methods presented in this article are not yet competitive with other algorithms for computing optimal designs over a finite design space, they certainly present interesting features, such as sparse iterates and a guaranteed speed of convergence, and we believe that there is still an important room for improvement, e.g. by using recent techniques based on subsampling oracles (Kerdreux et al. 2018) or lazy separators (Braun et al. 2017). Conversely, an interesting perspective is to use well established techniques of optimal experimental design, such as methods to restrict the set of potential support points of an optimal design (Pronzato 2013), to improve algorithms that were designed to solve group lasso regressions.

Another topic for further research is whether we can reformulate other design problems (such as the D-optimal design problem, or problems with constraints on the design weights) as unconstrained convex optimization problems.

Change history

12 September 2019
Unfortunately, due to a technical error, the articles published in issues 60:2 and 60:3 received incorrect pagination. Please find here the corrected Tables of Contents. We apologize to the authors of the articles and the readers.

Notes

When $\mu $ is the uniform measure over the design space, we point out that the IMSE criterion is sometimes called I-optimality, or IV-optimality (for integrated variance); see Pukelsheim (1993).

References

Atkinson A, Donev A (1992) Optimum experimental designs. Oxford Statistical Science Series, vol 8. Oxford University Press, Oxford
Google Scholar
Bach F (2008) Consistency of the group lasso and multiple kernel learning. J Mach Learn Res 9:1179–1225 (Jun)
MathSciNet MATH Google Scholar
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106
Article MATH Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Article MathSciNet MATH Google Scholar
Beck A, Pauwels E, Sabach S (2015) The cyclic block conditional gradient method for convex optimization problems. SIAM J Optim 25(4):2024–2049
Article MathSciNet MATH Google Scholar
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont
MATH Google Scholar
Böhning D (1986) A vertex-exchange-method in D-optimal design theory. Metrika 33(1):337–347
Article MathSciNet MATH Google Scholar
Borwein J, Lewis AS (2010) Convex analysis and nonlinear optimization: theory and examples. Springer, New York
Google Scholar
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book MATH Google Scholar
Braun G, Pokutta S, Zink D (2017) Lazifying conditional gradient algorithms. In: International conference on machine learning, pp 566–575
Černỳ M, Hladík M (2012) Two complexity results on c-optimality in experimental design. Comput Optim Appl 51(3):1397–1408
Article MathSciNet MATH Google Scholar
Chaloner K (1984) Optimal Bayesian experimental design for linear models. Ann Stat 12:283–300
Article MathSciNet MATH Google Scholar
Chambolle A, Dossal C (2015) On the convergence of the iterates of “FISTA”. J Optim Theory Appl 166(3):25
Article MATH Google Scholar
Combettes PL, Pesquet JC (2011) Proximal splitting methods in signal processing. In: Fixed-point algorithms for inverse problems in science and engineering, pp 185–212. Springer, New York
Duncan G, DeGroot M (1976) A mean squared error approach to optimal design theory. In: Proceedings of the 1976 conference on Information: science and systems, pp 217–221. The John Hopkins University, Baltimore
Fedorov V (1972) Theory of optimal experiments. Academic Press, New York. Translated and edited by W. J. Studden and E.M. Klimko
Fedorov V (1996) Design of spatial experiments: model fitting and prediction. In: Gosh S, Rao C (eds) Handbook of statistics, Chap. 16, vol 13. Elsevier, New York, pp 515–553
Google Scholar
Fedorov V, Lee J (2000) Design of experiments in statistics. In: Wolkowicz H, Saigal R, Vandenberghe L (eds) Handbook of semidefinite programming. Kluwer, Boston
Google Scholar
Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res Logist Q 3(1–2):95–110
Article MathSciNet Google Scholar
Gauthier B, Pronzato L (2016) Optimal design for prediction in random field models via covariance kernel expansions. In: mODa 11-advances in model-oriented design and analysis, pp 103–111. Springer
Gauthier B, Pronzato L (2017) Convex relaxation for imse optimal design in random-field models. Comput Stat Data Anal 113:375–394
Article MathSciNet MATH Google Scholar
Harman R, Filová L, Richtárik P (2018) A randomized exchange algorithm for computing optimal approximate designs of experiments . arXiv:1801.05661
Kerdreux T, Pedregosa F, d’Aspremont A (2018) Frank-Wolfe with subsampling oracle. arXiv:1803.07348
Lukemire J, Mandal A, Wong W (2018) D-QPSO: a quantum-behaved particle swarm technique for finding d-optimal designs with discrete and continuous factors and a binary response. In: Technometrics, pp 1–27
Nesterov Y (1983) A method of solving a convex programming problem with convergence rate O (1/k2). Sov Math Dokl 27:372–376
MATH Google Scholar
Pilz J (1983) Bayesian estimation and experimental design in linear regression models, vol 55. Teubner-Texte zur Mathematik, Leipzig
MATH Google Scholar
Pronzato L (2013) A delimitation of the support of optimal designs for kiefer’s $\phi $p-class of criteria. Stat Probab Lett 83(12):2721–2728
Article MathSciNet MATH Google Scholar
Pukelsheim F (1993) Optimal design of experiments. Wiley, New York
MATH Google Scholar
Pukelsheim F, Rieder S (1992) Efficient rounding of approximate designs. Biometrika 79:763–770
Article MathSciNet Google Scholar
Rockafellar RT (1970) Convex analysis. Princeton Mathematical Series, vol 28. Princeton University Press, Princeton
Google Scholar
Rockafellar RT, Wets RJB (2009) Variational analysis, vol 317. Springer, New York
MATH Google Scholar
Sagnol G (2011) Computing optimal designs of multiresponse experiments reduces to second-order cone programming. J Stat Plan Inference 141(5):1684–1708
Article MathSciNet MATH Google Scholar
Sagnol G, Harman R (2015) Computing exact D-optimal designs by mixed integer second-order cone programming. Ann Stat 43(5):2198–2224
Article MathSciNet MATH Google Scholar
Sagnol G, Hege H.C, Weiser M (2016) Using sparse kernels to design computer experiments with tunable precision. In: Proceedings of the 22nd international conference on computational statistics, pp 397–408
Silvey S, Titterington D, Torsney B (1978) An algorithm for optimal designs on a finite design space. Commun Stat Theory Methods 7(14):1379–1389
Article MATH Google Scholar
Spöck G, Pilz J (2010) Spatial sampling design and covariance-robust minimax prediction based on convex design ideas. Stoch Environ Res Risk Assess 24(3):463–482
Article MATH Google Scholar
Tanaka K, Miyakawa M (2013) The group lasso for design of experiments. arXiv:1308.1196
Wright SJ (2015) Coordinate descent algorithms. Math Progr 151(1):3–34
Article MathSciNet MATH Google Scholar
Wynn H (1970) The sequential generation of $D$-optimum experimental designs. Ann Math Stat 41:1655–1664
Article MathSciNet MATH Google Scholar
Yu Y (2010) Monotonic convergence of a general algorithm for computing optimal designs. Ann Stat 38(3):1593–1606
Article MathSciNet MATH Google Scholar
Yu Y (2011) D-optimal designs via a cocktail algorithm. Stat Comput 21(4):475–481
Article MathSciNet MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68(1):49–67
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Both authors would like to thank two anonymous reviewers for their careful reading and detailed comments which helped improve the quality of this manuscript.

Author information

Authors and Affiliations

Institut für Mathematik, Technische Universität Berlin, Sekr. MA 5-1, Straße des 17. Juni 136, 10623, Berlin, Germany
Guillaume Sagnol
Toulouse 3 Université Paul Sabatier, Irit building 1R1, 118 route de Narbonne, 31062, Toulouse, France
Edouard Pauwels

Authors

Guillaume Sagnol
View author publications
You can also search for this author in PubMed Google Scholar
Edouard Pauwels
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Sagnol.

Additional information

This work was initiated when the first author was invited in Toulouse by the Chair in Applied Mathematics OQUAIDO, gathering partners in technological research (BRGM, CEA, IFPEN, IRSN, Safran, Storengy) and academia (CNRS, Ecole Centrale de Lyon, Mines Saint-Etienne, University of Grenoble, University of Nice, University of Toulouse) around advanced methods for Computer Experiments. The research of the first author is carried out in the framework of MATHEON supported by Einstein Foundation Berlin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sagnol, G., Pauwels, E. An unexpected connection between Bayes A-optimal designs and the group lasso. Stat Papers 60, 565–584 (2019). https://doi.org/10.1007/s00362-018-01062-y

Download citation

Received: 05 September 2018
Revised: 25 November 2018
Published: 13 December 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s00362-018-01062-y

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An unexpected connection between Bayes A-optimal designs and the group lasso

Abstract

Similar content being viewed by others

Finding global minima via kernel approximations

A Guide for Sparse PCA: Model Comparison and Applications

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

1 Introduction

2 Problem reformulation

2.1 The Bayes \(A_K\)-optimal design problem

2.2 Reformulation as an unconstrained convex problem

Proposition 1

3 Convex analysis of the squared group lasso penalty

Definition 1

Lemma 1

Proof

Lemma 2

Proof

4 Algorithms

4.1 Proximal decomposition methods

4.2 Block coordinate descent

Proposition 2

Proof

Proposition 3

Proof

5 Numerical experiments

5.1 Instances

5.2 Algorithms

5.3 Results

6 Conclusion

Change history

12 September 2019

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation