Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

System Identification Techniques: Convexification, Regularization, Relaxation

  • Alessandro ChiusoEmail author
Living reference work entry

Latest version View entry history

DOI: https://doi.org/10.1007/978-1-4471-5102-9_101-3


System identification has been developed, by and large, following the classical parametric approach. In this entry we discuss how regularization theory can be employed to tackle the system identification problem from a nonparametric (or semi-parametric) point of view. Both regularization for smoothness and regularization for sparseness are discussed, as flexible means to face the bias/variance dilemma and to perform model selection. These techniques have also advantages from the computational point of view, leading sometimes to convex optimization problems.


Kernel methods Nonparametric methods Optimization Sparse Bayesian learning Sparsity 


System identification is concerned with automatic model building from measured data. Under this unifying umbrella, this field spans a rather broad spectrum of topics, considering different model classes (linear, hybrid, nonlinear, continuous, and discrete time) as well as a variety of methodologies and algorithms, bringing together in a nontrivial way concepts from classical statistics, machine learning, and dynamical systems.

Even though considerable effort has been devoted to specific areas, such as parametric methods for linear system identification which are by now well developed (see the introductory article “System Identification: An Overview”), it is fair to say that modeling still is, by far, the most time-consuming and costly step in advanced process control applications. As such, the demand for fast and reliable automated procedures for system identification makes this exciting field still a very active and lively one.

Suffices here to recall that, following this classic parametric maximum likelihood (ML)/prediction error (PE) framework, the candidate models are described using a finite number of parameters \(\theta \in \mathbb {R}^n\). After the model classes have been specified, the following two steps have to be undertaken:

  1. (i)

    Estimate the model complexity \(\hat {n}\).

  2. (ii)

    Find the estimator \(\hat {\theta }\in {\mathbb {R}}^{\hat {n}}\) minimizing a cost function J(θ), e.g., the prediction error or (minus) the log-likelihood.

Both of these steps are critical, yet for different reasons: step (ii) boils down to an optimization problem which, in general, is non-convex, and as such it is very hard to guarantee that a global minimum is achieved. The regularization techniques discussed in this entry sometimes allow to reformulate the identification problem as a convex program, thus solving the issue of local minima.
In addition fixing the system complexity equal to the “true” one is a rather unrealistic assumption, and in practice the complexity n has to be estimated as per step (i). In practice there is never a “true” model, certainly not in the model class considered. The problem of statistical modeling is first of all an approximation problem; one seeks for an approximate description of “reality” which is at the same time simple enough to be learned with the available data and also accurate enough for the purpose at hand. On this issue see also the section “Trade-off Between Bias and Variance” in “System Identification: An Overview”. This has nontrivial implications, chiefly the facts that classical order selection criteria are based on asymptotic arguments and that the statistical properties of estimators \(\hat {\theta }\) after model selection, called post-model-selection estimators (PMSEs), are in general difficult to study (Leeb and Pötscher, 2005) and may lead to undesirable behavior. Experimental evidence shows that this is not only a theoretical problem but also a practical one (Pillonetto et al., 2011; Chen et al., 2012). On top of this statistical aspect, there is also a computational one. In fact the model selection step, which includes as special cases also variable selection and structure selection, may lead to computationally intractable combinatorial problems. Two simple examples which reveal the combinatorial explosion of candidate models are the following:
  1. (a)

    Variable selection: consider a high-dimensional time series (MIMO) where not all inputs/ outputs are relevant and one would like to select k out of m available input signals where k is not known and needs to be inferred from data (see, e.g., Banbura et al. 2010 and Chiuso and Pillonetto 2012).

  2. (b)

    Structure selection: consider all autoregressive models of maximal lag p with only p0 < p nonzero coefficients and one would like to estimate how many (p0) and which coefficients are nonzero.

The same combinatorial problem arises in hybrid system identification (e.g., switching ARX models). Given that enumeration of all possible models is essentially impossible due the combinatorial explosion of candidates, selection could be performed using greedy approaches from multivariate statistics, such as stepwise methods (Hocking, 1976).

The system identification community, inspired by work in statistics (Tibshirani, 1996; Mackay, 1994), machine learning (Rasmussen and Williams, 2006; Tipping, 2001; Bach et al., 2004), and signal processing (Donoho, 2006; Wipf et al., 2011), has recently developed and adapted methods based on regularization to jointly perform model selection and estimation in a computationally efficient and statistically robust manner. Different regularization strategies have been employed which can be classified in two main classes: regularization induced by so-called smoothness priors (aka Tikhonov regularization; see Kitagawa and Gersh (1984) and Doan et al. (1984) for early references in the field of dynamical systems) and regularization for selection. This latter is usually achieved by convex relaxation of the 0 quasinorm (such as 1 norm and variations thereof such as sum of norms, nuclear norm, etc.) or other non-convex sparsity-inducing penalties which can be conveniently derived in a Bayesian framework, aka sparse Bayesian learning (SBL) (Mackay, 1994; Tipping, 2001; Wipf et al., 2011).

The purpose of this entry is to guide the reader through the most interesting and promising results on this topic as well as areas of active research; of course this subjective view only reflects the author’s opinion, and of course different authors could have offered a different perspective.

While, as mentioned above, system identification studies various classes of models (ranging from linear to general “nonlinear” models), in this entry, we shall restrict our attention to specific ones, namely, linear and hybrid dynamical systems. The field of nonlinear system identification is so vast (a quote sometimes attributed to S. Ulam has it that the study of nonlinear systems is a sort of “non-elephant zoology”) that even though it has largely benefitted from the use of regularization, it cannot be addressed within the limited space of this contribution. The reader is referred to the Encyclopedia chapters “Nonlinear System Identification: An Overview of Common Approaches” and “Nonlinear System Identification Using Particle Filters” for more details on nonlinear model identification.

System Identification

Let \(u_{t} \in \mathbb {R}^m\) and \(y_{t} \in \mathbb {R}^p\) be, respectively, the measured input and output signals in a dynamical system; the purpose of system identification is to find, from a finite collection of input-output data {ut, yt}t ∈ [1,N], a “good” dynamical model which describes the phenomenon under observation. The candidate model will be searched for within a so-called model set denoted by \({\mathscr {M}}\). This set can be described in parametric form (see, e.g., Eq. (3) in “System Identification: An Overview”) or in a nonparametric form. In this entry we shall use the symbol \(\mathscr {M}_n(\theta )\) for parametric model classes where the subscript n denotes the model complexity, i.e., the number of free parameters.

Linear Models

The first part of the entry will address identification of linear models, i.e., models described by a convolution
$$\displaystyle \begin{aligned} y_{t} = \sum_{k=1}^{\infty} g_{t-k} u_k + \sum_{k=0}^{\infty} h_{t-k} e_k \quad t\in \mathbb{Z} \end{aligned} $$
where g and h are the so-called impulse responses of the system and \(\{e_t\}_{t\in \mathbb {Z}}\) is a zero-mean white noise process which under suitable assumptions is the one-step-ahead prediction error; a convenient description of the linear system (1) is given in terms of the transfer functions
$$\displaystyle \begin{aligned} G(q): = \sum_{k=1}^\infty g_k q^{-k} \quad H(q): = \sum_{k=0}^\infty h_k q^{-k} \quad \end{aligned}$$
The linear model (1) naturally yields an “optimal” (in the mean square sense) output predictor which shall be denoted later on by \(\hat {y}_{t|t-1}\). As mentioned above, under suitable assumptions, the noise et in (1) is the so-called innovation process \(e_t = y_t - \hat {y}_{t|t-1}\). See also Eq. (8) in “System Identification: An Overview”.

When g and h are described in a parametric form, we shall use the notation gk(θ), hk(θ), and, likewise, G(q, θ), H(q, θ), and \(\hat {y}_{t|t-1}(\theta )\).

Example 1

Consider the so-called output-error model, i.e., assume H(q) = 1. An example of parametric model class is obtained restricting G(q, θ) to be a rational function
$$\displaystyle \begin{aligned} G(q,\theta) = K\prod_{i=1}^{n} \frac{q-z_i}{q-p_i} \end{aligned}$$
where θ := [K, p1, z1, …, pn, zn] is the parameter vector. Note that the parameter vector θ may subjected to constraints θ ∈ Θ, e.g., enforcing that the system be bounded input and bounded output (BIBO) stable (|pi| < 1) or that the impulse response be real (\(K \in \mathbb {R}\) and poles pi and zeros zi appear in complex conjugate pairs).
An example of nonparametric model is obtained, e.g., postulating that gk is a realization of a Gaussian process (Rasmussen and Williams, 2006) with zero mean and a certain covariance function R(t, s) = cov(gt, gs). For instance, the choice R(t, s) = λtδts, where |λ| < 1 and δk is the Kronecker symbol, postulates that the gt and gs are uncorrelated for t ≠ s and that the variance of gt decays exponentially in t; this latter condition ensures that each realization gk, k > 0, is BIBO stable with probability one. The exponential decay of gt guarantees that, to any practical purpose, it can be considered zero for t > T for a suitably large T. This allows to approximate the OE model with a “long” finite impulse response (FIR) model
$$\displaystyle \begin{aligned} G(q) = \sum_{k=1}^T g_k z^{-k} \end{aligned} $$
where gk, k = 1, …, T, is modeled as a zero-mean Gaussian vector with covariance Σ, with elements [ Σ]ts = R(t, s).

Remark 1

Note that the model (2), which has been obtained from truncation of a nonparametric model, could in principle be thought as a parametric model in which the parameter vector θ contains all the entries of gk, k = 1, …, T. Yet the truncation index T may have to be large even for relatively “simple” impulse responses; for instance, \(\{g_k(\theta )\}_{k\in \mathbb {Z}^+}\) may be a simple decaying exponential, gk(θ) = αρk, which is described by two parameters (amplitude and decay rate), yet if |ρ|≃ 1, the truncation index T needs to be large (ideally T →) to obtain sensible results (e.g., with low bias). Therefore, the number of parameters T(m × p) may be larger (and in fact much larger) than the available number of data points N. Under these conditions, the parameter θ cannot be estimated from any finite data segment unless further constraints are imposed.

The Role of Regularization in Linear System Identification

In order to simplify the presentation, we shall refer to the linear model (1) and assume that H(q) = 1, i.e., we consider the so-called linear output-error (OE) models. The extension to more general model classes can be found in Pillonetto et al. (2011), Chen et al. (2012), Chiuso and Pillonetto (2012), and references therein.

The main purpose of regularization is to control the model complexity in a flexible manner, moving from families of rigid, finite dimensional parametric model classes \(\mathscr {M}_n(\theta )\) to flexible, possibly infinite dimensional, models. To this purpose one starts with a “suitably large” model class which is constrained through the use of so-called regularization functionals. To simplify the presentation, we consider the FIR (2). The estimator \(\hat {\theta }\) is found as the solution of the following optimization problem
$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat\theta&=& \text{arg min}_{\theta \in \mathbb{R}^n}\; J_F(\theta) + J_R(\theta;\lambda) \end{array} \end{aligned} $$
where JF(θ) is the “fit” term often measured in terms of average squared prediction errors:
$$\displaystyle \begin{aligned} \begin{array}{rcl} J_F(\theta)& := & \frac{1}{N} \sum_{t=1}^{N}\| y_{t} -\hat{y}_{t|t-1}(\theta)\|{}^2 \end{array} \end{aligned} $$
while JR(θ;λ) is a regularization term which penalizes certain parameter vectors θ associated to “unlikely” systems. Equation (3) can be seen as a way to deal with the bias-variance trade-off. The regularization term JR(θ;λ) may depend upon some regularization parameters λ which need to be tuned using measured data. In its simplest instance,
$$\displaystyle \begin{aligned} J_R(\theta;\lambda)= \lambda J_R(\theta) \end{aligned}$$
where λ is a scale factor that controls “how much” regularization is needed. We now discuss different forms of regularization JR(θ;λ) which have been studied in the literature.

Example 2

Let us consider the FIR model in Eq. (2), and let θ be a vector containing all the unknown coefficients of the impulse response {gk}k=1,…,T. The linear least squares estimator
$$\displaystyle \begin{aligned} \hat{\theta}_{LS}:= \text{arg min}_\theta \frac{1}{N} \sum_{t=1}^{N}\| y_{t} -\hat{y}_{t|t-1}(\theta)\|{}^2 \end{aligned} $$
is ill-posed unless the number of data N is larger (and in fact much larger) than the number of parameters T. From the statistical point of view, the estimator (5) would result for large T in small bias and large variance. The purpose of regularization is to render the inverse problem of finding θ from the data {yt}t=1,…,N well posed, thus better trading bias versus variance. The simplest form of regularization is indeed the so-called ridge regression or its weighted version (aka generalized Tikhonov regularization), where the 2-norm of the vector θ is weighted w.r.t. a positive semidefinite matrix Q,
$$\displaystyle \begin{aligned} \hat{\theta}_{\mathrm{Reg}} &:= \text{arg min}_\theta \frac{1}{N} \sum_{t=1}^{N}\| y_{t} -\hat{y}_{t|t-1}(\theta)\|{}^2 \\ &\quad + \lambda \theta^\top Q \theta{} \end{aligned} $$
which result in so-called regularization for smoothness; see section “Regularization for Smoothness.” The choice of the weighting Q is highly nontrivial in the system identification context, and the performance of the regularized estimator \(\hat \theta _{\mathrm {Reg}}\) heavily depends on this.

Remark 2

In order to formalize these ideas for nonparametric models or, equivalently, when the parameter θ is infinite dimensional, one has to bring in functional analytic tools, such as reproducing kernel Hilbert spaces (RKHS). This is rather standard in the literature on ill-posed inverse problems and has been recently introduced also in the system identification setting (Pillonetto et al., 2011). We shall not discuss these issues here because, we believe, the formalism would render the content less accessible.

Note that this regularization approach admits a completely equivalent Bayesian formulation simply setting
$$\displaystyle \begin{aligned} p(y|\theta) \propto e^{-J_F(\theta)} \quad p(\theta|\lambda) \propto e^{-J_R(\theta;\lambda)} \end{aligned} $$
The densities p(y|θ) and p(θ|λ) are, respectively, the likelihood function and the prior, which in turn may depend on the unknown regularization parameters λ, aka hyperparameters in this Bayesian formulation. This is straightforward in the finite dimensional setting, while it requires some care when θ is infinite dimensional. With reference to Example 1 and assuming θ contains the impulse response coefficients gk in (2), p(θ|λ) is a Gaussian density with zero mean and covariance Σ which may be depend upon some regularization parameters λ. From the definitions (7), it follows that
$$\displaystyle \begin{aligned} p(\theta|y,\lambda ) \propto p(y|\theta) p(\theta|\lambda) \end{aligned} $$
from which point estimators of θ can be obtained (e.g., as posterior mean, MAP, etc.). As such, with some abuse of terminology, we shall indifferently refer to JR(θ;λ) as the “regularization term” or the “prior.” The unknown parameter λ is used to introduce some flexibility in the regularization term JR(θ;λ) or equivalently in the prior p(θ|λ) and is tuned based on measured data as discussed later on.

The regularization term JR(θ;λ) can be roughly classified in regularization for smoothness, which attempts to control complexity in a smooth fashion and regularization for sparseness which, on top of estimation, also aims at selecting among a finite (yet possibly very large) number of candidate model classes.

Regularization for Smoothness

Let us consider a single-input, single-output FIR model of length T (arbitrarily large), and let \(\theta := [g_1 \; g_2\; \ldots g_{T}]^\top \in \mathbb {R}^{T}\) be the (finite) impulse response; define also \(y\in \mathbb {R}^N\) be the vector of output observations, Φ the regressor matrix with past input samples, and e the vector with innovations (zero mean, variance σ2I). With this notation the convolution input-output equation (1) takes the form
$$\displaystyle \begin{aligned} y= \varPhi \theta +e \end{aligned}$$
Following the prescriptions of ridge regression, a regularized estimator \(\hat {\theta }\) can be found setting
$$\displaystyle \begin{aligned} J_R(\theta;\lambda)= \theta^\top K^{-1}(\lambda) \theta \end{aligned} $$
where the matrix K(λ), aka kernel, is tailored to capture specific properties of impulse responses (exponential decay, BIBO stability, smoothness, etc.). Early references include Doan et al. (1984) and Kitagawa and Gersh (1984), while more recent work can be found in Pillonetto and De Nicolao (2010), Pillonetto et al. (2011), and Chen et al. (2012) where several choices of kernels are discussed; the recent paper Zorzi and Chiuso (2018) provides a frequency domain interpretation of several kernel functions using tools from harmonic analysis.

Example 3

The simplest example of kernel is the so-called “exponentially decaying” kernel
$$\displaystyle \begin{aligned} K(\lambda):= \gamma D(\rho) \quad D(\rho) : = \mathrm{diag}\{\rho,\ldots,\rho^T\} \end{aligned} $$
where λ := (γ, ρ) with 0 < ρ < 1 and γ ≥ 0.
For fixed λ, the estimator \(\hat \theta (\lambda )\) is the solution of a quadratic problem and can be written in closed form (aka ridge regression):
$$\displaystyle \begin{aligned} \hat\theta(\lambda) = K(\lambda)\varPhi^\top \left(\varPhi K(\lambda) \varPhi^\top +\sigma^2 I \right)^{-1} y \end{aligned} $$
Two common strategies adopted to estimate the parameters λ are cross validation (Ljung, 1999) and marginal likelihood maximization. This latter approach is based on the Bayesian interpretation given in Eq. (7) from which one can compute the so-called “empirical Bayes” estimator \(\hat {\theta }_{\mathrm {EB}}:= \hat \theta (\hat \lambda _{\mathrm {ML}})\) of θ plugging in (11) the estimator of λ which maximizes the marginal likelihood:
$$\displaystyle \begin{aligned} \hat \lambda_{\mathrm{ML}} &:= \mathop{\mathrm{arg\;max}}\limits_{\lambda} p(\lambda |y)\\ &=\mathop{\mathrm{arg\;max}}\limits_{\lambda} \int p(\lambda, \theta |y) \, d\theta{} \end{aligned} $$
The main strength of the marginal likelihood is that by integrating the joint posterior over the unknown hyperparameters θ, it automatically accounts for the residual uncertainty in θ for fixed λ. When both JF and JR are quadratic costs, which corresponds to assuming that e and θ are independent and Gaussian, the marginal likelihood in (12) can be computed in closed form so that
$$\displaystyle \begin{aligned} \hat \lambda_{\mathrm{ML}} &:= \operatorname*{\mathrm{arg\;min}}_{\lambda} \;\;\mathrm{log}(\mathrm{det}(\Sigma(\lambda))) + y^\top \Sigma^{-1}(\lambda) y \\ \Sigma(\lambda) &:= \varPhi K(\lambda) \varPhi^\top + \sigma^2 I{}\end{aligned} $$

It is here interesting to observe that \(\hat \lambda _{\mathrm {ML}}\) which solves (12) under certain conditions leads to \(K(\hat \lambda _{\mathrm {ML}}) = 0\) (see Example 4), so that the estimator of θ in (11) satisfies \(\hat \theta (\hat \lambda _{\mathrm {ML}})=0\). This simple observation is the basis of so-called sparse Bayesian learning (SBL); we shall return to this issue in the next section when discussing regularization for sparsity and selection.

Unfortunately the optimization problem (12) (or (13)) is not convex and thus subjected to the issue of local minima. However, both experimental evidence and some theoretical results support the use of marginal likelihood maximization for estimating regularization parameters; see, e.g., Rasmussen and Williams (2006) and Aravkin et al. (2014).

Regularization for Sparsity: Variable Selection and Order Estimation

The main purpose of regularization for sparseness is to provide estimators \(\hat \theta \) in which subsets or functions of the estimated parameters are equal to zero.

Consider the multi-input, multi-output OE model
$$\displaystyle \begin{aligned} y_{t,j} = \sum_{i=1}^m \sum_{k=1}^T g_{k,ij} u_{t-k,i} + e_{t,i} \quad j=1,\ldots,p\end{aligned} $$
where yt,j denotes the jth component of \(y_{t}\in \mathbb {R}^p\); let also \(\theta \in \mathbb {R}^{T(m+p)}\) be the vector containing all the impulse response coefficients gk,ij, j = 1, …, p, i = 1, …, m, and k = 1, …, T. With reference to Eq. (14), simple examples of sparsity one may be interested in are:
  1. (i)

    Single elements of the parameter vector θ, which corresponds to eliminating specific lags of some variables from the model (14).

  2. (ii)

    Groups of parameters such as the impulse response from ith input to the jth output gk,ij, k = 1, …, T, thereby eliminating the ith input from the model for the jth output.

  3. (iii)

    The singular values of the Hankel matrix \(\mathscr {H}(\theta )\) formed with the impulse response coefficients gk; in fact the rank of the Hankel matrix equals the order (i.e., the McMillan degree) of the system. (Strictly speaking any full rank FIR model of length T has McMillan degree T × p. Yet, we consider {gk}k=1,…,T to be the truncation of some “true” impulse response {gk}k=1,…,, and, as such, the finite Hankel matrix built with the coefficients gk will have rank equal to the McMillan degree of \(G(q)= \sum _{k=1}^\infty g_k z^{-k}\).)


To this purpose one would like to penalize the number of nonzero terms; let them be entries of θ, groups, singular values, etc. This is measured by the 0 quasinorm or its variations: group 0 and 0 quasinorm of the Hankel singular values, i.e., the rank of the Hankel matrix. Unfortunately if JR is a function of the 0 quasinorm, the resulting optimization problem is computationally intractable; as such one usually resorts to relaxations. Three common ones are described below.

One possibility is to resort to greedy algorithms such as orthogonal matching pursuit; generically it is not possible to guarantee convergence to a global minimum point.

A very popular alternative is to replace the 0 quasinorm by its convex envelope, i.e., the 1 norm, leading to algorithms known in statistics as LASSO (Tibshirani, 1996) or its group version group LASSO (Yuan and Lin, 2006):
$$\displaystyle \begin{aligned} J_R(\theta;\lambda) = \lambda \|\theta\|{}_{1} \end{aligned} $$
Similarly the convex relaxation of the rank (i.e., the 0 quasinorm of the singular values) is the so-called nuclear norm (aka Ky Fan n-norm or trace norm), which is the sum of the singular values \(\| A \|{ }_*: =\mathrm {trace}\{\sqrt {{ A}^{\top {A}}}\}\) where \(\sqrt {\cdot }\) denotes the matrix square root which is well defined for positive semidefinite matrices. In order to control the order (McMillan degree) of a linear system, which is equal to the rank of the Hankel matrix \(\mathscr {H}(\theta )\) built with the impulse response described by the parameter θ, it is then possible to use the regularization term
$$\displaystyle \begin{aligned} J_R(\theta;\lambda) = \lambda \|\mathscr{H}(\theta)\|{}_{*} \end{aligned} $$
thus leading to convex optimization problems (Fazel et al., 2001). The implications of using nuclear and atomic norms in system identification have been studied in Pillonetto et al. (2016), while Prando et al. (2017a) introduces an iterative reweighted scheme to account for smoothness, stability, and complexity (MacMillan degree) at the same time.

Both (16) and (15) induce sparse or nearly sparse solutions (in terms of elements or groups of θ (15) or in terms of Hankel singular values (16)), making them attractive for selection. It is interesting to observe that both 1 and group 1 are special cases of the nuclear norm if one considers matrices with fixed eigenspaces. Yet, as well documented in the statistics literature, both (16) and (15) do not provide a satisfactory trade-off between sparsity and shrinking, which is controlled by the regularization parameter λ. As λ varies one obtains the so-called regularization path. Increasing λ the solution gets sparser, but, unfortunately, it suffers from shrinking of nonzero parameters. To overcome these problems, several variations of LASSO have been developed and studied, such as adaptive LASSO (Zou, 2006), SCAD (Fan and Li, 2001), and so on. We shall now discuss a Bayesian alternative which, to some extent, provides a better trade-off between sparsity and shrinking than the 1 norm.

This Bayesian procedure goes under the name of sparse Bayesian learning and can be seen as an extension of the Bayesian procedure for regularization described in the previous section. In order to illustrate the method, we consider its simplest instance. Consider an MIMO system as in (14) with p = 1 and m = 2, i.e.,
$$\displaystyle \begin{aligned} \begin{array}{rcl} y_{t} &=& \sum_{k=1}^T g_{k,1} u_{t-k,1}+ \sum_{k=1}^T g_{k,2} u_{t-k,2} + e_{t} \\ & =& \phi^\top_{t,1} g_1+\phi^\top_{t,2} g_2 + e_{t} \end{array}\end{aligned} $$
where gi := [g1,i, …, gt,i]. Let \(\theta :=[g_1^\top \; g_2^\top ]^\top \), and assume that the gis are independent Gaussian random vectors with zero mean and covariances λiK. Letting Φi := [ϕ1,i, …, ϕN,i] and following the formulation in (7) and (8), it follows that the marginal likelihood estimator of λ takes the form
$$\displaystyle \begin{aligned} \begin{array}{rcl} \hat \lambda_{\mathrm{ML}} &:=& \mathop{\mathrm{arg\;min}}\limits_{\lambda_i\geq 0}\mathrm{log}(\mathrm{det}(\Sigma(\lambda))) + y^\top \Sigma^{-1}(\lambda) y \\ \\ \Sigma(\lambda) &:=& \lambda_1 \varPhi_1 K \varPhi_1^\top + \lambda_2 \varPhi_2 K\varPhi_2^\top + \sigma^2 I \end{array} \end{aligned} $$
After \(\hat \lambda _{\mathrm {ML}}\) has been found, the estimator of θ is found in closed form as per Eq. (11). It can be shown that under certain conditions on the observation vector y, the estimated hyperparameters \(\hat \lambda _{\mathrm {ML},i}\) lie at the boundary, i.e., are exactly equal to zero. If \(\hat \lambda _{\mathrm {ML},i} = 0\), then, from Eq. (11), also \(\hat g_i = 0\); this reveals that in (17), the ith input does not enter into the model; see also Example 4 for a simple illustration.

These Bayesian methods for sparsity have been studied in a general regression framework in Wipf et al. (2011) under the name of “type-II” maximum likelihood. Further results can be found in Aravkin et al. (2014) which suggest that these Bayesian methods provide a better trade-off between sparsity and shrinking (i.e., are able to provide sparse solution without inducing excessive shrinkage on the nonzero parameters).

Remark 3

A more detailed analysis, see, for instance, Aravkin et al. (2014), shows that LASSO/GLASSO (i.e., 1 penalties) and SBL using the “empirical Bayes” approach can be derived under a common Bayesian framework starting from the joint posterior p(λ, θ|y). While SBL is derived from the maximization λ of the marginal posterior, LASSO/GLASSO corresponds to maximizing the joint posterior after a suitable change of variables. For reasons of space, we refer the interested reader to the literature for details.

Recent work on the use of sparseness for variable selection and model order estimation can be found in Wang et al. (2007), Chiuso and Pillonetto (2012), and references therein.

Example 4

In order to illustrate how sparse Bayesian learning leads to sparse solutions, we consider a very simplified scenario in which the measurements equation is
$$\displaystyle \begin{aligned} y_t = \theta u_{t-1} + e_t \end{aligned}$$
where et is zero-mean, unit variance Gaussian, and white and ut is a deterministic signal. The purpose is to estimate the coefficient θ, which could be possibly equal to zero. Thus, the estimator should reveal whether ut−1 influences yt or not.
Following the SBL framework, we model θ as a Gaussian random variable, with zero mean and variance λ, independent of et. Therefore, yt is also Gaussian, zero mean, and variance \(u_{t-1}^2 \lambda + 1\). Therefore, assuming N data points are available, the likelihood function for λ is given by
$$\displaystyle \begin{aligned} L(\lambda) { =} \displaystyle{\prod_{i=1}^N \frac{1}{\sqrt{2\pi ( u_{t-1}^2 \lambda + 1)}} e^{-\displaystyle{\frac{1}{2} \sum_{i=1}^N \frac{y_t^2}{u_{t-1}^2 \lambda + 1}}}} \end{aligned}$$
Defining now
$$\displaystyle \begin{aligned} \hat \lambda_{\mathrm{ML}}:=\displaystyle{\mathop\mathrm{ arg\;\;min}_{\lambda \geq 0 }}\;\; -2{\log}\;L(\lambda) \end{aligned}$$
one obtains that
$$\displaystyle \begin{aligned} \hat \lambda_{\mathrm{ML}} = \mathrm{ max}\left(0,\lambda_*\right) \end{aligned}$$
where λ is the solution of
$$\displaystyle \begin{aligned} \sum_{t=1}^{N} \frac{u_{t-1}^4\lambda + u_{t-1}^2\left(1-y_{t}^2\right) }{(u_{t-1}^2 \lambda + 1)^2}=0 \end{aligned}$$
which unfortunately is not available in closed form. If however we assume that the input ut is constant (without loss of generality say that ut = 1 ), we obtain that
$$\displaystyle \begin{aligned} \lambda_* = \frac{1}{N}\sum_{t=1}^N y_t^2-1 \end{aligned}$$
$$\displaystyle \begin{aligned} \hat \lambda_{ML} = \mathrm{max}\left(0,\frac{1}{N}\sum_{t=1}^Ny_t^2 -1\right) \end{aligned}$$
Clearly this is a threshold estimator which sets to zero \(\hat \lambda _{ML}\) when the sample variance of yt is smaller than the variance of et, which was assumed to be equal to 1. Defining Φ = [u1, …, uN], the empirical Bayes estimator of θ, as per Eq. (11), is given by
$$\displaystyle \begin{aligned} \hat\theta = \hat \lambda_{ML} \varPhi^\top\left[\hat \lambda_{ML} \varPhi\varPhi^\top + I\right]^2 y \end{aligned}$$
which is clearly equal to zero when \(\hat \lambda _{ML}=0\).

Application of sparse learning is attracting enormous interest in the context of dynamic network models (Chiuso and Pillonetto, 2012; Dankers et al., 2016; Hayden et al., 2016), which arise, for instance, in gene regulatory networks (Daniel et al., 2010) and neuroscience (Prando et al., 2017b; Razi et al., 2017), where structured dynamic dependences should be inferred from measured data; for high-dimensional processes, the combinatorial explosion of alternative structures renders the direct comparison intractable. Regularization allows to avoid such curse of dimensionality, returning sparse (and thus interpretable) models with reasonable computational effort.

Extensions: Regularization for Hybrid Systems Identification and Model Segmentation

An interesting extension of linear systems is a class of so-called hybrid models described by a relation of the form
$$\displaystyle \begin{aligned} \begin{array}{rcl} y_{t} &=& \hat y_{\theta_k}(t|t-1) + e_{t} \\ \hat y_{\theta_k}(t|t-1) &=& L_{\theta_k}(y_t^-,u_t^-) \\ \theta_k \in \mathbb{R}^{n_k} & & k=1,\ldots,K \end{array} \end{aligned} $$
where the predictor \(\hat y_{\theta _k}(t|t-1)\), which is a linear function \(L_{\theta _k}(y_t^-,u_t^-)\) of the “past” histories \(y_t^-:=\{y_{t-1},y_{t-2},\ldots .\}\) and \(u_t^-:=\{u_{t-1},u_{t-2},\ldots .\}\), is parametrized by a parameter vector \(\theta _k\in \mathbb {R}^{n_k}\); there are K different parameter vectors θk, k = 1, …, K, whose evolution over time is determined by a so-called switching mechanism. The name hybrid hints at the fact that the model is described continuous-valued (y, u, and e) and discrete-valued (k) variables.
A well-studied subclass of (19) is composed by the so-called switching ARX models, where the predictor takes the special form
$$\displaystyle \begin{aligned} \hat y_{\theta_k}(t|t-1) =\phi_t^\top \theta_k \quad \theta_k \in \mathbb{R}^{n_k} \end{aligned} $$
The regressor ϕt is a finite vector containing inputs us and outputs ys in a finite past window s ∈ [t − 1, t − T], plus possibly a constant component to model changing “means.” The value of k ∈ [1, K] is determined by the switching mechanism \(p(\phi _t,t): \mathbb {R}^{n_k}\times \mathbb {R} \rightarrow \{1,\ldots ,K\}\).

Two extreme but interesting cases are (i) p(ϕt, t) = pt, where p(⋅) is an exogenous and not measurable signal, and (ii) p(ϕt, t) = p(ϕt), where p(⋅) is an endogenous unknown measurable function of the regression vector ϕt. In any case, from the identification point of view, k at time t is not assumed to be known, and, as such, the identification algorithm has to operate without knowledge of this switching mechanism.

Identification of systems in the form (20) requires to estimate (a) the number of models K and the position of the switches between different models, (b) the “dimension” of each model nk, (c) the value of the parameters θk, and, possibly, (d) the function p(ϕt, t) which determines the switching mechanism.

Steps (b) and (c) are essentially as in section “System Identification” (see also the introductory paper “System Identification: An Overview”); however, this is complicated by steps (a) and (d), which in particular require that one is able to estimate, from data alone, which system is “active” at each time t.

Step (a), which is also related to the problem of model segmentation, has been tackled in the literature; see, e.g., Ozay et al. (2012) and Ohlsson and Ljung (2013), and references therein, by applying suitable penalties on the number of different models K and/or on the number of switches. Note that p(ϕt, t) ≠ p(ϕs, s) if and only if θt ≠ θs. Based on this simple observation, one can construct a regularization which counts either the number of switches, i.e.,
$$\displaystyle \begin{aligned} J_R(\theta;\gamma): = \gamma \sum_{t=2}^N \|\|\theta_t-\theta_{t-1} \|\|{}_0, \end{aligned} $$
or attempts to approximate the total number of different models computing
$$\displaystyle \begin{aligned} J_R(\theta;\gamma): = \gamma \sum_{t,s=1}^Nw(s,t)\|\|\theta_t-\theta_{s}\|\|{}_0 \end{aligned} $$
for a suitable weighting w(t, s); see Ohlsson and Ljung (2013).
As discussed above, these quasinorms lead, in general, to unfeasible optimization problems (NP-hard). An exception is the case where one considers bounded noise, i.e., solves a problem of the form which is shown to be a convex problem; see Ozay et al. (2012). In general relaxations are used, typically using the 1/group-1 penalties, thus relaxing (21) and (22) to
$$\displaystyle \begin{aligned} \begin{array}{rcl} J_R(\theta;\lambda)&: =& \lambda \sum_{t=2}^N \|\theta_t-\theta_{t-1} \|{}_1\\ J_R(\theta;\lambda)&: = &\lambda \sum_{t,s=1}^Nw(s,t)\|\theta_t-\theta_{s}\|{}_1 \end{array} \end{aligned} $$
This yields to the convex optimization problems:
$$\displaystyle \begin{aligned} \mathop\mathrm{min}_{\theta_t} \sum_t \left(y_{t} -\phi_t^\top \theta_t\right)^2 + \lambda \sum_{t=2}^N \|\theta_t-\theta_{t-1} \|{}_1 \end{aligned} $$
$$\displaystyle \begin{aligned} \mathop\mathrm{min}_{\theta_t} \sum_t \left(y_{t} -\phi_t^\top \theta_t\right)^2 +\lambda \sum_{t,s=1}^Nw(s,t)\|\theta_t-\theta_{s}\|{}_1 \end{aligned} $$

Summary and Future Directions

We have presented a bird’s-eye overview of regularization methods in system identification. By necessity this overview was certainly incomplete, and we encourage the reader to browse through the recent literature for new developments on this exciting topic; we hope the references we have provided are a good starting point. While regularization is quite an old topic, we believe it is fair to say that the nontrivial interaction between regularization and system theoretic concepts provides a wealth of interesting and challenging problems while also providing new powerful tools to tackle challenging practical problems.

Just to mention a few open questions, (i) a thorough analysis of the interaction between different type of priors (e.g., stability vs. complexity vs. structure) is still lacking; (ii) some preliminary results (Formentin and Chiuso, 2018) suggest that tailoring regularization to control objectives leads to improved results in terms of control performance; further research in this direction is certainly needed; (iii) the statistical properties of Bayesian procedures such as SBL and its extensions in the context of system identification are not completely understood. Last but not least, while some results are available, nonlinear system identification still offers significant challenges.

Recommended Reading

The use of regularization methods for system identification can be traced back to the 1980s; see Doan et al. (1984) and Kitagawa and Gersh (1984); yet it is fair to say that the most significant developments are rather recent, and therefore the literature is not established yet. The reader may consult Fazel et al. (2001), Pillonetto et al. (2011), Chen et al. (2012), Chiuso and Pillonetto (2012), Chiuso (2016), Prando et al. (2017a), Pillonetto and Chiuso (2015), Zorzi and Chiuso (2018), Zorzi and Chiuso (2017), Romeres et al. (2019), and references therein. Clearly all this work has largely benefitted from cross fertilization with neighboring areas, and, as such, very relevant work can be found in the fields of machine learning (Bach et al., 2004; Mackay, 1994; Tipping, 2001; Rasmussen and Williams, 2006), statistics (Hocking, 1976; Tibshirani, 1996; Fan and Li, 2001; Wang et al., 2007; Yuan and Lin, 2006; Zou, 2006), signal processing (Donoho, 2006; Wipf et al., 2011), and econometrics (Banbura et al., 2010).



  1. Aravkin A, Burke J, Chiuso A, Pillonetto G (2014) Convex vs non-convex estimators for regression and sparse estimation: the mean squared error properties of ARD and GLASSO. J Mach Learn Res 15:217–252MathSciNetzbMATHGoogle Scholar
  2. Bach F, Lanckriet G, Jordan M (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21st international conference on machine learning, Banff, pp 41–48Google Scholar
  3. Banbura M, Giannone D, Reichlin L (2010) Large Bayesian VARs. J Appl Econom 25:71–92CrossRefGoogle Scholar
  4. Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes – revisited. Automatica 48:1525–1535MathSciNetCrossRefGoogle Scholar
  5. Chiuso A (2016) Regularization and Bayesian learning in dynamical systems: past, present and future. Annu Rev Control 41:24–38CrossRefGoogle Scholar
  6. Chiuso A, Pillonetto G (2012) A Bayesian approach to sparse dynamic network identification. Automatica 48:1553–1565MathSciNetCrossRefGoogle Scholar
  7. Daniel M, Robert JP, Thomas S, Claudio M, Dario F, Gustavo S (2010) Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci 107:6286–6291CrossRefGoogle Scholar
  8. Dankers A, Van den Hof PMJ, Bombois X, Heuberger PSC (2016) Identification of dynamic models in complex networks with prediction error methods: predictor input selection. IEEE Trans Autom Control 61:937–952MathSciNetCrossRefGoogle Scholar
  9. Doan T, Litterman R, Sims C (1984) Forecasting and conditional projection using realistic prior distributions. Econom Rev 3:1–100CrossRefGoogle Scholar
  10. Donoho D (2006) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306MathSciNetCrossRefGoogle Scholar
  11. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetCrossRefGoogle Scholar
  12. Fazel M, Hindi H, Boyd S (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American control conference, Arlington, vol 6, pp 4734–4739Google Scholar
  13. Formentin S, Chiuso A (2018) CoRe: control-oriented regularization for system identification. In: 2018 IEEE conference on decision and control (CDC), pp 2253–2258Google Scholar
  14. Hayden D, Chang YH, Goncalves J, Tomlin CJ (2016) Sparse network identifiability via compressed sensing. Automatica 68:9–17MathSciNetCrossRefGoogle Scholar
  15. Hocking RR (1976) A biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics 32:1–49Google Scholar
  16. Kitagawa G, Gersh H (1984) A smoothness priors-state space modeling of time series with trends and seasonalities. J Am Stat Assoc 79:378–389Google Scholar
  17. Leeb H, Pötscher B (2005) Model selection and inference: facts and fiction. Econom Theory 21:21–59MathSciNetCrossRefGoogle Scholar
  18. Ljung L (1999) System identification – theory for the user. Prentice Hall, Upper Saddle RiverzbMATHGoogle Scholar
  19. Mackay D (1994) Bayesian non-linear modelling for the prediction competition. ASHRAE Trans 100:3704–3716Google Scholar
  20. Ohlsson H, Ljung L (2013) Identification of switched linear regression models using sum-of-norms regularization. Automatica 49:1045–1050MathSciNetCrossRefGoogle Scholar
  21. Ozay N, Sznaier M, Lagoa C, Camps O (2012) A sparsification approach to set membership identification of switched affine systems. IEEE Trans Autom Control 57:634–648MathSciNetCrossRefGoogle Scholar
  22. Pillonetto G, Chiuso A (2015) Tuning complexity in regularized kernel-based regression and linear system identification: the robustness of the marginal likelihood estimator. Automatica 58:106–117MathSciNetCrossRefGoogle Scholar
  23. Pillonetto G, De Nicolao G (2010) A new kernel-based approach for linear system identification. Automatica 46:81–93MathSciNetCrossRefGoogle Scholar
  24. Pillonetto G, Chiuso A, De Nicolao G (2011) Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica 47:291–305MathSciNetCrossRefGoogle Scholar
  25. Pillonetto G, Chen T, Chiuso A, Nicolao GD, Ljung L (2016) Regularized linear system identification using atomic, nuclear and kernel-based norms: the role of the stability constraint. Automatica 69:137–149MathSciNetCrossRefGoogle Scholar
  26. Prando G, Chiuso A, Pillonetto G (2017a) Maximum entropy vector kernels for MIMO system identification. Automatica 79:326–339MathSciNetCrossRefGoogle Scholar
  27. Prando G, Zorzi M, Bertoldo A, Chiuso A (2017b) Estimating effective connectivity in linear brain network models. In: 2017 IEEE 56th annual conference on decision and control (CDC), pp 5931–5936Google Scholar
  28. Rasmussen C, Williams C (2006) Gaussian processes for machine learning. MIT, CambridgezbMATHGoogle Scholar
  29. Razi A, Seghier ML, Zhou Y, McColgan P, Zeidman P, Park H-J, Sporns O, Rees G, Friston KJ (2017) Large-scale DCMs for resting-state fMRI. Netw Neurosci 1:222–241CrossRefGoogle Scholar
  30. Romeres D, Zorzi M, Camoriano R, Traversaro S, Chiuso A (2019, in press) Derivative-free online learning of inverse dynamics models. IEEE Trans Control Syst TechnolGoogle Scholar
  31. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58:267–288MathSciNetzbMATHGoogle Scholar
  32. Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244MathSciNetzbMATHGoogle Scholar
  33. Wang H, Li G, Tsai C (2007) Regression coefficient and autoregressive order shrinkage and selection via the LASSO. J R Stat Soc Ser B 69:63–78MathSciNetGoogle Scholar
  34. Wipf D, Rao B, Nagarajan S (2011) Latent variable Bayesian models for promoting sparsity. IEEE Trans Inf Theory 57:6236–6255MathSciNetCrossRefGoogle Scholar
  35. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68:49–67MathSciNetCrossRefGoogle Scholar
  36. Zorzi M, Chiuso A (2017) Sparse plus low rank network identification: a nonparametric approach. Automatica 76:355–366MathSciNetCrossRefGoogle Scholar
  37. Zorzi M, Chiuso A (2018) The harmonic analysis of kernel functions. Automatica 94:125–137MathSciNetCrossRefGoogle Scholar
  38. Zou H (2006) The adaptive Lasso and it oracle properties. J Am Stat Assoc 101:1418–1429MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Information EngineeringUniversity of PadovaPadovaItaly

Section editors and affiliations

  • Lennart Ljung
    • 1
  1. 1.Division of Automatic Control, Department of Electrical EngineeringLinköping UniversityLinköpingSweden