# System Identification Techniques: Convexification, Regularization, Relaxation

Latest version View entry history

**DOI:**https://doi.org/10.1007/978-1-4471-5102-9_101-3

- 198 Downloads

## Abstract

System identification has been developed, by and large, following the classical parametric approach. In this entry we discuss how regularization theory can be employed to tackle the system identification problem from a nonparametric (or semi-parametric) point of view. Both regularization for smoothness and regularization for sparseness are discussed, as flexible means to face the bias/variance dilemma and to perform model selection. These techniques have also advantages from the computational point of view, leading sometimes to convex optimization problems.

## Keywords

Kernel methods Nonparametric methods Optimization Sparse Bayesian learning Sparsity## Introduction

System identification is concerned with automatic model building from measured data. Under this unifying umbrella, this field spans a rather broad spectrum of topics, considering different model classes (linear, hybrid, nonlinear, continuous, and discrete time) as well as a variety of methodologies and algorithms, bringing together in a nontrivial way concepts from classical statistics, machine learning, and dynamical systems.

Even though considerable effort has been devoted to specific areas, such as parametric methods for linear system identification which are by now well developed (see the introductory article “System Identification: An Overview”), it is fair to say that modeling still is, by far, the most time-consuming and costly step in advanced process control applications. As such, the demand for fast and reliable automated procedures for system identification makes this exciting field still a very active and lively one.

Suffices here to recall that, following this classic parametric maximum likelihood (ML)/prediction error (PE) framework, the candidate models are described using a finite number of parameters \(\theta \in \mathbb {R}^n\). After the model classes have been specified, the following two steps have to be undertaken:

- (i)
Estimate the model complexity \(\hat {n}\).

- (ii)
Find the estimator \(\hat {\theta }\in {\mathbb {R}}^{\hat {n}}\) minimizing a cost function

*J*(*θ*), e.g., the prediction error or (minus) the log-likelihood.

*n*has to be estimated as per step (i). In practice there is never a “true” model, certainly not in the model class considered. The problem of statistical modeling is first of all an approximation problem; one seeks for an approximate description of “reality” which is at the same time simple enough to be learned with the available data and also accurate enough for the purpose at hand. On this issue see also the section “Trade-off Between Bias and Variance” in “System Identification: An Overview”. This has nontrivial implications, chiefly the facts that classical order selection criteria are based on asymptotic arguments and that the statistical properties of estimators \(\hat {\theta }\) after model selection, called post-model-selection estimators (PMSEs), are in general difficult to study (Leeb and Pötscher, 2005) and may lead to undesirable behavior. Experimental evidence shows that this is not only a theoretical problem but also a practical one (Pillonetto et al., 2011; Chen et al., 2012). On top of this statistical aspect, there is also a computational one. In fact the model selection step, which includes as special cases also variable selection and structure selection, may lead to computationally intractable combinatorial problems. Two simple examples which reveal the combinatorial explosion of candidate models are the following:

- (a)
*Variable selection*: consider a high-dimensional time series (MIMO) where not all inputs/ outputs are relevant and one would like to select*k*out of*m*available input signals where*k*is not known and needs to be inferred from data (see, e.g., Banbura et al. 2010 and Chiuso and Pillonetto 2012). - (b)
*Structure selection*: consider all autoregressive models of maximal lag*p*with only*p*_{0}<*p*nonzero coefficients and one would like to estimate how many (*p*_{0}) and which coefficients are nonzero.

The system identification community, inspired by work in statistics (Tibshirani, 1996; Mackay, 1994), machine learning (Rasmussen and Williams, 2006; Tipping, 2001; Bach et al., 2004), and signal processing (Donoho, 2006; Wipf et al., 2011), has recently developed and adapted methods based on regularization to jointly perform model selection and estimation in a computationally efficient and statistically robust manner. Different regularization strategies have been employed which can be classified in two main classes: regularization induced by so-called smoothness priors (aka Tikhonov regularization; see Kitagawa and Gersh (1984) and Doan et al. (1984) for early references in the field of dynamical systems) and regularization for selection. This latter is usually achieved by convex relaxation of the *ℓ*_{0} quasinorm (such as *ℓ*_{1} norm and variations thereof such as sum of norms, nuclear norm, etc.) or other non-convex sparsity-inducing penalties which can be conveniently derived in a Bayesian framework, aka sparse Bayesian learning (SBL) (Mackay, 1994; Tipping, 2001; Wipf et al., 2011).

The purpose of this entry is to guide the reader through the most interesting and promising results on this topic as well as areas of active research; of course this subjective view only reflects the author’s opinion, and of course different authors could have offered a different perspective.

While, as mentioned above, system identification studies various classes of models (ranging from linear to general “nonlinear” models), in this entry, we shall restrict our attention to specific ones, namely, linear and hybrid dynamical systems. The field of nonlinear system identification is so vast (a quote sometimes attributed to S. Ulam has it that the study of nonlinear systems is a sort of “non-elephant zoology”) that even though it has largely benefitted from the use of regularization, it cannot be addressed within the limited space of this contribution. The reader is referred to the Encyclopedia chapters “Nonlinear System Identification: An Overview of Common Approaches” and “Nonlinear System Identification Using Particle Filters” for more details on nonlinear model identification.

## System Identification

Let \(u_{t} \in \mathbb {R}^m\) and \(y_{t} \in \mathbb {R}^p\) be, respectively, the measured *input* and *output* signals in a dynamical system; the purpose of system identification is to find, from a finite collection of input-output data {*u*_{t}, *y*_{t}}_{t ∈ [1,N]}, a “good” dynamical model which describes the phenomenon under observation. The candidate model will be searched for within a so-called model set denoted by \({\mathscr {M}}\). This set can be described in parametric form (see, e.g., Eq. (3) in “System Identification: An Overview”) or in a nonparametric form. In this entry we shall use the symbol \(\mathscr {M}_n(\theta )\) for parametric model classes where the subscript *n* denotes the model complexity, i.e., the number of free parameters.

### Linear Models

*g*and

*h*are the so-called impulse responses of the system and \(\{e_t\}_{t\in \mathbb {Z}}\) is a zero-mean white noise process which under suitable assumptions is the one-step-ahead prediction error; a convenient description of the linear system (1) is given in terms of the transfer functions

*e*

_{t}in (1) is the so-called

*innovation*process \(e_t = y_t - \hat {y}_{t|t-1}\). See also Eq. (8) in “System Identification: An Overview”.

When *g* and *h* are described in a parametric form, we shall use the notation *g*_{k}(*θ*), *h*_{k}(*θ*), and, likewise, *G*(*q*, *θ*), *H*(*q*, *θ*), and \(\hat {y}_{t|t-1}(\theta )\).

### Example 1

*H*(

*q*) = 1. An example of

*parametric*model class is obtained restricting

*G*(

*q*,

*θ*) to be a rational function

*θ*:= [

*K*,

*p*

_{1},

*z*

_{1}, …,

*p*

_{n},

*z*

_{n}] is the parameter vector. Note that the parameter vector

*θ*may subjected to constraints

*θ*∈

*Θ*, e.g., enforcing that the system be bounded input and bounded output (BIBO) stable (|

*p*

_{i}| < 1) or that the impulse response be real (\(K \in \mathbb {R}\) and poles

*p*

_{i}and zeros

*z*

_{i}appear in complex conjugate pairs).

*nonparametric*model is obtained, e.g., postulating that

*g*

_{k}is a realization of a Gaussian process (Rasmussen and Williams, 2006) with zero mean and a certain covariance function

*R*(

*t*,

*s*) = cov(

*g*

_{t},

*g*

_{s}). For instance, the choice

*R*(

*t*,

*s*) =

*λ*

^{t}

*δ*

_{t−s}, where |

*λ*| < 1 and

*δ*

_{k}is the Kronecker symbol, postulates that the

*g*

_{t}and

*g*

_{s}are uncorrelated for

*t*≠

*s*and that the variance of

*g*

_{t}decays exponentially in

*t*; this latter condition ensures that each realization

*g*

_{k},

*k*> 0, is BIBO stable with probability one. The exponential decay of

*g*

_{t}guarantees that, to any practical purpose, it can be considered zero for

*t*>

*T*for a suitably large

*T*. This allows to approximate the OE model with a “long”

*finite impulse response*(FIR) model

*g*

_{k},

*k*= 1, …,

*T*, is modeled as a zero-mean Gaussian vector with covariance Σ, with elements [ Σ]

_{ts}=

*R*(

*t*,

*s*).

### Remark 1

Note that the model (2), which has been obtained from truncation of a nonparametric model, could in principle be thought as a parametric model in which the parameter vector *θ* contains all the entries of *g*_{k}, *k* = 1, …, *T*. Yet the truncation index *T* may have to be large even for relatively “simple” impulse responses; for instance, \(\{g_k(\theta )\}_{k\in \mathbb {Z}^+}\) may be a simple decaying exponential, *g*_{k}(*θ*) = *αρ*^{k}, which is described by two parameters (amplitude and decay rate), yet if |*ρ*|≃ 1, the truncation index *T* needs to be large (ideally *T* →*∞*) to obtain sensible results (e.g., with low bias). Therefore, the number of parameters *T*(*m* × *p*) may be larger (and in fact much larger) than the available number of data points *N*. Under these conditions, the parameter *θ* cannot be estimated from any finite data segment unless further constraints are imposed.

## The Role of Regularization in Linear System Identification

In order to simplify the presentation, we shall refer to the linear model (1) and assume that *H*(*q*) = 1, i.e., we consider the so-called linear output-error (OE) models. The extension to more general model classes can be found in Pillonetto et al. (2011), Chen et al. (2012), Chiuso and Pillonetto (2012), and references therein.

*J*

_{F}(

*θ*) is the “fit” term often measured in terms of average squared prediction errors:

*J*

_{R}(

*θ*;

*λ*) is a regularization term which penalizes certain parameter vectors

*θ*associated to “unlikely” systems. Equation (3) can be seen as a way to deal with the

*bias-variance trade-off*. The regularization term

*J*

_{R}(

*θ*;

*λ*) may depend upon some regularization parameters

*λ*which need to be tuned using measured data. In its simplest instance,

*λ*is a scale factor that controls “how much” regularization is needed. We now discuss different forms of regularization

*J*

_{R}(

*θ*;

*λ*) which have been studied in the literature.

### Example 2

*θ*be a vector containing all the unknown coefficients of the impulse response {

*g*

_{k}}

_{k=1,…,T}. The linear least squares estimator

*N*is larger (and in fact much larger) than the number of parameters

*T*. From the statistical point of view, the estimator (5) would result for large

*T*in small bias and large variance. The purpose of regularization is to render the inverse problem of finding

*θ*from the data {

*y*

_{t}}

_{t=1,…,N}well posed, thus better trading bias versus variance. The simplest form of regularization is indeed the so-called ridge regression or its weighted version (aka generalized Tikhonov regularization), where the 2-norm of the vector

*θ*is weighted w.r.t. a positive semidefinite matrix

*Q*,

*Q*is highly nontrivial in the system identification context, and the performance of the regularized estimator \(\hat \theta _{\mathrm {Reg}}\) heavily depends on this.

### Remark 2

In order to formalize these ideas for nonparametric models or, equivalently, when the parameter *θ* is infinite dimensional, one has to bring in functional analytic tools, such as reproducing kernel Hilbert spaces (RKHS). This is rather standard in the literature on ill-posed inverse problems and has been recently introduced also in the system identification setting (Pillonetto et al., 2011). We shall not discuss these issues here because, we believe, the formalism would render the content less accessible.

*p*(

*y*|

*θ*) and

*p*(

*θ*|

*λ*) are, respectively, the likelihood function and the prior, which in turn may depend on the unknown regularization parameters

*λ*, aka hyperparameters in this Bayesian formulation. This is straightforward in the finite dimensional setting, while it requires some care when

*θ*is infinite dimensional. With reference to Example 1 and assuming

*θ*contains the impulse response coefficients

*g*

_{k}in (2),

*p*(

*θ*|

*λ*) is a Gaussian density with zero mean and covariance Σ which may be depend upon some regularization parameters

*λ*. From the definitions (7), it follows that

*θ*can be obtained (e.g., as posterior mean, MAP, etc.). As such, with some abuse of terminology, we shall indifferently refer to

*J*

_{R}(

*θ*;

*λ*) as the “regularization term” or the “prior.” The unknown parameter

*λ*is used to introduce some flexibility in the regularization term

*J*

_{R}(

*θ*;

*λ*) or equivalently in the prior

*p*(

*θ*|

*λ*) and is tuned based on measured data as discussed later on.

The regularization term *J*_{R}(*θ*;*λ*) can be roughly classified in *regularization for smoothness*, which attempts to control complexity in a smooth fashion and *regularization for sparseness* which, on top of estimation, also aims at selecting among a finite (yet possibly very large) number of candidate model classes.

### Regularization for Smoothness

*T*(arbitrarily large), and let \(\theta := [g_1 \; g_2\; \ldots g_{T}]^\top \in \mathbb {R}^{T}\) be the (finite) impulse response; define also \(y\in \mathbb {R}^N\) be the vector of output observations,

*Φ*the regressor matrix with past input samples, and

*e*the vector with innovations (zero mean, variance

*σ*

^{2}

*I*). With this notation the convolution input-output equation (1) takes the form

*K*(

*λ*), aka kernel, is tailored to capture specific properties of impulse responses (exponential decay, BIBO stability, smoothness, etc.). Early references include Doan et al. (1984) and Kitagawa and Gersh (1984), while more recent work can be found in Pillonetto and De Nicolao (2010), Pillonetto et al. (2011), and Chen et al. (2012) where several choices of kernels are discussed; the recent paper Zorzi and Chiuso (2018) provides a frequency domain interpretation of several kernel functions using tools from harmonic analysis.

### Example 3

*λ*:= (

*γ*,

*ρ*) with 0 <

*ρ*< 1 and

*γ*≥ 0.

*λ*, the estimator \(\hat \theta (\lambda )\) is the solution of a quadratic problem and can be written in closed form (aka ridge regression):

*λ*are cross validation (Ljung, 1999) and marginal likelihood maximization. This latter approach is based on the Bayesian interpretation given in Eq. (7) from which one can compute the so-called “empirical Bayes” estimator \(\hat {\theta }_{\mathrm {EB}}:= \hat \theta (\hat \lambda _{\mathrm {ML}})\) of

*θ*plugging in (11) the estimator of

*λ*which maximizes the marginal likelihood:

*θ*, it automatically accounts for the residual uncertainty in

*θ*for fixed

*λ*. When both

*J*

_{F}and

*J*

_{R}are quadratic costs, which corresponds to assuming that

*e*and

*θ*are independent and Gaussian, the marginal likelihood in (12) can be computed in closed form so that

It is here interesting to observe that \(\hat \lambda _{\mathrm {ML}}\) which solves (12) under certain conditions leads to \(K(\hat \lambda _{\mathrm {ML}}) = 0\) (see Example 4), so that the estimator of *θ* in (11) satisfies \(\hat \theta (\hat \lambda _{\mathrm {ML}})=0\). This simple observation is the basis of so-called sparse Bayesian learning (SBL); we shall return to this issue in the next section when discussing regularization for sparsity and selection.

Unfortunately the optimization problem (12) (or (13)) is not convex and thus subjected to the issue of local minima. However, both experimental evidence and some theoretical results support the use of marginal likelihood maximization for estimating regularization parameters; see, e.g., Rasmussen and Williams (2006) and Aravkin et al. (2014).

### Regularization for Sparsity: Variable Selection and Order Estimation

The main purpose of regularization for sparseness is to provide estimators \(\hat \theta \) in which subsets or functions of the estimated parameters are equal to zero.

*y*

_{t,j}denotes the

*j*th component of \(y_{t}\in \mathbb {R}^p\); let also \(\theta \in \mathbb {R}^{T(m+p)}\) be the vector containing all the impulse response coefficients

*g*

_{k,ij},

*j*= 1, …,

*p*,

*i*= 1, …,

*m*, and

*k*= 1, …,

*T*. With reference to Eq. (14), simple examples of sparsity one may be interested in are:

- (i)
Single elements of the parameter vector

*θ*, which corresponds to eliminating specific lags of some variables from the model (14). - (ii)
Groups of parameters such as the impulse response from

*i*th input to the*j*th output*g*_{k,ij},*k*= 1, …,*T*, thereby eliminating the*i*th input from the model for the*j*th output. - (iii)
The singular values of the Hankel matrix \(\mathscr {H}(\theta )\) formed with the impulse response coefficients

*g*_{k}; in fact the rank of the Hankel matrix equals the order (i.e., the McMillan degree) of the system. (Strictly speaking any full rank FIR model of length*T*has McMillan degree*T*×*p*. Yet, we consider {*g*_{k}}_{k=1,…,T}to be the truncation of some “true” impulse response {*g*_{k}}_{k=1,…,∞}, and, as such, the finite Hankel matrix built with the coefficients*g*_{k}will have rank equal to the McMillan degree of \(G(q)= \sum _{k=1}^\infty g_k z^{-k}\).)

To this purpose one would like to penalize the number of nonzero terms; let them be entries of *θ*, groups, singular values, etc. This is measured by the *ℓ*_{0} quasinorm or its variations: group *ℓ*_{0} and *ℓ*_{0} quasinorm of the Hankel singular values, i.e., the rank of the Hankel matrix. Unfortunately if *J*_{R} is a function of the *ℓ*_{0} quasinorm, the resulting optimization problem is computationally intractable; as such one usually resorts to relaxations. Three common ones are described below.

One possibility is to resort to greedy algorithms such as orthogonal matching pursuit; generically it is not possible to guarantee convergence to a global minimum point.

*ℓ*

_{0}quasinorm by its

*convex envelope*, i.e., the

*ℓ*

_{1}norm, leading to algorithms known in statistics as LASSO (Tibshirani, 1996) or its group version group LASSO (Yuan and Lin, 2006):

*ℓ*

_{0}quasinorm of the singular values) is the so-called nuclear norm (aka Ky Fan

*n*-norm or trace norm), which is the sum of the singular values \(\| A \|{ }_*: =\mathrm {trace}\{\sqrt {{ A}^{\top {A}}}\}\) where \(\sqrt {\cdot }\) denotes the matrix square root which is well defined for positive semidefinite matrices. In order to control the order (McMillan degree) of a linear system, which is equal to the rank of the Hankel matrix \(\mathscr {H}(\theta )\) built with the impulse response described by the parameter

*θ*, it is then possible to use the regularization term

Both (16) and (15) induce sparse or nearly sparse solutions (in terms of elements or groups of *θ* (15) or in terms of Hankel singular values (16)), making them attractive for selection. It is interesting to observe that both *ℓ*_{1} and group *ℓ*_{1} are special cases of the nuclear norm if one considers matrices with fixed eigenspaces. Yet, as well documented in the statistics literature, both (16) and (15) do not provide a satisfactory trade-off between sparsity and shrinking, which is controlled by the regularization parameter *λ*. As *λ* varies one obtains the so-called regularization path. Increasing *λ* the solution gets sparser, but, unfortunately, it suffers from shrinking of nonzero parameters. To overcome these problems, several variations of LASSO have been developed and studied, such as adaptive LASSO (Zou, 2006), SCAD (Fan and Li, 2001), and so on. We shall now discuss a Bayesian alternative which, to some extent, provides a better trade-off between sparsity and shrinking than the *ℓ*_{1} norm.

*p*= 1 and

*m*= 2, i.e.,

*g*

_{i}:= [

*g*

_{1,i}, …,

*g*

_{t,i}]. Let \(\theta :=[g_1^\top \; g_2^\top ]^\top \), and assume that the

*g*

_{i}s are independent Gaussian random vectors with zero mean and covariances

*λ*

_{i}

*K*. Letting

*Φ*

_{i}:= [

*ϕ*

_{1,i}, …,

*ϕ*

_{N,i}]

^{⊤}and following the formulation in (7) and (8), it follows that the marginal likelihood estimator of

*λ*takes the form

*θ*is found in closed form as per Eq. (11). It can be shown that under certain conditions on the observation vector

*y*, the estimated hyperparameters \(\hat \lambda _{\mathrm {ML},i}\) lie at the boundary, i.e., are exactly equal to zero. If \(\hat \lambda _{\mathrm {ML},i} = 0\), then, from Eq. (11), also \(\hat g_i = 0\); this reveals that in (17), the

*i*th input does not enter into the model; see also Example 4 for a simple illustration.

These Bayesian methods for sparsity have been studied in a general regression framework in Wipf et al. (2011) under the name of “type-II” maximum likelihood. Further results can be found in Aravkin et al. (2014) which suggest that these Bayesian methods provide a better trade-off between sparsity and shrinking (i.e., are able to provide sparse solution without inducing excessive shrinkage on the nonzero parameters).

### Remark 3

A more detailed analysis, see, for instance, Aravkin et al. (2014), shows that LASSO/GLASSO (i.e., *ℓ*_{1} penalties) and SBL using the “empirical Bayes” approach can be derived under a common Bayesian framework starting from the joint posterior *p*(*λ*, *θ*|*y*). While SBL is derived from the maximization *λ* of the marginal posterior, LASSO/GLASSO corresponds to maximizing the joint posterior after a suitable change of variables. For reasons of space, we refer the interested reader to the literature for details.

Recent work on the use of sparseness for variable selection and model order estimation can be found in Wang et al. (2007), Chiuso and Pillonetto (2012), and references therein.

### Example 4

*e*

_{t}is zero-mean, unit variance Gaussian, and white and

*u*

_{t}is a deterministic signal. The purpose is to estimate the coefficient

*θ*, which could be possibly equal to zero. Thus, the estimator should reveal whether

*u*

_{t−1}influences

*y*

_{t}or not.

*θ*as a Gaussian random variable, with zero mean and variance

*λ*, independent of

*e*

_{t}. Therefore,

*y*

_{t}is also Gaussian, zero mean, and variance \(u_{t-1}^2 \lambda + 1\). Therefore, assuming

*N*data points are available, the likelihood function for

*λ*is given by

*λ*

_{∗}is the solution of

*u*

_{t}is constant (without loss of generality say that

*u*

_{t}= 1 ), we obtain that

*y*

_{t}is smaller than the variance of

*e*

_{t}, which was assumed to be equal to 1. Defining

*Φ*= [

*u*

_{1}, …,

*u*

_{N}]

^{⊤}, the empirical Bayes estimator of

*θ*, as per Eq. (11), is given by

Application of sparse learning is attracting enormous interest in the context of dynamic network models (Chiuso and Pillonetto, 2012; Dankers et al., 2016; Hayden et al., 2016), which arise, for instance, in gene regulatory networks (Daniel et al., 2010) and neuroscience (Prando et al., 2017b; Razi et al., 2017), where structured dynamic dependences should be inferred from measured data; for high-dimensional processes, the combinatorial explosion of alternative structures renders the direct comparison intractable. Regularization allows to avoid such curse of dimensionality, returning sparse (and thus interpretable) models with reasonable computational effort.

## Extensions: Regularization for Hybrid Systems Identification and Model Segmentation

*K*different parameter vectors

*θ*

_{k},

*k*= 1, …,

*K*, whose evolution over time is determined by a so-called switching mechanism. The name

*hybrid*hints at the fact that the model is described continuous-valued (

*y*,

*u*, and

*e*) and discrete-valued (

*k*) variables.

*ϕ*

_{t}is a finite vector containing inputs

*u*

_{s}and outputs

*y*

_{s}in a finite past window

*s*∈ [

*t*− 1,

*t*−

*T*], plus possibly a constant component to model changing “means.” The value of

*k*∈ [1,

*K*] is determined by the switching mechanism \(p(\phi _t,t): \mathbb {R}^{n_k}\times \mathbb {R} \rightarrow \{1,\ldots ,K\}\).

Two extreme but interesting cases are (i) *p*(*ϕ*_{t}, *t*) = *p*_{t}, where *p*(⋅) is an exogenous and not measurable signal, and (ii) *p*(*ϕ*_{t}, *t*) = *p*(*ϕ*_{t}), where *p*(⋅) is an endogenous unknown measurable function of the regression vector *ϕ*_{t}. In any case, from the identification point of view, *k* at time *t* *is not* assumed to be known, and, as such, the identification algorithm has to operate without knowledge of this switching mechanism.

Identification of systems in the form (20) requires to estimate (a) the number of models *K* and the position of the switches between different models, (b) the “dimension” of each model *n*_{k}, (c) the value of the parameters *θ*_{k}, and, possibly, (d) the function *p*(*ϕ*_{t}, *t*) which determines the switching mechanism.

Steps (b) and (c) are essentially as in section “System Identification” (see also the introductory paper “System Identification: An Overview”); however, this is complicated by steps (a) and (d), which in particular require that one is able to estimate, from data alone, which system is “active” at each time *t*.

*model segmentation*, has been tackled in the literature; see, e.g., Ozay et al. (2012) and Ohlsson and Ljung (2013), and references therein, by applying suitable penalties on the number of different models

*K*and/or on the number of switches. Note that

*p*(

*ϕ*

_{t},

*t*) ≠

*p*(

*ϕ*

_{s},

*s*) if and only if

*θ*

_{t}≠

*θ*

_{s}. Based on this simple observation, one can construct a regularization which counts either the number of switches, i.e.,

*w*(

*t*,

*s*); see Ohlsson and Ljung (2013).

*ℓ*

_{1}/group-

*ℓ*

_{1}penalties, thus relaxing (21) and (22) to

## Summary and Future Directions

We have presented a bird’s-eye overview of regularization methods in system identification. By necessity this overview was certainly incomplete, and we encourage the reader to browse through the recent literature for new developments on this exciting topic; we hope the references we have provided are a good starting point. While regularization is quite an old topic, we believe it is fair to say that the nontrivial interaction between regularization and system theoretic concepts provides a wealth of interesting and challenging problems while also providing new powerful tools to tackle challenging practical problems.

Just to mention a few open questions, (i) a thorough analysis of the interaction between different type of priors (e.g., stability vs. complexity vs. structure) is still lacking; (ii) some preliminary results (Formentin and Chiuso, 2018) suggest that tailoring regularization to control objectives leads to improved results in terms of control performance; further research in this direction is certainly needed; (iii) the statistical properties of Bayesian procedures such as SBL and its extensions in the context of system identification are not completely understood. Last but not least, while some results are available, nonlinear system identification still offers significant challenges.

## Recommended Reading

The use of regularization methods for system identification can be traced back to the 1980s; see Doan et al. (1984) and Kitagawa and Gersh (1984); yet it is fair to say that the most significant developments are rather recent, and therefore the literature is not established yet. The reader may consult Fazel et al. (2001), Pillonetto et al. (2011), Chen et al. (2012), Chiuso and Pillonetto (2012), Chiuso (2016), Prando et al. (2017a), Pillonetto and Chiuso (2015), Zorzi and Chiuso (2018), Zorzi and Chiuso (2017), Romeres et al. (2019), and references therein. Clearly all this work has largely benefitted from cross fertilization with neighboring areas, and, as such, very relevant work can be found in the fields of machine learning (Bach et al., 2004; Mackay, 1994; Tipping, 2001; Rasmussen and Williams, 2006), statistics (Hocking, 1976; Tibshirani, 1996; Fan and Li, 2001; Wang et al., 2007; Yuan and Lin, 2006; Zou, 2006), signal processing (Donoho, 2006; Wipf et al., 2011), and econometrics (Banbura et al., 2010).

## Cross-References

## References

- Aravkin A, Burke J, Chiuso A, Pillonetto G (2014) Convex vs non-convex estimators for regression and sparse estimation: the mean squared error properties of ARD and GLASSO. J Mach Learn Res 15:217–252MathSciNetzbMATHGoogle Scholar
- Bach F, Lanckriet G, Jordan M (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21st international conference on machine learning, Banff, pp 41–48Google Scholar
- Banbura M, Giannone D, Reichlin L (2010) Large Bayesian VARs. J Appl Econom 25:71–92CrossRefGoogle Scholar
- Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes – revisited. Automatica 48:1525–1535MathSciNetCrossRefGoogle Scholar
- Chiuso A (2016) Regularization and Bayesian learning in dynamical systems: past, present and future. Annu Rev Control 41:24–38CrossRefGoogle Scholar
- Chiuso A, Pillonetto G (2012) A Bayesian approach to sparse dynamic network identification. Automatica 48:1553–1565MathSciNetCrossRefGoogle Scholar
- Daniel M, Robert JP, Thomas S, Claudio M, Dario F, Gustavo S (2010) Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci 107:6286–6291CrossRefGoogle Scholar
- Dankers A, Van den Hof PMJ, Bombois X, Heuberger PSC (2016) Identification of dynamic models in complex networks with prediction error methods: predictor input selection. IEEE Trans Autom Control 61:937–952MathSciNetCrossRefGoogle Scholar
- Doan T, Litterman R, Sims C (1984) Forecasting and conditional projection using realistic prior distributions. Econom Rev 3:1–100CrossRefGoogle Scholar
- Donoho D (2006) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306MathSciNetCrossRefGoogle Scholar
- Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetCrossRefGoogle Scholar
- Fazel M, Hindi H, Boyd S (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the 2001 American control conference, Arlington, vol 6, pp 4734–4739Google Scholar
- Formentin S, Chiuso A (2018) CoRe: control-oriented regularization for system identification. In: 2018 IEEE conference on decision and control (CDC), pp 2253–2258Google Scholar
- Hayden D, Chang YH, Goncalves J, Tomlin CJ (2016) Sparse network identifiability via compressed sensing. Automatica 68:9–17MathSciNetCrossRefGoogle Scholar
- Hocking RR (1976) A biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics 32:1–49Google Scholar
- Kitagawa G, Gersh H (1984) A smoothness priors-state space modeling of time series with trends and seasonalities. J Am Stat Assoc 79:378–389Google Scholar
- Leeb H, Pötscher B (2005) Model selection and inference: facts and fiction. Econom Theory 21:21–59MathSciNetCrossRefGoogle Scholar
- Ljung L (1999) System identification – theory for the user. Prentice Hall, Upper Saddle RiverzbMATHGoogle Scholar
- Mackay D (1994) Bayesian non-linear modelling for the prediction competition. ASHRAE Trans 100:3704–3716Google Scholar
- Ohlsson H, Ljung L (2013) Identification of switched linear regression models using sum-of-norms regularization. Automatica 49:1045–1050MathSciNetCrossRefGoogle Scholar
- Ozay N, Sznaier M, Lagoa C, Camps O (2012) A sparsification approach to set membership identification of switched affine systems. IEEE Trans Autom Control 57:634–648MathSciNetCrossRefGoogle Scholar
- Pillonetto G, Chiuso A (2015) Tuning complexity in regularized kernel-based regression and linear system identification: the robustness of the marginal likelihood estimator. Automatica 58:106–117MathSciNetCrossRefGoogle Scholar
- Pillonetto G, De Nicolao G (2010) A new kernel-based approach for linear system identification. Automatica 46:81–93MathSciNetCrossRefGoogle Scholar
- Pillonetto G, Chiuso A, De Nicolao G (2011) Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica 47:291–305MathSciNetCrossRefGoogle Scholar
- Pillonetto G, Chen T, Chiuso A, Nicolao GD, Ljung L (2016) Regularized linear system identification using atomic, nuclear and kernel-based norms: the role of the stability constraint. Automatica 69:137–149MathSciNetCrossRefGoogle Scholar
- Prando G, Chiuso A, Pillonetto G (2017a) Maximum entropy vector kernels for MIMO system identification. Automatica 79:326–339MathSciNetCrossRefGoogle Scholar
- Prando G, Zorzi M, Bertoldo A, Chiuso A (2017b) Estimating effective connectivity in linear brain network models. In: 2017 IEEE 56th annual conference on decision and control (CDC), pp 5931–5936Google Scholar
- Rasmussen C, Williams C (2006) Gaussian processes for machine learning. MIT, CambridgezbMATHGoogle Scholar
- Razi A, Seghier ML, Zhou Y, McColgan P, Zeidman P, Park H-J, Sporns O, Rees G, Friston KJ (2017) Large-scale DCMs for resting-state fMRI. Netw Neurosci 1:222–241CrossRefGoogle Scholar
- Romeres D, Zorzi M, Camoriano R, Traversaro S, Chiuso A (2019, in press) Derivative-free online learning of inverse dynamics models. IEEE Trans Control Syst TechnolGoogle Scholar
- Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58:267–288MathSciNetzbMATHGoogle Scholar
- Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244MathSciNetzbMATHGoogle Scholar
- Wang H, Li G, Tsai C (2007) Regression coefficient and autoregressive order shrinkage and selection via the LASSO. J R Stat Soc Ser B 69:63–78MathSciNetGoogle Scholar
- Wipf D, Rao B, Nagarajan S (2011) Latent variable Bayesian models for promoting sparsity. IEEE Trans Inf Theory 57:6236–6255MathSciNetCrossRefGoogle Scholar
- Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68:49–67MathSciNetCrossRefGoogle Scholar
- Zorzi M, Chiuso A (2017) Sparse plus low rank network identification: a nonparametric approach. Automatica 76:355–366MathSciNetCrossRefGoogle Scholar
- Zorzi M, Chiuso A (2018) The harmonic analysis of kernel functions. Automatica 94:125–137MathSciNetCrossRefGoogle Scholar
- Zou H (2006) The adaptive Lasso and it oracle properties. J Am Stat Assoc 101:1418–1429MathSciNetCrossRefGoogle Scholar