Matrix Calculus and Notation

Caswell, Hal

doi:10.1007/978-3-030-10534-1_2

Hal Caswell³

Part of the book series: Demographic Research Monographs ((DEMOGRAPHIC))

9873 Accesses

Abstract

This book relies on this set of mathematical techniques. This chapter introduces the basics, which will be used throughout the text. For more information, I recommend four sources in particular. The most complete treatment, but not the easiest starting point, is the book by Magnus and Neudecker (1988).

You have full access to this open access chapter, Download chapter PDF

1 Introduction: Can It Possibly Be That Simple?

In October of 2005, I scribbled in a notebook, “can it possibly be that simple?” I was referring to the sensitivity of transient dynamics (the eventual results appear in Chap. 7), and had just begun to use matrix calculus as a tool. The answer to my question was yes. It can be that simple.

This book relies on this set of mathematical techniques. This chapter introduces the basics, which will be used throughout the text. For more information, I recommend four sources in particular. The most complete treatment, but not the easiest starting point, is the book by Magnus and Neudecker (1988). More accessible introductions can be found in the paper by Magnus and Neudecker (1985) and especially the text by Abadir and Magnus (2005). A review paper by Nel (1980) is helpful in placing the Magnus-Neudecker formulation in the context of other attempts at a calculus of matrices.

Sensitivity analysis asks how much change in an outcome variable y is caused by a change in some parameter x. At its most basic level, and with some reasonable assumptions about the continuity and differentiability of the functional relationships involved, the solution is given by differential calculus. If y is a function of x, then the derivative

$$\displaystyle \begin{aligned} {d y \over d x} {} \end{aligned}$$

tells how y responds to a change in x, i.e., the sensitivity of y to a change in x.

However, the outcomes of a demographic calculation may be scalar-valued (e.g., the population growth rate λ), vector-valued (e.g., the stable stage distribution), or matrix-valued (e.g., the fundamental matrix). Any of these outcomes may be functions of scalar-valued parameters (e.g., the Gompertz aging rate), vector-valued parameters (e.g., the mortality schedule), or matrix-valued parameters (e.g., the transition matrix) parameters. Thus, sensitivity analysis in demography requires more than the simple derivative in (2.3). We want a consistent and flexible approach to differentiating

$$\displaystyle \begin{aligned} \left\{ \begin{array}{r} \mbox{scalar-valued} \\ \mbox{vector-valued}\\ \mbox{matrix-valued} \end{array} \right\} \; \mbox{functions of} \; \left\{ \begin{array}{l} \mbox{scalar} \\ \mbox{vector}\\ \mbox{matrix} \end{array} \right\} \; \mbox{arguments} \end{aligned}$$

2 Notation and Matrix Operations

2.1 Notation

Matrices are denoted by upper case bold symbols (e.g., A), vectors (usually) by lower case bold symbols (n). The (i, j) entry of the matrix A is a _ij, and the ith entry of the vector n is n _i. Sometimes we will use Matlab notation, and write

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ X}}(i,:) &\displaystyle =&\displaystyle \mbox{row }i\mbox{ of }{\mathbf{ X}} \end{array} \end{aligned} $$

(2.1)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ X}}(:,j) &\displaystyle =&\displaystyle \mbox{column }j \mbox{ of }{\mathbf{ X}} \end{array} \end{aligned} $$

(2.2)

The notation

$$\displaystyle \begin{aligned} \left(\begin{array}{c} x(ij) \end{array}\right) \end{aligned}$$

denotes a matrix whose (i, j) entry is x. For example,

$$\displaystyle \begin{aligned} \left(\begin{array}{c} \displaystyle {d y_i \over d x_j} \end{array}\right) \end{aligned}$$

is the matrix whose (i, j) entry is the derivative of y _i with respect to x _j.

The transpose of X is X ^T. Logarithms are natural logarithms. The vector norm ∥x∥ is, unless noted otherwise, the 1-norm. The symbol $\mathcal {D}\,({\mathbf { x}})$ denotes the square matrix with x on the diagonal and zeros elsewhere. The symbol 1 denotes a vector of ones. The vector e _i is a unit vector with 1 in the ith entry and zeros elsewhere. The identity matrix is I. Where necessary for clarity, the dimension of matrices or vectors will be indicated by a subscript. Thus I _s is a s × s identity matrix, 1 _s is an s × 1 vector of ones, and X _m×n is a m × n matrix.

In some places (Chaps. 6 and 10) block-structured matrices appear; these are denoted by either $\mathbb {A}$ or $\tilde {{\mathbf { A}}}$, depending on the context and the role of the matrix.

2.2 Operations

In addition to the familiar matrix product AB, we will also use the Hadamard, or elementwise product

$$\displaystyle \begin{aligned} {\mathbf{ A}} \circ {\mathbf{ B}} = \left(\begin{array}{c} a_{ij} b_{ij} \end{array}\right) \end{aligned} $$

(2.3)

and the Kronecker product

$$\displaystyle \begin{aligned} {\mathbf{ A}} \otimes {\mathbf{ B}} = \left(\begin{array}{c} a_{ij} {\mathbf{ B}} \end{array}\right) \end{aligned} $$

(2.4)

The Hadamard product requires that A and B be the same size. The Kronecker product is defined for any sizes of A and B. Some useful properties of the Kronecker product include

$$\displaystyle \begin{aligned} \begin{array}{rcl} \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right)^{-1} &\displaystyle =&\displaystyle \left( {\mathbf{ A}}^{-1} \otimes {\mathbf{ B}}^{-1} \right) {} \end{array} \end{aligned} $$

(2.5)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right)^{\mathsf{T}} &\displaystyle =&\displaystyle \left( {\mathbf{ A}}^{\mathsf{T}} \otimes {\mathbf{ B}}^{\mathsf{T}} \right) \end{array} \end{aligned} $$

(2.6)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ A}} \otimes \left( {\mathbf{ B}} + {\mathbf{ C}} \right) &\displaystyle =&\displaystyle \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right) + \left( {\mathbf{ A}} \otimes {\mathbf{ C}} \right) \end{array} \end{aligned} $$

(2.7)

and, provided that the matrices are of the right size for the products to be defined,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \left({\mathbf{ A}}_1 \otimes {\mathbf{ B}}_1 \right) \left({\mathbf{ A}}_2 \otimes {\mathbf{ B}}_2 \right) &\displaystyle =&\displaystyle \left({\mathbf{ A}}_1 {\mathbf{ A}}_2 \otimes {\mathbf{ B}}_1 {\mathbf{ B}}_2 \right) . {} \end{array} \end{aligned} $$

(2.8)

2.3 The Vec Operator and Vec-Permutation Matrix

The vec operator transforms a m × n matrix A into a mn × 1 vector by stacking the columns one above the next,

$$\displaystyle \begin{aligned} \mbox{vec} \, {\mathbf{ A}} = \left(\begin{array}{c} {\mathbf{ A}}(:,1) \\ \vdots\\ {\mathbf{ A}}(:,n) \end{array}\right) \end{aligned} $$

(2.9)

For example,

$$\displaystyle \begin{aligned} \mbox{vec} \, \left(\begin{array}{cc} a & b \\ c & d \end{array}\right) = \left(\begin{array}{c} a\\ c \\ b \\ d \end{array}\right) . \end{aligned} $$

(2.10)

The vec of A and the vec of A ^T are rearrangements of the same entries; they are related by

$$\displaystyle \begin{aligned} \mbox{vec} \, {\mathbf{ A}}^{\mathsf{T}} = {\mathbf{ K}}_{m,n} \mbox{vec} \, {\mathbf{ A}} \end{aligned} $$

(2.11)

where A is m × n and K _m,n is the vec-permutation matrix (Henderson and Searle 1981) or commutation matrix (Magnus and Neudecker 1979). The vec-permutation matrix can be calculated as

$$\displaystyle \begin{aligned} {\mathbf{ K}}_{m,n} = \sum_{i=1}^m \sum_{j=1}^n \left( {\mathbf{ E}}_{ij} \otimes {\mathbf{ E}}_{ij}^{\mathsf{T}} \right) \end{aligned} $$

(2.12)

where E _ij is a matrix, of dimension m × n, with a 1 in the (i, j) entry and zeros elsewhere. Like any permutation matrix, K ⁻¹ = K ^T.

The vec operator and the vec-permutation matrix are particularly important in multistate models (e.g., age×stage-classified models), where they are used in both the formulation and analysis of the models (e.g., Caswell 2012, 2014; Caswell and Salguero-Gómez 2013; Caswell et al. 2018); see also Chap. 6. Extensions to an arbitrary number of dimensions, so-called hyperstate models, have been presented by Roth and Caswell (2016).

2.4 Roth’s Theorem

The vec operator and the Kronecker product are connected by a theorem due to Roth (1934):

$$\displaystyle \begin{aligned} \mbox{vec} \, \left({\mathbf{ABC}}\right) = \left({\mathbf{ C}}^{\mathsf{T}} \otimes {\mathbf{ A}} \right) \mbox{vec} \, {\mathbf{ B}} . {} \end{aligned} $$

(2.13)

We will often want to obtain the vec of a matrix that appears in the middle of a product; we will use Roth’s theorem repeatedly.

3 Defining Matrix Derivatives

The derivative of a scalar y with respect to a scalar x is familiar. What, however, does it mean to speak of the derivative of a scalar with respect to a vector, or of a vector with respect to another vector, or any other combination? These can be defined in more than one way and the choice is critical (Nel 1980; Magnus and Neudecker 1985). This book relies on the notation due to Magnus and Neudecker, because it makes certain operations possible and consistent.

If x and y are scalars, the derivative of y with respect to x is the familiar derivative dy∕dx.
If y is a n × 1 vector and x a scalar, the derivative of y with respect to x is the n × 1 column vector
$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d x} = \left(\begin{array}{c} \displaystyle {d y_1 \over d x} \\ \vdots \\ \displaystyle {d y_n \over d x} \end{array}\right) . \end{aligned} $$
(2.14)
If y is a scalar and x a m × 1 vector, the derivative of y with respect to x is the 1 × m row vector (called the gradient vector)
$$\displaystyle \begin{aligned} {d y \over d {\mathbf{ x}}^{\mathsf{T}}} = \left(\begin{array}{ccc} \displaystyle {\partial y \over \partial x_1} & \cdots & \displaystyle {\partial y \over \partial x_m} \end{array}\right). \end{aligned} $$
(2.15)
Note the orientation of d y∕dx as a column vector and dy∕d x ^T as a row vector.
If y is a n × 1 vector and x a m × 1 vector, the derivative of y with respect to x is defined to be the n × m matrix whose (i, j) entry is the derivative of y _i with respect to x _j, i.e.,
$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d {\mathbf{ x}}^{\mathsf{T}}} = \left(\begin{array}{c} \displaystyle {d y_i \over d x_j} \end{array}\right) \end{aligned} $$
(2.16)
(this matrix is called the Jacobian matrix).
Derivatives involving matrices are written by first transforming the matrices into vectors using the vec operator, and then applying the rules for vector differentiation to the resulting vectors. Thus, the derivative of the m × n matrix Y with respect to the p × q matrix X is the mn × pq matrix
$$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \rule{0in}{2.5ex} \left( \mbox{vec} \, {\mathbf{ X}} \right)^{\mathsf{T}}}. \end{aligned} $$
(2.17)
From now on, I will write vec ^TX for (vec X)^T.

4 The Chain Rule

The chain rule for differentiation is your friend. The Magnus-Neudecker notation, unlike some alternatives, extends the familiar scalar chain rule to derivatives of vectors and matrices (Nel 1980; Magnus and Neudecker 1985). If u (size m × 1) is a function of v (size n × 1) and v is a function of x (size p × 1), then

$$\displaystyle \begin{aligned} \underbrace{{d {\mathbf{ u}} \over d {\mathbf{ x}}^{\mathsf{T}}}}_{m \times p} = \underbrace{ \left( {d {\mathbf{ u}} \over d {\mathbf{ v}}^{\mathsf{T}}} \right)}_{m \times n} \;\underbrace{ \left( {d {\mathbf{ v}} \over d {\mathbf{ x}}^{\mathsf{T}}} \right) }_{n \times p} {} \end{aligned} $$

(2.18)

Notice that the dimensions are correct, and that the order of the multiplication matters. Checking dimensional consistency in this way is a useful way to find errors.

5 Derivatives from Differentials

The key to the matrix calculus of Magnus and Neudecker (1988) is the relationship between the differential and the derivative of a function. Experience suggests that, for many readers of this book, this relationship is shrouded in the mists of long-ago calculus classes.

5.1 Differentials of Scalar Function

Start with scalars. Suppose that y = f(x) is a differentiable function at x = x ₀. Then the derivative of y with respect to x at the point x ₀ is defined as

$$\displaystyle \begin{aligned} f\prime(x_0) = \lim_{h\rightarrow 0} \frac{f(x_0 + h) - f(x_0)}{h}. {} \end{aligned} $$

(2.19)

Now define the differential of y. This is customarily denoted dy, but for the moment, I will denote it by cy. The differential of y at x ₀ is a function of h, defined by

$$\displaystyle \begin{aligned} cy(x_0,h) = f\prime(x_0)h. \end{aligned} $$

(2.20)

There is no requirement that h be “small.” Since x is a function of itself, x = g(x), with g′(x) = 1, we also have cx(x ₀, h) = g′(x ₀)h = h. Thus the ratio of the differential of y and the differential of x is

$$\displaystyle \begin{aligned} \frac{cy(x_0,h)}{cx(x_0,h)} = \frac{f\prime(x_0)h}{h} = f\prime(x_0). {} \end{aligned} $$

(2.21)

That is, the derivative is equal to the ratio of the differentials.

Now, return to the standard notation of dy for the differential of y. This gives two meanings to the familiar notation for derivatives,

$$\displaystyle \begin{aligned} \left. {d y \over d x} \right|{}_{x_0} = f\prime(x_0). \end{aligned} $$

(2.22)

The left hand side can be regarded either as equivalent to the limit (2.19) or the ratio of the differentials given by (2.21). Mathematicians are strangely unconcerned with this ambiguity (e.g., Hardy 1952).

All this leads to a set of familiar rules for calculating differentials that guarantee that they can be used to create derivatives. A few of these, for scalars, are

$$\displaystyle \begin{aligned} \begin{array}{rcl} d(u+v) &\displaystyle =&\displaystyle du + dv {} \end{array} \end{aligned} $$

(2.23)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d(cu) &\displaystyle =&\displaystyle c \; du \end{array} \end{aligned} $$

(2.24)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d(uv) &\displaystyle =&\displaystyle u(dv) + (du) v \end{array} \end{aligned} $$

(2.25)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d(e^u) &\displaystyle =&\displaystyle e^u du \end{array} \end{aligned} $$

(2.26)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d(\log u) &\displaystyle =&\displaystyle \frac{1}{u} du \end{array} \end{aligned} $$

(2.27)

If y = f(x ₁, x ₂), then the total differential is

$$\displaystyle \begin{aligned} dy = {\partial f \over \partial x_1} dx_1 + {\partial f \over \partial x_2} dx_2. {} \end{aligned} $$

(2.28)

Derivatives can be constructed from these expressions at will by dividing by differentials. For example, dividing (2.23) by dx gives d(u + v)∕dx = du∕dx + dv∕dx. From (2.28), we have

$$\displaystyle \begin{aligned} \begin{array}{rcl} {d y \over d x_1} &\displaystyle =&\displaystyle {\partial f \over \partial x_1} + {\partial f \over \partial x_2} {d x_2 \over d x_1} \end{array} \end{aligned} $$

(2.29)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {d y \over d x_2} &\displaystyle =&\displaystyle {\partial f \over \partial x_1} {d x_1 \over d x_2} + {\partial f \over \partial x_2} . \end{array} \end{aligned} $$

(2.30)

5.2 Differentials of Vectors and Matrices

To extend these concepts to matrices, we define the differential of a matrix (or vector) as the matrix (or vector) of differentials of the elements; i.e.,

$$\displaystyle \begin{aligned} d {\mathbf{ X}} = \left(\begin{array}{c} \displaystyle dx_{ij} \end{array}\right). \end{aligned} $$

(2.31)

This definition leads to some basic rules for differentials of matrices:

$$\displaystyle \begin{aligned} \begin{array}{rcl} d (c {\mathbf{ U}}) &\displaystyle =&\displaystyle c ( d {\mathbf{ U}}) \end{array} \end{aligned} $$

(2.32)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d ({\mathbf{ U}} + {\mathbf{ V}}) &\displaystyle =&\displaystyle d {\mathbf{ U}} + d {\mathbf{ V}} \end{array} \end{aligned} $$

(2.33)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d ({\mathbf{ U}} {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) {\mathbf{ V}} + {\mathbf{ U}} (d {\mathbf{ V}}) \end{array} \end{aligned} $$

(2.34)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d({\mathbf{ U}} \otimes {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) \otimes {\mathbf{ V}} + {\mathbf{ U}} \otimes (d {\mathbf{ V}}) \end{array} \end{aligned} $$

(2.35)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d({\mathbf{ U}} \circ {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) \circ {\mathbf{ V}} + {\mathbf{ U}} \circ (d {\mathbf{ V}}) \end{array} \end{aligned} $$

(2.36)

$$\displaystyle \begin{aligned} \begin{array}{rcl} d \mbox{vec} \, {\mathbf{ U}} &\displaystyle =&\displaystyle \mbox{vec} \, d {\mathbf{ U}} \end{array} \end{aligned} $$

(2.37)

where c is a constant, and, of course, the dimensions of U and V must be conformable. The differentials of an operators applied elementwise to a vector can be obtained from the differentials of the elements. For example, suppose u is a s × 1 vector, and the exponential is applied elementwise. Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} d (\exp({\mathbf{ u}})) &=& \left(\begin{array}{c} e^{u_1} d u_1 \\ \vdots \\ e^{u_s} d u_s \end{array}\right) \end{array} \end{aligned} $$

(2.38)

$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& \mathcal{D}\,\left[ \rule{0in}{2ex} \exp({\mathbf{ u}}) \right] d {\mathbf{ u}} . \end{array} \end{aligned} $$

(2.39)

If y is a function of x ₁ and x ₂, the total differential is given just as in (2.28), by

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\partial {\mathbf{ y}} \over \partial {\mathbf{ x}}_1^{\mathsf{T}}} d {\mathbf{ x}}_1 + {\partial {\mathbf{ y}} \over \partial {\mathbf{ x}}_2^{\mathsf{T}}} d {\mathbf{ x}}_2 \end{aligned} $$

(2.40)

6 The First Identification Theorem

For scalar y and x,

$$\displaystyle \begin{aligned} dy = q dx \Longrightarrow {d y \over d x} = q. \end{aligned} $$

(2.41)

That much is easy. But, suppose that y is a n × 1 vector function of the m × 1 vector x. The differential d y is the n × 1 vector

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = \left(\begin{array}{c} dy_1 \\ \vdots \\ dy_n \end{array}\right) \end{aligned} $$

(2.42)

which, by the total derivative rule, is

$$\displaystyle \begin{aligned} \begin{array}{rcl} d {\mathbf{ y}} &=& \left(\begin{array}{c} \displaystyle {\partial y_1 \over \partial x_1} dx_1 + \cdots + {\partial y_1 \over \partial x_m} dx_m \\ \vdots \\ \displaystyle {\partial y_n \over \partial x_1} dx_1 + \cdots + {\partial y_n \over \partial x_m} dx_m \end{array}\right) \end{array} \end{aligned} $$

(2.43)

$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& \left(\begin{array}{ccc} {\partial y_1 \over \partial x_1} & \cdots & {\partial y_1 \over \partial x_m} \\ \vdots & & \vdots \\ {\partial y_n \over \partial x_1} & \cdots & {\partial y_n \over \partial x_m} \end{array}\right) \left(\begin{array}{c} dx_1 \\ \vdots \\ dx_m \end{array}\right) \end{array} \end{aligned} $$

(2.44)

$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& {\mathbf{ Q}} \; d {\mathbf{ x}}. {} \end{array} \end{aligned} $$

(2.45)

If these were scalars, dividing both sides by d x would give Q as the derivative of y with respect to x. But, one cannot divide by a vector. Instead, Magnus and Neudecker proved that if it can be shown that

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} \; d {\mathbf{ x}} {} \end{aligned} $$

(2.46)

then the derivative is

$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d {\mathbf{ x}}^{\mathsf{T}}} = {\mathbf{ Q}}. {} \end{aligned} $$

(2.47)

This is the First Identification Theorem of Magnus and Neudecker (1988).^{Footnote 1}

6.1 The Chain Rule and the First Identification Theorem

Suppose that d y is given by (2.46), and that x is in turn a function of some vector θ. Then

$$\displaystyle \begin{aligned} d {\mathbf{ x}} = {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} \end{aligned} $$

(2.48)

and

$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$

(2.49)

In other words, the differential expression (2.46) can be transformed into a derivative with respect to any vector by careful use of the chain rule. This applies equally to more complicated expressions for the differential. Suppose that

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} d {\mathbf{ x}} + {\mathbf{ R}} d {\mathbf{ z}}. \end{aligned} $$

(2.50)

Applying the chain rule to the differentials on the right hand side gives

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} \end{aligned} $$

(2.51)

for any vector θ. Thus

$$\displaystyle \begin{aligned} d {\mathbf{ y}} = \left({\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \right) d \boldsymbol{\theta}, \end{aligned} $$

(2.52)

and the First Identification Theorem gives

$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \left( {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \right) . \end{aligned} $$

(2.53)

7 Elasticity

When parameters are measured on different scales, it is sometimes helpful to calculate proportional effects of proportional perturbations, also called elasticities. The elasticity of y _i to θ _j is

$$\displaystyle \begin{aligned} {\epsilon y_i \over \epsilon \theta_j} = \frac{\theta_j}{y_i} {d y_i \over d \theta_j} . \end{aligned} $$

(2.54)

For vectors y and θ, this becomes

$$\displaystyle \begin{aligned} {\epsilon {\mathbf{ y}} \over \epsilon \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,({\mathbf{ y}})^{-1} \; {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \; \mathcal{D}\,(\boldsymbol{\theta}) . {} \end{aligned} $$

(2.55)

There seems to be no accepted notation for elasticities; the notation used here is adapted from that in Samuelson (1947).

8 Some Useful Matrix Calculus Results

Several matrix calculus results will be used repeatedly. Many more can be found in Magnus and Neudecker (1988) and Abadir and Magnus (2005).

1.
The matrix product Y = AB. Differentiate,
$$\displaystyle \begin{aligned} d {\mathbf{ Y}} = (d {\mathbf{ A}}) {\mathbf{ B}} + {\mathbf{ A}} (d {\mathbf{ B}}). \end{aligned} $$
(2.56)
Then write (or imagine writing; with practice one does not actually need this step explicitly)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \left( d {\mathbf{ A}} \right) {\mathbf{ B}} &\displaystyle =&\displaystyle {\mathbf{ I}} \left( d {\mathbf{ A}} \right) {\mathbf{ B}} \end{array} \end{aligned} $$
(2.57)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ A}} \left( d {\mathbf{ B}} \right) &\displaystyle =&\displaystyle {\mathbf{ A}} \left( d {\mathbf{ B}} \right) {\mathbf{ I}} \end{array} \end{aligned} $$
(2.58)
and apply the vec operator and Roth’s theorem, to obtain
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = \left( {\mathbf{ B}}^{\mathsf{T}} \otimes {\mathbf{ I}} \right) d \mbox{vec} \, {\mathbf{ A}} + \left( {\mathbf{ I}} \otimes {\mathbf{ A}} \right) d \mbox{vec} \, {\mathbf{ B}}. \end{aligned} $$
(2.59)
The chain rule gives, for any vector variable θ
$$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \left( {\mathbf{ B}}^{\mathsf{T}} \otimes {\mathbf{ I}} \right) {d \mbox{vec} \, {\mathbf{ A}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + \left( {\mathbf{ I}} \otimes {\mathbf{ A}} \right) {d \mbox{vec} \, {\mathbf{ B}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$
(2.60)
2.
The Hadamard product Y = A ∘B. Differentiate the product,
$$\displaystyle \begin{aligned} d {\mathbf{ Y}} = d {\mathbf{ A}} \circ {\mathbf{ B}} + {\mathbf{ A}} \circ d {\mathbf{ B}}, \end{aligned} $$
(2.61)
then vec
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = d \mbox{vec} \, {\mathbf{ A}} \circ \mbox{vec} \, {\mathbf{ B}} + \mbox{vec} \, {\mathbf{ A}} \circ d \mbox{vec} \, {\mathbf{ B}} . \end{aligned} $$
(2.62)
It will be useful to replace the Hadamard products, which we do using the fact that ${\mathbf { x}} \circ {\mathbf { y}} = \mathcal {D}\,({\mathbf { x}}) {\mathbf { y}}$, to get
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ B}}) d \mbox{vec} \, {\mathbf{ A}} + \mathcal{D}\,(\mbox{vec} \, {\mathbf{ A}}) d \mbox{vec} \, {\mathbf{ B}}. \end{aligned} $$
(2.63)
The chain rule gives the derivative from the differential,
$$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ B}}) {d \mbox{vec} \, {\mathbf{ A}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + \mathcal{D}\,(\mbox{vec} \, {\mathbf{ A}}) {d \mbox{vec} \, {\mathbf{ B}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$
(2.64)
3.
Diagonal matrices. The diagonal matrix $\mathcal {D}\, ({\mathbf { x}})$, with the vector x on the diagonal and zeros elsewhere, can be written
$$\displaystyle \begin{aligned} \mathcal{D}\, ({\mathbf{ x}}) = {\mathbf{ I}} \circ \left( \mathbf{1} \,{\mathbf{ x}}^{\mathsf{T}} \right) {} \end{aligned} $$
(2.65)
Differentiate both sides,
$$\displaystyle \begin{aligned} d \mathcal{D}\, ({\mathbf{ x}}) = {\mathbf{ I}} \circ \left( \mathbf{1} \,d {\mathbf{ x}}^{\mathsf{T}} \right) \end{aligned} $$
(2.66)

and vec the result
$$\displaystyle \begin{aligned} \begin{array}{rcl} d \mbox{vec} \, \mathcal{D}\, ({\mathbf{ x}}) &\displaystyle =&\displaystyle \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \mbox{vec} \, \left( \mathbf{1} \,d {\mathbf{ x}}^{\mathsf{T}} \right) \end{array} \end{aligned} $$
(2.67)

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle =&\displaystyle \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \left( {\mathbf{ I}} \otimes \mathbf{1} \right) d {\mathbf{ x}} \end{array} \end{aligned} $$
(2.68)
The First Identification Theorem gives
$$\displaystyle \begin{aligned} {d \mbox{vec} \, \mathcal{D}\, ({\mathbf{ x}}) \over d \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \left( {\mathbf{ I}} \otimes \mathbf{1} \right) {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. {} \end{aligned} $$
(2.69)

The identity matrix in (2.65) masks the matrix $ \left ( \mathbf {1} \,{\mathbf { x}}^{\mathsf {T}} \right )$, setting to zero all but the diagonal elements. Matrices other than I can be used in this way to mask entries of a matrix. For example, the transition matrix for a Leslie matrix, with a vector of survival probabilities p on the subdiagonal, is obtained by setting x = p and replacing I with a matrix Z that contains ones on the subdiagonal and zeros elsewhere (see, e.g., Chap. 4).

Some Markov chain calculations (Chaps. 5 and 11) involve a matrix N _dg, which contains the diagonal elements of N on the diagonal and zeros elsewhere. This can be written
$$\displaystyle \begin{aligned} {\mathbf{N}}_{\mathrm{dg}} = {\mathbf{I}} \circ {\mathbf{N}}. \end{aligned} $$
(2.70)
Differentiating and applying the vec operator yields
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{N}}_{\mathrm{dg}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{I}}) d \mbox{vec} \, {\mathbf{N}}. \end{aligned} $$
(2.71)
4.
The Kronecker product. Differentiating the Kronecker product is a bit more complicated (Magnus and Neudecker 1985, Theorem 11). We want an expression for the differential of the product in terms of the differentials of the components, something of the form
$$\displaystyle \begin{aligned} d \mbox{vec} \, \left( {\mathbf{A}} \otimes {\mathbf{B}} \right) = {\mathbf{Z}}_1 d \mbox{vec} \, {\mathbf{A}} + {\mathbf{Z}}_2 d \mbox{vec} \, {\mathbf{B}} {} \end{aligned} $$
(2.72)
for some matrices Z ₁ and Z ₂.

This requires a result for the vec of the Kronecker product. Let A be of dimension m × p and B be r × s. Then
$$\displaystyle \begin{aligned} \mbox{vec} \, \left( {\mathbf{A}} \otimes {\mathbf{B}} \right) = \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( \mbox{vec} \, {\mathbf{A}} \otimes \mbox{vec} \, {\mathbf{B}} \right). {} \end{aligned} $$
(2.73)

Let Y = A ⊗B. Differentiate,
$$\displaystyle \begin{aligned} d {\mathbf{Y}} = \left( d {\mathbf{A}} \otimes {\mathbf{B}} \right) + \left( {\mathbf{A}} \otimes d {\mathbf{B}} \right) \end{aligned} $$
(2.74)

and vec
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{Y}} = \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left[ \rule{0in}{3ex} \left(d \mbox{vec} \, {\mathbf{A}} \otimes \mbox{vec} \, {\mathbf{B}} \right) + \left(\mbox{vec} \, {\mathbf{A}} \otimes d \mbox{vec} \, {\mathbf{B}} \right) \right]. \end{aligned} $$
(2.75)
With some ingenious simplifications (Magnus and Neudecker 1985), this reduces to (2.72) with
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{Z}}_1 &\displaystyle =&\displaystyle \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( {\mathbf{I}}_m \otimes \mbox{vec} \, {\mathbf{B}} \right) \end{array} \end{aligned} $$
(2.76)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{Z}}_2 &\displaystyle =&\displaystyle \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( \mbox{vec} \, {\mathbf{A}} \otimes {\mathbf{I}}_{rs} \right). \end{array} \end{aligned} $$
(2.77)
Substituting Z ₁ and Z ₂ into (2.72) gives the differential of the Kronecker product in terms of the differentials of its component matrices.
5.
The matrix inverse. The inverse of X satisfies
$$\displaystyle \begin{aligned} {\mathbf{X}} {\mathbf{X}}^{-1} = {\mathbf{I}} . \end{aligned} $$
(2.78)
Differentiate both sides
$$\displaystyle \begin{aligned} \left(d {\mathbf{X}} \right) {\mathbf{X}}^{-1} + {\mathbf{X}} \left( d {\mathbf{X}}^{-1} \right) = \mathbf{0} , \end{aligned} $$
(2.79)
then vec
$$\displaystyle \begin{aligned} \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{I}} \right] d \mbox{vec} \, {\mathbf{X}}+ \left[ {\mathbf{I}} \otimes {\mathbf{X}} \right] d \mbox{vec} \, {\mathbf{X}}^{-1} = \mathbf{0} \end{aligned} $$
(2.80)
and finally solve for dvec X ⁻¹
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{X}}^{-1} = - \left[ {\mathbf{I}} \otimes {\mathbf{X}} \right]^{-1} \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{I}} \right] d \mbox{vec} \, {\mathbf{X}} \end{aligned} $$
(2.81)
The properties (2.5) and (2.8) of the Kronecker product let this be simplified to
$$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{X}}^{-1} = - \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{X}}^{-1} \right] d \mbox{vec} \, {\mathbf{X}} {} \end{aligned} $$
(2.82)
6.
The square root and ratios. In calculating standard deviations and coefficients of variation it is useful to calculate the elementwise square root and the elementwise ratio of two vectors. If x is a non-negative vector, and the square root $\sqrt {{\mathbf {x}}}$ is taken elementwise, then
$$\displaystyle \begin{aligned} d \sqrt{{\mathbf{x}}} = \frac{1}{2} \mathcal{D}\, \left( \sqrt{{\mathbf{x}}} \right) ^{-1} d {\mathbf{x}} . {} \end{aligned} $$
(2.83)
For the elementwise ratio, let x and y be m × 1 vectors, with y nonzero. Let w be a vector whose ith element is x _i∕y _i; i.e., ${\mathbf {w}} = \mathcal {D}\,({\mathbf {y}})^{-1} {\mathbf {x}}$. Then
$$\displaystyle \begin{aligned} d {\mathbf{w}} = \mathcal{D}\,({\mathbf{y}})^{-1} d {\mathbf{x}} - \left[ {\mathbf{x}}^{\mathsf{T}} \mathcal{D}\,({\mathbf{y}})^{-1} \otimes \mathcal{D}\,({\mathbf{y}})^{-1} \right] \mathcal{D}\,(\mbox{vec} \, {\mathbf{I}}_{m} ) \left({\mathbf{I}}_m \otimes {\mathbf{1}}_{m} \right) d {\mathbf{y}} . {} \end{aligned} $$
(2.84)

This list could go on. The books by Magnus and Neudecker (1988) and Abadir and Magnus (2005) contain many other results, and demographically relevant derivations appear throughout this book, especially in Chap. 5.

9 LTRE Decomposition of Demographic Differences

The LTRE decomposition in Sect. 1.3.1 extends readily to matrix calculus. Suppose that a demographic outcome ξ, dimension (s × 1), is a function of a vector θ of parameters, dimension (p × 1). Suppose that results are obtained under two “conditions,” with parameters θ ⁽¹⁾ and θ ⁽²⁾. Define the parameter difference as Δθ = θ ⁽²⁾ −θ ⁽¹⁾ and the effect as Δξ = ξ ⁽²⁾ −ξ ⁽¹⁾. Then, to first order,

$$\displaystyle \begin{aligned} \begin{array}{rcl} \Delta \boldsymbol{\xi} &\displaystyle \approx&\displaystyle \sum_{i=1}^p {d \boldsymbol{\xi} \over d \theta_i} \Delta \theta_i \end{array} \end{aligned} $$

(2.85)

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle =&\displaystyle {d \boldsymbol{\xi} \over d \boldsymbol{\theta}^{\mathsf{T}}} \Delta \boldsymbol{\theta} . \end{array} \end{aligned} $$

(2.86)

Writing

$$\displaystyle \begin{aligned} \Delta \boldsymbol{\theta} = \mathcal{D}\,(\Delta \boldsymbol{\theta}) {\mathbf{1}}_p, \end{aligned} $$

(2.87)

we create a contribution matrix C, of dimension s × p,

$$\displaystyle \begin{aligned} {\mathbf{C}} = {d \boldsymbol{\xi} \over d \boldsymbol{\theta}^{\mathsf{T}}} \; \mathcal{D}\,(\Delta \boldsymbol{\theta}) . \end{aligned} $$

(2.88)

The (i, j) entry of C is the contribution of Δθ _j to the difference ξ _i, for i = 1, …, s and j = 1, …, p. The rows and columns of C give

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{C}}(i,:) &\displaystyle =&\displaystyle \mbox{contributions of }\Delta \boldsymbol{\theta}\mbox{ to }\Delta \xi_i \end{array} \end{aligned} $$

(2.89)

$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{C}}(:,j) &\displaystyle =&\displaystyle \mbox{contributions of }\theta_j\mbox{ to }\Delta \boldsymbol{\xi} \end{array} \end{aligned} $$

(2.90)

When calculating C, the derivative of ξ must be evaluated somewhere. Experience suggests that the evaluating it at the midpoint between θ ⁽¹⁾ and θ ⁽²⁾ gives good results (Logofet and Lesnaya 1997; Caswell 2001).

10 A Protocol for Sensitivity Analysis

The calculations may grow to be complex, but the protocol is simple:

1.
write a matrix expression for the outcome,
2.
differentiate,
3.
vec,
4.
simplify,
5.
calculate derivatives from the differentials, and
6.
extend using the chain rule

The rest of this book shows what can be done with this simple procedure.

Notes

1.
There is also a second identification theorem that provides the second derivatives of matrix functions. See Shyu and Caswell (2014) for applications of this theory to the second derivatives of measures of population growth rate.

Bibliography

Abadir, K. M., and J. R. Magnus. 2005. Matrix algebra. Econometric exercises 1. Cambridge University Press, Cambridge, United Kingdom.
Google Scholar
Caswell, H. 2001. Matrix Population Models: Construction, Analysis, and Interpretation. 2nd edition. Sinauer Associates, Sunderland, MA.
Google Scholar
Caswell, H. 2012. Matrix models and sensitivity analysis of populations classified by age and stage: a vec-permutation matrix approach. Theoretical Ecology 5:403–417.
Article Google Scholar
Caswell, H. 2014. A matrix approach to the statistics of longevity in heterogeneous frailty models. Demographic Research 31:553–592.
Article Google Scholar
Caswell, H., C. de Vries, N. Hartemink, G. Roth, and S. F. van Daalen. 2018. Age×stage-classified demographic analysis: a comprehensive approach. Ecological Monographs 88:560–584.
Article Google Scholar
Caswell, H., and R. Salguero-Gómez. 2013. Age, stage and senescence in plants. Journal of Ecology 101:585–595.
Article Google Scholar
Hardy, G. H. 1952. A course of pure mathematics. 10th edition. Cambridge University Press, Cambridge, United Kingdom.
Google Scholar
Henderson, H. V., and S. R. Searle. 1981. The vec-permutation matrix, the vec operator and Kronecker products: a review. Linear and Multilinear Algebra 9:271–288.
Article Google Scholar
Logofet, D. O., and E. V. Lesnaya. 1997. Why are the middle points the most sensitive in the sensitivity experiments? Ecological Modelling 104:303–306.
Article Google Scholar
Magnus, J. R., and H. Neudecker. 1979. The commutation matrix: some properties and applications. Annals of Statistics 7:381–394.
Article Google Scholar
Magnus, J. R., and H. Neudecker. 1985. Matrix differential calculus with applications to simple, Hadamard, and Kronecker products. Journal of Mathematical Psychology 29:474–492.
Article Google Scholar
Magnus, J. R., and H. Neudecker. 1988. Matrix differential calculus with applications in statistics and econometrics. John Wiley and Sons, New York, New York.
Google Scholar
Nel, D. G. 1980. On matrix differentiation in statistics. South African Statistical Journal 14:137–193.
Google Scholar
Roth, G., and H. Caswell. 2016. Hyperstate matrix models: extending demographic state spaces to higher dimensions. Methods in Ecology and Evolution 7:1438–1450.
Article Google Scholar
Roth, W. E. 1934. On direct product matrices. Bulletin of the American Mathematical Society 40:461–468.
Article Google Scholar
Samuelson, P. A. 1947. Foundations of economic analysis. Harvard University Press, Cambridge, Massachusetts, USA.
Google Scholar
Shyu, E., and H. Caswell. 2014. Calculating second derivatives of population growth rates for ecology and evolution. Methods in Ecology and Evolution 5:473–482.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Biodiversity & Ecosystem Dynamics, University of Amsterdam, Amsterdam, The Netherlands
Hal Caswell

Authors

Hal Caswell
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Caswell, H. (2019). Matrix Calculus and Notation. In: Sensitivity Analysis: Matrix Methods in Demography and Ecology. Demographic Research Monographs. Springer, Cham. https://doi.org/10.1007/978-3-030-10534-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-10534-1_2
Published: 03 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10533-4
Online ISBN: 978-3-030-10534-1
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics