Matrix Calculus and Notation

  • Hal Caswell
Open Access
Part of the Demographic Research Monographs book series (DEMOGRAPHIC)


This book relies on this set of mathematical techniques. This chapter introduces the basics, which will be used throughout the text. For more information, I recommend four sources in particular. The most complete treatment, but not the easiest starting point, is the book by Magnus and Neudecker (1988).

2.1 Introduction: Can It Possibly Be That Simple?

In October of 2005, I scribbled in a notebook, “can it possibly be that simple?” I was referring to the sensitivity of transient dynamics (the eventual results appear in Chap.  7), and had just begun to use matrix calculus as a tool. The answer to my question was yes. It can be that simple.

This book relies on this set of mathematical techniques. This chapter introduces the basics, which will be used throughout the text. For more information, I recommend four sources in particular. The most complete treatment, but not the easiest starting point, is the book by Magnus and Neudecker (1988). More accessible introductions can be found in the paper by Magnus and Neudecker (1985) and especially the text by Abadir and Magnus (2005). A review paper by Nel (1980) is helpful in placing the Magnus-Neudecker formulation in the context of other attempts at a calculus of matrices.

Sensitivity analysis asks how much change in an outcome variable y is caused by a change in some parameter x. At its most basic level, and with some reasonable assumptions about the continuity and differentiability of the functional relationships involved, the solution is given by differential calculus. If y is a function of x, then the derivative
$$\displaystyle \begin{aligned} {d y \over d x} {} \end{aligned}$$
tells how y responds to a change in x, i.e., the sensitivity of y to a change in x.
However, the outcomes of a demographic calculation may be scalar-valued (e.g., the population growth rate λ), vector-valued (e.g., the stable stage distribution), or matrix-valued (e.g., the fundamental matrix). Any of these outcomes may be functions of scalar-valued parameters (e.g., the Gompertz aging rate), vector-valued parameters (e.g., the mortality schedule), or matrix-valued parameters (e.g., the transition matrix) parameters. Thus, sensitivity analysis in demography requires more than the simple derivative in (2.3). We want a consistent and flexible approach to differentiating
$$\displaystyle \begin{aligned} \left\{ \begin{array}{r} \mbox{scalar-valued} \\ \mbox{vector-valued}\\ \mbox{matrix-valued} \end{array} \right\} \; \mbox{functions of} \; \left\{ \begin{array}{l} \mbox{scalar} \\ \mbox{vector}\\ \mbox{matrix} \end{array} \right\} \; \mbox{arguments} \end{aligned}$$

2.2 Notation and Matrix Operations

2.2.1 Notation

Matrices are denoted by upper case bold symbols (e.g., A), vectors (usually) by lower case bold symbols (n). The (i, j) entry of the matrix A is aij, and the ith entry of the vector n is ni. Sometimes we will use Matlab notation, and write
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ X}}(i,:) &\displaystyle =&\displaystyle \mbox{row }i\mbox{ of }{\mathbf{ X}} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ X}}(:,j) &\displaystyle =&\displaystyle \mbox{column }j \mbox{ of }{\mathbf{ X}} \end{array} \end{aligned} $$
The notation
$$\displaystyle \begin{aligned} \left(\begin{array}{c} x(ij) \end{array}\right) \end{aligned}$$
denotes a matrix whose (i, j) entry is x. For example,
$$\displaystyle \begin{aligned} \left(\begin{array}{c} \displaystyle {d y_i \over d x_j} \end{array}\right) \end{aligned}$$
is the matrix whose (i, j) entry is the derivative of yi with respect to xj.

The transpose of X is XT. Logarithms are natural logarithms. The vector norm ∥x∥ is, unless noted otherwise, the 1-norm. The symbol \(\mathcal {D}\,({\mathbf { x}})\) denotes the square matrix with x on the diagonal and zeros elsewhere. The symbol 1 denotes a vector of ones. The vector ei is a unit vector with 1 in the ith entry and zeros elsewhere. The identity matrix is I. Where necessary for clarity, the dimension of matrices or vectors will be indicated by a subscript. Thus Is is a s × s identity matrix, 1s is an s × 1 vector of ones, and Xm×n is a m × n matrix.

In some places (Chaps.  6 and  10) block-structured matrices appear; these are denoted by either \(\mathbb {A}\) or \(\tilde {{\mathbf { A}}}\), depending on the context and the role of the matrix.

2.2.2 Operations

In addition to the familiar matrix product AB, we will also use the Hadamard, or elementwise product
$$\displaystyle \begin{aligned} {\mathbf{ A}} \circ {\mathbf{ B}} = \left(\begin{array}{c} a_{ij} b_{ij} \end{array}\right) \end{aligned} $$
and the Kronecker product
$$\displaystyle \begin{aligned} {\mathbf{ A}} \otimes {\mathbf{ B}} = \left(\begin{array}{c} a_{ij} {\mathbf{ B}} \end{array}\right) \end{aligned} $$
The Hadamard product requires that A and B be the same size. The Kronecker product is defined for any sizes of A and B. Some useful properties of the Kronecker product include
$$\displaystyle \begin{aligned} \begin{array}{rcl} \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right)^{-1} &\displaystyle =&\displaystyle \left( {\mathbf{ A}}^{-1} \otimes {\mathbf{ B}}^{-1} \right) {} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right)^{\mathsf{T}} &\displaystyle =&\displaystyle \left( {\mathbf{ A}}^{\mathsf{T}} \otimes {\mathbf{ B}}^{\mathsf{T}} \right) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ A}} \otimes \left( {\mathbf{ B}} + {\mathbf{ C}} \right) &\displaystyle =&\displaystyle \left( {\mathbf{ A}} \otimes {\mathbf{ B}} \right) + \left( {\mathbf{ A}} \otimes {\mathbf{ C}} \right) \end{array} \end{aligned} $$
and, provided that the matrices are of the right size for the products to be defined,
$$\displaystyle \begin{aligned} \begin{array}{rcl} \left({\mathbf{ A}}_1 \otimes {\mathbf{ B}}_1 \right) \left({\mathbf{ A}}_2 \otimes {\mathbf{ B}}_2 \right) &\displaystyle =&\displaystyle \left({\mathbf{ A}}_1 {\mathbf{ A}}_2 \otimes {\mathbf{ B}}_1 {\mathbf{ B}}_2 \right) . {} \end{array} \end{aligned} $$

2.2.3 The Vec Operator and Vec-Permutation Matrix

The vec operator transforms a m × n matrix A into a mn × 1 vector by stacking the columns one above the next,
$$\displaystyle \begin{aligned} \mbox{vec} \, {\mathbf{ A}} = \left(\begin{array}{c} {\mathbf{ A}}(:,1) \\ \vdots\\ {\mathbf{ A}}(:,n) \end{array}\right) \end{aligned} $$
For example,
$$\displaystyle \begin{aligned} \mbox{vec} \, \left(\begin{array}{cc} a & b \\ c & d \end{array}\right) = \left(\begin{array}{c} a\\ c \\ b \\ d \end{array}\right) . \end{aligned} $$
The vec of A and the vec of AT are rearrangements of the same entries; they are related by
$$\displaystyle \begin{aligned} \mbox{vec} \, {\mathbf{ A}}^{\mathsf{T}} = {\mathbf{ K}}_{m,n} \mbox{vec} \, {\mathbf{ A}} \end{aligned} $$
where A is m × n and Km,n is the vec-permutation matrix (Henderson and Searle 1981) or commutation matrix (Magnus and Neudecker 1979). The vec-permutation matrix can be calculated as
$$\displaystyle \begin{aligned} {\mathbf{ K}}_{m,n} = \sum_{i=1}^m \sum_{j=1}^n \left( {\mathbf{ E}}_{ij} \otimes {\mathbf{ E}}_{ij}^{\mathsf{T}} \right) \end{aligned} $$
where Eij is a matrix, of dimension m × n, with a 1 in the (i, j) entry and zeros elsewhere. Like any permutation matrix, K−1 = KT.

The vec operator and the vec-permutation matrix are particularly important in multistate models (e.g., age×stage-classified models), where they are used in both the formulation and analysis of the models (e.g., Caswell 2012, 2014; Caswell and Salguero-Gómez 2013; Caswell et al. 2018); see also Chap.  6. Extensions to an arbitrary number of dimensions, so-called hyperstate models, have been presented by Roth and Caswell (2016).

2.2.4 Roth’s Theorem

The vec operator and the Kronecker product are connected by a theorem due to Roth (1934):
$$\displaystyle \begin{aligned} \mbox{vec} \, \left({\mathbf{ABC}}\right) = \left({\mathbf{ C}}^{\mathsf{T}} \otimes {\mathbf{ A}} \right) \mbox{vec} \, {\mathbf{ B}} . {} \end{aligned} $$
We will often want to obtain the vec of a matrix that appears in the middle of a product; we will use Roth’s theorem repeatedly.

2.3 Defining Matrix Derivatives

The derivative of a scalar y with respect to a scalar x is familiar. What, however, does it mean to speak of the derivative of a scalar with respect to a vector, or of a vector with respect to another vector, or any other combination? These can be defined in more than one way and the choice is critical (Nel 1980; Magnus and Neudecker 1985). This book relies on the notation due to Magnus and Neudecker, because it makes certain operations possible and consistent.
  • If x and y are scalars, the derivative of y with respect to x is the familiar derivative dydx.

  • If y is a n × 1 vector and x a scalar, the derivative of y with respect to x is the n × 1 column vector
    $$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d x} = \left(\begin{array}{c} \displaystyle {d y_1 \over d x} \\ \vdots \\ \displaystyle {d y_n \over d x} \end{array}\right) . \end{aligned} $$
  • If y is a scalar and x a m × 1 vector, the derivative of y with respect to x is the 1 × m row vector (called the gradient vector)
    $$\displaystyle \begin{aligned} {d y \over d {\mathbf{ x}}^{\mathsf{T}}} = \left(\begin{array}{ccc} \displaystyle {\partial y \over \partial x_1} & \cdots & \displaystyle {\partial y \over \partial x_m} \end{array}\right). \end{aligned} $$
    Note the orientation of dydx as a column vector and dydxT as a row vector.
  • If y is a n × 1 vector and x a m × 1 vector, the derivative of y with respect to x is defined to be the n × m matrix whose (i, j) entry is the derivative of yi with respect to xj, i.e.,
    $$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d {\mathbf{ x}}^{\mathsf{T}}} = \left(\begin{array}{c} \displaystyle {d y_i \over d x_j} \end{array}\right) \end{aligned} $$
    (this matrix is called the Jacobian matrix).
  • Derivatives involving matrices are written by first transforming the matrices into vectors using the vec operator, and then applying the rules for vector differentiation to the resulting vectors. Thus, the derivative of the m × n matrix Y with respect to the p × q matrix X is the mn × pq matrix
    $$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \rule{0in}{2.5ex} \left( \mbox{vec} \, {\mathbf{ X}} \right)^{\mathsf{T}}}. \end{aligned} $$
    From now on, I will write vec TX for (vec X)T.

2.4 The Chain Rule

The chain rule for differentiation is your friend. The Magnus-Neudecker notation, unlike some alternatives, extends the familiar scalar chain rule to derivatives of vectors and matrices (Nel 1980; Magnus and Neudecker 1985). If u (size m × 1) is a function of v (size n × 1) and v is a function of x (size p × 1), then
$$\displaystyle \begin{aligned} \underbrace{{d {\mathbf{ u}} \over d {\mathbf{ x}}^{\mathsf{T}}}}_{m \times p} = \underbrace{ \left( {d {\mathbf{ u}} \over d {\mathbf{ v}}^{\mathsf{T}}} \right)}_{m \times n} \;\underbrace{ \left( {d {\mathbf{ v}} \over d {\mathbf{ x}}^{\mathsf{T}}} \right) }_{n \times p} {} \end{aligned} $$
Notice that the dimensions are correct, and that the order of the multiplication matters. Checking dimensional consistency in this way is a useful way to find errors.

2.5 Derivatives from Differentials

The key to the matrix calculus of Magnus and Neudecker (1988) is the relationship between the differential and the derivative of a function. Experience suggests that, for many readers of this book, this relationship is shrouded in the mists of long-ago calculus classes.

2.5.1 Differentials of Scalar Function

Start with scalars. Suppose that y = f(x) is a differentiable function at x = x0. Then the derivative of y with respect to x at the point x0 is defined as
$$\displaystyle \begin{aligned} f\prime(x_0) = \lim_{h\rightarrow 0} \frac{f(x_0 + h) - f(x_0)}{h}. {} \end{aligned} $$
Now define the differential of y. This is customarily denoted dy, but for the moment, I will denote it by cy. The differential of y at x0 is a function of h, defined by
$$\displaystyle \begin{aligned} cy(x_0,h) = f\prime(x_0)h. \end{aligned} $$
There is no requirement that h be “small.” Since x is a function of itself, x = g(x), with g′(x) = 1, we also have cx(x0, h) = g′(x0)h = h. Thus the ratio of the differential of y and the differential of x is
$$\displaystyle \begin{aligned} \frac{cy(x_0,h)}{cx(x_0,h)} = \frac{f\prime(x_0)h}{h} = f\prime(x_0). {} \end{aligned} $$
That is, the derivative is equal to the ratio of the differentials.
Now, return to the standard notation of dy for the differential of y. This gives two meanings to the familiar notation for derivatives,
$$\displaystyle \begin{aligned} \left. {d y \over d x} \right|{}_{x_0} = f\prime(x_0). \end{aligned} $$
The left hand side can be regarded either as equivalent to the limit (2.19) or the ratio of the differentials given by (2.21). Mathematicians are strangely unconcerned with this ambiguity (e.g., Hardy 1952).
All this leads to a set of familiar rules for calculating differentials that guarantee that they can be used to create derivatives. A few of these, for scalars, are
$$\displaystyle \begin{aligned} \begin{array}{rcl} d(u+v) &\displaystyle =&\displaystyle du + dv {} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d(cu) &\displaystyle =&\displaystyle c \; du \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d(uv) &\displaystyle =&\displaystyle u(dv) + (du) v \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d(e^u) &\displaystyle =&\displaystyle e^u du \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d(\log u) &\displaystyle =&\displaystyle \frac{1}{u} du \end{array} \end{aligned} $$
If y = f(x1, x2), then the total differential is
$$\displaystyle \begin{aligned} dy = {\partial f \over \partial x_1} dx_1 + {\partial f \over \partial x_2} dx_2. {} \end{aligned} $$
Derivatives can be constructed from these expressions at will by dividing by differentials. For example, dividing (2.23) by dx gives d(u + v)∕dx = dudx + dvdx. From (2.28), we have
$$\displaystyle \begin{aligned} \begin{array}{rcl} {d y \over d x_1} &\displaystyle =&\displaystyle {\partial f \over \partial x_1} + {\partial f \over \partial x_2} {d x_2 \over d x_1} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} {d y \over d x_2} &\displaystyle =&\displaystyle {\partial f \over \partial x_1} {d x_1 \over d x_2} + {\partial f \over \partial x_2} . \end{array} \end{aligned} $$

2.5.2 Differentials of Vectors and Matrices

To extend these concepts to matrices, we define the differential of a matrix (or vector) as the matrix (or vector) of differentials of the elements; i.e.,
$$\displaystyle \begin{aligned} d {\mathbf{ X}} = \left(\begin{array}{c} \displaystyle dx_{ij} \end{array}\right). \end{aligned} $$
This definition leads to some basic rules for differentials of matrices:
$$\displaystyle \begin{aligned} \begin{array}{rcl} d (c {\mathbf{ U}}) &\displaystyle =&\displaystyle c ( d {\mathbf{ U}}) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d ({\mathbf{ U}} + {\mathbf{ V}}) &\displaystyle =&\displaystyle d {\mathbf{ U}} + d {\mathbf{ V}} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d ({\mathbf{ U}} {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) {\mathbf{ V}} + {\mathbf{ U}} (d {\mathbf{ V}}) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d({\mathbf{ U}} \otimes {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) \otimes {\mathbf{ V}} + {\mathbf{ U}} \otimes (d {\mathbf{ V}}) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d({\mathbf{ U}} \circ {\mathbf{ V}}) &\displaystyle =&\displaystyle (d {\mathbf{ U}}) \circ {\mathbf{ V}} + {\mathbf{ U}} \circ (d {\mathbf{ V}}) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} d \mbox{vec} \, {\mathbf{ U}} &\displaystyle =&\displaystyle \mbox{vec} \, d {\mathbf{ U}} \end{array} \end{aligned} $$
where c is a constant, and, of course, the dimensions of U and V must be conformable. The differentials of an operators applied elementwise to a vector can be obtained from the differentials of the elements. For example, suppose u is a s × 1 vector, and the exponential is applied elementwise. Then
$$\displaystyle \begin{aligned} \begin{array}{rcl} d (\exp({\mathbf{ u}})) &=& \left(\begin{array}{c} e^{u_1} d u_1 \\ \vdots \\ e^{u_s} d u_s \end{array}\right) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& \mathcal{D}\,\left[ \rule{0in}{2ex} \exp({\mathbf{ u}}) \right] d {\mathbf{ u}} . \end{array} \end{aligned} $$
If y is a function of x1 and x2, the total differential is given just as in (2.28), by
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\partial {\mathbf{ y}} \over \partial {\mathbf{ x}}_1^{\mathsf{T}}} d {\mathbf{ x}}_1 + {\partial {\mathbf{ y}} \over \partial {\mathbf{ x}}_2^{\mathsf{T}}} d {\mathbf{ x}}_2 \end{aligned} $$

2.6 The First Identification Theorem

For scalar y and x,
$$\displaystyle \begin{aligned} dy = q dx \Longrightarrow {d y \over d x} = q. \end{aligned} $$
That much is easy. But, suppose that y is a n × 1 vector function of the m × 1 vector x. The differential dy is the n × 1 vector
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = \left(\begin{array}{c} dy_1 \\ \vdots \\ dy_n \end{array}\right) \end{aligned} $$
which, by the total derivative rule, is
$$\displaystyle \begin{aligned} \begin{array}{rcl} d {\mathbf{ y}} &=& \left(\begin{array}{c} \displaystyle {\partial y_1 \over \partial x_1} dx_1 + \cdots + {\partial y_1 \over \partial x_m} dx_m \\ \vdots \\ \displaystyle {\partial y_n \over \partial x_1} dx_1 + \cdots + {\partial y_n \over \partial x_m} dx_m \end{array}\right) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& \left(\begin{array}{ccc} {\partial y_1 \over \partial x_1} & \cdots & {\partial y_1 \over \partial x_m} \\ \vdots & & \vdots \\ {\partial y_n \over \partial x_1} & \cdots & {\partial y_n \over \partial x_m} \end{array}\right) \left(\begin{array}{c} dx_1 \\ \vdots \\ dx_m \end{array}\right) \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} &=& {\mathbf{ Q}} \; d {\mathbf{ x}}. {} \end{array} \end{aligned} $$
If these were scalars, dividing both sides by dx would give Q as the derivative of y with respect to x. But, one cannot divide by a vector. Instead, Magnus and Neudecker proved that if it can be shown that
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} \; d {\mathbf{ x}} {} \end{aligned} $$
then the derivative is
$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d {\mathbf{ x}}^{\mathsf{T}}} = {\mathbf{ Q}}. {} \end{aligned} $$
This is the First Identification Theorem of Magnus and Neudecker (1988).1

2.6.1 The Chain Rule and the First Identification Theorem

Suppose that dy is given by (2.46), and that x is in turn a function of some vector θ. Then
$$\displaystyle \begin{aligned} d {\mathbf{ x}} = {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} \end{aligned} $$
$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$
In other words, the differential expression (2.46) can be transformed into a derivative with respect to any vector by careful use of the chain rule. This applies equally to more complicated expressions for the differential. Suppose that
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} d {\mathbf{ x}} + {\mathbf{ R}} d {\mathbf{ z}}. \end{aligned} $$
Applying the chain rule to the differentials on the right hand side gives
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} d \boldsymbol{\theta} \end{aligned} $$
for any vector θ. Thus
$$\displaystyle \begin{aligned} d {\mathbf{ y}} = \left({\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \right) d \boldsymbol{\theta}, \end{aligned} $$
and the First Identification Theorem gives
$$\displaystyle \begin{aligned} {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \left( {\mathbf{ Q}} {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + {\mathbf{ R}} {d {\mathbf{ z}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \right) . \end{aligned} $$

2.7 Elasticity

When parameters are measured on different scales, it is sometimes helpful to calculate proportional effects of proportional perturbations, also called elasticities. The elasticity of yi to θj is
$$\displaystyle \begin{aligned} {\epsilon y_i \over \epsilon \theta_j} = \frac{\theta_j}{y_i} {d y_i \over d \theta_j} . \end{aligned} $$
For vectors y and θ, this becomes
$$\displaystyle \begin{aligned} {\epsilon {\mathbf{ y}} \over \epsilon \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,({\mathbf{ y}})^{-1} \; {d {\mathbf{ y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} \; \mathcal{D}\,(\boldsymbol{\theta}) . {} \end{aligned} $$
There seems to be no accepted notation for elasticities; the notation used here is adapted from that in Samuelson (1947).

2.8 Some Useful Matrix Calculus Results

Several matrix calculus results will be used repeatedly. Many more can be found in Magnus and Neudecker (1988) and Abadir and Magnus (2005).
  1. 1.
    The matrix product Y = AB. Differentiate,
    $$\displaystyle \begin{aligned} d {\mathbf{ Y}} = (d {\mathbf{ A}}) {\mathbf{ B}} + {\mathbf{ A}} (d {\mathbf{ B}}). \end{aligned} $$
    Then write (or imagine writing; with practice one does not actually need this step explicitly)
    $$\displaystyle \begin{aligned} \begin{array}{rcl} \left( d {\mathbf{ A}} \right) {\mathbf{ B}} &\displaystyle =&\displaystyle {\mathbf{ I}} \left( d {\mathbf{ A}} \right) {\mathbf{ B}} \end{array} \end{aligned} $$
    $$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{ A}} \left( d {\mathbf{ B}} \right) &\displaystyle =&\displaystyle {\mathbf{ A}} \left( d {\mathbf{ B}} \right) {\mathbf{ I}} \end{array} \end{aligned} $$
    and apply the vec operator and Roth’s theorem, to obtain
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = \left( {\mathbf{ B}}^{\mathsf{T}} \otimes {\mathbf{ I}} \right) d \mbox{vec} \, {\mathbf{ A}} + \left( {\mathbf{ I}} \otimes {\mathbf{ A}} \right) d \mbox{vec} \, {\mathbf{ B}}. \end{aligned} $$
    The chain rule gives, for any vector variable θ
    $$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \left( {\mathbf{ B}}^{\mathsf{T}} \otimes {\mathbf{ I}} \right) {d \mbox{vec} \, {\mathbf{ A}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + \left( {\mathbf{ I}} \otimes {\mathbf{ A}} \right) {d \mbox{vec} \, {\mathbf{ B}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$
  2. 2.
    The Hadamard product Y = A ∘B. Differentiate the product,
    $$\displaystyle \begin{aligned} d {\mathbf{ Y}} = d {\mathbf{ A}} \circ {\mathbf{ B}} + {\mathbf{ A}} \circ d {\mathbf{ B}}, \end{aligned} $$
    then vec
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = d \mbox{vec} \, {\mathbf{ A}} \circ \mbox{vec} \, {\mathbf{ B}} + \mbox{vec} \, {\mathbf{ A}} \circ d \mbox{vec} \, {\mathbf{ B}} . \end{aligned} $$
    It will be useful to replace the Hadamard products, which we do using the fact that \({\mathbf { x}} \circ {\mathbf { y}} = \mathcal {D}\,({\mathbf { x}}) {\mathbf { y}}\), to get
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{ Y}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ B}}) d \mbox{vec} \, {\mathbf{ A}} + \mathcal{D}\,(\mbox{vec} \, {\mathbf{ A}}) d \mbox{vec} \, {\mathbf{ B}}. \end{aligned} $$
    The chain rule gives the derivative from the differential,
    $$\displaystyle \begin{aligned} {d \mbox{vec} \, {\mathbf{ Y}} \over d \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ B}}) {d \mbox{vec} \, {\mathbf{ A}} \over d \boldsymbol{\theta}^{\mathsf{T}}} + \mathcal{D}\,(\mbox{vec} \, {\mathbf{ A}}) {d \mbox{vec} \, {\mathbf{ B}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. \end{aligned} $$
  3. 3.
    Diagonal matrices. The diagonal matrix \(\mathcal {D}\, ({\mathbf { x}})\), with the vector x on the diagonal and zeros elsewhere, can be written
    $$\displaystyle \begin{aligned} \mathcal{D}\, ({\mathbf{ x}}) = {\mathbf{ I}} \circ \left( \mathbf{1} \,{\mathbf{ x}}^{\mathsf{T}} \right) {} \end{aligned} $$
    Differentiate both sides,
    $$\displaystyle \begin{aligned} d \mathcal{D}\, ({\mathbf{ x}}) = {\mathbf{ I}} \circ \left( \mathbf{1} \,d {\mathbf{ x}}^{\mathsf{T}} \right) \end{aligned} $$
    and vec the result
    $$\displaystyle \begin{aligned} \begin{array}{rcl} d \mbox{vec} \, \mathcal{D}\, ({\mathbf{ x}}) &\displaystyle =&\displaystyle \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \mbox{vec} \, \left( \mathbf{1} \,d {\mathbf{ x}}^{\mathsf{T}} \right) \end{array} \end{aligned} $$
    $$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle =&\displaystyle \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \left( {\mathbf{ I}} \otimes \mathbf{1} \right) d {\mathbf{ x}} \end{array} \end{aligned} $$
    The First Identification Theorem gives
    $$\displaystyle \begin{aligned} {d \mbox{vec} \, \mathcal{D}\, ({\mathbf{ x}}) \over d \boldsymbol{\theta}^{\mathsf{T}}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{ I}}) \left( {\mathbf{ I}} \otimes \mathbf{1} \right) {d {\mathbf{ x}} \over d \boldsymbol{\theta}^{\mathsf{T}}}. {} \end{aligned} $$

    The identity matrix in (2.65) masks the matrix \( \left ( \mathbf {1} \,{\mathbf { x}}^{\mathsf {T}} \right )\), setting to zero all but the diagonal elements. Matrices other than I can be used in this way to mask entries of a matrix. For example, the transition matrix for a Leslie matrix, with a vector of survival probabilities p on the subdiagonal, is obtained by setting x = p and replacing I with a matrix Z that contains ones on the subdiagonal and zeros elsewhere (see, e.g., Chap.  4).

    Some Markov chain calculations (Chaps.  5 and  11) involve a matrix Ndg, which contains the diagonal elements of N on the diagonal and zeros elsewhere. This can be written
    $$\displaystyle \begin{aligned} {\mathbf{N}}_{\mathrm{dg}} = {\mathbf{I}} \circ {\mathbf{N}}. \end{aligned} $$
    Differentiating and applying the vec operator yields
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{N}}_{\mathrm{dg}} = \mathcal{D}\,(\mbox{vec} \, {\mathbf{I}}) d \mbox{vec} \, {\mathbf{N}}. \end{aligned} $$
  4. 4.
    The Kronecker product. Differentiating the Kronecker product is a bit more complicated (Magnus and Neudecker 1985, Theorem 11). We want an expression for the differential of the product in terms of the differentials of the components, something of the form
    $$\displaystyle \begin{aligned} d \mbox{vec} \, \left( {\mathbf{A}} \otimes {\mathbf{B}} \right) = {\mathbf{Z}}_1 d \mbox{vec} \, {\mathbf{A}} + {\mathbf{Z}}_2 d \mbox{vec} \, {\mathbf{B}} {} \end{aligned} $$
    for some matrices Z1 and Z2.
    This requires a result for the vec of the Kronecker product. Let A be of dimension m × p and B be r × s. Then
    $$\displaystyle \begin{aligned} \mbox{vec} \, \left( {\mathbf{A}} \otimes {\mathbf{B}} \right) = \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( \mbox{vec} \, {\mathbf{A}} \otimes \mbox{vec} \, {\mathbf{B}} \right). {} \end{aligned} $$
    Let Y = A ⊗B. Differentiate,
    $$\displaystyle \begin{aligned} d {\mathbf{Y}} = \left( d {\mathbf{A}} \otimes {\mathbf{B}} \right) + \left( {\mathbf{A}} \otimes d {\mathbf{B}} \right) \end{aligned} $$
    and vec
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{Y}} = \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left[ \rule{0in}{3ex} \left(d \mbox{vec} \, {\mathbf{A}} \otimes \mbox{vec} \, {\mathbf{B}} \right) + \left(\mbox{vec} \, {\mathbf{A}} \otimes d \mbox{vec} \, {\mathbf{B}} \right) \right]. \end{aligned} $$
    With some ingenious simplifications (Magnus and Neudecker 1985), this reduces to (2.72) with
    $$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{Z}}_1 &\displaystyle =&\displaystyle \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( {\mathbf{I}}_m \otimes \mbox{vec} \, {\mathbf{B}} \right) \end{array} \end{aligned} $$
    $$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{Z}}_2 &\displaystyle =&\displaystyle \left( {\mathbf{I}}_p \otimes {\mathbf{K}}_{s,m} \otimes {\mathbf{I}}_r \right) \left( \mbox{vec} \, {\mathbf{A}} \otimes {\mathbf{I}}_{rs} \right). \end{array} \end{aligned} $$
    Substituting Z1 and Z2 into (2.72) gives the differential of the Kronecker product in terms of the differentials of its component matrices.
  5. 5.
    The matrix inverse. The inverse of X satisfies
    $$\displaystyle \begin{aligned} {\mathbf{X}} {\mathbf{X}}^{-1} = {\mathbf{I}} . \end{aligned} $$
    Differentiate both sides
    $$\displaystyle \begin{aligned} \left(d {\mathbf{X}} \right) {\mathbf{X}}^{-1} + {\mathbf{X}} \left( d {\mathbf{X}}^{-1} \right) = \mathbf{0} , \end{aligned} $$
    then vec
    $$\displaystyle \begin{aligned} \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{I}} \right] d \mbox{vec} \, {\mathbf{X}}+ \left[ {\mathbf{I}} \otimes {\mathbf{X}} \right] d \mbox{vec} \, {\mathbf{X}}^{-1} = \mathbf{0} \end{aligned} $$
    and finally solve for dvec X−1
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{X}}^{-1} = - \left[ {\mathbf{I}} \otimes {\mathbf{X}} \right]^{-1} \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{I}} \right] d \mbox{vec} \, {\mathbf{X}} \end{aligned} $$
    The properties (2.5) and (2.8) of the Kronecker product let this be simplified to
    $$\displaystyle \begin{aligned} d \mbox{vec} \, {\mathbf{X}}^{-1} = - \left[ \left({\mathbf{X}}^{-1}\right)^{\mathsf{T}} \otimes {\mathbf{X}}^{-1} \right] d \mbox{vec} \, {\mathbf{X}} {} \end{aligned} $$
  6. 6.
    The square root and ratios. In calculating standard deviations and coefficients of variation it is useful to calculate the elementwise square root and the elementwise ratio of two vectors. If x is a non-negative vector, and the square root \(\sqrt {{\mathbf {x}}}\) is taken elementwise, then
    $$\displaystyle \begin{aligned} d \sqrt{{\mathbf{x}}} = \frac{1}{2} \mathcal{D}\, \left( \sqrt{{\mathbf{x}}} \right) ^{-1} d {\mathbf{x}} . {} \end{aligned} $$
    For the elementwise ratio, let x and y be m × 1 vectors, with y nonzero. Let w be a vector whose ith element is xiyi; i.e., \({\mathbf {w}} = \mathcal {D}\,({\mathbf {y}})^{-1} {\mathbf {x}}\). Then
    $$\displaystyle \begin{aligned} d {\mathbf{w}} = \mathcal{D}\,({\mathbf{y}})^{-1} d {\mathbf{x}} - \left[ {\mathbf{x}}^{\mathsf{T}} \mathcal{D}\,({\mathbf{y}})^{-1} \otimes \mathcal{D}\,({\mathbf{y}})^{-1} \right] \mathcal{D}\,(\mbox{vec} \, {\mathbf{I}}_{m} ) \left({\mathbf{I}}_m \otimes {\mathbf{1}}_{m} \right) d {\mathbf{y}} . {} \end{aligned} $$
This list could go on. The books by Magnus and Neudecker (1988) and Abadir and Magnus (2005) contain many other results, and demographically relevant derivations appear throughout this book, especially in Chap.  5.

2.9 LTRE Decomposition of Demographic Differences

The LTRE decomposition in Sect.  1.3.1 extends readily to matrix calculus. Suppose that a demographic outcome ξ, dimension (s × 1), is a function of a vector θ of parameters, dimension (p × 1). Suppose that results are obtained under two “conditions,” with parameters θ(1) and θ(2). Define the parameter difference as Δθ = θ(2) −θ(1) and the effect as Δξ = ξ(2) −ξ(1). Then, to first order,
$$\displaystyle \begin{aligned} \begin{array}{rcl} \Delta \boldsymbol{\xi} &\displaystyle \approx&\displaystyle \sum_{i=1}^p {d \boldsymbol{\xi} \over d \theta_i} \Delta \theta_i \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle =&\displaystyle {d \boldsymbol{\xi} \over d \boldsymbol{\theta}^{\mathsf{T}}} \Delta \boldsymbol{\theta} . \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \Delta \boldsymbol{\theta} = \mathcal{D}\,(\Delta \boldsymbol{\theta}) {\mathbf{1}}_p, \end{aligned} $$
we create a contribution matrixC, of dimension s × p,
$$\displaystyle \begin{aligned} {\mathbf{C}} = {d \boldsymbol{\xi} \over d \boldsymbol{\theta}^{\mathsf{T}}} \; \mathcal{D}\,(\Delta \boldsymbol{\theta}) . \end{aligned} $$
The (i, j) entry of C is the contribution of Δθj to the difference ξi, for i = 1, …, s and j = 1, …, p. The rows and columns of C give
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{C}}(i,:) &\displaystyle =&\displaystyle \mbox{contributions of }\Delta \boldsymbol{\theta}\mbox{ to }\Delta \xi_i \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{C}}(:,j) &\displaystyle =&\displaystyle \mbox{contributions of }\theta_j\mbox{ to }\Delta \boldsymbol{\xi} \end{array} \end{aligned} $$

When calculating C, the derivative of ξ must be evaluated somewhere. Experience suggests that the evaluating it at the midpoint between θ(1) and θ(2) gives good results (Logofet and Lesnaya 1997; Caswell 2001).

2.10 A Protocol for Sensitivity Analysis

The calculations may grow to be complex, but the protocol is simple:
  1. 1.

    write a matrix expression for the outcome,

  2. 2.


  3. 3.


  4. 4.


  5. 5.

    calculate derivatives from the differentials, and

  6. 6.

    extend using the chain rule

The rest of this book shows what can be done with this simple procedure.


  1. 1.

    There is also a second identification theorem that provides the second derivatives of matrix functions. See Shyu and Caswell (2014) for applications of this theory to the second derivatives of measures of population growth rate.


  1. Abadir, K. M., and J. R. Magnus. 2005. Matrix algebra. Econometric exercises 1. Cambridge University Press, Cambridge, United Kingdom.Google Scholar
  2. Caswell, H. 2001. Matrix Population Models: Construction, Analysis, and Interpretation. 2nd edition. Sinauer Associates, Sunderland, MA.Google Scholar
  3. Caswell, H. 2012. Matrix models and sensitivity analysis of populations classified by age and stage: a vec-permutation matrix approach. Theoretical Ecology 5:403–417.CrossRefGoogle Scholar
  4. Caswell, H. 2014. A matrix approach to the statistics of longevity in heterogeneous frailty models. Demographic Research 31:553–592.CrossRefGoogle Scholar
  5. Caswell, H., C. de Vries, N. Hartemink, G. Roth, and S. F. van Daalen. 2018. Age×stage-classified demographic analysis: a comprehensive approach. Ecological Monographs 88:560–584.CrossRefGoogle Scholar
  6. Caswell, H., and R. Salguero-Gómez. 2013. Age, stage and senescence in plants. Journal of Ecology 101:585–595.CrossRefGoogle Scholar
  7. Hardy, G. H. 1952. A course of pure mathematics. 10th edition. Cambridge University Press, Cambridge, United Kingdom.Google Scholar
  8. Henderson, H. V., and S. R. Searle. 1981. The vec-permutation matrix, the vec operator and Kronecker products: a review. Linear and Multilinear Algebra 9:271–288.CrossRefGoogle Scholar
  9. Logofet, D. O., and E. V. Lesnaya. 1997. Why are the middle points the most sensitive in the sensitivity experiments? Ecological Modelling 104:303–306.CrossRefGoogle Scholar
  10. Magnus, J. R., and H. Neudecker. 1979. The commutation matrix: some properties and applications. Annals of Statistics 7:381–394.CrossRefGoogle Scholar
  11. Magnus, J. R., and H. Neudecker. 1985. Matrix differential calculus with applications to simple, Hadamard, and Kronecker products. Journal of Mathematical Psychology 29:474–492.CrossRefGoogle Scholar
  12. Magnus, J. R., and H. Neudecker. 1988. Matrix differential calculus with applications in statistics and econometrics. John Wiley and Sons, New York, New York.Google Scholar
  13. Nel, D. G. 1980. On matrix differentiation in statistics. South African Statistical Journal 14:137–193.Google Scholar
  14. Roth, G., and H. Caswell. 2016. Hyperstate matrix models: extending demographic state spaces to higher dimensions. Methods in Ecology and Evolution 7:1438–1450.CrossRefGoogle Scholar
  15. Roth, W. E. 1934. On direct product matrices. Bulletin of the American Mathematical Society 40:461–468.CrossRefGoogle Scholar
  16. Samuelson, P. A. 1947. Foundations of economic analysis. Harvard University Press, Cambridge, Massachusetts, USA.Google Scholar
  17. Shyu, E., and H. Caswell. 2014. Calculating second derivatives of population growth rates for ecology and evolution. Methods in Ecology and Evolution 5:473–482.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Hal Caswell
    • 1
  1. 1.Biodiversity & Ecosystem DynamicsUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations