Appendices
Univariate Gaussian Distribution as an Exponential Family
Canonical Decomposition and \({\varvec{F}}\)
$$\begin{aligned} f(x;\mu ,\sigma ^{2})&= \frac{1}{(2 \pi \sigma ^{2})^{1/2}}\exp \left\{ -\frac{(x - \mu )^{2}}{2\sigma ^{2}} \right\} \\&= \exp \left\{ -\frac{1}{2\sigma ^{2}} (x^{2} - 2 x \mu + \mu ^{2}) - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\&= \exp \left\{ \langle \frac{1}{2\sigma ^{2}}, -x^{2} \rangle + \langle \frac{\mu }{\sigma ^{2}}, x \rangle - \frac{\mu ^{2}}{2\sigma ^{2}} - \frac{1}{2} \log \left( 2 \pi \sigma ^{2}\right) \right\} \\ \end{aligned}$$
In the sequel, the vector of source parameters is denoted \(\lambda =(\mu , \sigma ^2)\). One may recognize the canonical form of an exponential family
$$f(x;\theta ) = \exp \left\{ <\theta ,s(x)> +\, k(x) - F(\theta )\right\} $$
by setting \(\theta = (\theta _1,\theta _2)\) with
$$\begin{aligned} \theta _{1}&= \frac{\mu }{\sigma ^{2}} \Longleftrightarrow \mu = \frac{\theta _{1}}{2\theta _{2}}\end{aligned}$$
(52)
$$\begin{aligned} \theta _{2}&= \frac{1}{2\sigma ^{2}} \Longleftrightarrow \sigma ^{2} = \frac{1}{2\theta _{2}} \end{aligned}$$
(53)
$$\begin{aligned} s(x)&=(x,-x^{2}) \end{aligned}$$
(54)
$$\begin{aligned} k(x)&= 0 \end{aligned}$$
(55)
$$\begin{aligned} f(x; \theta _{1}, \theta _{2})&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{1}{2} \frac{(\theta _{1}/2\theta _{2})^{2}}{1/2\theta _{2}} - \frac{1}{2} \log (2\pi /2\theta _{2})\right\} \\&= \exp \left\{ \langle \theta _{2}, -x^{2} \rangle + \langle \theta _{1}, x \rangle - \frac{\theta _{1}^{2}}{4\theta _{2}} - \frac{1}{2} \log (\pi ) + \frac{1}{2} \log \theta _{2}\right\} \end{aligned}$$
with the log normalizer F as
$$\begin{aligned} F(\theta _{1}, \theta _{2}) = \frac{\theta _{1}^{2}}{4\theta _{2}} + \frac{1}{2} \log (\pi ) - \frac{1}{2} \log \theta _{2} \end{aligned}$$
(56)
1.1 Gradient of the Log-Normalizer
The gradient of the log-normalizer is given by:
$$\begin{aligned} \frac{\partial F}{\partial \theta _{1}}(\theta _{1}, \theta _{2})&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$
(57)
$$\begin{aligned} \frac{\partial F}{\partial \theta _{2}}(\theta _{1},\theta _{2})&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$
(58)
In order to get the dual coordinate system \(\eta =(\eta _{1}, \eta _{2})\), the following set of equations has to be inverted:
$$\begin{aligned} \eta _{1}&= \frac{\theta _{1}}{2\theta _{2}} \end{aligned}$$
(59)
$$\begin{aligned} \eta _{2}&= -\frac{\theta _{1}^{2}}{4\theta _{2}^{2}} - \frac{1}{2\theta _{2}} \end{aligned}$$
(60)
By plugging the first equation into the second one, it follows:
$$\begin{aligned} \eta _{2} = - \eta _{1}^{2} - \frac{1}{2\theta _{2}} \Longleftrightarrow&\theta _{2} = -\frac{1}{2(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{2}}(\eta _{1},\eta _{2}) \end{aligned}$$
(61)
$$\begin{aligned}&\theta _{1} = 2 \theta _{2} \eta _{1} = - \frac{\eta _{1}}{(\eta _{1}^{2} + \eta _{2})}&= \frac{\partial F^*}{\partial \eta _{1}}(\eta _{1},\eta _{2}) \end{aligned}$$
(62)
Formulas are even simpler regarding the source parameters since we know that
$$\begin{aligned} \eta _{1} = \mathbb {E}[X] = \mu\Longleftrightarrow & {} \mu = \eta _{1} \end{aligned}$$
(63)
$$\begin{aligned} \eta _{2} = \mathbb {E}[-X^2] = -\left\{ \mu ^2 + \sigma ^2\right\}\Longleftrightarrow & {} \sigma ^2 = - \left\{ \eta _{1}^2 + \eta _{2}\right\} \end{aligned}$$
(64)
In order to compute \(F^{*}\), we simply have to reuse our previous results in
$$F^{*}(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$
and obtain the following expression
The hessians H(F), \(H(F^*)\) of respectively F and \(F^*\) are
$$\begin{aligned} H(F)(\theta _1, \theta _2) = \begin{pmatrix} \frac{1}{2 \theta _2} &{} -\frac{\theta _1}{2 \theta _2^2}\\ -\frac{\theta _1}{2 \theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2 \theta _2^3} \end{pmatrix} \end{aligned}$$
(65)
$$\begin{aligned} H(F^*)(\eta _1, \eta _2) = \begin{pmatrix} \frac{\eta _1^2 - \eta _2}{(\eta _1^2 + \eta _2)^2} &{} \frac{\eta _1}{(\eta _1^2 + \eta _2)^2}\\ \frac{\eta _1}{(\eta _1^2 + \eta _2)^2} &{} \frac{1}{2(\eta _1^2 + \eta _2)^2} \end{pmatrix} \end{aligned}$$
(66)
Since the univariate normal distribution is an exponential family, the Kullback–Leibler divergence is a Bregman divergence for \(F^*\) on expectation parameters:
$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\sigma ^2_{p}) || \mathcal {N}(\mu _{q},\sigma ^2_{q}))&= B_{F^*}(\eta _p : \eta _q) \\&= F^*(\eta _p) - F^*(\eta _q) - \langle \eta _p - \eta _q, \nabla F^* (\eta _q) \rangle \end{aligned}$$
After calculations, it follows:
$$\begin{aligned} B_F^*(\eta _p : \eta _q) = \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right) \end{aligned}$$
(67)
A simple rewrite of it with the source parameters leads to the known closed form:
$$\begin{aligned} \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{2(\eta _{1_p} - \eta _{1_q})\eta _{1_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} + \frac{\eta _{2_p} - \eta _{2_q}}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\eta _{1_q}^{2} + \eta _{2_q}}{\eta _{1_p}^{2} + \eta _{2_p}}\right) + \frac{(\eta _{1_p}^2 + \eta _{2_p}) - (\eta _{1_p}-\eta _{1_q})^2 - (\eta _{1_q}^2 + \eta _{2_q})}{(\eta _{1_q}^{2} + \eta _{2_q})} \right)&= \nonumber \\ \frac{1}{2} \left( \log \left( \frac{\sigma _q^{2}}{\sigma _p^{2}}\right) + \frac{\sigma _p^{2}}{\sigma _q^{2}} + \frac{(\mu _p-\mu _q)^2}{\sigma _q^{2}} -1 \right) \end{aligned}$$
(68)
The Fisher information matrix \(I(\lambda )\) is obtained by computing the expectation of the product of Fisher score and its transposition:
$$\begin{aligned} I(\lambda )&\mathop {=}\limits ^{def} \mathbb {E}\left[ \nabla _\lambda \log f(x;\lambda ) . \nabla _\lambda \log f(x;\lambda )^T\right] \nonumber \\&= \mathbb {E}\left[ \begin{pmatrix} \frac{x-\mu }{\sigma ^2}\\ \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}. \begin{pmatrix} \frac{x-\mu }{\sigma ^2} \frac{(x-\mu )^2 - \sigma ^2}{2\sigma ^4}\end{pmatrix}\right] \nonumber \\&=\begin{pmatrix} \frac{1}{\sigma ^2} &{} 0 \\ 0 &{} \frac{1}{2\sigma ^4}\end{pmatrix}. \end{aligned}$$
(69)
By change in coordinates or direct computation, the Fisher information matrix is also:
$$\begin{aligned} I(\theta ) = H(F)(\theta ) = \begin{pmatrix}\frac{1}{2\theta _2} &{} -\frac{\theta _1}{2\theta _2^2}\\ -\frac{\theta _1}{2\theta _2^2} &{} \frac{\theta _1^2 + \theta _2}{2\theta _2^3}\end{pmatrix} \text{ and } I(\eta ) = \frac{1}{(\eta _1^2 + \eta _2)^2} \begin{pmatrix} (\eta _1^2 - \eta _2) &{} \eta _1\\ \eta _1 &{} \frac{1}{2}\end{pmatrix} \end{aligned}$$
(70)
1.2 Multivariate Gaussian Distribution as an Exponential Family
Canonical Decomposition and \({\varvec{F}}\)
$$\begin{aligned} f(x;\mu ,\varSigma )&= \frac{1}{(2 \pi )^{d / 2} |\varSigma |^{1/2}}\exp \left\{ -\frac{ {}^t (x - \mu ) \varSigma ^{-1} (x - \mu )}{2} \right\} \\&= \exp \left\{ -\frac{ {}^tx\varSigma ^{-1}x - {}^t\mu \varSigma ^{-1}x - {}^tx\varSigma ^{-1}\mu + {}^t\mu \varSigma ^{-1}\mu }{2} - \log \left( (2 \pi )^{d / 2} |\varSigma |^{1/2}\right) \right\} \\&= \exp \left\{ -\frac{tr({}^tx\varSigma ^{-1}x) - \langle {}^t\varSigma ^{-1} \mu , x \rangle -\langle x, \varSigma ^{-1}\mu \rangle + \langle {}^t\varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle }{2} - \log \left( \pi ^{d / 2} |2\varSigma |^{1/2}\right) \right\} \end{aligned}$$
Due to the cyclic property of the trace and to the symmetry of \(\varSigma ^{-1}\), it follows:
$$\begin{aligned} f(x;\mu ,\varSigma )&= \exp \left\{ tr\left( ^t\left( \frac{1}{2}\varSigma ^{-1}\right) (-x{}^tx)\right) + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{2} \langle \varSigma ^{-1}\mu , \varSigma \varSigma ^{-1} \mu \rangle - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma |\right\} \\&= \exp \left\{ \langle \frac{1}{2}\varSigma ^{-1}, -x{}^tx \rangle _{F} + \langle \varSigma ^{-1} \mu , x \rangle - \frac{1}{4} {}^t(\varSigma ^{-1}\mu ) 2\varSigma (\varSigma ^{-1}\mu ) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2\varSigma | \right\} \\ \end{aligned}$$
where \(\langle \cdot , \cdot \rangle _{F}\) is the Frobenius scalar product. One may recognize the canonical form of an exponential family
$$f(x;\varTheta ) = \exp \left\{ <\varTheta ,t(x)> + k(x) - F(\varTheta )\right\} $$
by setting:
$$\varTheta = (\theta _{1}, \theta _2)$$
$$\begin{aligned} \theta _1&= \varSigma ^{-1}\mu \Longleftrightarrow \mu = \frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$
(71)
$$\begin{aligned} \theta _2&= \frac{1}{2}\varSigma ^{-1} \Longleftrightarrow \varSigma = \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(72)
$$\begin{aligned} t(x)&=(x,-x{}^tx)\end{aligned}$$
(73)
$$\begin{aligned} k(x)&= 0 \end{aligned}$$
(74)
$$\begin{aligned} f(x; \theta _1, \theta _2) = \exp \left\{ \langle \theta _2, -x{}^tx \rangle _{F} + \langle \theta _1, x \rangle - \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 - \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\theta _2| \right\} \nonumber \\ \end{aligned}$$
(75)
with the log normalizer F:
$$\begin{aligned} F(\theta _1, \theta _2) = \frac{1}{4} {}^t\theta _1 \theta _2^{-1} \theta _1 + \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |\theta _2| \end{aligned}$$
(76)
1.3 Gradient of the Log-Normalizer
By applying the following formulas from the matrix cookbook (Petersen and Pedersen 2012)
- identity 57:
-
$$ \frac{\partial \log |X|}{\partial X} = ({}^tX)^{-1} = {}^t(X^{-1}) $$
- identity 61:
-
$$\frac{\partial {}^ta X^{-1} b}{\partial X} = - {}^tX^{-1} a {}^tb X^{-1} $$
- identity 81:
-
$$\frac{\partial {}^tx B x}{\partial x} = (B + {}^tB)x $$
the gradient of the log-normalizer is given by:
$$\begin{aligned} \frac{\partial F}{\partial \theta _1}(\theta _1,\theta _2)&= \frac{1}{4} (\theta _2^{-1}+ {}^{t}\theta _2^{-1}) \theta _1 = \frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$
(77)
$$\begin{aligned} \frac{\partial F}{\partial \theta _2}(\theta _1,\theta _2)&= - \frac{1}{4} {}^t\theta _2^{-1} \theta _1 {}^t\theta _1 \theta _2^{-1} - \frac{1}{2} {}^t\theta _2^{-1} = - \left( \frac{1}{2} \theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2} \theta _2^{-1} \theta _1\right) - \frac{1}{2} \theta _2^{-1} \end{aligned}$$
(78)
In order to emphasize the coherence of these formulas, recall that the gradient of the log-normalizer corresponds the expectation of the sufficient statistics:
$$\begin{aligned} \mathbb {E}[x]&= \mu\equiv & {} ~\frac{1}{2}\theta _2^{-1} \theta _1\end{aligned}$$
(79)
$$\begin{aligned} \mathbb {E}[-x{}^tx]&= -\mathbb {E}[x{}^tx] = -\mu {}^{t}\mu - \varSigma\equiv & {} - \left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(80)
Last equation comes from the expansion of \(\mathbb {E}[(x - \mu ) {}^t(x - \mu )]\).
1.4 Convex Conjugate G of F and Its Gradient
In order to get the dual coordinate system \(H=(\eta _1, \eta _2)\), the following set of equations has to be inverted:
$$\begin{aligned} \eta _1&=\frac{1}{2} \theta _2^{-1} \theta _1 \end{aligned}$$
(81)
$$\begin{aligned} \eta _2&= -\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) ^t\left( \frac{1}{2}\theta _2^{-1} \theta _1\right) - \frac{1}{2}\theta _2^{-1} \end{aligned}$$
(82)
By plugging the first equation into the second one, it follows
$$\begin{aligned} \eta _2 = - \eta _1 {}^t\eta _1 - \frac{1}{2}\theta _2^{-1} \Longleftrightarrow \theta _2= \frac{1}{2}(-\eta _1 {}^t\eta _1 -\eta _2)^{-1} = \frac{\partial G}{\partial \eta _2}(\eta _1,\eta _2) \end{aligned}$$
(83)
and
$$\begin{aligned} \theta _1 = 2 \theta _2\eta _1= (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 = \frac{\partial G}{\partial \eta _1}(\eta _1,\eta _2) \end{aligned}$$
(84)
Formulas are even simpler regarding the source parameters since we know from Eqs. 79 and 80 that
$$\begin{aligned} \eta _1 = \mu\Longleftrightarrow & {} \mu = \eta _1 \end{aligned}$$
(85)
$$\begin{aligned} \eta _2= -\mu {}^{t}\mu - \varSigma\Longleftrightarrow & {} \varSigma = - \eta _1 {}^t\eta _1 - \eta _2 \end{aligned}$$
(86)
In order to compute \(G := F^{*}\), we simply have to reuse our previous results in
$$G(H) = \langle (\nabla F)^{-1} (H), H \rangle - F ( (\nabla F)^{-1} (H))$$
and obtain the following expression
$$\begin{aligned} G(\eta _1, \eta _2)&= \langle (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1, \eta _1 \rangle + \langle \frac{1}{2} (- \eta _1 {}^t\eta _1 - \eta _2)^{-1}, \eta _2 \rangle _{F}\\&- \frac{1}{4} {}^t((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1) 2(-\eta _1 {}^t\eta _1 - \eta _2) (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 \\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |\frac{1}{2} (-\eta _1 {}^t\eta _1 - \eta _2)^{-1}|\\&= {}^t \eta _1 (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr({}^{t}(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{1}{2} {}^t\eta _1{}^t(-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1\\&- \frac{d}{2} \log (\pi ) + \frac{1}{2} \log |(2(-\eta _1 {}^t\eta _1 - \eta _2))^{-1}|\\&= \frac{1}{2} {}^t \eta _1 (- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 +\frac{1}{2} tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= \frac{1}{2} \left( tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _1 {}^t \eta _1 ) +tr((-\eta _1 {}^t\eta _1 - \eta _2)^{-1} \eta _2)\right) \\&- \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr((- \eta _1 {}^t\eta _1 - \eta _2)^{-1} (- \eta _1 {}^t \eta _1 - \eta _2)) - \frac{d}{2} \log (\pi )\nonumber \\&- \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{1}{2} tr(I_{d}) - \frac{d}{2} \log (\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2(-\eta _1 {}^t\eta _1 - \eta _2)|\\ \end{aligned}$$
Let us rewrite this expression with source parameters:
$$\begin{aligned} G(\mu , \varSigma ) = - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |2\varSigma |\ \end{aligned}$$
(87)
1.5 Kullback–Leibler Divergence
First recall that the Kullback–Leibler divergence between two p.d.f. p and q is
$$ KL(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$
For two multivariate normal distributions, it is known in closed form
$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2}\left( \log \left( \frac{|\varSigma _{q}|}{|\varSigma _{p}|}\right) + tr(\varSigma _{q}^{-1}\varSigma _{p}) + {}^{t}(\mu _{q}-\mu _{p})\varSigma _{q}^{-1}(\mu _{q}-\mu _{p}) - d\right) \end{aligned}$$
(88)
Since the multivariate normal distribution is an E.F., the same result must be obtained using the bregman divergence for G on expectation parameters \(H_{p}\) and \(H_{q}\):
$$KL(\mathcal {N}(\mu _{p},\varSigma _{p}) || \mathcal {N}(\mu _{q},\varSigma _{q})) = B_G(H_p || H_q) = G(H_{p}) - G(H_{q}) - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle $$
$$\begin{aligned} G(H_{p}) - G(H_{q})&= - \frac{d}{2} \log (e\pi ) - \frac{1}{2} \log |-2(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})| \\&+ \frac{d}{2} \log (e\pi ) + \frac{1}{2} \log |-2(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})| \\&= \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{p}} {}^t\eta _{1_{p}} + \eta _{2_{p}})|}\\ - \langle H_{p} - H_{q}, \nabla G (H_{q}) \rangle&= - \langle \eta _{1_{p}} - \eta _{1_{q}}, - (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \rangle \\&- tr\left( ^{t} (\eta _{2_{p}} - \eta _{2_{q}}) \left( -\frac{1}{2} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}\right) \right) \\&= {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}} \\&- \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1})) + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))\\ \end{aligned}$$
In order to go further, we can express these two formulas using \(\mu \) and \(\varSigma ^{-1} = (-\eta _1 {}^t\eta _1 - \eta _2)^{-1} = -(\eta _1 {}^t\eta _1 + \eta _2)^{-1} \) (cf. Eq. 86):
$$\begin{aligned} \frac{1}{2} \log \frac{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}{|-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})|}&= \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} \end{aligned}$$
$$\begin{aligned} {}^t \eta _{1_{p}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q}\\ - {}^t \eta _{1_{q}} (\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1} \eta _{1_{q}}&= {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} \end{aligned}$$
$$\begin{aligned} - \frac{1}{2} tr({}^{t} \eta _{2_{p}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= \frac{1}{2} tr((\mu _{p}{}^{t}\mu _{p} + \varSigma _{p}) \varSigma _{q}^{-1})\\&= \frac{1}{2} tr(\mu _{p}{}^{t}\mu _{p}\varSigma _{q}^{-1}) + \frac{1}{2} tr(\varSigma _{p}\varSigma _{q}^{-1})\\&= \frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p})\\ + \frac{1}{2} tr({}^{t}\eta _{2_{q}} (-(\eta _{1_{q}} {}^t\eta _{1_{q}} + \eta _{2_{q}})^{-1}))&= - \frac{1}{2} tr((\mu _{q}{}^{t}\mu _{q} + \varSigma _{q}) \varSigma _{q}^{-1})\\&= - \frac{1}{2} tr(\mu _{q}{}^{t}\mu _{q}\varSigma _{q}^{-1}) - \frac{1}{2} tr(\varSigma _{q}\varSigma _{q}^{-1})\\&= - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d \end{aligned}$$
By summing up of these terms, the standard formula for KL divergence is recovered:
$$\begin{aligned} KL(\mathcal {N}(\mu _{p},\varSigma _{p})&|| \mathcal {N}(\mu _{q},\varSigma _{q})) = \frac{1}{2} \log \frac{|\varSigma _q|}{|\varSigma _p|} -{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} +{}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} + \\&\frac{1}{2} {}^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + \frac{1}{2} tr(\varSigma _{q}^{-1}\varSigma _{p}) - \frac{1}{2} {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q} - \frac{1}{2} d\\ =&\frac{1}{2} \left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - d~-\right. \\&\left. \left\{ 2{}^t \mu _{p} \varSigma _{q}^{-1} \mu _{q} - 2 {}^t \mu _{q} \varSigma _{q}^{-1} \mu _{q} - ^{t}\mu _{p}\varSigma _{q}^{-1}\mu _{p} + {}^{t}\mu _{q}\varSigma _{q}^{-1}\mu _{q}\right\} \right) \\ =&\frac{1}{2}\left( \log \frac{|\varSigma _q|}{|\varSigma _p|} + tr(\varSigma _{q}^{-1}\varSigma _{p}) - {}^t (\mu _{p} - \mu _{q}) \varSigma _{q}^{-1} (\mu _{p} - \mu _{q}) - d \right) \end{aligned}$$