Keywords

1 Introduction

Convex analysis is a classical area of mathematics with innumerous applications in engineering, economics, physics, statistics and information theory. The central notion is that of a convex function \(f : \mathbb R^n \rightarrow \mathbb R\), satisfying the inequality \(f(tx + (1-t)y) \le tf(x) + (1-t)f(y)\) for all \(t \in [0,1]\). Convexity is a useful property for optimization problems: Every local minimum of f is automatically a global minimum. Convex functions furthermore admit a beautiful duality theory; the ubiquitous Legendre-Fenchel transform (or convex conjugation) defined as

$$\begin{aligned} f^*(x^*) = \sup _{{x}}\,\left\{ { \langle x^*,x\rangle - f(x) }\right\} \end{aligned}$$

encodes f in terms of all affine functions \(\langle x^*,x\rangle - c\) majorized by f (here \(\langle -,-\rangle \) denotes the standard inner product on \(\mathbb R^n\)). The function \(f^*\) is convex regardless of f, and under a closedness assumption we recover \(f^{**}=f\).

While convex analysis is a rich field, its compositional structure is not readily apparent; the central notion, convex functions, is not closed under composition. The notion which does compose is less well known: a convex bifunction, due to [27], is a jointly convex function \(F : \mathbb R^{m}\times \mathbb R^n \rightarrow \mathbb R\) of two variables. Such bifunctions compose via infimization

$$\begin{aligned} (F \circ G)(x,z) = \inf _{{y}}\,\left\{ {F(y,z) + G(x,y)}\right\} \end{aligned}$$

Categorical Methods In this work, we will study bifunctions and their associated dualities in the framework of category theory. Graphical methods are ubiquitous in engineering and statistics, and can used to derive efficient algorithms by making use of the factorized structure of a problem. The language of props and string diagrams unifies these methods, as a large body of work on graphical linear algebra and applied category theory shows [1, 2, 7, 19]. We extend these methods to problems of convex analysis and optimization. Our category of bifunctions subsumes an array of mathematical structures, such as linear maps and relations, convex relations, and (surprisingly) multivariate Gaussian probability.

Fig. 1.
figure 1

Addition of independent normal variables X, Y. Left: pdf and convolution, right: logpdf and sup-convolution

Applications to Probability Theory Convex analysis offers a rich perspective on Gaussian (multivariate normal) probability distributions: The log-density \(h(x) = \log f(x)\) of a Gaussian random variable is a concave function of the formFootnote 1

$$\begin{aligned} h(x) = -\frac{(x-\mu )^2}{2\sigma ^2} \end{aligned}$$

It turns out that anything we can do with Gaussian densities and integration can instead be done with logdensities and maximization. For example, to compute the density of a sum of independent variables, we may take a convolution of densities, or instead compute a sup-convolution of logdensities (see Fig. 1), as

$$\begin{aligned} \log \int f_X(x)f_Y(z-x) \mathrm d x = \sup _{{x}}\,\left\{ {h_X(x) + h_Y(z-x)}\right\} \end{aligned}$$

This is highly particular to Gaussians. We can elegantly formalize this statement in categorical terms, as our main theorem states: Logprobability defines a functor from Gaussian probability to concave bifunctions (Theorem 5)

In this sense, the essence of Gaussians is captured by concave quadratic functions. By extending our viewpoint to partial concave quadratic functions, we obtain a generalized notion of Gaussian relation which includes improper priors. Such entities are subtle to describe measure-theoretically, but straightforward in the convex analytic view. The duality theory of bifunctions generalizes the duality of precision and covariance, and more generally connects to the notion of cumulant-generating function in probability theory.

We elegantly formalize the connections between convex analysis and probability theory using the language of Markov categories [17], which are a categorical formalism for probability theory, and have close connections to the semantics of probabilistic programs [30].

Contribution and Outline This paper is intended to serve as a high-level roadmap to a categorical treatment of convex analysis. Our aim is to spell out the underlying structures, and present a diverse range of connections, especially with diagrammatic methods and categorical probability. For the sake of presentation, we choose to stick to general statements and keep some technical notions (such as regularity conditions) informal. Spelling out the details in a concrete setting is a starting point for future developments. We elaborate one such particular setting in detail, namely Gaussian probability.

We begin §2 by recalling the relevant notions of convex analysis, and proceed to define and study the categorical structure of bifunctions in §3. This includes two structures as a hypergraph category and the duality theory of §3.1.

In §4, we elaborate different examples of categories which embed in bifunctions, such as linear and affine algebra, linear algebra, convex relations and convex optimization problems. In each case, the embedding preserves the relevant categorical structures and dualities. In particular, we show that the theory of bifunctions is a conservative extension of graphical linear algebra [25].

In §5 we begin making connections to probability theory. We recall Gaussian probability from a categorical point of view, and construct the embedding functor to bifunctions. We discuss how partial quadratic functions can be seen as an extension of Gaussian probability beyond measure theory.

We conclude with §6-7 discussing the wider context of this work, elaborating connections of probability and convex analysis such as the Laplace approximation and cumulant generating functions, and the idea of idempotent analysis as a ‘tropical limit’ of ordinary analysis.

2 Overview of Convex Analysis

The following section is a brief overview of standard material in convex analysis; all propositions and conventions are taken from [27].

Caveat: An important feature of convex analysis is that it deals with formal infinities \(+\infty , -\infty \) in a consistent fashion. This is crucial because optimization problems may be unbounded. Traditionally, one considers the extended real numbers \(\overline{\mathbb {R}}= [-\infty , +\infty ]\) and extends the usual laws of arithmetic to them. The case \((+\infty ) + (-\infty )\) is left undefined and carefully avoided like 0/0 in real analysis. A more systematic approach [18, 37] is based on enriched category theory, and endows \(\overline{\mathbb {R}}\) with the structure of a commutative quantale, which gives it totally defined operations with a particular arithmetic.

A more serious caveat is that many results in convex analysis require specific regularity assumptions to hold. As these assumptions are not the focus of the present paper, so we will state some big picture theorems in §3 under reservation of these conditions. We then elaborate an array of concrete examples §4-5 where we make sure that all regularity conditions are indeed satisfied. We discuss this drawback in §7.

A subset \(A \subseteq \mathbb R^n\) is convex if for all \(x,y \in A\) and \(t \in [0,1]\), we have \(tx + (1-t)y \in A\). The epigraph of a function \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) is the set \(\textrm{epi}(f) = \{ (x,y) \in \mathbb R^{n+1} : y \ge f(x) \}\). We say that f is convex if \(\textrm{epi}(f)\) is a convex subset of \(\mathbb R^{n+1}\). This is equivalent to the well-known definition from the introduction, while accounting for infinities. We say that f is concave if \((-f)\) is convex.

Example 1

The following functions are convex: linear functions, |x|, \(x^2\), \(\exp (x)\), \(-\ln (x)\). For a convex subset \(A \subseteq \mathbb R^n\), the convex indicator function \(\delta _A : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) is defined by

$$ \delta _A(x) = {\left\{ \begin{array}{ll} 0, &{}x \in A \\ +\infty , &{}x \notin A \end{array}\right. } $$

We also write indicator functions using modified Iverson brackets as \(\lbrace \!|{x \in A}|\!\rbrace = \delta _A(x)\). The concave indicator function of A is \(-\delta _A(x)\).

The infimal convolution of convex functions \(f,g : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) is defined by . The convex function f is called closed if \(\textrm{epi}(f)\) is a closed subset of \(\mathbb R^{n+1}\); this is equivalent to f being lower semicontinuous.

2.1 Conjugacy – the Legendre-Fenchel transform

Definition 1

For a convex function \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\), its convex conjugate \(f^* : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) is the convex function

$$\begin{aligned} f^*(x^*) = \sup _{{x}}\,\left\{ {\langle x^*,x\rangle - f(x)}\right\} \end{aligned}$$

For a concave function \(g : \mathbb R^n \rightarrow \overline{\mathbb {R}}\), its concave conjugate \(g^* : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) is the concave function

$$\begin{aligned} g^*(x^*) = \inf _{{x}}\,\left\{ {\langle x^*,x \rangle - g(x) }\right\} \end{aligned}$$

Note that if \(g=-f\) then \(g^*(x^*)=-f^*(-x^*)\)

Geometrically, \(f^*\) encodes information about which affine functions \(\langle x^*,-\rangle - c\) are majorized by f. It is thus natural to view \(f^*\) as a function on covectors \(x^* \in (\mathbb R^n)^*\). This is for example done in [37], but in order to keep notation consistent with [27], we make the traditional identification \((\mathbb R^n)^* \cong \mathbb R^n\) via the inner product, and the notation \(x^*\) is purely decoration. The Legendre-Fenchel transform has applications in many areas of mathematics and physics [34], such as the Hamiltonian formalism in mechanics, statistical mechanics or large deviation theory (e.g. §6.2).

A closed convex function f is the pointwise supremum of all affine functions \(h \le f\) [27, 12.1]. This allows them to be recovered by their Legendre transform

Proposition 1

([27, Theorem 12.2]). For any convex function \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\), \(f^*\) is a closed convex function. We have \(f^{**} = f\) if and only if f is closed.

For arbitrary functions f, the operation \(f \mapsto f^{**}\) is a closure operator which we denote by \(\textrm{cl}(f)\). This is the largest closed convex function majorized by f.

Example 2

The absolute value function \(f(x) = |x|\) is convex and closed. The supremum \(\sup _{{x}}\,\left\{ {cx-|x|}\right\} \) equals 0 if \(|c| \le 1\), and \(\infty \) otherwise. Hence \(f^*(c) = \lbrace \!|{|c| \le 1}|\!\rbrace \), and \(f^{**}=f\).

Example 3

Let \(f(x) = ax^2\) for \(a > 0\). Then \(x \mapsto c \cdot x - a x^2\) is differentiable and has a maximum at \(x=c/2a\). We obtain \(f^*(c) = \frac{1}{4a} c^2\). In particular, we see that the function \(f(x) = \frac{1}{2} x^2\) is a fixed point of the Legendre transform.

Proposition 2

([27, Theorem 16.4]). If f, g are closed convex functions, then under certain regularity conditions and .

3 Categories of Convex Bifunctions

We now come to the central definition of this article, namely that of convex (or concave) bifunctions. This concept is due to [27] and scattered throughout his book.

A bifunction F from \(\mathbb R^m\) to \(\mathbb R^n\) is the convex analysis terminology for a curried function \(\mathbb R^m \rightarrow (\mathbb R^n \rightarrow \overline{\mathbb {R}})\). The uncurried function \(\underline{{F}} : \mathbb R^{m+n} \rightarrow \overline{\mathbb {R}}\) is referred to as the graph function of F. We will suppress the partial application and write F(x)(y) and F(x, y) interchangeably.

Definition 2

A bifunction F from \(\mathbb R^m\) to \(\mathbb R^n\) is called convex (or concave, closed) if its graph function \(\underline{{F}} : \mathbb R^{m + n} \rightarrow \overline{\mathbb {R}}\) has that property. The closure operation cl(F) is applied on the level of graph functions. We denote a convex bifunction by \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) and a concave bifunction by \(F : \mathbb R^m \rightharpoondown \mathbb R^n\).

Bifunction composition is known as product in [27, § 38].

Definition 3

(Categories of bifunctions). We define a category \(\mathtt {{CxBiFn}}\) of convex bifunctions as follows

  • objects are the spaces \(\mathbb R^n\)

  • morphisms are convex bifunctions \(\mathbb R^m \rightharpoonup \mathbb R^n\)

  • the identity \(\mathbb R^n \rightharpoonup \mathbb R^n\) is given by the indicator function

    $$\begin{aligned} \textrm{id}_n(x,y) =\lbrace \!|{x=y}|\!\rbrace \end{aligned}$$
  • composition is infimization over the middle variable

    $$\begin{aligned} (F {\mathop {\circ }\limits ^{\rightharpoonup }}G)(x,z) = \inf _{{y}}\,\left\{ {G(x,y) + F(y,z)}\right\} \end{aligned}$$

Analogously, the category \(\mathtt {{CvBiFn}}\) of concave bifunctions is defined as

  • objects are the spaces \(\mathbb R^n\)

  • morphisms are concave bifunctions \(\mathbb R^m \rightharpoondown \mathbb R^n\)

  • the identity \(\mathbb R^n \rightharpoondown \mathbb R^n\) is given by the concave indicator function

    $$\begin{aligned} -\textrm{id}_n(x,y) = -\lbrace \!|{x=y}|\!\rbrace \end{aligned}$$
  • composition is supremization over the middle variable

    $$\begin{aligned} (F {\mathop {\circ }\limits ^{\rightharpoondown }}G)(x,z) = \sup _{{y}}\,\left\{ {G(x,y) + F(y,z)}\right\} \end{aligned}$$

Proof (of well-definedness)

This construction is a subcategory of the the category of weighted relations \(\mathtt {{Rel}}(Q)\) taking values in a commutative quantale Q [3, 12, 23], where \(Q=\overline{\mathbb {R}}\) are the extended reals. It suffices to verify that convex bifunctions are closed under composition, tensor (addition) and contain the identities ([27, p. 408]).

We will write bifunction composition as \(F \circ G\) when it is clear from context whether we use the convex or concave variety. We will write I for the unit space \(\mathbb R^0\), and \(\textbf{0}\) for its unique element.

Example 4

The states (morphisms \(I \rightharpoonup \mathbb R^n\) out of the unit) are in bijection with convex functions \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\), as are the effects \(\mathbb R^n \rightharpoonup I\). States and effects in \(\mathtt {{CvBiFn}}\) are in bijection with concave functions \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\).

3.1 Duality for Bifunctions

Unless otherwise stated, theorems phrased for convex bifunctions will hold for concave bifunctions by selecting the appropriate versions of the operations.

The duality theory of convex functions extends to bifunctions as follows.

Definition 4

([27, §30]). The adjoint of a convex bifunction \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) is the concave bifunction \(F^* : \mathbb R^n \rightharpoondown \mathbb R^m\) defined by

$$\begin{aligned} F^*(y^*,x^*) = \inf _{{x,y}}\,\left\{ {F(x,y) + \langle x^*,x\rangle - \langle y^*,y\rangle }\right\} \end{aligned}$$

The adjoint of a concave bifunction is convex and uses \(\sup \) instead of \(\inf \). The adjoint of the convex bifunction F is related to the conjugate of its graph function \(\underline{{F}}\) using the formula \(F^*(y^*,x^*) = -\underline{{F}}^*(-x^*,y^*)\). (Note the slight asymmetry that one input is negated)

The analogue of Proposition 1 for bifunctions is as follows

Proposition 3

([27, Theorem 30.1]). For any convex bifunction F, the adjoint \(F^*\) is a closed concave bifunction, and we have \(F^{**} = \textrm{cl}(F)\). In particular, if F is a closed convex bifunction, then \(F^{**} = F\).

Theorem 1

([27, Theorem 38.5]). Under regularity assumptions, the adjoint operation respects composition. That is, for \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) and \(G : \mathbb R^n \rightharpoonup \mathbb R^k\), we have

$$\begin{aligned} (G {\mathop {\circ }\limits ^{\rightharpoonup }}F)^* = F^* {\mathop {\circ }\limits ^{\rightharpoondown }}G^* \end{aligned}$$

That is, the adjoint operation defines a pair of mutually inverse functors

figure d

We indicate with dashed arrows that the functoriality depends on regularity assumptions.

For the interested reader, the regularity assumptions in Theorem 1 include closedness, as well as properness and certain (relative interiors of) domains of the involved bifunctions intersecting [27, § 38]. These assumptions are not necessary conditions.

As a corollary of functoriality, we can derive the following well-known fact:

Corollary 1 (Fenchel duality)

Let \(f : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) be convex, \(g : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) concave, and let \(f^*,g^*\) be their convex and concave conjugates respectively. Then under sufficient regularity assumptions, we have

$$\begin{aligned} \inf _{{x}}\,\left\{ {f(x) - g(x)}\right\} = \sup _{{x^*}}\,\left\{ {g^*(x^*) - f^*(x^*)}\right\} \end{aligned}$$

Proof

Consider the convex function \(h=-g\) and form the state \(s_f : I \rightharpoonup \mathbb R^n, s_f(\textbf{0},x) = f(x)\) and effect \(e_h : \mathbb R^n \rightharpoonup I, e_h(x,\textbf{0}) = h(x)\). The proof proceeds by using functoriality to compute the scalar \((e_h {\mathop {\circ }\limits ^{\rightharpoonup }}s_f)^* = (s_f^* {\mathop {\circ }\limits ^{\rightharpoondown }}e_h^*)\) in two ways: On the one hand, we have

$$\begin{aligned} (e_h {\mathop {\circ }\limits ^{\rightharpoonup }}s_f)(\textbf{0},\textbf{0}) = \inf _{{x}}\,\left\{ {s_f(\textbf{0},x) + e_h(x,\textbf{0})}\right\} = \inf _{{x}}\,\left\{ {f(x)-g(x)}\right\} \end{aligned}$$

On the other hand, we express the adjoints in terms of the conjugates \(f^*,g^*\)

$$\begin{aligned} s_f^*(x^*,\textbf{0}) &= \inf _{{x}}\,\left\{ {s_f(\textbf{0},x) - \langle x^*,x \rangle }\right\} = -f^*(x^*) \\ e_h^*(\textbf{0},x^*) &= \inf _{{x}}\,\left\{ {e_h(x,\textbf{0}) + \langle x^*,x \rangle }\right\} = g^*(x^*) \end{aligned}$$

The adjoint acts as the identity on scalars, so we obtain

$$\begin{aligned} \inf _{{x}}\,\left\{ {f(x)-g(x)}\right\} = (e_h {\mathop {\circ }\limits ^{\rightharpoonup }}s_f)^*(\textbf{0},\textbf{0}) = (s_f^* {\mathop {\circ }\limits ^{\rightharpoondown }}e_h^*)(\textbf{0},\textbf{0}) = \sup _{{x}}\,\left\{ {g^*(x^*) - f^*(x^*)}\right\} \end{aligned}$$

3.2 Hypergraph Structure and Symmetries

Bifunctions can not only be composed in sequence, but also in parallel. The relevant structure is that of a symmetric monoidal category \((\mathbb C, \otimes , I)\). In this work, we are dealing with a particular simple form of such categories called a prop. A prop \(\mathbb C\) is a strict monoidal category which is generated by a single object R so that every object is of the form \(R^{\otimes n}\) for some \(n \in \mathbb N\). The monoid of objects \((\textrm{ob}(\mathbb C), \otimes , I)\) is thus isomorphic to \((\mathbb N,+,0)\).

Proposition 4

Convex bifunctions have the structure of a prop, generated by the object \(\mathbb R\)

  1. 1.

    The tensor is \(\mathbb R^m \otimes \mathbb R^n = \mathbb R^{m + n}\)

  2. 2.

    The unit is \(I = \mathbb R^0\).

  3. 3.

    The tensor of bifunctions is given by addition: If \(F : \mathbb R^{m_1} \rightharpoonup \mathbb R^{n_1}\), \(G : \mathbb R^{m_2} \rightharpoonup \mathbb R^{n_2}\) then \(F \otimes G : \mathbb R^{m_1+m_2} \rightharpoonup \mathbb R^{n_1 + n_2}\) is defined as

    $$\begin{aligned} (F \otimes G)((x_1,x_2),(y_1,y_2)) = F(x_1,y_1) + G(x_2,y_2) \end{aligned}$$

Proof (of well-definedness)

General fact about categories of weighted relations \(\mathtt {{Rel}}(Q)\) ([23]).

Symmetric monoidal categories are widely studied and admit a convenient graphical language using string diagrams [28]. It is useful to consider further pieces of structure on such a category

  1. 1.

    in a copy-delete category [11], every object carries the structure of a commutative comonoid \(\textrm{copy}_X : X \rightarrow X \otimes X\) and \(\textrm{discard}_X : X \rightarrow I\). This lets information be used in a non-linear way (in the sense of linear logic).

  2. 2.

    in a hypergraph category [14], every object carries the structure of a special commutative Frobenius algebra

Every hypergraph category is in particular a copy-delete category. The pieces of structure of a hypergraph category are often rendered as cups and caps in string diagrams

figure e

subject to equations such as the Frobenius law

figure f

This gives rise to a rich graphical calculus, which has been explored for a number of engineering applications like signal-flow diagrams or electrical circuits [1, 2, 7,8,9, 25]

Proposition 5

\(\mathtt {{CxBiFn}}\) has the structure of a hypergraph category in two different ways, which we call the additive and co-additive structure. That is, every object carries two different structures as a special commutative Frobenius algebra

  1. 1.

    The additive structure is given by

    $$\begin{aligned} \textrm{unit} : I \rightharpoonup \mathbb R^n, &\qquad \textrm{unit}(\textbf{0},x) = 0 \\ \textrm{discard} : \mathbb R^n \rightharpoonup I, &\qquad \textrm{discard}(x,\textbf{0}) = 0 \\ \textrm{copy} : \mathbb R^n \rightharpoonup \mathbb R^n \otimes \mathbb R^n &\qquad \textrm{copy}(x,y,z) = \lbrace \!|{x=y=z}|\!\rbrace \\ \textrm{comp} : \mathbb R^n \otimes \mathbb R^n \rightharpoonup \mathbb R^n, &\qquad \textrm{comp}(x,y,z) = \lbrace \!|{x=y=z}|\!\rbrace \end{aligned}$$
  2. 2.

    The co-additive structure is given by

    $$\begin{aligned} \textrm{zero} : I \rightharpoonup \mathbb R^n, &\qquad \textrm{zero}(\textbf{0},x) = \lbrace \!|{x=0}|\!\rbrace \\ \textrm{cozero} : \mathbb R^n \rightharpoonup I, &\qquad \textrm{cozero}(x,\textbf{0}) = \lbrace \!|{x=0}|\!\rbrace \\ \textrm{add} : \mathbb R^n \otimes \mathbb R^n \rightharpoonup \mathbb R^n, &\qquad \textrm{add}(x,y,z) = \lbrace \!|{x+y=z}|\!\rbrace \\ \textrm{coadd} : \mathbb R^n \rightharpoonup \mathbb R^n \otimes \mathbb R^n, &\qquad \textrm{coadd}(z,x,y) = \lbrace \!|{x+y=z}|\!\rbrace \end{aligned}$$

The analogous structures on \(\mathtt {{CvBiFn}}\) use concave indicator functions instead.

We can motivate the names of the hypergraph structures by observing how multiplications acts on states. This duality is clarified in what follows.

Example 5

Let \(f,g : I \rightharpoonup \mathbb R^n\) be two states. Then

figure g

Definition 5

The dagger of a bifunction \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) is given by reversing its arguments

$$\begin{aligned} F^\dagger : \mathbb R^n \rightharpoonup \mathbb R^m, F^\dagger (y,x) = F(x,y) \end{aligned}$$

The inverse of a bifunction \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) is the concave bifunction [27, p. 384]

$$\begin{aligned} F_*(x,y) = -F(y,x) \end{aligned}$$

Both these operations define involutiveFootnote 2 functors

$$\begin{aligned} (-)^\dagger : \mathtt {{CxBiFn}}^\textrm{op}\rightarrow \mathtt {{CxBiFn}}, \qquad (-)_* : \mathtt {{CxBiFn}}^\textrm{op}\rightarrow \mathtt {{CvBiFn}}\end{aligned}$$

The functor \((-)^\dagger \) is a dagger functor in the sense of [29].

Proposition 6

([27, p. 384]). The operations of inverse and adjoint commute, i.e. for \(F : \mathbb R^m \rightharpoonup \mathbb R^n\) we have \((F^*)_* = (F_*)^*\).

The composite operation \(F^*_*\) defines another covariant functor \(\mathtt {{CxBiFn}}\rightarrow \mathtt {{CxBiFn}}\), which we now interpret: As is customary in graphical linear algebra, we render the two hypergraph structures as follows

figure h

We refer to the additive structure as ‘black’ (\(\bullet \)) and the co-additive one as ‘white’ (\(\circ \)). This presentation reveals an array of symmetries (mirror-image and color-swapFootnote 3), which we are relating now:

Theorem 2

The adjoint operation interchanges the additive and co-additive structure. That is we have functors of hypergraph categories

$$\begin{aligned} (-)^* : (\mathtt {{CxBiFn}}^\textrm{op}, \bullet ) \rightarrow (\mathtt {{CvBiFn}}, \circ ) \\ (-)^* : (\mathtt {{CxBiFn}}^\textrm{op}, \circ ) \rightarrow (\mathtt {{CvBiFn}}, \bullet ) \end{aligned}$$

Note that the opposite of a hypergraph category is again a hypergraph category where cups and caps are interchanged.

Proof

Follows from the results in §4.1, as the hypergraph structures are induced by linear maps.

In terms of the generators (1), the mirror image is given by the \((-)^\dagger \) functor. Both hypergraph structures consist of \(\dagger \)-Frobenius algebras, meaning that \((-)^\dagger \) is a functor of hypergraph categories \(\mathtt {{CxBiFn}}^\textrm{op}\rightarrow \mathtt {{CxBiFn}}\).

The color-swap operation is given by the inverse adjoint \(F^*_*\), which gives a hypergraph equivalence \((\mathtt {{CxBiFn}},\bullet ) \rightarrow (\mathtt {{CxBiFn}},\circ )\). This equivalence does however not commute with \(\dagger \), i.e. is not an equivalence of dagger hypergraph categories.

4 Example Categories of Bifunctions

We now elaborate example subcategories of bifunctions on which functoriality and duality work as desired (that is, all regularity conditions apply).

4.1 Linear Algebra

The identities and dualities of convex bifunctions generalize those of linear algebra. Let \(A : \mathbb R^m \rightarrow \mathbb R^n\) be a linear map. The convex indicator bifunction of A is defined as

$$\begin{aligned} F_A(x,y) = \lbrace \!|{y = Ax}|\!\rbrace \end{aligned}$$

The following facts hold [27, p 310]:

  1. 1.

    For composable linear maps, A, B we have \(F_{AB} = F_A \circ F_B\)

  2. 2.

    The adjoint \(F_A^*\) is the concave indicator bifunction of the transpose \(A^T\)

    $$\begin{aligned} F_A^*(y^*,x^*) = -\lbrace \!|{x^* = A^Ty^*}|\!\rbrace \end{aligned}$$
  3. 3.

    if A is invertible, then the inverse \((F_A)_*\) is the concave indicator bifunction associated to the inverse \(A^{-1}\). In that case, Proposition 6 generalizes the identity \((A^{-1})^T = (A^T)^{-1}\).

In more categorical terms, let \(\mathtt {{Vect}}\) denote the prop of the vector spaces \(\mathbb R^n\) and linear maps. This is a copy-delete category equipped with the linear maps \(\varDelta : \mathbb R^n \rightarrow \mathbb R^n \oplus \mathbb R^n\) and \(! : \mathbb R^n \rightarrow \mathbb R^0\). For a linear map \(A : \mathbb R^m \rightarrow \mathbb R^n\), define

$$\begin{aligned} F_A : \mathbb R^m \rightharpoonup \mathbb R^n, F_A(x,y) = \lbrace \!|{y=Ax}|\!\rbrace \\ G_A : \mathbb R^n \rightharpoondown \mathbb R^m, G_A(y,x) = -\lbrace \!|{x=A^Ty}|\!\rbrace \end{aligned}$$

Theorem 3

We have a commutative diagram of functors between copy-delete categories

(2)

Proof

Functoriality and commutativity follow from the above facts. For the copy-delete structures, notice that \(\textrm{copy}, \textrm{delete}, \textrm{add}, \textrm{zero}\) are the indicator bifunctions of the linear maps \(\varDelta \) and !. The transpose of \(\varDelta \) is summation \((x,y) \mapsto x+y\).

We call a diagram like (2) a duality situation. The dashed arrows indicate that, while \((-)^*\) is neither a functor nor idempotent on all bifunctions without further conditions, everything works out on the image of F, G respectively. We could thus obtain a genuine commutative diagram of functors by characterizing these images exactly (which we refrain from doing here for the sake of simplicity).

Linear Relations Graphical Linear Algebra [25] is the diagrammatic study of the prop \(\mathtt {{LinRel}}\) of linear relations, which are relations \(R \subseteq \mathbb R^m \times \mathbb R^n\) that are also vector subspaces. This category is a hypergraph category using the two structures shown in (1), and the operations mirror-image and color-swap are defined for linear relations via relational converse and a twisted orthogonal complement

$$\begin{aligned} R^\dagger &= \{ (y,x) : (x,y) \in R \} \\ R^c &= \{ (x^*,y^*) : \forall (x,y) \in R, \langle x^*,x \rangle - \langle y^*,y \rangle = 0 \} \end{aligned}$$

The operations \((-)^\dagger \) and \((-)^c\) commute and define a composite contravariant involution \((-)^* : \mathtt {{LinRel}}^\textrm{op}\rightarrow \mathtt {{LinRel}}\). The following theorem shows that bifunctions are a conservative extension of graphical linear algebra.

Theorem 4

If we embed a linear relation \(R \subseteq \mathbb R^m \times \mathbb R^n\) via its indicator function as a bifunction \(I_R : \mathbb R^m \rightharpoonup \mathbb R^n\), then we have a commutative diagram

(3)

In addition, the functor I preserves both hypergraph structures.

Affine Relations Graphical linear algebra has been extended to affine relations [6]; those are affine subspaces \(R \subseteq \mathbb R^m \times \mathbb R^n\). This still forms a hypergraph category with both structures \(\bullet ,\circ \), however the color-swap symmetry of linear relations is broken. That is because the affine generator \(\underline{1} : 0 \rightarrow 1\) representing the affine relation \(\{(\textbf{0},1)\}\) does not have an obvious color-swapped dual; affine subspaces are not recovered by their orthogonal complements.

The embedding into bifunctions suggests an avenue to recover such a symmetry: Taking the embedding (3) as a starting point, the indicator bifunction of \(\underline{1}\) is \(f : I \rightharpoonup \mathbb R, f(\textbf{0},x) = \lbrace \!|{x=1}|\!\rbrace \). Its adjoint is \(f^*(x^*,\textbf{0}) = -x^*\), which is a perfectly well-defined bifunction but not the indicator bifunction of any affine relation. This suggests that an extension of affine relations with color-swap symmetry can be obtained using a category of partial affine function (e.g. [27, p. 107]) but details are to left for future work. We will discuss the case of partial quadratic functions in §5.2.

4.2 Convex Relations

Generalizing the previous example even further, a convex relation \(R \subseteq \mathbb R^m \times \mathbb R^n\) is a relation which is also a convex subset of \(\mathbb R^{m + n}\). Convex relations are closed under the usual relation composition and thus form a prop \(\mathtt {{CxRel}}\) [3, 12, 23].

Every linear relation is in particular convex, and like linear relations, convex relations embed into convex bifunctions via the indicator function.

We sketch a certain converse to this embedding: The space \((\mathbb R,+,0)\) is a monoid object in \(\mathtt {{CxRel}}\). We consider the ‘writer’ monad \(T : \mathtt {{CxRel}} \rightarrow \mathtt {{CxRel}}\) associated to that monoid, i.e. \(T(\mathbb R^m) = \mathbb R^{m+1}\). If \(S \subseteq \mathbb R^m \times \mathbb R^{n+1}\) and \(R \subseteq \mathbb R^n \times \mathbb R^{k+1}\) are Kleisli arrows, then Kleisli composition takes the following form

$$\begin{aligned} R \bullet S = \{ (x,z,t_1+t_2) : (x,y,t_1) \in S, (y,z,t_2) \in R \} \end{aligned}$$

Given a convex bifunction \(F : \mathbb R^m \rightharpoonup \mathbb R^n\), the epigraph of its graph function \(\textrm{epi}(\underline{{F}}) \subseteq \mathbb R^m \times \mathbb R^{n+1}\) is thus a Kleisli arrow for T. Under sufficient regularity assumptions, this is functorial, and we have an embedding \(\textrm{epi} : \mathtt {{CxBiFn}}\rightarrow \mathtt {{CxRel}}_T\).

4.3 Ordinary Convex Programs

We briefly discuss the historical origins of bifunctions in convex optimization [27, § 29-30]: For simplicity, we say that a ordinary convex program P is a minimization problem of the form

$$\begin{aligned} \inf \{ f(x) : x \in \mathbb R^n, g_1(x) \le 0, \ldots , g_k(x) \le 0 \} \end{aligned}$$

where the objective function f and the constraints \(g_1, \ldots , g_k : \mathbb R^n \rightarrow \overline{\mathbb {R}}\) are finite convex functions. The bifunction associated to P is defined as

$$\begin{aligned} F_P : \mathbb R^k \rightharpoonup \mathbb R^n, F_P(v,x) = f(x) + \sum _{i=1}^k \lbrace \!|{f_i(x) \le v_i}|\!\rbrace \end{aligned}$$

The inputs of \(v \in \mathbb R^k\) can be thought of as perturbations of the constraints. The so-called perturbation function of P is the parameterized minimization problem \((\inf F_P)(v) = \inf _{{x}}\,\left\{ {F_P(v,x)}\right\} \). The convex function \(F_P(0,-)\) represents the unperturbed problem and \((\inf F_P)(0)\) is the desired solution. Note that in categorical language, the perturbation function is straightforwardly obtained as the bifunction composite \((\textrm{discard} \circ F_P) : \mathbb R^k \rightharpoonup I\), or graphically

figure i

The associated bifunction \(F_P\) contains all information about the problem P, and allows one to find the dual problem \(P^*\) by taking its adjoint. This way one can think of any bifunction \(\mathbb R^k \rightharpoonup \mathbb R^n\) as a generalized convex program ([27, p. 294]).

Example 6

([27, p. 312]). Consider a linear minimization problem P of the form

$$\begin{aligned} \inf \{ \langle c,x \rangle : b - Ax \le 0 \} \end{aligned}$$

The associated bifunction and its adjoint are

$$\begin{aligned} F(v,x) &= \langle c,x \rangle + \lbrace \!|{x \ge 0, b-Ax \le v }|\!\rbrace \\ F^*(x^*,v^*) &= \langle b, v^* \rangle - \lbrace \!|{v^* \ge 0, c-A^Tv^* \ge x^*}|\!\rbrace \end{aligned}$$

which is the concave bifunction associated to the dual maximization problem

$$\begin{aligned} \sup \{ \langle b,y \rangle : y \ge 0, c-A^Ty \ge 0 \} \end{aligned}$$

5 Gaussian Probability and Convexity

We now study the probabilistic applications of our categorical framework: Recently, a sizeable body of work in categorical probability theory has been developed in terms of copy-delete and Markov categories. A Markov category [17] is a copy-delete category \((\mathbb C, \otimes , I)\) where every morphism \(f : X \rightarrow Y\) is discardable in the sense that \(\textrm{discard}_Y \circ f = \textrm{discard}_X\). Classic examples of Markov categories are the category \(\mathtt {{FinStoch}}\) of finite sets and stochastic matrices, and the category \(\mathtt {{Stoch}}\) of measurable spaces and Markov kernels. Discardability expresses that probability measures are normalized (integrate to 1). Markov categories provide a natural semantic domain for probabilistic programs [30].

In this section, we will focus on Gaussian probability, by which we mean the study of multivariate normal (Gaussian) distributions and affine-linear maps. This is a small but expressive fragment of probability, which suffices for a range of interesting application from linear regression and Gaussian processes to Kalman filters. The univariate normal distribution \(\mathcal N(\mu ,\sigma ^2)\) is defined on \(\mathbb R\) via the density function

$$\begin{aligned} f(x) = \frac{1}{\sqrt{2\pi \sigma ^2}}\exp \left( -\frac{(x-\mu )^2}{2\sigma ^2}\right) \end{aligned}$$

Multivariate Gaussian distributions are easiest described as the laws of random vectors \(A\cdot X + \mu \) where \(A \in \mathbb R^{n \times k}\) and \(X_1, \ldots , X_k \sim \mathcal N(0,1)\) are independent variables. The law is fully characterized by the mean \(\mu \) and the covariance matrix \(\varSigma = AA^T\). Conversely, for every vector \(\mu \in \mathbb R^n\) and positive semidefinite matrix \(\varSigma \in \mathbb R^{n \times n}\), there exists a unique Gaussian law \(\mathcal N(\mu ,\varSigma )\). If \(X \sim \mathcal N(\mu ,\varSigma )\) and \(Y \sim \mathcal N(\mu ',\varSigma ')\) are independent then \(X + Y \sim \mathcal N(\mu + \mu ',\varSigma + \varSigma ')\) and \(AX \sim \mathcal N(A\mu ,A\varSigma A^T)\). Gaussians are self-conjugate: If (X, Y) are jointly Gaussian, then so is the conditional distribution \(X|Y=y\) for any constant \(y \in \mathbb R^k\).

If the covariance matrix \(\varSigma \) is positive definite, then the Gaussian has a density with respect to the Lebesgue measure on \(\mathbb R^n\) given by

$$\begin{aligned} f(x) = \frac{1}{\sqrt{(2\pi )^n\det (\varSigma )}}\exp \left( -\frac{1}{2} (x-\mu )^T\varOmega (x-\mu )\right) \end{aligned}$$
(4)

where \(\varOmega = \varSigma ^{-1}\) is known as the precision matrix. This suggests two equivalent representations of Gaussians with different advantages (e.g. [20, 31]):

  • In covariance representation \(\varSigma \), pushforwards (addition, marginalization) are easy to compute. Conditioning requires solving an optimization problem

  • In precision representation \(\varOmega \), conditioning is straightfoward. Pushforwards require solving an optimization problem.

If \(\varSigma \) is singular, the Gaussian distribution is only supported on the affine subspace \(\mu + S\) where \(S = \textrm{im}(\varSigma )\). In that case, the distribution has a density only with respect to the Lebesgue measure on S. This variability of base measure makes it complicated to work with densities, and by extension the precision representation.

The situation becomes clearer if we represent Gaussians by the quadratic functions induced by their covariance and precision matrices. These functions are convex (concave), and turn out to be adjoints of each other. This explains the duality of the two representations, and paves the way for generalizations of Gaussian probability like improper priors [31] which correspond to partial quadratic functions (§5.2).

5.1 Embedding Gaussians in Bifunctions

We now give a categorical account of Gaussian probability (in covariance representation). A Gaussian morphism \(\mathbb R^m \rightarrow \mathbb R^n\) is a stochastic map of the form \(x \mapsto Ax + \mathcal N(\mu ,\varSigma )\), that is a linear map with Gaussian noise.

Definition 6

([17, §6]). The Markov prop \(\textsf{Gauss}\) is given as follows

  1. 1.

    objects are the spaces \(\mathbb R^n\), and \(\mathbb R^m \otimes \mathbb R^n = \mathbb R^{m+n}\)

  2. 2.

    morphisms \(\mathbb R^m \rightarrow \mathbb R^n\) are tuples \((A,\mu ,\varSigma )\) with \(A \in \mathbb R^{n \times m}, \mu \in \mathbb R^n\) and \(\varSigma \in \mathbb R^{n \times n}\) positive semidefinite

  3. 3.

    composition and tensor are given by the formulas

    $$\begin{aligned} (A,\mu ,\varSigma ) \circ (B,\mu ',\varSigma ') &= (AB,\mu + A\mu ',\varSigma + A\varSigma ' A^T) \\ (A,\mu ,\varSigma ) \otimes (B,\mu ',\varSigma ') &= \left( A \oplus B, \mu \oplus \mu ', \varSigma \oplus \varSigma ')\right) \end{aligned}$$

    where \(\oplus \) is block-diagonal composition.

  4. 4.

    the copy-delete structure is given by the linear maps \(\varDelta ,!\)

We have a Markov functor \(\textsf{Gauss}\rightarrow \mathtt {{Stoch}}\) which sends \(\mathbb R^n\) to the measurable space \((\mathbb R^n,\textrm{Borel}(\mathbb R^n))\) and assigns \((A,\mu ,\varSigma )\) to the probability kernel given by \(x \mapsto \mathcal N(Ax + \mu ,\varSigma )\). Functoriality expresses that the formulas of Definition 6 agree with composition of Markov kernels given by integration of measures. Our main theorem shows that, surprisingly, the representation of Gaussians by quadratic functions is also functorial, i.e. we have an embedding \(\textsf{Gauss}\rightarrow \mathtt {{CxBiFn}}\).

Theorem 5

We have functors of copy-delete categories in a duality situation

figure j

The functors are defined as follows: Let \(f = (A,\mu ,\varSigma ) \in \textsf{Gauss}(\mathbb R^m, \mathbb R^n)\), and define bifunctions

$$\begin{aligned} \textrm{logpdf}_f : \mathbb R^n \rightharpoondown \mathbb R^m, \quad &\textrm{logpdf}_f(y,x) = -\frac{1}{2} \langle z,\varSigma ^{-}z \rangle - \lbrace \!|{z \in S}|\!\rbrace \\ \textrm{cgf}_f : \mathbb R^m \rightharpoonup \mathbb R^n, \quad &\textrm{cgf}_f(x,y) = \frac{1}{2} \langle y,\varSigma y \rangle + \langle \mu , y \rangle + \lbrace \!|{x = A^Ty}|\!\rbrace \end{aligned}$$

where \(z = y-(Ax + \mu ), S=\textrm{im}(\varSigma )\) and \(\varSigma ^-\) denotes any generalized inverse of \(\varSigma \).

Proof (Sketch)

Functoriality of \(\textrm{cgf}\) follows from a straightforward computation, and one can check that \(\textrm{cgf}_f^*=\textrm{logpdf}_f\) using the formula of [27, p. 109]. Functoriality for \(\textrm{logpdf}\) then follows from Theorem 1. The full proof is elaborated in the extended version of this paper [32].

The value \(\textrm{logpdf}_f(y,x)\) is indeed the conditional log-probability (4) minus a scalar. The name \(\textrm{cgf}\) is short for cumulant-generating function, which we elaborate in §6.2. For now, we can see \(\textrm{cgf}\) as a generalized covariance representation.

5.2 Outlook: Gaussian Relations

Measure-theoretically, there is no uniform probability distribution over the real line. Such a distribution, if it existed, would be useful to model complete absence of information about a point X – in Bayesian inference, this is called an uninformative prior. Intuitively, such a distribution should have density 1, but this would not integrate to 1. On the other hand, a formal logdensity of 0 makes sense – this is simply the indicator function of the full subset \(\mathbb R\subseteq \mathbb R\).

An extended Gaussian distribution, as described in [31], is a formal sum \(\mathcal N(\mu ,\varSigma ) + D\) of a Gaussian distribution and a vector subspace \(D \subseteq \mathbb R^n\), called a fibre, thereby blending relational and probabilistic nondeterminism. Such entities were considered by Willems in the control theory literature, under the name of linear open stochastic systems [35, 36]; he identifies them with Gaussian distributions on the quotient space \(\mathbb R^n/D\). A categorical account based on decorated cospans is developed in [31].

It is straightforward to embed extended Gaussian distributions into convex bifunctions, by taking the sum of the interpretations from Theorems 4 and 5. The distribution \(\psi = \mathcal N(\mu ,\varSigma ) + D\) has a convex interpretation given by

$$\begin{aligned} \textrm{cgf}_\psi (x) = \frac{1}{2} \langle x,\varSigma x \rangle + \langle \mu ,x \rangle + \lbrace \!|{x \in D^\bot }|\!\rbrace \end{aligned}$$

Functions of this form are partial convex quadratic functions, which are known to form a well-behaved class of convex functions (see appendix of the extended version [32]). The theory of such functions can be understood as an extension of Gaussian probability with relational nondeterminism and conditioning, which we term Gaussian relations. In Gaussian relations, we achieve full symmetry between covariance and density representation (that is, there exists a color-swap symmetry).

Partiality is necessary to be able to interpret all generators of (1); on the upside, the presence of partiality makes conditioning a first-class operation: For example, if \(f : \mathbb R^2 \rightharpoondown I\) is the joint logdensity of Gaussian variables (X, Y), then conditioning on \(Y=0\) is the same as computing the bifunction composite with the \(\textrm{zero}\) map, which is a simple restriction of logdensity \(f_{X|Y=0}(x) = f(x,0)\). On the other hand, conditioning in the covariance representation \(f^*\) requires solving the infimization problem \(\inf _{{y^*}}\,\left\{ {f^*(x^*,y^*)}\right\} \). Graphically, we have

figure k

6 A Wider Perspective

The example of Gaussian probability was particular situation in which we could map probabilistic concepts to concepts of convex analysis in a functorial way. In this section, we will take an even wider perspective and view convex bifunctions as a categorical model of probability on its own. We will then point out known connections between probability theory and convex analysis, such as the Laplace approximation and cumulant generating functions.

6.1 The Laplace Approximation

For every copy-delete category \(\mathbb C\), the subcategory of discardable morphisms is a Markov category, and can therefore be seen as a generalized model of probability theory. We investigate this notion for categories of bifunctions.

Proposition 7

Let \(F : \mathbb R^m \rightharpoonup \mathbb R^n, G : \mathbb R^n \rightharpoondown \mathbb R^m\) be bifunctions. Then

  1. 1.

    F is discardable in \((\mathtt {{CxBiFn}},\bullet )\) if \(\forall x, \inf _y F(x,y) = 0\)

  2. 2.

    G is discardable in \((\mathtt {{CvBiFn}}^\textrm{op},\circ )\) if \(\forall x, G(0,x) = \lbrace \!|{x=0}|\!\rbrace \)

and the adjoint \((-)^*\) defines a bijection between the two.

Proof

Direct calculation.

The embedding of Theorem 5 takes values in discardable bifunctions and hence preserve Markov structure. Functoriality means that the composition of Gaussians (integration) and the composition of bifunctions (optimization) coincide. For general probability distributions, this will no longer be the case. We can however understand bifunction composition as an approximation of ordinary probability theory under the so-called Laplace approximation. In its simplest instance, Laplace’s method (or method of steepest ascent) is a method to approximate certain integrals by finding the maxima of its integrand (e.g. [34])

$$\begin{aligned} \int e^{n(\langle c,x \rangle - f(x))} \mathrm d x \approx \exp \left( n\sup _{{x}}\,\left\{ {\langle c,x\rangle - f(x)}\right\} \right) \text { for } n \rightarrow \infty \end{aligned}$$

A wide class of commonly used probability distributions is log-concave, including Gaussian, Laplace, Dirichlet, exponential and uniform distributions. Laplace’s approximation (e.g. [22, §27]) is a way of approximating such distributions around their mode \(x_0\) by a normal distribution, as the Taylor expansion of their logdensity resembles a Gaussian one

$$\begin{aligned} h(x) \approx h(x_0) + \frac{1}{2} h''(x_0)(x-x_0)^2 \end{aligned}$$

We can attempt to reduce questions about such distributions to mode-finding (maximization). The Laplace approximation is fundamental in many applications such as neuroscience [15, 16] and has been generalized to a large body of literature on so-called saddle-point methods [10, 24]. The existence of the functor from Gaussians to bifunctions expresses that, as desired, the Laplace approximation is exact on Gaussians. We give an example of the approximation not being exact (ironically) on Laplacian distributions.

Example 7

The standard Laplacian distribution has the density function \(f(x) = \frac{1}{2} \exp (|x|)\) on the real line. The logpdf \(h(x) = |x|\) is a convex function whose convex conjugate is given by \(h^*(x^*) = \lbrace \!|{|x^*| \le 1}|\!\rbrace \) (see Example 2). The latter function is idempotent under addition, and conversely , so h is idempotent under infimal convolution. In contrast, the density f(x) is not idempotent under integral convolution: The sum of independent standard Laplacians is not itself Laplacian.

6.2 Convex Analysis in Probability Theory

For a random variable X on \(\mathbb R^n\), the moment generating function \(M_X\) is defined by the following expectation (provided that it exists) \(M_X(x^*) = \mathbb E[e^{\langle x^*,X\rangle }]\). The cumulant-generating function is defined as its logarithm \(c_X(x^*) = \log M_X(x^*)\). The function \(c_X\) is always convex. The cumulant-generating function of a multivariate Gaussian \(X \sim \mathcal N(\mu ,\varSigma )\) is precisely

$$\begin{aligned} c_X(x^*) = \frac{1}{2} \langle x^*,\varSigma x^* \rangle + \langle x^*,\mu \rangle \end{aligned}$$
(5)

which explains our choice of the convex bifunction \(\textrm{cgf}\) associated to a Gaussian morphism in Theorem 5. The notion of cumulant-generating function has a central place in the study of exponential families.

It is a particular fact about Gaussians that the cumulant-generating function is the convex conjugate of the logdensity. In the general case, the convex conjugate \(c_X^*(x)\) does have a probabilistic interpretations as a so called-rate function in large deviations theory (Cramér’s theorem, [13]). It has also been used to formulate a variational principle [38].

6.3 Idempotent Mathematics

We zoom out to an even wider perspective: This subsection briefly outlines some further background of the connections between convex and probabilistic world: The logarithm of base \(t<1\) defines an isomorphism of semirings \(([0,\infty ), \times , +) \rightarrow (\mathbb R\cup \{+\infty \}, +, \oplus _t)\) where \(\oplus _t\) is \(x \oplus _t y = \log _t(t^x + t^y)\). In the ‘tropical limit’ \(t \searrow 0\), we have \(x \oplus _t y \approx \min (x,y)\), so we can consider working in the semiring \((\overline{\mathbb {R}}, +, \min )\) as a limit or deformation of the usual operations on the reals. The semiring \(\overline{\mathbb {R}}\) is idempotent, meaning \(x \oplus x= \min (x,x) = x\), hence this field of study is also known as idempotent mathematics [26], and the limiting procedure has been called Maslov dequantization [21]. Our definition of convex bifunctions in terms of the idempotent semiring \(\overline{\mathbb {R}}\) thus carries a strong flavor of idempotent mathematics.

Idempotent analogues of measure theory are discussed in [21, 26], and many theorems in classical probability theory are mirrored by theorems of idempotent probability theory. For example, the idempotent analogue of integration is infimization; under this view, the tropical analogue of the Laplace transform (cf. moment-generating function) is the Legendre transform [21, §7]

$$\begin{aligned} \int e^{\langle x^*, x \rangle } f(x) \mathrm dx \quad \leftrightarrow \quad \inf _{{x}}\,\left\{ {\langle x^*, x \rangle + f(x)}\right\} \end{aligned}$$

which explains the appearance of the cumulant-generating function in our work. Theorem 5 means that for Gaussians, it makes no difference whether we work in the real-analytic or idempotent world. Idempotent Gaussians have been defined in [26, 1.11.10] using the same formula (5).

7 Related and Future Work

We have described categories of bifunctions as a compositional setting for convex analysis which subsumes a variety of formalisms like linear functions and relations, as well as convex optimization problems, and has a rich duality theory and an elegant graphical language. We have then explored connections between convex analysis and probability theory, and showed that Gaussian probability can be equivalently described in a measure-theoretic and a convex-analytic language. The equivalence of these two perspectives is elegantly formalized as a structure-preserving functor between copy-delete categories. It will be interesting to see how this approach can be generalized to larger classes of distributions such as exponential families.

Concurrently to our work, the categorical structure of convex bifunctions has been exploited by [19] to compositionally build up objective functions for MPC in control theory. That work does not explore Legendre duality and the connections with categorical models of probability theory. The language of props has a history of applications in engineering [1, 2, 7], and our work was directly inspired by the semantics of probabilistic programming [30, 33].

A starting point for future work is to flesh out the outlook given in §5.2, that is to define a hypergraph category of partial quadratic convex functions, which generalizes Gaussian and extended Gaussian probability. It is also interesting to give a presentation for this prop in the style of [25]: We believe that this is achieved by the addition of a single generator \(\nu : I \rightarrow \mathbb R\) to graphical affine algebra [6] which represents the quadratic function \(f(x) = \frac{1}{2} x^2\), and that its equational theory is essentially given by invariance under the orthogonal groups O(n). A similar equational theory has been attempted in [33] though no completeness has been proven. Diagrammatic presentations of concepts from geometry and optimization such as polyhedral algebra and Farkas lemma have been given in [4, 5].

We realize that the dependence on regularity assumptions (the caveat of §2) makes general theorems about categories of bifunctions like Theorem 1 somewhat awkward to state. We still believe that using a general categorical language is a useful way of structuring the field and making connections, but see the following avenues of improving the technical situation

  1. 1.

    Identifying specific, well-behaved subcategories of bifunctions (such as convex relations, (partial) linear and (partial) quadratic functions) on which everything behaves as desired. This was pursued in §4 and §5.

  2. 2.

    The Legendre-Fenchel transform has been phrased in terms of enriched adjunctions in [37]. It stands to hope that developing this enriched-categorical approach may take care of some regularity conditions in a systematic way.