Asymptotic dependency structure of multiple signals
 250 Downloads
Abstract
We formalize the notion of the dependency structure of a collection of multiple signals, relevant from the perspective of information theory, artificial intelligence, neuroscience, complex systems and other related fields. We model multiple signals by commutative diagrams of probability spaces with measurepreserving maps between some of them. We introduce the asymptotic entropy (pseudo)distance between diagrams, expressing how much two diagrams differ from an informationprocessing perspective. If the distance vanishes, we say that two diagrams are asymptotically equivalent. In this context, we prove an asymptotic equipartition property: any sequence of tensor powers of a diagram is asymptotically equivalent to a sequence of homogeneous diagrams. This sequence of homogeneous diagrams expresses the relevant dependency structure.
Keywords
Asymptotic equipartition property Entropy distance Diagrams of probability spaces Multiple signals1 Introduction
According to usual modeling assumptions in information theory, a discrete signal is cut into a collection of long words of length n, whose particular representation is irrelevant (each word is considered as an atomic object without inner structure), and small errors are allowed. If the signal is modeled as a sequence of independently, identically distributed random variables, there is only one relevant variable determining the signal, namely the entropy: the exponential growth rate of the number of typical words of length n. We elaborate on this point of view below in Sect. 1.1.
Similarly, if one probes a measurepreserving dynamical system at a discrete sequence of times with a finiteoutput measurement device and counts measurement trajectories of length n, while discarding rarely appearing, untypical ones, one arrives at the notion of entropy of a systemmeasurement pair. Entropy, in this case, is the exponential growth rate of the number of typical trajectories with respect to the length n. The supremum of such entropies over varying measurement devices is the Kolmogorov–Sinai entropy of a measurepreserving dynamical system. According to a theorem of Ornstein [17], the entropy is the only invariant of the isomorphism classes of certain types of dynamical systems (Bernoulli shifts).
In information theory, but also in artificial intelligence, neuroscience and the theory of complex systems, one usually studies multiple signals at once. Likewise, a dynamical system is often observed with multiple measurement devices simultaneously. In these cases, one assumes in addition that the relations between the signals are essential. In this article we characterize, under these modeling assumptions, the relevant invariants in multiple signals, that are obtained as i.i.d. samples from random variables. We explain this in more detail in Sects. 1.3 and 1.4.
We will now explain our point of view on entropy for a single signal, that is, for a single probability space.
1.1 Probability spaces and their entropy
The entropy of X is the exponential growth rate of the observable cardinality of tensor powers of X. The observable cardinality, loosely speaking, is the cardinality of the set \(X^{n}\) after the biggest possible subset of small measure has been removed. It turns out that the observable cardinality of \(X^{n}\) might be much smaller than \(S^{n}\), the cardinality of the whole of \(X^{n}\), in the following sense.
The Asymptotic Equipartition Property states that for every \(\varepsilon >0\) and \(n\gg 0\) one can find a, socalled, typical subset \(A^{(n)}_{\varepsilon }\subset S^{n}\), defined as a subset that takes up almost all of the mass of \(X^{n}\) and the probability distribution on \(A^{(n)}_{\varepsilon }\) is almost uniform on the normalized logarithmic scale, as stated in the following theorem, see [8].
Theorem 1.1
 1.
\(p^{\otimes n}(A^{(n)}_{\varepsilon })> 1\varepsilon \)
 2.For any \(a,a'\in A^{(n)}_{\varepsilon }\) holds$$\begin{aligned} \left \frac{\ln p^{\otimes n}(a)}{n}  \frac{\ln p^{\otimes n}(a')}{n} \right < \varepsilon . \end{aligned}$$
The cardinality \(A^{(n)}_{\varepsilon }\) may be much smaller than \(S^{n}\), but it will still grow exponentially with n. Even though there are generally many choices for such a set \(A^{(n)}_\varepsilon \), in view of the property (1) in Theorem 1.1, the exponential growth rate with respect to n is welldefined up to \(2\varepsilon \).
1.2 Asymptotic equivalence
If \(X_1\) and \(X_2\) are probability spaces with the same entropy, there is a bijection between their typical sets of sequences of length n, for the plain reason that they can be chosen to have the same cardinality. It means that up to a change of code (of representation) and an error that becomes small as n gets large, the spaces \(X_1^n\) and \(X_2^n\) are equivalent. In the same sense, both \(X_1^n\) and \(X_2^n\) are equivalent to a uniform measure space with cardinality \({\mathbf{e}}^{{n}{\textsf {Ent}}{(X_i)}}\).
Even though we were greatly influenced by ideas in [10], we found that Gromov’s definition does not extend easily to situations in which multiple signals are processed at the same time, or when a dynamical system is probed with several measurement devices at once.
1.3 Diagrams of probability spaces
We model multiple signals by diagrams of probability spaces. By a diagram of probability spaces we mean a commutative diagram of probability spaces and measurepreserving maps between some of them. We will give a precise definition in Sect. 2.4, but will now consider particular examples of diagrams called twofans.
However, such an asymptotic equivalence relation would be too coarse. Consider the three examples of twofans are shown in Fig. 1, which is to be interpreted in the following way. Each of the spaces \(X_{i}\) and \(Y_{i}\), \(i=1,2,3\), have cardinality six and a uniform distribution, where the weight of each atom is \(\frac{1}{6}\). The spaces \(U_{i}\) have cardinality 12 and the distribution is also uniform with all weights being \(\frac{1}{12}\). The support of the measure on \(U_{i}\)’s is colored grey on the pictures. The maps from \(U_{i}\) to \(X_{i}\) and \(Y_{i}\) are coordinate projections.
1.4 The entropy distances and asymptotic equivalence for diagrams
Instead of finding an almost measurepreserving bijection between large parts of the two spaces, we consider a stochastic coupling (transportation plan, joint distribution) between a pair of spaces and measure its deviation from being an isomorphism of probability spaces (a measurepreserving bijection). Such a measure of deviation from being an isomorphism then leads to the notion of intrinsic entropy distance, and its stable version—the asymptotic entropy distance, as explained in Sect. 3. We say two sequences of diagrams are asymptotically equivalent if the asymptotic entropy distance between them vanishes.
The intrinsic entropy distance is an intrinsic version of a distance between random variables going by many different names, such as entropy distance, shared information distance and variation of information. It was reinvented many times by different people, among them Shannon [20], Kolmogorov, Sinai and Rokhlin. It appears in the proof of the theorem about generating partitions for ergodic systems by Kolmogorov and Sinai, see for example [22].
The intrinsic version of the entropy distance between probability spaces was introduced by Kovacevic et al. [14] and by Vidyasagar [24]. They showed that the involved minimization problem is NPhard. Methods to find approximate solutions are discussed in [6, 11].
1.5 Asymptotic equipartition property
With the notion of asymptotic equivalence induced by the asymptotic entropy distance, we can prove an asymptotic equipartition property for diagrams. Whereas the asymptotic equipartition property for single probability spaces states that high tensor powers of probability spaces can be approximated by uniform measure spaces, the Asymptotic Equipartition Property Theorem for diagrams, Theorem 6.1, states that sequences of successive tensor powers of a diagram can be approximated in the asymptotic entropy distance by a sequence of homogeneous diagrams.
Homogeneous diagrams have the property that the symmetry group acts transitively on the support of the measures of the constituent spaces. Twofans shown on Fig. 1 are particular examples of homogeneous diagrams.
Homogeneous probability spaces are just uniform probability spaces, while homogeneous diagrams are, unlike homogeneous probability spaces, rather complex objects. Nonetheless, they seem to be simpler than arbitrary diagrams of probability spaces for the types of problems that we would like to address.
In a subsequent article we show that the optimal values in InformationOptimization problems only depend on the asymptotic class of a diagram and that they are continuous with respect to the asymptotic entropy distance; in many cases, the optimizers are continuous as well. The Asymptotic Equipartition Property implies that for the purposes of calculating optimal values and approximate optimizers, one only needs to consider homogeneous diagrams and this can greatly simplify computations.
Summarizing, the Asymptotic Equipartition Property and the continuity of InformationOptimization problems are important justifications for the choice of asymptotic equivalence relation and the introduction of the intrinsic and asymptotic Kolmogorov–Sinai distances.
1.6 Definitions and results in random variable context
In this article, we use the language of probability spaces and their commutative diagrams rather than the language of random variables, because we often encounter situations in which their joint distributions are not defined, are variable, or even do not exist.
Some relations between the probability spaces can be easily represented by commutative diagrams of probability spaces, such as by a diamond diagram, Sect. 2.5.5, while the description with random variables is complex and not easily interpretable. The diagrams also provide a geometric overview of various entropy identities and inequalities.
Since the language of random variables will be more familiar to many readers, we now present our main result in these terms.
For random variables \(\textsf {X}, \textsf {Y}, \textsf {Z}\) etc., we denote by \(\underline{X}, \underline{Y}, \underline{Z}\) the target sets, and by X, Y, Z the probability spaces with the induced distributions.
In general, there is a relation between ktuples of random variables and diagrams of a certain type, involving a space for every subset \(I \subset \{1, \dots , k\}\).
However, not every type of diagram corresponds to a tuple of random variables.
Theorem 1
Let \((\textsf {X}(i): i \in {\mathbb {N}})\) be a sequence of i.i.d. random tuples defined on a standard probability space.
2 Category of probability spaces and diagrams
In this section we present the basic setup used throughout the article. We will start by explaining how probability spaces and (equivalence classes) of measurepreserving maps between them form a category. This point of view on probability theory was already advocated in [2, 10].
Category theory yields simple definitions of diagrams of probability spaces and morphisms between them and allows for precise and relatively short proofs. The setup is also convenient when couplings (joint distributions) between probability spaces are absent or variable.
2.1 Categories
Below we briefly review elementary category theory. We refer the reader to the first chapter of [15] for a more extensive introduction.
A category \(\mathbf{C}\) is an abstract mathematical structure that captures the idea of a collection of spaces and structurepreserving maps between them, such as groups and homomorphisms, vector spaces and linear maps, and topological spaces and continuous maps. Categories consist of a collection of objects (which need not to be sets), a collection of morphisms (which need not to be maps), and a rule for composing morphisms.

A class of objects \(\text {Obj}_{\mathbf{C}}\);

A class of morphisms \(\text {Hom}_{\mathbf{C}}(A,B)\) for every pair of objects \(A,B\in \text {Obj}_{\mathbf{C}}\). For a morphism \(f\in \text {Hom}_{\mathbf{C}}(A,B)\) one usually writes \(f:A{\mathop {\rightarrow }\limits ^{}}B\). Object A will be called the domain and B the target of f, and we say that f is a morphism from A to B;
 For each triple of objects A, B and C, a binary, associative operation, called composition,
 For every object \(A \in \text {Obj}_{\mathbf{C}}\) an identity morphism \(\mathbf{1 }_{A}:A {\mathop {\rightarrow }\limits ^{}}A\), with the property that for every \(f: A {\mathop {\rightarrow }\limits ^{}}B\) and every \(g:B {\mathop {\rightarrow }\limits ^{}}A\),$$\begin{aligned} f \circ \mathbf{1 }_{A} = f, \quad \mathbf{1 }_{A} \circ g = g. \end{aligned}$$
Category theory becomes a very powerful tool when functors and their natural transformations are considered. Functors can be seen as homomorphisms between categories. In turn, natural transformations are homomorphisms between functors.
2.2 Probability spaces and reductions
We will now describe the category \({{\mathrm{\mathbf {Prob}}}}\). The objects in \({{\mathrm{\mathbf {Prob}}}}\) are finite probability spaces. A finite probability space X is a pair (S, p), where S is a (not necessarily finite) set and \(p:2^{S} {\mathop {\rightarrow }\limits ^{}}[0,1]\) is a probability measure, such that there is a finite subset of S with full measure. We denote by \(\underline{X}=\text {supp}\,p\) the support of the measure and by \(X:=\text {supp}\,p_{X}\) its cardinality. Slightly abusing the language, we call this quantity the cardinality of X. We will no longer explicitly mention that the probability spaces we consider are finite. We will also write \(p_X\) where we truly mean its density with respect to the counting measure.
The morphisms in the category \({{\mathrm{\mathbf {Prob}}}}\) are exactly the reductions between finite probability spaces. At this stage one might want to check that \({{\mathrm{\mathbf {Prob}}}}\) is indeed a category, and this is guaranteed as the composition of two reductions is again a reduction.
2.3 Isomorphisms, automorphisms and homogeneity
Now that we have organized probability spaces and reductions into a category, we get concepts such as isomorphism for free: Two probability spaces X and Y are isomorphic in the category \({{\mathrm{\mathbf {Prob}}}}\) if and only if there exists a measurepreserving bijection between the supports of the measures on X and Y. If X and Y are isomorphic, they have the same cardinality. The automorphism group \(\text {Aut}(X)\) is the group of all selfisomorphisms of X.
A probability space X is called homogeneous if the automorphism group \(\text {Aut}(X)\) acts transitively on the support \(\underline{X}\) of the measure. For the category \({{\mathrm{\mathbf {Prob}}}}\), this turns out to be a complicated way of saying that the measure on X is uniform on its support, but when we consider diagrams later, there will be no such simple implication. Homogeneity is an isomorphism invariant and we will denote the subcategory of homogeneous spaces by \({{\mathrm{\mathbf {Prob}}}}_\mathbf{h}\).
There is a product in \({{\mathrm{\mathbf {Prob}}}}\) (which is not a product in the sense of category theory!) given by the Cartesian product of probability spaces, that we will denote by \(X\otimes Y:=(\underline{X}\times \underline{Y},p_{X}\otimes p_{Y})\), where \(p_{X} \otimes p_{Y}\) is the (independent) product measure. There are canonical reductions \(X\otimes Y{\mathop {\rightarrow }\limits ^{}}X\) and \(X\otimes Y{\mathop {\rightarrow }\limits ^{}}Y\) given by projections to factors. For a pair of reductions \(f_{i}:X_{i}{\mathop {\rightarrow }\limits ^{}}Y_{i}\), \(i=1,2\) their tensor product is the reduction \(f_{1}\otimes f_{2}:X_{1}\otimes X_{2}{\mathop {\rightarrow }\limits ^{}}Y_{1}\otimes Y_{2}\), which is equal to the class of the Cartesian product of maps representing \(f_{i}\)’s. The product leaves the subcategory of homogeneous spaces invariant. If one of the factors in the product is replaced by an isomorphic space, then the product stays in the same isomorphism class.
We close this section with a technical remark. The category \({{\mathrm{\mathbf {Prob}}}}\) is not a small category. However it has a small full subcategory, that contains an object for every isomorphism class in \({{\mathrm{\mathbf {Prob}}}}\) and for every pair of objects in it, it contains all the available morphisms between them and is closed under the product. From now on we imagine that such a subcategory was chosen and fixed and replaces \({{\mathrm{\mathbf {Prob}}}}\) in all considerations below.
2.4 Diagrams of probability spaces

the reductions form a directed, acyclic graph which is transitively closed;

the spaces in the diagram form a poset;

the underlying combinatorial structure could be recorded as a finite category.
A twofan is then a diagram indexed by \(\varvec{\varLambda }_{2}\): we assign to each object in \(\varvec{\varLambda }_{2}\) a probability space and to each morphism in \(\varvec{\varLambda }_2\) a reduction.
In general, then, a diagram of probability spaces indexed by a poset category \(\mathbf{G}\) is a functor \(\mathscr {X}:\mathbf{G} {\mathop {\rightarrow }\limits ^{}}{{\mathrm{\mathbf {Prob}}}}\). The requirement that \(\mathscr {X}\) is a functor and not just a map between objects and morphisms (combined with the assumption that there is only one morphism between objects), is exactly the requirement that the diagrams should be commutative.
Thus, a reduction of a twofan is a family of reductions of probability spaces indexed by the objects in the poset category \(\varvec{\varLambda }_2\) such that the diagram commutes.
For a diagram \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \), the poset category \(\mathbf{G}\) will be called the combinatorial type of \(\mathscr {X}\). For a poset category \(\mathbf{G}\) or a diagram \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) we denote by Open image in new window the number of objects in the category \(\mathbf{G}\).
An object O in a poset category \(\mathbf{G}\) will be called a source, if it is not a target of any morphism except for the identity. Likewise a sink object is not a domain of any morphism, except for the identity morphism. If a category contains a unique source object, the object is called the initial object and such a category will be called complete.
The above terminology transfers to diagrams indexed by \(\mathbf{G}\): A source space in \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) is one that is not a target space of any reduction within the diagram, a sink space is not the domain of any nontrivial reduction and \(\mathscr {X}\) is called complete if \(\mathbf{G}\) is, i.e. if it has a unique source space.
Denote by \({{\mathrm{\mathbf {Set}}}}\) the category of finite sets and surjective maps. Then all of the above constructions could be repeated for sets instead of probability spaces. Thus we could talk about the category of diagrams of sets \({{\mathrm{\mathbf {Set}}}}\langle \mathbf{G} \rangle \).
2.5 Examples of diagrams
We now consider some examples of poset categories and corresponding diagrams, that will be important in what follows.
2.5.1 Singleton
We denote by \(\bullet \) the poset category with a single object. Clearly diagrams indexed by \(\bullet \) are just probability spaces and we have \({{\mathrm{\mathbf {Prob}}}}\equiv {{\mathrm{\mathbf {Prob}}}}\langle \bullet \rangle \).
2.5.2 Chains
2.5.3 Twofan
The twofan \(\varvec{\varLambda }_{2}\) is a category with three objects \(\left\{ O_{1},O_{12},O_{2}\right\} \) and two nonidentity morphisms \(O_{12}{\mathop {\rightarrow }\limits ^{}}O_{1}\) and \(O_{12}{\mathop {\rightarrow }\limits ^{}}O_{2}\). A diagram indexed by a twofan will also be called a twofan.
Essentially, a twofan \((X{\mathop {\leftarrow }\limits ^{}}Z{\mathop {\rightarrow }\limits ^{}}Y)\) is a triple of probability spaces and a pair of reductions between them.
A twofan \((X{\mathop {\leftarrow }\limits ^{}}Z{\mathop {\rightarrow }\limits ^{}}Y)\) is called minimal if for any superdiagram \(\left\{ X,Y,Z,Z'\right\} \) shown on Fig. 2b the reduction \(m:Z{\mathop {\rightarrow }\limits ^{}}Z'\) must be an isomorphism. Minimal twofans are also called couplings in probability theory.
For any twofan \((X{\mathop {\leftarrow }\limits ^{}}Z{\mathop {\rightarrow }\limits ^{}}Y)\) of probability spaces there always exist a unique (up to isomorphism), minimal twofan \((X{\mathop {\leftarrow }\limits ^{}}Z'{\mathop {\rightarrow }\limits ^{}}Y)\), that can be included in the diagram shown on Fig. 2b. The minimization can be constructed by taking \(Z':=\underline{X}\times \underline{Y}\) as a set and considering a probability distribution on \(Z'\) induced by a map \(Z{\mathop {\rightarrow }\limits ^{}}Z'\), that is the Cartesian product of the reductions \(\underline{Z}{\mathop {\rightarrow }\limits ^{}}\underline{X}\) and \(\underline{Z}{\mathop {\rightarrow }\limits ^{}}\underline{Y}\) in the original twofan. Thus, the inclusion of a pair of probability spaces X and Y as sink vertices in a minimal twofan is equivalent to specifying a joint distribution on \(\underline{X}\times \underline{Y}\).
Note that minimality of a twofan is defined in purely categorical terms. Even though the definition applies to twofans of morphisms in any category, the minimization need not to exist. However as the next proposition asserts, if minimization of any twofan exists in a category \(\mathbf{C}\), then it also exists in a category of diagrams over \(\mathbf{C}\).
Proposition 2.1
 1.
A twofan \(\mathscr {F}=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G},\varvec{\varLambda }_{2}\rangle \) of \(\mathbf{G}\)diagrams is minimal if and only if the constituent twofans of probability spaces \(\mathscr {F}_{i}=(X_{i}{\mathop {\leftarrow }\limits ^{}}Z_{i}{\mathop {\rightarrow }\limits ^{}}Y_{i})\) are all minimal.
 2.For any twofan \(\mathscr {F}=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\) of \(\mathbf{G}\)diagrams its minimal reduction exists, that is, there exists a minimal twofan \(\mathscr {F}'=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}'{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\) included in the following diagram
The proof of Proposition 2.1 can be found on page 38.
2.5.4 Cofan
2.5.5 A diamond diagram
Of course, there is also a morphism \(O_{12}{\mathop {\rightarrow }\limits ^{}}O_{\bullet }\), which lies in the transitive closure of the given four morphisms. We will often skip writing morphisms that are implied by the transitive closure.
A diamond diagram will be called minimal if the top twofan in it is minimal.
2.5.6 “Twotents” diagram
The “twotents” category \(\mathbf{M}_{2}\) consists of five objects, of which two are sources and three are sinks, and morphisms are as in Fig. 3b.
2.5.7 Full diagram
The full category \(\varvec{\varLambda }_{n}\) on n objects is a category with objects \(\left\{ O_{I}\right\} _{I\in 2^{\left\{ 1,\ldots ,n\right\} }\setminus \left\{ \emptyset \right\} }\) indexed by all nonempty subsets \(I\in 2^{\left\{ 1,\ldots ,n\right\} }\) and a morphism from \(O_{I}\) to \(O_{J}\), whenever \(J\subseteq I\).
A diagram \(\mathscr {X}\) indexed by a full category will be called minimal, if for every twofan in it, it also contains a minimal twofan with the same sink vertices. If \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \varvec{\varLambda }_n\rangle \) is minimal full diagram of probability spaces, then the set \(\underline{\mathscr {X}(O_{I})}\) can be considered as a subset of the product \(\prod _{i\in I}\underline{\mathscr {X}(O_{i})}\), while reductions are just coordinate projections.
For an ntuple of random variables \(\textsf {X}_{1},\ldots ,\textsf {X}_{n}\) one may construct a minimal full diagram \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \varvec{\varLambda }_{n}\rangle \) by considering all joint distributions and “marginalization” reductions. We denote such a diagram by \(\langle \textsf {X}_{1},\ldots ,\textsf {X}_{n}\rangle \). On the other hand, the reductions from the initial space to the sink vertices of a full diagram can be viewed as random variables on the domain of definition given by the (unique) initial space.
Once the underlying sets of the sink spaces are fixed, there is a onetoone correspondence between the full minimal diagrams and distributions as above.
As a corollary of Proposition 2.1 we also obtain the following characterization of minimal full diagrams of any \(\mathbf{G}\)diagrams of probability spaces.
Corollary 2.2
 1.
A full diagram \(\mathscr {F}\) of \(\mathbf{G}\)diagrams is minimal, if and only if the constituent full diagrams of probability spaces \(\mathscr {F}_{i}\) are all minimal.
 2.
For any full diagram \(\mathscr {F}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G},\varvec{\varLambda }_{n}\rangle \) of \(\mathbf{G}\)diagrams there exists another minimal full diagram \(\mathscr {F}'\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G},\varvec{\varLambda }_{n}\rangle \) with the same sink entries and a reduction \(\mu :\mathscr {F}{\mathop {\rightarrow }\limits ^{}}\mathscr {F}',\) such that \(\mu \) restricts to an isomorphism on sink entries of \(\mathscr {F}\). Moreover, \(\mathscr {F}'\) is unique upto isomorphism.
2.6 Constant diagrams
Suppose X is a probability space and \(\mathbf{G}\) is a poset category. One may form a constant \(\mathbf{G}\)diagram by considering a functor that maps all objects in \(\mathbf{G}\) to X and all the morphisms to the identity morphism \(X{\mathop {\longrightarrow }\limits ^{\text {Id}}} X\). We denote such a constant diagram by \(X^{\mathbf{G}}\) or simply by X, when \(\mathbf{G}\) is clear from the context. Any constant diagram is automatically minimal.
2.7 Homogeneous diagrams
A diagram \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) indexed by some poset category \(\mathbf{G}\) is called homogeneous if its automorphism group \(\text {Aut}(\mathscr {X})\) acts transitively on every probability space in \(\mathscr {X}\). Three examples of homogeneous diagrams were given in the introduction. The subcategory of all homogeneous diagrams indexed by \(\mathbf{G}\) will be denoted \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle _\mathbf{h}\).
In fact, for \(\mathscr {X}\) to be homogeneous it is sufficient that the \(\text {Aut}(\mathscr {X})\) acts transitively on every source space in \(\mathscr {X}\). Thus, if \(\mathscr {X}\) is complete with initial space \(X_{0}\), to check homogeneity it is sufficient to check the transitivity of the action of the symmetries of \(\mathscr {X}\) on \(X_{0}\).
A single probability space is homogeneous if and only if there is a representative in its isomorphism class with uniform measure and the same holds true for chain diagrams, for the cofan or any other diagram that does not contain a twofan. However, for more complex diagrams, for example for twofans, no such simple description is available.
2.7.1 Universal construction of homogeneous diagrams
Examples of homogeneous diagrams could be constructed in the following manner. Suppose \(\varGamma \) is a finite group and \(\left\{ H_{i}\right\} \) is a collection of subgroups. Consider a collection of sets \(\underline{X}_{i}:=\varGamma /H_{i}\) and consider a natural surjection \(f_{ij}:\underline{X}_{i}{\mathop {\rightarrow }\limits ^{}}\underline{X}_{j}\) whenever \(H_{i}\) is a subgroup of \(H_{j}\). Equipping each \(\underline{X}_{i}\) with the uniform distribution one can turn the diagram of sets \(\left\{ \underline{X}_{i};f_{ij}\right\} \) into a homogeneous diagram of probability spaces. It will be complete if there is a smallest subgroup (under inclusion) among \(H_{i}\)’s.
Such a diagram will be complete and minimal, if together with any pair of groups \(H_{i}\) and \(H_{j}\) in the collection, their intersection \(H_{i}\cap H_{j}\) also belongs to the collection \(\left\{ H_{i}\right\} \).
In fact, any homogeneous diagram arises this way. Suppose diagram \(\mathscr {X}= \left\{ X_i ; f_{ij} \right\} \) is homogeneous, then we set \(\varGamma = \text {Aut}(\mathscr {X})\) and choose a collection of points \(x_i \in X_i\) such that \(f_{ij} (x_i) = x_j\) and denote by \(H_i := {\text {Stab}}(x_i) \subset \varGamma \). Then, if one applies the construction of the previous paragraph to \(\varGamma \), with the collection of subgroups \(\left\{ H_i\right\} \), one recovers the original diagram \(\mathscr {X}\) upto isomorphism.
2.8 Conditioning
Under some assumptions it is possible to condition a whole subdiagram of \(\mathscr {X}\). More specifically, if a diagram \(\mathscr {X}\) contains a subdiagram \(\mathscr {Y}\) and a probability space X satisfying the condition that there exists a space Z in \(\mathscr {X}\) that reduces to all the spaces in \(\mathscr {Y}\) and to X, then we may condition the whole of \(\mathscr {Y}\) on \(x \in X\) given that \(p_X(x)>0\).
For \(x\in X\) with positive weight we denote by \(\mathscr {Y}\lfloor x\) the diagram of spaces in \(\mathscr {Y}\) conditioned on \(x\in X\). The diagram \(\mathscr {Y}\lfloor x\) has the same combinatorial type as \(\mathscr {Y}\) and will be called the slice of \(\mathscr {Y}\) over \(x\in X\). Note that the space X itself may or may not belong to \(\mathscr {Y}\). The conditioning \(\mathscr {Y}\lfloor x\) may depend on the choice of a fan between \(\mathscr {Y}\) and X, however when \(\mathscr {X}\) is complete the conditioning \(\mathscr {Y}\lfloor x\) is welldefined and is independent of the choice of fans.
Suppose now that there are two subdiagram \(\mathscr {Y}\) and \(\mathscr {Z}\) in \(\mathscr {X}\) and in addition \(\mathscr {Z}\) is a constant diagram, \(\mathscr {Z}=Z^{\mathbf{G}'}\) for some poset category \(\mathbf{G}'\). Let \(z\in \underline{Z}\), then \(\mathscr {Y}\lfloor z\) is well defined and is independent of the choice of the space in \(\mathscr {Z}\), the element of which z is to be considered.
If \(\mathscr {X}\) is homogeneous, then \(\mathscr {Y}\lfloor x\) is also homogeneous and its isomorphism class does not depend on the choice of \(x\in \underline{X}\).
2.9 Entropy
3 The entropy distance
We turn the space of diagrams into a pseudometric space by introducing the intrinsic entropy distance and asymptotic entropy distance. The intrinsic entropy distance is obtained by taking an infimum of the entropy distance over all possible joint distributions on two probability spaces.
3.1 Entropy distance and asymptotic entropy distance
3.1.1 Entropy distance in the case of single probability spaces
The bivariate function \(\mathbf k :{{\mathrm{\mathbf {Prob}}}}\times {{\mathrm{\mathbf {Prob}}}}{\mathop {\rightarrow }\limits ^{}}\mathbb {R}_{\ge 0}\) defines a notion of pseudodistance and it vanishes exactly on pairs of isomorphic probability spaces. This follows directly from the Shannon inequality (5), and a more general statement will be proven in Proposition 3.1 below.
3.1.2 Entropy distance for complete diagrams
The definition of entropy distance for complete diagrams repeats almost literally the definition for single spaces. We fix a complete poset category \(\mathbf{G}\) and will be considering diagrams from \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \).
The following proposition records that the intrinsic entropy distance is in fact a pseudodistance on \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \), provided \(\mathbf{G}\) is a complete poset category (that is when \(\mathbf{G}\) has a unique initial space).
Proposition 3.1
Moreover, two diagrams \(\mathscr {X}, \mathscr {Y}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) satisfy \(\mathbf k (\mathscr {X},\mathscr {Y})=0\) if and only if \(\mathscr {X}\) is isomorphic to \(\mathscr {Y}\) in \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \).
The idea of the proof is very simple. In the case of single probability spaces X, Y, Z a coupling between X and Z can be constructed from a coupling between X and Y and a coupling between Y and Z by adhesion on Y, see [16]. The triangle inequality then follows from Shannon inequality. However, since we are dealing with diagrams the combinatorial structure requires careful treatment. Therefore, we provide a detailed proof on page 40.
It is important to note, that the proof uses the fact that \(\mathbf{G}\) is complete. In fact, even though the definition of \(\mathbf k \) could be easily extended to some bivariate function on the space of diagrams of any fixed combinatorial type, it fails to satisfy the triangle inequality in general, because the composition of couplings requires completeness of \(\mathbf{G}\).
3.1.3 The asymptotic entropy distance
As a corollary of Proposition 3.1 and definition (10) we immediately obtain that the asymptotic entropy distance is a homogeneous pseudodistance on \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \).
Corollary 3.2
 1.
\({\varvec{\upkappa }}(\mathscr {X},\mathscr {Y})\le \mathbf k (\mathscr {X},\mathscr {Y})\)
 2.
for any \(n\in \mathbb {N}_{0}\) holds \( {\varvec{\upkappa }}(\mathscr {X}^{n},\mathscr {Y}^{n})=n\cdot {\varvec{\upkappa }}(\mathscr {X},\mathscr {Y}) \).
We will see later that there are instances when \({\varvec{\upkappa }}<\mathbf k \), moreover there are pairs of nonisomorphic diagrams with vanishing asymptotic entropy distance between them.
In the next subsection we derive some elementary properties of the intrinsic entropy distance and the asymptotic entropy distance.
3.2 Properties of (asymptotic) entropy distance
3.2.1 Tensor product
We show that the tensor product on the space of diagrams is 1Lipschitz. Later this will allow us to give a simple description of tropical diagrams, that is, of points in the asymptotic cone of \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \), as limits of certain sequences of “classical” diagrams, as will be discussed in a subsequent article.
Proposition 3.3
This statement is a direct consequence of additivity of entropy with respect to the tensor product. Details can be found on page 42.
It follows directly from definition (10) and Proposition 3.3, that the asymptotic entropy distance enjoys a similar property.
Corollary 3.4
As another corollary we obtain the subadditivity properties of the intrinsic entropy distance and asymptotic entropy distance.
Corollary 3.5
It implies in particular that shifts are nonexpanding maps in \(({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle ,\mathbf k )\) or \(({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle ,{\varvec{\upkappa }})\).
Corollary 3.6
Less obvious is the fact that \({\varvec{\upkappa }}\) is, in fact, translation invariant and in particular, \(({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle ,{\varvec{\upkappa }})\) satisfies the cancellation property. This is the subject of Proposition 3.7 below, which was communicated to us by Tobias Fritz.
Proposition 3.7
The proof of the lemma can be found on page 43.
3.2.2 Entropy
Proposition 3.8
Again, the proof of the proposition above is an application of Shannon’s inequality, see page 44 for details.
3.3 The Slicing Lemma
The Slicing Lemma, Proposition 3.9 below, allows to estimate the intrinsic entropy distance between two diagrams with the integrated intrinsic entropy distance between “slices”, which are diagrams obtained by conditioning on another probability space. It turned out to be a very powerful tool for estimation of the intrinsic entropy distance and will be used below on several occasions.
Proposition 3.9
The idea of the proof of the Slicing Lemma (page 45) is as follows. For every pair \((u,v)\in \underline{W}\) we consider an optimal twofan \(\mathscr {G}_{uv}\) coupling \(\mathscr {X}\lfloor u\) and \(\mathscr {Y}\lfloor v\). These fans have the same underlying diagram of sets. Then we construct a coupling between \(\mathscr {X}\) and \(\mathscr {Y}\) as a convex combination of distributions of \(\mathscr {G}_{uv}\)’s weighted by \(p_{W}(u,v)\). The estimates on the resulting twofan then imply the proposition.
Various implications of the Slicing Lemma are summarized in the next corollary.
Corollary 3.10
 1.Given a “twotents” diagramthe following inequality holds$$\begin{aligned} \mathscr {X}{\mathop {\leftarrow }\limits ^{}}{\hat{\mathscr {X}}}{\mathop {\rightarrow }\limits ^{}}U{\mathop {\leftarrow }\limits ^{}}{\hat{\mathscr {Y}}}{\mathop {\rightarrow }\limits ^{}}\mathscr {Y}\end{aligned}$$
 2.Given a fanthe following inequality holds$$\begin{aligned} \mathscr {X}{\mathop {\leftarrow }\limits ^{}}{\hat{\mathscr {X}}}{\mathop {\rightarrow }\limits ^{}}U \end{aligned}$$
 3.Let \(\mathscr {X}{\mathop {\rightarrow }\limits ^{}}U\) be a reduction, then
 4.For a cofan \(\mathscr {X}{\mathop {\rightarrow }\limits ^{}}U{\mathop {\leftarrow }\limits ^{}}\mathscr {Y}\) holds$$\begin{aligned} \mathbf k (\mathscr {X},\mathscr {Y}) \le \int _{U}\mathbf k (\mathscr {X}\lfloor u,\mathscr {Y}\lfloor u)\hbox {d}\,p_{U}(u). \end{aligned}$$
4 Distributions and types
In this section we recall some elementary inequalities for (relative) entropies and the total variation distance for distributions on finite sets. Furthermore, we generalize the notion of a probability distribution on a set to a distribution on a diagram of sets. Finally, we give a perspective on the theory of types, and also introduce types in the context of complete diagrams.
4.1 Distributions
4.1.1 Single probability spaces
For a finite set S we denote by \(\varDelta S\) the collection of all probability distributions on S. It is a unit simplex in the real vector space \(\mathbb {R}^S\). We often use the fact that it is a compact, convex set, whose interior points correspond to fully supported probability measures on S.
The total variation norm and relative entropy are related by the following inequality.
Lemma 4.1
The claim of the Lemma, Pinsker’s inequality, is a wellknown inequality in for instance information theory, and a proof can be found in [8].
4.1.2 Distributions on diagrams
A map \(f:S{\mathop {\rightarrow }\limits ^{}}S'\) between two finite sets induces an affine map \(f_{*}:\varDelta S{\mathop {\rightarrow }\limits ^{}}\varDelta S'\).
4.2 Types
We now discuss briefly the theory of types. Types are special subspaces of tensor powers that consist of sequences with the same “empirical distribution” as explained in details below. For a more detailed discussion the reader is referred to [7, 8]. We generalize the theory of types to complete diagrams of sets and complete diagrams of probability spaces.
The theory of types for diagrams, that are not complete, is more complex and will be addressed in a subsequent article.
4.2.1 Types for single probability spaces
For \(\pi \in {\varDelta ^{^{(n)}}} S\), the space \(T^{n}_{\pi }S:=\mathbf{q}^{1}(\pi )\) equipped with the uniform measure is called a type over \(\pi \). The symmetric group \(\mathbb {S}_{n}\) acts on \(S^{n}\) by permuting the coordinates. This action leaves the empirical distribution invariant and therefore could be restricted to each type, where it acts transitively. Thus, for \(\pi \in {\varDelta ^{^{(n)}}} S\) the probability space \((T^{n}_{\pi }S,u)\) with u being a uniform (\(\mathbb {S}_{n}\)invariant) distribution, is a homogeneous space.
The following lemma records some standard facts about types, which can be checked by elementary combinatorics and found in [8].
Lemma 4.2
 1.
\(\varDelta ^{^{(n)}}X = \genfrac(){0.0pt}0{n+X}{X} \le {\mathbf{e}}^{X \cdot \ln (n+1)} = {\mathbf{e}}^{\textsf {O}(X \cdot \ln n)}\)
 2.
\(p^{\otimes n}(\mathbf{x}) = \mathbf{e}^{n\big [h(\mathbf{q}(\mathbf{x}))+D(\mathbf{q}(\mathbf{x})\,\,p)\big ]}\)
 3.
\(\mathbf{e}^{n \cdot h(\pi )  X \cdot \ln (n+1)}\le T^{n}_{\pi } \underline{X} \le \mathbf{e}^{n\cdot h(\pi )}\) or\(T^{n}_{\pi } \underline{X}=\mathbf{e}^{n\cdot h(\pi )+\textsf {O}(X\cdot \ln n)}\)
 4.
\(\mathbf{e}^{n\cdot D(\pi \,\,p)X \cdot \ln (n+1)} \le \tau _n(\pi )=p^{\otimes n}(T^{n}_{\pi } \underline{X}) \le \mathbf{e}^{n\cdot D(\pi \,\,p)}\) or\(\tau _n(\pi )= \mathbf{e}^{n\cdot D(\pi \,\,p)+\textsf {O}(X\cdot \ln n)}\)
Corollary 4.3
The following important theorem is known as Sanov’s theorem. It can be easily derived from Lemma 4.2 or a proof can be found in [8].
Theorem 4.4
Combining the estimate in Theorem 4.4 with the Pinsker’s inequality in 4.1 we obtain the following corollary.
Corollary 4.5
4.3 Types for complete diagrams
In this subsection we generalize the theory of types for diagrams indexed by a complete poset category. The theory for a noncomplete diagrams is more complex and will be addressed in our future work. Before we describe our approach we need some preparatory material.
Now we are ready to give the definitions of types. Let \(\mathscr {X}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) be a complete diagram, \(\mathscr {X}=\left\{ X_{i};f_{ij}\right\} \) with initial space \(X_{0}\) and let \(\pi \in \varDelta ^{^{(n)}}\mathscr {X}\).
4.3.1 The empirical twofan
Unlike in the cases of single probability spaces there is no empirical reduction from the power of \(\mathscr {X}\) to \(\varDelta \mathscr {X}\). It will be convenient for us to see the types as the power of the diagram conditioned on a distribution. This is achieved by including the power of diagram into a empirical twofan.
5 Distance between types
Our goal in this section is to estimate the intrinsic entropy distance between two types over two different distributions \(\pi _{1},\pi _{2}\in \varDelta ^{^{(n)}}\mathscr {S}\) in terms of the total variation distance \(\pi _{1}\pi _{2}_1\).
For this purpose we use a “lagging” technique which is explained below. Practically, we couple different types by randomly removing and inserting the appropriate amount of symbols to pass from a trajectory of the one type to a trajectory of the other.
5.1 The lagging trick
The next lemma uses the lagging twofan to estimate the intrinsic entropy distance between its sink diagrams.
Lemma 5.1
Proof of Lemma 5.1
5.2 Distance between types
In this section we use the lagging trick as described above to estimate the distance between types over two different distributions in \(\varDelta \mathscr {S}\) where \(\mathscr {S}\) is a complete diagram of sets.
Proposition 5.2
The idea of the proof is to write p and q as a convex combination of a common distribution \({\hat{p}}\) and “small amounts” of \(p^{+}\) and \(q^{+}\), respectively. Then we use the lagging trick to estimate distances between types over p and \({\hat{p}}\), as well as between types over q and \({\hat{p}}\). We now present details of the proof.
Proof of Proposition 5.2
6 Asymptotic equipartition property for diagrams
Below we prove that any Bernoulli sequence of complete diagrams can be approximated by a sequence of homogeneous diagrams. This is essentially the Asymptotic Equipartition Theorem for diagrams.
Theorem 6.1
Proof
7 Technical proofs
This section contains some proofs that did not make it into the main text. The numbering of the claims in this section coincides with the numbering in the main text. Lemma that first appear in this section are numbered within section.
Proposition 2.1
 1.
A twofan \(\mathscr {F}=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G},\varvec{\varLambda }_{2} \rangle \) of \(\mathbf{G}\)diagrams is minimal if and only if the constituent twofans of probability spaces \(\mathscr {F}_{i}=(X_{i}{\mathop {\leftarrow }\limits ^{}}Z_{i}{\mathop {\rightarrow }\limits ^{}}Y_{i})\) are all minimal.
 2.For any twofan \(\mathscr {F}=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\) of \(\mathbf{G}\)diagrams its minimal reduction exists, that is, there exists a minimal twofan \(\mathscr {F}'=(\mathscr {X}{\mathop {\leftarrow }\limits ^{}}\mathscr {Z}'{\mathop {\rightarrow }\limits ^{}}\mathscr {Y})\) included in the following diagram
Before we go to the proof of Proposition 2.1, we will need the following lemma.
Lemma 7.1
Proof of Lemma 7.1
We define \(\rho '\) on the sink spaces of \(\mathscr {F}'\) to coincide with \(\rho \).
Proof of Proposition 2.1
 1.
\(\mathscr {F}_{i_{0}}\) is not minimal
 2.
for any \(j\in {\hat{J}}(i_{0}) \backslash \{ i_0 \}\) the twofan \(\mathscr {F}_{j}\) is minimal.
To address the second assertion of the Lemma 2.1 observe that the argument above gives an algorithm for the construction of a minimal reduction of any twofan of \(\mathbf{G}\)diagrams. \(\square \)
Proposition 3.1
Moreover, two diagrams \(\mathscr {X}, \mathscr {Y}\in {{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \) satisfy \(\mathbf k (\mathscr {X},\mathscr {Y})=0\) if and only if \(\mathscr {X}\) is isomorphic to \(\mathscr {Y}\) in \({{\mathrm{\mathbf {Prob}}}}\langle \mathbf{G} \rangle \).
Proof
The symmetry of \(\mathbf k \) is immediate. The nonnegativity of \(\mathbf k \) follows from the fact that entropy of the target space of a reduction is not greater then the entropy of the domain, which is a particular instance of the Shannon inequality (5).
We proceed to prove the triangle inequality. We will make use of the following lemma
Lemma 7.2
The Lemma 7.2 follows immediately from Shannon inequality.
Suppose for now that \(\mathbf{G}=\bullet \) and we are given three probability spaces X, Y, Z together with the optimal couplings \(\mathscr {U}=(X{\mathop {\leftarrow }\limits ^{}}U{\mathop {\rightarrow }\limits ^{}}Y)\) and \(\mathscr {V}=(Y{\mathop {\leftarrow }\limits ^{}}V{\mathop {\rightarrow }\limits ^{}}Z)\) in the sense of optimization problem (9). Together they form a twotents diagram \(\mathscr {T}=(X{\mathop {\leftarrow }\limits ^{}}U{\mathop {\rightarrow }\limits ^{}}Y{\mathop {\leftarrow }\limits ^{}}V{\mathop {\rightarrow }\limits ^{}}Z)\). If we can extend \(\mathscr {T}\) to a minimal full diagram \(\mathscr {Q}\) as in the assumption of Lemma 7.2, the triangle inequality would follow. The diagram \(\mathscr {Q}=\mathbf ad (\mathscr {T})\) can be constructed by the so called adhesion, as explained below.
Finally, if \(k(\mathscr {X}, \mathscr {Y}) = 0\), then there is a twofan \(\mathscr {F}\) of \(\mathbf{G}\)diagrams between \(\mathscr {X}\) and \(\mathscr {Y}\) with \(\text {kd}(\mathscr {F})= 0\), from which it follows that \(\mathscr {X}\) and \(\mathscr {Y}\) are isomorphic. \(\square \)
Proposition 3.3
Proof
Proposition 3.7
Proof
Proposition 3.8
Proof
Proposition 3.9
Proof
Notes
Acknowledgements
Open access funding provided by Max Planck Society. We would like to thank Tobias Fritz, František Matúš, Misha Movshev and Johannes Rauh for inspiring discussions. We are grateful to the participants of the Wednesday Morning Session at the CASA group at the Eindhoven University of Technology for valuable feedback on the introduction of the article. Finally, we thank the Max Planck Institute for Mathematics in the Sciences, Leipzig, for its hospitality.
References
 1.Ay, N., Bertschinger, N., Der, R., Güttler, F., Olbrich, E.: Predictive information and explorative behavior of autonomous robots. Eur. Phys. J. B 63(3), 329–339 (2008)MathSciNetCrossRefGoogle Scholar
 2.Baez, J.C., Fritz, T., Leinster, T.: A characterization of entropy in terms of information loss. Entropy 13(11), 1945–1957 (2011)MathSciNetCrossRefGoogle Scholar
 3.Boltzmann, L.: über die mechanische Bedeutung des zweiten Hauptsatzes der Wärmetheorie. Wiener Berichte 53, 195–220 (1866)Google Scholar
 4.Boltzmann, L.: Vorlesungen über Gastheorie, vols. I, II. J.A. Barth, Leipzig (1896)Google Scholar
 5.Bertschinger, N., Rauh, J., Olbrich, E., Jost, J., Ay, N.: Quantifying unique information. Entropy 16(4), 2161–2183 (2014)MathSciNetCrossRefGoogle Scholar
 6.Cicalese, F., Gargano, L., Vaccaro, U.: How to find a joint probability distribution of minimum entropy (almost) given the marginals. In: 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2173–2177. IEEE, New York (2017)Google Scholar
 7.Csiszár, I.: The method of types. IEEE Trans. Inf. Theory, 44(6):2505–2523 (1998) (Information theory: 1948–1998) Google Scholar
 8.Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Series in Telecommunications. A WileyInterscience Publication, Wiley, New York (1991)CrossRefGoogle Scholar
 9.Friston, K.: The freeenergy principle: a rough guide to the brain? Trends Cogn. Sci. 13(7), 293–301 (2009)CrossRefGoogle Scholar
 10.Gromov, M.: In a search for a structure, part 1: on entropy (2012). Preprint. https://www.ihes.fr/~gromov/wpcontent/uploads/2018/08/structreserchentropyjuly52012.pdf and https://math.mit.edu/~dspivak/teaching/sp13/gromovEntropyViaCT.pdf. Accessed 08 Oct 2018
 11.Kocaoglu, M., Dimakis, A.G., Vishwanath, S., Hassibi, B.: Entropic causality and greedy minimum entropy coupling (2017). arXiv preprint. arXiv:1701.08254
 12.Kolmogorov, A.N.: New metric invariant of transitive dynamical systems and endomorphisms of Lebesgue spaces. Dokl. Russ. Acad. Sci. 119(5), 861–864 (1958)zbMATHGoogle Scholar
 13.Kolmogorov, A.N.: New metric invariant of transitive dynamical systems and endomorphisms of Lebesgue spaces. Dokl. Russ. Acad. Sci. 124, 754–755 (1959)Google Scholar
 14.Kovacevic, M., Stanojevic, I., Senk, V.: On the hardness of entropy minimization and related problems. In: Information Theory Workshop (ITW). 2012 IEEE, pp. 512–516. IEEE, New York (2012)Google Scholar
 15.MacLane, S.: Categories for the Working Mathematician. Graduate Texts in Mathematics, vol. 5. Springer, New York (1971)Google Scholar
 16.Matus, F.: Infinitely many information inequalities. In: IEEE International Symposium on Information Theory, 2007. ISIT 2007, pp. 41–44. IEEE, New York (2007)Google Scholar
 17.Ornstein, D.: Bernoulli shifts with the same entropy are isomorphic. Adv. Math. 4, 337–352 (1970)MathSciNetCrossRefGoogle Scholar
 18.Steudel, B., Ay, N.: Informationtheoretic inference of common ancestors. Entropy 17(4), 2304–2327 (2015)MathSciNetCrossRefGoogle Scholar
 19.Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)MathSciNetCrossRefGoogle Scholar
 20.Shannon, C.: The lattice theory of information. Trans. IRE Prof. Group Inf. Theory 1(1), 105–107 (1953)CrossRefGoogle Scholar
 21.Sinai, Ya G.: On the notion of entropy of a dynamical system. Dokl. Russ. Acad. Sci. 124, 768–771 (1959)zbMATHGoogle Scholar
 22.Sinai, Y.G.: Introduction to Ergodic Theory. Princeton University Press, Princeton (1976) (Translated by V. Scheffer Mathematical Notes, vol. 18) Google Scholar
 23.Van Dijk, S.G., Polani, D.: Informational constraintsdriven organization in goaldirected behavior. Adv. Complex Syst. 16(02n03), 1350016 (2013)MathSciNetCrossRefGoogle Scholar
 24.Vidyasagar, M.: A metric between probability distributions on finite sets of different cardinalities and applications to order reduction. IEEE Trans. Autom. Control 57(10), 2464–2477 (2012)MathSciNetCrossRefGoogle Scholar
 25.Yeung, R.W.: A First Course in Information Theory. Springer Science & Business Media, Berlin (2012)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.