A CATEGORICAL CHARACTERIZATION OF RELATIVE ENTROPY ON STANDARD BOREL SPACES

. We give a categorical treatment, in the spirit of Baez and Fritz, of relative entropy for probability distributions deﬁned on standard Borel spaces. We deﬁne a category called SbStat suitable for reasoning about statistical inference on standard Borel spaces. We deﬁne relative entropy as a functor into Lawvere’s category [0 , ∞ ] and we show convexity, lower semicontinuity and uniqueness.


Introduction
The inspiration for the present work comes from two recent developments.The first is the beginning of a categorical understanding of Bayesian inversion and learning [DG15, DDG16, CDDG17, DSDG18].The second is a categorical reconstruction of relative entropy [BFL11,BF14,Lei].The present paper provides a categorical treatment of entropy in the spirit of Baez and Fritz in the setting of standard Borel spaces, thus setting the stage to explore the role of entropy in learning.
Recently there have been some exciting developments that bring some categorical insights to probability theory and specifically to learning theory.These are reported in some recent papers by Clerc, Dahlqvist, Danos and Garnier [DG15, DDG16,CDDG17].The first of these papers showed how to view the Dirichlet distribution as a natural transformation thus opening the way to an understanding of higher-order probabilities, while the second gave a powerful framework for constructing several natural transformations.In [DG15] the hope was expressed that one could use these ideas to understand Bayesian inversion, a core concept in machine learning.In [CDDG17] this was realized in a remarkably novel way.
These papers carry out their investigations in the setting of standard Borel spaces and are based on the Giry monad [Gir81,Law64].
In [BFL11,BF14] a beautiful treatment of relative entropy is given in categorical terms.The basic idea is to understand entropy in terms of the results of experiments and observations.How much does one learn about a probabilistic situation by doing experiments and observing the results?A category is set up where the morphisms capture the interplay between the original space and the space of observations.In order to interpret the relative entropy as a functor they use Lawvere's category which consists of a single object and a morphism for every extended positive real number [Law73].
Our contribution is to develop the theory of Baez et al. in the setting of standard Borel spaces; their work is carried out with finite sets.While the work of [BF14] gives a firm conceptual direction, it gives little guidance in the actual development of the mathematical theory.We had to redevelop the mathematical framework and find the right analogues for the concepts appropriate to the finite case.

Background
In this section we review some of the background.We assume that the reader is familiar with concepts from topology and measure theory as well as basic category theory.We have found books by Ash [Ash72], Billingsley [Bil95] and Dudley [Dud89] to be useful.
We will use letters like X, Y, Z for measurable spaces and capital Greek letters like Σ, Λ, Ω for σ-algebras.We will use p, q, . . .for probability measures.Given (X, Σ) and (Y, Λ) and a measurable function f : X → Y and a probability measure p on (X, Σ) we obtain a measure on (Y, Λ) by p • f −1 ; this is called the pushforward measure or the image measure.
2.1.The Giry monad.We denote the category of measurable spaces and measurable functions by Mes.We recall the Giry [Gir81] functor Γ : Mes → Mes which maps each measurable space X to the space Γ(X) of probability measures over X.Let A ∈ Σ, we define ev A : Γ(X) → [0, 1] by ev A (p) = p(A).We endow Γ(X) with the smallest σ-algebra making all the ev's measurable.A morphism With the following natural transformations, this endofunctor is a monad: the Giry monad.The natural transformation η : I → Γ is given by η X (x) = δ x , the Dirac measure concentrated at x.The monad multiplication µ : Γ 2 → Γ is given by where p is a probability measure in Γ(Γ(X)) and ev A : Γ(X) → [0, 1] is the measurable function on Γ(X) defined by ev A (p) = p(A).

Even if
Mes is an interesting category in and of itself, the need for regular conditional probabilities forces us to restrict ourselves to a subcategory of standard Borel spaces.10:3 2.2.Standard Borel spaces and disintegration.The Radon-Nikodym theorem is the main tool used to show the existence of conditional probability distributions, also called Markov kernels, see the discussion below.It is a very general theorem, but it does not give as strong regularity features as one might want.A stronger theorem is needed; this is the so-called disintegration theorem.It requires stronger hypotheses on the space on which the kernels are being defined.A category of spaces that satisfy these stronger hypotheses is the category of standard Borel spaces.In order to define standard Borel spaces, we must first define Polish spaces.
Definition 2.1.A Polish space is a separable, completely metrizable topological space.
Definition 2.2.A standard Borel space is a measurable space obtained by forgetting the topology of a Polish space but retaining its Borel algebra.The category of standard Borel spaces has measurable functions as morphisms; we denote it by StBor.
We can now state a version of the disintegration theorem.The following is also known as Rohlin's disintegration theorem.
Theorem 2.3 [Rok49].Let (X, p) and (Y, q) be two standard Borel spaces equipped with probability measures, where q is the pushforward measure q := p • f −1 for a Borel measurable function f : X → Y .Then, there exists a q-almost everywhere uniquely determined family of probability measures {p y } y∈Y on X such that (1) the function y → p y (A) is a Borel-measurable function for each Borel-measurable set A ⊂ X; (2) p y is a probability measure on f −1 (y) for q-almost all y ∈ Y ; (3) for every Borel-measurable function h : h dp y dq. 10:4 For an arrow s : Y → Γ(X) in StBor, we write s y for s(y) or, in kernel form s(y, •).For arrows t : Z → Γ(Y ) and s : Y → Γ(X) in StBor, we denote their Kleisli composition by For standard Borel spaces equipped with a probability measure p, we sometimes omit the measure in the notation, i.e. we sometimes write X instead of (X, p).We say a probability measure p is absolutely continuous with respect to another measure q on the same measurable space X, denoted by p ≪ q, if for all measurable sets B, q(B) = 0 implies that p(B) = 0.
We note that absolute continuity is preserved by Kleisli composition; the proof is straightforward.
Proposition 2.4.Given a standard Borel space Y with probability measures q and q ′ such that q ≪ q ′ .Then, for arbitrary standard Borel space X and morphism s from Y to Γ(X),

The categorical setting
In this section, following Baez and Fritz [BF14] (see also [BFL11]) we describe the category FinStat which they use for their characterization of entropy on finite spaces.We then introduce the category SbStat which will be the arena for the generalization to standard Borel spaces.
Before doing so, we define the notion of coherence which will play an important role in what follows.
Definition 3.1.Given standard Borel spaces X and Y with probability measure p and q, respectively, a pair (f, s), with f : (X, p) → (Y, q) and s : Y → Γ(X) measurable, is said to be coherent1 when f is measure preserving, i.e., q = p • f −1 , and s y is a probability measure on f −1 (y) q-almost everywhere.2If in addition, p is absolutely continuous with respect to s • q, then we say that (f, s) is absolutely coherent.
Definition 3.2.The category FinStat has • Objects : Pairs (X, p) where X is a finite set and p a probability measure on X.
We compose arrows (f, s) : (X, p) → (Y, q) and (g, t) : (Y, q) → (Z, m) as follows: (g, t) We now leave the finite world for a more general one: the category SbStat.• Objects : Pairs (X, p) where X is a standard Borel space and p a probability measure on the Borel subsets of X.
We compose arrows (f, s) : (X, p) → (Y, q) and (g, t) : (Y, q) → (Z, m) as follows: (g, t) Note that the identity arrow on object (X, p) is (id X , η X ) where id X is the identity function on X.Following the graphical representation from [BF14] we represent composition as follows: One can think of f as a measurement process from X to Y and of s as a hypothesis about X given an observation in Y .We say that a hypothesis s is optimal 3 if p = s • q.We denote by FP the subcategory of SbStat consisting of the same objects, but with only those morphisms where the hypothesis is optimal.See [BFL11, BF14] and [Lei] for a discussion of these ideas in the finite case.
Proposition 3.4.Given coherent pairs the composition is coherent.If, in addition, they are absolutely coherent, the composition is absolutely coherent.
Proof.We first show that the composition is coherent, i.e., η Z = (Γ(g) It is sufficient to show that the following diagram commutes: Using the hypothesis that η Z = Γ(g) • t and the fact that Id = µ • Γ(η), we get that the right-hand square commutes.The triangle commutes since it is the application of Γ to our hypothesis η Y = Γ(f ) • s and the left-hand square commutes because µ is a natural transformation.Therefore, the whole diagram commutes and we have thus shown the composition of coherent morphisms is also coherent.
Next, in addition, assume the pairs (f, s) and (g, t) are absolutely coherent.We show p ≪ (s • t • m).By hypotheses, p ≪ s • q and q ≪ t • m.Using Proposition 2.4 on 3 For a coherent pair (f, s), asking s to be optimal is equivalent to asking that (f, s) satisfies condition (3) in Theorem (2.3) as will be shown in Lemma (4.3).

10:6
We end this section by defining one more category; this one is due to Lawvere [Law73].It is just the set [0, ∞] but endowed with categorical structure.This allows numerical values associated with morphisms to be regarded as functors.
This is a remarkable category with monoidal closed structure and many other interesting properties.

Relative entropy functor
We recapitulate the definition of the relative entropy functor on FinStat from Baez and Fritz [BF14] and then extend it to SbStat.
Definition 4.1.The relative entropy functor RE f in is defined from FinStat to [0, ∞] as follows: • On Objects : It maps every object (X, p) to •.
The convention from now on will be that We extend RE f in from FinStat to SbStat.
Definition 4.2.The relative entropy functor RE is defined from SbStat to [0, ∞] as follows: • On Objects : It maps every object (X, p) to •.
• On Morphisms : Given a coherent morphism (f, s) , where This quantity is also known as the Kullback-Leibler divergence.
We could have defined our category to have only absolutely coherent morphisms but it would make the comparison with the finite case more awkward as the finite case does not assume the morphisms to be absolutely coherent.The present definition leads to slightly awkward proofs where we have to consider absolutely coherent pairs and ordinary coherent pairs separately.10:7 Clearly, RE restricts to RE f in on FinStat.If (f, s) is absolutely coherent, then p is absolutely continuous with respect to (s • q) and the Radon-Nikodym derivative is defined.The relative entropy is always non-negative [KL51]; this is an easy consequence of Jensen's inequality.This shows that RE is defined everywhere in SbStat.
We will use the following notation occasionally: It's easy to see that RE sends the identity arrows of SbStat to 0-the identity arrow of the unique object • of [0, ∞].Hence, in order to show that RE is indeed a functor, it suffices to show that In order to do so, we will need the following two lemmas.
We just have to show that {s y } y∈Y satisfies the three properties implied by the disintegration theorem.We prove the third one; the first two being obvious.

(iii) : For every Borel-measurable function
h ds y dq.
Proof.Let's assume as a special case that h is the indicator function for a measurable set E ⊂ X.Then, we have h ds y dq.
We have shown that it is true for any indicator function.By linearity, it is true for any simple function and then, by the monotone convergence theorem, it is true for all Borel-measurable functions h : X → [0, ∞].
Lemma 4.4.The relative entropy is preserved under pre-composition by optimal hypotheses, i.e., for any (g, t) : (Y, q) → (Z, m) and (f, s) : (X, s • q) → (Y, q), we have Proof.Case I : (g, t) is absolutely coherent.Since (g, t) is absolutely coherent, so is Because f is measure preserving, it is sufficient to show that the following functions on X By the Radon-Nikodym theorem, itx is sufficient to show that for any E ⊂ X measurable set, we have The following calculation establishes the above.
We get (4.1) by applying the disintegration theorem to f : (X, s The equation (4.2) follows by using the fact that dq d(t • m) • f is constant on f −1 (y) for every y.To obtain (4.3) we apply Lemma 4.3.To show (4.4) we use the fact that s y is a probability measure on f −1 (y).We get (4.5) by the definition of the Radon-Nikodym derivative and we finally establish (4.6) by the definition of Kleisli composition.
Case II : (g, t) is not absolutely coherent.We have RE((g, t)) = ∞.We show that (g • f, s • t) is not absolutely coherent, i.e., s • q is not absolutely continuous with respect to s • t • m.Since, by hypothesis, q ≪ t • m doesn't hold, there exists a measurable set B ⊂ Y such that (t • m)(B) = 0 but q(B) > 0. We argue that (s • t • m)(f −1 (B)) = 0 and (s • q)(f −1 (B)) > 0. On one hand, we have But on the other hand, since f is a measure preserving map from (X, s • q) to (Y, q), we have (s Theorem 4.5 (Functoriality).Given arrows (f, s) : (X, p) → (Y, q) and (g, t) : (Y, q) → (Z, m), we have Proof.Note that by definition, RE ((g, t) Case I : (f, s) and (g, t) are absolutely coherent.By Proposition 3.4, we have that We get (4.7) by the chain rule for Radon-Nikodym derivatives and (4.8) by applying Lemma 4.4.
Case II : (g, t) is not absolutely coherent.We argue that (g • f, s • t) is not absolutely coherent.By hypothesis, q ≪ t • m doesn't hold, so there is a measurable set B ⊂ Y such that (t • m)(B) = 0 and q(B) > 0. We show that (s • t • m) f −1 (B) = 0 and p(f −1 (B)) > 0. On one hand, we have but on the other hand, we have p(f −1 (B)) = q(B) > 0. Therefore Case III : (f, s) is not absolutely coherent.
This case is not analogous to the previous case since the existence of a measurable set A ⊂ X such that (s • q)(A) = 0 and p(A) > 0 is surprisingly not enough to conclude that (s By the hypothesis of (f, s) not being absolutely coherent, p ≪ s • q doesn't hold, so there is a measurable set A ⊂ X such that (s • q)(A) = 0 and p(A) > 0.
We partition A into and we partition Y into We argue that (s • t • m)(A 0 ) = 0 and p(A 0 ) > 0. Since , so for all y ∈ B ϵ we have s y (A 0 ) = 0 because their support is disjoint from A 0 .On one hand, we thus have On the other hand, since we have p(A 0 ) + p(A ϵ ) = p(A) > 0 and A ϵ ⊂ f −1 (B ϵ ), it suffices to show p(f −1 (B ϵ )) = 0 to conclude p(A 0 ) > 0.
By hypothesis, we have This completes the proof of this case.
We have thus shown that RE is a well-defined functor from SbStat to [0, ∞].
4.1.Convex linearity.We show below that the relative entropy functor satisfies a convex linearity property.In [BF14] convexity looks familiar; here since we are performing "large" sums we have to express it as an integral.First we define a localized version of the relative entropy.
Note that Lemma 4.3 says that s y = (s • q) y q-almost everywhere.Thus, in the following there is no notational clash between the kernel s y and (s • q) y , the later being the disintegration of (s • q) along f .
Given an arrow (f, s) : (X, p) → (Y, q) in StBor and a point y ∈ Y , we denote by (f, s) y , the morphism (f, s) restricted to the pair of standard Borel spaces f −1 (y) and {y}.Explicitly, where δ y is the one and only probability measure on {y}.
Definition 4.6.A functor F from SbStat to [0, ∞] is convex linear if for every arrow (f, s) : (X, p) → (Y, q), we have We will sometimes refer to the relative entropy of (f, s) y as the local relative entropy of (f, s) at y. Before proving that RE is convex linear, we first prove the following lemma.
Lemma 4.7.Given where f is a measurable map preserving the measure of both Borel probability measures p and p ′ .If p ≪ p ′ , then dpy dp ′ y is defined for q-almost every y and Proof.For an arbitrary measurable function h : X → [0, ∞], by first applying the Radon-Nikodym theorem and then the disintegration theorem on the measurable function h dp dp ′ , we get Hence, for q-almost every y, we must have Theorem 4.8 (Convex Linearity).The functor RE is convex linear, i.e., for every arrow (f, s) : (X, p) → (Y, q), we have 10:12 Proof.Case I : (f, s) is absolutely coherent.
We have We get (4.9) by the disintegration theorem and (4.10) by applying Lemma 4.7.
Case II : (f, s) is not absolutely coherent.By the hypothesis of (f, s) not being absolutely coherent, there is a measurable set A ⊂ X such that (s • q)(A) = 0 and p(A) > 0.
Applying lemma 4.3, on one hand we have but on the other hand we have Y p y (A) dq = p(A) > 0.
Hence, the subset of Y on which p y ≪ (s • q) y doesn't hold contains a set of measure strictly greater than 0. Therefore, 4.2.Lower-semi-continuity.Recall that a sequence of probability measures p n converges strongly to p, denoted by p n → p, if for all measurable set E, one has lim n→∞ p n (E) = p(E).
The singleton set equipped with the trivial measure, which we will denote by (1, δ), is a weakly terminal object of SbStat, it is weakly terminal in the sense that for every (X, p) there exist a non-unique arrow (f, s) : (X, p) → (1, δ) in SbStat.
Definition 4.9.A functor F from SbStat to [0, ∞] is lower semi-continuous if for every arrow (f, s) : (X, p) → (1, δ), whenever p n → p and s n → s, then Recall that in [BF14], lower semicontinuity was defined on FinStat as the following.10:13 Definition 4.10 (Baez and Fritz).A functor F : FinStat → [0, ∞] is lower semicontinuous if for any sequence of morphisms (f, s i ) : (X, p i ) → (Y, q i ) that converges 4 to a morphism (f, s) : (X, p) → (Y, q), we have Recalling that FP stands for the subcategory of SbStat consisting of the same objects, but with only those morphisms where the hypothesis is optimal.We claim that a lower semi-continuous (as defined in Definition 4.9) functor F that vanishes on FP restricts to a lower semi-continuous functor on FinStat (as defined in Definition 4.10).To see this, note that, given a sequence of morphisms (f, s i ) : (X, p i ) → (Y, q i ) that converges pointwise to a morphism (f, s) : (X, p) → (Y, q), we can recover Note that, on finite sets, converging pointwise is equivalent to strong convergence.
Proof.Let us denote If a = ∞, then the statement holds automatically, so we assume that a < ∞.
By virtue of a being a limit inferior, we can pick a subsequence {n i } i∈N such that for all i ∈ N, we have both Now, instantiating statements (2.4.7) and (2.4.9) from Pinsker [Pin60, Section 2.4]5 in our setting, we have as desired.

Uniqueness
We now show that the relative entropy is, up to a multiplicative constant, the unique functor satisfying the conditions established so far.We first prove a crucial lemma.
Lemma 5.1.Let X be a Borel space equipped with probability measures p and q, if p ≪ q, then we can find a sequence of simple functions p * n on X such that for the sequence of probability measures p n (E) := E p * n dq, we have that p n and p agree on the elements of the partition on X induced by p * n and moreover, p n → p strongly.
Denote by K n the index set {0, 1, . . ., n2 n − 1, ≤} of k.We fix a version dp dq of the Radon-Nikodym such that dp dq < ∞ everywhere.We define a family of partitions and a family of simple functions as follows: Every function induces a partition on the domain; if moreover the function is simple, the induced partition is finite.
We first note that p n and p agree on the elements of the partition induced by p * n : Next, we prove the strong convergence of p n → p.We first show p * n → dp dq pointwise.Let x ∈ X. Pick N large enough such that dp dq (x) ≤ N .For a fixed integer n ≥ N , there is exactly one k n for which x ∈ X n,kn .On the one hand, we have k n 2 −n ≤ dp dq (x) ≤ (k n + 1)2 −n on X n,kn .But on the other hand, by integrating over X n,kn and dividing everything by q(X n,kn ), we also have k n 2 −n ≤ p(X n,kn ) q(X n,kn ) ≤ (k n + 1)2 −n on X n,kn .We thus get pointwise convergence since we have p * n (x) − dp dq (x) = p (X n,kn ) q (X n,kn ) − dp dq (x) ≤ 2 −n for any n ≥ N.
From the above inequality and the choice of N , we note the following Before proving uniqueness, we recall the main theorem of Baez and Fritz [BF14] on FinStat.
Theorem 5.2.Suppose that a functor is lower semicontinuous, convex linear and vanishes on FP.Then for some 0 ≤ c ≤ ∞ we have F (f, s) = cRE f in (f, s) for all morphisms (f, s) in FinStat.
We are now ready to extend this characterization to SbStat.
Theorem 5.3.Suppose that a functor is lower semicontinuous, convex linear and vanishes on FP.Then for some 0 ≤ c ≤ ∞ we have F (f, s) = cRE(f, s) for all morphisms.
Proof.Since F satisfies all the above properties on FinStat, we can apply Theorem 5.2 in order to establish that F = cRE f in = cRE for all morphisms in the subcategory FinStat.
We show that F extends uniquely to cRE on all morphisms in SbStat.
By convex linearity of F , for an arbitrary morphism (f, s) from (X, p) to (Y, q), we have F ((f, s)) = Y F ((f, s) y ) dq, so F is totally described by its local relative entropies.It is thus sufficient to show F = cRE on an arbitrary morphism (f, s) : (X, p) → (1, δ).The case where p is not absolutely continuous with respect to s is straightforward, so let us assume p ≪ s.
We apply Lemma 5.1 with p and s to get the family of simple functions p * n and the corresponding family of partitions {X n,k }.We define π n as the function that maps x ∈ X n,k ′

Vol. 19: 4 A
CHARACTERIZATION OF RELATIVE ENTROPY ON STANDARD BOREL SPACES 10:15 So for all n, we can bound p * n (x) everywhere by the integrable function g(x) := dp dq (x) + 1.Given a measurable set E ⊂ X, we can thus apply Lebesgue's dominated convergence theorem.We get Edp dq dq = p(E).