On Higher-Order Probabilistic Subrecursion

We study the expressive power of subrecursive probabilistic higher-order calculi. More specifically, we show that endowing a very expressive deterministic calculus like G\"odel's $\mathbb{T}$ with various forms of probabilistic choice operators may result in calculi which are not equivalent as for the class of distributions they give rise to, although they all guarantee almost-sure termination. Along the way, we introduce a probabilistic variation of the classic reducibility technique, and we prove that the simplest form of probabilistic choice leaves the expressive power of $\mathbb{T}$ essentially unaltered. The paper ends with some observations about the functional expressive power: expectedly, all the considered calculi capture the functions which $\mathbb{T}$ itself represents, at least when standard notions of observations are considered.


Introduction
Probabilistic models are more and more pervasive in computer science and are among the most powerful modeling tools in many areas like computer vision [Pri12], machine learning [Pea88] and natural language processing [MS99]. Since the early times of computation theory [DLMSS56], the very concept of an algorithm has been itself generalized from a purely deterministic process to one in which certain elementary computation steps can have a probabilistic outcome. This has further stimulated research in computation and complexity theory [Gil74], but also in programming languages [SD78,Koz81].
Endowing programs with probabilistic primitives (e.g., an operator which models sampling from a distribution) poses a challenge to programming language semantics. Already for a minimal, imperative probabilistic programming language, giving a denotational semantics is nontrivial [Koz81]. When languages also have higher-order constructs, everything becomes even harder [JT98] to the point of disrupting much of the beautiful theory known in the deterministic case [Bar84]. This has stimulated research about the denotational semantics of higher-order probabilistic programming languages, with some surprising positive results coming out recently (e.g., [EPT14,CDL14,DLFVY17,HKSY17]).
The main problem, then, consists in understanding how the obtained classes relate to each other, and to the class of T-representable functions, which is well-known to comprise precisely the provably total functions of Peano's arithmetic [GTL89]. Along the way, however, we manage to understand how to compare the expressive power of probabilistic calculi per se. Summing up, this paper's main contributions are the following ones: • We first take a look at the full calculus T ⊕,R,X , and prove that it enforces almost-sure termination, namely that the probability of termination of any typable term is 1. This is done by appropriately adapting the well-known reducibility technique [GTL89] to a probabilistic operational semantics. We then observe that while T ⊕,R,X cannot be positively almost surely terminating, T ⊕ indeed is. This already shows that there must be a gap in expressive power. This is done in Section 2. • In Section 3, we look more closely at the expressive power of T ⊕ , proving that the mere presence of probabilistic choice does not add much to the expressive power of T: in a sense, probabilistic choice can be "lifted up" to the ambient deterministic calculus. • We look at other fragments of T ⊕,R,X and at their expressive power. More specifically, we will prove that (the equiexpressive) T R and T X represent precisely what T ⊕ can do at the limit, in a sense which will be made precise in Section 2. This part, which turns out to be the most challenging, is in Section 4. • Section 5 is devoted to proving that both for Monte Carlo and for Las Vegas observations, the class of functions representable in T R coincides with the T-representable ones, the only exception being classes obtained by observing the most likely outcome, which are much larger and of questionable interest.

Probabilistic Choice Operators, Informally
Any term of Gödel's T can be seen as a purely deterministic computational object whose dynamics is finitary, due to the well-known strong normalization theorem (see, e.g., [GTL89]). In particular, the non-determinism due to multiple redex occurrences is completely harmless because of confluence. Confluence is well-known not to hold in a probabilistic scenario [DLZ12], but in this paper we neglect this problem, and work with a fixed reduction strategy, namely weak call-by-value reduction (keeping in mind that all what we say here also holds when call-by-name is the underlying notion of reduction). Evaluation of a T-term M of type NAT can be seen as a finite sequence of terms ending in the normal form n of M (see Figure 1a). More generally, the unique normal form of any T term M will be denoted as M .
Noticeably, T is computationally very powerful. In particular, the T-representable functions on N coincide with the functions which are provably total in Peano's arithmetic [GTL89]. As we already mentioned, the most natural way to enrich deterministic calculi and turn them into probabilistic ones consists in endowing their syntax with one or more probabilistic choice operators. Operationally, each of them models the process of sampling from a distribution and proceeding depending on the outcome. Of course, one has many options here as for which one(s) of the various operators to grab. The aim of this work is precisely the one of studying to which extent this choice have consequences on the overall expressive power of the underlying calculus.
Suppose, for example, that T is endowed with the binary probabilistic choice operator ⊕ described in the Introduction, whose evaluation corresponds to tossing a fair coin and choosing one of the two arguments accordingly. The presence of ⊕ has indeed an impact on the dynamics of the underlying calculus: the evaluation of any term M is not deterministic anymore, but can be modeled as a finitely branching tree (see, e.g. Figure 1c for such a tree when M is (3 ⊕ 4) ⊕ 2 ). The fact that all branches of this tree have finite height (and the tree is thus finite) is intuitive, and a proof of it can be given by adapting the well-known reducibility proof of termination for T. In this paper, we in fact prove much more, and establish that T ⊕ can be embedded into T.
If ⊕ is replaced by R, the underlying tree is not finitely branching anymore, but, again, there is no infinitely long branch, since each of them can somehow be seen as a T computation (see Figure 1b for an example). What happens to the expressive power of the obtained calculus? Intuition tells us that the calculus should not be too expressive viz. T ⊕ . If ⊕ is replaced by X, on the other hand, the underlying tree is finitely branching, but its height can well be infinite (see Figure 1d). Actually, X and R are easily shown to be equiexpressive in presence of higher-order recursion, as we show in Section 2.4. On the other hand, for R and ⊕, no such result is possible. Nonetheless, T R can still be somehow encoded into T, as we will detail in Section 4. From this embedding, we can show that neither Monte Carlo nor Las Vegas algorithms on T ⊕,X,R add any expressive power to T. This is done in Section 5.
2. The Full Calculus T ⊕,R,X All along this paper, we work with a calculus T ⊕,R,X whose terms are the ones generated by the following grammar: Please observe the presence of the usual constructs from the untyped λ-calculus, but also of primitive recursion, constants for natural numbers, pairs, and the three choice operators we have described in the previous sections.
As usual, terms are taken modulo α-equivalence. Terms in which no variable occurs free are, as usual, dubbed closed, and are collected in the set T ⊕,R,X C . A value is simply a closed term from the following grammar  Termination of Gödel's T is guaranteed by the presence of types, which we also need here. Types are expressions generated by the following grammar Environmental contexts are expressions of the form Γ = x 1 ∶ A 1 , . . . , x n ∶ A n , while typing judgments are of the form Γ ⊢ M ∶ A. Typing rules are given in Figure 2. From now on, only typable terms will be considered. We denote as T ⊕,R,X C (A) the set of closed terms of type A, similarly for T ⊕,R,X V (A). Given n ∈ N, we use the shortcut n for the corresponding value of type NAT: 0 is already part of the language of terms, while n + 1 is simply S n. For simplicity of notations, we also write SS for λx.S(S x); similarly for SS . . . S (n times).
2.1. Operational Semantics. While evaluating terms in a deterministic calculus ends up in a value, the same process leads to a distribution of values when applied to terms in a probabilistic calculus [DLZ12]. The reader should note that countable distributions are perfectly sufficient to give semantics to T ⊕,R,X , and that we will focus our attention on them here. Formalizing all this requires some care, but can be done following one of the many definitions from the literature (e.g., [DLZ12]).
Given a countable set X, a distribution L on X is a function mapping elements of X to elements of the interval [0, 1]: Observe that we take distributions as functions summing to a real in [0, 1] rather than to 1. This is an implicit way to handle computations with nonzero probability of divergence as distributions whose sum is strictly below 1.
If L is a distribution on X and x ∈ X, then L(x) is the real number which L put in correspondence to x. We will use the pointwise order ≤ on distributions, which turns them into an ωCPO. The support of a distribution M ∈ D(X), namely the subset of X to which M assigns strictly positive probability, is indicated as M . For any M ∈ D(X) and Y ⊆ X, otherwise. Another useful operation on distributions is scalar multiplication: if L ∈ D(X) and p ∈ [0, 1] then p ⋅ L is the distribution assigning to x the probability pL(x).
We are especially interested in distributions over terms here. In particular, a distribution of type A is simply an element of D(T ⊕,R,X C (A)) and, as such, it is a function from T ⊕,R,X C (A) to [0, 1] whose sum is itself less or equal to 1. The set D(T ⊕,R,X V ) is the set of of distribution over values and will thus be used to assign a meaning to close terms; it is ranged over by metavariables like U, V, W. We use the following notation for Dirac's distributions over terms: We define the reducible and value supports of a distribution M on terms as M R ∶= M ∩T ⊕,R,X R and M V ∶= M ∩T ⊕,R,X V , respectively. This way, notations like M R and M V have an obvious and natural meaning.
As syntactic sugar, we use the following convenient notation to manipulate distributions, i.e., for any family of distributions By a slight abuse of notation, we may define N M only for M ∈ M , since the other elements of the family (N M ) M ∈T ⊕,R,X do not anyway contribute to the sum. The sum notation as we use it here can be easily generalized, e.g., to families of real numbers (p M ) M ∈T ⊕,R,X and to other kinds of distributions.
Example 2.1. Suppose that M = n ↦ 1 2 n+1 n ∈ N ∈ D(T ⊕,R,X (NAT)), which is the socalled exponential distribution over the natural numbers. Suppose, moreover, that for any n, N n = n + 1 ↦ 1 2 0 ↦ 1 2 is another distribution in D(T ⊕,R,X (NAT)). Then which is exactly the same as M. More generally, the sum notation as defined here satisfies some familiar identities, e.g., Example 2.2. If M is the exponential distribution from Example 2.1 above, and C is the context (λx.x) ⋅ , then the push-forward distribution C M is the distribution assigning probability 1 2 n+1 to any term in the form (λx.x)n, and probability 0 to any other term. ⊠  In the following we will often, by abuse of notation, write push-forward distributions like the distribution C M from Example 2.2 as (λx.x)M. We even go beyond that, and write expressions like, e.g., MN , which stands for the following distribution: Reduction rules of T ⊕,R,X are given by Figure 3. For simplicity, we use the notation M → ? M for {M } → M (i.e., M → M) whenever M is reducible and M = {M } whenever M is a value. The sum notation allows us to rewrite rule (r-∈) as a form of monadic lifting: 3. Notice that we can faithfully simulate system T inside T ⊕,R,X . For example, consider the following term: Expo ∶= λn.rec ⟨1 , λxy.rec ⟨0, λx.SS, y⟩ , Sn⟩ ∶ NAT → NAT where SS is a shortcut for the term λz.S(Sz) computing the double-successor of its argument. This term computes the function n ↦ 2 n+1 in time O(2 n ). Indeed, when applied to a natural number n, it will get the following reduction where we denote ⊠ Example 2.4. As a second example, we are presenting the term X⟨S, 0⟩, whose reduction is essentially probabilistic: Notice that, after n + 1 reduction steps, the support of the underlying distribution has precisely n elements.

⊠
We can easily define → n as the n th iteration of → and → * as the reflexive and transitive closure of →. If M → n N and U = N V , then we write M → ≤n U. In other words, U is the distribution on values to which M reduces in at most n steps. For every M and for every natural number n, there are unique N and U such that M → n N and M → ≤n U.
In probabilistic systems, we might want to consider infinite reduction sequences such as the ones induced by X⟨(λx.x), 0⟩, which reduces to {0}, but only after an infinite number of steps. Please note that for any value V , and whenever M → N , it holds that M(V ) ≤ N (V ). As a consequence, we can finally give the following definition, one of the most crucial ones in this paper: Definition 2.5. Let M be a term and let (M n ) n∈N be the unique distribution family such that M → ≤n M n . The evaluation of M is the value distribution The success of M is its probability of termination, which is formally defined as the norm of its evaluation, i.e., Succ(M ) ∶= M . The average reduction length from M is The operator ⋅ can be easily generalized to one on distributions of terms, since the sequence of distributions M n is anyway unique. Whenever M = {N }, it makes sense to consider N as the normal form of M ; indeed, we write N = NF(M ) in all these cases, e.g. when M is a term from T.
Example 2.6. Take as an example the term X⟨S, 0⟩ from Example 2.4. We have that, for all n: so that M n (m) = 0 if m ≥ n and M n (m) = 1 2 m+1 otherwise. Thus: Following, e.g., [DLZG14], we can say that any such term represents a function g ∶ N → D(N) iff for every m, n it holds that g(n)(m) = M n (m). This will be a key notion not only to evaluate the expressive power of various fragments of T ⊕,R,X , but also to compare them.

2.2.
On the Continuity of the Operational Semantics. In this section, we will ask ourselves whether the semantics M N of an application M N can somehow be given from the semantics of M and N . Lemma 2.11 below gives a positive answer to this question, but some auxiliary lemmas are necessary beforehand.
The first such lemma states that the (one-step) reduction of a sum is the sum of the one-step reducts of its addends. Of course, the different reductions of the addends can have interpolations which makes the decomposition nontrivial: The decomposition can of course be iterated to reductions of any length: Thanks to Lemma 2.8, it is possible to trace back values obtained by reducing an application, as stated in the following, intermediate, lemma: Proof. This is again an induction on n. The base case is trivial. If, instead, n ≥ 1, we proceed differently depending whether M and N are values or not. We now have all the necessary ingredients for at least proving a restricted form of our desired continuity result: Proof. By induction on m + n. The base case m + n = 0 is trivial. If n ≥ 1, we proceed differently depending whether M and N are values or not.
• If N → L → n−1 N ≥ V then N is not a value. By Lemma 2.8, we can decompose N = ∑ L↦L N L and V = ∑ L↦L V L in such a way that for any L ∈ L , L → ≤n−1 N L ≥ V L (with → ≤n−1 standing for either = or → n−1 depending whether L is a value or not). Then we get • If V is the empty (sub)distribution, this is trivial. Proof. (≤) There is (L n , W n ) n≥1 such that (M N ) → n L n ≥ W n for all n and such that M N = lim n (W n ). Applying Lemma 2.9 gives M n , N n , Q n ∈ D(T ⊕,R,X ) and Thus lim n U n ≤ M and lim n V n ≤ N with W n ≤ U n V n leading to the required inequality: (≥) There is (M n , N n , U n , V n ) n≥1 such that M → n M n ≥ U n and N → n N n ≥ V n for all n and such that M = lim n (U n ) and N = lim n (V n ). This leads to the equality M N = lim m,n U m V n . Finally, by Lemma 2.10, for any m, n, each approximant of U m V n is below M N , so is their sup.
2.3. Almost-Sure Termination. We now have all the necessary ingredients to specify a quite powerful notion of probabilistic computation. When, precisely, should such a process be considered terminating? Do all probabilistic branches (see figures 1a-1d) need to be finite? Should we stay more liberal? The literature on the subject is unanimously pointing to a notion called almost-sure termination: a probabilistic computation should be considered terminating if the set of infinite computation branches, although not necessarily empty, has null probability [MM05,FH15,KK15]. This has the following incarnation in our setting: Definition 2.12. A term M is said to be almost-surely terminating (AST) iff its probability of convergence is maximal, namely if Succ(M ) = 1.
This section is concerned with proving that T ⊕,R,X indeed guarantees almost-sure termination. This will be done by adapting Girard-Tait's reducibility technique to a probabilistic operational semantics.
Proof. The proof is is based on the the notion of a reducible term, which is given as follows by induction on the structure of types: Then we can observe that: (1) The reducibility candidates over Red A are →-saturated. By an induction on A we can (2) The reducibility candidates over Red A are precisely the AST terms M such that M ⊆ Red A . This goes by induction on A.
• Trivial when A = NAT. (M V ) is AST and has an evaluation supported by elements of Red C , so that we can conclude by IH.
is AST by IH; using Lemma 2.11 we get M AST and it is easy to see that if AST with an evaluation supported by elements of Red A i ; by Lemma 2.11 π i M = π i M meaning that (π i M ) is AST and has an evaluation supported by elements of Red A i , so that we can conclude by IH.
This one goes by induction on the type derivation. The only difficult cases are the recursion, the application, the binary choice ⊕ and the denumerable choice X: • For the operator rec: We have to show that if U ∈ Red A and V ∈ Red NAT→A→A then for all n ∈ N, (rec ⟨U, V, n⟩) ∈ Red A . We proceed by induction on n: • If n = 0: rec ⟨U, V, 0⟩ → {U } ⊆ Red A and we conclude by saturation.
Red A by IH and since n ∈ Red N and V ∈ Red N→A→A , we conclude by saturation. • For the application: we have to show that if M ∈ Red A→B and N ∈ Red A then (M N ) ∈ Red B . But since N ∈ Red A , this means that it is AST and for every • For the operator ⊕: • For the operator X: we have to show that for any value U ∈Red A→A and V ∈Red A if holds that (X U V ) ∈ Red A . By an easy induction on n, Moreover, by an easy induction on n we have V by Lemma 2.11 and IH, which is sufficient to conclude. At the limit, we get X U V = ∑ i∈N Points 2 and 3 above together leads to the thesis: if ⊢ M ∶ A, then Point 3 implies that M ∈ Red A which, by Point 2, implies that M is AST.
Almost-sure termination could however be seen as too weak a property: there is no guarantee about the average computation length, which can well be infinite even if the probability of termination is finite. For this reason, a stronger notion is often considered, namely positive almost-sure termination: Definition 2.14. A term M is said to be positively almost-surely terminating (or PAST ) iff the average reduction length [M ] is finite.
Gödel's T, when paired with R, is combinatorially too powerful to guarantee positive almost sure termination already. This comes from the possibility to describe programs with exponential reduction time such as the term Expo from Example 2.3, which is computing the function n ↦ 2 n+1 in time Θ(2 n ).
Theorem 2.15. T ⊕,R,X is not positively almost-surely terminating.
Proof. The term (Expo R) ∶ NAT is computing, with probability 1 2 n+1 the number 2 n+1 in time Θ(2 n ); the average reduction length is thus Please observe that the counterexample to positive almost-sure termination for T ⊕,R,X has been obtained by applying Expo to R, and both these terms are positively almost surely terminating when considered in isolation. In other words, positive almost sure termination is not a compositional property.
2.4. On Fragments of T ⊕,R,X : a Roadmap. The calculus T ⊕,R,X contains at least four fragments, namely Gödel's T and the three fragments T ⊕ , T R and T X corresponding to the three probabilistic choice operators we consider. It is then natural to ask ourselves how these fragments relate to each other as for their respective expressive power. At the end of this paper, we will have a very clear picture in front of us.
The first such result is the equivalence between the two fragments T R and T X . The embeddings are in fact quite simple: getting X from R only requires "guessing" the number of iterations via R and then use rec to execute them; capturing R from X is even easier: it corresponds to counting the total number of iterations performed by X: Proposition 2.16. T R and T X are both equiexpressive with T ⊕,R,X .
Proof. The calculus T R embeds the full system T ⊕,R,X via the encoding: 1 The fragment T X embeds the full system T ⊕,R,X via the encoding: In both cases, the embedding is compositional and preserves types. We have to prove the correctness of the two embeddings: • For any M and N : 1 Notice that the dummy abstractions on z and the 0 at the end ensure the correct reduction order by making λz.N a value. Indeed, we only have to perform a few reductions: • For any U and V : X R ⟨U, V ⟩ = X⟨U, V ⟩ Indeed, both of them are the unique fixpoint of the following contractive function: That X⟨U, V ⟩ = f ( X⟨U, V ⟩ ) is immediate after a reduction, as for the other, we have: by Lemma 2.11 • Finally: That R = n ↦ 1 2 n+1 n ≥ 0 is just one step of reduction, while R X = n ↦ 1 2 n+1 n ≥ 0 was shown in Example 2.6. Notice how simulating X by R requires the presence of recursion, while the converse is not true. The implications of this fact are intriguing, but lie outside the scope of this work. In the following, we will no longer consider T X nor T ⊕,R,X but only T R , keeping in mind that all these are equiexpressive due to Proposition 2.16. The rest of this paper, thus, will be concerned with understanding the relative expressive power of the three fragments T, T ⊕ , and T R . Can any of the (obvious) strict syntactical inclusions between them be turned into a strict semantic inclusion? Are the three systems equiexpressive? In order to compare probabilistic and deterministic calculi, several options are available. The most common one is to consider notions of observations over the probabilistic outputs; this will be pursued in Section 5, where we look at whether Monte Carlo or Las Vegas algorithms on T ⊕ or T R result in deterministically T-definable functions or not. Notice that neither Monte Carlo nor Las Vegas algorithms are natively definable inside T ⊕,R,X . Indeed, those algorithms are based on restrictions on the resulting distribution, which cannot be described in the calculus. For example, a Las Vegas algorithm is captured by, say, a term M ∶ NAT → NAT such that M n (0) ≤ 1 2 for any n. In the next two sections, instead, we look at how T ⊕,R,X can be seen as a way to compute functions returning distributions over the base type N rather than elements of if.
We say that the distribution M ∈ D(N) is finitely represented by 2 f ∶ N → B, if there exists a natural number q such that for every k ≥ q it holds that f (k) = 0 and M = {k ↦ f (k)}. Moreover, the definition can be extended to families of distributions (M n ) n by requiring the existence of functions f ∶ N × N → B and q ∶ N → N such that for every k ≥ q(n) it holds that f (n, k) = 0 and for every n it holds that M n = {k ↦ f (n, k)}. In this case, we say that the representation is parametric. We will see in Section 3 that the distributions computed by T ⊕ are exactly the ones (parametrically) finitely representable by T terms. Concretely, this means that for any M ∈ T ⊕ (NAT) or any M ∈ T ⊕ (NAT → NAT), the distributions M and ( N n ) n are (parametrically) finitely representable.
In T R , however, distributions are more complex, having infinite support and giving rise to non-rational probabilities. That is why only a characterization in terms of approximations is possible. More specifically, a distribution M ∈ D(N) is said to be approximated by two functions f ∶ N × N → B and g ∶ N → N iff for every n ∈ N and for every k ≥ g(n) it holds that f (n, k) = 0 and k∈N M(k) − f (n, k) ≤ 1 n .
In other words, the distribution M can be approximated arbitrarily well, and uniformly, by finitely representable ones. Similarly, we can define a parametric version of this definition at first order. In Section 4 , we show that distributions generated by T R terms are indeed uniform limits over those of T ⊕ ; our result on T ⊕ thus induces their (parametric) approximation in T.

Binary Probabilistic Choice
This section is concerned with two theoretical results on the expressive power of T ⊕ . Taken together, they tell us that this fragment is not far from T.
2 Here we denote B for binomial numbers m 2 n (where m, n ∈ N) and BIN for their representation in system T encoded by pairs ⟨m, n⟩ of natural numbers. 3.1. Positive Almost-Sure Termination. As we already observed, the average number of steps to normal form can be infinite for terms of T ⊕,R,X . We will prove that, on the contrary, T ⊕ is positive almost-surely terminating. This will be done by adapting (and strengthening!) the reducibility-based result from Section 2.3. To this end, we will first give a formalization of the notion of execution tree discussed in Section 1 in the form of a multistep reduction procedure. Then, we will formally show that this tree is finite. We will see later that the multistep reduction is nothing more than → * for T ⊕ , but that this is not the case in richer fragments of T ⊕,R,X .
Definition 3.1. The multistep reduction relation ⇒ is defined by induction in Figure 4. Due to the (potentially) countably many preconditions of the rule (R-∈), the derivation tree of a multistep reduction ⇒ can be infinitely wide and even of unbounded height, but each path have to be finite.
The infiniteness of the width and the fact that the height is unbounded is an essential tool for analising T R . In fact, most theorems in this section will be given for both T ⊕ and T R . But, for now, we will focus on T ⊕ and finite derivations, while T R and transfinite derivations will be analyzed in details in Section 4.1.
Example 3.2. Consider the already considered example term rec⟨0, λxy.y ⊕ (Sy), 2⟩; the execution tree is the following, where U n stands for rec⟨0, λxy.y ⊕ (Sy), n⟩: where ξ is the following derivation: Notice that the derivation is correct and finite because the execution tree is finite.
Proof. By an easy induction, we show that if N ⇐ M ⇒ L (resp. N ⇐ M ⇒ L), then there is P such that N ⇒ P ⇐ L. Now: • If both reductions are using the same rule ( either (R-0), (R-tran), or (R-∈) ), then it is an immediate use of the induction hypothesis on the premises as those rules are determinist. • If one of them use the rule (R-0), then it is trivial. • No other case is possible as (R-tran) and (R-∈) cannot apply together (one require a term as source and the other a distribution).  Proof. When it exists, M a is unique due to confluence. Thus we only have to prove its existence. The proof goes by reducibility over reducibility sets defined as follows: (2) The reducibility candidates over Red A are ⇒-saturated. This is a trivial induction on ⇒ using the →-saturation for the (R-+) case. (3) Red A is inhabited by a value. By induction on A: 0 ∈ Red NAT , λx.V ∈ Red A→B and ⟨U, V ⟩ ∈ Red A×B whenever U ∈ Red A and V ∈ Red B . (4) The reducibility candidates M over Red A ⇒-reduce to M a . This goes by induction on A: • Trivial for A = NAT.
• Let M ∈ Red B→C , there is a value V ∈ Red B , thus (M V ) ∈ Red C and M V ⇒ M V a by IH; we can conclude using Lemma 3.4. • Similar for products.
By induction on the type derivation. The only difficult cases are applications, recursion, and binary probabilistic choices: • For the application, we have to show that if M ∈ Red A→B and N ∈ Red A then (M N ) ∈ Red B . But since N ∈ Red A we get that N ⇒ N a with N a ⊆ Red A . This means that M N a ⊆ Red B and that M N a ⇒ M N a a supported into Red B . We conclude by Lemma 3.5 that U = M N a and thus that (M N ) ∈ Red B . • For the operator rec, We have to show that if U ∈ Red A and V ∈ Red NAT→A→A then for all n ∈ N, (rec ⟨U, V, n⟩) ∈ Red A . We proceed by induction on n: • If n = 0: rec ⟨U, V, 0⟩ → {U } ⊆ Red A and we conclude by saturation.
The thesis, as usual, can be proved as a corollary of points 4 and 5.
Notice that this theorem does not apply to T X (and a fortiori to T ⊕,R,X ) because step (5) of the proof would not hold.  Notice that Corollary 3.8 does not apply to T R (and a fortiori to T ⊕,R,X ) because the second bullet of the proof would not be verified. 3.2. Mapping to T. Positive almost sure termination of terms in T ⊕ is not the only consequence of Theorem 3.7. In fact, the finiteness of the resulting distribution over values allows a finite representation of T ⊕ -distributions by T-definable functions. Indeed, we can consider an extension of classic system T with a single memory cell of type NAT that we use to store (the binary encoding of) the outcomes of the coin flips we will perform in the future. If we denote c the memory-cell, this means that the ⊕ can be encoded 3 : From Theorem 3.6, we know that for any M ∈ T ⊕ (NAT), there is n ∈ N such that M → n M . Since the execution is bounded by n, there cannot be more than n successive probabilistic choice so that: Using a well known state-passing style transformation, we can enforce (c∶=m ; M * ) into a term of T. Then, using a simple recursive operation on m, we can represent the whole #{m < 2 n k = NF(c∶=m ; M * )} into the result of a term k ∶ N ⊢ N ∶ N so that λk.N define a function that represent the distribution M .
Example 3.9. Take the term M = rec ⟨0, λxy.y ⊕ Sy, 2⟩ from Example 3.2. Its encoding in T is By a standard state passing lifting (and a few simplifications) we obtain the term: M ∼ = rec ⟨λc.(0, c), λxyc. if (mod c 2) then y (div c 2) else S (y (div c 2)), 2⟩ As we know that there are at most two choices, we can count the number of c below 4 which result to a certain u, getting: M $ ∶= λu. rec ⟨0, λxy. if (π 1 (M ∼ x) == u) then Sy else y, 4⟩.
Then we have: ⊠ What remains to be shown is that this encoding can be made parameteric, in the sense that for any M ∈ T ⊕ (NAT → NAT), we can generate M ↓ ∈ T(NAT → NAT → NAT) and M # ∈ T(NAT → NAT) such that for all n ∈ N: The difficulty, here, comes from the bound M # that have to be computed dynamically by a complex monadic encoding. To this purpose, a translation of T ⊕ into T has to be appropriately defined. First of all, let us define two maps ((⋅)) and ((⋅)) V on types as follows: which we give what remains of the stream (computed by shifting x twice).
The encoding ((⋅)) is given through the return and the bind operations: The binary choice is the defined as expected: where we use the following syntactical sugar: ite ∶= λx.rec ⟨π 3 x, λ .π 2 x, π 1 x⟩ ; The byproduct of this relatively complex encoding is the fact that whatever distribution one is able to compute in T ⊕ can also be computed back in T, and that this scales to first-order functions:

Countable Probabilistic Choice
4.1. Multistep Semantics. We have seen that none of Theorem 3.6, Theorem 3.7 and Corollay 3.8 hold in T X . Indeed Theorem 3.6, which is a prologue to the other two, does not hold on terms like, e.g., X⟨0, S⟩ that will never ⇒-reduce to a value distribution. The fragment T R is more interesting, as both Theorem 3.6 and Theorem 3.7 hold. However, as we have seen in Theorem 2.15, positive almost sure normalization (and Theorem 3.8) do not hold. This is because we are manipulating infinitely supported distributions (due to the reduction rule of R).

⊠
Remember that T R and T X are equivalent, so why such a difference? This is due to the discrepancy in nature between their execution trees. Indeed, we have seen that the execution trees are finitely branching in T X , but with infinite paths, while those of T R are infinitely branching, but with finite paths. Since multistep reduction somehow reflects those execution trees,we can see that we only need derivations with infinite arity to get a correct multistep semantics for T R . The whole point is that we can perform transfinite structural induction over these trees. Indeed, by considering the reduction trees themselves with the inclusion (or subtree) order gives you a well-founded poset, recalling that there is no infinite path. If one want to unfold this well-founded poset into an ordinal, then it should be the smallest ordinal o such that o = 1 + ωo, i.e., o = ω ω . This is unusual in operational semantics, where finitary induction suffices in most cases. Remark that, due to the encoding of ⊕ and X into T R , Theorem 3.7 subsumes Theorem 2.13. Remark, moreover, that we did not have to go through the definition of approximants. Nonetheless, those approximations exists and point out that T ⊕ should be approximating T R in some way or another. This is precisely what we are going to do in the next section.
4.2. The Approximants: State-Bounded Random Integers. In this section, we show that T ⊕ approximates T R : for any term M ∈ T R (NAT), there is a term N ∈ T ⊕ (NAT → NAT) that represents a sequence approximating M uniformly. We will here make strong use of  the fact that M has type NAT. This is a natural drawback when we understand that the encoding (⋅) † on which the result above is based is not direct, but goes through another state-passing transformation.
A naive idea would be to use T ⊕ and to stop the evaluation after a given reduction time as schematized in Figure 5a. Despite the encoding to be a nightmare, this should be encodable in T ⊕ . However, for the convergence time to be independent from the term and uniform, there is virtually no hope. That is why we have switched to T R , which carries much nicer properties as seen in the previous chapter. The basic idea behind the embedding (⋅) † is to mimic any instance of the R operator in the source term by some term 0⊕(1⊕(⋯(n⊕ )⋯), where n is sufficiently large, and is an arbitrary value of type NAT. Of course, the semantics of this term is not the same as that of R, due to the presence of ; however, n will be chosen sufficiently large for the difference to be negligible. Notice, moreover, that this term can be generalized into the following parametric form R ‡ ∶= λn.rec ⟨ , (λx.S ⊕ (λy.0))⟩ n. Once R ‡ is available, a natural candidate for the encoding (⋅) † would be to consider something like M ‡ ∶= λz.M [(R ‡ z) R]. In the underlying execution tree, (M ‡ n) correctly simulates the first n branches of each R (which had infinite-arity), but truncates the rest with garbage terms . As schematized in Figure 5b, by increasing n, we can hope to obtain the M at the limit.
The question is whether the remaining non-truncated tree has a "sufficient weight", i.e., that there is a minimal bound to the probability to stay in this non-truncated tree. However, in general (⋅) ‡ fails on this point, not achieving to approximate M uniformly. In fact, this probability is basically (1 − 1 2 n ) d where d is its depth. Since in general the depth of the non-truncated tree can grow very rapidly with respect to n in a powerful system like T , there is no hope for this transformation to perform a uniform approximation. It might well be possible to perform a complex monadic transformation in the style of Section 3.2, that computes a function relating the size n to the depth d of the execution tree. But there is a much easier solution.
The solution we are using is to have the precision m of 0 ⊕ (1 ⊕ (⋯(m ⊕ )⋯)) to dynamically grow along the computation, as schematized in Figure 5c. More specifically, Figure 6. Operational semantics of TR in the approximants M † n, the growing speed of m will increase with n: in the n-th approximation M † n, R will be simulated as 0 ⊕ (1 ⊕ (⋯(m ⊕ )⋯)) and, somehow, m will be updated to m + n. Why does it work? Simply because even for an (hypothetical) infinite and complete execution tree of M , we would stay inside the n th non-truncated tree with probability ∏ k≥0 (1 − 1 2 k * n ) which is asymptotically above (1 − 1 n ). Implementing this scheme in T ⊕ requires a feature which is not available (but which can be encoded), namely ground-type references. We then prefer to show that the just described scheme can be realized in an intermediate language called TR, whose operational semantics is formulated not on terms, but rather on triples in the form (M, m, n), where M is the term currently being evaluated, m is the current approximation threshold value, and n is the value of which m is incremented whenever R is simulated. The operational semantics is standard, except for the following rule: (r-R) (R, m, n) → (k, m+n, n) ↦ 1 2 k+1 k < m Notice how this operator behaves similarly to R with the exception that it fails when drawing too big of a number (i.e., bigger that the fist state m). Notice that the failure is represented by the fact that the resulting distribution does not necessarily sum to 1. The intermediate language TR is able to approximate T R at every order (Theorem 4.6 below). Moreover, the two memory cells can be shown to be expressible in T ⊕ , again by way of a continuation-passing transformation. Crucially, the initial value of n can be passed as an argument to the encoded term.   Proof. We use the following notation: This gives us an analytic lower bound to the success rate of (M, m, n). However, it is not obvious that this infinite product is an interesting bound: it is not even clear that it can be different from 0. This is why we will further underapproximate this infinite product to get a simpler expression whenever m = n:  Proof. By Lemma 4.4 we have that Succ(M, n, n) ≥ ∏ k≥1 1 − 1 2 k * n which is above the product ∏ k≥1 1 − 1 n 2 k 2 whenever n ≥ 4. This infinite product has been shown by Euler to be equal to sin( π n ) π n . By an easy numerical analysis we then obtain that This lemma can be restated by saying that the probability of "failure" of (M * , n, n), i.e. the difference between M * , n, n and M , is bounded by 1 n . With this we then get our first theorem, which is the uniform approximation of elements of T R by those of TR: Theorem 4.6. For any M ∈ T R and any n ∈ N, Proof. By Lemma 4.3, for each V the difference is positive, thus we can remove the absolute value and distribute the sum. We conclude by using the fact that Succ(M ) = 1 and Succ(M * , n, n) ≥ 1 − 1 n .
The second theorem, i.e., the uniform approximation of ground elements of T R by those of T ⊕ , follows immediately: Theorem 4.7. Distributions in T R (NAT) can be approximated by T ⊕ -distributions (which are finitely T-representable), i.e., for any M ∈ T R (NAT), there is M † ∈ T ⊕ (NAT) such that for every natural number n, it holds that k M (k) − M † n (k) ≤ 1 n .

Moreover:
• the encoding is parametric, i.e., for all M ∈ T R (NAT → NAT), there is M † ∈ T ⊕ (NAT → NAT) such that (M n) † = M † n for all n ∈ N; • the encoding is such that M (k) ≤ M † n (k) only when k = 0.
Proof. It is clear that in an extension of T ⊕ with two global memory cells m, n and with exceptions, theR operator can be encoded bȳ where is raising an error/exception and m ∶=!m+!n is returning the value of m before changing the memory cell to m+n. Remark that the only objective of the dummy abstraction over u and of the dummy application to 0, preventing from being evaluated. We can conclude by referring to the usual state passing style encoding of exceptions and state-monads into T (and thus into T ⊕ ). In fact, we do not have any requirement over , i.e., we can replace by any value A of the correct type A (which is possible since every type is inhabited). In other words, we do not need to implement the exception monad, but only the state monad which we can present easily here:

On Probabilistic and Nondeterministic
Observations. In the last two sections, we have not been able to precisely delineate the status of PT R and NT R . As we previously mentioned, the practical pertinence of these classes is questionable, in the sense that the result will be obtained after an unbounded number of tries and the proof that the algorithm is correct is given as an oracle.
In this section, we exploit this intuition, by proving that both of them contain functions which are recursive bot not definable in T. More precisely, we show that NT R , the nondeterministic class over T R , exactly captures (total) recursive functions, while PT R has a bit more complex structure and corresponds to a recursive choice over two T-definable possible results. Before giving these two results, a remark is in order: contrary to the polynomial case where NP ⊆ PP, we have PT R ⊆ NT R here. In fact, in the realm of decision problems, the two classes collapse to the one of recursive decision problems. The difference between them can only be observed when considering proper functions, which are neglected in probabilistic complexity theory. For any subset X of N, the class Rec X stands for the class of recursive total function whose range is included in X. n ∈ T X that gives the same result; using an encoding of the error monad, we can easily get a term N ∈ T X (N → N) such that f (m) is the only k ∈ N such that N m (Sk) > 0. We conclude by NT R = NT X Theorem 5.6. f ∈ PT R iff there are g 1 , g 2 ∈ DT and h ∈ Rec {1,2} such that f (n) = g h(n) (n).
Proof. Let DT ○ Rec {1,2} be the class of all those functions f such that f (n) = g h(n) (n), where g 1 , g 2 ∈ DT and h ∈ Rec {1,2} . We prove the equality between PT R and DT ○ Rec {1,2} as follows: Then for n = 8, we get that NF(F m n k) > 3 8 for k = f (m) and for at most one other value (since the total has to be bellow 9 8 ), both bellow NF(Q m n). We can thus construct two terms N 1 , N 2 ∶ N → N in T such that Trivially, we can write G ∶ NAT → NAT → NAT in T ⊆ T R such that NF(G1n) = g 1 (n) and NF(G2n) = g 2 (n). As we have seen in Theorem 5.5, h ∈ NT R and thus there is M ∈ T R (NAT → NAT) such that f (m) is the only k ∈ N such that M m (Sk) > 0. We thus set: N ∶= λn.ite⟨M n, G (M n) n, (G 1 n)⊕(G 2 n)⟩ A summary of the introduced subrecursive classes and the obtained results is in Figure 7 6. Conclusions This paper is concerned with the impact of adding various forms of probabilistic choice operators to a higher-order subrecursive calculus in the style of Gödel's T. The three probabilistic choice operators we analyze in this paper are equivalent if employed in the context of untyped or Turing-powerful λ-calculi [DLZ12]. As an example, X can be easily expressed by way of ⊕, thanks to fixpoints. Moreover, there is no hope to get termination in any of those settings. We give evidence that this is not the case in a subrecursive setting. We claim that all we have said in this paper could have been spelled out in a probabilistic variation of Kleene's primitive recursive functions, e.g. [DLZG14]. Going higher-order makes our results, and in particular the termination results from Sections 2 and 3, significantly stronger. This is one of the reasons why we have proceeded this way. Classically, subrecursion refers to the study of relatively small classes of computable functions lying strictly below the partially recursive ones, and typically consisting of total functions. In this paper, we have initiated a study of the corresponding notion of subrecursive computability in presence of probabilistic choice operators, where computation itself becomes a stochastic process.
However, we barely scratched the tip of the iceberg, since the kinds of probabilistic choice operators we consider here are just examples of the possible ways one can turn a deterministic calculus like T into a probabilistic model of computation. The expressiveness of T ⊕,R,X is sufficient to encode most reasonable probabilistic operators, but what can we say about their own expressive power? For example, what about a ternary operator in which either of the first two operators is chosen with a probability which depends on the value of the third operator? This ternary operator would have the type Ter ∶ A→A→(NAT→NAT)→A, where the third argument z ∶ NAT→NAT is seen as a probability p ∈ [0, 1] (whose n th binary component is given by (z n)). The expressivity of T R is sufficient to encode Ter ∶= λxyz.rec x 25:34

F. Breuvart, U. Dal Lago, and A. Herrou
Vol. 17:4 (λuv.y) (z R). The expressivity of T Ter , however, strictly lies between that of T ⊕ and of T R : T Ter can construct non binomial distributions 6 while enforcing PAST. A general theory of probabilistic choice operators and of their expressive power is still lacking, and is an intriguing topic for future work. Another research direction to which this paper hints at consists in studying the logical and proof-theoretical implications of endowing a calculus like T with probabilistic choice operators. The calculus T was born as a language of realizers for arithmetical formulas, and indeed the class of first-order functions T can express precisely corresponds to the ones which are provably total in Peano's arithmetic. But how about, e.g., T R ? Is there a way to characterize the functions (from natural numbers to distributions of natural numbers) which can be represented in it? Or even better: to which extent do real numbers in the codomain of a distribution in the form M (where M is, say, a T R term of type NAT) are computable? They are of course computable in the sense of Turing computability, but how about subrecursive notions of real-number computability?
What is even more exciting, however, is the application of the ideas presented here to polynomial time computation. This would allow to go towards a characterization of expected polynomial time computation, thus greatly improving on the existing works on the implicit complexity of probabilistic systems [DLT15,DLZG14], which only deals with worst-case execution time. The authors are currently engaged in that.