Infinite Probabilistic Databases

Probabilistic databases (PDBs) model uncertainty in data in a quantitative way. In the established formal framework, probabilistic (relational) databases are finite probability spaces over relational database instances. This finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016), and with application scenarios that are better modeled by continuous probability distributions (Dalvi et al., CACM 2009). We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a primary focus on countably infinite spaces. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. We argue that finite point processes are an appropriate model from probability theory for dealing with general probabilistic databases. This allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries.


Introduction
Probabilistic databases (PDBs) are used to model uncertainty in data. Such uncertainty can have various reasons like, for example, noisy sensor data, the presence of incomplete or inconsistent information, or information gathered from unreliable sources [Agg09,SORK11]. In the standard formal framework, probabilistic databases are finite probability spaces whose sample spaces consist of database instances in the usual sense, referred to as "possible worlds". However, this framework has various shortcomings due to its inherent closed-world assumption [CDVdB16,CDVdB21], and the restriction to finite domains. In particular, any event outside of the finite scope of such probabilistic databases is treated as an impossible event. Yet, statistical models of uncertain data, say, for example, for temperature measurements as in Example 2.1, usually feature the use of continuous probability distributions in appropriate error models. This (continuous attribute-level uncertainty) is not expressible in the traditional PDB model. Finite PDBs also only have limited support for tuple-level uncertainty: in finite PDBs, all possible worlds have a fixed maximum number of tuples. Instead, in particular with respect to an open-world assumption, we would like to be able to model probabilistic databases without an a priori bound on the number of tuples per instance. It is worth noting that there have been a number of approaches to PDB systems that are supporting continuous probability distributions, and hence going beyond finite probability spaces (see the related works section). These models, however, lack a general (unifying) formal basis in terms of a possible worlds semantics [DRS09]. While both open-world PDBs and continuous probability distributions in PDBs have received some attention in the literature, there is no systematic joint treatment of these issues with a sound theoretical foundation. In [GL19], we introduced an extended model of PDBs as arbitrary (possibly infinite) probability spaces over finite database instances. However, the focus there was on countably infinite PDBs. An extension to continuous PDBs, which is necessary to model probability distributions appearing in many applications that involve real-valued measurement data, raises new fundamental questions concerning the measurability of events and queries.
In this paper, we lay the foundations of a systematic and sound treatment of infinite, even uncountable, probabilistic databases. In particular, we prove that queries expressed in standard query languages have a well-defined semantics that is compatible with the existing theoretical point of view. Our model is based on the mathematical theory of finite point processes [Moy62,Mac75,DVJ03]. Adopting this theory to the context of relational databases, we give a suitable construction of measurable spaces over which our probabilistic databases can then be defined. The only (and mild) assumption we need is that the domains of all attributes are Polish spaces. Intuitively, this requires them to have nice topological properties. All typical domains one might encounter in database theory, for example integers, strings, and reals, satisfy this assumption.
For queries and views to have a well-defined open-world semantics, we need them to be measurable mappings between probabilistic databases. Our main technical result states that indeed all queries and views that can be expressed in the relational algebra, even equipped with arbitrary aggregate operators (satisfying some mild measurability conditions) are measurable mappings. The result holds for both a bag-based and set-based relational algebra and entails the measurability of Datalog queries.
Measurability of queries may seem like an obvious minimum requirement, but one needs to be very careful. We give an example of a simple, innocent looking "query" that is not measurable (see Example 3.12). The proofs of the measurability results are not trivial, and to our knowledge not immediately covered by standard results from point process theory. At their core, the proofs are based on finding suitable "countable approximations" of the queries. That such approximations can be obtained is guaranteed by our topological requirements.
In the last section of this paper, we briefly discuss queries for probabilistic databases that go beyond those that are just set-based versions of traditional database queries. This also casts other natural PDB queries into our framework. Examples of such a queries are probabilistic threshold queries and rank queries. Note that these examples refer not only to the facts in a database, but also to their probabilities, and hence are inherently probabilistic.
This article is an extended version of the paper with the same title, Infinite Probabilistic Databases that was presented at the 23rd International Conference on Database Theory (ICDT 2020) [GL20]. The presentation of various proofs and arguments has been extensively reworked, and the paper contains much more background information and new examples. The overall accessibility of the paper has been additionally enhanced by many notational and structural improvements. domains [BPVDB15,MPS19]. However, a direct connection to our infinite PDBs seems hard to obtain, as queries in infinite PDBs in general do not have finite lineage expressions. In particular, our infinite PDBs may have an unbounded number of random tuples, contrasting the setup of a fixed number of variables in a WMI problem. Also, in this paper we introduce a rather general notion of PDB queries, that covers, for example, parity tests and various fixpoint queries.
Our classification of views towards the end of this paper is similar to previous classifications of queries such as [CKP03,WvK15] in the sense that it distinguishes the level on which information is aggregated. The work [WvK15] suggests a distinction between "traditional" and "out-of-world aggregation" similar to the one we present.

Preliminaries
Throughout this paper, N, Q, and R denote the sets of non-negative integers, rationals, and real numbers, respectively. With N + , Q + , and R + we denote their restrictions to positive numbers.
Sets and Bags. If S is a set and k ∈ N, then P k (S) denotes the set of all subsets of S that have cardinality exactly k. The set of all finite subsets of S is then given as P fin (S) := ∞ i=0 P k (S). The set of all subsets of S (that is, the powerset of S) is denoted by P(S).
A (finite) bag (or multiset) B over a set S is a function from S to N, assigning a multiplicity to every element from S. We interpret bags as collections of elements that may contain duplicates. For s ∈ S we let |B| s denote the multiplicity of s in B. If S ⊆ S, we let |B| S := s∈S |B| s . The cardinality |B| of the bag B is the sum of all multiplicities, i. e. |B| := |B| S . Similar to the set notation, we use B k (S) to denote the set of all bags over S that have cardinality exactly k. The set of all finite bags over S is given as B fin (S) := ∞ i=0 B k (S). Occasionally, we explicitly denote bags by the elements they contain. Then { {a 1 , . . . , a k } } denotes a bag of cardinality k with elements a 1 , . . . , a k (possibly including repetitions).
Relational Databases. For the remainder of this paper, we fix two countably infinite sets Rel and Att with Rel ∩ Att = ∅. The elements of Rel are called relation symbols, and the elements of Att are called attribute names.
A database schema τ is a tuple (A, R, sort) such that A is a finite subset of Att, R is a finite subset of Rel, and sort : R → ∞ k=0 A k is a function that maps every relation symbol R ∈ R to a tuple (A 1 , . . . , A k ) of pairwise distinct attribute names A 1 , . . . , A k ∈ A for some k ∈ N. For R ∈ R we call sort(R) the sort of R. If sort(R) = (A 1 , . . . , A k ), then k is called the arity of R, and denoted by ar(R). We abuse notation and write A ∈ sort(R) if A is an attribute name appearing in sort(R). We also write R ∈ τ instead of R ∈ R for relation symbols.
Let A be a subset of Att. A (sorted) universe with sorts A is a pair (U, dom) where U is a non-empty set and dom : A → P(U). We abuse notation and refer a sorted universe (U, dom) just by U, and implicitly assume dom given. For A ∈ A, we call dom(A) =: A the (attribute) domain of A. If τ = (A, R, sort) is a database schema and U a universe with sorts A and R ∈ R with sort(R) = (A 1 , . . . , A k ), then is called the domain of R. The elements of T [R,U] are called R-tuples (over U). If R is a relation of sort sort(R) = (A 1 , . . . , A k ), and t = (a 1 , . . . , a k ) is an R-tuple, then denotes the restriction of t to the attributes A i 1 , . . . , A i for all 1 ≤ i 1 < · · · < i ≤ k and all = 1, . . . , k.
In particular, R T [R,U] is the set of all R-facts over U, which is formally given as (2. 2) The elements of are called τ -facts (or just facts) over U. We use (variants of) the letter f to denote facts. A database instance D over τ and U, or τ -instance over U is a finite bag of τ -facts over U. Thus, is the set of all τ -instances over U. For R ∈ τ and D ∈ DB [τ,U] , we let R(D) denote the restriction of D to R-facts. That is, the instance R(D) is given by for all f ∈ F [τ,U] . From all of the notation we defined above, we omit the explicit mention of τ and U if they are clear from the context, and just write T, T R , F, F R and DB instead of T

Concept Notation
Database schema τ = (A, R, sort) with A ⊆ Att, R ⊆ Rel Attribute names A, A 1 , A 2 , · · · ∈ A Relation names R, S, R 1 , R 2 , · · · ∈ R Sort of a relation name R ∈ R sort(R) ∈ k∈N A k Arity of a relation name R ∈ R ar(R) ∈ N Underlying universe U Space of all tuples over (τ, U) In [GL19], we used an example of temperature recordings to motivate infinite probabilistic databases, to illustrate the setup, and to highlight some observations. With some customization, we adopt this example to also serve as a running example for this paper.
Example 2.1. We consider a database that stores information about the rooms in an office building. An example database instance is shown in Figure 1.  The database schema τ consists of • relation symbols R = {Office, TempRec}, • attribute names A = {RoomNo, Person, Date, Temp}, and • sorts sort(Office) = (RoomNo, Person) and sort(TempRec) = (RoomNo, Date, Temp). The sorted universe is given by (U, dom) with U = Σ * ∪ R where Σ is some alphabet, say, the set of ASCII symbols, and (for simplicity) dom(RoomNo), dom(Person), dom(Date) ⊆ Σ * and dom(TempRec) = R.
For example facts f 1 = Office(4108, Bob) and f 2 = TempRec(4108, 2021-07-12, 20.5) are both facts from F [τ,U] , and it holds that f 1 / ∈ D and f 2 ∈ D for the database instance D from Figure 1. Let X = ∅. A σ-algebra on X is a family X of subsets of X such that X ∈ X and X is closed under complements and countable unions (an equivalent definition is obtained by replacing "countable unions" with "countable intersections").
Notation 2.2. Throughout the paper, we use double struck letters (X, Y, Z, A, B, . . . ) to denote underlying spaces of interest, and fraktur letters (X, Y, Z, A, B, . . . ) to denote set families and σ-algebras over such spaces, in particular. We usually denote subsets of the underlying spaces with bold italics letters (X, Y , Z, A, B, . . . ).
Let G (fraktur "G") be a set of subsets of X (that is, G ⊆ P(X)). The σ-algebra generated by G is the coarsest (i. e. smallest with respect to set inclusion) σ-algebra X containing G. For any G ⊆ P(X), the σ-algebra generated by G is unique. A measurable space is a pair (X, X) where X is an arbitrary set and X is a σ-algebra on X. The sets in X are called X-measurable (or measurable, if X is clear from the context). If (Y, Y) is another measurable space, then a function ϕ : The following simple properties are needed throughout the paper. (1) Suppose Y is generated by some G ⊆ P(Y). If ϕ : X → Y satisfies ϕ −1 (G) ∈ X for all G ∈ G, then ϕ is measurable.
Probability and Image Measures. A probability measure on a measurable space (X, X) is a countably additive function P : X → [0, 1] with P (X) = 1. Then (X, X, P ) is called a probability space. Measurable sets in probability spaces are also called events. A measurable function ϕ from a probability space (X, X, P ) into a measurable space (Y, Y) introduces a new probability measure P on (Y, Y) by for all Y ∈ Y. Then P = P • ϕ −1 is called the image or push-forward probability measure of P under (or along) ϕ. The measurability of ϕ ensures that P is well-defined.
Standard Constructions. Throughout this paper, we will encounter a variety of measurable spaces. The most basic of these are either well-known measurable spaces like countable spaces with the powerset σ-algebra or uncountable spaces like R with its Borel σ-algebra. From these spaces, we then construct more complicated measurable spaces with the use of the following standard constructions.
(1) Product spaces. For i = 1, . . . , n, let (X i , X i ) be a measurable space. The product σ-algebra is the σ-algebra on n i=1 X i that is generated by the sets {proj −1 j (X) : X ∈ X j } with j = 1, . . . , n where proj j is the canonical projection proj j : n i=1 X i → X j . If (X i , X i ) = (X, X) for all i = 1, . . . , n, we also write X ⊗n instead of n i=1 X.
(2) Disjoint unions. If the spaces X i are pairwise disjoint, then the disjoint union σ-algebra on n i=1 X i is given by It may be easily verified that this is a σ-algebra on n i=1 X i . (3) Subspaces. Let (X, X) be a measurable space and X ∈ X. Then Polish Topological Spaces. There are deep connections between measure theory and general topology. In fact, virtually everything we discuss and show in this paper relies on the presence of certain topological properties. The central notion from topology we need is that of Polish spaces [Kec95, Chapter 3]. In the following, we assume familiarity with the basic topological concepts. A small introduction to the basic terms can be found in Appendix A. For a thorough introduction, we refer to [Wil04].
A Polish space is a topological space that is separable (i. e. there exists a countable dense set), and completely metrizable (i. e. there is a complete metric on the space generating its topology). Polish and (the later introduced) standard Borel spaces are introduced and treated in detail in [Kec95, Chapter 1 & 2] and [Sri98]. In this paper, we heavily exploit the properties of Polish spaces. The existence of a countable dense set, together with the existence of a complete metric generating the topology allows us to approximate any point in the space by countable collections of open sets. More specifically, for every point in the space, there exists a sequence over a fixed countable set that converges to the point.
The following fact lists a few natural classes of spaces that are Polish. The above properties arguably capture all typical spaces of interest for database theory. When we later work with Polish spaces, we always assume that we have a fixed Polish metric at hand, that is, a complete metric generating the Polish topology. With respect to said compatible metric, we write B ε (X) to denote the open ball of radius ε > 0 around the point X ∈ X. Standard Borel Spaces. If (X, O) is a topological space, the Borel σ-algebra Bor(X) on X is the σ-algebra generated by the open sets O (the topology in use is usually clear from context; in our case, provided there is one, we always use a Polish topology on X). Sets in the Borel σ-algebra are also called Borel sets. In the following, we state some nice basic properties of standard Borel spaces. (1) If ϕ : X → Y is continuous, then ϕ is Bor(X), Bor(Y) -measurable.
(2) If (Y, O Y ) is a metric topological space, (ϕ n ) n≥0 is a sequence of Bor(X), Bor(Y)measurable functions ϕ n : X → Y with lim n→∞ ϕ n = ϕ, then ϕ is measurable as well.
A measurable space (X, X) is called a standard Borel space if there exists a Polish topology O on X such that X is the (Borel) σ-algebra generated by O.
Fact 2.6 (Lusin-Souslin, see [Kec95,Theorem 15.1]). Let (X, X) and (Y, Y) be standard Borel spaces, let X ∈ X, and let ϕ : X → Y be a continuous function such that ϕ X (the restriction of ϕ to the set X) is injective. Then ϕ(X) ∈ Y.
In other words, the above fact states that between standard Borel spaces, the image of a measurable set X under a function is measurable itself, provided that the function is continous, and injective on X.
The following fact intuitively states that the property of measurable spaces being standard Borel is closed under using the constructions introduced before.
(1) The product space i X i , i X i is standard Borel.
(2) If the spaces X i are pairwise disjoint, then the disjoint sum i X i , i X i is standard Borel. Moreover, if (X, X) is standard Borel, and X ∈ X, then (X, X| X ) is standard Borel.
(Finite) Point Processes. A point process [DVJ03, DVJ08] is a probability space over countable sets of points in some abstract "state space" such as the Euclidean space R n . Point processes are a well-studied subject in probability theory and they appear in a variety of applications such as particle physics, ecology, geostatistics (cf. [DVJ03,Bad07] and target tracking [Deg17]. In computer science, they have applications in queuing theory [Fra82] and machine learning [KT12]. The individual outcomes (that is, point configurations) drawn from a point process are called a realization of the process. A point process is called finite if all of its realizations are finite. In the following, we recall the construction of a finite point process over a Polish state space, following the classic constructions of [Moy62,Mac75].
Then ∼ k is an equivalence relation on X k . Tuples in X k are equivalent under ∼ k if and only if they contain every individual element the exact same number of times. Thus, the quotient space X k /∼ k can be identified with the set B k (X) of bags of cardinality exactly k over X. Note that in the case k = 0, X 0 /∼ 0 (by convention) consists of a single distinguished point representing the empty realization. The space is one of the canonical and equivalent choices for the sample space for a finite point process on X [Mac75]. The original construction of [Moy62] considers the symmetrization sym : ∞ k=0 X k → B fin (X) where for all (X 1 , . . . , X k ) ∈ X k it holds that sym(X 1 , . . . , that is, as the σ-algebra on B fin (X) induced by sym : ∞ k=0 X k → B fin (X). Note that this is indeed a σ-algebra on B fin (X), cf. [Kal02, Lemma 1.3].
Remark 2.8 (cf. [Moy62, Section 2]). A set X ∈ X ⊗k is called symmetric, if for all (X 1 , . . . , X k ) ∈ X k it holds that (X 1 , . . . , X k ) ∈ X implies X π(1) , . . . , X π(k) ∈ X for all permutations π of {1, . . . , k}. The function X → sym(X) is a bijection between the symmetric sets X in ∞ k=0 X ⊗k and the sets of (2.3). An equivalent, but technically more convenient construction is motivated by interpreting point processes as random counting measures [DVJ08]. For X ∈ X and n ∈ N, let #(X, n) := B ∈ B fin (X) : |B| X = X∈X |B| X = n , that is, #(X, n) is the set of finite bags over X containing exactly n elements from X, counting multiplicities. The sets #(X, n) with X ∈ X and n ∈ N are called counting events. The counting σ-algebra Count(X) on X is the σ-algebra generated by all counting events. Definition 2.10. A finite point process with standard Borel state space (X, X) is a probability space B fin (X), Count(X), P . A finite point process is simple if P P fin (X) = 1, that is, if its realizations are almost surely sets.

Probabilistic Databases
In this section, we introduce our framework for infinite probabilistic databases and their query semantics. To begin with, we recall the conventional formal definition of probabilistic databases as it is found in textbooks on the subject [SORK11, VdBS17]: Definition 3.1 (Finite Probabilistic Databases; adapted from [SORK11, Section 2.2]). Let τ be a database schema and let U be a universe. A probabilistic database (PDB) D over τ and U is a probability space D = DB, P(DB), P where DB is a finite set of database instances over τ and U.
From a probability theoretic point of view, Definition 3.1 is a severe limitation to the vast expressive power of stochastic models, only allowing probability distributions over finitely many alternative database instances. As an example, in a setting such as that of Example 2.1 (the database of temperature measurements), we would typically model noise or uncertainty in the sensor measurements by continuous distributions. This directly leads to (even uncountably) infinite probability spaces that are not covered by the typical textbook definition of probabilistic databases (Definition 3.1). In [GL19], we introduced the following general notion of probabilistic databases as probability spaces of database instances: and DB is a σ-algebra on DB satisfying Therein, the sample space DB may be infinite, even uncountable. While we only discussed set PDBs in [GL19], we broaden the definition to support bag instances here. In any case, it was left open in [GL19], how to construct such probability spaces, let alone how to obtain suitable measurable spaces (DB, DB). This is no longer a trivial task once DB is uncountable. In this section, we provide a general construction for such measurable spaces.
Remark 3.3. At this point, we want to stress again the meaning of our terminology. In general probabilistic databases there are lots of components, on different levels of abstraction that could, in principle, be infinite spaces. The term infinite probabilistic databases is derived from the notion of an infinite probability space, meaning that the sample space is of infinite size. Still, in the PDBs we consider in this paper, database instances themselves (the concrete outcomes, or realizations of a PDB) are always finite collections of facts. The framework we discribe here is not suitable for discussing probability spaces over infinite database instances in the sense of [AHV95, Section 5.6], such as constraint databases [KLP00].  We first construct the measurable space of facts over τ and U. For all A ∈ A, by assumption, the domain A := dom(A) is Polish. Equipping it with its Borel σ-algebra A thus yields a standard Borel space (A, A). Now let R ∈ τ be a relation symbol with sort(R) = (A 1 , . . . , A k ), so that the standard Borel spaces belonging to the attribute names A 1 , . . . , A k are (A 1 , A 1 ), . . . , (A k , A k ). Recall from (2.1), that the set T R of R-tuples is the product of the attribute domains in R. Naturally, T R is equipped with its product σ-algebra (3.1) Likewise, the set F R of R-facts is equipped with the σ-algebra (2.2)). As the set F of all facts (over τ and U) is the disjoint union of all the F R , it is naturally equipped with the disjoint union σ-algebra Example 3.5. We reenact these definitions for our running example (Example 2.1). Recall that the attribute names A ∈ {RoomNo, Person, Date}, have domain Σ * for some (finite) alphabet Σ. As Σ * is countably infinite, we equip this space with its powerset σ-algebra, P(Σ * ). For the attribute TempRec, we assume the measurable space to be (R, Bor(R)), where Bor(R) is the Borel σ-algebra on R. Given the sorts of the relation names Office and TempRec, we have the tuple spaces T Office = Σ * × Σ * and T TempRec = Σ * × Σ * × R. Following (3.1), the corresponding σ-algebras are and . That is, T Office is the σ-algebra generated by all events of the shape L 1 × L 2 where L 1 , L 2 ⊆ Σ * , and T TempRec is the σ-algebra generated by all events of the shape L 1 × L 2 × B where L 1 , L 2 ⊆ Σ * and B is a Borel set in R. The spaces of Office-and TempRec-facts are given by F Office = Office T Office = Office(r, s) : r, s ∈ Σ * and F TempRec = TempRec T TempRec = TempRec(r, d, θ) : r, d ∈ Σ * and θ ∈ R . For example, (assuming that the alphabet Σ contains the respective symbols), it holds that Office(4108, Bob) ∈ F Office , and TempRec(4108, 2021-07-12, 20.5) ∈ F TempRec . The σ-algebras on these spaces of facts are directly obtained from the σ-algebras on the tuple spaces: F Office = P {Office} ⊗ T Office and F TempRec = P {TempRec} ⊗ T TempRec . Finally, For example, consider the sets Then F 1 ∈ F Office and F 2 ∈ F TempRec . Note that due to the real interval, F 2 contains uncountably many facts. As F 1 ∈ F Office and F 2 ∈ F TempRec , it holds that F 1 ∪ F 2 ∈ F.
The constructions of the various measurable spaces above all started from standard Borel measurable spaces for the attribute domains, and then used the basic constructions of product and disjoint union measurable spaces. Therefore, Fact 2.7 immediately yields the following statement. Lemma 3.6. The spaces T R , T R R∈τ , F R , F R R∈τ , and (F, F) are standard Borel spaces.
Definition 3.7. A standard probabilistic database (standard PDB) over τ and U is a probability space (DB, DB, P ) with DB = B fin (F) and DB = Count(F).
That is, a standard PDB is a finite point process over the state space (F, F). Note that every standard PDB is a PDB in the sense of Definition 3.2. Standard PDBs that are simple are suitable for modeling PDBs with set semantics.
Example 3.8. We continue from Example 3.5. The space DB of database instances is exactly the space of finite bags over F = F Office ∪ F TempRec with the counting σ-algebra DB = Count(F). This means, for example that the set is measurable (i. e., an event) in (DB, DB). This is the set of database instances over F that contain exactly three facts of the shape TempRec(4108, d, θ) where d is an arbitrary string and 21 ≤ θ ≤ 23. Note that these three facts need not be distinct, as facts are allowed to be present with multiplicities. A standard probabilistic database of the example schema is just a probability space with underlying measurable space (DB, DB). In particular, in these PDBs, events such as #(F 2 , 3) carry a probability. Proof. In [DVJ08, Proposition 9.1.IV], it is shown that the space of N ∪ {∞}-valued counting measures on a standard Borel space (with the property of being finite on bounded sets) is a Polish space, whose σ-algebra is generated by the functions that map a counting measure µ to µ(F ) ∈ N ∪ {∞} for all F ∈ F. Restricting this space to integer-valued counting measures, and equipping it with the corresponding subspace σ-algebra yields a standard Borel space. This space is isomorphic to (DB, DB) via the function that maps a counting measure µ to the database instance whose multiplicity mapping is given by µ. 1 Convention 3.10. From now on, all PDBs we consider are standard PDBs. When we speak of PDBs, it is understood to refer to standard PDBs, unless explicitly stated otherwise.

The Possible Worlds Semantics of Queries and Views. Views are mappings between database instances. That is, a view
for some database schemas τ and τ and universes U and U . We call τ the input, and τ the output schema of the view V . If τ consists of a single relation symbol only, we call V a query. Queries are typically denoted by Q. Usually, queries and views are given as syntactic expressions in some query language. As usual, we blur the distinction between a query or view, and its syntactic representation. In the following, we let (DB, 1 Isomorphic here means that this function is bijective and measurable both ways; bijectivity is clear, and being measurable both ways stems from the fact that the generating events of the σ-algebra of the space of integer-valued counting measures on (F, F) are identified with the counting events in (DB, DB), i. e. the generating events of DB. Let D = (DB, DB, P ) be a PDB and let V : Remark 3.11. The kind of semantics we introduce here is the natural generalization of a standard choice for semantics on probabilistic databases. Conceptually, we consider queries and views that have well-defined semantics on traditional database instances. That is, they get as input a database instance, and as output, produce a new database instance. Such a semantics is lifted to probabilistic databases by applying the query or view on every possible world, and weighting it according to the probability measure of the input PDB. This notion of semantics is commonly called the possible answer sets semantics of probabilistic databases [SORK11, Section 2.3.1]. It (or, to be more precise, its discrete version) has previously been also called possible worlds semantics (of queries) [DS07], which is the term we prefer to use in this work, as we deem it the natural choice on how to define query (or view) semantics on PDBs that are modelled as a collection of possible worlds with a probability distribution, matching the standard definition of output probabilities of a (measurable) function on a probability space (cf. [Gre09]). Note that strictly speaking, this overloads the term "possible worlds semantics": in reference to PDBs, "possible worlds semantics" means the definition of PDBs as probability spaces over database instances, whereas in reference to queries or views, it means the definition of the output of a query with respect to an application per possible world, as in (3.2). For queries, another semantics has been discussed in literature, wich was later dubbed the possible answers semantics [SORK11, Section 2.3.2]. Under this semantics, the output of a query is the collection of tuples that may appear as an answer to the query (i. e. the tuples that appear in the output possible worlds under the possible worlds semantics), together with their marginal probability. For finite PDBs, this notion makes sense, because the result will be much smaller than a description of the whole output probability under possible worlds semantics. For uncountable infinite PDBs, however, this is not of much use. As soon as continuous probability distributions are involved, we naturally encounter PDBs where the marginal probability of every particular fact (or tuple in the output) may be zero.
We note that to the best of our knowledge, there has been no formal description of these semantics when duplicates are allowed. Note that for Boolean queries with set semantics, i. e., queries whose output is either {()} (true) or ∅ (false), both of the above semantics are essentially equivalent: the only possible answer tuple is the empty tuple (), and the only possible worlds of the answer are ∅ and {()}.
Note that if V fails to be measurable, then (3.2) is not well-defined. In this case, V has no meaningful semantics on probabilistic databases! Thus, discussing the measurability of views and queries is an issue that requires attention. The following example shows that there are inconspicuous, seemingly simple queries that are not measurable.
Example 3.12. Consider U = U = R, together with the database schemas τ = τ consisting of the single, unary relation symbol R with domain T R = R (equipped with the Borel σalgebra). Let B ∈ Bor(R 2 ). We define a function It is well known that there are Borel sets B ⊆ R 2 with the property that proj From the above equality, we get that for all D ∈ DB, all F ∈ F V , and all n ∈ N it holds that |V (D)| F = n if and only if there exist non-negative integers n 1 , . . . , n k with n 1 + · · · + n k = n and the property that |Q i (D)| F ∩F Q i = n i for all i = 1, . . . , k. Note that this condition corresponds to an event given by a countable union of counting events. Thus, we obtain the following.
Lemma 3.13. The view V is measurable if and only if Q i is measurable for all i = 1, . . . , k.
By the merit of Lemma 3.13, we only need to discuss the measurability of queries.

General Measurability Criteria
In the remainder of the paper we establish measurability results for various types of queries as they typically appear in database applications. In this section, we set out the general setup of said investigation and introduce some general measurability results that are not yet tailored to specific query languages.
4.1. Setup. Henceforth, we adhere to the following notational conventions when discussing the measurability of a query Q.
Convention 4.1 (Inputs). The input schema of Q is τ = (A, R, sort). We consider input instances over τ , and the sorted universe U (with all attribute domains Polish).
The associated fact space is denoted as (F, F), with subspaces (F R , F R ) for all R ∈ τ . The space of R-tuples is given as (T R , T R ). For all R ∈ τ , we fix a compatible Polish metric d R on T R and let T * R be a countable, dense set in T R . With abuse of notation, we denote the corresponding metric on F R by d R as well. Note that F * R := T * R is a countable dense set in F R . We denote the input (standard) PDB under consideration by D = (DB, DB, P ), where DB = B fin (F) and DB = Count(F).
We consider output instances over τ Q , and sorted universe U Q (with all attribute domains Polish). The associated fact space is denoted as (F Q , F Q ). The space of R Q -tuples is given as (T Q , T Q ). We fix a compatible Polish metric d Q on T Q , and a countable dense set T * Q in T Q . Again d Q will also denote the corresponding metric on F Q , and the set F * Q is countable and dense in F Q . The output measurable space is denoted by (DB Q , DB Q ), were DB Q = B fin (F Q ) and Thus, our goal is to show that a given function Q : Remark 4.3. We have some straightforward measurability criteria from the general properties of measurable functions and the used σ-algebras.
(1) By Fact 2.3(1), to show the measurability of a query Q, it suffices to show that for all F ∈ F Q and all n ∈ N, as the counting events generate the σ-algebra DB Q . This remains true, if we replace "= n" with "≥ n" or only require n ∈ N + .
(2) By Fact 2.3(2), compositions of measurable functions are measurable. That is, for query languages whose semantics are defined inductively over the structure of their syntactic expression, it suffices to show measurability for the basic building blocks. (3) By Fact 2.5(2), limits of measurable queries are measurable.
4.2. The Mapping Theorem. The following is a partial restatement of what is known as the mapping theorem of point processes. The original theorem from point process theory also involves the transfer of certain properties to the image space [LP17, Theorem 5.1] which is, however, of less importance for the remainder of the paper. Moreover, we allow partial transformations as long as their domain is measurable (and this does not invalidate the measurability statement from the mapping theorem).
for all D ∈ DB is a measurable query.
In this (restricted) form, the theorem is straightforward to verify.
Proof. Let F q ∈ F and q : F q → F Q be measurable. Now let F ∈ F Q and n ∈ N. Then |Q(D)| F = n ⇐⇒ |D| q −1 (F ) = n.
Since q −1 (F ) ∈ F, the claim follows. Intuitively, the theorem states that whenever we have a measurable transformation of the fact space of a a PDB, then we obtain a measurable query when we just apply this transformation to all facts in the database instances. 2 Example 4.5. We continue our running example. Recall that F TempRec is the space of facts TempRec(r, d, θ) where r and d are strings and θ is a real number. Consider the function q : F TempRec → F TempRec that increases the temperature by 2°C, i. e. with q TempRec(r, d, θ) = TempRec(r, d, θ + 2).
Then q is (F TempRec , F TempRec )-measurable. This follows, since the addition of 2 is a continuous function on R and by the construction of the measurable spaces.
Thus, by Theorem 4.4, the query that, given an instance D, applies q to every fact in D (i. e. increasing temperatures by 2) is measurable.
While Theorem 4.4 is a nice statement, it fails to cover most queries of interest, as database queries often consider and manipulate multiple tuples at once. Such transformations are not captured by Theorem 4.4. We therefore need measurability statements beyond Theorem 4.4.

4.3.
Continuous One-to-One Decompositions. In this subsection we introduce a new criterion for query measurability that overcomes the aforementioned limitation of Theorem 4.4.
Lemma 4.6. Let Q : DB → DB Q . Then Q is measurable if there exists some k ∈ N + , pairwise distinct R 1 , . . . , R k ∈ τ and functions q i : F Q → F R i for i = 1, . . . , k with the following properties: (1) For all n ∈ N + there is a set N Q (n) ∈ N k \ {(0, . . . , 0)} with for all D ∈ DB and all f ∈ F Q .
(2) For all i = 1, . . . , k, the function q i is injective and continuous.
Think of the functions q i as providing a decomposition of R Q -facts into the R i -facts of the input instance they originated from under the query. The set N Q (n) provides the recipe, how the number of occurrences of a fact in the output is determined by the counts of its decompositions in the input. That is, requirement (1) stipulates the query semantics. The topological requirement (2) ensures measurability.
A simple example application that we ask the reader to have in mind is that of a difference operator on bag instances (this is in fact, one of our later applications, where it is made precise). Intuitively, if Q is the difference of relations R and S, then the number of times a tuple t occurs in the output is given by the number of times it appears in R, minus the number of times it appears in S. This fits the pattern of Lemma 4.6 with functions q 1 , q 2 being the identity t → t, and N Q (n) being the set of pairs (n 1 , n 2 ) with max(0, n 1 − n 2 ) = n. Lemma 4.6 provides a generalization of this setup, allowing for much more general functions q i , and sets N Q (n).
2 Functions of the shape of Q in Theorem 4.4 are a special case of "mapping constructs" (applying a function to every element of a bag) that can be found in previously considered bag query languages [GM96,LW97]. Remark 4.7. This is vaguely related to the notion of (why-and how-)provenance of a tuple in the output of a view [GKT07], with the functions q i providing the "why-information", and the set N Q (n) providing the "how". Lemma 4.6 now only applies to queries inducing a very particular provenance structure (as governed by the q i and N Q (n)).
Proof of Lemma 4.6. Let Q : DB → DB Q , let k ∈ N + . Let R 1 , . . . , R k be distinct relation symbols in τ . In the following, we write (F i , F i ) instead of (F R i , F R i ) for all i = 1, . . . , k.
As required, let q i : F Q → F i for all i = 1, . . . , k and assume that (1) and (2) hold. We have to show that Q is measurable, which is settled, in particular, by showing that {D ∈ DB : |Q(D)| F ≥ n} ∈ DB for all F ∈ F Q and all n ∈ N + . Let d be a fixed Polish metric on F Q generating F Q and let F * Q ⊆ F Q be countable and dense in F Q . Let F ∈ F Q and n ∈ N + . We show that for all D ∈ DB, it holds that |Q(D)| F ≥ n is equivalent to the following condition.
To conclude the proof, we argue that Condition 4.8 can be used to express |Q(D)| F ≥ n as a countable combination of counting events. Note that because Q + is dense in R + , the equivalence of |Q(D)| F ≥ n and Condition 4.8 still holds, if the numbers ε 0 and ε are additionally required to be rational. Also, the fact sets in (b) are measurable in F i : the open balls are certainly measurable, and as q i is injective and continuous, q i maps measurable sets to measurable sets by Fact 2.6. Then the set of database instances D ∈ DB with Condition 4.8 is of the shape with the indices ranging as in Condition 4.8 and ε, ε 0 additionally restricted to Q.

M. Grohe and P. Lindner
Vol. 18:1 That is, d R (D) is the smallest distance between any two R-tuples of D (or ∞, if D contains at most one R-fact).
Definition 4.9. Let ε > 0. An instance D ∈ DB is called ε-coarse if for all R ∈ τ it holds that d R (D) > ε. We denote the set of ε-coarse instances by DB| ε .
Unfolding the definition, all instances in DB| ε have the property that their tuples (per relation) are sufficiently far apart with respect to the metric on the respective space of tuples.
Example 4.10. Consider our example of temperature recordings, but for simplicity (in order for not having to discuss the metric on the product space), assume that there is only one relation, with a single attribute for (real-valued) temperature recordings. Then a database instance is ε-coarse precisely if the temperatures occuring in its instances differ by more than ε between distinct facts.
The following lemma states that the set of ε-coarse instances is measurable for any ε > 0.
Proof. An instance D ∈ DB is not ε-coarse, if for some R ∈ τ , there exist t 1 , t 2 ∈ T R (D) such that t 1 = t 2 and d R (t 1 , t 2 ) ≤ ε. We claim that this is the case if and only if D satisfies the following condition. Condition 4.12. There are k 1 , k 2 ∈ N + and ε 0 > 0 such that for all r ∈ 0, ε 0 there are t * 1,r , t * 2,r ∈ T * R with ε 0 < d R (t * 1,r , t * 2,r ) < ε + 2r such that (1) |D| R(Br(t * 1,r )) = k 1 and (2) |D| R(Br(t * 2,r )) = k 2 . The condition basically states that we can find approximations t * 1,r and t * 2,r of two tuples t 1 and t 2 in T R (D) that witness that d R (t 1 , t 2 ) is too small. In particular, the role of ε 0 in Condition 4.12 is to guarantee that t * 1,r and t * 2,r do not end up approximating the same tuple. We proceed to show the following equivalence: there are t 1 , t 2 ∈ T R (D), t 1 = t 2 with d R (t 1 , t 2 ) ≤ ε ⇔ D satisfies Condition 4.12.
Note that for every database instance D, there exists an ε > 0, small enough, such that D ∈ DB| ε , because D is finite. This means that ε>0 DB| ε covers DB (even if the union is taken only over rational ε).
Corollary 4.13. If for all ε > 0, all F ∈ F Q , and all n ∈ N it holds that D ∈ DB| ε : |Q(D)| F = n ∈ DB, then Q is measurable.
Proof. This follows directly from Corollary 4.13 can be used to leverage the finiteness of our instances. In an ε-coarse instance, we can approximate sets of facts by simpler sets of facts as long as these approximations are sufficiently fine. For example, in the context of Example 4.10, in order to prove measurability of a query, it suffices to prove the measurability with respect to preimages where the temperature recordings are "far apart".

Relational Algebra
As motivated in Section 3.2, we investigate the measurability of relational algebra queries in our model. The concrete relational algebra for bags that we use here is basically the (unnested version of the) algebra that was introduced in [DGK82] and investigated, respectively extended, and surveyed in [Alb91, GM96,GLMW96]. It is called BALG 1 (with superscript 1) in [GM96].
We do not introduce nesting as it would yield yet another layer of abstraction and complexity to the spaces we investigate, although by the properties that such spaces exhibit, we have strong reason to believe that there is no technical obstruction in allowing spaces of finite bags as attribute domains. It is unclear however, whether this extends to PDBs with unbounded nesting depth.
The operations we consider are shown in the Table 2 below. As seen in [Alb91, GM96,GLMW96], there is some redundancy within this set of operations that will be addressed later. A particular motivation for choosing this particular algebra is that possible worlds semantics are usually built on top of set semantics and these operations naturally extend the common behavior of relation algebra queries to bags. This is quite similar to the original motivation of [DGK82] and [Alb91] regarding their choice of operations.

Base Queries
Constructors Since compositions of measurable functions are measurable, it suffices to show the measurability of the operators from Table 2, and the measurability of compound queries follows by structural induction.
Therefore, by investigating the measurability of the operators from Table 2 we will show the following main result of this section.
Theorem 5.1. All queries expressible in the bag algebra BALG 1 are measurable.
5.1. Base Queries. The base queries (Table 3) are easily seen to be measurable. Table 3. Base Queries.

Query
Semantics (for all D ∈ DB, t ∈ T Q ) Lemma 5.2. The following queries are measurable: (2) This can be shown the same way.
(3) Let Q = R, let R Q (T ) ∈ F Q and let n ∈ N. Then R(T ) ∈ F R ⊆ F. For all D ∈ DB, it holds that |Q(D)| R Q (T ) = n ⇐⇒ |D| R(T ) = n, and the latter is measurable in DB. Thus, Q is measurable. Note that there is nothing to show for the renaming query because it leaves all tuples themselves untouched. In the subsequent subsections, we deal with the remainder of Table 2. 5.2. Basic Bag Operations. We now treat the basic bag operations (Table 4). Assume that R, S ∈ τ are relation symbols of the same sort. From now on, we will only consider the case where R and S are distinct, as for the case R = S, measurability is trivial.

Query
Semantics (for all D ∈ DB, t ∈ T Q ) Lemma 5.3. The following queries are measurable:  (Table 5) maps bag instances to their underlying set instances.

Query
Semantics (for all D ∈ DB, t ∈ T Q ) Proof. We apply Lemma 4.6 for r = 1 and the function q 1 = q : F Q → F R defined by q(R Q (x)) = R(x). Then for all D ∈ DB and all f ∈ F Q , it holds that Thus, we let N Q (k) = {0} if k = 0 and N Q (k) = N \ {0} otherwise. Then r, q and N Q satisfy the requirements of Lemma 4.6, so Q is measurable.
Having the deduplication query measurable means that standard PDBs support set semantics.
Remark 5.5. The function associated with the deduplication query is countable-to-one (preimage of a single instance in the result is a countable collection of instances) and measurable by the lemma above. This can be used to infer that the space of set instances is standard Borel using [Sri98,Theorem 4.12.4]. This means that we could also completely restrict our setting to set instances without introducing new measurability problems. In general however, it can still be mathematically more conventient to use the full measurable space that was defined in Section 3.1, even in a set semantics setting. The point processes defining PDBs should then just be "simple" in the sense of Definition 2.10, i. e. have the probability 0 of containing duplicate facts.

Selection and Projection.
In this section, we investigate the selection and projection operators Table 6. First, we note that reordering the attributes in the sort of a relation yields a measurable query. Proof. This directly follows from the mapping theorem (Theorem 4.4), because β : F R → F Q is measurable.
Lemma 5.6 is helpful for restructuring relations into a more convenient shape to work with later. Semantically, it is a special case of a projection query.
Let R ∈ τ be a relation symbol with ar(R) = r, and let A 1 , . . . , A k be pairwise distinct attributes appearing in sort(R) where 0 < k ≤ r. By Lemma 5.6, wlog. we assume that sort(R) = (A 1 , . . . , A r ). Table 6. Selection and Projection.

Query
Semantics (for all D ∈ DB, t ∈ T Q ) Thus, Q is measurable.
Example 5.8. Suppose A 1 , A 2 ∈ sort(R) with A 1 = A 2 = R. The sets are Borel in R 2 . Thus, the queries are measurable by Lemma 5.7. In particular, in the context of our running example, we can do selections based on comparing temperatures.
Then F −1 ∈ F R and it holds that and, hence, Q is measurable.
Remark 5.10. Above, we provided direct proofs of Lemma 5.7 and Lemma 5.9. Alternatively, they follow from Theorem 4.4 using the function for selection and the function for projection. A closer look reveals that the sets F −1 of (5.1) and (5.2) are really the preimages of the respective function q that we know from Theorem 4.4.

5.5.
Products. Let R, S ∈ τ be relation symbols. In this section, we treat the cross product Q = R × S (Table 7). (For the discussions here, it does not matter whether R = S.) Table 7. Cross Products.

Query
Semantics (for all D ∈ DB, t ∈ T R , t ∈ T S ) Lemma 5.11. Let F ∈ F Q such that F = R Q (T R × T S ) with T R ∈ T R and T S ∈ T S . Then Proof. Let F ∈ F Q , say F = R Q (T ) with T ∈ T Q . If T = T R × T S for some T R ∈ T R and T S ∈ T S , then for all D ∈ D, it holds that where F R = R(T R ) and F S = S(T S ). Thus, Remark 5.12. Note that we proceeded similar to Lemma 4.6. Unfortunately, not every set T ∈ T Q can be decomposed into a product of measurable sets T R and T S as used above. The sets T R and T S are the projections of T Q to T R and T S . Recall that we already encountered a similar situation in Example 3.12. In general, such projections of measurable sets from products of standard Borel spaces to their factors yield the analytic sets (cf. [Kec95, Exercise 14.3]). A fundamental theorem in descriptive set theory states that every uncountable standard Borel space has analytic, non-(Borel-)measurable subsets [Kec95, Theorem 14.2]. Thus, with suitably chosen attribute domains, there are sets T ∈ T Q such that, for instance, the projection of T to T R is not in T R . Therefore, the approach from (5.3) doesn't help us resolve the measurability of Q, even though it appeared promisingly similar to the arguments from the previous section. In particular, for the given query we cannot establish criterion (2) from Lemma 4.6.
For t ∈ T Q , we let t R ∈ T R and t S ∈ T S denote the projections of t to its R-and S-part. Then for all r > 0, W r (t) := B r (t R ) × B r (t S ) ⊆ T Q is a measurable rectangle containing t with B r (t R ) ⊆ T R and B r (t S ) ⊆ T S . Note that B r (t R ) is a ball with respect to the (Polish) metric on T R , and B r (t S ) is a ball with respect to the (Polish) metric on T S .
Remark 5.13. Let us briefly comment on the intuition of the setup. The sets W r (t) should be thought of as a small windows that we can use to make our considerations local around t = (t R , t S ). We use these windows to show query measurability by exploiting the finiteness of our database instances: since every D is finite, D is ε-coarse for ε > 0 small enough (see Section 4.4). Then for small enough radius r, the balls B r (t R ) and B r (t S ) both contain at most one R-or S-tuple from D, respectively. Thus, the image of D under the cross product query also contains at most one tuple in W r (t) = B r (t R ) × B r (t S ). Then, as W r (t) has the appropriate shape, Lemma 5.11 can be used again.
Recall that for all ε > 0, DB| ε is the set of ε-coarse instances in DB. Let F ∈ F Q and n ∈ N. For all ε > 0, we let Then D F ,n,ε is the ε-coarse preimage of the event {D Q ∈ DB Q : |D| F = n} from DB Q .
Lemma 5.14. For all F ∈ F Q , n ∈ N, t ∈ T Q and all ε, r > 0, with r < ε 3 it holds that D F ∩R Q (Wr(t)),n,ε ∈ DB.
Proof. We prove the lemma as follows. Let F ε := F ∈ F Q : D F ∩R Q (Wr(t)),n,ε ∈ DB for all t ∈ T Q , n ∈ N, and r < ε 3 .
(5.4) Intuitively, this is the set of measurable sets F with the property that all sufficiently fine "window approximations" of the ε-coarse preimages of {D Q ∈ DB Q : |D Q | F = n} are measurable in DB. It then suffices to show for all ε > 0 that (1) for every set F = R Q (T ) with T = T R × T S ∈ T R × T S ⊆ T Q it holds that F ∈ F ε , and (2) the family F ε is a σ-algebra on F Q .
From these two it follows that F ε = F Q . That is, indeed every measurable set F ∈ F Q satisfies the property from (5.4). This line of argument is occasionally referred to as the good sets principle [Ash72, p. 5].
(1) Suppose F = R Q (T ) for some T ∈ T R × T S , say T = T R × T S . Let t ∈ T Q , n ∈ N, and r < ε 3 be arbitary. With t R ∈ T R and t S ∈ T S , we denote the R-and the S-part of t, respectively. Then By Lemma 5.11, it follows that {D ∈ DB : |Q(D)| F ∩R Q (Wr(t)) = n} ∈ DB. With Lemma 4.11 this entails that As t, n and r were arbitrary, it follows that F ∈ F ε .
(2) We show that F ε is a σ-algebra on F Q by showing F Q ∈ F ε , and that it is closed under complements and countable intersections. First note that F Q ∈ F ε follows from Lemma 5.11 because F Q = R Q (T R × T S ). Now let t ∈ T Q , n ∈ N, and r < ε 3 be arbitrary but fixed, and let D ∈ DB| ε . Figure 3 provides visualizations for the remaining cases. (a) Let F ∈ F ε . Then it holds that D ∈ D F c ∩R Q (Wr(t)),n,ε ⇐⇒ |Q(D)| F c ∩R Q (Wr(t)) = n ⇐⇒ |Q(D)| R Q (Wr(t)) − |Q(D)| F ∩R Q (Wr(t)) = n Recall that D is ε-coarse. In particular, as r < ε 3 , the ball B r (t R ) contains at most one R-tuple from D and the ball B r (t S ) contains at most one S-tuple from D. Hence, W r (t) = B r (t R ) × B r (t S ) contains at most one R Q -tuple from Q(D). Thus, D ∈ D F c ∩R Q (Wr(t)),n,ε if and only if |Q(D)| R Q (Wr(t)) = n and |Q(D)| F ∩R Q (Wr(t)) = 0.
From (1) and the assumption F ∈ F ε , it thus follows that (Wr(t))) = n. As in the previous case, because D is ε-coarse, the set W r (t) contains at most one R Q -tuple from Q(D). Thus, D ∈ D F ∩R Q (Wr(t)),n,ε if and only if |Q(D)| F i ∩R Q (Wr(t)) = n for all i = 1, 2, . . . .

Thus, ∞
i=1 F i ∈ F ε using the assumption. Together, F ε is indeed a σ-algebra on F Q . From (1) and (2) it follows that F ε = F Q for all ε > 0.
Lemma 5.15. For all F ∈ F Q , all n ∈ N, and all ε > 0, it holds that D F ,n,ε ∈ DB.
Proof. Let F ∈ F Q , n ∈ N, and ε > 0. It suffices to show that D F ,≥n,ε := m≥n D F ,m,ε ∈ DB where we may assume n > 0. We show that D ∈ D F ,≥n,ε is equivalent to D satisfying the following condition.
Condition 5.16. The instance D is ε-coarse and for some ∈ N + there are k 1 , . . . , k ∈ N + with k 1 + · · · + k ≥ n such that for all r ∈ 0, ε 3 there are t * 1,r , . . . , t * ,r ∈ T * R × T * S such that (1) for all i = j it holds that d R (t * i,r,R , t * j,r,R ) > ε 3 or d S (t * i,r,S , t * j,r,S ) > ε 3 and (2) for all i = 1, . . . , it holds that D ∈ D F ∩R Q (Wr(t * i,r )),k i ,ε . We start with the easy direction (⇐). ⇐ : Suppose D satisfies Condition 5.16. Then D ∈ DB| ε and it remains to show |Q(D)| F ≥ n. Note that it suffices to show that if r is small enough in Condition 5.16, then the sets W r (t * i,r ) are pairwise disjoint. In this case, the claim follows from (2). By (1), for all i, j = 1, . . . , n with i = j it holds that at least one of d R (t * i,R , t * j,R ) or d R (t * i,S , t * j,S ) is larger than ε 3 . Thus, for r < ε 6 it follows that at least one of d R (t * i,R , t * j,R ) or d S (t * i,S , t * j,S ) is larger than 2r. Therefore, W r (t * i,r ) ∩ W r (t * j,r ) = ∅. ⇒ : Suppose that D ∈ D F ,≥n,ε . Then D is ε-coarse and |Q(D)| F ≥ n. Thus, there exist pairwise distinct t 1 , . . . , t ∈ Q(D) with R Q (t i ) ∈ F for all i = 1, . . . , n such that Let r ∈ 0, ε 3 . Since T * R is dense in T R and T * S is dense in T S , for all i = 1, . . . , n we can choose t * i,r = (t * i,r,R , t * i,r,S ) ∈ T * R × T * S such that d R (t * i,r,R , t i,R ) < r and d S (t * i,r,S , t i,S ) < r.
For the rest of the proof we assume d R (t i,R , t R ) > ε (the other case is completely symmetric). Recall that r < ε 3 . Then it follows that , and (2) follows. Also for all j = i, we have that The equivalence still holds, if r is additionally required to be rational. That is, using Lemma 5.14 (with the indices ranging as in our equivalence, and numbers being restricted to rationals).
Finally, the measurability of Q = R × S is a direct consequence of Lemma 5.15 and Corollary 4.13.
Lemma 5.17. The query Q = R × S is measurable.
As a consequence, we also obtain the measurability of all kinds of derived operators, including the typical join operators. Note that it also follows from Theorem 5.1, that we can use finite Boolean combinations of predicates for selection queries.

Aggregation
There are practically relevant queries that are not already covered by our treatment in the previous section. For example, in our running example of temperature recordings, we might be interested in returning the average (or minimum or maximum) temperature per room, taken over all temperature records for this particular room.
In this section, we formalize aggregate operators and aggregate queries, possibly with grouping in the standard PDB framework in a possible worlds semantics style (cf. Section 3.2). In particular, we show that these queries are measurable in the standard PDB framework.
Remark 6.1. Often, when a separate treatment of aggregate queries over purely algebraic ones is motivated, it is mentioned that the correspondence of relational algebra and relational calculus limits expressive power to that of first-order logic. However, bag query languages based on relational algebra do allow expressing various kinds of aggregation based on exploiting the presence of multiplicities. For example, counting in a unary fashion is possible in BALG 1 [GM96]. Here, we follow a more general approach in allowing basically any measurable function over finite bags to be used for aggregation. This goes beyond the integer aggregation of BALG [GLMW96]. Examples of common attribute operators Φ are shown in Table 8.
Without loss of generality, we assume that R is the only relation in τ . Then for all D ∈ DB, it holds that , T Q -measurable, the claim follows.  is Count(T R ), T Q )-measurable.
Proof. It suffices to show that the restriction Φ m of Φ to B m (T R ), the bags of cardinality m over T R , is measurable for all m ∈ N. Note that B 0 (T R ) only contains the empty bag, and T 0 R only contains the empty tuple. That is, the statement is trivial for m = 0. Thus, let m ∈ N + . Since ϕ m is (T ⊗m R , T Q )-measurable and symmetric, for all T ∈ T Q it holds that 4 Formally, R(D) has been defined as the bag of R-facts in D wheras Φ should take bags of R-tuples. This small type mismatch is of no significance whatsoever.
From Remark 2.8 and Fact 2.9 it follows that Φ −1 m (T ) ∈ Count(T R ). Example 6.4. All the aggregators in Table 8 yield measurable aggregation queries. The associated functions ϕ m are all continuous under suitable choices of attribute domains. That is, for example, if + in the definition of SUM is the addition of real numbers.
What we have introduced so far is only sufficient to express the aggregation over all tuples of a relation at once. Usually, we want to perform aggregation separately for parts of the data, as in our motivating example of returning the average temperature per room. For this, we need to group tuples before aggregating values. Suppose we want to group a relation R by attributes A 1 , . . . , A k and perform the aggregation over attribute A, separately for every occurring value of the attributes A 1 , . . . , A k . Without loss of generality, we assume that sort(R) = (A 1 , . . . , A k , A). Then what we described is an group-by aggregate query Essentially, A a 1 ,...,a k (D) is obtained by selecting those tuples where the first k attributes have values a 1 , . . . , a k , and then projecting to the last attribute. Both kinds of aggregate queries (without, and with grouping) are shown in Table 9.
In the following, we fix a compatible Polish metric d grp on T grp , and a countable dense set T * grp in T grp . For all t ∈ T grp and all r > 0 define Note that Q t,r is measurable by Theorem 5.1 for all particular choices of t and r. Let Q = π A 1 ,...,A k (R) and DB| ε := {D ∈ DB : Q(D) is ε-coarse} Then DB| ε ∈ DB, as DB| ε = Q −1 DB | ε using Lemmas 4.11 and 5.9 where DB is the output instance space of Q. Intuitively, DB| ε are the instances that are ε-coarse in the (A 1 , . . . , A k ) attributes of R. Similar to Corollary 4.13 it suffices to show that {D ∈ DB| ε : |Q(D)| F ≥ n} is measurable for all positive ε, F ∈ F Q and n ∈ N + .
We show that for all D ∈ DB| ε , all F ∈ F Q , and all n ∈ N + it holds that |Q(D)| F ≥ n is equivalent to D satisfying the following condition. Condition 6.6. For all positive r < ε 3 there exist t * 1,r , . . . , t * n,r ∈ T * grp with d grp (t * i,r , t * j,r ) > ε 3 such that |Q t * i,r ,r (D)| F ≥ 1 for all i = 1, . . . , n. ⇒ : Let D ∈ DB| ε such that |Q(D)| F ≥ n. Then there exist R Q (t 1 , b 1 In particular, every B r (t * i,r ) contains no tuple among t 1 , . . . , t n other than t i . Thus, i,r ,r (D)| F ≥ 1 for all i = 1, . . . , n. ⇐ : Suppose Condition 6.6 holds. Since D is finite, the tuples t * i,r in Condition 6.6 converge to tuples t i with , the tuples t 1 , . . . , t n are pairwise distinct. Thus, (6.2) implies |Q(D) The equivalence still holds, when r is additionally required to be rational. Thus, with the indices ranging as in Condition 6.6 (and r rational).
By the observation of Example 6.4, Theorem 6.5 applies to the operators from Table 8.

Datalog
The measurability results of the previous sections also allow us to say something about fixpoint queries. The key observation is the following lemma, which follows from Fact 2.5(2).
Lemma 7.1. Let (Q i ) i∈N be a family of measurable queries such that Q(D) Then Q is a measurable query.
Proof. For all n ∈ N, let Q (n) := n i=0 Q i . As a finite (maximum-)union of measurable queries, Q (n) is measurable. As Q(D) is finite for all D, it holds that Q(D) = lim n→∞ Q (n) (D), and Q is the pointwise limit of the functions Q (n) . Thus, Q = lim n→∞ Q (n) is measurable as well.
We omit the definition of Datalog and related query languages. For simplicity, we consider set PDBs, and Datalog with sets semantics. Recall that if Q is a Datalog query, then Q can be written as a countable union of conjunctive queries [AHV95]. 5 Thus, combining our measurability results with the above lemma, we obtain the following. 5 We note that a similar statement can be made for the bounded fixpoints semantics over bags featured in [CL97]. Corollary 7.2. Every Datalog query is measurable.
In fact, our argumentation can be applied to all types of queries with operators that are based on countable iterative (or inductive, inflationary, or fixed-point) processes. All we need is that the iterative mechanism forms a converging sequence of measurable queries. For partial Datalog / fixed-point logic, we cannot directly use Lemma 7.2, but a slightly more complicated argument still based on countable limits works there as well.

Beyond Possible Worlds Semantics
In Section 3.2, we introduced the notion of queries or views on probabilistic databases solely based on the existing notion of queries and views for traditional databases, which we referred to as the possible worlds semantics of queries or views. As the title of this section suggests, we explicitly leave this setup. 6 Before, we have introduced views as functions mapping database instances to database instances and adopted a semantics based on possible worlds. Now, we want to discuss PDB views as functions that map probabilistic databases to probabilistic databases, for which no such semantics (to be precise, a definition as in in the shape of Section 3.2) exists. Such "views" naturally arise in a variety of computational problems in probabilistic databases. For example, consider the following problems or "queries": • probabilistic threshold queries that intuitively return a deterministic table containing only those facts which have a marginal probability over some specified threshold [QJSP10]; • probabilistic top-k queries that intuitively return a deterministic table containing the k most probable facts [RDS07]; • probabilistic skyline queries [PJLY07] that consider how different instances compare to each other with respect to some notion of dominance; and • conditioning [KO08] the probabilistic database to some event.
Note that the way we informally explained the first two queries above is only sensible if the space of facts is discrete. In a continuous setting, we interpret these queries with respect to a suitable countable partition of the fact space into measurable sets.
More such "queries" as the above can be found in [Agg09,WLLW13]. These queries (or views) still take as input a PDB and produce some output, but differ from the ones we have seen so far in that they have no reasonable semantics on single instances (i. e. per possible world). This can be, for example, because they explicitly take probabilities into account. The goal of this section is to interpret other kinds of problems on PDBs abstractly as functions on probability spaces in order fit them into a unified framework. Developing such an understanding has already been motivated in [WvK15] as yielding potential insight into common properties of the corresponding problems.
We now present our formal classification of views that are directly defined on the probability space level of a PDB (as opposed to the instance level as in Section 3.2). Let PDB τ denote the class of probabilistic databases of schema τ . Note that all PDBs in PDB τ have the same instance measurable space (DB, DB). Queries and, more generally, views of input schema τ and output schema τ are now mappings V : PDB τ → PDB τ .
6 Note that we do not abandon our definition of PDBs as probability spaces over possible worlds. It is only that we broaden the notion of views to also incorporate mappings of PDBs that are not defined on a "per possible world" basis. (1) Every view is type 1.
(2) A view V is type 2, or pointwise local, if (and only if) for every fixed input PDB D = (DB, DB, P ) there exists a measurable function q D : DB → DB V such that P V = P • q −1 D .
(3) A view V is type 3, or uniformly local, if there exists a measurable function q : DB → DB V such that for every input PDB D = (DB, DB, P ), it holds that P V = P • q −1 . (4) A view V is type 4, or pointwise, if (and only if) there exists measurable function q : F q → F Q such that V is composed of functions of the shape of Q as in Theorem 4.4 (where Q depends on q). We let V I , V II , V III and V IV denote the classes of views of type 1 throughout 4.
Remark 8.2. Let us shed some more light on these classes and their names: (1) Class V I does not require further explanation, as it contains every view.
(2) The class V II is described via functions that may depend on the measurable structure of the input PDB. Specifically, these views are functions that directly transform input to output PDBs. We dub this "pointwise local", because this function is applied per instance (hence, local) but the function itself depends on the concrete PDB (hence, is only pointwise local with respect to probability spaces). In general, type 2 views may take the probability space level into account. For example, probabilistic threshold or probabilistic top-k queries can be viewed as views of this class: they transform any input PDB to a single database instance (that is, a PDB with only one possible world of probability 1) containing the respective output tuples along with their probability in a separate attribute. Another example is conditioning a PDB, as the probability measure of a conditioned PDB involves a normalization term that depends on the probability mass of an event.
(3) The class V III captures the lifting of typical database queries to PDBs under the possible worlds semantics. Hence, there is a single (measurable) function that is applied "locally" on every database instance. The term "uniformly" expresses that it does not depend on the concrete PDB. In general, type 3 views only take the instance level into account. All the views that we investigated in Sections 5 to 7 fall into this category. (4) The class V IV is the class of views corresponding to the measurability criterion of the Mapping Theorem. That is, there is a (measurable) function that is applied "pointwise" 7 on every fact. This transformation naturally lifts to instances and, thus, to PDBs. In general, type 4 views only take the fact level into account. We have seen an example of such a view in Example 4.5.
Example 8.3. Recall our running example of temperature measurements. We use this to introduce an example of a view that performs "out-of-world aggregation" [WvK15] (and that, in particular, is of type 2, but not of type 3). The relation TempRec stores triples of room numbers (RoomNo), recording dates (Date) and recorded temperatures (Temp). Assume that the pair (RoomNo, Date) acts as a key, so that with probability 1 there is at most one temperature recording per pair. Moreover, assume that the PDB is modelled with independent tuples (r, d, θ) where θ is Normally distributed per pair (r, d), but such that the existence of an record belonging to (r, d) is subject to uncertainty. That is, we have a mix of attribute-and tuple-level uncertainty. A possible representation of such a PDB is 7 Note that the "pointwise" in the term "pointwise local" from VII refers to probability spaces as "points", whereas the term "pointwise" alone, as here in the definition of VIV refers to facts as "points" shown in Figure 4. Therein, we have two possible tuples (one recording for room 4108 and one recording for room 4108a), but the value of the temperature recording is specified as a Normally distributed random variable, parameterized with its mean and variance. We assume the existence of tuples with room 4108 or 4108a to be independent from one another, and that these events carry probability 0.4 and 0.8, respectively. A possible view could ask to return, per room and date, the expected temperature, under preserving the tuple-level uncertainty. The output of this view is shown on the right-hand side of Figure 1. (It is here trivially to obtain, because the expected value was already part of the parametrization. The idea however also applies to more complicated setups.) The possible worlds of the input PDB D = (DB, DB, P ) are partitioned into four cases, depending on the presence of a tuple with room 4108 and 4108a, respectively. Our function q D (which depends on D) maps instances D ∈ DB as follows: the tuple (4108, 2021-07-12, 20.5) is present in the view result if and only if D contains a tuple with room 4108, and the tuple (4108, 2021-07-12, 21.0) is present in the view result if and only if D contains a tuple with room 4108a. Then the output PDB has probability measure P • q −1 D and, in particular, four possible worlds. This is a so-called tuple-independent PDB with given marginal probabilities (0.4 and 0.8), hence it can be represented as shown on the right-hand side of Figure 1. 8 Let us come back to the relationships between the classes of views we defined in Definition 8.1. Clearly, V I ⊇ V II ⊇ V III ⊇ V IV . We already informally argued in Section 4.2 that queries that depend on multiple tuples per instance are not captured by V III , i. e. V III ⊆ V IV . We provide two examples to expose that also the remaining inclusions are strict.
Proposition 8.4. There exists a view that is not type 2, i. e. V II V I .
We demonstrate this using a view that conditions its input PDB on an event.
Proof. Let D be a PDB with three possible worlds D 1 , D 2 , and D 3 , such that P ({D 1 }) = 1 6 , P ({D 2 }) = 1 2 , and P ({D 3 }) = 1 3 . Consider the view V that conditions a PDB on the event {D 1 , D 2 }. Note that P V (D) ({D 1 }) = 1 6 / 1 6 + 1 2 = 1 4 and P V (D) ({D 2 }) = 1 2 / 1 6 + 1 2 = 3 4 . Yet, there is no event D in D with the property that P (D) = 1 4 . Thus, a function q D as required in Definition 8.1(2) does not exist and, hence, V is not type 2. 8 The input PDB in this case is not tuple-independent, as any two facts with the same room number are mutually exclusive. Instead, this is a so-called (uncountable) block-independent disjoint PDB with two independent blocks, both representing a single random tuple, such that each block specifies the (probability distribution over) possible manifestations for this tuple. Proposition 8.5. There exists a type 2 view that is not type 3, so V III V II .
We have already discussed such a view in Example 8.3, but have not actually shown that it is not of type 3. We demonstrate the propoosition in considering another example, namely that of a probabilistic threshold query.
Proof. Let α ∈ (0, 1]. We consider a function V α that maps an input PDB D = (DB, DB, P ) to an output PDB V α (D) such that D D,α := f ∈ F : P {D ∈ DB : |D| f > 0} ≥ α has probability 1 in V α (D). Note that D D,α is finite for all α ∈ (0, 1], so V α is well-defined. The view V α is a probabilistic threshold query with threshold α. Consider the function q D with D → D D,α for all D ∈ DB. Then q D is measurable and witnesses that V α is type 2 for all α ∈ (0, 1]. Now let α > 1 2 and consider the following two PDBs D 1 and D 2 , with probability measures P 1 and P 2 , respectively, over distinct facts f and g: Then D D 1 ,α = {f } and D D 2 ,α = {g}. Suppose q exists such that P V = P • q for every input PDB D = (DB, DB, P ). Let D f := {D ∈ DB V : f ∈ D}. Then P D 1 q −1 (D f ) = 1 implies {f } ∈ q −1 (D f ). On the contrary, P D 2 q −1 (D f ) = 0, implies {f } / ∈ q −1 (D f ), a contradiction. Thus, V α is not of type 3.
Together, we have that V I V II V III V IV . Before closing this section, let us highlight two key insights of the arguments used in the examples for Propositions 8.4 and 8.5. In essence, we separated type 2 from type 3 by arguing about the structure of possible worlds without really taking probabilities into account. For the separation of type 1 from type 2, we argued about the structure of the probability measure instead. This highlights again the different levels within the hierarchical structure of a PDB (fact level-instance level-probability space level) that the views of the different classes operate on.

Conclusions
We introduce the notion of standard PDBs, for which we rigorously describe how to construct suitable measurable spaces for infinite probabilistic databases, completing the picture of [GL19]. The viability of this model as a general and unifying framework for finite and infinite databases is supported by the well-definedness and compositionality of (typical) query semantics. Other kinds of PDB queries embed into the framework as well.
It is currently open, whether, and if so, how more in-depth results on point processes can be used for probabilistic databases, for example, to perform open-world query answering. Also, while we focused on relational algebra and aggregation, the queries of Section 8 deserve a systematic treatment in their own right in infinite PDBs.