Efficient Evaluation of Arbitrary Relational Calculus Queries

. The relational calculus (RC) is a concise, declarative query language. However, existing RC query evaluation approaches are inefficient and often deviate from established algorithms based on finite tables used in database management systems. We devise a new translation of an arbitrary RC query into two safe-range queries, for which the finiteness of the query’s evaluation result is guaranteed. Assuming an infinite domain, the two queries have the following meaning: The first is closed and characterizes the original query’s relative safety, i.e., whether given a fixed database, the original query evaluates to a finite relation. The second safe-range query is equivalent to the original query, if the latter is relatively safe. We compose our translation with other, more standard ones to ultimately obtain two SQL queries. This allows us to use standard database management systems to evaluate arbitrary RC queries. We show that our translation improves the time complexity over existing approaches, which we also empirically confirm in both realistic and synthetic experiments.


Introduction
Codd's theorem states that all domain-independent queries of the relational calculus (RC) can be expressed in relational algebra (RA) [Cod72].A popular interpretation of this result is that RA suffices to express all interesting queries.This interpretation justifies why SQL evolved as the practical database query language with the RA as its mathematical foundation.SQL is declarative and abstracts over the actual RA expression used to evaluate a query.Yet, SQL's syntax inherits RA's deliberate syntactic limitations, such as union-compatibility, which ensure domain independence.RC does not have such syntactic limitations, which arguably makes it a more attractive declarative query language than both RA and SQL.The main problem of RC is that it is not immediately clear how to evaluate even domain-independent queries, much less how to handle the domain-dependent (i.e., not domain-independent) ones.
As a running example, consider a shop in which brands (unary finite relation B of brands) sell products (binary finite relation P relating brands and products) and products are reviewed by users with a score (ternary finite relation S relating products, users, and scores).We consider a brand suspicious if there is a user and a score such that all the brand's products were reviewed by that user with that score.An RC query computing suspicious brands is Q susp B(b) ∧ ∃u, s. ∀p.P(b, p) −→ S(p, u, s).
This query is domain independent and follows closely our informal description.It is not, however, clear how to evaluate it because its second conjunct is domain dependent as it is satisfied for every brand that does not occur in P. Finding suspicious brands using RA or SQL is a challenge, which only the best students from an undergraduate database course will accomplish.We give away an RA answer next (where − is the set difference operator and ▷ is the anti-join, also known as the generalized difference operator [AHV95]): π brand ((π user ,score (S) × B) − π brand,user ,score ((π user ,score (S) × P) ▷ S)) ∪ (B − π brand (P)).
The highlighted expressions π user ,score (S) are called generators.They ensure that the left operands of the anti-join and set difference operators include or have the same columns (i.e., are union-compatible) as the corresponding right operands.(Following Codd [Cod72], one could also use the active domain to obtain canonical, but far less efficient, generators.) Van Gelder and Topor [GT87,GT91] present a translation from a decidable class of domain-independent RC queries, called evaluable, to RA expressions.Their translation of the evaluable Q susp query would yield different generators, replacing both highlighted parts by π user (S) × π score (S).That one can avoid this Cartesian product as shown above is subtle: Replacing only the first highlighted generator with the product results in an inequivalent RA expression.
Once we have identified suspicious brands, we may want to obtain the users whose scoring made the brands suspicious.In RC, omitting u's quantifier from Q susp achieves just that: In contrast, RA cannot express the same property as it is domain dependent (hence also not evaluable and thus out of scope for Van Gelder and Topor's translation): Q susp user is satisfied for every user if a brand has no products, i.e., it does not occur in P. Yet, Q susp user is satisfied for finitely many users on every database instance where P contains at least one row for every brand from the relation B, in other words Q susp user is relatively safe on such database instances.How does one evaluate queries that are not evaluable or even domain dependent?The main approaches from the literature (Section 2) are either to use variants of the active domain semantics [BL00,HS94,AGSS86] or to abandon finite relations entirely and evaluate queries using finite representations of infinite (but well-behaved) relations such as systems of constraints [Rev02] or automatic structures [BG04].These approaches favor expressiveness over efficiency.But unlike query translations, they cannot benefit from decades of practical database research and engineering.
In this work, we translate arbitrary RC queries to RA expressions under the assumption of an infinite domain.To deal with queries that are domain dependent, our translation produces two RA expressions, instead of a single equivalent one.The first RA expression characterizes the original RC query's relative safety, the decidable question of whether the query evaluates to a finite relation for a given database, which can be the case even for a domain-dependent query, e.g., Q susp user .If the original query is relatively safe on a given database, i.e., produces some finite result, then the second RA expression evaluates to the same finite result.Taken together, the two RA expressions solve the query capturability problem [AH91]: they allow us to enumerate the original RC query's finite evaluation result, or to learn that it would be infinite using RA operations on the unmodified database.Section 4 Section 6.1 Section 6.2 Section 6.3 Section 6.4 Section 6.5 Figure 1: Overview of our translation.
Figure 1 summarizes our translation's steps and the sections where they are presented.Starting from an RC query, it produces two SQL queries via transformations to safe-range queries, the safe-range normal form (SRNF), the relational algebra normal form (RANF), and RA, respectively (Section 3).This article's main contribution is the first step: translating an RC query into two safe-range RC queries (Section 4), which fundamentally differs from Van Gelder and Topor's approach and produces better generators, like π user ,score (S) above.Our generators strictly improve the time complexity of query evaluation (Section 5).
After the standard transformations from safe-range to RANF queries and from there to RA expressions, we translate the RA expressions into SQL using the radb tool [Yan19] (Section 6).We leverage various ideas from the literature to optimize the overall result.For example, we generalize Claußen et al. [CKMP97]'s approach to avoid evaluating Cartesian products like π user ,score (S) × P in RANF queries by using count aggregations (Section 6.3).
The translation to SQL enables any standard database management system (DBMS) to evaluate RC queries.We implement our translation and then use either PostgreSQL or MySQL for query evaluation.Using a real Amazon review dataset [NLM19] and our synthetic benchmark that generates hard database instances for random RC queries (Section 7), we evaluate our translation's performance (Section 8).The evaluation shows that our approach outperforms Van Gelder and Topor's translation (which also uses a standard DBMS for evaluation) and other RC evaluation approaches based on constraint databases and structure reduction.
In summary, our three main contributions are as follows: • We devise a translation of an arbitrary RC query into a pair of RA expressions as described above.The time complexity of evaluating our translation's results improves upon Van Gelder and Topor's approach [GT91].• We implement our translation and extend it to produce SQL queries.The resulting tool RC2SQL makes RC a viable input language for any standard DBMS.We evaluate our tool on synthetic and real data and confirm that our translation's improved asymptotic time complexity carries over into practice.• To challenge RC2SQL (and its competitors) in our evaluation, we devise the Data Golf benchmark that generates hard database instances for randomly generated RC queries.This article extends our ICDT 2022 conference paper [RBKT22b] with a more complete description of the translation.In particular, it describes the steps that follow our main contribution -the translation of RC queries into two safe-range queries.In addition, we formally verify the functional correctness (but not the complexity analysis) of the main contribution using the Isabelle/HOL proof assistant [RT22].The theorems and examples that have been verified in Isabelle are marked with a special symbol ( ).The formalization helped us identify and correct a technical oversight in the algorithm from the conference paper (even though the problem was compensated for by the subsequent steps of the translation in our implementation).

Related Work
We recall Trakhtenbrot's theorem and the fundamental notions of capturability and data complexity.Given an RC query over a finite domain, Trakhtenbrot [Tra50] showed that it is undecidable whether there exists a (finite) structure and a variable assignment satisfying the query.In contrast, the question of whether a fixed structure and a fixed variable assignment satisfies the given RC query is decidable [AGSS86].
Kifer [Kif88] calls a query class capturable if there is an algorithm that, given a query in the class and a database instance, enumerates the query's evaluation result, i.e., all tuples satisfying the query.Avron and Hirshfeld [AH91] observe that Kifer's notion is restricted because it requires every query in a capturable class to be domain independent.Hence, they propose an alternative definition that we also use: A query class is capturable if there is an algorithm that, given a query in the class, a (finite or infinite) domain, and a database instance, determines whether the query's evaluation result on the database instance over the domain is finite and enumerates the result in this case.Our work solves Avron and Hirshfeld's capturability problem additionally assuming an infinite domain.
Data complexity [Var82] is the complexity of recognizing if a tuple satisfies a fixed query over a database, as a function of the database size.Our capturability algorithm provides an upper bound on the data complexity for RC queries over an infinite domain that have a finite evaluation result (but it cannot decide if a tuple belongs to a query's result if the result is infinite).
Next, we group related approaches to evaluating RC queries into three categories.Structure reduction.The classical approach to handling arbitrary RC queries is to evaluate them under a finite structure [Lib04].The core question here is whether the evaluation produces the same result as defined by the natural semantics, which typically considers infinite domains.Codd's theorem [Cod72] affirmatively answers this question for domain-independent queries, restricting the structure to the active domain.Ailamazyan et al. [AGSS86] show that RC is a capturable query class by extending the active domain with a few additional elements, whose number depends only on the query, and evaluating the query over this finite domain.Natural-active collapse results [BL00] generalize Ailamazyan et al.'s [AGSS86] result to extensions of RC (e.g., with order relations) by combining the structure reduction with a translation-based approach.Hull and Su [HS94] study several semantics of RC that guarantee the finiteness of the query's evaluation result.In particular, the "output-restricted unlimited interpretation" only restricts the query's evaluation result to tuples that only contain elements in the active domain, but the quantified variables still range over the (finite or infinite) underlying domain.Our work is inspired by all these theoretical landmarks, in particular Hull and Su's work (Section 4.1).Yet we avoid using (extended) active domains, which make query evaluation impractical.
Query translation.Another strategy is to translate a given query into one that can be evaluated efficiently, for example as a sequence of RA operations on finite tables.Van Gelder and Topor pioneered this approach [GT87, GT91] for RC.A core component of their translation is the choice of generators, which replace the active domain restrictions from structure reduction approaches and thereby improve the time complexity.Extensions to scalar and complex function symbols have also been studied [EHJ93,LYL08].All these approaches focus on syntactic classes of RC, for which domain independence is given, e.g., the evaluable queries of Van Gelder and Topor (Appendix A).Our approach is inspired by Van Gelder and Topor's work but generalizes it to handle arbitrary RC queries at the cost of assuming an infinite domain.Also, we further improve the time complexity of Van Gelder and Topor's approach by choosing better generators.
Evaluation with infinite relations.Constraint databases [Rev02] obviate the need for using RA operations on finite tables.This yields significant expressiveness gains as domain independence need not be assumed.Yet the efficiency of the quantifier elimination procedures employed cannot compare with the simple evaluation of the RA's projection operation.Similarly, automatic structures [BG04] can represent the results of arbitrary RC queries finitely, but struggle with large quantities of data.We demonstrate this in our evaluation where we compare our translation to several modern incarnations of the above approaches, all based on binary decision diagrams [MLAH99, Møl02, CGS09, KM01, BKMZ15].

Preliminaries
We introduce the RC syntax and semantics and define relevant classes of RC queries.

Relational Calculus.
A signature σ is a triple (C, R, ι), where C and R are disjoint finite sets of constant and predicate symbols, and the function ι : R → N maps each predicate symbol r ∈ R to its arity ι(r).Let σ = (C, R, ι) be a signature and V a countably infinite set of variables disjoint from C ∪ R. The following grammar defines the syntax of RC queries: Here, r ∈ R is a predicate symbol, t, t 1 , . . ., t ι(r) ∈ V ∪ C are terms, and x ∈ V is a variable.We write ∃⃗ v. Q for ∃v 1 . . . .∃v k .Q and ∀⃗ v. Q for ¬∃⃗ v. ¬Q, where ⃗ v is a variable sequence v 1 , . . ., v k .If k = 0, then both ∃⃗ v. Q and ∀⃗ v. Q denote just Q. Quantifiers have lower precedence than conjunctions and disjunctions, e.g., ∃x.
We use ≈ to denote the equality of terms in RC to distinguish it from =, which denotes syntactic object identity.We also write would complicate later definitions, e.g., the safe-range queries (Section 3.2).
We define the subquery partial order ⊑ on queries as the (reflexive and transitive) subterm relation on the datatype of RC queries.For example, Q 1 is a subquery of the query Q 1 ∧ ¬∃y.Q 2 .We denote by sub(Q) the set of subqueries of a query Q, by fv(Q) the set of free variables in Q, and by av(Q) be the set of all (free and bound) variables in a query Q.Furthermore, we denote by ⃗ fv(Q) the sequence of free variables in Q based on some fixed ordering of variables.We lift this notation to sets of queries in the standard way.A query Q with no free variables, i.e., fv(Q) = ∅, is called closed.Queries of the form r(t 1 , . . ., t ι(r) ) and x ≈ c are called atomic predicates.We define the predicate ap(•) characterizing atomic predicates, i.e., ap(Q) is true iff Q is an atomic predicate.Queries of the form ∃⃗ v. r(t 1 , . . ., t ι(r) ) and ∃⃗ v. x ≈ c are called quantified predicates.We denote by ∃x.Q the query obtained by existentially quantifying a variable x from a query We lift this notation to sets of queries in the standard way.We use ∃x.Q (instead of ∃x.Q) when constructing a query to avoid introducing bound variables that never occur in Q.
A structure S over a signature (C, R, ι) consists of a non-empty domain D and interpretations c S ∈ D and r S ⊆ D ι(r) , for each c ∈ C and r ∈ R. We assume that all the relations r S are finite.Note that this assumption does not yield a finite structure (as defined in finite model theory [Lib04]) since the domain D can still be infinite.A (variable) assignment is a mapping α : V → D. We extend α to constant symbols c ∈ C with α(c) = c S .We write α[x → d] for the assignment that maps x to d ∈ D and is otherwise identical to α.We lift this Figure 2: The semantics of RC.
x ≈ x ≡ ⊤, ¬⊥ ≡ ⊤, ¬⊤ ≡ ⊥, ∃x.⊥ ≡ ⊥, ∃x.⊤ ≡ ⊤, notation to sequences ⃗ x and ⃗ d of pairwise distinct variables and arbitrary domain elements of the same length.The semantics of RC queries for a structure S and an assignment α is defined in Figure 2. We write α |= Q for (S, α) |= Q if the structure S is fixed in the given context.For a fixed S, only the assignments to Q's free variables influence α |= Q, i.e., α |= Q is equivalent to α ′ |= Q, for every variable assignment α ′ that agrees with α on fv(Q).For closed queries Q, we write |= Q and say that Q holds, since closed queries either hold for all variable assignments or for none of them.We call a finite sequence ⃗ d of domain elements d 1 , . . .d k ∈ D a tuple.Given a query Q and a structure S, we denote the set of satisfying tuples for Q by We omit S from Q S if S is fixed.We call the values from Q assigned to x ∈ fv(Q) column x.
The active domain adom S (Q) of a query Q and a structure S is a subset of the domain D containing the interpretations c S of all constant symbols that occur in Q and the values in the relations r S interpreting all predicate symbols that occur in Q.Since C and R are finite and all r S are finite relations of a finite arity ι(r), the active domain adom S (Q) is also a finite set.We omit S from adom S (Q) if S is fixed in the given context.
Queries Q 1 and Q 2 over the same signature are equivalent, written for every S with an infinite domain D and every α.Clearly, equivalent queries are also inf-equivalent.
A query Q is domain-independent if Q S 1 = Q S 2 holds for every two structures S 1 and S 2 that agree on the interpretations of constants (c S 1 = c S 2 ) and predicates (r S 1 = r S 2 ), while their domains D 1 and D 2 may differ.Agreement on the interpretations implies We denote by cp(Q) the query obtained from a query Q by exhaustively applying the rules in Figure 3.Note that cp(Q) is either of the form ⊥ or ⊤ or contains no ⊥ or ⊤ subqueries.
Definition 3.1.The substitution of the form Q[x → y] is the query cp(Q ′ ), where Q ′ is obtained from a query Q by replacing all occurrences of the free variable x by the variable y, potentially also renaming bound variables to avoid capture.Definition 3.2.The substitution of the form Q[x/⊥] is the query cp(Q ′ ), where Q ′ is obtained from a query Q by replacing with ⊥ every atomic predicate or equality containing the free variable x, except for (x ≈ x) which is replaced by ⊤.
We lift the substitution notation to sets of queries in the standard way.
Figure 4: The generated relation.
if x ̸ = y and cov(x, Q y , G) and gen(y, Q y , G y ).
Figure 5: The covered relation.
The function flat ⊕ (Q), where ⊕ ∈ {∨, ∧}, computes a set of queries by "flattening" the operator ⊕: 3.2.Safe-Range Queries.The class of safe-range queries [AHV95] is a decidable subset of domain-independent RC queries.Its definition is based on the notion of the range-restricted variables of a query.A variable is called range restricted if "its possible values all lie within the active domain of the query" [AHV95].Intuitively, atomic predicates restrict the possible values of a variable that occurs in them as a term.An equality x ≈ y can extend the set of range-restricted variables in a conjunction Q ∧ x ≈ y: If x or y is range restricted in Q, then both x and y are range restricted in Q ∧ x ≈ y.
We formalize range-restricted variables using the generated relation gen(x, Q, G), defined in Figure 4. Specifically, gen(x, Q, G) holds if x is a range-restricted variable in Q and every satisfying assignment for Q satisfies some quantified predicate, referred to as generator, from G. A similar definition by Van Gelder and Topor [GT91, Figure 5] uses a set of atomic (not quantified) predicates A as generators and defines the rule gen vgt (x, ∃y.Q y , A) if x ̸ = y and gen vgt (x, Q y , A) (Appendix A, Figure 17).In contrast, we modify the rule's conclusion to existentially quantify the variable y in all queries in G where y is free: gen(x, ∃y.Q y , ∃y.G).Hence, gen(x, Q, G) implies fv(G) ⊆ fv(Q).We now formalize these relationships.
Lemma 3.3 .Let Q be a query, x ∈ fv(Q), and G be a set of quantified predicates such that gen(x, Q, G).Then (i) for every Q qp ∈ G, we have x ∈ fv(Q qp ) and fv(Q qp ) ⊆ fv(Q), (ii) for every α such that α |= Q, there exists a Q qp ∈ G such that α |= Q qp , and (iii) Q[x/⊥] = ⊥.Definition 3.4.Let gen(x, Q) hold iff gen(x, Q, G) holds for some G. Let nongens(Q) {x ∈ fv(Q) | gen(x, Q) does not hold} be the set of free variables in a query Q that are not range restricted.A query Q has range-restricted free variables if every free variable of Q is range restricted, i.e., nongens(Q) = ∅.A query Q has range-restricted bound variables if the bound variable y in every subquery ∃y.Q y of Q is range restricted, i.e., gen(y, Q y ) holds.A query is safe range if it has range-restricted free and range-restricted bound variables.
3.3.Safe-Range Normal Form.A query Q is in safe-range normal form (SRNF) if the query Q ′ in every subquery ¬Q ′ of Q is an atomic predicate, equality, or an existentially quantified query [AHV95].In Section 6.1 we define function srnf(Q) that returns a SRNF query equivalent to a query Q. Intuitively, the function srnf(Q) proceeds by pushing negations downwards [AHV95, Section 5.4], distributing existential quantifiers over disjunction [GT91, Rule (T9)], and dropping bound variables that never occur [GT91, Definition 9.2].We include the last two rules to optimize the time complexity of evaluating the resulting query.
If a query Q is safe range, then srnf(Q) is also safe range.
3.4.Relational Algebra Normal Form.Relation algebra normal form (RANF) is a class of safe-range queries that can be easily mapped to RA [AHV95] and evaluated using the RA operations for projection, column duplication, selection, set union, binary join, and anti-join.Figure 6 defines the predicate ranf(•) characterizing RANF queries.The translation of safe-range queries (Section 3.2) to equivalent RANF queries proceeds via SRNF (Section 3.3).A safe-range query in SRNF can be translated to an equivalent RANF query by subquery rewriting using the following rules [AHV95, Algorithm 5.4.7]: Subquery rewriting is a nondeterministic process (as the rewrite rules can be applied in an arbitrary order) that impacts the performance of evaluating the resulting RANF query.We translate a safe-range query in SRNF to an equivalent RANF query by a recursive function sr2ranf(•) inspired by the rules (R1)-(R3) and fully specified in Figure 12 in Section 6.2.
3.5.Query Cost.To assess the time complexity of evaluating a RANF query Q, we define the cost of Q over a structure S, denoted cost S (Q), to be the sum of intermediate result sizes over all RANF subqueries of Q. Formally, cost S (Q) This corresponds to evaluating Q following its RANF structure (Section 3.4, Figure 6) using the RA operations.The complexity of these operations is linear in the combined input and output size (ignoring logarithmic factors due to set operations).The output size (the number of tuples times the number of variables) is counted in Q ′ S • |fv(Q ′ )| and the input size is counted as the output size for the input subqueries.Repeated subqueries are only considered once, which does not affect the asymptotics of query cost.In practice, the evaluation results for common subqueries can be reused.

Query Translation
Our approach to evaluating an arbitrary RC query Q over a fixed structure S with an infinite domain D proceeds by translating Q into a pair of safe-range queries Since the queries Q fin and Q inf are safe range, they are domain-independent and thus Q fin is a finite set.In particular, Q is a finite set if Q inf does not hold.Our translation generalizes Hull and Su's case distinction that restricts bound variables [HS94] to restrict all variables.Moreover, we use Van Gelder and Topor's idea to replace the active domain by a smaller set (generator) specific to each variable [GT91] while further improving the generators.Unless explicitly noted, in the rest of the article we assume a fixed structure S.
4.1.Restricting One Variable.Let x be a free variable in a query Q with range-restricted bound variables.This assumption on Q will be established by translating an arbitrary query Q bottom-up (Section 4.2).In this section, we develop a translation of Q into an equivalent query Q′ that satisfies the following: • Q′ has range-restricted bound variables; • Q′ is a disjunction; x is range restricted in the first disjunct; the remaining disjuncts are all binary conjunctions of a query not containing x with a query of a special form containing x.The special form, central to our translation, is either an equality x ≈ y or a query satisfied by infinitely many values of x for all values of the remaining free variables.We now restate Hull and Su's [HS94] and Van Gelder and Topor's [GT91] approaches using our notation in order to clarify how we generalize both approaches.In particular, Hull and Su's approach is already stated in a generalized way that restricts a free variable.
Hull and Su.Let Q be a query with range-restricted bound variables and x ∈ fv( Q).Then Here AD(x, Q) stands for an RC query with a single free variable x that is satisfied by an assignment α if and only if α(x) ∈ adom( Q).Hull and Su's translation distinguishes the following three cases for a fixed assignment α (each corresponding to a top-level disjunct above): Specifically, all atomic predicates having x free can be replaced by ⊥ (as α(x) / ∈ adom( Q)), all equalities x ≈ y and y ≈ x for y ∈ fv( Q) \ {x} can be replaced by ⊥ (as α(x) ̸ = α(y)), and all equalities x ≈ z for a bound variable z can be replaced by ⊥ (as α(x) / ∈ adom( Q) and z is range restricted in its subquery ∃z.Q z , by assumption).In the last case, gen(z, Q z ) holds and thus, for all α ′ extending α, we have α and following Hull and Su we obtain that Q is equivalent to the disjunction of the following three queries: propagation that is part of the substitution operator; Note that in this example, each disjunct covers a different subset of Q's satisfying assignments and all three disjuncts are necessary to cover all of Q's satisfying assignments.
Van Gelder and Topor.Let Q be an evaluable query with range-restricted bound variables, x ∈ fv( Q).Then there exists a set A of atomic predicates such that Note that ∃ ⃗ fv(Q) \ {x}.Q is the query in which all free variables of Q except for x are existentially quantified.Van Gelder and Topor restrict their attention to evaluable queries, which do not contain equalities between variables.(They only discuss an incomplete approach to supporting such equalities [GT91, Appendix A].) Thus, their translation lacks the corresponding disjuncts that Hull and Su have.
To avoid enumerating the entire active domain adom( Q), Van Gelder and Topor replace the query AD(x, Q) used by Hull and Su by the query Qap ∈A ∃ ⃗ fv(Q ap )\{x}.Q ap constructed from the atomic predicates from A. Because their translation must yield an equivalent query (for every finite or infinite domain), A and Q must satisfy, for all α, Note that (vgt 2 ) does not hold for the query Q ¬B(x) and thus Van Gelder and Topor only consider a proper subset of all RC queries, called evaluable.For evaluable queries, Van Gelder and Topor use the constrained relation con vgt (x, Q, A), defined in Appendix A, Figure 17, to construct a set of atomic predicates A that satisfies (vgt 1 ).
Our Translation.Let Q be a query with range-restricted bound variables, x ∈ fv( Q).Then there exists a set A of atomic predicates and a set of equalities E such that In contrast to Van Gelder and Topor, we only require that A satisfies (vgt 1 ) in our translation, which also allows us to translate non-evaluable queries, such as Q := ¬B(x) above.Note that we also existentially quantify only these variables that are not free in Q, whereas Van Gelder and Topor quantify all variables except x.For our introductory example Q susp , this modification allows our translation to use the quantified predicate ∃p.S(p, u, s) to restrict both u and p simultaneously.In contrast, Van Gelder and Topor's approach restricts them separately using ∃p, u.S(p, u, s) and ∃p, s.S(p, u, s), so that the Cartesian product of these quantified predicates may need to be computed in their translated queries.
In contrast to Hull and Su, we do not consider the equalities of x with all other free variables in Q, but only such equalities E that occur in Q.We jointly compute the sets A and E using the covered relation cov(x, Q, G) (in contrast to con vgt (x, Q, A) relation).Figure 5 shows the definition of this relation.The set G computed by the covered relation contains atomic predicates that satisfy (vgt 1 ) and are already quantified as described above.The set also contains the relevant equalities that can be used in our translation.For every variable x and query Q with range-restricted bound variables, there exists at least one set of quantified predicates and equalities G such that cov(x, Q, G) and (vgt 1 ) holds for the set of atomic predicate subqueries in G (i.e., for As the cover set G in cov(x, Q, G) may contain both quantified predicates and equalities between two variables, we define a function qps(G) that collects all generators, i.e., quantified predicates and a function eqs(x, G) that collects all variables y distinct from x occurring in equalities of the form x ≈ y.We use qps ∨ (G) to denote the query Qqp ∈qps(G) Q qp .We state the soundness and completeness of the relation cov(x, Q, G) in the next lemma, which follows by induction on the derivation of cov(x, Q, G).
Lemma 4.2 .Let Q be a query with range-restricted bound variables, x ∈ fv( Q).Completeness: Then there exists a set G of quantified predicates and equalities such that cov(x, Q, G) holds and, Soundness: for any G satisfying cov(x, Q, G) and all α, Finally, to preserve the dependencies between the variable x and the remaining free variables of Q occurring in the quantified predicates from qps(G), we do not project qps(G) on the single variable x, i.e., we restrict x by qps ∨ (G) instead of ∃ ⃗ fv(Q) \ {x}.qps(G) as by Van Gelder and Topor.From Lemma 4.2, we derive our optimized translation characterized by the following lemma.
Lemma 4.3 .Let Q be a query with range-restricted bound variables, x ∈ fv( Q), and G be such that cov(x, Q, G) holds.Then x ∈ fv(Q qp ) and fv(Q qp ) ⊆ fv( Q), for every Q qp ∈ qps(G), and Note that x is only guaranteed to be range restricted in (⋆)'s first disjunct.However, it only occurs in the remaining disjuncts in subqueries of a special form that are conjoined at the top-level to the disjuncts.These subqueries of a special form are equalities of the form x ≈ y or negations of a disjunction of quantified predicates with a free occurrence of x and equalities of the form x ≈ y.We will show how to handle such occurrences in Section 4.2 and Section 4.3.Moreover, the negation of the disjunction can be omitted if (vgt 2 ) holds.4.2.Restricting Bound Variables.Let x be a free variable in a query Q with rangerestricted bound variables.Suppose that the variable x is not range restricted, i.e., gen(x, Q) does not hold.To translate ∃x.Q into an inf-equivalent query with range-restricted bound variables (∃x.Q does not have range-restricted bound variables precisely because x is not range restricted in Q), we first apply (⋆) to Q and distribute the existential quantifier binding x over disjunction.Next we observe that where the first equivalence follows because x does not occur free in Q[x → y] and the second equivalence follows from the straightforward validity of ∃x.(x ≈ y).Moreover, we observe input: An RC query Q. output: A query Q with range-restricted bound variables such that Q  for which (fv) and (eval) hold.
the following inf-equivalence (recall: an equivalence that holds for infinite domains only): and there exists a value d for x in the infinite domain D such that x ̸ = y holds for all finitely many y ∈ eqs(x, G) and d is not among the finitely many values interpreting the quantified predicates in qps(G).Altogether, we obtain the following lemma.
Lemma 4.4 .Let Q be a query with range-restricted bound variables, x ∈ fv( Q), and G be a set of quantified predicates and equalities such that cov(x, Q, G) holds.Then Our approach for restricting all bound variables recursively applies Lemma 4.4.Because the set G such that cov(x, Q, G) holds is not necessarily unique, we introduce the following (general) notation.We denote the non-deterministic choice of an object X from a non-empty set X as X ← X .We define the recursive function rb(Q) in Figure 7, where rb stands for range-restrict bound variables.The function converts an arbitrary RC query Q into an inf-equivalent query with range-restricted bound variables.We proceed by describing the case ∃x.Q x .First, rb(Q x ) is recursively applied on Line 9 to establish the precondition of Lemma 4.4 that the translated query has range-restricted bound variables.Because existential quantification distributes over disjunction, we flatten disjunction in rb(Q x ) and process the individual disjuncts independently.We apply (⋆∃) to every disjunct Q fix in which the variable x is not already range restricted.For every Q ′ fix added to Q after applying (⋆∃) to Q fix the variable x is either range restricted or does not occur in Q ′ fix , i.e., x / ∈ nongens(Q ′ fix ).This entails the termination of the loop on Lines 10-13.The bound variable p is already range restricted in Q susp user and thus only s must be restricted.Applying (⋆) to restrict s in ¬∃p.P(b, p) ∧ ¬S(p, u, s), then existentially quantifying s, and distributing the existential quantifier over disjunction would yield the first disjunct in rb(Q susp user ) above and ∃s.(¬∃p.P(b, p))∧¬(∃p.S(p, u, s)) as the second disjunct.Because there exists some value in the infinite domain D that does not belong to the finite interpretation of the atomic predicate S(p, u, s), the query ∃s.¬(∃p.S(p, u, s)) is a tautology over D. Hence, ∃s.(¬∃p.P(b, p)) ∧ ¬(∃p.S(p, u, s)) is inf-equivalent to ¬∃p.P(b, p), i.e., the second disjunct in rb(Q susp user ).This reasoning justifies that instead of (⋆) our algorithm applies (⋆∃) to restrict s in ∃s.¬∃p.P(b, p) ∧ ¬S(p, u, s).

4.3.
Restricting Free Variables.Given an arbitrary query Q, we translate the infequivalent query rb(Q) with range-restricted bound variables into a pair of safe-range queries (Q fin , Q inf ) such that our translation's main properties (fv) and (eval) hold.Our translation is based on the following lemma.
Lemma 4.6 .Let x be a free variable in a query Q with range-restricted bound variables and let cov(x, Q, G) for a set of quantified predicates and equalities G.If Q[x/⊥] is not satisfied by any tuple, then ] is satisfied by some tuple, then Q is an infinite set.
Proof.If Q[x/⊥] is not satisfied by any tuple, then (⋆) follows from (⋆).If Q[x/⊥] is satisfied by some tuple, then the last disjunct in (⋆) applied to Q is satisfied by infinitely many tuples obtained by assigning x some value from the infinite domain D such that x ̸ = y holds for all finitely many y ∈ eqs(x, G) and x does not appear among the finitely many values interpreting the quantified predicates from qps(G).
We remark that Q might be an infinite set of tuples even if Q[x/⊥] is never satisfied, for some x.This is because Q[y/⊥] might be satisfied by some tuple, for some y, in which case Lemma 4.6 (for y) implies that Q is an infinite set of tuples.Still, (⋆) can be applied to Q for x resulting in a query satisfied by the same infinite set of tuples.
Our approach is implemented by the function split(Q) defined in Figure 8.In the following, we describe this function and justify its correctness, formalized by the input/output specification.In split(Q), we represent the queries Q fin and Q inf using a set Q fin of pairs consisting of a query and a relation representing a set of equalities and a set Q inf of queries such that and, for every (Q f , E) ∈ Q fin , the relation E represents a set of equalities between variables.Hereby, ≈ (Q f , E) is a query that is equivalent to Q∈{Q f }∪≈(E) .Q where ≈(E) abbreviates {x ≈ y | (x, y) ∈ E}.However, the ≈ (Q f , E) operator carefully assembles the conjunction to ensure that the resulting query is safe range (whenever possible).In particular, the operator must iteratively conjoin the equalities from ≈(E) to Q f in a left-associative fashion and always pick next an equation for which one of the variables is free in Q f or in the equalities conjoined so far, if such an equation exists.(If no such equation exists, the operator is free to conjoin the remaining equations in an arbitrary order.)Our algorithm proceeds as follows.As long as there exists some (Q fix , E) ∈ Q fin such that nongens(Q fix ) ̸ = ∅, we apply (⋆) to Q fix and add the query Q fix [x/⊥] to Q inf .We remark that if we applied (⋆) to the entire disjunct ≈ (Q fix , E), the loop on Lines 7-12 might not terminate.Note that, for every (Q ′ fix , E ′ ) added to Q fin after applying (⋆) to This entails the termination of the loop on Lines 7-12.Finally, if Q fix is an infinite set of tuples, then ≈ (Q fix , E) is an infinite set of tuples too.This is because the equalities in E merely duplicate columns of the query Q fix .Hence, it indeed suffices to apply (⋆) to After the loop on Lines 7-12 in Figure 8 terminates, for every However, the query ≈ (Q f , E) does not have to be safe range, e.g., if Q f B(x) and E {(x, y), (u, v)}.Given a relation E, let classes(E) be the set of equivalence classes of free variables fv(Q ≈ ) with respect to the (partial) equivalence closure of E, i.e., the smallest symmetric and transitive relation that contains E. For instance, classes({(x, y), (y, z), (u, v)}) = {{x, y, z}, {u, v}}.Let disjointvars(Q f , E) V ∈classes(E),V ∩fv(Q f )=∅ V be the set of all variables in equivalence classes from classes(E) that are disjoint from Q f 's free variables.Then, ) is an infinite set of tuples because all equivalence classes of variables in disjointvars(Q f , E) ̸ = ∅ can be assigned arbitrary values from the infinite domain D. In our example with E) is satisfied by some tuple, then this tuple can be extended to infinitely many tuples over fv(Q) by choosing arbitrary values from the infinite domain D for the variables in the non-empty set fv Note that we only remove pairs from Q fin , hence the loop on Lines 13-16 terminates.Afterwards, the query Q fin is safe range.However, the query Q inf does not have to be safe range.Indeed, every query Q i ∈ Q inf has range-restricted bound variables, but not all the free variables of Q i need be range restricted and thus the query ∃ ⃗ fv(Q i ).Q i does not have to be safe range.But the query Q inf is closed and thus the inf-equivalent query rb(Q inf ) with range-restricted bound variables is safe range.
1 This statement contained the error we discovered while formalizing the result presented in our conference paper [RBKT22b].There we had wrongly used the naive conjunction Q f ∧ ( Q∈≈(E) .Q), which will not be safe range whenever E has more than one element, instead of the more carefully constructed ≈ (Q f , E).
Lemma 4.7 .Let Q be an RC query and split(Q) = (Q fin , Q inf ).Then the queries Q fin and Q inf are safe range; fv(Q fin ) = fv(Q) unless Q fin is syntactically equal to ⊥; and fv(Q inf ) = ∅.
Lemma 4.8 .Let Q be an RC query and split By Lemma 4.7, Q fin is a safe-range (and thus also domain-independent) query.Hence, for the fixed structure, the tuples in Q fin only contain elements in the active domain adom(Q fin ), i.e., Q fin = Q fin ∩ adom(Q fin ) |fv(Qfin)| .Our translation does not introduce new constants in Q fin and thus adom(Q fin ) ⊆ adom(Q).Hence, by Lemma 4.8, if does not necessarily hold.For instance, for Q ¬B(x), our translation yields split(Q) = (⊥, ⊤).In this case, we have Q inf = ⊤ and thus |= Q inf because ¬B(x) is satisfied by infinitely many tuples over an infinite domain.However, if B(x) is never satisfied, then Next, we demonstrate different aspects of our translation on a few examples.Thereby, we use a mildly modified algorithm that performs constant propagation after all steps that could introduce constants ⊤ or ⊥ in a subquery.This optimization keeps the queries small, but is not necessary for termination and correctness.(In contrast, the constant propagation that is part of the substitution operators Q[x → y] and Q[x/⊥] is necessary.)We have verified in Isabelle that our results hold for the modified algorithm.That is, for all above theorems, we proved two variants: one with and one without additional constant propagation steps.
Example 4.9 .Consider the query Q B(x) ∨ P(x, y).The variable y is not range restricted in Q and thus split(Q) restricts y by a conjunction of Q with P(x, y).However, if Q[y/⊥] = B(x) is satisfied by some tuple, then Q contains infinitely many tuples.Hence, split(Q) = ((B(x) ∨ P(x, y)) ∧ P(x, y), ∃x.B(x)).Because Q fin = (B(x) ∨ P(x, y)) ∧ P(x, y) is only used if ̸ |= Q inf , i.e., if B(x) is never satisfied, we could simplify Q fin to P(x, y).However, our translation does not implement such heuristic simplifications.
Example 4.10 .Consider the query Q B(x) ∧ u ≈ v.The variables u and v are not range restricted in Q and thus split(Q) chooses one of these variables (e.g., u) and restricts it by splitting Q into Q f = B(x) and E = {(u, v)}.Now, all variables are range restricted in Q f , but the variables in Q f and E are disjoint.Hence, Q contains infinitely many tuples whenever Q f is satisfied by some tuple.In contrast, To understand split(Q susp user ), we apply (⋆) to rb(Q susp user ) for the free variable u: rb(Q susp user ) ≡ rb(Q susp user ) ∧ ∃s, p. S(p, u, s) ∨ B(b) ∧ ¬∃p.P(b, p) ∧ ¬∃s, p. S(p, u, s) .If the subquery B(b) ∧ (¬∃p.P(b, p)) from the second disjunct is satisfied for some b, then Q susp user is satisfied by infinitely many values for u from the infinite domain D that do not belong to the finite interpretation of S(p, u, s) and thus satisfy the subquery ¬∃s, p. S(p, u, s).Hence, S is an infinite set of tuples whenever B(b) ∧ ¬∃p.P(b, p) is satisfied for some b.In contrast, if B(b) ∧ ¬∃p.P(b, p) is not satisfied for any b, then Q susp user is equivalent to rb(Q susp user ) ∧ (∃s, p. S(p, u, s)) obtained also by applying (⋆) to Q susp user for the free variable u.

Complexity Analysis
We analyze the time complexity of capturing Q, i.e., checking if Q is finite and enumerating Q in this case.To bound the asymptotic time complexity of capturing a fixed Q, we need to apply an additional standard translation step to both queries produced by our translation to obtain two RANF queries.Query cost (Section 3.5) can then be applied to the resulting two queries to bound computation time based on the cardinalities of subquery evaluation results.
Since function sr2ranf(•) is a standard translation step, we present it in Section 6 (see Figure 12).Note that the proof of Lemma 5.6 relies on its algorithmic details.
We ignore the (constant) time complexity of computing rw(Q) = ( Qfin , Qinf ) and focus on the time complexity of evaluating the RANF queries Qfin and Qinf , i.e., the query cost of Qfin and Qinf .Without loss of generality, we assume that the input query Q has pairwise distinct (free and bound) variables to derive a set of quantified predicates from Q's atomic predicates and formulate our time complexity bound.Still, the RANF queries Qfin and Qinf computed by our translation need not have pairwise distinct (free and bound) variables.
We define the relation ≲ Q on av(Q) such that x ≲ Q y iff the scope of an occurrence of x ∈ av(Q) is contained in the scope of an occurrence of y ∈ av(Q).Formally, we define x ≲ Q y iff y ∈ fv(Q) or ∃x.Q x ⊑ ∃y.Q y ⊑ Q for some Q x and Q y .Note that ≲ Q is a preorder on all variables and a partial order on the bound variables for every query with pairwise distinct (free and bound) variables.
Let aps(Q) be the set of all atomic predicates in a query Q.We denote by qps(Q) the set of quantified predicates obtained from aps(Q) by performing the variable substitution x → y, where x and y are related by equalities in Q and x ≲ Q y, and existentially quantifying from a quantified predicate Q qp the innermost bound variable x in Q that is free in Q qp .Let eqs * (Q) be the transitive closure of equalities occurring in Q. Formally, we define qps(Q) by: We bound the time complexity of capturing Q by considering subsets Q qps of quantified predicates qps(Q) that are minimal in the sense that every quantified predicate in Q qps contains a unique free variable that is not free in any other quantified predicate in Q qps .Formally, we define minimal(Q qps ) ∀Q qp ∈ Q qps .fv(Q qps \ {Q qp }) ̸ = fv(Q qps ).Every minimal subset Q qps of quantified predicates qps(Q) contributes the product of the numbers of tuples satisfying each quantified predicate Q qp ∈ Q qps to the overall bound (that product is an upper bound on the number of tuples satisfying the join over all Q qp ∈ Q qps ).Similarly to Ngo et al. [NRR13], we use the notation Õ (•) to hide logarithmic factors incurred by set operations.
Theorem 5.2.Let Q be a fixed RC query with pairwise distinct (free and bound) variables.The time complexity of capturing Q, i.e., checking if Q is finite and enumerating Q in this case, is in Õ Note that this query is equivalent to Now, to prove Theorem 5.2, we need to introduce guard queries and the set of quantified predicates of a query.Given a RANF query Q, we define a guard query guard( Q) that is implied by Q, i.e., guard( Q) can be used to over-approximate the set of satisfying tuples for Q.We use this over-approximation in our proof of Theorem 5.2.The guard query guard( Q) has a simple structure: it is the disjunction of conjunctions of quantified predicates and equalities.
We now define the set of quantified predicates qps(Q) occurring in the guard query guard(Q).For an atomic predicate Q ap ∈ aps(Q), let B Q (Q ap ) be the set of sequences of bound variables for all occurrences of Q ap in Q.For example, consider a query Q ex ((∃z.(∃y, z.P 3 (x, y, z)) ∧ P 2 (y, z)) ∧ P 1 (z)) ∨ P 3 (x, y, z).Then aps(Q ex ) = {P 1 (z), P 2 (y, z), P 3 (x, y, z)} and B Qex (P 3 (x, y, z)) = {[y, z], []}, where [] denotes the empty sequence corresponding to the occurrence of P 3 (x, y, z) in Q ex for which the variables x, y, z are all free in Q ex .Note that the variable z in the other occurrence of P 3 (x, y, z) in Q ex is bound to the innermost quantifier.Hence, neither [z, y] nor [z, y, z] are in B Qex (P 3 (x, y, z)).Furthermore, let qps(Q) be the set of the quantified predicates obtained by existentially quantifying sequences of bound variables in For instance, qps(Q ex ) = {P 3 (x, y, z), ∃z.P 3 (x, y, z), ∃y, z.P 3 (x, y, z), P 2 (y, z), ∃z.P 2 (y, z), P 1 (z)}.
A crucial property of our translation, which is central for the proof of Theorem 5.2, is the relationship between the quantified predicates qps( Q) for a RANF query Q produced by our translation and the original query Q.The relationship is formalized in the following lemma.
Lemma 5.6.Let Q be an RC query with pairwise distinct (free and bound) variables and let rw Next we observe that qps(Q ′ ) ⊆ qps(Q ′ ) for every query Q ′ .Finally, we show that qps( Qfin ) ⊆ qps(Q fin ) and qps( Qinf ) ⊆ qps(Q inf ).We observe that Assume that Q ′ ∧ Q∈Q Q is a safe-range query in which no variable occurs both free and bound, no bound variables shadow each other, i.e., there are no subqueries ∃x.
x is a quantified predicate.Then the free variables in Q∈Q Q never clash with the bound variables in Q ′ , i.e., Line 26 in Figure 12 is never executed.Next we observe that (this subset relation only holds when considering queries modulo α-equivalence, i.e., queries that have the same binding structure but differ in the used bound variable names are considered to be equal) and then qps(sr2ranf Recall Example 5.3.The query ∃u, p. S(p, u, s) is in qps(Q vgt ), but not in qps(Q).Hence, qps(Q vgt ) ⊆ qps(Q), i.e., an analogue of Lemma 5.6 for Van Gelder and Topor's translation, does not hold.
Every tuple satisfying a RANF query Q belongs to the set of tuples satisfying the join over some minimal subset Q qps ⊆ qps( Q) of quantified predicates and satisfying equalities duplicating some of Q qps 's columns.Hence, we define the guard query guard( Q) as follows: Note that {x ≈ y | x ∈ V ∧ y ∈ V ′ } denotes the set of all equalities x ≈ y between variables x ∈ V and y ∈ V ′ .We express the correctness of the guard query in the following lemma.
Lemma 5.7.Let Q be a RANF query.Then, for all variable assignments α, Proof.The statement follows by well-founded induction over the definition of ranf( Q).
Proof.Applying Lemma 5.7 to the RANF query Q′ yields Q′ ⊆ Qqps ⊆qps( Q′ ),minimal(Qqps ), We observe that where the first inequality follows from the fact that equalities Q ≈ ∈ Q ≈ can only restrict a set of tuples and duplicate columns.Because Q′ is a subquery of Q, it follows that qps( Q′ ) ⊆ qps( Q).Lemma 5.6 yields qps( Q) ⊆ qps(Q).Hence, we derive qps( Q′ ) ⊆ qps(Q).
The second inequality holds because the variables in a subquery Q′ of Q are in av( Q).Hence, the number of subsets We now bound the query cost of a RANF query Q ∈ { Qfin , Qinf } over the fixed structure S.
Lemma 5.9.Let Q be an RC query with pairwise distinct (free and bound) variables and let Finally, we prove Theorem 5.2.
Proof of Theorem 5.2.We derive Theorem 5.2 from Lemma 5.9 and the fact that the quantities sub( Q) , av( Q) , and 2 |av( Q)| 2 only depend on the query Q and thus they do not contribute to the asymptotic time complexity of capturing a fixed query Q.

Implementation
We have implemented our translation RC2SQL consisting of roughly 1000 lines of OCaml code [RBKT22a].It consists of multiple translation steps that take an arbitrary relational calculus (RC) query and produce two SQL queries.
Figure 9 summarizes the order of the translation steps and the functions that implement them.The function split(•) (Section 4.3), applied in the first step, is the main part of our translation.Recall that it takes an arbitrary RC query and returns two safe-range RC queries.Next, the function srnf(•) (Section 6.1) converts both queries to safe-range normal form (SRNF), followed by the function sr2ranf(•, •) (Section 6.2) that converts SRNF queries into relation algebra normal form (RANF).Both normal forms were defined in Section 3.For simplicity, we define a function sr2ranf(•) that combines the previous two functions and can be applied to any safe-range RC query.In addition to the worst-case complexity, we further improve our translation's average-case complexity by implementing the optimizations inspired by Claußen et al. [CKMP97].The function optcnt(•) (Section 6.3) implements these optimizations on the RANF queries.Finally, to derive SQL queries from the RANF queries we first obtain equivalent relational algebra (RA) expressions following a (slightly modified) standard approach [AHV95] implemented by the function ranf2ra(•) (Section 6.4).To translate the RA expressions into SQL, we reuse a publicly available RA interpreter radb [Yan19] (Section 6.5).We name the composition of the last two steps ranf2sql(•).
To resolve the nondeterministic choices present in our algorithms (Section 6.6) we always choose the alternative with the lowest query cost.The query cost is estimated by using a sample structure of constant size, called a training database.A good training database should preserve the relative ordering of queries by their cost over the actual database as much as possible.Nevertheless, our translation satisfies the correctness and worst-case complexity claims independently of the choice of the training database.
Overall, the translation is formally defined as Figure 11: Measure on RC queries.12, where sr2ranf stands for safe range to relational algebra normal form, takes a safe-range query Here m(Q) is defined in Figure 11, eqneg(Q) 1 if Q is an equality between two variables or the negation of a query, and eqneg(Q) 0 otherwise.
Next we describe the definition of sr2ranf(Q, Q) that follows [AHV95, Algorithm 5.4.7].Note that no constant propagation (Figure 3) is needed in [AHV95, Algorithm 5.4.7], because the constants ⊥ and ⊤ are not in the query syntax [AHV95, Section 5.3].Because gen(x, ⊥) holds and x / ∈ fv(⊥), we need to perform constant propagation to guarantee that every disjunct has the same set of free variables (e.g., the query ⊥ ∨ B(x) must be translated to B(x) to be in RANF).We flatten the disjunction and conjunction using flat ∨ (•) and flat ∧ (•), respectively.In the case of a conjunction Q ∧ , we first split the queries from flat ∧ (Q ∧ ) and Q into queries Q + that do not have the form of a negation and queries Q − that do.Then we take out equalities between two variables and negations of equalities between two variables from the sets Q + and Q − , respectively.To partition flat ∧ (Q ∧ ) ∪ Q this way, we define the predicates neg(Q) and eq(Q) characterizing equalities between two variables and negations, respectively, i.e., neg(Q) is true iff Q has the form ¬Q ′ and eq(Q) is true iff Q has the form x ≈ y.Finally, the function sort ∧ (Q) converts a set of queries into a RANF conjunction, defined in Figure 6, i.e., a left-associative conjunction in RANF.Note that the function sort ∧ (Q) must order the queries x ≈ y so that either x or y is free in some preceding conjunct, e.g., B(x) ∧ x ≈ y ∧ y ≈ z is in RANF, but B(x) ∧ y ≈ z ∧ x ≈ y is not.In the case of an existentially quantified query ∃⃗ v. Q ⃗ v , we rename the variables ⃗ v to avoid a clash of the free variables in the set of queries Q with the bound variables ⃗ v.
Finally, we resolve the nondeterministic choices in sr2ranf(Q, Q) by minimizing the cost of the resulting RANF query with respect to a training database (Section 6.6).6.3.Optimization using Count Aggregations.In this section, we introduce count aggregations and describe a generalization of Claußen et al. [CKMP97]'s approach to evaluate RANF queries using count aggregations.Consider the query for all subqueries of the form ¬Q ′ , gen(x, ¬Q ′ ) does not hold for any variable x. output: A RANF query Q and a subset of queries where fv(Q x ) = {x}, fv(Q y ) = {y}, and fv(Q xy ) = {x, y}.This query is obtained by applying our translation to the query Q x ∧ ∀y.(Q y −→ Q xy ).The cost of the translated query is dominated by the cost of the Cartesian product Q x ∧ Q y .Consider the subquery Next we introduce the syntax and semantics of count aggregations.We extend RC's syntax by [CNT ⃗ v. Q ⃗ v ](c), where Q is a query, c is a variable representing the result of the count aggregation, and ⃗ v is a sequence of variables that are bound by the aggregation operator.The semantics of the count aggregation is defined as follows: We formulate translations introducing count aggregations in the following two lemmas.
Lemma 6.2.Given Q ̸ = ∅, let ∃⃗ v. Q ⃗ v ∧ Q∈Q ¬Q be a RANF query.Let c, c ′ be fresh variables that do not occur in fv(Q ⃗ v ).Then Moreover, the right-hand side of (#) is in RANF.
Moreover, the right-hand side of (##) is in RANF.
Note that the query cost does not decrease after applying the translation (#) or (##) because of the subquery [CNT ⃗ v. Q ⃗ v ](c) in which Q ⃗ v is evaluated before the count aggregation is computed.For the query ∃y.((Q x ∧ Q y ) ∧ ¬Q xy ) from before, we would compute [CNT y.Q x ∧ Q y ](c), i.e., we would not (yet) avoid computing the Cartesian product Q x ∧ Q y .However, we could reduce the scope of the bound variable y by further translating This technique, called mini-scoping, can be applied to a count aggregation [CNT ⃗ v. Q ⃗ v ](c) if the aggregated query Q ⃗ v is a conjunction that can be split into two RANF conjuncts and the variables ⃗ v do not occur free in one of the conjuncts (that conjunct can be pulled out of the count aggregation).Mini-scoping can be analogously applied to queries of the form ∃⃗ v. Q ⃗ v .
Moreover, we can split a count aggregation over a conjunction Q ⃗ v ∧ Q ′ ⃗ v into a product of count aggregations if the conjunction can be split into two RANF conjuncts with disjoint sets of bound variables, i.e., Here c 1 and c 2 are fresh variables that do not occur in fv(Q ⃗ v ) ∪ fv(Q ′ ⃗ v ) ∪ {c}.Note that miniscoping is only a heuristic and it can both improve and harm the time complexity of query evaluation.We leave the application of other more general optimization algorithms [KNR16,OS16]) as future work.
We implement the translations from Lemmas 6.2 and 6.3 and mini-scoping in the function optcnt(•).Given a RANF query Q, optcnt( Q) is an equivalent RANF query after introducing count aggregations and performing mini-scoping.The function optcnt( Q) uses a training database to decide how to apply the translations from Lemmas 6.2 and 6.3 and mini-scoping.More specifically, the function optcnt( Q) tries several possibilities and chooses one that minimizes the query cost of the resulting RANF query.
Example 6.4.We show how to introduce count aggregations into the RANF query After applying the translation (##) and mini-scoping to this query, we obtain the following equivalent RANF query: 6.4.Translating RANF to RA.Our translation of a RANF query into SQL has two steps: we first translate the query to an equivalent RA expression, which we then translate to SQL using a publicly available RA interpreter radb [Yan19].We define the function ranf2ra( Q) translating RANF queries Q into equivalent RA expressions ranf2ra( Q).The translation is based on Algorithm 5.4.8 by Abiteboul et al. [AHV95], which we modify as follows.We adjust the way closed RC queries are handled.Chomicki and Toman [CT95] observed that closed RC queries cannot be handled by SQL, since SQL allows neither empty projections nor 0-ary relations.They propose to use a unary auxiliary predicate A ∈ R whose interpretation A S = {t} always contains exactly one tuple t.Every closed query ∃x.Q x is then translated into ∃x.A(t) ∧ Q x with an auxiliary free variable t.Every other closed query Q is translated into A(t) ∧ Q, e.g., B(42) is translated into A(t) ∧ B(42).We also use the auxiliary predicate A to translate queries of the form x ≈ c and c ≈ x because the single tuple (t) in A S can be mapped to any constant c.Finally, we extend [AHV95, Algorithm 5.4.8] with queries of the form [CNT ⃗ v. Q ⃗ v ](c).6.5.Translating RA to SQL.The radb interpreter, abbreviated here by the function ra2sql(•), translates an RA expression into SQL by simply mapping the RA connectives into their SQL counterparts.The function ra2sql(•) is primitive recursive on RA expressions.We modify radb to further improve performance of the query evaluation as follows.
The radb interpreter introduces a separate SQL subquery in a WITH clause for every subexpression in the RA expression.We extend radb to additionally perform common subquery elimination, i.e., to merge syntactically equal subqueries.Common subquery elimination is also assumed in our query cost (Section 3.5).
Finally, the function ranf2sql( Q) is defined as ranf2sql( Q) ra2sql(ranf2ra( Q)), i.e., as a composition of the two translations from RANF to RA and from RA to SQL. 6.6.Resolving Nondeterministic Choices.To resolve the nondeterministic choices in our algorithms, we suppose that the algorithms have access to a training database T of constant size.The training database is used to compare the cost of queries over the actual database and thus it should preserve the relative ordering of queries by their cost over the actual database as much as possible.Still, our translation satisfies the correctness and worst-case complexity claims (Section 4.3 and 5) for every choice of the training database.The training databases used in our empirical evaluation are obtained using the function dg (Section 7) with |T + | = |T − | = 2.Because of its constant size, the complexity of evaluating a query over the training database is constant and does not impact the asymptotic time complexity of evaluating the query over the actual database using our translation.There are two types of nondeterministic choices in our algorithms: • Choosing some X ← X in a while-loop.As the while-loops always update X with X (X \ {X}) ∪ f (X) for some f , the order in which the elements of X are chosen does not matter.
yields a RANF query, we enumerate all minimal subsets (a subset Q ⊆ Q is minimal if there exists no proper subset Q ′ ⊊ Q that could be used instead of Q) and choose one that minimizes the query cost of the RANF query.

Empirical Evaluation
We empirically validate the evaluation performance of the queries output by RC2SQL.We also assess RC2SQL's translation time, the average-case time complexity of query evaluation, scalability to large databases, and DBMS interoperability.To this end, we answer the following research questions: RQ1 How does RC2SQL's query evaluation perform compared to the state-of-the-art tools on both domain-independent and domain-dependent queries?RQ2 How does RC2SQL's query evaluation scale on large synthetic databases?RQ3 How does RC2SQL's query evaluation perform on real-world databases?RQ4 How does the count aggregation optimization impact RC2SQL's performance?RQ5 Can RC2SQL use different DBMSs for query evaluation?RQ6 How long does RC2SQL take to translate different queries (without query evaluation)?
We organize our evaluation into five experiments.Four experiments (Small, Medium, Large, and Real) focus on the type and size of the structures we use for query evaluation.The fifth experiment (Infinite) focuses on the evaluation of non-evaluable (i.e., domaindependent) queries that may potentially produce infinite evaluation results.
To answer RQ1, we compare our tool with the translation-based approach by Van Gelder and Topor [GT91] (VGT), the structure reduction approach by Ailamazyan et al. [AGSS86], and the DDD [MLAH99, Møl02], LDD [CGS09], and MonPoly REG [BKMZ15] tools that evaluate RC queries directly using infinite relations encoded as binary decision diagrams.We could not find a publicly available implementation of Van Gelder and Topor's translation.Therefore, the tool VGT for evaluable RC queries is derived from our implementation by modifying the function rb(•) in Figure 7 to use the relation con vgt (x, Q, A) (Appendix A, Figure 17) instead of cov(x, Q, G) (Figure 5) and to use the generator Qap ∈A ∃ ⃗ fv(Q)\{x}.Q ap instead of qps ∨ (G).Evaluable queries Q are always translated into (Q fin , ⊥) by rw(•) because all of Q's free variables are range restricted.We exclude VGT from the comparison on nonevaluable queries (experiment Infinite).Similarly, the implementation of Ailamazyan et al.'s approach was not available; hence we used our formally-verified implementation [Ras22].The implementations of the remaining tools were publicly available.
We use Data Golf structures of growing size (experiments Small, Medium, and Large) to answer RQ2.In contrast, to answer RQ3, we use real-world structures obtained from the Amazon review dataset [NLM19] (experiment Real).
To answer RQ4, we also consider variants of the translation-based approaches without the step that uses count aggregation optimization optcnt(•), superscripted with a minus ( − ).

Experiments:
Small Medium Large Infinite Real TO TO TO TO TO * Only states that the result is infinite.
Table 1: Applicability and performance of all the tools on all the experiments.TO = Timeout of 300s on all experiment runs, N/A = Not applicable SQL queries computed by the translations are evaluated using the PostgreSQL and MySQL DBMS (RQ5).We superscript the tool names with P and M accordingly.In the Large experiment, we only use PostgreSQL because it consistently performed better than MySQL in the Medium experiment.In all our experiments, the translation-based tools used a Data Golf structure with |T + | = |T − | = 2 as the training database.We run our experiments on an AMD Ryzen 7 PRO 4750U computer with 32 GB RAM.The relations in PostgreSQL and MySQL are recreated before each invocation to prevent optimizations based on caching recent query evaluation results.We measure the query evaluation times of all the tools and the translation time of our RC2SQL tool (RQ6).We provide all our experiments in an easily reproducible and publicly available artifact [RBKT22a].
In the Small, Medium, and Large experiments, we generate ten pseudorandom queries (denoted as Q i , 1 ≤ i ≤ 10, see Appendix C) with a fixed size 14 and Data Golf structures S (strategy γ = 1).The queries satisfy the Data Golf assumptions along with a few additional ones: the queries are not safe range, every bound variable actually occurs in its scope, disjunction only appears at the top-level, and only pairwise distinct variables appear as terms in predicates.The queries have 2 free variables and every subquery has at most 4 free variables.We control the size of the Data Golf structure S in our experiments using a parameter n = |T + | = |T − |.Because the sets T + and T − grow in the recursion on subqueries, relations in a Data Golf structure typically have more than n tuples.The values of the parameter n for Data Golf structures are summarized in Figure 15.
The Infinite experiment consists of five pseudorandom queries Q I i , 1 ≤ i ≤ 5 (Appendix C) that are not evaluable and rw(Q ), where Q i,inf ̸ = ⊥.Specifically, the queries are of the form Q 1 ∧ ∀x, y.Q 2 −→ Q 3 , where Q 1 , Q 2 , and Q 3 are either atomic predicates or equalities.We choose the queries so that the number of their satisfying tuples is not too high, e.g., quadratic in the parameter n, because no tool can possibly enumerate so many tuples within the timeout.For each 1 ≤ i ≤ 5, we compare the performance of our tool to tools that directly evaluate Q I i on structures generated by the two Data Golf strategies (parameter γ), which trigger infinite or finite evaluation results on the considered queries.For infinite results, our tool outputs this fact (by evaluating Q i,inf ), whereas the other tools also output a finite representation of the infinite result.For finite results, all tools produce the same output.
Figure 15 shows the empirical evaluation results for the experiments Small, Medium, Large, and Infinite.All entries are execution times in seconds, TO is a timeout, and RE is a runtime error.In the experiments Small, Medium, and Large, the columns Figure 16 shows the empirical evaluation results: the time it takes for our translation RC2SQL to translate each query is shown in the first line and the execution times on Data Golf structures (left) and on structures derived from the real-world dataset for two specific product categories (right) are shown in the remaining lines.We remark that VGT cannot handle the query Q susp user as it is not evaluable [GT91], hence we mark the correspond cells in Figure 16 with −.Our translation RC2SQL significantly outperforms all other tools (except VGT on Q susp , where RC2SQL and VGT have similar performance) on both Data Golf and real-world structures (RQ3).VGT − translates Q susp into a RANF query with a higher query cost than RC2SQL − .However, the optimization optcnt(•) manages to rectify this inefficiency (RQ4) and thus VGT exhibits a comparable performance as RC2SQL.Specifically, the factor of 80× in query cost between VGT − and RC2SQL − improves to 1.1× in query cost between VGT and RC2SQL on a Data Golf structure with n = 20 [RBKT22a].Nevertheless, VGT does not finish evaluating the query Q susp text on GC and MI datasets within 5 minutes, unlike RC2SQL.Finally, RC2SQL's translation took less than 1 second on all the queries (RQ6).

Conclusion
We presented a translation-based approach to evaluating arbitrary relational calculus queries over an infinite domain with improved time complexity over existing approaches.This contribution is an important milestone towards making the relational calculus a viable query language for practical databases.In future work, we plan to integrate into our base language features that database practitioners love, such as inequalities, bag semantics, and aggregations.
We remark that the rules (R1)-(R3) are not sufficient to yield an equivalent RANF query for the original definition of ENF [GT91].This issue has been identified and fixed by Escobar-Molano et al. [EHJ93].Unlike SRNF, a query in ENF can have a subquery of the form ¬(Q 1 ∧ Q 2 ), but no subquery of the form ¬Q 1 ∨ Q 2 or Q 1 ∨ ¬Q 2 .A function enf(Q) that yields an ENF query equivalent to Q can be defined in terms of subquery rewriting using the rules in [EHJ93, Figure 2].
Analogously to [EHJ93, Lemma 7.4], if a query Q is safe range, then enf(Q) is also safe range.Next we prove the following lemma that we could use as a precondition for translating safe-range queries in ENF to queries in RANF.
Lemma B.1.Let Q enf be a query in ENF.Then gen(x, ¬Q ′ ) does not hold for any variable x and subquery ¬Q ′ of Q enf .
Proof.Assume that gen(x, ¬Q ′ ) holds for a variable x in a subquery ¬Q ′ of Q enf .We derive a contradiction by induction on m(Q enf ).According to Figure 4 and by definition of ENF, gen(x, ¬Q ′ ) can only hold if Q ′ is a conjunction.Then gen(x, ¬Q ′ ) implies gen(x, ¬Q 1 ) for some query Q 1 ∈ flat ∧ (Q ′ ) that is not a negation (by definition of ENF) or conjunction (by definition of flat ∧ (•)), i.e., Q 1 is a disjunction (according to Figure 4).Then gen(x, ¬Q 1 ) implies gen(x, ¬Q 2 ) for some query Q 2 ∈ flat ∨ (Q 1 ) that is not a negation (by definition of ENF) or disjunction (by definition of flat ∨ (•)), i.e., Q 2 is a conjunction (according to Figure 4).Next we observe that ¬Q 2 is in ENF because Q 2 is a subquery of the ENF query Q enf , Q 2 is a conjunction, and Q 2 is a subquery of a disjunction (Q 1 ) in Q enf .Moreover, m(¬Q 2 ) < m(Q 1 ) < m(Q ′ ) < m(Q enf ).This allows us to apply the induction hypothesis to the ENF query ¬Q 2 and its subquery ¬Q 2 (note that a query is a subquery of itself) and derive that gen(x, ¬Q 2 ) does not hold, which is a contradiction.
Although applying the rules (R1)-(R3) to enf(Q) instead of srnf(Q) may result in a RANF query with fewer subqueries, the query cost, i.e., the time complexity of query evaluation, can be arbitrarily larger.We illustrate this in the following example that is also included in our artifact [RBKT22a].We thus opt for using SRNF instead of ENF for translating safe-range queries into RANF.

Proof.
Recall that |sub( Q)| denotes the number of subqueries of the query Q and thus bounds the number of RANF subqueries Q′ of the query Q.For every subquery Q′ of Q, we first use the fact that |fv( Q′ )| ≤ |av( Q)| to bound | Q′ | • |fv( Q′ )| ≤ | Q′ | • |av( Q)|.Then we use the estimation of | Q′ | by Lemma 5.8.

Figure 12 :
Figure 12: Translation of a safe-range query in SRNF to RANF.
and α(c) = |M | , where M = { ⃗ d ∈ D |⃗ v| | (S, α[⃗ v → ⃗ d]) |= Q}.We use the condition M = ∅ −→ fv(Q) ⊆ ⃗ v instead of M ̸ = ∅ to set c to a zero count if the group M is empty and there are no group-by variables (like in SQL).The set of free variables in a count aggregation is fv([CNT ⃗ v. Q ⃗ v ](c)) = (fv(Q) \ ⃗ v) ∪ {c}.Finally, we extend the definition of ranf(Q) with the case of a count aggregation: Figure 19: Randomly generated queries.
there exists a value d such that α[y → d] satisfies Q y , but not Q xy , i.e., the number of values d such that α[y → d] satisfies Q y is not equal to the number of values d such that α[y → d] satisfies both Q y and Q xy .An alternative evaluation of Q ′ evaluates the queries Q x , Q y , Q y ∧ Q xy and computes the numbers of values d such that α[y → d] satisfies Q y and Q y ∧ Q xy , respectively, i.e., computes count aggregations.These count aggregations are then used to filter assignments α satisfying Q x to get assignments α satisfying Q ′ .The asymptotic time complexity of the alternative evaluation never exceeds that of the evaluation computing the Cartesian product Q x ∧ Q y and asymptotically improves it if the number of values d such that α[y → d] satisfies Q y is equal to the number of values d such that α[y → d] satisfies