The Complexity of Aggregates over Extractions by Regular Expressions

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).


Introduction
Information extraction commonly refers to the task of extracting structured information from text. A document spanner (or just spanner for short) is an abstraction of an information extraction program: it states how to transform a document into a relation over its spans. More formally, a document is a string d over a finite alphabet, a span of d represents a substring of d by its start and end positions, and a spanner is a function that maps every document d into a relation over the spans of d [FKRV15a]. The spanner framework has originally been introduced as the theoretical basis underlying IBM's SQL-like rule system for information extraction, namely SystemT [KLR + 08, LRC11]. The most studied spanner instantiation is the class of regular spanners, which is the closure of regex formulas (regular expressions with capture variables) under the standard operations of the relational algebra (projection, natural join, union, and difference). Equivalently, the regular spanners are the ones expressible as variable-set automata (VSet-automata for short), which are nondeterministic finite-state automata that can open and close capture variables. These spanners extract from the text relations wherein the capture variables are the attributes.
While regular spanners and natural generalizations thereof are the basis of rule-based systems for text analytics, they are also used implicitly in other types of systems, and particularly ones based on statistical models and machine learning. Rules similar to regular spanners are used for feature generators of graphical models (e.g., Conditional Random Fields) [LBC04,SM12], weak constraints of Markov Logic Networks [PD07] and extensions such as DeepDive [SWW + 15], and the generators of noisy training data ("labeling functions") in the state-of-the-art Snorkel system [RBE + 17]. Further connections to regular spanners can potentially arise from efforts to express artificial neural networks for natural language processing as finite-state automata [MY18,MSV + 19,WGY18]. The computational complexity of evaluating regular spanners has been well studied from various angles, including the data and combined complexity of answer enumeration [ABMN19, FRU + 18, FKP18,MRV18], the cost of combining spanners via relational algebra operators [PFKK19] and recursive programs [PtCFK19], their dynamic complexity [FT20], evaluation in the presence of weighted transitions [DKMP22], and the ability to distribute their evaluation over fragments of the document [DKM + 19].
In this article, we study the computational complexity of evaluating aggregate functions over regular spanners. These are queries that map a document d and a spanner S into a number α(S(d)), where S(d) is the relation obtained by applying S to d and α is a standard aggregate function: Count, Sum, Average, Min, Max, or Quantile. There are various scenarios where queries that involve aggregate functions over spanners can be useful. For example, such queries arise in the extraction of statistics from textual resources like medical publications [NKS + 19] and news reports [SC09]. As another example, when applying advanced text search or protein/DNA motif matching using regular expressions [CG89,NG94], the search engine typically provides the (exact or approximate) number of answers, and we would like to be able to compute this number without actually computing the answers, especially when the number of answers is prohibitively large. Finally, when programming feature generators or labeling functions in extractor development, the programmer is likely to be interested in aggregate statistics and summaries for the extractions (e.g., to get a holistic view of what is being extracted from the dataset, such as quantiles over extracted ages and so on), and again, we would like to be able to estimate these statistics faster than it takes to materialize the entire set of answers.
Our main objective in this work is to understand when it is tractable to compute α(S(d)). This question raises closely related questions that we will discuss, such as when the materialization of the set of intermediate results S(d) (which can be exponentially large) can be avoided. Furthermore, when the exact computation of α(S(d)) is intractable, we study whether it can be approximated.
At the technical level, each aggregate function (with the exception of Count) requires a specification of how an extracted tuple of spans represents a number. For example, the number 21 can be represented by the span of the string "21", "21.0", "twenty one", "twenty first", "three packs of seven" and so on. To abstract away from specific textual representations of numbers, we consider several means of assigning weights to tuples. To this end, we assume that a (representation of a) weight function w, which maps every tuple of S(d) into a number, is part of the input of the aggregate functions. Hence, the general form of the aggregate query we study is α(S, d, w). The direct approach to evaluating α(S, d, w) is to compute S(d), apply w to each tuple, and apply α to the resulting sequence of numbers. This approach works well if the number of tuples in S(d) is manageable (e.g., bounded by some polynomial). However, the number of tuples in S(d) can be exponential in the number of variables of S, and so, the direct approach takes exponential time in the worst case. We will identify several cases in which S(d) is exponential, yet α(S(d)) can be computed in polynomial time.
It is not very surprising that, at the level of generality we adopt, each of the aggregate functions is intractable (#P-hard) in general. Hence, we focus on several assumptions that can potentially reduce the inherent hardness of evaluation: • Restricting the range of weight functions to positive numbers; • Restricting to weight functions that are determined by a single span or defined by (unambiguous) weighted VSet-automata; • Restricting to spanners that are represented by an unambiguous variant of VSet-automata; • Allowing for a randomized approximation (FPRAS, i.e., fully polynomial randomized approximation schemes). Our analysis shows which of these assumptions brings the complexity down to polynomial time, and which is insufficient for tractability. Importantly, we derive an interesting and general tractable case for each of the aggregate functions we study.
To the best of our knowledge, counting the number of tuples extracted by a VSetautomaton (i.e., the Count aggregate function) is the only aggregation function for document spanners which has been studied in literature, except for Doleschal et al. [DKMP22] who consider a variation of maximum aggregation. (Given a weighted VSet-automaton and a document, they study the computational complexity of returning a tuple with maximal weight.) Concerning counting, Florenzano et al. [FRU + 18] study the problem of counting the number of extractions of a VSet-automaton and approximation thereof is studied by Arenas et al. [ACJR19]. To be specific, Arenas et al. [ACJR19] give a polynomial-time uniform sampling algorithm from the space of words which are accepted by an NFA and have a given length. Using that sampling, they establish an FPRAS for the Count aggregate function. Our FPRAS results are based on their results. We explain the connection between the known results and our work in more detail throughout the article. Yet, to the best of our knowledge, this work is the first to consider aggregate functions over numerical values extracted by document spanners.
Comparison to the Conference Version. Compared to the conference version of this article [DBKM21], the following aspects are new. We now consider constant-width weight functions, which generalize the single-variable weight functions from [DBKM21]; Section 4.2 is new; and we provide a more detailed complexity overview for regular weight functions over different semirings. Furthermore, proofs that were missing in [DBKM21] are now included. On a technical level, we now use parsimonious reductions and metric reductions instead of Turing reductions for some of the results, which strengthens them.
Organization. This article is organized as follows. In Section 2, we give preliminary definitions and notation. We summarize the main results in Section 3 and expand on these results in the later sections. In Section 4 we give some preliminary results. We describe our investigation for constant-width weight functions, polynomial-time weight functions and regular weight functions in Sections 5, 6 and 7, respectively. Finally, we study approximation in Section 8 and conclude in Section 9. T h e r e ⊔ a r e ⊔ 7 ⊔ e v e n t s ⊔ i n ⊔ B e l g i u m , ⊔ 1 0 -1 5 ⊔ i n ⊔ F r a n c e , ⊔ 4 ⊔ i n ⊔ L u x e m b o u r g , ⊔ t h r e e ⊔ i n ⊔ B e r l i n .  A span of d is an expression of the form [i, j⟩ with 1 ≤ i ≤ j ≤ n + 1. For a span [i, j⟩ of d, we denote by d [i,j⟩ the string σ i · · · σ j−1 . A span [i, j⟩ is empty if i = j which implies that d [i,j⟩ = ε. Two spans [i 1 , j 1 ⟩ and [i 2 , j 2 ⟩ are equal if i 1 = i 2 and j 1 = j 2 . In particular, we observe that two spans do not have to be equal if they select the same string. That is, K-relations and K-annotators. Let V ⊆ Vars be a finite set of variables. A V -tuple is a function t : V → D that assigns values to variables in V . We sometimes leave V implicit when the precise set is not important. For such a tuple t, we denote the set V by Vars(t). We denote the set of all V -tuples by V -Tup. For a subset X ⊆ Vars, we denote the restriction of t to the variables in X by π X (t) or simply π X t. We say that a tuple t is empty, denoted by t = (), if Vars(t) = ∅.
A K-relation R over V is a function R : V -Tup → K such that its support, defined by Supp(R) := {t | R(t) ̸ = 0}, is finite. We will also write t ∈ R to abbreviate t ∈ Supp(R). Furthermore, we say that two K-relations R 1 and The size of a K-relation R is the size of its support, that is, |R| := |Supp(R)|. The arity of a V -tuple t is the cardinality |V | of V and, similarly, the arity of a K-relation over V is |V |.
The framework focuses on functions that extract spans from documents and assigns them to variables. Since we will be working with relations over spans, also called span relations, we assume that D is such that Spans ⊆ D. A d-tuple t is a V -tuple which only assigns values from Spans(d), that is, t(x) ⊆ Spans(d) for every x ∈ Vars(t). If the document d is clear from the context, we sometimes say simply tuple instead of d-tuple. We denote by d t the tuple (d t(x 1 ) , . . . , d t(xn) ), where Vars(t) = {x 1 , . . . , x n }.
A K-weighted span relation over document d and variables V is a K-relation R wherein every tuple is a d-tuple t with Vars(t) = V . We also denote V by Vars(R). A K-weighted string relation is a K-relation R wherein every tuple t ∈ R assigns strings, that is, t(x) ∈ Σ * for every variable x ∈ Vars(t). Note that we can associate a string relation to every span relation over a document d by replacing every span [i, j⟩ with the string d [i,j⟩ .
Example 2.2. Consider the document in Figure 1. The table on the bottom left depicts a (Bweighted) span relation R, encoding a possible extraction of locations with the corresponding number of events. The string relation at the bottom right is the corresponding string relation.
Definition 2.3. A K-annotator (or annotator for short) is a function S that is associated to a finite set V ⊆ Vars of variables and maps each document d into a K-weighted span relation over V . We denote V by Vars(S). We sometimes also refer to a K-annotator as an annotator over K when we want to emphasize the semiring.
Example 2.4. As an example of a K-weighted annotator, consider again the setting in Example 2.2. A Q-weighted annotator in this setting is the function S that maps each document d to the span relation R in which the tuples are pairs, consisting of a name of a country and a number (or numeric range), and in which the weight associated to each tuple is the smallest value in the numeric range. An example of such a tuple for the document in Figure 1 would be t 1 with t 1 (x loc ) = [23, 30⟩ (the span of "Belgium") and t 1 (x events ) = [11, 12⟩ (the span of "7"). Another example would be t 2 with t 2 (x loc ) = [41, 47⟩ (the span of "France") and t 1 (x events ) = [32, 37⟩ (the span of "10-15"). The relation R would assign R(t 1 ) = 7 and R(t 2 ) = 10.
We say that two K-annotators S 1 and S 2 are disjoint if, for every document d ∈ Σ * , the K-relations S 1 (d) and S 2 (d) are disjoint. Furthermore, we denote by S = S ′ the fact that S and S ′ define the same function.
Notice that B-annotators, i.e., annotators over the Boolean semiring are simply the functional document spanners as defined by Fagin et al. [FKRV15a,FKRV15b]. Throughout this article, we refer to B-annotators as document spanners (also spanner for short).
2.3. Algebraic Operators on K-Relations and K-Annotators. Green et al. [GKT07] defined a set of operators on K-relations that naturally correspond to relational algebra operators and map K-relations to K-relations. As in much of the work on semirings in provenance, they do not consider the difference operator (which would require additive inverses). More precisely, they define the algebraic operators union, projection, and natural join for all finite sets V 1 , V 2 ⊆ Vars and for all K-relations R 1 over V 1 and R 2 over V 2 , as follows.
• Natural Join: The natural join R := R 1 ▷◁ R 2 is a function R : Proposition 2.5 (Green et al. [GKT07]). The above operators preserve the finiteness of the supports. Therefore, they map K-relations into K-relations. Hence, we obtain an algebra on K-relations. We now lift the relational algebra operators on K-relations to the level of K-annotators. For all documents d and for all annotators S 1 and S 2 associated with V 1 and V 2 , respectively, we define the following: • Natural Join: The natural join S := S 1 ▷◁ S 2 is defined by S(d) := S 1 (d) ▷◁ S 2 (d). Due to Proposition 2.5, it follows that the above operators form an algebra on K-annotators.

Ref-Words.
We use weighted VSet-automata (or simply VSet-automata for the Boolean semiring) in order to represent K-annotators. Following Freydenberger [Fre19], we introduce so-called ref-words, which connect spanner representations with regular languages. We also introduce unambiguous and functional VSet-automata, which have properties essential to the tractability of some problems we study.
For a finite set V ⊆ Vars of variables, ref-words are defined over the extended alphabet We assume that Γ V is disjoint with Σ and Vars. Ref-words extend strings over Σ by encoding opening (▷ x ) and closing (◁ x ) of variables.
A ref-word r ∈ (Σ ∪ Γ V ) * is valid if every occurring variable is opened and closed exactly once. More formally, for each x ∈ V , the string r has precisely one occurrence of ▷ x and precisely one occurrence of ◁ x , which is after the occurrence of ▷ x . For every valid ref-word r over (Σ ∪ Γ V ), we define Vars(r) as the set of variables x ∈ V which occur in the ref-word. More formally, Intuitively, each valid ref-word r encodes a d-tuple for some document d, where the document is given by symbols from σ in r and the variable markers encode where the spans begin and end. Formally, we define functions doc and tup that, given a valid ref-word, output the corresponding document and tuple. 2 The morphism doc : (Σ ∪ Γ V ) * → Σ * is defined on single symbols as: and we define doc(σ 1 · · · σ n ) := doc(σ 1 ) · · · doc(σ n ). We now define the function tup. By definition, every valid ref-word r over (Σ ∪ Γ V ) has a unique factorization r = r pre for each x ∈ Vars(r). We then define the function tup as The usage of the doc morphism in the above definition ensures that the indices i x and j x refer to positions in the document and do not consider other variable operations.
A ref-word language R is a language of ref-words. We say that R is functional if every ref-word r ∈ R is valid and there is a set V of variables such that Vars(r) = V for each r ∈ R.
1 Here, ∪ stands for the union of two K-relations as was defined previously. The same is valid also for the other operators.
2 The function doc is sometimes also called clr in literature (cf. Freydenberger et al. [FKP18]).
Given a functional ref-word language R, the spanner R represented by R is given by The Variable Order Condition and the ref Function Observation 2.6. Let r be a valid ref-word and let r ′ := ref(doc(r), tup(r)). Then tup(r) = tup(r ′ ). Furthermore, r = r ′ if and only if r satisfies the variable order condition.
Analogously to functionality, we say that a ref-word language R satisfies the variable order condition if every ref-word r ∈ R satisfies the variable order condition.
2.5. (Weighted) Variable Set-Automata. In this section, we revisit the definition of weighted VSet-automaton as a formalism to represent K-annotators [DKMP22]. This formalism is a natural generalization of both VSet-automata and weighted automata [DKV09]. Throughout the article, we will use weighted VSet-automata for two purposes: we use the VSet-automata over the Boolean semiring B for extracting spans from documents (as in the usual document spanner framework [FKRV13]) and the more general K-weighted VSet-automata as one formalism for weight functions. (We discuss all considered variants for weight functions in Section 3.3.) Let V ⊆ Vars be a finite set of variables. A weighted variable-set automaton over semiring K (alternatively, a weighted VSet-automaton or a K-weighted VSet-automaton) is a tuple A := (Σ, V, Q, I, F, δ) where Σ is a finite alphabet; V ⊆ Vars is a finite set of variables; Q is a finite set of states; I : Q → K is the initial weight function; F : Q → K is the final weight function; and δ : We define the transitions of A as the set of triples (p, o, q) with δ(p, o, q) ̸ = 0. Likewise, the initial (resp., accepting) states are those states q with I(q) ̸ = 0 (resp., F (q) ̸ = 0). For every semiring element a ∈ K, we denote the length of the encoding of a by ∥a∥. The size of a weighted VSet-automaton A is defined by • δ(q i , σ i+1 , q i+1 ) ̸ = 0 for all 0 ≤ i < m. We say that a run ρ is on a document d if ρ is a run on r and doc(r) = d. Furthermore, overloading notation, given a run ρ of A on r, we denote r by ref(ρ). We define the ref-word language R(A) as the set of all ref-words r such that A has a run on r.

Runs of
The weight of a run is obtained by ⊗-multiplying the weights of its constituent transitions. Formally, the weight w ρ of ρ is an element in K given by the expression If ρ is valid we denote the tuple tup(ref(ρ)) by tup(ρ).
We say that a weighted VSet-automaton A is functional if every run of A is valid. We denote the set of all valid and nonzero runs of A on d by Notice that there may be infinitely many valid and nonzero runs of a weighted VSetautomaton on a given document, due to ε-cycles, which are states q 1 , . . . , q k such that (q i , ε, q i+1 ) is a transition for every i ∈ {1, . . . , k − 1} and q 1 = q k . Following Doleschal et al. [DKMP22] we assume that weighted VSet-automata do not have ε-cycles, unless mentioned otherwise.
As such, if A does not have ε-cycles, then the result of applying A on a document d, denoted A K (d), is the K-relation R for which Note that P (A, d) only contains runs ρ that are valid and nonzero. If t is a V ′ -tuple with V ′ ̸ = V then R(t) = 0, because we only consider valid runs. In addition, A K is a well defined K-annotator since every V -tuple in the support of A K (d) is a V -tuple over Spans(d).
To simplify notation, we sometimes denote A K (d)(t) -the weight assigned to the d-tuple t by A -by A K (d, t). We say that two K-weighted VSet-automata A 1 and A 2 are disjoint if R(A 1 ) ∩ R(A 2 ) = ∅. This implies that also the corresponding K-annotators A 1 K and A 2 K are disjoint.
We say that a K-annotator (resp., document spanner) S is regular if there exists a weighted VSet-automaton (resp., B-weighted VSet-automaton) A such that S = A K . Note that this is an equality between functions. If K is clear from the context, we may just write A instead of A K .
We say that two weighted VSet-automata A and A ′ are equivalent if they define the same K-annotator, that is, Similar to our terminology on B-annotators, we refer to (functional) B-weighted VSetautomata as (functional) VSet-automata. Since VSet-automata can always be translated into equivalent functional VSet-automata [Fre19, Proposition 3.9], we assume in this article that VSet-automata are functional. This is a common assumption for document spanners involving regular languages [FKRV15a,Fre19,PFKK19]. Furthermore, we assume that all weighted VSet-automata are functional as well. In the following, we denote by Reg K the class of all functional K-weighted VSet-automata and by VSA the class of all functional VSet-automata.
Due to the close relationship between regular expressions and B-weighted automata, and since regular expressions are easy to read, we sometimes define B-weighted VSet-automata using regular expressions over Σ ∪ Γ V . Here, we use · to denote concatenation, ∨ to denote disjunction, and * to denote Kleene star. As usual, we often omit · and use priority rules ( * before · before ∨) for improving the readability of expressions.
Unambiguous (weighted) VSet-Automata. We now discuss unambiguity for (weighted) VSetautomata. A (weighted) VSet-automaton A is unambiguous if it satisfies the following two conditions. (C1) R(A) satisfies the variable order condition; (C2) for every r ∈ R(A), there is exactly one run of A on r.
We note that for Boolean spanners, i.e. spanners with no variables, the definitions coincide with the classical unambiguity definition of finite state automata. That is, a VSet-automaton with Vars(A) = ∅ is unambiguous if it is a unambiguous finite state automaton. Furthermore, we note that every VSet-automaton A can be transformed to an equivalent unambiguous VSet-automaton A ′ . (e.g. Doleschal et al. [DKM + 21, Lemma 4.5]). However, VSet-automata can be exponentially more succinct than equivalent unambiguous VSet-automata. 4 Example 2.7. The span relation on the bottom right of Figure 1 can be extracted from d by a spanner that matches textual representations of numbers (or ranges) in the variable x events , followed by a city or country name, matched in x loc . Figure 2 shows how two such VSet-automata may look like. Note that some strings, like Luxembourg are the name of a city as well as a country. Thus, the upper automaton is ambiguous, because the tuple with Luxembourg is captured twice (thus, violating (C2)). The lower automaton is unambiguous, because the sub-automaton for Loc only matches such names once.
In the following, we denote by UReg K the class of K-weighted unambiguous functional VSet-automata and by uVSA the class of unambiguous functional VSet-automata.
2.6. Aggregate Queries. Aggregation functions, such as min, max, and sum operate on numerical values from database tuples, whereas all the values of d-tuples are spans. Yet, these spans may represent numerical values, from the document d, encoded by the captured words (e.g., "3," "three," "March" and so on). To connect spans to numerical values, we will use weight functions In the definition of weight functions, we allow the range to include ∞, since we will use subsets of Q and the tropical semiring T, the latter of which contains ∞. We discuss weight functions in more detail in Section 3.3.  Figure 2: Two example VSet-automata that extract the span relation R on input d as defined in Figure 1. For the sake of presentation, the automata are simplified as follows: Num is a sub-automaton matching anything representing a number (of events) or range, Gap is a sub-automaton matching sequences of at most three words, City and Country are sub-automata matching city and country names respectively. Loc is a sub-automaton for the union of City and Country. All these sub-automata are assumed to be unambiguous.
T h e r e ⊔ a r e ⊔ 7 ⊔ e v e n t s ⊔ i n ⊔ B e l g i u m , ⊔ 1 0 -1 5 ⊔ i n ⊔   Example 2.9. Consider the document in Figure 3 and assume that we want to calculate the total number of mentioned events. The relation R at the bottom left depicts a possible extraction of locations with their number of events. The table in the bottom middle depicts a weighted string relation W (where the weight of each string is in the rightmost column). The relation on the bottom right depicts the string relation where each tuple is annotated with a weight corresponding to W, R, and d. To get an understanding of the total number of events, we may want to take the sum over the weights of the extracted tuples, namely 7 + 10 + 4 + 3 = 24.
For a spanner S, a document d, and weight function w, we denote by Img(S, d, w) the set of weights of output tuples of S on d, that is, Img(S, d, w) = {w(d, t) | t ∈ S(d)}. Furthermore, let Img(w) ⊆ Q be the set of weights assigned by w, that is, k ∈ Img(w) if and only if there is a document d and a d-tuple t with w(d, t) = k.
Definition 2.10. Let d be a document and A be a VSet-automaton such that A (d) ̸ = ∅. Let S = A , let w be a weight function, and q ∈ Q with 0 ≤ q ≤ 1. We define the following spanner aggregation functions: We observe that Min(S, d, w) = 0-Quantile(S, d, w) and Max(S, d, w) = 1-Quantile(S, d, w).

2.7.
Main Problems. Let S be a class of regular document spanners and W be a class of weight functions. We define the following problems.

Sum[S, W]
Input: Spanner S ∈ S, document d ∈ Σ * , a weight function w ∈ W. Task: Compute Sum(S, d, w). Notice that all these problems study combined complexity. Since the number of tuples in S(d) is always in O(|d| 2k ), where k is the number of variables of the spanner S (cf. Corlollary 4.6), the data complexity of all the problems is in FP: One can just materialize S(d) and apply the necessary aggregate. Under combined complexity, we will therefore need to find ways to avoid materializing S(d) to achieve tractability.
2.8. Algorithms and Complexity Classes. Before we discuss our main results in Section 3, we provide a few definitions on computational complexity.
We first define fully polynomial-time randomized approximation schemes (FPRAS).
Definition 2.11. Let f be a function that maps inputs x to rational numbers and let A be a probabilistic algorithm, which takes an input instance x and a parameter δ > 0. Then A is called a fully polynomial-time randomized approximation scheme (FPRAS), if The following definitions closely follow the Handbook of Theoretical Computer Science [vL91]. The class FP (respectively, FEXPTIME) is the set of all functions that are computable in polynomial time (resp., in exponential time). A counting Turing Machine is an non-deterministic Turing Machine whose output for a given input is the number of accepting computations for that input. Given functions f, g : Σ * → N, f is said to be parsimoniously reducible to g in polynomial time if there is a function h : Σ * → Σ * , which is computable in polynomial time, such that for every x ∈ Σ * it holds that f (x) = g(h(x)). Furthermore, we say that f is Turing reducible to g in polynomial time, if f can be computed by a polynomial time Turing Machine M , which has access to an oracle for g.
The class #P is the set of all functions that are computable by polynomial-time counting Turing Machines. A problem X is #P-hard under parsimonious reductions (resp., Turing reductions) if there are polynomial time parsimonious reductions (resp., Turing reductions) to it from all problems in #P. If in addition X ∈ #P, we say that X is #P-complete under parsimonious reductions (resp., Turing reductions).
The class FP #P is the set of all functions that are computable in polynomial time by an oracle Turing Machine with a #P oracle. It is easy to see that, under Turing reductions, a problem is hard for the class #P if and only if it is hard for FP #P . We note that every problem which is #P-hard under parsimonious reductions is also #P-hard under Turing reductions. Therefore, unless mentioned otherwise, we always use parsimonious reductions.
The class spanL is the class of all functions f : The class OptP is the set of all functions computable by taking the maximum output value over all accepting computations of a polynomial-time non-deterministic Turing Machine that outputs natural numbers. Assume that Γ is the Turing Machine alphabet. Let f, g : Γ * → N be functions. A metric reduction, as introduced by Krentel [Kre88], from f to g is a pair of polynomial-time computable functions T 1 , T 2 , where T 1 : Γ * → Γ * and T 2 : The class BPP is the set of all decision problems solvable in polynomial time by a probabilistic Turing Machine in which the answer always has probability at least 1 2 + δ of being correct for some fixed δ > 0.

Main Results
In this section we present the main results of this article.
3.1. Known Results. We begin by giving an overview of the results on Count, which are known from the literature.
approx. (8.10) Table 1: Detailed overview of complexities of aggregate problems for document spanners. All problems are in FEXPTIME. The "no FPRAS" claims either assume that RP ̸ = NP or assume that the polynomial hierarchy does not collapse. The #P-hardness results, marked with † rely on Turing reductions. The numbers refer to the numbers of new results.
The spanL lower bound by Florenzano et al. [FRU + 18, Theorem 5.2] is due to a parsimonious reduction from the #NFA(n)-problem 5 which is known to be #P-complete under Turing reductions (cf. Kannan et al. [KSM95]). As every parsimonious reduction is also a Turing reduction, the following corollary follows immediately.
Two observations can be made from these results. First, Count requires the input spanner to be unambiguous for tractability. This tractability implies that Count can be computed without materializing the possibly exponentially large set S(d) if the spanner is unambiguous. Furthermore, if the spanner is not unambiguous then, due to spanLcompleteness of Count, we do not know an efficient algorithm for its exact computation (and therefore may have to materialize S(d)), but Count can be approximated by an FPRAS. We will explore to which extent this picture generalizes to other aggregates. 5 Given an NFA A and a natural number n, encoded in binary, the #NFA(n) problem asks for the number of words w ∈ L(A) of length n. The #NFA(n) problem is sometimes also called Census Problem. 3.2. Overview of New Results. The complexity results are summarized in Table 1. By now the reader is familiar with the aggregate problems and the types of spanners we study. We obtain different results for different representations of weight functions, which we denote here as CWidth, Poly, and Reg (resp., UReg) and define formally in Section 3.3. Intuitively, CWidth are constant-width weight functions that assign values based on strings selected by a constant number of variables; Poly are polynomial-time computable weight functions, and Reg (resp., UReg) are weight functions represented by weighted (resp., unambiguous weighted) VSet-automata. Furthermore, we sometimes restrict these classes based on their range. For instance, CWidth N and CWidth Q + are the constant-width weight functions that map to natural numbers and positive rational numbers, respectively. Entries in the table should be read from left to right. For instance, the third row states that the Min problem, for both spanner classes uVSA and VSA, and for all three classes CWidth, UReg T , and Reg T of weight functions is in FP. Likewise, the fourth row states that the same problems with Reg Q or Poly weight functions become OptP-hard and that the existence of an FPRAS would contradict commonly believed conjectures.
In general, the table gives a detailed overview of the impact of (1) unambiguity of spanners and (2) different weight function representations on the complexity of computing aggregates.

Results for Different Weight Functions.
We formalize how we represent the weight functions for our new results. Recall that weight functions w map pairs consisting of a document d and d-tuple t to values in Q ∪ {∞}.
3.3.1. Constant-Width Weight Functions. The simplest type of weight functions we consider are the constant-width weight functions. 6 Let 1 ≤ c ∈ N be a constant. A constant-width weight function (CWidth) w assigns values based on the strings selected by at most c variables. A constant-width weight function CWidth is given in the input as a Q-weighted string relation, i.e., a string relation R over the numerical semiring Q = (Q, +, ×, 0, 1) and the variables X, where X ⊆ Vars, is a set of at most c variables. Recall that d t denotes the tuple To facilitate presentation, we assume that the variables in X are always present in t, that is, X ⊆ Vars(t). The weight function w(d, t) is defined as  Not surprisingly there are multiple drawbacks of having arbitrary polynomial time weight functions. The first is that all considered aggregates become intractable, even if we only consider unambiguous VSet-automata (Theorems 6.1, and 6.2). However, all aggregates can at least be computed in exponential time (Theorem 6.3).
3.3.3. Regular Weight Functions. As the class of polynomial-time weight functions quickly leads to intractability, we focus on a restricted class Reg that we introduce next and is less restrictive than CWidth but not as general as Poly such that we can understand the structure of the representation towards efficient algorithms. 7 Our final classes of weight functions are based on K-Annotators. More precisely, we consider weighted VSet-automata and unambiguous weighted VSet-automata over the tropical semiring T = (Q ∪ {∞}, min, +, ∞, 0) and the numerical semiring Q = (Q, +, ×, 0, 1). 8 Formally, let Reg := Reg T ∪ Reg Q be the class of all annotators over the tropical or numerical semiring. A regular (Reg) weight function w is represented by a weighted VSet-automaton W and defines w(d, t) = W (d, π Vars(W ) (t)). Furthermore, as for constant width weight functions, we assume that the variables used by W are always present in t, that is, The set of unambiguous regular (UReg) weight functions is the subset of Reg that is represented by unambiguous weighted VSet-automata, that is UReg Example 3.3. Figure 4 gives an unambiguous weighted VSet-automaton over the tropical semiring that extracts the values of three-digit natural numbers from text. It can easily be extended to extract natural numbers of up to a constant number of digits by adding nondeterminism. Likewise, it is possible to extend it to extract weights as in Example 2.9. If a single variable captures a list of numbers, similar to d [32,37⟩ = 10−15, one may use ambiguity to extract the minimal number represented in this range. 7 We prove in Section 4.2 that CWidth ⊆ Reg ⊆ Poly; also see Figure 5. 8 One can also consider the tropical semiring with max/plus, in which case the complexity results are analogous to the ones we have for the tropical semiring with min/plus, with Min and Max interchanged. Our results for regular and unambiguous regular weight functions are similar to CWidth when it comes to Min, Max, Sum, and Average. The main difference is that, depending on the semiring, we require more unambiguity. For instance, for the tropical semiring, one needs unambiguity of the regular weight function for Max. For Sum and Average one needs unambiguity for both the spanner and the regular weight function to achieve tractability. Contrary, over the numerical semiring, one needs unambiguity of the regular weight function for Min and Max, whereas for Sum and Average unambiguity of the spanner is sufficient for tractability. For q-Quantile, the situation is different from CWidth in the sense that regular weight functions render the problem intractable. We refer to Table 1 for an overview.
3.4. Approximation. In the cases where exact computation of the aggregate problem is intractable, we consider the question of approximation. It turns out that there exist FPRAS's in two settings that we believe to be interesting. Firstly, in the case of Sum and Average and constant-width weight functions, the restriction of unambiguity in the spanner can be dropped if the weight function uses only nonnegative weights. Secondly, although q-Quantile is #P-hard under Turing reductions for general VSA, it is possible to positionally approximate the Quantile element in an FPRAS-like fashion, even with the very general polynomial-time weight functions. We discuss this problem in more detail in Section 8.

Preliminary Results
In this section, we give basic results for document spanners and weight functions that we use throughout this article. 4.1. Known Results on K-Annotators. We begin by recalling some known results on K-annotators. Theorem 4.2. Let A 1 , A 2 ∈ Reg K be K-weighted functional VSet-automata and X ⊆ Vars(A 1 ). Then, A π , A ∪ , A ▷◁ ∈ Reg K can be constructed in polynomial time, such that

Relative Expressiveness of Weight Functions.
We first show that every constantwidth weight function is also an unambiguous regular weight function.
Proof. Let w ∈ CWidth be a constant-width weight function, represented by a Q-weighted string relation R over X, that is, tuples in R map variables to strings. We begin by showing that w ∈ UReg Q . Let X = {x 1 , . . . , x n }. We construct a Q-annotator W representing w. We define an unambiguous VSet-automaton A t , for every tuple t ∈ R, For every x ∈ X, let w x be the word t(x) and let in variable x and outputs the corresponding {x}-tuple with the span. Since our definition of unambiguity requires one run per ref-word in the language, it is easy to see that such an unambiguous A x t exists. Furthermore, We define W t as the unambiguous Q-weighted VSet-automaton such that This can be achieved by interpreting A t as a Q-weighted VSet-automaton, where all edges have weight 1, the final weight function assigns weight 1 to all accepting states, and the initial weight function assigns weight R(t) to the initial state of A t . We finally define W as the union of all W t . That is, We observe that, by Theorem 4.2, W must be unambiguous, as all W t are unambiguous and the ref-word languages of the automata W t are pairwise disjoint.
The proof for CWidth ⊆ UReg T follows the same lines. However, the zero element of the tropical semiring is ∞, which implies that the automaton W must have exactly one run ρ for every tuple t, even if w(d, t) = 0. To this end, let W t be as defined before, but interpreted over the tropical semiring. We construct an unambiguous T-weighted VSet-automaton ∈ R and W R has no run for t otherwise. We observe that R is a recognizable string relation. 9 Therefore, due to Doleschal et al. [DKMP22,Theorem 6 Let W R be A R , interpreted as T-weighted VSet-automaton, that is, each transition, initial 9 A k-ary string relation is recognizable if it is a finite union of Cartesian products L1 × · · · × L k , where each Li is a regular language. Note that R is recognizable as it is the union over all tuples t ∈ R, where each tuple is represented by the Cartesian product {t(x1)} × · · · × {t(xn)} with Vars(t) = {x1, . . . , xn}.
(4.4) Figure 5: Inclusion structure of our considered weight functions and final state gets weight 1 = 0. Note that, due to Again, we observe that, by Theorem 4.2, W must be unambiguous as all involved automata are unambiguous and their ref-word languages are pairwise disjoint. Furthermore, We now observe that every regular weight function is a polynomial-time weight function. Indeed, given a document d and a d-tuple t, the weight w(d, t) for a regular weight function w can be computed in polynomial time (cf. Doleschal [Dol21, Theorem 5.6.1]). To summarize, we provide the inclusion structure of the classes of weight functions we consider in Figure 5. All inclusions that do not have a number hold by definition. 4.3. Preliminary Results on Document Spanners. We will also need some preliminary results concerning the number of possible spans over a document d. Proof. For a span [i, j⟩, let ℓ = j − i be the length of the span. It is easy to see that for every document d, there is exactly one span of length |d|, two spans of length |d| − 1, three spans of length |d| − 2, etc. Thus, there are 1 + 2 + · · · + (|d| + 1) = (|d|+1)·(|d|+2) 2 spans over a document d, concluding the proof.
It follows directly that the maximal number of tuples, extracted by a document spanner is exponential in the size of the spanner. As we see next, given a number of variables, a document d, and a number k of tuples, we can construct an unambiguous VSet-automaton A and a document d ′ such that A extracts exactly k tuples on d ′ . Lemma 4.7. Let X := {x 1 , . . . , x v } ∈ Vars be a set of variables, d ∈ Σ * be a document, and 0 ≤ k ≤ |Spans(d)| |X| . Then there is a VSet-automaton A ∈ uVSA with Vars(A) = X and a document d ′ ∈ Σ * such that | A (d ′ )| = k. Furthermore, A and d ′ can be constructed in time polynomial in |X| and d.
Proof. We observe that the statement holds for k = 0. Therefore we assume, w.l.o.g., that We begin by proving the statement for |X| = 1. Let 1 ≤ k ≤ |Spans(d)|. Recalling the proof of Lemma 4.5, we observe that k can be written as a sum k = k 1 + · · · + k n of n ≤ |d| + 1 different natural numbers with 0 ≤ k 1 < · · · < k n ≤ |d| + 1. We construct an automaton A k ∈ uVSA, which consists of n branches, corresponding to k 1 , . . . , k n . On document d, the branch corresponding to k i selects all spans of length ℓ i := |d| + 1 − k i . Each of these branches can be constructed as an unambiguous VSet-automaton A k i := Σ * · ▷ x Σ ℓ i ◁ x ·Σ * . We observe that there are exactly k i spans over d with length ℓ i , and therefore | A k i (d)| = k i . The automaton A k is defined as It is straightforward to verify that all automata A k i are unambiguous. Thus, since the ref-word languages of all A k i are pairwise disjoint, it holds that A k ∈ uVSA (cf. Theorem 4.2). Furthermore, we observe that It remains to show the statement for v := |X| > 1. Let # / ∈ Σ be a new alphabet symbol. We build upon the encoding for |X| = 1. That is, for every 1 ≤ k ≤ |Spans(d)|, let A x k be the automaton A k , using variable x, as defined previously. We observe that every 1 ≤ k ≤ |Spans(d)| v has an encoding k = k 1 · · · k v in base |Spans(d)| of length v. The document d ′ consists of v copies of d · #, more formally, For every 1 ≤ i ≤ v, we construct an automaton A ′ k i , which selects exactly k i · |Spans(d)| v−i tuples over document d ′ . More formally, The automaton A ′ k is then defined as the union of all A ′ k i , that is, We observe that A ′ k i ∈ uVSA and due to the ref-word languages of all A ′ k i being pairwise disjoint, A ′ k ∈ uVSA (cf. Theorem 4.2). Furthermore, we observe that This concludes the proof. Proof. Let A ∈ VSA, d ∈ Σ * , X ⊆ Vars(A) with |X| ≤ c, and w ∈ CWidth be given as a Q-weighted string relation R over X. We first show that the set {π X t | t ∈ A (d)} can be computed in time polynomial in the sizes of A and d.
We observe that, per definition of projection for document spanners (Section 2.3), Since A is functional (which we assume for VSetautomata throughout this article), a VSet-automaton for π X ( A ) can be computed in polynomial time (cf. Freydenberger et al. [FKP18,Lemma 3.8]). Due to |X| ≤ c, it follows from Corollary 4.6 that there are at most polynomially many tuples in π X ( A ) (d). Thus, the set {π X t | t ∈ A (d)} can be materialized in polynomial time.
In order to compute Min and Max, a polynomial time algorithm can iterate over all tuples t in {π X t | t ∈ A (d)}, evaluate R(d, t) and maintain the minimum and the maximum of these numbers.
In order to calculate aggregates like Sum, Avg, or q-Quantile, it is not sufficient to know which weights are assigned, but also the multiplicity of each weight is necessary. Recall that counting the number of output tuples is tractable if the VSet-automaton is unambiguous (Theorem 3.1) and spanL-complete in general. We now show that we can achieve tractability of the mentioned aggregate problems if the VSet-automaton is unambiguous. The reason is that we can compute in polynomial time the multiset S A,d := ⦃π X t | t ∈ A (d)⦄, where we represent the multiplicity of each tuple t ′ (i.e., the number of tuples t ∈ A (d) such that π X t = t ′ ) in binary.
Lemma 5.2. Given a VSet-automaton A and a document d, the multiset S A,d can be computed in FP if A ∈ uVSA.
Proof. The procedure is given as Algorithm 1. It is straightforward to verify that the algorithm is correct. Due to Corollary 4.6, the set π X ( A )(d) is at most of polynomial size. Furthermore, the automaton A ref(d,t) := ref(d, t) ∈ uVSA can be constructed in polynomial time and due to Theorem 4.2 an unambiguous VSet-automaton for A t can be computed in polynomial time as well. By Theorem 3.1, each iteration of the for-loop also only requires polynomial time. Thus, the whole algorithm terminates after polynomially many steps.
It follows that all remaining aggregate functions can be efficiently computed if the spanner is given as an unambiguous VSet-automaton. Proof. Let A ∈ uVSA be a VSet-automaton, d ∈ Σ * be a document, w ∈ CWidth be a weight function, represented by a Q-weighted string relation R over X. Due to Lemma 5.2 the multiset S A,d can be computed in polynomial time. Thus one can compute the multiset It is straightforward to compute the aggregates in polynomial time from W .
We conclude this section by showing that Sum, Avg, and q-Quantile are not tractable, if the spanner is given as a VSet-automaton.
Proof. We will give a reduction from the #CNF problem, which is #P-complete under parsimonious reductions. To this end, let ϕ be a Boolean formula in CNF over variables x 1 , . . . , x n and let w ∈ CWidth be the weight function which is represented by the Q-Relation R, which is as defined in the theorem statement. We construct a VSet-automaton A ∈ VSA and a document d := a n · − · 1, such that Sum( A , d, w) = c, where c is the number of variable assignments which satisfy ϕ.
The automaton A 1 selects exactly 2 n tuples on document d, all of which get assigned weight 1 by w. More formally (using ∨ to denote regular expression disjunction), We use a similar encoding as Doleschal et al. [DKMP22,Theorem 5.4] to encode variable assignments into tuples. That is, each variable x i of ϕ is associated with a corresponding capture variable x i of A −1 . With each assignment τ we associate the tuple t τ , such that We construct the automaton A −1 as a regex formula α, such that there is a one-to-one correspondence between the non-satisfying assignments for ϕ and tuples in α (d). More formally, for each clause C j of ϕ and each variable x i , we construct a regex-formula Consequently, we define α j := α 1,j · · · α n,j · ▷ x − 1◁ x .
For example, if we use variables x 1 , x 2 , x 3 , x 4 and C j = x 1 ∨ x 3 ∨ ¬x 4 is a clause, then We observe that t ∈ α j (d) if and only if the variable assignment τ of ϕ with t = t τ does not satisfy clause C j .
We finally define α := α 1 ∨ · · · ∨ α m , that is, the disjunction of all α i and A −1 as the VSet-automaton corresponding to α. 10 Therefore, Count( A −1 , d) = s, where s = 2 n − c is the number of variable assignments which do not satisfy ϕ. Furthermore, per definition of A −1 and w, it follows that We finally define the VSet-automaton A as the union of A 1 and A −1 . We observe that every tuple t ∈ A (d) is either selected by A 1 (if d t(x) = 1) or by A −1 (if d t(x) = −1), but never by both automata. Recall that c is the number of assignments which satisfy ϕ and s = 2 n − c is the number non-satisfying assignments of ϕ. Therefore, we have that This concludes the proof.
If the weights are restricted to natural numbers, Sum becomes spanL-complete. Note that we restrict weight functions to natural numbers, because spanL is a class of functions that return natural numbers. Allowing positive rational numbers does not fundamentally change the complexity of the problems though. We will see in Section 8 that this enables us to approximate Sum aggregates. For the lower bound, we give a reduction from Count[VSA], which is spanL-complete (cf. Theorem 3.1). Let A ∈ VSA, d ∈ Σ * . We assume, w.l.o.g., that 1 / ∈ Σ and x / ∈ Vars(A). We construct a document d ′ := d · 1 and a VSet-automaton A ′ := A · ▷ x 1 ◁ x . We observe that Sum( A ′ , d ′ , w) = Count( A , d), concluding the proof.
Solving the equation for Count( A , d), we have that This concludes the proof that Average[VSA, CWidth Q + ] is #P-hard under Turing reductions. It remains to show that q-Quantile[VSA, CWidth] is also #P-hard under Turing reductions. Let A ∈ VSA be a VSet-automaton and d ∈ Σ * be a document. We will show the lower bound for q = 1 2 first and study the general case of 0 < q < 1 afterwards. Let x / ∈ Vars(A) be a new variable. Let 0 ≤ r ≤ |Spans(d)| |Vars(A)| . By Lemma 4.7 there is a VSet-automaton A ′ and a document d ′ with Count( Recalling the definition of w it holds, for every tuple t ∈ A r , that w(d r , t) = 1 if t was selected by A ′ and w(d r , t) = 0 otherwise, i.e., t was selected by A. Therefore, 1 2 -Quantile( A r , d r , w) = 0 if and only if Count( A , d) ≥ Count( A ′ , d ′ ) = r. Let r max be the biggest r such that we have 1 2 -Quantile( A r , d r , w) = 0. Using binary search, we can calculate r max with a polynomial number of calls to an 1 2 -Quantile oracle. Furthermore, due to Count( A , d) ∈ N and R max being maximal, it must hold that Count( A , d) = r max , concluding this part of the proof.
The general case of 0 < q < 1 follows by slightly adopting the above reduction. Let q = a b with a, b ∈ N be given by its numerator and denominator. Observe that b > a as 0 < a b < 1. Let A ′ , d ′ be as above and let c := Count( A , d). The document d r consists of a copies of d, separated by 0 ′ s and (b − a) copies of d ′ separated by 1 ′ s. Formally, Thus, a b -Quantile( A r , d r , w) = 1. Recall that c = Count( A , d). As for q = 1 2 , let r max be the biggest r such that a b -Quantile( A r , d r , w) = 0. Using binary search, we can calculate r max with a polynomial number of calls to an a b -Quantile oracle. Again it holds that Count( A , d) = r max , concluding the proof.

Polynomial-Time Weight Functions
Before we study regular weight functions, we make a few observations on the very general polynomial-time computable weight functions. For weight functions w ∈ Poly, we assume that w is represented as a Turing Machine A that returns a value A(d, t) in polynomially many steps for some fixed polynomial of choice (e.g., n 2 ). 11 Furthermore, to avoid complexity due to the need to verify whether A is indeed a valid input (i.e., timely termination), we will assume that w(d, t) = 0, if A does not produce a value within the allocated time.
We first observe that polynomial-time weight functions make all our aggregation problems intractable, which is not surprising. In fact, all the lower bounds already hold for regular weight functions. Proof. We will see later that these problems are already hard for weight functions in Reg, which are a subclass of Poly (Theorems 7.3 and 7.7).
Proof. We will see later that the problem is already hard for UReg weight functions (Theorem 7.9).
We note that all studied problems can be solved in exponential time, by first constructing the relation A (d), which might be of exponential size, computing the weights associated to all tuples, and finally computing the desired aggregate. 11 Our complexity results are independent of the choice of this polynomial. Proof. Let A ∈ VSA, d ∈ Σ * , and w ∈ Poly. The algorithm first computes the multiset which might be exponentially large. It is easy to see that W A,d,w can be computed in exponential time. Furthermore, it follows directly that Agg[VSA, Poly] is in FEXPTIME for every Agg ∈ {Min, Max, Sum, Average, q-Quantile}.
Throughout this section, we do not study excessively whether we can give a more precise upper bound than the general FEXPTIME upper bound. However, we sometimes give such bounds. For instance, we are able to provide OptP and FP #P upper bounds if the weight functions return natural numbers (or integers in the case of the FP #P upper bounds). Proof. We only give the upper bound for Max. The proof for Min is analogous. To this end, let A ∈ VSA, d ∈ Σ * , and w ∈ Poly be a weight function which only assigns natural numbers. The Turing Machine N guesses a d-tuple t and accepts with output 0 if t / ∈ A(d). Otherwise, N computes the weight w(d, t) and accepts with output w(d, t). It is easy to see that the maximum output value of N is exactly Max ( A , d, w).
In the following theorem we show that Sum, Average, and q-Quantile can be computed in FP #P if all weights are integers. The key idea is that, due to the restriction to integer weights, we can compute the aggregates by multiple calls to a #P oracle. For instance for Sum, we define two weight functions, w + and w − , such that w + computes the sum of all positive and w − the sum of all negative weights. Each of these sums can be computed by a single call to a #P oracle. Proof. We first prove that Sum[VSA, Poly] is in #P if the weight function only assigns natural numbers. We will use this as an oracle for the general upper bound. Let A be a VSet-automaton, d ∈ Σ * be a document and w ∈ Poly be a weight function that only assigns natural numbers. A counting Turing Machine M for solving the problem in #P would have w(d, t) accepting runs for every tuple in A(d). More precisely, M guesses a d-tuple t over Vars(A) and checks whether t ∈ A (d). If t ∈ A (d) and w(d, t) > 0, then M branches into w(d, t) accepting branches, which it can do because w is given in the input as a polynomial-time deterministic Turing Machine. Otherwise, M rejects. Per construction, M has exactly w(d, t) accepting branches for every tuple t ∈ A (d) with w(d, t) > 0. Thus, the number of accepting runs is exactly t∈ A (d) w(d, t) = Sum ( A , d, w).
We now continue by showing that Sum[VSA, Poly] is in FP #P if the weight function only assigns integers. Let A be a VSet-automaton, d ∈ Σ * be a document, and w ∈ Poly be a weight function, which only assigns integers.
We define two weight functions w + , w − ∈ Poly, such that Formally, we define the following two weight functions: Recall that And therefore Thus, the upper bound of q-Quantile[VSA, Poly] can be obtained by performing binary search, using the upper bound of Sum[VSA, Poly] and Theorem 3.1.

Regular Weight Functions
We now turn to Reg and UReg weight functions. As we have shown in Proposition 4.3, every CWidth weight function can be translated into an equivalent UReg weight function. Furthermore, the weight functions which were used for the lower bounds can be represented by unambiguous weighted VSet-automata of constant size. Therefore, all lower bounds for CWidth also hold for UReg.
7.1. Compact DAG Representation. As we show next, aggregation problems for regular weight functions can often be reduced to problems about paths on weighted directed acyclic graphs (DAGs), where the weights come from the semiring of the weight function. To this end, let (K, ⊕, ⊗, 0, 1) be a semiring. A K-weighted DAG is a DAG D = (N, E), where N is a set of nodes, E ⊆ N × K × N is a finite set of weighted edges, and src (resp., snk) is a unique node in N without incoming (resp., outgoing) edges. We define len(e) = ℓ, where e = (v, ℓ, v ′ ) ∈ E. Furthermore, we define paths p in the obvious manner as sequences of edges and the length len(p) of p as the product (⊗) of the lengths of its edges. More formally, a path p := n 1 ℓ 1 n 2 · · · ℓ n−1 n j is a sequence of nodes n i ∈ N with 1 ≤ i ≤ j and (n i , ℓ i , n i+1 , ) ∈ E, for all 1 ≤ i < j, and the length len(p) := ℓ 1 ⊗ · · · ⊗ ℓ j−1 . We denote the set of all paths in D from src to snk by Paths(src, snk). Given a document d, a VSet-automaton A and a regular weight function w ∈ Reg K , we will construct a DAG D which plays the role of a compact representation of the materialized intermediate result. The DAG D is obtained by a product construction between A, W , and d, such that every path from src to snk corresponds to an accepting run of W that represents a tuple in A (d). If A and W are unambiguous this correspondence is actually a bijection.
Lemma 7.1. Let K ∈ {Q, T} be either the numerical or the tropical semiring. Let d be a document, A ∈ VSA, and W be the weighted VSet-automaton representing w ∈ Reg K . We can compute, in polynomial time, a K-weighted DAG D, such that there is a surjective mapping m from paths p ∈ Paths(src, snk) in D to tuples t ∈ A (d). Furthermore, (1) the mapping m is a bijection, if A and W are unambiguous, and Proof. Let d ∈ Σ * , A ∈ VSA, and W be the weighted VSet-automaton representing w ∈ Reg K . By Proposition 4.1, we can assume, w.l.o.g., that all VSet-automata used in this proof do not contain ε-transitions.
We begin by giving the construction of D. Let W A be the weighted VSet-automaton obtained by interpreting A as a K-weighted VSet-automaton. More formally, every transition in A is interpreted as a weighted transition with weight 1 and every transition which is not in A is interpreted as a transition with weight 0. Furthermore, let W d := d be the weighted VSet-automaton with Vars(W d ) = ∅ that assigns the weight 1 to the empty tuple on input d and 0 to every tuple on input d ′ ̸ = d. By Theorem 4.2 the join of weighted VSet-automata can be computed in polynomial time. Let Per definition of join for K-relations, it holds that Let A ∈ uVSA be unambiguous or K = T. In both cases, it holds that Furthermore, Therefore, if A ∈ uVSA or K = T, it holds, for every tuple t ∈ A (d). that We will use this equality in the proof of condition (2). and q the state. The set of edges is defined as follows: In the following we assume that D is trimmed, that is, for every node n ∈ N D there is at least one path from src to snk, which visits n. 12 We observe that the construction of D only requires polynomial time. Note that there is a one-to-one correspondence between paths p ∈ Paths(src, snk) and accepting runs of W D on d. That is, p = src · ℓ 0 · (q 0 , ∅) · ℓ 1 · (q 1 , σ 1 ) · · · (q n , σ n ) · ℓ n+1 · snk is a path from src to snk in D if and only if with I(q 0 ) = ℓ 0 and F (q n ) = ℓ n+1 is an accepting run of W D on d. Furthermore, we observe that the weight of p is exactly the weight assigned to the run ρ by W D , that is, len(p) = w ρ .
For the sake of contradiction, assume that D is cyclic. Per assumption, all nodes n ∈ N are on a path from src to snk, thus, D must have a path p from src to snk, which contains a cycle. Let ρ be the run of W D corresponding to p. The automaton W d is acyclic. Observe that W D is functional as W , W A , and W d are functional. Thus, ref(ρ) is valid and therefore the cycle can not contain an edge labeled by a variable operation. Per assumption, all involved VSet-automata do not contain ε-transitions. Therefore, the cycle must only consist of edges, labeled by alphabet symbols. Let ρ ′ be the run, obtained from ρ by removing all cycles. Due to commutativity of ⊗, it follows that w ρ ′ = w ρ ⊗ x for some x ̸ = 0. We observe that doc(ref(ρ ′ )) ̸ = d. Therefore, there is a run ρ ′ of W D on doc(ref(ρ ′ )) ̸ = d with weight w ρ ′ ̸ = 0, which is the desired contradiction to the observation that for all runs ρ of W D it holds that w ρ ̸ = 0 if and only if doc(ref(ρ)) = d.
We now define the mapping m. Let p ∈ Paths(src, snk) and let ρ be the corresponding run of W D . We define the mapping m(p) := tup(ρ). It follows directly that m is surjective. If A ∈ uVSA or K = T and for t ∈ A (d), we have that It remains to show that condition (1) holds. Assume that A ∈ uVSA and W are unambiguous. Then, by Theorem 4.2, W D is unambiguous. 13 Assume that there are two paths p 1 ̸ = p 2 such that p 1 , p 2 ∈ Paths(src, snk) with m(p 1 ) = m(p 2 ). Let ρ 1 ̸ = ρ 2 be the corresponding runs of W D . Due to m(p) = tup(ρ), it must hold that ρ 1 and ρ 2 are two runs 12 Note that this condition can be enforced in linear time by two graph traversals (e.g. using breadth first search), one starting from src to identify all states which can be reached from src and one starting from snk to identify all states which can reach snk. We remove all states which are not marked by both graph traversals. 13 Recall that W d is unambiguous. Proof. Let d be a document, A ∈ VSA, and W be the weighted VSet-automaton representing w ∈ Reg T or w ∈ UReg Q . Let D and m be the DAG and the surjective mapping as guaranteed by Lemma 7.1. In the following, we will reduce all four cases to finding the path with minimal (resp., maximal) length in D. Note that given a weighted DAG D, one can compute the path with minimal (resp., maximal) length in polynomial time, via dynamic programming, e.g. using the Bellman-Ford algorithm. 14 We begin by giving the proofs for the numerical semiring. If A ∈ uVSA and W ∈ UReg Q , it follows directly from property (1) of Lemma 7.1 that m is a bijection. Therefore, for every tuple t ∈ A (d), there is exactly one path p ∈ Paths(src, snk) with m(p) = t. Thus, w(d, t) = len(p), where p ∈ Paths(src, snk) with m(p) = t. It follows directly that Min( A , d, w) and Max( A , d, w) can be computed from D by searching for the path p with minimal (respectively maximal) length.
It remains to give the proofs for the tropical semiring. We begin by giving the proof for However, if W is unambiguous, it must hold that len(p) = len(p ′ ) for all runs p, p ′ ∈ Paths(src, snk) with m(p) = m(p ′ ). Otherwise W would be required to have at least two runs which accept the same tuple but assign different weights. Thus, W would not be unambiguous. We can therefore conclude that, Again, we can reduce Max[VSA, UReg T ] to the max length problem on D.
14 One has to be careful in the case of the numeric semiring as the lengths along the path are multiplied.
Therefore one has to maintain the minimal as well as the maximal length between two nodes, as edges with negative length change the sign, resulting in minimal path's to be maximal and vice versa. As we show now, the results of Theorem 7.2 are close to the tractability frontier: For instance, if we relax the unambiguity condition in the weight function, the problem Max does not correspond to finding the longest paths in DAGs and becomes intractable. Proof. We begin by giving the proofs for Max[uVSA, Reg T ]. We give a metric reduction 15 from the OptP-complete problem Maximum Satisfying Assignment (MSA) [Kre88], which is defined as follows. Let ϕ(x 1 , . . . , x n ) be a propositional formula in CNF and let v = v 1 · · · v n ∈ B n be a variable assignment of ϕ. Furthermore, let n v ∈ N be the natural number encoded by v in binary. MSA asks, given the CNF formula ϕ(x 1 , . . . , x n ), for the maximum n v ∈ N such that v satisfies ϕ, or 0 if ϕ is not satisfiable. In the following, we denote by MSA(ϕ) the output of MSA on input ϕ.
Let ϕ(x 1 , . . . , x n ) be a Boolean formula in CNF. We use a similar construction as in the proofs of Theorem 5.4 and Doleschal et al. [DKMP22,Theorem 7.6], to encode the CNF formula ϕ. Let d = a n be the document. We define Notice that A can be defined with a polynomial-time constructible uVSA. Observe that there is a one-to-one correspondence between tuples t in A (d) and variable assignments α t for ϕ: we can set α t (x i ) = 1 if and only if t(x i ) = [i, i + 1⟩. We construct a weight function w ∈ Reg T such that w(d, t) = n αt if α t |= ϕ 0 otherwise. Recall that n αt is the natural number which is encoded by the variable assignment α t . It follows directly that MSA(ϕ) = Max( A , d, w). Defining T 2 (x, y) → y gives the desired reduction.
It remains to construct a weighted VSet-automaton W which encodes w. We define the weighted VSet-automaton W as the union of two automata. Let V be the set of variables of ϕ. The first automaton W A is a copy of A, assigning weight 0 to all edges, which are present in A. Furthermore, let δ assign weight 2 i−1 to the a labeled edge between opening and closing variable x i (that is, ▷ x i and ◁ x i ). Let I(q) = 0 if q is the start state of A and ∞, otherwise. Analogously, let F (q) = 0 if q is an accepting state of A and ∞ otherwise. It follows directly that W A K (a n , t) = n αt .
The second automaton, W ′ consists of m disjoint branches, where each branch corresponds to a clause C i of ϕ; we call these clause branches. Each branch has exactly one run ρ with weight 1 for each tuple t associated to an assignment α t which does not satisfy the clause C i .
We now give a formal construction of W ′ . The set of states Q := {q a i,j | 1 ≤ i ≤ m, 1 ≤ j ≤ n, 1 ≤ a ≤ 5} contains 5n states for each clause branch. Intuitively, W ′ has a gadget, consisting of 5 states, for each variable and each clause branch. Figure 6 depicts the three types of gadgets we use here. Note that the weights of the drawn edges are all 0. We use the left gadget if x does not occur in the relevant clause and the middle (resp., right) gadget if the literal ¬x (resp., x) occurs. Furthermore, within the same branch of W ′ , the last state of each gadget is the same state as the start state of the next variable, i.e., q 5 i,j = q 1 i,j+1 for all 1 ≤ i ≤ k, 1 ≤ j < n. 15 Recall that a metric reduction from f to g is a pair of polynomial-time computable functions T1, T2, where T1 : Σ * → Σ * and T2 : Σ * × N → N, such that f (x) = T2(x, g(T1(x))) for all x ∈ Σ * .  We illustrate the crucial part of the construction on an example. Let ϕ = (¬x 1 ∨ ¬x 2 ∨ x 4 )∧(x 2 ∨x 3 ∨x 4 ). The corresponding weighted VSet-automaton W ′ therefore has two disjoint branches, one for each clause of ϕ. Figure 7 depicts the clause branch C 1 that corresponds to all assignments which do not satisfy C i , that is, all assignments with x 1 = x 2 = 1 and x 4 = 0.
Formally, the initial weight function is I(q a i,j ) = 1 if j = 1 = a and I(q a i,j ) = 0 otherwise. The final weight function F (q a i,j ) = 1 if j = n and a = 5 and F (q a i,j ) = 0, otherwise. The transition function δ is defined as follows: , and there is a variable assignment τ with τ (x j ) = 1 and τ ̸ |= C i 1 a = 3, a ′ = 5, o = a, and there is a variable assignment τ with τ (x j ) = 0 and τ ̸ |= C i 1 a = 4, a ′ = 5, o = ◁ x j All other transitions have weight 0.
We claim that W ′ represents w ′ , where w ′ (d, t) = 1 if α t ̸ |= ϕ and w ′ (d, t) = 0 otherwise. To this end, let t ∈ A (d) be a tuple and let τ = α t be the variable assignment encoded by t. It is easy to see that there is an accepting run ρ of W ′ for r with weight w ρ = 1, starting in q a i,0 , if and only if τ does not satisfy clause C i . As mentioned before, the weighted VSet-automaton W is the union of W ′ and W A . Recall that, over the tropical semiring, 0 = ∞, 1 = 0, and the weight of a tuple t is the minimal weight over all accepting runs which encode t. Thus, the weight function represented by W is exactly w, as claimed. This concludes the proof that Max[uVSA, Reg T ] is OptP-hard.
It We give a metric reduction from the OptP-complete problem of weighted satisfiability (WSAT) [Kre88], which is defined as follows. Let ϕ(x 1 , . . . , x n ) be a propositional formula in CNF with binary weights. WSAT asks, given the CNF formula ϕ(x 1 , . . . , x n ) with m clauses and weights w 1 , . . . , w m , for the maximal weight of an assignment, where the weight of an assignment is the sum of the weights of the satisfied clauses.
Denote by WSAT(ϕ) the output of WSAT on input ϕ. Let ϕ(x 1 , . . . , x n ) be a Boolean formula in CNF. Let d, A, W be as defined before. However, the weights in W are defined differently. That is, W is the union of W A and W ′ , where W A is a copy of A, where all transitions have weight 1. Furthermore, let x be the sum of all clause weights and F (q) = x, if q is an accepting state of A. The automaton W ′ is defined exactly as before, however, accepting with final weight F (q) = −w i if q is the final weight of the branch of clause C i and w i is the weight of C i . Observe that w(d, t) = W Q (d, t) is exactly the weighted sum of all clauses, which are satisfied by the valuation α t encoded by t. It follows that Max(S, d, w) = WSAT(ϕ). Defining T 2 (x, y) → y concludes the proof for Max[uVSA, Reg Q ].
Proof. Let D, m be the DAG and the bijection guaranteed by Lemma 7.1. We have that The first equation follows from the definition of Sum. The second equation follows from property (2) of Lemma 7.1. The third equation must hold due to m being a bijection between tuples t ∈ A and paths p ∈ Paths(src, snk).
It remains to show that the sum of the lengths of source-to-target paths in a DAG D = (N, E) can be computed in polynomial time. We begin by observing that given two nodes x, y ∈ D the number of paths from x to y in D can be computed in polynomial time via dynamic programming. Furthermore, given an edge e = (x, y) ∈ E one can compute the number of paths from src to snk which use e by multiplying the number of path's from src to x with the number of paths from y to snk. Therefore, the function c : E → N which, given an edge e ∈ E assigns the number of paths using e can be computed in polynomial time. Recall that over the tropical semiring, ⊗ = + and therefore len(p) = e∈p len(e). It therefore follows that If we relax the restriction that weight functions are given as unambiguous automata, Sum and Average become #P-hard again. Proof. We begin by giving a parsimonious reduction from the #P-complete problem of #CNF. To this end, let c = 1 in the case of Sum and c = 2 n in the case of Average.
Let ϕ(x 1 , . . . , x n ) be a propositional formula in conjunctive normal form. Let A, d be as constructed in the proof of Theorem 7.3 and let w be the weight function such that w(d, t) = c if the corresponding assignment α t satisfies ϕ and w(d, t) = 0 otherwise. Therefore, with c := 1 it follows directly that #CNF(ϕ) = Sum( A , d, w), which shows that the problem is #Phard. For Average let c := 2 n . It follows that #CNF(ϕ) = x = x·2 n 2 n = x·c 2 n = Avg( A , d, w), implying that Average[uVSA, Reg T ] is also #P-hard.
It remains to show that there is a weighed automaton W representing w ∈ Reg T . As in the proof of Theorem 7.3, W is the union of two weighted VSet-automata W A and W ′ , where W A is a copy of A, assigning weight 0 to all initial states and transitions of A and weight c to all final states. Furthermore, W ′ is as defined, that is It follows directly that W encodes the weight function w, concluding the proof.
Finally, we show that Sum and Average for Reg T weight functions are in FP #P . . Now, let w ∈ Reg be a weight function, represented by the weighted VSet-automaton W . We can assume, w.l.o.g., that all rationals in W have the denominator d lcm . 16 We recall that w(d, t) = W (d, π Vars(W ) (t)). Thus, w(d, t) is the product of |d| + 1 + 2 * |Vars(A)| rationals, where each factor has the denominator d lcm . Therefore, W (d, π Vars(W ) (t)) must have the denominator d 7.4. Quantile Aggregation. The situation for q-Quantile is different from the other aggregation problems, since it remains hard, even when both the spanner and weight function are unambiguous. The reason is that the problem reduces to counting the number of paths in a weighted DAG that are shorter than a given target weight, which is #P-complete due to Mihalák et al. [MSW16].  Theorem 7.9. q-Quantile[uVSA, UReg] is #P-hard under Turing reductions, for every 0 < q < 1.
At the core of the quantile problem is the problem of counting up to a threshold k ̸ = ∞: To this end, we reduce from #Partition and #-Product-Partition. Given a set N = {n 1 , . . . , n n } of natural numbers. Two sets N 1 , N 2 are a partition of N if N 1 ∪ N 2 = N and N 1 ∩ N 2 = ∅. Furthermore, a partition is perfect, if the sums of the natural numbers in both sets are equal. Given such a set N = {n 1 , . . . , n n }, the #Partition problem asks for the number of perfect partitions.
Analogously, a partition N 1 , N 2 is called a perfect product partition, if the products of the natural numbers in both sets are equal. Furthermore, the Product-Partition Problem asks whether there is a perfect product partition and the problem #Product-Partition asks for the number of perfect product partitions. Theorem 1] can be used to give a Turing reduction from #X3C to #Product-Partition, which implies that #Product-Partition is also #P-hard under Turing reductions. It is easy to see that #Product-Partition is in #P.
Proof. We use the same idea as Mihalák et al. [MSW16, Theorem 1] to encode #Partition. Let N = {n 1 , . . . , n n } be an instance of #Partition. Let d = a n . We construct A and W such that every tuple t ∈ A (d) corresponds to a partition of N . Furthermore, w(d, t) = k if and only if the partition encoded by t is perfect.
More formally, A := (Σ, V, Q, q 0 , Q F , δ), where Σ := {a}, V := {x 1 , . . . , x n }, Q := {q j i | 1 ≤ i ≤ n, 1 ≤ j ≤ 5}, where q 5 i = q 1 i+1 for all 1 ≤ i < n, q 0 := q 1 1 , Q F := {q 5 n }, and for 1 ≤ i ≤ n, δ is defined as follows: Recall, that q 5 i = q 1 i+1 for all 1 ≤ i < n. Furthermore, we define the weighted VSet-automaton W encoding w the same way as A. That is, all transitions labeled by a variable operation x ∈ Γ V are assigned weight 1, δ(q 3 i , a, q 5 i ) = n i and δ(q 2 i , a, q 4 i ) = −n i , the initial-and final weight functions: We observe that every tuple t ∈ A (d) encodes a partition of N , that is, Furthermore, for every tuple t ∈ A (d), the weight w(d, t) is exactly k plus the difference of the sum of all elements in N 1 and the sum of all elements in N 2 . We make some observations about A, d, and w.
(1) The number of perfect partitions is exactly Count =k ( A , d, w) ; (2) Count <k ( A , d, w) = Count >k ( A , d, w) ; Due to Observations (1) and (5) it follows that the number of perfect partitions can be computed by a single call to a Count <k ( A , d, w)-oracle. It remains to argue that the observations (1) − (5) hold. Observation (1) follows directly from the previous observation that the weight of each tuple is k plus the difference of the sum of all elements in N 1 and the sum of all elements in N 2 . Observation (2) follows from the fact that the partition problem is symmetric, that is for every partition N 1 , N 2 of N there is also a partition N 2 , N 1 . Observation (3) follows from (2), and (4) from the fact that there are 2 n subsets of N and therefore 2 · 2 n possible partitions. The last observation (5) follows from (3) and (4). This concludes the proof. Proof. Let N be an instance of #Product-Partition. We construct A, d, w and W , as constructed in the proof of Lemma 7.11. However in W , δ(q 3 i , a, q 5 i ) = n i and δ(q 2 i , a, q 4 i ) = 1 n i . Observe, that w(d, t) is exactly the product of all elements in N 1 divided by the product of all elements in N 2 , where n i ∈ N 1 if and only if t(x i ) = [i, i⟩ and n i ∈ N 2 if and only if t(x i ) = [i, i + 1⟩. Therefore, the number of perfect product partitions is exactly the number of tuples t ∈ A (d) with w(d, t) = 1. Using the same argument as in the proof of Lemma 7.11, it follows that #Product-Partition = 2 n+1 − 2 · Count <1 ( A , d, w) , and thus, #Product-Partition can be computed by a single Count <1 [uVSA, UReg Q ]-oracle call. 18 Recall that, in the proof for the tropical semiring, we add k to all accepting runs by having F (q) = k, if q ∈ QF . This is not possible over the numerical semiring, as the multiplicative operation is the numerical multiplication · and not the numerical addition +.
Using binary search, we compute r min as the smallest r with q-Quantile( A ′ , d ′ , w ′ ) < 1.
For the sake of contradiction, assume that Count <1 ( A ,d,w)+r min c·b > q = c·a c·b . It follows that, Count <1 ( A , d, w) + r min > c · a and therefore, as all involved numbers are natural numbers, Count <1 ( A , d, w) + r min − 1 ≥ c · a. Thus, Count <1 ( A ,d,w)+(r min −1) c·b ≥ q, leading to the desired contradiction, as r min was assumed to be minimal.
We have that Count <1 ( A ,d,w)+r min c·b = q = c·a c·b . It follows that Count <1 ( A , d, w) = c · a − r min , 19 For instance with v = Vars(A) · b. 20 Note that we use 0 and 1 instead of 0 and 1 on purpose. The reason is that we want to assign the same weights for both semirings. which concludes the proof.

Aggregate Approximation
Now that we have a detailed understanding on the complexity of computing exact aggregates, we want to see in which cases the result can be approximated. We only consider the situation where the exact problems are intractable and want to understand when the considered aggregation problems can be approximated by fully polynomial-time randomized approximation schemes (FPRAS), and when the existence of such an FPRAS would contradict commonly believed conjectures, like RP ̸ = NP and the conjecture that the polynomial hierarchy does not collapse. Based on the results for the computation of exact aggregates, we can already give some insights into the possibility of approximation. That is, Zuckerman [Zuc96] shows that #SAT can not be approximated by an FPRAS unless NP = RP. Furthermore, as shown by Dyer et al. [DGGJ04], this characterization extends to all problems which are #P-complete under parsimonious reductions. Therefore, due to Theorems 5.4, and 7.7, we have the following corollary. In the remainder of this section, we will revisit the other intractable cases of spanner aggregation and study whether or not approximation is possible. 8.1. Approximation is Hard at First Sight. We begin with some inapproximability results. For instance, as we show now, the existence of an FPRAS for the problems Min, Max with Reg Q weight functions would imply a collapse of the polynomial hierarchy, even when spanners are unambiguous. Furthermore, for Max and Reg T weight functions the same result holds. Proof. Assume there is an FPRAS for Min[uVSA, Reg Q ]. We will show that such an FPRAS implies that the NP-complete problem SAT is in BPP, which implies that the polynomial hierarchy collapses to the second level. 21 Let ϕ(x 1 , . . . , x n ) be a Boolean formula, given in CNF, and let A, d, and W ′ be as defined in the proof for Max[uVSA, Reg T ] of Theorem 7.3, where W ′ is interpreted as a weighted VSet-automaton over the numerical semiring. Observe that, due to 1 = 1 and 0 = 0, it follows that W ′ Q (d, t) ≥ 1 if the valuation α t encoded by t does not satisfy at least one clause of ϕ and 0 otherwise. Let w be the weight function encoded by W ′ . 21 NP ⊆ BPP implies that PH ⊆ BPP (cf. Zachos [Zac88]) and as BPP ⊆ (Π P 2 ∩Σ P 2 ) (cf. Lautemann [Lau83]) the polynomial hierarchy collapses on the second level. Furthermore, as BPP is closed under complement, coNP ⊆ BPP implies that NP ⊆ BPP resulting in the same collapse of the polynomial hierarchy. For the sake of contradiction, assume that there is an FPRAS for Min[uVSA, Reg Q ] and let δ = 0.4. Assume that ϕ is satisfiable, thus Min( A , d, w) = 0. Then the FPRAS must return 0 with probability at least 3 4 . On the other hand, if ϕ is not satisfiable, the FPRAS must return a value x ≥ (1 − δ) · 1 = 0.6 with probability at least 3 4 . Consider the algorithm which calls the FPRAS and accepts if the approximation is 0, and rejects otherwise. This algorithm is a BPP algorithm for SAT, resulting in the desired contradiction.
The proof for Max[uVSA, Reg Q ] is analogous. The only difference is that the final weight function of W ′ is multiplied by −1, that is, W ′ assigns weight −x to each tuple, encoding a valuation α which does not satisfy x clauses of ϕ. For the sake of contradiction, assume that there is an FPRAS for Max[uVSA, Reg T ] and let δ = 0.4. Assume that ϕ is satisfiable, thus Max( A , d, w) ≥ 1. Then the FPRAS must return a value x ≥ (1 − δ) · 1 = 0.6 with probability at least 3 4 . On the other hand, if ϕ is not satisfiable, the FPRAS must return 0 with probability at least 3 4 . Therefore, we can obtain a BPP algorithm for SAT as follows. The algorithm first calls the FPRAS, accepts if the approximation is bigger than 0, and rejects otherwise.
Concerning Sum and Average the only case which is not resolved by Corollary 8.1 is the case of Average[VSA, CWidth]. We show now that, under reasonable complexity assumptions, this problem can also not be approximated by an FPRAS.
Theorem 8.5. Average[VSA, CWidth] cannot be approximated by an FPRAS, unless the polynomial hierarchy collapses to the second level.
Proof. We will show that such an FPRAS implies that the NP-complete problem SAT is in BPP, which implies that the polynomial hierarchy collapses to the second level.
To this end, let A, d and w be as constructed in the proof of Theorem 5.4. Recall that given a propositional formula ϕ in CNF, we have that Sum( A , d, w) = c, where c is the number of satisfying assignments of ϕ.
Assume there is an FPRAS for Average[VSA, CWidth] and let δ = 0.5. Assume that ϕ is not satisfiable. Then the FPRAS on input A, d, w must return 0 with probability at least 3 4 . On the other hand, if ϕ is satisfiable, thus c > 0, the FPRAS must return a value x ≥ (1 − δ) * Avg( A , d, w) = 1 2 · c Count( A ,d) > 0, with probability at least 3 4 . Therefore, the algorithm which first approximates Avg( A , d, w) with δ = 0.5, rejects if the approximation is 0 and accepts otherwise is a BPP algorithm for SAT, implying that NP ⊆ BPP, which implies that the polynomial hierarchy collapses to the second level.
We now turn to the quantile problem. It turns out that this problem is difficult to approximate even if the weight functions only return 0 or 1.
Theorem 8.6. Let 0 < q < 1. Then, q-Quantile[VSA, CWidth] cannot be approximated by an FPRAS, unless the polynomial hierarchy collapses to the second level. Proof. We will show that an FPRAS for q-Quantile[VSA, CWidth] implies a BPP algorithm for SAT. To this end, let ϕ be a propositional formula ϕ in CNF. Assume that q = 1 2 and let A and d be as constructed in the proof of Theorem 5.4. However, let w be the weight function which is represented by the Q-Relation R over {x} with Recall from the construction of A and d that A is the union of two automata A 1 , A −1 , such that Count( A 1 , d) = 2 n and Count( A −1 , d) = s, where s is the number of non-satisfying assignments for ϕ, furthermore, t ∈ A 1 (d) if and only if d t(x) = 1 and t ∈ A −1 (d) if and only if d t(x) = −1. We observe that R(−1) = 0 and therefore, for every t ∈ A (d) we have that Thus, 1 2 -Quantile( A , d, w) = 0 if and only if ϕ is not satisfiable. Assuming there is an FPRAS for q-Quantile[VSA, CWidth], one can decide SAT with a probability of 3 4 by approximating q-Quantile( A , d, w) with δ = 0.5, rejecting if the approximation is 0 and accepting otherwise. This, however, implies that NP ⊆ BPP, which implies a collapse of the polynomial hierarchy on the second level.
The general case for 0 < q < 1 follows by slightly adopting the previous construction. That is, assume that q = a b . Due to 0 ≤ q ≤ 1, it must hold that 1 ≤ a < b. We construct a VSet-automaton A ′ and a document d ′ as follows. Let σ / ∈ Σ be a new alphabet symbol. The document d ′ consists of b copies of d, separated by σ and A ′ consists of a copies of A −1 and b − a copies of A 1 . More formally, d ′ := (d · σ) b . Furthermore, slightly abusing notation, we define We observe that on input document d ′ , the automaton A ′ accepts exactly 2 n · (b − a) tuples t with w(d ′ , t) = 1 and s · a tuples with weight 0. Therefore, a b -Quantile(S, d, w) = 0 if and only if s · a 2 n · (b − a) + s · a ≥ a b .
Solving this equation for s, it holds that a b -Quantile(S, d, w) = 0 if and only if s = 2 n and therefore a b -Quantile(S, d, w) = 0 if and only if ϕ is not satisfiable. The rest of the proof is analogous to the case that q = 1 2 . When the spanners are unambiguous, the simplest intractable case for q-Quantile is the one with UReg weight functions (see Table 1). Again, we can show that approximation is hard.
Theorem 8.7. Let 0 < q < 1. Then, q-Quantile[uVSA, UReg T ] cannot be approximated by an FPRAS, unless the polynomial hierarchy collapses on the second level.
Proof. We show that an FPRAS for q-Quantile[uVSA, UReg T ] implies a BPP algorithm for the NP-complete Partition problem. To this end, let S = {s 1 , . . . , s n } be a set of natural numbers. Furthermore, let A, d, w be constructed from S as in the proof of Lemma 7.11 with k = 0.
Per construction of A, d and w, every tuple t ∈ A (d) corresponds to a partition of S, such that the partition is perfect if and only if w(d, t) = 0. Furthermore, due to the partition problem being symmetrical, for every tuple t ∈ A (d) with w(d, t) = k there is a tuple t ′ ∈ A (d) with w(d, t) = −k. Thus, 1 2 -Quantile( A , d, w) = 1 if and only if there is a tuple t ∈ A (d) with w(d, t) = 0.
Let q = 1 2 . Assuming there is an FPRAS for q-Quantile[uVSA, UReg T ], one can decide Partition with a probability of 3 4 by approximating q-Quantile( A , d, w) with δ = 0.5, accepting if the approximation is 0 rejecting otherwise. This implies that the algorithm accepts if and only if there is a perfect partition and therefore, NP ⊆ BPP, which implies a collapse of the polynomial hierarchy on the second level.
For the general case, assume that q = a b . We observe that due to 0 < q < 1, it must hold that a < b. By Observation (4) in the proof of Lemma 7.11, Count( A , d) = 2 n+1 . As in the proof of Theorem 7.9, we construct a VSet-automaton A ′ , a document d ′ and a weight function w ′ , represented by the weighted automaton W ′ ∈ UReg T , such that q-Quantile(A ′ , d ′ , w ′ ) = 0 if and only if S has a perfect partition. By Lemma 4.7, there are VSet-automata A −1 , A 1 ∈ uVSA and documents d −1 , d 1 ∈ Σ * such that Count( A −1 , d −1 ) = (a − 1) · 2 n and Count( A 1 , d 1 ) = (b − a − 1) · 2 n . Let W −1 (resp., W 1 ) be the same as A −1 (resp., A 1 ) interpreted as weighted automaton over the tropical semiring, such that all transitions are assigned weight 0 and the final weight function assigns weight −1 (resp., 1) to all accepting states. Let w −1 (resp., w 1 ) be the weight function, represented by W −1 (resp., W 1 ) Thus, w −1 (d −1 , t) = −1 if and only if t ∈ A −1 (d −1 ) and w 1 (d 1 , t) = 1 if and only if t ∈ A 1 (d 1 ). Let σ be a new alphabet symbol. We construct A ′ , d ′ , and W ′ as follows.
We note that the case of approximating q-Quantile[uVSA, UReg Q ] does not follow analogous to the proof for q-Quantile[uVSA, UReg T ]. The main reason is the fact that #Partition can be encoded into a weight function automaton w T ∈ UReg T , such that perfect partitions correspond to tuples with weight 0, whereas #Product-Partition is encoded into a weight function w Q ∈ UReg Q , such that perfect product partitions correspond to tuples with weight 1. Furthermore, all weights assigned by w T are integers, whereas w Q assigns In the following, we will denote an FPRAS approximation with error rate δ of the problem Count( A , d) (resp., Sum( A , d, w) and Avg( A , d, w)) by Count( A , d, δ) (resp., Sum( A , d, w, δ) and Avg( A , d, w, δ)).
We begin by showing that Sum[VSA, CWidth Q + ] admits an FPRAS. Let A ∈ VSA be a VSet-automaton, d ∈ Σ * be a document, and w ∈ CWidth Q + be a weight function. Recall that every weight x ∈ Q + is encoded by its numerator and its denominator. Let D be the set of denominators used by w and let lcm be the least common multiple of all elements in D. We note that, as argued in the proof of Theorem 7.8, lcm can be computed in polynomial time. Let w N (d, t) = w(d, t) · lcm. Per definition of lcm, w N ∈ CWidth N only assigns natural numbers. Furthermore, w(d, t) = w N (d,t) lcm . It follows that Sum( A , d, w, δ) := Sum( A ,d,w N ,δ) lcm is an δ-approximation of Sum(S, d, w) with success probability 3 4 , concluding this part of the proof.
It remains to show that Average[VSA, CWidth Q + ] admits an FPRAS. We show that the algorithm which, with success rate ( 3 4 ) 0.5 , calculates a δ 3 -approximations for Count and Sum, and then returns the quotient of the results, is an FPRAS for the problem Average[VSA, CWidth Q + ]. We note that the probability that both approximations are successful is (