The Shapley Value of Tuples in Query Answering

We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.


Introduction
The Shapley value is named after Lloyd Shapley who introduced the value in a seminal 1952 article [Sha53]. He considered a cooperative game that is played by a set A of players and is defined by a wealth function v that assigns, to each coalition S ⊆ A, the wealth v(S). For instance, in our running example the players are researchers, and v(S) is the total number of citations of papers with an author in S. As another example, A might be a set of politicians, and v(S) the number of votes that a poll assigns to the party that consists of the candidates in S. The question is how to distribute the wealth v(A) among the players, or from a different perspective, how to quantify the contribution of each player to the overall wealth. For example, the removal of a researcher r may have zero impact on the overall number of citations, since each paper has co-authors from A. Does it mean that r has no contribution at all? What if the removal in turns of every individual author has no impact? Shapley considered distribution functions that satisfy a few axioms of good behavior. Intuitively, the axioms state that the function should be invariant under isomorphism, the sum over all players should be equal to the total wealth, and the contribution to a sum of wealths is equal to the sum of separate contributions. Quite remarkably, Shapley has established that there is a single such function, and this function has become known as the Shapley value. Following previous work on the explanations and responsibility of facts to query answers [MGMS10b, MGH + 10], we view the database as consisting of two types of facts: exogenous facts and endogenous facts. Exogenous facts represent a context of information that is taken for granted and assumed not to claim any contribution or responsibility to the result of a query. Our concern is about the role of the endogenous facts in establishing the result of the query. In notation, we denote by D x and D n the subsets of D that consist of the exogenous and endogenous facts, respectively. Hence, in our notation we have that D = D x ∪ D n .
Example 2.1. Figure 1 depicts the database D of our running example from the domain of academic publications. The relation Author stores authors along with their affiliations, which are stored with their states in Inst. The relation Pub associates authors with their publications, and Citations stores the number of citations for each paper 1 . For example, publication C has 8 citations and it is written jointly by Bob from NYU of NY state, Cathy from UCSD of CA state, and David from MIT of MA state. All Author facts are endogenous, and all remaining facts are exogenous. Hence, D n = {f a 1 , f a 2 , f a 3 , f a 4 , f a 5 } and D x consists of all f x j for x ∈ {i, p, c} and relevant j.
Relational and conjunctive queries. Let S be a schema. A relational query is a function that maps databases to relations. More formally, a relational query q of arity k is a function q : DB(S) → P(Const k ) (where P(Const k ) is the power set of Const k that consists of all subsets of Const k ) that maps every database over S to a finite relation q(D) of arity k. We denote the arity of q by ar (q). Each tuple c in q(D) is an answer to q on D. If the arity of q is zero, then we say that q is a Boolean query; in this case, D |= q denotes that q(D) consists of the empty tuple (), while D |= q denotes that q(D) is empty. Our analysis will focus on the special case of Conjunctive Queries (CQs). A CQ over the schema S is a relational query definable by a first-order formula of the form ∃y 1 · · · ∃y m θ( x, y 1 , . . . , y m ), where θ is a conjunction of atomic formulas of the form R( t) with variables among those in x, y 1 , . . . , y m . In the remainder of the article, a CQ q will be written shortly as a logic rule, that is, an expression of the form q( x) :-R 1 ( t 1 ), . . . , R n ( t n ) 1 This example is used for illustrative purposes only and does not express any suggestion of a way to rank researchers. where each R i is a relation symbol of S, each t i is a tuple of variables and constants with the same arity as R i , and x is a tuple of k variables from t 1 , . . . , t n . We call q( x) the head of q, and R 1 ( t 1 ), . . . , R n ( t n ) the body of q. Each R i ( t i ) is an atom of q. The variables occurring in the head are called the head variables, and we make the standard safety assumption that every head variable occurs at least once in the body. The variables occurring in the body but not in the head are existentially quantified, and are called the existential variables. The answers to q on a database D are the tuples c that are obtained by projecting all homomorphisms from q to D onto the variables of x, and replacing each variable with the constant it is mapped to. A homomorphism from q to D is a mapping of the variables in q to the constants of D, such that every atom in q is mapped to a fact in D.
A self-join in a CQ q is a pair of distinct atoms over the same relation symbol. For example, in the query q() :-R(x, y), S(x), R(y, z), the first and third atoms constitute a self-join. We say that q is self-join-free if it has no self-joins, or in other words, every relation symbol occurs at most once in the body.
Let q be a CQ. For a variable y of q, let A y be the set of atoms R i ( t i ) of q that contain y (that is, y occurs in t i ). We say that q is hierarchical if for all existential variables y and y it holds that A y ⊆ A y , or A y ⊆ A y , or A y ∩ A y = ∅ [DRS09]. For example, every CQ with at most two atoms is hierarchical. The smallest non-hierarchical CQ is the following. On the other hand, the query q(x) :-R(x), S(x, y), T (y), which has a single existential variable y, is hierarchical. Let q be a Boolean query and D a database, both over the same schema, and let f ∈ D n be an endogenous fact. We say that f is a counterfactual cause (for q w.r.t. D) [MGH + 10,MGMS10a] if the removal of f causes q to become false; that is, D |= q and D \ {f } |= q.
Example 2.2. We will use the following queries in our examples. q 1 () :-Author(x, y), Pub(x, z) q 2 () :-Author(x, y), Pub(x, z), Citations(z, w) q 3 (z, w) :-Author(x, y), Pub(x, z), Citations(z, w) q 4 (z, w) :-Author(x, y), Pub(x, z), Citations(z, w), Inst(y, CA) Note that q 1 and q 2 are Boolean, whereas q 3 and q 4 are not. Also note that q 1 and q 3 are hierarchical, and q 2 and q 4 are not. Considering the database D of Figure 1, none of the Author facts is a counterfactual cause for q 1 , since the query remains true even if the fact is removed. The same applies to q 2 . However, the fact f a 1 is a counterfactual cause for the Boolean CQ q 1 () :-Author(x, UCLA), Pub(x, z), asking whether there is a publication with an author from UCLA, since D satisfies q 1 , but if we remove Alice from the database, the query q 1 is not longer satisfied, as no other author from UCLA exists.
Numerical and aggregate-relational queries. A numerical query α is a function that maps databases to numbers. More formally, a numerical query α is a function α : DB(S) → R that maps every database D over S to a real number α(D).
A special form of a numerical query α is what we refer to as an aggregate-relational query: a k-ary relational query q followed by an aggregate function γ : P(Const k ) → R that maps the resulting relation q(D) into a single number γ(q(D)). We denote this aggregate-relational query as γ[q]; hence, γ[q](D) def = γ(q(D)). Special cases of aggregate-relational queries include the functions of the form γ = F ϕ that transform every tuple c into a number ϕ( c) via a feature function ϕ : Const k → R, and then contract the resulting bag of numbers into a single number (hence, F is a numerical function on bags of numbers). Formally, we define F ϕ [q](D) where { {·} } is used for bag notation. For example, if we assume that the ith attribute of q(D) takes a numerical value, then ϕ can simply copy this number (i.e., ϕ( c) = c[i]); we denote this ϕ by [i]. As another example, ϕ can be the product of two attributes: ϕ = [i] · [j]. We later refer to the following aggregate-relational queries.  In terms of presentation, when we mention general functions γ and ϕ, we make the implicit assumption that they are computable in polynomial time with respect to the representation of their input. Also, observe that our modeling of an aggregate-relational query does not allow for grouping, since a database is mapped to a single number. This is done for simplicity of presentation, and all concepts and results of this article generalize to grouping as in traditional modeling (e.g., [CNS07]). This is explained in the next section.
Shapley value. Let A be a finite set of players. A cooperative game is a function v : P(A) → R, such that v(∅) = 0. The value v(S) represents a value, such as wealth, jointly obtained by S when the players of S cooperate. The Shapley value [Sha53] measures the share of each individual player a ∈ A in the gain of A for the cooperative game v. Intuitively, the gain of a is as follows. Suppose that we form a team by taking the players one by one, randomly and uniformly without replacement; while doing so, we record the change of v due to the addition of a as the random contribution of a. Then the Shapley value of a is the expectation of the random contribution. where Π A is the set of all possible permutations over the players in A, and for each permutation σ we denote by σ a the set of players that appear before a in the permutation. An alternative formula for the Shapley value is the following.
Note that |B|! · (|A| − |B| − 1)! is the number of permutations over A such that all players in B come first, then a, and then all remaining players. For further reading, we refer the reader to the book by Roth [Rot88].

Shapley Value of Database Facts
Let α be a numerical query over a schema S, and let D be a database over S. We wish to quantify the contribution of every endogenous fact to the result α(D). For that, we view α as a cooperative game over D n , where the value of every subset E of D n is α(E ∪ D x ). 2), where: That is, Shapley(D, α, f ) is the Shapley value of f in the cooperative game that has the endogenous facts as the set of players and values each team by the quantity it adds to α.
The choice of v is natural in that the first term collects answers where endogenous facts may interact with exogenous facts, but we remove those answers that come only from exogenous facts. As a special case, if q is a Boolean query, then Shapley(D, q, f ) is the same as the value Shapley(D, count[q], f ). In this case, the corresponding cooperative game takes the values 0 and 1, and the Shapley value then coincides with the Shapley-Shubik index [SS54]. Some fundamental properties of the Shapley value [Sha53] are reflected here as follows: Remark 3.2. Note that Shapley(D, α, f ) is defined for a general numerical query α. The definition is immediately extendible to queries with grouping (producing tuples of database constants and numbers [CNS07]), where we would measure the responsibility of f for an answer tuple a and write something like Shapley(D, α, a, f ). In that case, we treat every group as a separate numerical query. We believe that focusing on numerical queries (without grouping) allows us to keep the presentation considerably simpler while, at the same time, retaining the fundamental challenges.
In the remainder of this section, we illustrate the Shapley value on our running example.
Example 3.3. We begin with a Boolean CQ, and specifically q 1 from Example 2.2. Recall that the endogenous facts correspond to the authors. As Ellen has no publications, her addition to any D x ∪ E where E ⊆ D n does not change the satisfaction of q 1 . Hence, its Shapley value is zero: Shapley(D, q 1 , f a 5 ) = 0. The fact f a 1 changes the query result if it is either the first fact in the permutation, or it is the second fact after f a 5 . There are 4! permutations that satisfy the first condition, and 3! permutations that satisfy the second.
The contribution of f a 1 to the query result is one in each of these permutations, and zero otherwise. Therefore, we have Shapley(D, q 1 , f a 1 ) = 4!+3! 120 = 1 4 . The same argument applies to f a 2 , f a 3 and f a 4 , and so, Shapley(D, q 1 , f a 2 ) = Shapley(D, q 1 , f a 3 ) = Shapley(D, q 1 , f a 4 ) = 1 4 . We get the same numbers for q 2 , since every paper is mentioned in the Citations relation. Note that the value of the query q 1 on the database is 1, and it holds that 5 i=1 Shapley(D, q 1 , f a i ) = 4 · 1 4 + 0 = 1; hence, the second fundamental property of the Shapley value mentioned above is satisfied.
While Alice, Bob, Cathy and David have the same Shapley value for q 1 , things change if we consider the relation pub endogenous as well: the Shapley value of Alice and Cathy will be higher than Bob's and David's values, since they have more publications. Specifically, the fact f a 1 , for example, will change the query result if and only if at least one of f p 1 or f p 2 appears earlier in the permutation, and no pair among {f a 2 , f p 3 }, {f a 3 , f p 4 }, {f a 3 , f p 5 }, and {f a 4 , f p 6 } appears earlier than f a 1 . By rigorous counting, we can show that there are: 2 such sets of size one, 17 such sets of size two, 56 such sets of size three, 90 such sets of size four, 73 such sets of size five, 28 such sets of size six, and 4 such sets of size seven. Therefore, the Shapley value of f a 1 is: We can similarly compute the Shapley value for the rest of the authors, concluding that Shapley(D, q 1 , f a 2 ) = Shapley(D, q 1 , f a 4 ) = 241 2520 and Shapley(D, q 1 , f a 3 ) = 442 2520 . Hence, the Shapley value is the same for Alice and Cathy, who have two publications each, and lower for Bob and David, that have only one publication.
The following example, taken from Salimi et al. [SBSdB16], illustrates the Shapley value on (Boolean) graph reachability.  Here, we assume that all edges e i are endogenous facts. Let p ab be the Boolean query (definable in, e.g., Datalog) that determines whether there is a path from a to b. Let us calculate Shapley(G, p ab , e i ) for different edges e i . Intuitively, we expect e 1 to have the highest value since it provides a direct path from a to b, while e 2 contributes to a path only in the presence of e 3 , and e 4 enables a path only in the presence of both e 5 and e 6 . We show that, indeed, it holds that Shapley(G, p ab , e 1 ) > Shapley(G, p ab , e 2 ) > Shapley(G, p ab , e 4 ).
To illustrate the calculation, observe that there are 2 5 subsets of G that do not contain e 1 , and among them, the subsets that satisfy p ab are the supersets of {e 2 , e 3 } and {e 4 , e 5 , e 6 }. Hence, we have that: Similarly, there are 2 5 subsets of G that do not contain e 2 , and among them, the subsets that satisfy p ab are the supersets of {e 1 } and {e 4 , e 5 , e 6 }. Then, 8 60 A similar reasoning shows that Shapley(G, p ab , e 3 ) = 8 60 . Finally, among the 2 5 subsets of G that do not contain e 4 , those that satisfy p ab are the supersets of {e 1 } and {e 2 , e 3 }, and it holds that: Similarly, Shapley(G, p ab , e 5 ) = Shapley(G, p ab , e 6 ) = 3 60 . Lastly, we consider aggregate functions over conjunctive queries.
Example 3.5. We consider the queries α 1 , α 2 , and α 4 from Example 2.3. Ellen has no publications; hence, Shapley(D, α j , f a 5 ) = 0 for j ∈ {1, 2, 4}. The contribution of f a 1 is the same in every permutation (20 for α 1 and 2 for α 2 ) since Alice is the single author of two published papers that have a total of 20 citations. Hence, Shapley(D, α 1 , f a 1 ) = 20 and Shapley(D, α 2 , f a 1 ) = 2. The total number of citations of Cathy's papers is also 20; however, Bob and David are her coauthors on paper C. Hence, if the fact f a 3 appears before f a 2 and f a 4 in a permutation, its contribution to the query result is 20 for α 1 and 2 for α 2 , while if f a 3 appears after at least one of f a 2 or f a 4 in a permutation, its contribution is 12 for α 1 and 1 for α 2 . Clearly, f a 2 appears before both f a 3 and f a 4 in one-third of the permutations. Thus, we have that Shapley(D, α 1 , f a 3 ) = 1 3 · 20 + 2 3 · 12 = 44 3 and Shapley(D, α 2 , f a 3 ) = 1 3 · 2 + 2 3 · 1 = 4 3 . Using similar computations we obtain that Shapley(D, α 1 , f a 2 ) = Shapley(D, α 1 , f a 4 ) = 8 3 and Shapley(D, α 2 , f a 2 ) = Shapley(D, α 2 , f a 4 ) = 1 3 . We conclude that the Shapley value of Alice, who is the single author of two papers with a total of 20 citations, is higher than the Shapley value of Cathy who also has two papers with a total of 20 citations, but shares one paper with other authors. Bob and David have the same Shapley value, since they share a single paper, and this value is the lowest among the four, as they have the lowest number of papers and citations.
Finally, consider α 4 . The contribution of f a 1 in this case depends on the maximum value before adding f a 1 in the permutation (which can be 0, 8 or 12). For example, if f a 1 is the first fact in the permutation, its contribution is 18 since α 4 (∅) = 0. If f a 1 appears after f a 3 , then its contribution is 6, since α 4 (S) = 12 whenever f a 3 ∈ S. We have that Shapley(D, α 4 , f a 1 ) = 10, Shapley(D, α 4 , f a 2 ) = Shapley(D, α 4 , f a 4 ) = 2 and Shapley(D, α 4 , f a 3 ) = 4 (we omit the computations here). We see that the Shapley value of f a 1 is much higher than the rest, since Alice significantly increases the maximum value when added to any prefix. If the number of citations of paper C increases to 16, then Shapley(D, α 4 , f a 1 ) = 6, hence lower. This is because the next highest value is closer; hence, the contribution of f a 1 diminishes.

Complexity Results
In this section, we give complexity results on the computation of the Shapley value of facts. We begin with exact evaluation for Boolean CQs (Section 4.1), then move on to exact evaluation on aggregate-relational queries (Section 4.2), and finally discuss approximate evaluation (Section 4.3). In the first two parts we restrict the discussion to CQs without self-joins, and leave the problems open in the presence of self-joins. However, the approximate treatment in the third part covers the general class of CQs (and beyond).

Boolean Conjunctive Queries.
We investigate the problem of computing the (exact) Shapley value w.r.t. a Boolean CQ without self-joins. Our main result in this section is a full classification of (i.e., a dichotomy in) the data complexity of the problem. As we show, the classification criterion is the same as that of query evaluation over tuple-independent probabilistic databases [DS04]: hierarchical CQs without self-joins are tractable, and nonhierarchical ones are intractable.
Theorem 4.1. Let q be a Boolean CQ without self-joins. If q is hierarchical, then computing Shapley(D, q, f ) can be done in polynomial time, given D and f as input. Otherwise, the problem is FP #P -complete.
Recall that FP #P is the class of functions computable in polynomial time with an oracle to a problem in #P (e.g., counting the number of satisfying assignments of a propositional formula). This complexity class is considered intractable, and is known to be above the polynomial hierarchy (Toda's theorem [Tod91]).
Example 4.2. Consider the query q 1 from Example 2.2. This query is hierarchical; hence, by Theorem 4.1, Shapley(D, q 1 , f ) can be calculated in polynomial time, given D and f . On the other hand, the query q 2 is not hierarchical. Thus, Theorem 4.1 asserts that computing In the rest of this subsection, we discuss the proof of Theorem 4.1. While the tractability condition is the same as that of Dalvi and Suciu [DS04], it is not clear whether and/or how we can use their dichotomy to prove ours, in each of the two directions (tractability and hardness). The difference is mainly in that they deal with a random subset of probabilistically independent (endogenous) facts, whereas we reason about random permutations over the facts. We start by discussing the algorithm for computing the Shapley value in the hierarchical case, and then we discuss the proof of hardness for the non-hierarchical case. 4.1.1. Tractability side. Let D be a database, let f be an endogenous fact, and let q be a Boolean query. The computation of Shapley(D, q, f ) easily reduces to the problem of counting the k-sets (i.e., sets of size k) of endogenous facts that, along with the exogenous facts, satisfy q. More formally, the reduction is to the problem of computing |Sat(D, q, k)| where Sat(D, q, k) is the set of all subsets E of D n such that |E| = k and (D x ∪ E) |= q. The reduction is based on the following formula, where we denote m = |D n | and slightly abuse the notation by viewing q as a 0/1-numerical query, where q(D ) = 1 if and only if D |= q.
In the last expression, D is the same as D, except that f is viewed as exogenous instead of endogenous. Hence, to prove the positive side of Theorem 4.1, it suffices to show the following.
Theorem 4.3. Let q be a hierarchical Boolean CQ without self-joins. There is a polynomialtime algorithm for computing the number |Sat(D, q, k)| of subsets E of D n such that |E| = k and (D x ∪ E) |= q, given D and k as input.
To prove Theorem 4.3, we show a polynomial-time algorithm for computing |Sat(D, q, k)| for q as in the theorem. The pseudocode is depicted in Figure 2.
We assume in the algorithm that D n contains only facts that are homomorphic images of atoms of q (i.e., facts f such that there is a mapping from an atom of q to f ). In the terminology of Conitzer and Sandholm [CS04], regarding the computation of the Shapley value, the function defined by q concerns only the subset C of D n consisting of these facts (i.e., the satisfaction of q by any subset of D does not change if we intersect with C), and so, the Shapley value of every fact in D n \ C is zero and the Shapley value of any other fact is unchanged when ignoring D n \ C [CS04, Lemma 4]. Moreover, these facts can be found in polynomial time.
As expected for a hierarchical query, our algorithm is a recursive procedure that acts differently in three different cases: (a) q has no variables (only constants), (b) there is a root variable x, that is, x occurs in all atoms of q, or (c) q consists of two (or more) subqueries that do not share any variables. Since q is hierarchical, at least one of these cases always applies [DS12].
In the first case (lines 1-7), every atom a of q can be viewed as a fact. Clearly, if one of the facts in q is not present in D, then there is no subset E of D n of any size such that (D x ∪ E) |= q, and the algorithm will return 0. Otherwise, suppose that A is the set of endogenous facts of q (and the remaining atoms of q, if any, are exogenous). Due to our assumption that every fact of D n is a homomorphic image of an atom of q, the single choice of a subset of facts that makes the query true is A; therefore, the algorithm returns 1 if k = |A| and 0 otherwise.
Next, we consider the case where q has a root variable x (lines 9-21). We denote by V x the set {v 1 , . . . , v n } of values that D has in attributes that correspond to an occurrence of x. For example, if q contains the atom R(x, y, x) and D contains a fact R(a, b, a), then a is one of the values in V x . We also denote by q [x→v i ] the query that is obtained from q by substituting v i for x, and by D v i the subset of D that consists of facts with the value v i in every attribute where x occurs in q.
We solve the problem for this case using a simple dynamic program. We denote by P i the number of subsets of size of i r=1 D vr n that satisfy the query (together with the let D 1 and D 2 be the restrictions of D to the relations of q 1 and q 2 , respectively 25: exogenous facts in i r=1 D vr x ). Our goal is to find P k n , which is the number of subsets E of size k of n r=1 D vr n . Note that this union is precisely D n , due to our assumption that D n contains only facts that can be obtained from atoms of q via an assignment to the variables. First, we compute, for each value v i , and for each j ∈ {0, . . . , k}, the number using a recursive call. In the recursive call, we replace q with q [x→v i ] , as D v i contains only facts that use the value v i for the variable x; hence, we can reduce the number of variables in q by substituting x with v i . Then, for each ∈ {0, . . . , k} it clearly holds that P 1 = f 1, . For each i ∈ {2, . . . , |V x |} and ∈ {0, · · · , k}, we compute P i in the following way. Each subset E of size of i r=1 D vr n contains a set E 1 of size j of facts from D v i n (for some j ∈ {0, . . . , }) and a set E 2 of size − j of facts from i−1 r=1 D vr n . If the subset E satisfies the query, then precisely one of the following holds:  ( that corresponds to Case (2), and the value that corresponds to Case (3). Note that we have all the values P −j i−1 from the previous iteration of the for loop of line 16.
Finally, we consider the case where q has two nonempty subqueries q 1 and q 2 with disjoint sets of variables (lines 23-26). For j ∈ {1, 2}, we denote by D j the set of facts from D that appear in the relations of q j . (Recall that q has no self-joins; hence, every relation can appear in either q 1 or q 2 , but not in both.) Every subset E of D that satisfies q must contain a subset E 1 of D 1 that satisfies q 1 and a subset E 2 of D 2 satisfying q 2 . Therefore, to compute |Sat(D, q, k)|, we consider every pair (k 1 , k 2 ) of natural numbers such that k 1 + k 2 = k, compute |Sat(D 1 , q 1 , k 1 )| and |Sat(D 2 , q 2 , k 2 )| via a recursive call, and add the product of the two to the result.
The correctness and efficiency of CntSat is stated in the following lemma.
Lemma 4.4. Let q be a hierarchical Boolean CQ without self-joins. Then, CntSat(D, q, k) returns the number |Sat(D, q, k)| of subsets E of D n such that |E| = k and D x ∪ E |= q, given D and k as input. Moreover, CntSat(D, q, k) terminates in polynomial time in k and |D|.
We have already established the correctness of the algorithm. Thus, we now consider the complexity claim of Lemma 4.4. The number of recursive calls in each step is polynomial in k and |D|. In particular, in the dynamic programming part of the algorithm (lines 12-20), we make (k + 1) · |V x | recursive calls. Clearly, it holds that |V x | ≤ |D|. Furthermore, we make 2(k + 1) recursive calls in lines 23-26. Finally, in each recursive call, we reduce the number of variables in q by at least one. Thus, the depth of the reduction is bounded by the number of variables in query q, which is a constant when considering data complexity.
Example 4.5. We now illustrate the execution of CntSat(D, q, k) on the database D of Figure 3, the query q() :-R(x, y), S(x, z), T (w, w), U (w) and k = 4. We assume that all facts in D are endogenous. Since q does not have a root variable, the condition of line 9 does not hold. Hence, we start by considering the two disjoint sub-queries q 1 () :-R(x, y), S(x, z) and q 2 () :-T (w, w), U (w) in line 23, and the corresponding databases D 1 that contains the relations R and S and D 2 that contains the relations T and U . Note that q 1 and q 2 indeed do not share any variables. Each set of facts that satisfies q contains four facts of the form R(a, b), S(a, c), T (d, d) and U (d) for some values a, b, c, d. Clearly, it holds that {R(a, b), S(a, c)} |= q 1 and {T (d, d), U (d)} |= q 2 ; thus, we compute CntSat(D, q, 4) using 10 (that is, 2(k + 1)) recursive calls to CntSat.
Now, q 1 contains a root variable x; thus, in each recursive call with the query q 1 , the condition of line 9 holds. We will illustrate the execution of this part of the algorithm using CntSat (D 1 , q 1 , 3). Note that in a homomorphism from R(x, y) to D 1 , the variable x is mapped to one of three values, namely 1, 2, or 3. Similarly, in a homomorphism from S(x, z) to D 1 , the value x is mapped to either 1 or 2. Hence, it holds that V x = {1, 2, 3}.
For each value a i in V x (where a 1 = 1, a 2 = 2, a 3 = 3), we consider the query q [x→a i ] which is R(a i , y), S(a i , z), and the database D a i containing the facts that use the value a i for the variable x. That is, the database D 1 contains the facts {R(1, 2), R(1, 3), S(1, 1), S(1, 5)}, the database D 2 contains the facts {R(2, 1), S(2, 3), S(2, 4)}, and the database D 3 contains the fact {R(3, 1)}. Then, for each one of the three values, and for each j ∈ {0, . . . , 3}, we compute the number f i,j of subsets of size j of D a i that satisfy q, using the recursive call CntSat(D a i , q [x→a i ] , j). The reader can easily verify that the following holds.
Recall that D 1 contains four facts and D 2 contains three facts. Hence, we have the following computations for l = 0. In the first line we initialize P 0 2 . Then, in the second line, we consider j = 0, which is the only possible j in this case. Next, for l = 1, we compute the following.
Here, in the second line, we consider j = 0 (i.e., choosing zero facts from D 2 and one fact from D 1 ), and in the third line we consider j = 1 (i.e., choosing one fact from D 2 and zero facts from D 1 ). Next, we have l = 2.
Finally, we illustrate the base case of the algorithm (that is, lines 1-7). To do that, we use the recursive call CntSat(D 2 , q 2 , 3) from the first step of the execution. Recall that q 2 () :-T (w, w), U (w) and D 2 contains all the facts in T and U . The query q 2 contains a single variable w. In a homomorphism from T (w, w) to D 2 , this variable is mapped to one of three values, namely 1, 2, or 3. Note that there is no homomorphism from T (w, w) to the fact T (5, 6); hence, the values 5 and 6 are not in V w . In addition, in a homomorphism from U (w) to D 2 , the variable w is mapped to one of 1, 2, 3, or 4; thus, V w = {1, 2, 3, 4}.
In every recursive call, we will substitute one of the values in V w for w. One of the recursive calls will be CntSat(D 1 2 , q 2 , 2), where q 2 () :-T (1, 1), U (1). Here, D 1 2 contains every atom of q, and k = |A|; hence, the recursive call will return 1. On the other hand, the result of CntSat(D 1 2 , q 2 , 3) will be zero; as there are only two facts in D 1 2 , while k = 3. The result of CntSat(D 1 2 , q 2 , 1) will also be zero, since in this case k = 1 and |A| = 2; thus, k < |A|. Finally, for the recursive call CntSat(D 4 2 , q 2 , 2), where q 2 () :-T (4, 4), U (4), the result will be zero, as the fact T (4, 4) is not in the database. 4.1.2. Hardness side. We now give the proof of the hardness side of Theorem 4.1. Membership in FP #P is straightforward since, as aforementioned in Equation (4.1), the Shapley value can be computed in polynomial time given an oracle to the problem of counting the number of subsets E ⊆ D n of size k such that (D x ∪ E) |= q, and this problem is in #P. Similarly to Dalvi and Suciu [DS04], our proof of hardness consists of two steps. First, we prove the FP #P -hardness of computing Shapley(D, q RST , f ), where q RST is given in (2.1). Second, we reduce the computation of Shapley(D, q RST , f ) to the problem of computing Shapley(D, q, f ) for any non-hierarchical CQ q without self-joins. The second step is the same as that of Dalvi and Suciu [DS04], and we will give the proof here for completeness. The proof of the first step-hardness of computing Shapley(D, q RST , f ) (stated by Lemma 4.6), is considerably more involved than the corresponding proof of Dalvi and Suciu [DS04] that computing the probability of q RST in a tuple-independent probabilistic database (TID) is FP #P -hard. This is due to the coefficients of the Shapley value that do not seem to easily factor out. Proof. The proof is by a (Turing) reduction from the problem of computing the number |IS(g)| of independent sets of a given bipartite graph g, which is the same (via immediate reductions) as the problem of computing the number of satisfying assignments of a bipartite monotone 2-DNF formula, which we denote by #biSAT. Dalvi and Suciu [DS04] also proved the hardness of q RST (for the problem of query evaluation over TIDs) by reduction from #biSAT. Their reduction is a simple construction of a single input database, followed by a multiplication of the query probability by a number. It is not at all clear to us how such an approach can work in our case and, indeed, our proof is more involved. Our reduction takes the general approach that Dalvi and Suciu [DS12] used (in a different work) for proving that the CQ q() :-R(x, y), R(y, z) is hard over TIDs: solve several instances of the problem for the construction of a full-rank set of linear equations. The problem itself, however, is quite different from ours. This general technique has also been used by Aziz et al. [AdK14] for proving the hardness of computing the Shapley value for a matching game on unweighted graphs, which is again quite different from our problem.
In more detail, the idea is as follows. Given an input bipartite graph g = (V, E) for which we wish to compute |IS(g)|, we construct n + 2 different input instances (D j , f ), for j = 0, . . . , n + 1, of the problem of computing Shapley(D j , q RST , f ), where n = |V |. Each instance provides us with an equation over the numbers |IS(g, k)| of independent sets of size k in g for k = 0, . . . , n. We then show that the set of equations constitutes a non-singular matrix that, in turn, allows us to extract the |IS(g, k)| in polynomial time (e.g., via Gaussian elimination). This is enough, since |IS(g)| = n k=0 |IS(g, k)|. Our reduction is illustrated in Figure 4. Given the graph g (depicted in the leftmost part), we construct n + 2 graphs by adding new vertices and edges to g. For each such graph, we build a database that contains an endogenous fact R(v) for every left vertex v, an endogenous fact T (u) for every right vertex u, and an exogenous fact S(v, u) for every edge (v, u). In each constructed database D j , the fact f = R(0) represents a new left node, and we compute Shapley(D j , q RST , f ). In D 0 , the node of f is connected to every right vertex (i.e., we add an exogenous fact S(0, u) for every right vertex u). We use this database to compute a specific value (from the Shapley value of f ), as we explain next.
Instead of directly computing the Shapley value of f , we compute the complement of the Shapley value. To do that, we consider the permutations σ where f does not affect the query result. This holds in one of two cases: (1) No fact of T appears before f in σ, (2) At least one pair {R(v), T (u)} of facts, such that there is a fact S(u, v) in D 0 , appears in σ before f .
The number of permutations where the first case holds is: where n T is the number of vertices on the right-hand side of the graph g (namely, the number of facts in T ). This holds since each one of the facts of T and the fact f have an equal chance to be selected first (among these facts) in a random permutation. We are looking for the permutations where f is chosen before any fact of T ; hence, we are looking at 1/(n T + 1) of all permutations. Now, we compute the number of permutations σ where the second case holds. To do that, we have to count the permutations σ where σ f corresponds to a set of vertices from g (the original graph) that is not an independent set. Let us denote by |NIS(g, k)| the number of subsets of vertices of size k from g that are not independent sets. Then, the number of permutations satisfying the above is: Recall that the fact f corresponds to a new vertex that does not occur in the original bipartite graph g; hence, for each k, we have k facts that appear before f in the permutation and n + 1 − k − 1 = n − k facts that appear after f . We can now express the Shapley value of f in terms of P 1 0 and P 2 0 : Shapley(D 0 , q RST , f ) = 1 − P 1 0 + P 2 0 (n + 1)! Then, the value P 2 0 can be computed from Shapley(D 0 , q RST , f ) using the following formula.
We will use this value later in our proof. Next, for j = 1, . . . , n + 1, we construct a database D j that is obtained from g by adding f = R(0) and facts T (0 1 ), . Note that if the first condition holds, then it does not matter if we choose a fact from the set {T (0 1 ), . . . , T (0 j )} before choosing f or not (that is, the fact f will not affect the query result regardless of the positions of these facts). Hence, we first ignore these facts, and compute the number of permutations of the rest of the facts that satisfy the first condition: n k=2 |NIS(g, k)| · k! · (n − k)! From each such permutation σ, we can then generate m j permutations of all the n + j + 1 facts in D j by considering all the m j possibilities to add the facts of {T (0 1 ), . . . , T (0 j )} to the permutation. Note that this is the same m j for each permutation, and it holds that m j = n+j+1 j · j! (i.e., we select j positions for the facts of {T (0 1 ), . . . , T (0 j )} and place them in these position in one of j! possible permutations, while placing the rest of the facts in the remaining positions in the order defined by σ). Moreover, using this procedure we cover all the permutations of the facts in D j that satisfy the first condition, since for each one of them there is a single corresponding permutation of the facts in D j \ {T (0 1 ), . . . , T (0 j )}. Hence, the number of permutations of the facts in D j that satisfy the first property is m j · n k=2 |NIS(g, k)| · k! · (n − k)! = m j · P 2 0 Recall that we have seen earlier that the value P 2 0 can be computed from Shapley(D 0 , q RST , f ). Next, we compute the number of permutations that satisfy the second property: This holds since each permutation σ where σ f does not contain any fact T (0 j ) and any pair {R(v), T (u)} of facts such that there is a fact S(u, v) in D j , corresponds to an independent set of g. Hence, for each j = 1, . . . , n + 1 we get an equation of the form: And we can compute P j from Shapley(D j , q RST , f ) in the following way.
By an elementary algebraic manipulation of A (i.e., dividing each column j by the constant j! and reversing the order of the columns), we obtain the matrix with the coefficients a i,j = (i + j + 1)! that Bacher [Bac02] proved to be non-singular (and, in fact, that n−1 i=0 i!(i + 1)! is its determinant). We then solve the system as discussed earlier to obtain |IS(g, k)| for each k, and, consequently, compute the value IS(g) = n k=0 IS(g, k). Finally, we show that computing Shapley(D, q, f ) is hard for any non-hierarchical Boolean CQ q without self-joins, by constructing a reduction from the problem of computing Shapley(D, q RST , f ). As aforementioned, our reduction is very similar to the corresponding reduction of Dalvi and Suciu [DS04], and we give it here for completeness. We will also use this result in Section 5.
Lemma 4.7. Let q be a non-hierarchical Boolean CQ without self-joins. Then, computing Proof. We build a reduction from the problem of computing Shapley(D, q RST , f ) to the problem of computing Shapley(D, q, f ). Since q is not hierarchical, there exist two variables x, y ∈ Vars(q), such that A x ∩ A y = ∅, while A x ⊆ A y and A y ⊆ A x ; hence, we can choose three atoms α x , α y and α (x,y) in q such that: • x ∈ Vars(α x ) and y / ∈ Vars(α x ) • y ∈ Vars(α y ) and x / ∈ Vars(α y ) • x, y ∈ Vars(α x,y ) Recall that Vars(α) is the set of variables that appear in the atom α.
Given an input database D to the first problem, we build an input database D to our problem in the following way. Let c be an arbitrary constant that does not occur in D. For each fact R(a) and for each atom α ∈ A x \ A y , we generate a fact f over the relation corresponding to α by assigning the value a to the variable x and the value c to the rest of the variables in α. We then add the corresponding facts to D . We define each new fact in Similarly, for each fact T (b) and for each atom α ∈ A y \ A x , we generate a fact f over the relation corresponding to α by assigning the value b to the variable y and the value c to the rest of the variables in α. Moreover, for each fact S(a, b) and for each atom α ∈ A x ∩ A y , we generate a fact f over the relation corresponding to α by assigning the value a to x, the value b to y and the value c to the rest of the variables in α. In both cases, we define the new facts in α x and α x,y to be endogenous if and only if the original fact is endogenous, and we define the rest of the facts to be exogenous. Finally, for each atom α in q that does not use the variables x and y (that is, α ∈ A x ∪ A y ), we add a single exogenous fact R α (c, . . . , c) to the relation R α corresponding to α.
We will now show that the Shapley value of each fact R(a) in D w.r.t q RST is equal to the Shapley value of the corresponding fact f over the relation of α x in D (i.e., the fact in the relation of α x that has been generated using the value a that occurs in R(a)). The same holds for a fact T (b) and its corresponding fact in the relation of α y in D , and for a fact S(a, b) and its corresponding fact in the relation of α x,y in D .
By definition, the Shapley value of a fact f is the probability to select a random permutation σ in which the addition of the fact f changes the query result from 0 to 1 (i.e., f is a counterfactual cause for the query w.r.t. σ f ∪ D x ). From the construction of D , it holds that the number of endogenous facts in D is the same as the number of endogenous facts in D ; hence, the total number of permutations of the facts in D is the same as the total number of permutations of the facts in D . It is left to show that the number of permutations of the facts in D that satisfy the above condition is the same as the number of permutations of the facts in D that satisfy the above condition w.r.t. the corresponding fact f .
From the construction of D it is straightforward that a subset E of D n is such that E ∪ D x |= q RST if and only if the subset E of D n that contains for each fact f ∈ E the corresponding fact f ∈ D is such that E ∪ D x |= q. Therefore, it also holds that if a fact f is a counterfactual cause for q RST w.r.t. E ∪ D x , the corresponding fact f is a counterfactual cause for q w.r.t. E ∪ D x . Thus, the number of permutations of the endogenous facts in D in which f affects the result of q RST is equal to the number of permutations of the endogenous facts in D in which f changes the result of q. As aforementioned, the total number of permutations is the same for both D and D , and we conclude, from the definition of the Shapley value, that Shapley(D, q RST , f ) = Shapley(D , q, f ).

4.2.
Aggregate Functions over Conjunctive Queries. Next, we study the complexity of aggregate-relational queries, where the internal relational query is a CQ. We begin with hardness. The following theorem generalizes the hardness side of Theorem 4.1 and states that it is FP #P -complete to compute Shapley(D, α, f ) whenever α is of the form γ[q], as defined in Section 2, and q is a non-hierarchical CQ without self-joins. The only exception is when α is a constant numerical query (i.e., α(D) = α(D ) for all databases D and D ); in that case, Shapley(D, α, f ) = 0 always holds.
Theorem 4.8. Let α = γ[q] be a fixed aggregate-relational query where q is a non-hierarchical CQ without self-joins. Computing Shapley(D, α, f ), given D and f as input, is FP #Pcomplete, unless α is constant.
Proof. Since α is not a constant function, there exists a database D, such that α( D) = α(∅). Let D be a minimal such database; that is, for every database D such that q(D) ⊂ q( D) it holds that α(D) = α(∅). Let q( D) = { a 1 , . . . , a n }. We replace the head variables in q with the corresponding constants from the answer a 1 . We denote the result by q .
We start with the following observation. The query q is a non-hierarchical Boolean CQ (recall that the definition of hierarchical queries considers only the existential variables, which are left intact in q ). We can break the query q into connected components q 1 , ..., q m , such that Vars(q i ) ∩ Vars(q j ) = ∅ for all i = j (it may be the case that there is only one connected component). Since q is not hierarchical, we have that q i is not hierarchical for at least one i ∈ {1, . . . , m}. We assume, without loss of generality, that q 1 is not hierarchical. Then, Theorem 4.1 implies that computing Shapley(D, q 1 , f ) is FP #P -complete. Therefore, we construct a reduction from the problem of computing Shapley(D, q 1 , f ) to the problem of computing Shapley(D , α, f ).
Let x 1 , . . . , x k be the head variables of q. If a tuple a = (v 1 , . . . , v k ) is in q(D) for some database D, then for every connected component q i of q, there is a homomorphism from q i to D, such that each head variable x j is mapped to the corresponding value v j from a.
On the other hand, if this does not hold for at least one of the connected components, then (v 1 , . . . , v k ) is not in q(D).
Given an input database D to the first problem, we build an input database D to our problem, as we explain next. As in the proof of Theorem 4.3, we assume, without loss of generality, that D contains only facts f such that there is a homomorphism from an atom of q 1 to f . To construct D , we first add a subset of the facts of D to D x (recall that D is a minimal database satisfying α( D) = α(∅)). For each relation R that occurs in q i for i = {2, . . . , m}, we copy all the facts from R D to R D x . As explained above, for each answer a i = (v 1 , . . . , v k ) in q( D) and for each connected component q i , there is a homomorphism from q i to D such that each head variable x j that appears in q i is mapped to the value v j ; hence, the same holds for the database D and every connected component in {q 2 , . . . , q m }. Therefore, in order to have all the tuples { a 1 , . . . , a n } in q(D ), we only need to add additional facts to the relations that appear in q 1 to satisfy this connected component. Now, let x j 1 , . . . , x jr be the head variables that appear in q 1 . For each tuple a i that does not agree with a 1 on the values of these variables (i.e., the value of at least one x j k is different in a 1 and a i ), we generate a set of exogenous facts as follows. Assume that a i uses the value v j k for the head variable x j k . We replace each variable x j k in q 1 with the value v j k . Then, we assign a new distinct value to each one of the existential variables of q 1 . We then add the corresponding facts to D x (e.g., if q 1 now contains the atom R(a, b, c), then we add the fact R(a, b, c) to D x ). At this point, it is rather straightforward that each a i that does not agree with a 1 on the values of the head variables of q 1 appears in q(D ); however, a 1 and each tuple a i that uses the same values as a 1 for the head variables of q 1 are not yet in q(D ). Since we assumed that D is minimal, we know that α(D x ) = α(∅).
Next, we add all the facts of D to D . Each fact of D x is added to D x , and each fact of D n is added to D n . We prove that the following holds: Let A = { a k 1 , . . . , a kt } be the set of answers that do not agree with a 1 on the values of the head variables x j 1 , . . . , x jr of q 1 . As explained above, we have that A ⊆ q(D x ). Moreover, since q 1 was obtained from q 1 by replacing the head variables with the corresponding values from a 1 , and since we removed from D every fact that does not agree with q 1 on those values, the only possible answer of q 1 on D is (v j 1 , . . . , v jr ) where v j i is the value in a 1 corresponding to the head variable x j i of q 1 . Since all the answers in q( D) \ A agree with a 1 on the values of all these variables, we have that in each permutation σ of the endogenous facts in D , one of the following holds: Clearly, the contribution of each permutation that satisfies the first condition to the Shapley value of f is zero (as we assumed that α(A) = α(∅) for every A ⊂ { a 1 , . . . , a n }), while the contribution of each permutation that satisfies the second condition to the Shapley value of f is α( D) − α(∅) (as we assumed that α( D) = α(∅)).
Let X f be a random variable that gets the value 1 if f adds the answer a 1 to the result of q in the permutation and 0 otherwise. Due to the aforementioned observations and by the definition of the Shapley value, the following holds: Note that D and D contain the same endogenous facts. Moreover, the fact f adds the answer a 1 to the result of q in a permutation σ of the endogenous facts in D if and only if f changes the result of q 1 from 0 to 1 in the same permutation σ of the endogenous facts of D. This holds since f changes the result of q 1 in σ if and only if there exist a set of facts in σ f ∪ D x ∪ f (that contains f ) that satisfies q 1 , which, as explained above, happens if and only if the same set of facts adds the answer a 1 to the result of q in σ. Therefore, the Shapley value of a fact f in D is: where X f is the same random variable that we introduced above, and that concludes our proof.
For instance, it follows from Theorem 4.8 that, whenever q is a non-hierarchical CQ without self-joins, it is FP #P -complete to compute the Shapley value for the aggregaterelational queries count  Interestingly, it turns out that Theorem 4.8 captures precisely the hard cases for computing the Shapley value w.r.t. any summation over CQs without self-joins. In particular, the following argument shows that Shapley(D, sum ϕ [q], f ) can be computed in polynomial time if q is a hierarchical CQ without self-joins. Let q = q( x) be an arbitrary CQ. For a ∈ q(D), let q [ x→ a] be the Boolean CQ obtained from q by substituting every head variable Together with Theorem 4.8, we get a full dichotomy for sum ϕ [q] over CQs without self-joins.
The complexity of computing Shapley(D, α, f ) for other aggregate-relational queries remains an open problem for the general case where q is a hierarchical CQ without self-joins. We can, however, state a positive result for max ϕ [q] and min ϕ [q] for the special case where q consists of a single atom (i.e., aggregation over a single relation). , f ) = 0. Hence, from now on we assume that this is not the case. If a f already appears in the query result before adding the fact f (that is, a f ∈ q(σ f )), then clearly f does not affect the maximum value. If a f is added to the query result only after adding f in the permutation (that is, . . , v m } be the set of values associated with the answers in q(D) (that is, V contains every value v j such that ϕ({ a}) = v j for some a ∈ q(D)). Note that it may be the case that ϕ({ a 1 }) = ϕ({ a 2 }) for a 1 = a 2 ; hence, it holds that |V | ≤ |q(D)|. For each value v j we denote by n < v j the number of endogenous facts f in the database that correspond to an answer a (i.e., q({f }) = { a}) such that ϕ({ a}) < v j , and by n = v j the number of endogenous facts in the database that correspond to an answer a such that ϕ({ a}) = v j . We also denote by n ≤ v j the number n < v j + n = (That is, we choose at least one fact f such that ϕ(q({f })) = v ir and then we choose the rest of the facts among the facts f such that ϕ(q({f })) < v ir ). We count the number of such permutations separately for v i 1 , because in this case, we do not have to choose at least one endogenous fact f such that ϕ(q({f })) = v i 1 (as this is already the maximum value on the exogenous facts). Hence, the number of permutations in this case is: The contribution of each such permutation to the Shapley value of f is: Thus, the total contribution of the permutations σ such that max a∈q( Finally, the Shapley value of f is: As an example, if α is the query max [2] [q], where q is given by q(x, y) :-Citations(x, y), then we can compute in polynomial time Shapley(D, α, f ), determining the responsibility of each publication (in our running example) to the maximum number of citations.
The arguments in the proof of Proposition 4.11 heavily rely on the assumption that each fact adds at most one answer to the query result; hence, we can refer to the answer associated with a certain fact. Moreover, this answer is independent of the permutation. However, this assumption does not hold for general queries, where the addition of a fact can add multiple answers, and the added set of answers depends on the other facts in the permutation. Hence, the proof does not easily generalize to maximum and minimum over hierarchical queries consisting of more than one atom. We also cannot use here the linearity of expectation that was used to obtain a dichotomy for summation. Therefore, a complete classification of the complexity for general aggregate queries remain an open problem.
4.3. Approximation. In computational complexity theory, a conventional feasibility notion of arbitrarily tight approximations is via the Fully Polynomial-Time Approximation Scheme, FPRAS for short. Formally, an FPRAS for a numeric function f is a randomized algorithm  A(x, , δ), where x is an input for f and , δ ∈ (0, 1), that returns an -approximation of f (x) with probability 1 − δ (where the probability is over the randomness of A) in time polynomial in x, 1/ and log(1/δ). To be more precise, we distinguish between an additive (or absolute) FPRAS: and a multiplicative (or relative) FPRAS: Using the Chernoff-Hoeffding bound, we easily get an additive FPRAS of Shapley(D, q, f ) when q is any Boolean query computable in polynomial time, by simply taking the average value over O(log(1/δ)/ 2 ) trials of the following experiment: (1) Select a random permutation (f 1 , . . . , f n ) over the set of all endogenous facts.
(2) Suppose that f = f i , and let D i−1 = D x ∪{f 1 , . . . , f i−1 }. Return q(D i−1 ∪{f })−q(D i−1 ). In general, an additive FPRAS of a function f is not necessarily a multiplicative one, since f (x) can be very small. For example, we can get an additive FPRAS of the satisfaction of a propositional formula over Boolean i.i.d. variables by, again, sampling the averaging, but there is no multiplicative FPRAS for such formulas unless BPP = NP. Nevertheless, the situation is different for Shapley(D, q, f ) when q is a CQ, since the Shapley value is never too small (assuming data complexity).
Proposition 4.12. Let q be a fixed Boolean CQ. There is a polynomial p such that for all databases D and endogenous facts f of D it is the case that Shapley(D, q, f ) is either zero or at least 1/(p(|D|)).
Proof. We denote m = |D n |. If there is no subset S of D n such that f is a counterfactual cause for q w.r.t. S, then Shapley(D, q, f ) = 0. Otherwise, let S be a minimal such set (i.e., for every S ⊂ S, we have that (S ∪ D x ) |= q). Clearly, it holds that S ≤ k, where k is the number of atoms of q. The probability to choose a permutation σ, such that σ f is exactly (recall that σ f is the set of facts that appear before f in σ). Hence, we have that Shapley(D, q, f ) ≥ 1 (m−k+1)·...·m , and that concludes our proof.
It follows that whenever Shapley(D, q, f ) = 0, the above additive approximation is also zero, and when Shapley(D, q, f ) > 0, the additive FPRAS also provides a multiplicative FPRAS. Hence, we have the following. Approximation for Aggregate Queries. The Chernoff-Hoeffding bound applies to the additive approximation of any function with a "bounded domain," that is, where the gap between the maximal and minimal value is polynomial in the size of the input. Hence, we immediately conclude that there is an additive FPRAS for count. We also get an additive FPRAS for sum, average, median (or any quantile), min and max in the case where the values are from a bounded domain.
What about multiplicative approximation for aggregate queries? Interestingly, Corollary 4.13 generalizes to a multiplicative FPRAS for summation ( , f ) has a multiplicative FPRAS if either ϕ( a) ≥ 0 for all a ∈ q(D) or ϕ( a) ≤ 0 for all a ∈ q(D).
Observe that the above FPRAS results allow the CQ q to have self-joins. This is in contrast to the complexity results we established in the earlier parts of this section, regarding exact evaluation. In fact, an easy observation is that Proposition 4.12 continues to hold when considering unions of conjunctive queries (UCQs). Therefore, Corollaries 4.13 and 4.14 remain correct in the case where q is a UCQ.
The existence of additive and multiplicative approximations for other aggregate queries remains an open problem.

Related Measures
In this section, we discuss our work in comparison to some alternative measures for the responsibility of tuples to database queries.
Causal responsibility. Causality and causal responsibility [Pea09,Hal16] have been applied in data management, defining a fact as a cause for a query result as follows: For an instance D = D x ∪ D n , a fact f ∈ D n is an actual cause for a Boolean CQ q, if there exists Γ ⊆ D n , called a contingency set for f , such that f is a counterfactual cause for q in D Γ [MGMS10a]. The responsibility of an actual cause f for q is defined by ρ(f ) := 1 |Γ|+1 , where |Γ| is the size of a smallest contingency set for f . If f is not an actual cause, then ρ(f ) is zero [MGMS10a]. Intuitively, facts with higher responsibility provide stronger explanations. 2 Example 5.1. Consider the database of our running example, and the query q 1 from Example 2.2. The fact f a 1 is an actual cause with minimal contingency set Γ = {f a 2 , f a 3 , f a 4 }. So, its responsibility is 1 4 . Similarly, f a 2 , f a 3 and f a 4 are actual causes with responsibility 1 4 .
Example 5.2. Consider the database G and the query p ab from Example 3.4. All facts in G are actual causes since every fact appears in a path from a to b. It is easy to verify that all the facts in D have the same causal responsibility, 1 3 , which may be considered as counter-intuitive given that e 1 provides a direct path from a to b.
As shown in Example 3.4, the Shapley value gives a more intuitive degree of contribution of facts to the query result than causal responsibility. Actually, Example 3.4 was used in [SBSdB16] as a motivation to introduce an alternative to the notion of causal responsibility, that of causal effect.
2 These notions can be applied to any monotonic query (i.e., whose answer set can only grow when the database grows, e.g., UCQs and Datalog queries) [BS17b,BS17a]. Causal effect. To quantify the contribution of a fact to the query result, Salimi et al.  view the database as a tuple-independent probabilistic database where the probability of each endogenous fact is 0.5 and the probability of each exogenous fact is 1 (i.e., it is certain). The causal effect of a fact f ∈ D n on a numerical query α (in particular, a Boolean query) is a difference of expected values [SBSdB16]: where f is the event that the fact f is present in the database, and ¬f is the event that the fact f is absent from the database.
Although the values in the two examples above are different from the Shapley values computed in Example 3.3 and Example 3.4, respectively, if we order the facts according to their contribution to the query result, we will obtain the same order in both cases. Note that unlike the Shapley value, for causal effect the sum of the values over all facts is not equal to the query result on the whole database. In the next example we consider aggregate queries.
Example 5.4. Consider the query α 1 of Example 2.3. If f a 1 is in the database, then the result can be either 20, 28, or 40. If f a 1 is absent, then the query result can be either 0, 8, or 20. By computing the expected value in both cases, we obtain that CE(D, α 1 , f a 1 ) = 20. Similarly, it holds that CE(D, α 1 , f a 2 ) = CE(D, α 1 , f a 4 ) = 1, and CE(D, α 1 , f a 3 ) = 14. Interestingly, the causal effect coincides with a well known wealth-distribution function in cooperative games, namely the Banzhaf Power Index (BPI) [Lee90,DS79,KL10]. This measure is defined similarly to the definition of the Shapley value in Equation (2.3), except that we replace the ratio |B|!·(|A|−|B|−1)! |A|! with 1 2 |A|−1 . Proposition 5.5. Let α be a numerical query, D be a database, and f ∈ D n . Then, Hence, the causal effect coincides with the BPI.
Proof. The following holds. The transition ( * ) is correct since every endogenous fact in the probabilistic database has probability 0.5 and they are all independent; hence, all the possible worlds have the same probability 1 2 |Dn|−1 . (Recall that we condition on f being either present or absent from the database, and all exogenous facts are certain; thus, the probability of each possible world depends only on the facts in D n \ {f }.) Then, for each E ⊆ D n \ {f }, it holds that α(D x ∪E ∪{f }) is the value of the query on the possible world that contains all the exogenous facts, the fact f , and all the endogenous facts in E, but does not contain the endogenous facts in D n \ (E ∪ {f }). Hence, E⊆(Dn\{f }) In the next section we will discuss in more detail the complexity of the causal effect.
SHAP score. One of the instantiations of the Shapley value is the SHAP score that has been used in the context of machine learning for explaining the prediction of a model [LL17]. This score could be applied to the attribution of responsibility to tuples in query answering, and it would give a measure that is similar in spirit, yet technically different, from our application of the Shapley value. Both apply the Shapley value in a cooperative game where the players are the endogenous facts. The difference is in the definition of the cooperative game. In our case, the players of a coalition, as a subinstance of the database, occur in the database, while the others are excluded from the computation of the wealth function, which is the answer of the query on the resulting subinstance. In the case of the SHAP score as applied to our setting, we could view the database as a tuple-independent probabilistic database where the facts of the coalition are deterministic (probability one) and the others are probabilistic (say with the probability 1 2 ), and the wealth function is the expectation of the query answer on the resulting probabilistic database.
The complexity of the SHAP score (in a more abstract setting than query answering) has been studied by Van den Broeck et al. [VdBLSS21] and by Arenas et al. [ABBM21b,ABBM21a]. 3 Immediate algorithmic approaches that one can derive to compute the SHAP score for Boolean CQs and UCQs are (a) to reduce the problem to probabilistic query answering [VdBLSS21] and (b) to compile the provenance of the query into a deterministic and decomposable circuit and apply to the circuit the Shapley computation of Arenas et al. [ABBM21b,ABBM21a]. Yet, albeit the similarity between our Shapley value and the SHAP score, they are different enough that we do not see an immediate way of directly translating results (e.g., via simple reductions) from one to the other. It is sensible, though, that we could apply similar techniques, namely reduction to/from probabilistic query answering and knowledge compilation, to derive our results and perhaps more general results. We leave this investigation to future research. 5.1. The Complexity of the Causal-Effect Measure (and Banzhaf Power Index). We now show that the complexity results obtained in this work for the exact computation of the Shapley value also apply to the causal effect (and BPI). These results are, in fact, easier to obtain, via a connection to probabilistic databases [SORK11]. The extension of the approximation results will be discussed later.