The Shapley Value of Inconsistency Measures for Functional Dependencies

Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Therefore, inconsistency quantification and attribution have been a subject of much research in Knowledge Representation and, more recently, in Databases. As in many other fields, a conventional responsibility sharing mechanism is the Shapley value from cooperative game theory. In this paper, we carry out a systematic investigation of the complexity of the Shapley value in common inconsistency measures for functional-dependency (FD) violations. For several measures we establish a full classification of the FD sets into tractable and intractable classes with respect to Shapley-value computation. We also study the complexity of approximation in intractable cases.


Introduction
Inconsistency measures for knowledge bases have received considerable attention from the Knowledge Representation (KR) and Logic communities [KLM03, Kni03, HK06, GH06, HK08, HK10, GH17,Thi17]. More recently, inconsistency measures have also been studied from the database viewpoint [Ber18, LKT + 21]. Such measures quantify the extent to which the database violates a set of integrity constraints. There are multiple reasons why one might be using such measures. For one, the measure can be used for estimating the usefulness or reliability of new datasets for data-centric applications such as business intelligence [CPRT15]. Inconsistency measures have also been proposed as the basis of progress indicators for datacleaning systems [LKT + 21]. Finally, the measure can be used for attributing to individual tuples a level of responsibility to the overall inconsistency [MLJ11,Thi09], thereby prioritize tuples in the explanation/inspection/correction of errors.
Example 1.1. Figure 1 depicts an inconsistent database that stores a train schedule. For example, the tuple f 1 states that train number 16 will depart from the New York Penn Station at time 1030 and arrive at the Boston Back Bay Station after 315 minutes. Assume that we have the functional dependency stating that the train number and departure time determine the departure station. All tuples in the database are involved in violations of this constraint, as they all agree on the train number and departure time, but there is some disagreement on the departure station. Hence, one can argue that every fact in the database affects the overall level of inconsistency in the database. But how should we measure the responsibility of the tuples to this inconsistency? For example, which of the tuples f 1 and f 3 has a greater contribution to inconsistency? To this end, we can adopt some conventional concepts for responsibility sharing, and in this article we study the computational aspects involved in the measurement of those. ♦ A conventional approach to dividing the responsibility for a quantitative property (here an inconsistency measure) among entities (here the database tuples) is the Shapley value [Sha53], which is a game-theoretic formula for wealth distribution in a cooperative game. The Shapley value has been applied in a plethora of domains, including economics [Gul89], law [Nen03], environmental science [PZ03,LZS15], social network analysis [NN11], physical network analysis [MCL + 10], and advertisement [BDG + 19]. In data management, the Shapley value has been used for determining the relative contribution of features in machine-learning predictions [LF18,LL17], the responsibility of tuples to database queries [RKL20, LBKS20,BG20], and the reliability of data sources [CPRT15].
The Shapley value has also been studied in a context similar to the one we adopt in this article-assigning a level of inconsistency to statements in an inconsistent knowledge base [HK10,YVCB18,MLJ11,Thi09]. Hunter and Konieczny [HK06,HK10,HK08] use the maximal Shapley value of one inconsistency measure in order to define a new inconsistency measure. Grant and Hunter [GH15] considered information systems distributed along data sources of different reliabilities, and apply the Shapley value to determine the expected blame of each statement to the overall inconsistency. Yet, with all the investigation that has been conducted on the Shapley value of inconsistency, we are not aware of any results or efforts regarding the computational complexity of calculating this value.
Example 1.2. Let us define the following cooperative game over the database of Figure 1. We have nine players-the tuples of the database. One of the measures that we consider for quantifying the level of inconsistency of a coalition of players is the number of tuple pairs in this group that violate the constraints. For example, consider the constraint defined in Example 1.1. The inconsistency level of the group {f 1 , f 3 , f 5 } is 2, as there are two conflicting tuple pairs: {f 1 , f 3 } and {f 1 , f 5 }. The inconsistency level of the entire database is 29, as this is the total number of conflicting pairs in the database. The Shapley value allows us to measure the contribution of each individual tuple to the overall inconsistency level. For example, the Shapley value of the tuple f 1 , in this case, will be lower than the Shapley  [LKR20] ; no FPRAS I MC PTime FP #P -complete [LK17] ; ?
value of the tuple f 3 (we will later show how this value is computed), which indicates that f 3 has a higher impact on the inconsistency than f 1 .
In this work, we embark on a systematic analysis of the complexity of the Shapley value of database tuples relative to inconsistency measures, where the goal is to calculate the contribution of a tuple to inconsistency. Our main results are summarized in Table 1. We consider inconsistent databases with respect to functional dependencies (FDs), and basic measures of inconsistency following Bertossi [Ber19] and Livshits, Ilyas, Kimelfeld and Roy [LKT + 21]. We note that these measures are all adopted from the measures studied in the aforementioned KR research. In our setting, an individual tuple affects the inconsistency of only its containing relation, since the constraints are FDs. Hence, our analysis focuses on databases with a single relation; in the end of each relevant section, we discuss the generalization to multiple relations. While most of our results easily extend to multiple relations, some extensions require a more subtle proof.
More formally, we investigate the following computational problem for any fixed combination of a relational signature, a set of FDs, and an inconsistency measure: given a database and a tuple, compute the Shapley value of the tuple with respect to the inconsistency measure. As Table 1 shows, two of these measures are computable in polynomial time: I MI (number of FD violations) and I P (number of problematic facts that participate in violations). For two other measures, we establish a full dichotomy in the complexity of the Shapley value: I d (the drastic measure-0 for consistency and 1 for inconsistency) and I MC (number of maximal consistent subsets, a.k.a. repairs). The dichotomy in both cases is the same: when the FD set has, up to equivalence, an lhs chain (i.e., the left-hand sides form a chain w.r.t. inclusion [LK17]), the Shapley value can be computed in polynomial time; in any other case, it is FP #P -hard (hence, requires at least exponential time under conventional complexity assumptions). In the case of I R (the minimal number of tuples to delete for consistency), the problem is solvable in polynomial time in the case of an lhs chain, and NP-hard whenever it is intractable to find a cardinality repair [LKR20]; however, the problem is open for every FD set in between, for example, the bipartite matching constraint {A → B, B → A}.
We also study the complexity of approximating the Shapley value and show the following (as described in Table 1). First, in the case of I d , there is a (multiplicative) fully polynomial-time approximation scheme (FPRAS) for every set of FDs. In the case of I MC , approximating the Shapley value of any intractable (non-lhs-chain) FD set is at least as hard as approximating the number of maximal matchings of a bipartite graph-a long standing open problem [JR18]. In the case of I R , we establish a full dichotomy, namely FPRAS vs. hardness of approximation, that has the same separation as the problem of finding a cardinality repair.
This article is the full version of a conference publication [LK21]. We have added all of the proofs, intermediate results and algorithms that were excluded from the conference version. In particular, we have included in this version the proofs of Observation 3.2, Lemma 5.4, Lemma 6.3, Lemma 7.2, and Lemmea 7.3, and the algorithms of Figures 5, 8, and 10. Furthermore, the results of the conference publication have been restricted to schemas with a single relation symbol. While some of the results (e.g., all of the lower bounds) immediately generalize to schemas with multiple relation symbols, some generalizations (in particular, the upper bounds for I d and I MC ) require a more subtle analysis that we provide in this article. We generalize the upper bounds for all the measures to schemas with multiple relation symbols, in the corresponding sections.
The rest of the article is organized as follows. After presenting the basic notation and terminology in Section 2, we formally define the studied problem and give initial observations in Section 3. In Section 4, we describe polynomial-time algorithms for I MI and I P . Then, we explore the measures I d , I R and I MC in Sections 5, 6 and 7, respectively. We conclude and discuss future directions in Section 8.

Preliminaries
We begin with preliminary concepts and notation that we use throughout the article.
2.1. Database Concepts. By a relational schema we refer to a sequence (A 1 , . . . , A n ) of attributes. A database D over (A 1 , . . . , A n ) is a finite set of tuples, or facts, of the form (c 1 , . . . , c n ), where each c i is a constant from a countably infinite domain. For a fact f and an attribute A i , we denote by f [A i ] the value associated by f with the attribute A i (that is, f [A i ] = c i ). Similarly, for a sequence X = (A j 1 , . . . , A jm ) of attributes, we denote by f [X] the tuple (f [A j 1 ], . . . , f [A jm ]). Generally, we use letters from the beginning of the English alphabet (i.e., A, B, C, ...) to denote single attributes and letters from the end of the alphabet (i.e., X, Y, Z, ...) to denote sets of attributes. We may omit stating the relational schema of a database D when it is clear from the context or irrelevant.
A Functional Dependency (FD for short) over (A 1 , . . . , A n ) is an expression of the form X → Y , where X, Y ⊆ {A 1 , . . . , A m }. We may also write the attribute sets X and Y by concatenating the attributes (e.g., AB → C instead of {A, B} → {C}). A database D satisfies X → Y if every two facts f, g ∈ D that agree on the values of the attributes of X also agree on the values of the attributes of A database D satisfies a set ∆ of FDs, denoted by D |= ∆, if D satisfies every FD of ∆. Otherwise, D violates ∆ (denoted by D |= ∆). Two FD sets over the same relational schema are equivalent if every database that satisfies one of them also satisfies the other.
Let ∆ be a set of FDs and D a database (which may violate ∆). A repair (of D w.r.t. ∆) is a maximal consistent subset of D; that is, E ⊆ D is a repair if E |= ∆ but E |= ∆ for every E E . A cardinality repair (or c-repair for short) is a repair of maximum cardinality; that is, it is a repair E such that |E| ≥ |E | for every repair E .  The FD set ∆ consists of the two FDs: • train time → departs • train time duration → arrives The first FD states that the departure station is determined by the train number and departure time, and the second FD states that the arrival station is determined by the train number, the departure time, and the duration of the ride.
Observe that the database of Figure 1 violates the FDs as all the facts refer to the same train number and departure time, but there is no agreement on the departure station. Moreover, the facts f 6 and f 7 also agree on the duration, but disagree on the arrival station. The database has five repairs: where Π A is the set of all permutations over the players of A and σ a is the set of players that appear before a in the permutation σ. Intuitively, the Shapley value of a player a is the expected contribution of a to a subset constructed by drawing players randomly one by one (without replacement), where the contribution of a is the change to the value of v caused by the addition of a. An alternative formula for the Shapley value, that we will use in this article, is the following.
Observe that |B|! · (|A| − |B| − 1)! is the number of permutations where the players of B appear first, then a, and then the rest of the players.
2.3. Complexity. In this article, we focus on the standard notion of data complexity, where the relational schema and set of FDs are considered fixed and the input consists of a database and a fact. In particular, a polynomial-time algorithm may be exponential in the number of attributes or FDs. Hence, each combination of a relational schema and an FD set defines a distinct problem, and different combinations may have different computational complexities. We discuss both exact and approximate algorithms for computing Shapley values.
Recall that a Fully-Polynomial Randomized Approximation Scheme (FPRAS, for short) for a function f is a randomized algorithm A(x, , δ) that returns an -approximation of f (x) with probability at least 1 − δ, given an input x for f and , δ ∈ (0, 1), in time polynomial in x, 1/ , and log(1/δ). Formally, an FPRAS, satisfies: Note that this notion of FPRAS refers to a multiplicative approximation, and we adopt this notion implicitly unless stated otherwise. We may also write "multiplicative" explicitly for stress. In cases where the function f has a bounded range, it also makes sense to discuss an additive FPRAS where Pr [f (x) − ≤ A(x, , δ) ≤ f (x) + ] ≥ 1 − δ. We refer to an additive FPRAS, and explicitly state so, in cases where the Shapley value is in the range [0, 1].

The Shapley Value of Inconsistency Measures
In this article, we study the Shapley value of facts with respect to measures of database inconsistency. More precisely, the cooperative game that we consider here is determined by an inconsistency measure I, and the facts of the database take the role of the players. In turn, an inconsistency measure I is a function that maps pairs (D, ∆) of a database D and a set ∆ of FDs to a number I(D, ∆) ∈ [0, ∞). Intuitively, the higher the value I(D, ∆) is, the more inconsistent (or, the less consistent) the database D is w.r.t. ∆. The Shapley value of a fact f of a database D w.r.t. an FD set ∆ and inconsistency measure I is then defined as follows.
We note that the definition of the Shapley value requires the cooperative game to be zero on the empty set [Sha53] and this is indeed the case for all of the inconsistency measures I that we consider in this work. Next, we introduce each of these measures.
• I d is the drastic measure that takes the value 1 if the database is inconsistent and the value 0 otherwise [Thi17]. • I MI counts the minimal inconsistent subsets [HK08,HK10]; in the case of FDs, these subsets are simply the pairs of tuples that jointly violate an FD. • I P is the number of problematic facts, where a fact is problematic if it belongs to a minimal inconsistent subset [GH11]; in the case of FDs, a fact is problematic if and only if it participates in a pair of facts that jointly violate ∆. • I R is the minimal number of facts that we need to delete from the database for ∆ to be satisfied (similarly to the concept of a cardinality repair and proximity in Property Testing) [GH13, GGR98, Ber19]. • I MC is the number of maximal consistent subsets (i.e., repairs) [GH11,GH17]. Table 1 summarizes the complexity results for the different measures. The first column (lhs chain) refers to FD sets that have a left-hand-side chain-a notion that was introduced by Livshits et al. [LK17], and we recall in the next section. The second column (no lhs chain, PTime c-repair) refers to FD sets that do not have a left-hand-side chain, but entail a polynomial-time cardinality repair computation according to the dichotomy of Livshits et al. [LKR20] that we discuss in more details in Section 6.
Example 3.1. Consider again the database of our running example. Since the database is inconsistent w.r.t. the FD set defined in Example 2.1, we have that I d (D, ∆) = 1. As for the measure I MI , the reader can easily verify that there are twenty nine pairs of tuples that jointly violate the FDs; hence, we have that I MI (D, ∆) = 29. Since each tuple participates in at least one violation of the FDs, it holds that I P (D, ∆) = 9. Finally, as we have already seen in Example 2.1, the database has five repairs and a single cardinality repair obtained by deleting six facts. Thus, I R (D, ∆) = 6 and I MC (D, ∆) = 5. In the next sections, we discuss the computation of the Shapley value for each one of these measures. Observation 3.2. Let I be an inconsistency measure. The following holds.
Proof. We have the following.
Note that in Equation Observation 3.2 implies that to compute the Shapley value of f , it suffices to compute the expectations of the amount of inconsistency over subsets D and D ∪ {f }, where D is drawn uniformly from the space of subsets of size m, for every m. More precisely, the computation of the Shapley value is Cook reducible 1 to the computation of these expectations. Our algorithms will, indeed, compute these expectations instead of the Shapley value.
The second observation is the following. One of the basic properties of the Shapley value is one termed "efficiency"-the sum of the Shapley values over all the players equals the total wealth [Sha53]. This property implies that f ∈D Shapley(D, ∆, f, I) = I(D, ∆). Thus, whenever the measure itself is computationally hard, so is the Shapley value of facts.  We start by discussing two tractable measures, namely I MI and I P . We first give algorithms for computing the Shapley value for these measures, and then discuss the generalization to multiple relations. 4.1. Computation. Recall that I MI counts the pairs of facts that jointly violate at least one FD. An easy observation is that a fact f increases the value of the measure I MI by i in a permutation σ if and only if σ f contains exactly i facts that are in conflict with f . Hence, assuming that D contains N f facts that conflict with f , the Shapley value for this measure can be computed in the following way: Therefore, we immediately obtain the following result. We now move on to I P that counts the "problematic" facts; that is, facts that participate in a violation of ∆. Here, a fact f increases the measure by i in a permutation σ if and only if σ f contains precisely i − 1 facts that are in conflict with f , but not in conflict with any other fact of σ f (hence, all these facts and f itself are added to the group of problematic facts). We prove the following.
1 Recall that a Cook reduction from a function F to a function G is a polynomial-time Turing reduction from F to G, that is, an algorithm that computes F with an oracle to a solver of G. We denote by X the random variable holding the number of problematic facts in the random subset. We denote by Y g the random variable that holds 1 if the fact g is in the random subset and, moreover, participates there in a violation of the FDs. In addition, we denote the expectations of these variables by E(X) and E(Y g ), respectively (without explicitly stating the distribution D ∼ U m (D \ {f }) in the subscript). Due to the linearity of the expectation we have: Hence, the computation of E D ∼Um(D\{f }) I P (D , ∆) reduces to the computation of E(Y g ), and this value can be computed as follows.
where N g is the number of facts in D \ {f } that are in conflict with g. We can similarly consider the distribution U m (D \ {f }) and show that the expectation where Y g is a random variable that holds 1 if g is selected in the random subset and, moreover, participates in a violation of the FDs, and 0 otherwise. For a fact g that is not in conflict with f it holds that E(Y g ) = E(Y g ), while for a fact g that is in conflict with f it holds that

4.2.
Generalization to Multiple Relations. The results of this section immediately generalize to schemas with multiple relation symbols. This is true since one of the basic properties of the Shapley value is linearity [Sha53]: and both measures, I MI and I P , are additive over multiple relations, that is, the value of the measure on the entire database is the sum of the values over the individual relations. In this section, we consider the drastic measure I d . While the measure itself is extremely simple and, in particular, computable in polynomial time (testing whether ∆ is satisfied), it might be intractable to compute the Shapley value of a fact. In particular, we prove a dichotomy for this measure, classifying FD sets into ones where the Shapley value can be computed in polynomial time and the rest where the problem is FP #P -complete. 2 5.1. Dichotomy. Before giving our dichotomy, we recall the definition of a left-hand-side chain (lhs chain, for short), introduced by Livshits et al. [LK17].
Definition 5.1 [LK17]. An FD set ∆ has a left-hand-side chain if for every two FDs X → Y and X → Y in ∆, either X ⊆ X or X ⊆ X.
Example 5.2. The FD set of our running example (Example 2.1) has an lhs chain. We could also define ∆ with redundancy by adding the following FD: train time arrives → departs. The resulting FD set does not have an lhs chain, but it is equivalent to an FD set with an lhs chain. An example of an FD set that does not have an lhs chain, not even up to equivalence, is {train time → departs, train departs → time}. ♦ We prove the following. Interestingly, this is the exact same dichotomy that we obtained in prior work [LK17] for the problem of counting subset repairs. We also showed that this tractability criterion is decidable in polynomial time by computing a minimal cover: if ∆ is equivalent to an FD set with an lhs chain, then every minimal cover of ∆ has an lhs chain. In the remainder of this section, we prove Theorem 5.3. 5.1.1. Hardness Side. The proof of the hardness side of Theorem 5.3 has two steps. We first show hardness for the matching constraint {A → B, B → A} over the schema (A, B), and this proof is similar to the proof of Livshits et al. [LBKS20] for the problem of computing the Shapley contribution of facts to the result of the query q() :-R(x), S(x, y), T (y). Then, from this case to the remaining cases we apply the fact-wise reductions that have been devised in prior work [LK17]. We start by proving hardness for {A → B, B → A}.
Proof. We construct a reduction from the problem of computing the number |M(g)| of matchings in a bipartite graph g [Val79a]. Note that we consider partial matchings; that is, subsets of edges that consist of mutually-exclusive edges. Given an input bipartite graph g, we construct m + 1 input instances (D 1 , f 1 ), . . . , (D m+1 , f m+1 ) to our problem, where m is the number of edges in g, in the following way. For every r ∈ {1, . . . , m + 1}, we add one vertex v 1 to the left-hand side of g and r + 1 vertices u 1 , . . . , u r , v 2 to the right-hand side  of g. Then, we connect the vertex v 1 to every new vertex on the right-hand side of g. We construct the instance D r from the resulting graph by adding a fact (u, v) for every edge (u, v) in the graph. We will compute the Shapley value of the fact f corresponding to the edge (v 1 , v 2 ). The reduction is illustrated in Figure 2.
In every instance D r , the fact f will increase the value of the measure by one in a permutation σ if and only if σ f satisfies two properties: (1) the facts of σ f jointly satisfy the FDs in ∆, and (2) σ f contains at least one fact that is in conflict with f . Hence, for f to affect the value of the measure in a permutation, we have to select a set of facts corresponding to a matching from the original graph g, as well as exactly one of the facts corresponding to an edge (v 1 , u i ) (since the facts (v 1 , u i ) and (v 1 , u j ) for i = j jointly violate the FD A → B). We have the following.
is the set of matchings of g containing precisely k edges.
Hence, we obtain m + 1 equations from the m + 1 constructed instances, and get the following system of equations.
Let us divide each column in the above matrix by the constant (j + 1)! (where j is the column number, starting from 0) and each row by i + 1 (where i is the row number, starting from 0), and reverse the order of the columns. We then get the following matrix.
This matrix has coefficients a i,j = (i + j)!, and the determinant of A is det(A) = m i=0 i!i! = 0; hence, the matrix is non-singular [Bac02]. Since dividing a column by a constant divides the determinant by a constant, and reversing the order of the columns can only change the sign of the determinant, the determinant of the original matrix is not zero as well, and the matrix is non-singular. Therefore, we can solve the system of equations, and compute the value m k=0 M(g, k), which is precisely the number of matchings in g.
Generalization via Fact-Wise Reductions. Using the concept of a fact-wise reduction [Kim12], we can prove hardness for any FD set that is not equivalent to an FD set with an lhs chain. We first give the formal definition of a fact-wise reduction. Let (R, ∆) and (R , ∆ ) be two pairs of a relational schema and an FD set. A mapping from R to R is a function µ that maps facts over R to facts over R .
is a mapping Π from R to R with the following properties.
(1) Π is injective; that is, for all facts f and g over R, if Π(f ) = Π(g) then f = g.
(2) Π preserves consistency and inconsistency; that is, for all facts f and g over R respectively. Due to the structure of FD sets with an lhs chain, we can compute these probabilities efficiently, as we explain next.
Our main observation is that for an FD X → Y , if we group the facts of D by X (i.e., split D into maximal subsets of facts that agree on the values of all attributes in X), then this FD and the FDs that appear later in the chain may be violated only among facts from the same group. Moreover, when we group by XY (i.e., further split each group of X into maximal subsets of facts that agree on the values of all attributes in Y ), facts from different groups always violate this FD, and hence, violate ∆. We refer to the former groups as blocks and the latter groups as subblocks. This special structure allows us to split the problem into smaller problems, solve each one of them separately, and then combine the solutions via dynamic programming.
We define a data structure T where each vertex v is associated with a subset of D that we denote by D [v]. The root r is associated with D itself, that is, D[r] = D. At the first level, each child c of r is associated with a block of D[r] w.r.t. X 1 → Y 1 , and each child c of c is associated with a subblock of D[c] w.r.t. X 1 → Y 1 . At the second level, each child c of c is associated with a block of D[c ] w.r.t. X 2 → Y 2 , and each child c of c is associated with a subblock of D[c ] w.r.t. X 2 → Y 2 . This continues all the way to the nth FD, where at the ith level, each child u of an We assume that the data structure T is constructed in a preprocessing phase. Clearly, the number of vertices in T is polynomial in |D| and n (recall that n is the number of FDs in ∆) as the height of the tree is 2n, and each level contains at most |D| vertices; hence, this preprocessing phase requires polynomial time (even under combined complexity). Then, we compute both E D ∼Um(D\{f }) I d (D , ∆) and E D ∼Um(D\{f }) I d (D ∪ {f }, ∆) by going over the vertices of T from bottom to top, as we will explain later. Note that for the computation of these values, we construct T from the database D \ {f }. Observe that since we go over the values j in reverse order in the for loop of line 2 (i.e., from m to 1), at each iteration of this loop, we have that v.val[j 2 ] (for all considered j 2 ≤ j) still holds the expected value of I d over subsets of size j 2 of the previous children of v, which is indeed the value that we need for our computation.
This computation of v.val also applies to the block vertices. However, the addition of line 5 only applies to blocks. Since the children of a block belong to different subblocks, and two facts from the same ith level block but different ith level subbblocks always jointly violate X i → Y i , a subset of size j of a block also violates the constraints if we select a non-empty subset of the current child c and a non-empty subset of the previous children, even if each of these subsets by itself is consistent w.r.t. ∆. Hence, we add this probability in line 5. Note that all the three cases that we consider are disjoint, so we sum the probabilities. Observe also that the leaves of T have no children and we do not update their probabilities, and, indeed the probability to select a subset from a leaf v that violates the constraints is zero, as all the facts of D[v] agree on the values of all the attributes that occur in ∆.
if v is a block node then

Generalization to Multiple Relations.
We now generalize our results to schemas with multiple relation symbols. More formally, we consider (relational) schemas S that consists of a finite set {R 1 , . . . , R n } of relation symbols, each associated with a sequence of attributes. For a set ∆ of FDs over S and a relation symbol R j of S, we denote by ∆ R j the restriction of ∆ to the FDs over R j . Similarly, for a database D over S, we denote by D R j the restriction of D to the facts over R j . Finally, we denote ∆ R 1 ∪ · · · ∪ ∆ R j by ∆ j and D R 1 ∪ · · · ∪ D R j by D j .
It is straightforward that the lower bound provided in this section also holds for schemas with multiple relation symbols. That is, given an FD set ∆ over a schema S, if for at least one relation symbol R of S, the FD set ∆ R is not equivalent to an FD set with an lhs chain, then the problem of computing Shapley(D, f, ∆, I d ) is FP #P -complete. We now generalize our upper bound to schemas with multiple relations; that is, we focus on the case where the FD set ∆ R of every relation symbol R of the schema has an lhs chain (up to equivalence), and show that the Shapley value can be computed in polynomial time.
The formula given in Observation 3.2 for computing the Shapley value is general and also applies to databases over schemas with multiple relation symbols. As aforementioned, for the drastic measure, this computation boils down to computing two probabilities-the probability that a uniformly chosen subset of D \ {f } of size m violates the constraints, and the probability that such a subset D satisfies D ∪ {f } |= ∆. Since we consider FDs, there are no violations among facts over different relation symbols; hence, we can compute these probabilities separately for each one of the relation symbols (i.e., for every pair (D R j , ∆ R j ) of a database and its corresponding FD set), and then we combine these results using dynamic programming, as we explain next.
Let R 1 , . . . , R n be an arbitrary order of the relation symbols. For each j ∈ {1, . . . , n} we denote by T m j the probability that a uniformly chosen subset of size m of D R j \ {f } violates ∆ R j . This value can be computed in polynomial time for every relation symbol, 20:18

E. Livshits and B. Kimelfeld
Vol. 18:2 using the algorithm of Figure 3, as we assume that ∆ R j has an lhs chain. Next, we denote by P m j the probability that a uniformly chosen subset of size m of D j \ {f } violates the constraints of ∆ R 1 ∪ · · · ∪ ∆ R j . Hence, the value P m n is needed for the computation of the Shapley value. We compute this value using dynamic programming. Clearly, we have that: and for every j > 1 we prove the following.
Lemma 5.8. For j ∈ {2, . . . , n} we have that: and a subset E 2 of size m 2 of D j−1 \ {f }, for some m 1 , m 2 such that m 1 + m 2 = m. Clearly, D violates the constraints if and only if at least one of E 1 or E 2 violates the constraints. That is, Then, we have the following: ∆ := ∆ − XY return ∆ Figure 6. A simplification algorithm used for deciding whether a cardinality repair w.r.t. ∆ can be computed in polynomial time [LKR20].
This concludes our proof.
We can similarly compute the second probability required for the Shapley value computation. The only difference is that if the fact f that we consider is over the relation symbol R j , then T m j will be the probability that a uniformly chosen D ⊂ D R j of size m is such that D ∪ {f } violates ∆ R j . This value can be computed in polynomial time using the algorithm of Figure 5. Note that the results of Section 5.2 on the approximate computation of the Shapley value trivially generalize to schemas with multiple relation symbols; hence, there is an additive FPRAS and a multiplicative FPRAS for any set of FDs.

Measure I R : The Cost of a Cardinality Repair
In this section, we study the measure I R that is based on the cost of a cardinality repair, that is, the minimal number of facts that should be deleted from the database in order to obtain a consistent subset. Unlike the other inconsistency measures considered in this article, we do not have a full dichotomy for the measure I R .
6.1. Complexity Results. Livshits et al. [LKR20] established a dichotomy for the problem of computing a cardinality repair, classifying FD sets into those for which the problem is solvable in polynomial time, and those for which it is NP-hard. They presented a polynomialtime algorithm, which we refer to as Simplify, that takes as input an FD set ∆, finds a removable pair (X, Y ) of attribute sets (if such a pair exists), and removes every attribute of X ∪ Y from every FD in ∆ (we denote the result by ∆ − XY ). A pair (X, Y ) of attribute sets is considered removable if it satisfies the following three conditions: • XY is nonempty, • every FD in ∆ contains either X or Y on the left-hand side. Note that it may be the case that X = Y , and then the conditions imply that every FD of ∆ contains X on the left-hand side. The algorithm is depicted in Figure 6.
Livshits et al. [LKR20] have shown that if it is possible to transform ∆ to an empty set by repeatedly applying Simplify(∆), then a cardinality repair can be computed in polynomial time. Otherwise, the problem is NP-hard (and, in fact, APX-complete).
Fact 3.3 implies that computing Shapley(D, f, ∆, I R ) is hard whenever computing I R (D, ∆) is hard. Hence, we immediately obtain the following.
Theorem 6.1. Let ∆ be a set of FDs. If ∆ cannot be emptied by repeatedly applying Simplify(∆), then computing Shapley(D, f, ∆, I R ) is NP-hard.
In the remainder of this section, we focus on the tractable cases of the dichotomy of Livshits et al. [LKR20]. In particular, we start by proving that the Shapley value can again be computed in polynomial time for an FD set that has an lhs chain. Note that FD sets with an lhs chain are a special case of FD sets that can be emptied via Simplify steps. This holds since every FD set with an lhs chain has either an FD of the form ∅ → X or a set X of attributes that occurs on the left-hand side of every FD. In the first case, (∅, X) is a removable pair, while in the second case, (X, X) is a removable pair.
Theorem 6.2. Let ∆ be a set of FDs. If ∆ is equivalent to an FD set with an lhs chain, then computing Shapley(D, f, ∆, I R ) can be done in polynomial time, given D and f . Our polynomial-time algorithm RShapley, depicted in Figure 7, is very similar in structure to DrasticShapley. However, to compute the expected value of I R , we take the reduction of Observation 3.2 a step further, and show, that the problem of computing the expected value of the measure over subsets of size m can be reduced to the problem of computing the number of subsets of size m of D that have a cardinality repair of cost k, given m and k. Recall that we refer to the number of facts that are removed from D to obtain a cardinality repair E as the cost of E. In the subroutine UpdateCount, we compute this number. In what follows, we denote by MR(D, ∆) the cost of a cardinality repair of D w.r.t. ∆. if v is a block vertex then • If w 1 + j 2 ≤ w 2 + j 1 , then a cardinality repair of E ∩ D[c] is preferred over a cardinality repair of E ∩ D[prev(c)], as it requires removing less facts from the database. • If w 1 +j 2 > w 2 +j 1 , then a cardinality repair of E ∩D[prev(c)] is preferred over a cardinality repair of E ∩ D[c].
In fact, since we fix t in the computation of v.val[j, t], we do not need to go over all w 1 and w 2 . In the first case, we have that w 1 = t − j 2 (hence, the total number of removed facts is t − j 2 + j 2 = t), and in the second case we have that w 2 = t − j 1 for the same reason. Hence, in line 7 we consider the first case where t ≤ w 2 + j 1 , and in line 8 we consider the second case where w 1 + j 2 > t. To avoid negative costs, we add a lower bound of t − j 1 on j 2 and w 2 in line 7, and, similarly, a lower bound of t − j 2 on j 1 and w 1 in line 8. For a subblock vertex v, a cardinality repair of D[v] is the union of cardinality repairs of the children of v, as facts corresponding to different children of v do not jointly violate any FD. Therefore, for such vertices, in line 10, we compute v.val by going over all j 1 , j 2 such that j 1 + j 2 = j and all t 1 , t 2 such that t 1 + t 2 = t and multiply the number of subsets of size j 1 of the current child for which the cost of a cardinality repair is t 1 by the number of subsets of size j 2 of the previously considered children for which the cost of a cardinality repair is t 2 .
Next, we give the algorithm RShapleyF for computing E D ∼Um(D\{f }) I R (D ∪ {f }, ∆) , that again involves a special treatment for vertices that conflict with f . The algorithm is if v is a block vertex then For a subblock vertex v (that does not conflict with f , and, hence, matches f ), the computation of v.val is again very similar to that of v.val, with the only difference being the use of c.val . Observe that in this case, the children of v correspond to different blocks. Each such block that does not match f also does not violate any FD with f ; hence, when we add f to this block, a cardinality repair of the resulting group of facts does not require the removal of f . The only child of v where a cardinality repair might require the removal of f is a child that matches f , and, clearly, there is at most one such child. Therefore, we do not count the fact f twice in the computation of the value v.val . 6.2. Approximation. In cases where a cardinality repair can be computed in polynomial time, we can obtain an additive FPRAS in the same way as the drastic measure. (Note that this Shapley value is also in [0, 1].) Moreover, we can again obtain a multiplicative FPRAS using the same technique due to the following gap property (proved similarly to Proposition 5.6).
Proposition 6.4. There is a polynomial p such that for all databases D and facts f of D the value Shapley(D, f, ∆, I R ) is either zero or at least 1/(p(|D|)).
As aforementioned, Livshits et al. [LKR20] showed that the hard cases of their dichotomy for the problem of computing a cardinality repair are, in fact, APX-complete; hence, there is a polynomial-time constant-ratio approximation, but for some > 1 there is no (randomized) -approximation or else P = NP (NP ⊆ BPP). Since the Shapley value of every fact w.r.t. I R is positive, the existence of a multiplicative FPRAS for Shapley(D, f, ∆, I R ) would imply the existence of a multiplicative FPRAS for I R (D, ∆) (due to Fact 3.3), which is a contradiction to the APX-hardness. We conclude the following. Proof. The proof is by a reduction from the problem of computing the number of perfect matchings in a bipartite graph, known to be #P-complete [Val79b]. Given a bipartite graph g = (A ∪ B, E) (where |A| = |B|), we construct a database D over (A, B) by adding a fact (a, b) for every edge (a, b) ∈ E. We then define m = |A| and k = 0. It is rather straightforward that the perfect matchings of g correspond exactly to the subsets D of size |A| of D such that D itself is a cardinality repair. Observe that the cooperative game for ∆ = {A → B, B → A} can be seen as a game on bipartite graphs where the vertices on the left-hand side represent the values of attribute A, the vertices on the right-hand side correspond to the values that occur in attribute B, and the edges represent the tuples of the database (hence, the players of the game). This game is different from the well-known matching game [AdK14] where the players are the vertices of the graph (and the value of the game is determined by the maximum weight matching of the subgraph induced by the coalition). In contrast, in our case the players correspond to the edges of the graph. It is not clear what is the connection between the two games and whether or how we can use known results on matching games to derive results for the game that we consider here.
6.3. Generalization to Multiple Relations. As in the case of I MI and I P , the results of this section easily generalize to schemas with multiple relations, due to the linearity property of the Shapley value. As in the case of the drastic measure, the (positive and negative) results on the approximate computation of the Shapley value trivially generalize to schemas with multiple relation symbols.

Measure I MC : The Number of Repairs
The final measure that we consider is I MC that counts the repairs of the database. 7.1. Dichotomy. A dichotomy result from our previous work [LK17] states that the problem of counting repairs can be solved in polynomial time for FD sets with an lhs chain (up to equivalence), and is #P-complete for any other FD set. The hardness side, along with Fact 3.3, implies that computing Shapley(D, f, ∆, I MC ) is FP #P -hard whenever the FD set is not equivalent to an FD set with an lhs chain. Hence, an lhs chain is a necessary condition for tractability. We show here that it is also sufficient: if the FD set has an lhs chain, then the problem can be solved in polynomial time. Consequently, we obtain the following dichotomy. if v is a block vertex then Recall that in our reduction from the problem of computing the Shapley value to that of computing the expected value of the measure over subsets of a given size of the database, we considered the uniform distribution where Pr for a subset E of size m of D.
Therefore, we have that The result of Lemma 7.2 is reflected in line 6 of the UpdateExpected subroutine. Next, we show the following result for subblock vertices, that we use for the calculation of line 8.      To the best of our knowledge, existence of the latter is a long-standing open problem [JR18]. This is also the case for any ∆ that is not equivalent to an FD set with an lhs chain, since there is a fact-wise reduction from ∆ to such ∆ [LK17]. 7.3. Generalization to Multiple Relations. As in the case of the drastic measure, we can generalize the upper bound of this section to schemas with multiple relation symbols using dynamic programming. We again consider an arbitrary order R 1 , . . . , R n of the realtion symbols of the schema, and denote: and: P m j = E D ∼Um(D j \{f }) I MC (D , ∆ j ) The value T m j can be computed in polynomial time, using the algorithm of Figure 9, as we assume that each ∆ R j has an lhs chain. As for the value P m j , we have that P m 1 = T m 1 , and we prove the following for j > 1. (Recall that we denote by ∆ j the FD set ∆ R 1 ∪ · · · ∪ ∆ R j and by D j the database D R 1 ∪ · · · ∪ D R j .) 20:30

E. Livshits and B. Kimelfeld
Vol. 18:2 Lemma 7.4. For every j ∈ {2, . . . , n} we have that: Proof. A basic observation here is that the number of repairs of D R 1 ∪ · · · ∪ D R j is a product of the number of repairs of D R j and the number of repairs of D R 1 ∪ · · · ∪ D R j−1 , since there are no conflicts among facts over different relation symbols. Thus, we have the following: The computation of E D ∼Um(D j−1 \{f }) I MC (D ∪ {f }, ∆ j ) is very similar, with the only difference being the fact that: for the relation symbol R j of f . This value can be computed in polynomial time using the algorithm of Figure 10. Finally, as in the case of the drastic measure, it is rather straightforward that the lower bound of Theorem 7.1 generalizes to the case where the FD set ∆ R has no lhs chain (up to equivalence) for at least one relation symbol R of the schema.

Conclusions
We studied the complexity of calculating the Shapley value of database facts for basic inconsistency measures, focusing on FD constraints. We showed that two of them are computable in polynomial time: the number of violations (I MI ) and the number of problematic facts (I P ). In contrast, each of the drastic measure (I d ) and the number of repairs (I MC ) features a dichotomy in complexity, where the tractability condition is the possession of an lhs chain (up to equivalence). For the cost of a cardinality repair (I R ) we showed a tractable fragment and an intractable fragment, but a gap remains on certain FD sets-the ones that do not have an lhs chain, and yet, a cardinality repair can be computed in polynomial time.
We also studied the approximability of the Shapley value and showed, among other things, an FPRAS for I d and a dichotomy in the existence of an FPRAS for I R .
Many other directions are left open for future research. First, the picture is incomplete for the measure I R . In particular, the complexity of the exact computation is open for the bipartite matching constraint {A → B, B → A} that, unlike the known FD sets in the intractable fragment, has an FPRAS. In general, we would like to complete the picture of I R towards a full dichotomy. Moreover, for the schemas where there is no FPRAS for I R , our results neither imply nor refute the existence of a constant-ratio approximation (for some constant). Second, the problems are immediately extendible to any type of constraints other than functional dependencies, such as denial constraints, tuple generating dependencies, and so on. Third, it would be interesting to see how the results extend to wealth distribution functions other than Shapley, for instance the Banzhaff Power Index [DS79]. The tractable cases remain tractable for the Banzhaff Power Index, but it is not clear how (and whether) our proofs for the lower bounds generalize to this function. Another direction is to investigate whether properties of the database (e.g., bounded treewidth) have an impact on the complexity of computing the Shapley value. Finally, there is the practical question of implementation: while our algorithms terminate in polynomial time, we believe that they are hardly scalable without further optimization and heuristics ad-hoc to the use case; developing those is an important challenge for future research.