The Complexity of Rooted Phylogeny Problems

Several computational problems in phylogenetic reconstruction can be formulated as restrictions of the following general problem: given a formula in conjunctive normal form where the literals are rooted triples, is there a rooted binary tree that satisfies the formula? If the formulas do not contain disjunctions, the problem becomes the famous rooted triple consistency problem, which can be solved in polynomial time by an algorithm of Aho, Sagiv, Szymanski, and Ullman. If the clauses in the formulas are restricted to disjunctions of negated triples, Ng, Steel, and Wormald showed that the problem remains NP-complete. We systematically study the computational complexity of the problem for all such restrictions of the clauses in the input formula. For certain restricted disjunctions of triples we present an algorithm that has sub-quadratic running time and is asymptotically as fast as the fastest known algorithm for the rooted triple consistency problem. We also show that any restriction of the general rooted phylogeny problem that does not fall into our tractable class is NP-complete, using known results about the complexity of Boolean constraint satisfaction problems. Finally, we present a pebble game argument that shows that the rooted triple consistency problem (and also all generalizations studied in this paper) cannot be solved by Datalog.


Introduction
Rooted phylogeny problems are fundamental computational problems for phylogenetic reconstruction in computational biology, and more generally in areas dealing with large amounts of data about rooted trees. Given a collection of partial information about a rooted tree, we would like to know whether there exists a single rooted tree that explains the data. A concrete example of a computational problem in this context is the rooted triple consistency problem. We are given a set V of variables, and a set of triples ab|c with a, b, c ∈ V , and we would like to know whether there exists a rooted tree T with leaf set V such that for each of the given triples ab|c the youngest common ancestor of a and b in this tree is below the youngest common ancestor of a and c (if such a tree exists, we say that the instance is satisfiable).
The rooted triple consistency problem has an interesting history. The first polynomial time algorithm for the problem was discovered by Aho, Sagiv, Szymanski, and Ullman [ASSU81], motivated by problems in database theory. This algorithm was later rediscovered for phylogenetic analysis [Ste92]. Henzinger, King, and Warnow [HKW96] showed how to use decremental graph connectivity algorithms to improve the quadratic runtime O(mn) of the algorithm by Aho et al. to a deterministic algorithm with runtime O(m √ n). Dekker [Dek86] asked the question whether there is a finite set of 'rules' that allows to infer a triple ab|c from another given set of triples Φ if all trees satisfying Φ also satisfy ab|c. This question was answered negatively by Bryant and Steel [BS95]. Dekker's 'rules' have a very natural interpretation in terms of Datalog programs. Datalog as an algorithmic tool for rooted phylogeny problems is more powerful than Dekker's rules. We say that a Datalog program solves the rooted triple consistency problem if it derives a distinguished 0-ary predicate false on a given set of triples if and only if the instance of the rooted triple consistency problem is not satisfiable. One of the results of this paper is the proof that there is no Datalog program that solves the rooted triple consistency problem.
Datalog inexpressibility results are known to be very difficult to obtain, and the few existing results often exhibit interesting combinatorics [KV95,ASY91,FV99,Gro94,BK10]. The tool we apply to show our result, the existential pebble game, originates in finite model theory, and was successfully applied to finite domain constraint satisfaction [KV98]. A recent generalization of the intimate connection between Datalog and the existential pebble game to a broad class of infinite domain constraint satisfaction problems [BD08] allows us to apply the game to study the expressive power of Datalog for the rooted triple consistency problem.
There are several other important rooted phylogeny problems One is the subtree avoidance problem, introduced by [NSW00], or the forbidden triple problem [Bry97]; both are NP-hard. It turns out that all of those problems and many other rooted phylogeny problems can be conveniently put into a common framework, which we introduce in this paper.
A rooted triple formula is a formula Φ in conjunctive normal form where all literals are of the form ab|c. It turns out that the problems mentioned above and many other rooted phylogeny problems (we provide more examples in Section 2) can be formalized as the satisfiability problem for a given rooted triple formula Φ where the set of clauses that might be used in Φ is (syntactically) restricted. If C is a class of clauses, and the input is confined to rooted triple formulas with clauses from C, we call the corresponding computational problem the rooted phylogeny problem for clauses from C.
In this paper, we determine for all classes of clauses C the computational complexity of the rooted phylogeny problem for clauses from C. In all cases, the corresponding computational problem is either in P or NP-complete. In our proof of the complexity classification we apply known results from Boolean constraint satisfaction. The rooted phylogeny problem is closely related to a corresponding split problem (defined in Section 4), which is a Boolean constraint satisfaction problem where we are looking for a surjective solution, i.e., a solution where at least one variable is set to true and at least one variable is set to false. The complexity of Boolean split problems has been classified in [CKS01]. If C is such that the corresponding split problem can be solved efficiently, our algorithmic results in Section 4 show that the rooted phylogeny problem for clauses from C can be solved in polynomial time. Conversely, we present a general reduction that shows that if the split problem is NP-hard, then the rooted phylogeny problem for C is NP-hard as well.

Phylogeny Problems
We fix some standard terminology concerning rooted trees. Let T be a tree (i.e., an undirected, acyclic, and connected graph) with a distinguished vertex r, the root of T . The vertices with exactly one neighbor in T are called leaves. The vertices of T are denoted by V (T ), and the leaves of T by L(T ) ⊆ V (T ). For u, v ∈ V (T ), we say that u lies below v if the path from u to r passes through v. We say that u lies strictly below v if u lies below v and u = v. The youngest common ancestor (yca) of two vertices u, v ∈ V (T ) is the node w such that both u and v lie below w and w has maximal distance from r. Note that the yca, viewed as a binary operation, is commutative and associative, and hence there is a canonical definition of the yca of a set of elements u 1 , . . . , u k . The tree T is called binary if the root has two neighbors, and every other vertex has either three neighbors or one neighbor. A neighbor u of a vertex v is called a child of v (and v is called the parent of u) in T if the distance of u from the root is strictly larger than the distance of v from the root. We write uv|w (or say that uv|w holds in T ) if u, v, w are distinct leaves of T and yca(u, v ) lies strictly below yca(u, w ) in T . Note that for distinct leaves u, v, w of any binary tree T , exactly one of the triples uv|w, uw|v, and vw|u holds in T .
Definition 2.1. A rooted triple formula is a (quantifier-free) conjunction of clauses (also called triple clauses) where each clause is a disjunction of literals of the form xy|z.
Example 2.2. An example of a triple clause is xz|y ∨ yz|x; it will also be denoted by xy z. Another example of a triple clause is xy|z 1 ∨ xy|z 2 .
The following notion is used frequently in later sections. If Φ is a formula, and S is a subset of the variables of Φ, then Φ[S] denotes the conjunction of all those clauses in Φ that only contain variables from S. Definition 2.3. A rooted triple formula Φ is satisfiable if there exists a rooted binary tree T and a mapping α from the variables of Φ to the leaves of T such that in every clause at least one literal is satisfied. A literal xy|z is satisfied by (T, α) if α(x), α(y), α(z) are distinct and if yca(α(x ), α(y)) lies strictly below yca(α(x ), α(z )) in T . The pair (T, α) is then called a solution to Φ.
We would like to remark that a rooted triple formula Φ is satisfiable if and only if there exists a rooted binary tree T and an injective mapping α from the variables of Φ to the leaves of T such that the formula evaluates under α to true.
Example 2.4. Let Φ = xz|y ∨ yz|x ∧ xy|w be a rooted triple formula with variables V = {w, x, z, y}. Then the tree T x z y w together with the identity mapping on V is a solution to Φ.
A fundamental problem in phylogenetic reconstruction is the rooted triple consistency problem [HKW96,BS95,Ste92,ASSU81]. This problem can be stated conveniently in terms of rooted triple formulas.
Problem 2.8 (Rooted-Phylogeny for clauses from C). INSTANCE : A rooted triple formula Φ where each clause can be obtained from clauses in C by substitution of variables. QUESTION: Is Φ satisfiable?
All of these problems belong to NP. A given solution (T, α) can be verified in polynomial time using the following deterministic algorithm. For each literal of each clause of Φ check whether the literal is satisfied. If there is at least one literal per clause satisfied by (T, α), then the given solution is valid else it is invalid. A literal ab|c is satisfied if α(a), α(b), and α(c) are distinct and if v 1 = yca(α(a), α(b)) lies strictly below v 2 = yca(α(a), α(c)) (recalling definition 2.3). Determining the youngest common ancestor of two vertices is straightforward using a bottom-up search for each vertex. Another search is then used to check if v 1 lies strictly below v 2 .
Note that the rooted triple consistency problem, the subtree avoidance problem, and the forbidden triple consistency problem are examples of rooted phylogeny problems, by appropriately choosing the class C. For example, for the rooted triple consistency problem we choose C = {xy|z}. The subtree avoidance problem is the rooted phylogeny problem for the class C that contains for each k the clause x 1 y 1 z 1 ∨ · · · ∨ x k y k z k .
Finally, note that when C contains clauses with literals of the form xx|y, xy|x, or xy|y, then these literals can be removed from the clause since they are unsatisfiable. If all literals in a triple clause are of this form, then the clause is unsatisfiable. It is clear that in instances of the rooted phylogeny problem for clauses from a fixed class C one can efficiently decide whether the input contains such clauses (in which case the input is unsatisfiable). Thus, removing such clauses from C does not affect the complexity of the rooted phylogeny for clauses from C. To prevent dealing with degenerate cases, we therefore make the convention that all clauses in C do not contain literals of the form xx|y, xy|x, or xy|y.

THE COMPLEXITY OF ROOTED PHYLOGENY PROBLEMS 5
Constraint Satisfaction Problems. Many phylogeny problems can be viewed as infinite domain constraint satisfaction problems (CSPs), which are defined as follows. Let Γ be a structure 1 with a finite relational signature τ . A first-order formula over τ is called primitive positive if it is of the form ∃x 1 , . . . , x n . ψ 1 ∧ · · · ∧ ψ m where ψ 1 , . . . , ψ m are atomic formulas over τ , i.e., of the form x = y or R(x 1 , . . . , x k ) for a k-ary R ∈ τ . Then the constraint satisfaction problem for Γ, denoted by CSP(Γ), is the computational problem to decide whether a given primitive positive sentence (i.e., a primitive positive formula without free variables) is true in Γ. The sentence Φ is also called an instance of CSP(Γ), and the clauses of Φ are also called the constraints of Φ. We cannot give a full introduction to constraint satisfaction and to constraint satisfaction on infinite domains, but point the reader to [BJK05,Bod08]. Here, we only specify an infinite structure ∆ that can be used to describe the rooted triple consistency problem as a constraint satisfaction problem. It will then be straightforward to see that all rooted phylogeny problems for clauses from a finite class C can be formulated as infinite domain CSPs as well.
The signature of ∆ is {|} where | is a ternary relation symbol. The domain of ∆ is N → {0, 1}, i.e., the set of all infinite binary strings (hence, the domain of ∆ is uncountable).
For two elements f, g of ∆, let lcp(f, g) be the set {1, . . . , n} where n is the largest natural number i such that f (j) = g(j) for all j ∈ {1, . . . , i}; if no such i exists, we set lcp(f, g) := ∅, and if f = g, we set lcp(f, g) := N. The ternary relation f g|h in ∆ holds on elements f, g, h of ∆ if they are pairwise distinct and | lcp(f, g)| > | lcp(f, h)|.
The following lemma shows that instances of the rooted triple consistency problem can be viewed as primitive positive formulas over the signature {|}. Proof. Suppose that ∃x 1 , . . . , x n . Φ(x 1 , . . . , x n ) is true in ∆, and let f 1 , . . . , f n : N → {0, 1} be witnesses for x 1 , . . . , x n that satisfy Φ in ∆. We define a finite rooted tree T as follows. The vertex set of T consists of the restrictions of f i to lcp(f i , f j ) for all 1 ≤ i, j ≤ n (we do not require i and j to be distinct). Vertex g is above vertex g in T if g extends g; it is clear that this describes T uniquely. Note that f 1 , . . . , f n are exactly the leaves of T , and that T is binary. Let α be the map that sends x i to f i . Then (T, α) satisfies Φ.
Conversely, let (T, α) be a solution to Φ. For each vertex v of T that is not a leaf, let l(v) and r(v) be the two neighbors of v in T that have larger distance from the root than v. Let h be the length of the path r = p 1 , . . . , This shows that the rooted triple consistency problem is indeed a constraint satisfaction problem. A refined version of this observation will be useful in Section 3 to apply known techniques for proving Datalog inexpressibility of the rooted triple consistency problem.
A triple clause is called trivial if the clause is satisfied by any injective mapping from the variables into the leaves of any rooted tree. The following lemma (Lemma 2.10) shows that the rooted triple consistency problem is among the simplest rooted phylogeny problems, that is, for every class C that contains a non-trivial triple clause the rooted phylogeny problem for C can simulate the rooted triple consistency problem in a simple way.
Proof. First observe that if k = 3 and if φ(x 1 , x 2 , x 3 ) contains only one literal then renaming its variables is trivial.
is logically equivalent to ab|c. If φ(x 1 , x 2 , x 3 ) contains three or more literals, then due to its non-triviality there can only be at most two distinct literals. Thus, we fall back to one of the already shown cases and the claim follows for all clauses with exactly three variables.
If k > 3, then non-triviality of φ implies that φ(x 1 , . . . , x k ) can be written as In both cases we can falsify all literals in φ that contain a variable x i 4 distinct from x i 1 , x i 2 , x i 3 by making x i 4 equal to some other variable in this literal. The claim then follows from the case k = 3.
This implies that the Datalog inexpressibility result for the rooted triple consistency problem we present in the next section applies to all the rooted phylogeny problems for clauses from C that contain a non-trivial clause.

Datalog
Datalog is an important algorithmic concept originating both in logic programming and in database theory [AHV95, EF99, Imm98]. Feder and Vardi [FV99] observed that Datalog programs can be used to formalize efficient constraint propagation algorithms used in Artificial Intelligence [All83, Mon74,Dec92,Mac77]. Such algorithms have also been studied for the phylogenetic reconstruction problem. Dekker [Dek86] studied rules that infer rooted triples from given sets of rooted triples, and asked whether there exists a set of rules such that a rooted triple can be derived by these rules from a set of rooted triples Φ if and only if it is logically implied by Φ. This question was answered negatively by Bryant and Steel [BS95].
In this section, we show the stronger result that the rooted triple consistency problem cannot be solved by Datalog. This is a considerable strengthening of this previous result by Bryant and Steel, since we can use Datalog programs not only to infer rooted triples that are implied by other rooted triples, but rather might use Datalog rules to infer an arbitrary number of relations (aka IDBs) of arbitrary arity to solve the problem. Moreover, we only require that the Datalog program derives false if and only if the instance is unsatisfiable. In particular, we do not require that the Datalog program derives every rooted triple that is logically implied by the instance (which is required for the question posed by Dekker). Finally, as already announced in the conference version of this paper, we show that the proof technique extends to other constraint formalisms for reasoning about trees.
In our proof, we use a pebble-game that was introduced to describe the expressive power of Datalog [KV95] and which was later used to study Datalog as a tool for finite domain constraint satisfaction problems [FV99]. The correspondence between Datalog and pebble games extends to infinite domain constraint satisfaction problems for countably infinite ω-categorical structures. A countably infinite structure is called ω-categorical if its firstorder theory 2 has exactly one countable model up to isomorphism. It can be seen (e.g. using the theorem of Ryll-Nardzewski, see [Hod93]) that the structure ∆ introduced in Section 2 is, unfortunately, not ω-categorical. However, there are several ways of defining an ω-categorical structure Λ (described also in [Cam90]) which has the same constraint satisfaction problem.
We exactly follow the axiomatic approach to define such a structure Λ given in [AN98]. A ternary relation C is said to be a C-relation on a set L if for all a, b, c, d ∈ L the following conditions hold: A structure Γ is called k-transitive if for any two k-tuples (a 1 , . . . , a k ) and (b 1 , . . . , b k ) of distinct elements of Γ there is an automorphism 3 of Γ that maps a i to b i for all i ≤ k. A structure Γ is said to be relatively k-transitive if for every partial isomorphism f between induced substructures of Γ of size k there exists an automorphism of Γ that extends f . Note that a relatively 3-transitive C-set is necessarily 2-transitive.
Theorem 3.1 (Theorem 14.7 in [AN98]). Let (L; C) be a relatively 3-transitive C-set. Then Theorem 11.2 and 11.3 in [AN98] show how to construct such a C-relation from a semilinear order 4 that is dense, normal, and branches everywhere (all these concepts are defined in [AN98]). Such a semi-linear order is explicitly constructed in Section 5 of [AN98].
In fact, there is, up to isomorphism, a unique relatively 3-transitive countable C-set which • is uniform with branching number 2, that is, if for all a, b, c ∈ L we have C(a; b, c) ∨ C(b; c, a) ∨ C(c; a, b), • is dense, and • satisfies ¬C(a; a, a) for all a ∈ L.
(See the comments in [AN98] after the statement of Theorem 14.7; the condition that ¬C(a; a, a) for all (equivalently, for some) a ∈ L has been forgotten there, but is necessary to obtain uniqueness.) In the following, let Λ be the structure whose domain is the domain of the dense C-set that is uniform with branching number 2; the signature of Λ is not the C-relation, but the relation xy|z defined from the C-relation by Structures that are first-order definable in ω-categorical structures are ω-categorical (Theorem 7.3.8 in [Hod93]), so in particular Λ is ω-categorical. Note that the relation | of Λ satisfies (C1), (C2), (C3), but not (C4).
3 An automorphism of a structure Γ is an isomorphism between Γ and itself. 4 A poset is connected if for any two a, b there exists a c such that a ≤ c and b ≤ c, or a ≥ c and b ≥ c.
A connected poset is called semi-linear if for every point, the set of all points above it is linearly ordered. The following observation has already been made in [Bod08], but without proof, so we provide a proof here.
Proof. Suppose that there are a 1 , . . . , a n such that Φ(a 1 , . . . , a n ) is true in Λ. We first define a binary relation on the set of all pairs (a, b) with a, b ∈ {a 1 , . . . , a n }. We set (a, b) (c, d) if ¬cd|a ∧ ¬cd|b, and define R : Lemma 3.3 (Lemma 12.1 in [AN98]). The relation is a preorder, and hence R is an equivalence relation.
Also the following is taken from [AN98]; but to avoid extensive references into the proofs there, we give a self-contained presentation here. We claim that the poset /R that is induced by in the natural way on the equivalence classes of R is semi-linear. To see this, let (a 1 , a 2 ), (b 1 , b 2 ), (c 1 , c 2 ) be such that (a 1 , a 2 ) (b 1 , b 2 ) and (a 1 , a 2 ) (c 1 , c 2 ). We have to show that (b 1 , b 2 ) and (c 1 , c 2 ) are comparable in . If (b 1 , b 2 ) (c 1 , c 2 ), then c 1 c 2 |b 1 or c 1 c 2 |b 2 . Suppose in the following that c 1 c 2 |b 1 ; the case c 1 c 2 |b 2 is analogous. Since  (a 1 , a 2 ) (c 1 , c 2 ) we have in particular ¬c 1 c 2 |a 1 in Λ. Recall that the relation | satisfies (C3), which can be equivalently written as ∀a, b, c. (C(a; b, c) ∧ ¬C(d; b, c)) → C(a; d, c), so we find that a 1 c 1 |b 1 . By (C2) we have ¬a 1 b 1 |c 1 . Since (a 1 , a 2 ) (b 1 , b 2 ) we have ¬b 1 b 2 |a 1 . Axiom (C3) can also be written as ∀a, b, c. (¬C(a; d, c) ∧ ¬C(d; b, c)) → ¬C(a; b, c), and thus ¬b 1 b 2 |c 1 . Similarly, ¬b 1 b 2 |c 2 . Therefore, (c 1 , c 2 ) (b 1 , b 2 ), which is what we had to show.
Next, note that when (d 1 , d 2 ) and (e 1 , e 2 ) are incomparable with respect to , then (d 1 , e 1 ) is an upper bound for (d 1 , d 2 ) and (e 1 , e 2 ), that is, (d 1 , d 2 ) (d 1 , e 1 ) and (e 1 , e 2 ) (d 1 , e 1 ). It follows that /R is indeed a semi-linear order with a smallest element r, and there exists a tree T on the equivalence classes of R such that p lies below q in T if for all (equivalently, for some) (a, b) ∈ p and (c, d) ∈ q we have (c, d) (a, b). Let α be the map that sends x i to the equivalence class of (a i , a i ); it is straightforward to verify that (T, α) satisfies Φ.
Conversely, let (T, α) be a solution to Φ. We now determine elements a 1 , . . . , a n from Λ, and prove by induction on i that α(x r )α(x s )|α(x t ) in T if and only if a r a s |a t in Λ, for all r, s, t ≤ i. This is trivial for n = i = 1, and for n = i = 2 we can choose arbitrary distinct elements a 1 and a 2 from Λ. Now suppose we have already found elements a 1 , . . . , a i of Λ, for 2 ≤ i < n, that satisfy the inductive hypothesis. Let v be the vertex in T that has the maximal distance from the root of T such that there is an j ≤ i where both α(x j ) and α(x i+1 ) lie strictly below v.
First consider the case that v is the root of T . Then we can choose k, l ∈ {1, . . . , i} such that v = yca(α(x k ), α(x l )). Let a be an element of Λ that is distinct from a k and a l , and by the properties of Λ (xy|z is uniform with branching number 2) we have that a k a l |a, a k a|a l or aa k |a l holds. In the first case, we set a i+1 to a. In the second case, by relative 3-transitivity of Λ there exists an automorphism β of Λ that maps a k to a l and that fixes a. In this case we set a i+1 to β(a l ). In the third case we proceed similar to the second. In all three cases we have a p a q |a i+1 for all p, q ≤ i, which proves the inductive step.
Next, consider the case that v is not the root of T . In this case, there must be an m ≤ i such that α(x j )α(x i+1 )|α(x m ); choose m such that the distance between the root and yca(α(x j ), α(x m )) is maximal. When j is the only index of size at most i such that α(x j ) lies below v in T , then density of Λ (axiom (C7) in the special case that b = c) implies that there is an a such that a j a|a m . We can then set a i+1 to a. Otherwise, there are j , j ≤ i such that α(x j )α(x j )|α(x i+1 ); choose j , j such that the distance between v and yca(α(x j ), α(x j )) is minimal. Again we apply density (axiom (C7)) and conclude that there is an a such that a j a j |a and a j a|a m . We can then set a i+1 to a.
The Existential Pebble Game. The fact that Λ is ω-categorical allows us to use the existential k-pebble game to establish the Datalog lower bound for the rooted triple consistency problem [BD08].
The existential k-pebble game (for a structure Γ) is played by the players Spoiler and Duplicator on an instance Φ of CSP(Γ) and Γ. Each player has k pebbles, p 1 , . . . , p k for Spoiler and q 1 , . . . , q k for Duplicator; we say that that q i corresponds to p i . Spoiler places his pebbles on the variables of Φ, Duplicator her pebbles on elements of Γ. Initially, none of the pebbles is placed. In each round of the game Spoiler picks some of his pebbles. If some of these pebbles are already placed on Φ, then Spoiler removes them from Φ, and Duplicator responds by removing the corresponding pebbles from Γ. Duplicator looses if at some point of the game • there is a clause R(x 1 , . . . , x k ) in Φ such that x 1 , . . . , x k are pebbled by p j 1 , . . . , p j k , and • the corresponding pebbles q j 1 , . . . , q j k of Duplicator are placed on elements b 1 , . . . , b k in Γ such that R(b 1 , . . . , b k ) does not hold in Γ. Duplicator wins if the game continues forever. We will make use of the following theorem from [BD08].
Theorem 3.4 (Theorem 5 in [BD08]). Let Γ be an ω-categorical (or finite) structure. Then there is no Datalog program that solves CSP(Γ) if and only if for every k there exists an unsatisfiable instance Φ k of CSP(Γ) such that Duplicator wins the existential k-pebble game on Φ k and Γ.
Our Method. The incidence graph G(Φ) of an instance Φ of CSP(Γ) is the (undirected, simple) bipartite graph whose vertex set is the disjoint union of the variables of Φ and the clauses of Φ. An edge joins a variable a and a clause φ of Φ when a appears in φ. A leaf of Φ is a variable that has degree one in G(Φ). An instance has girth k if the shortest cycle of its incidence graph has 2k edges 5 .
Lemma 3.5. Let Γ be an l-transitive (for l ≥ 1) ω-categorical (or finite) structure with relations of arity at most l + 1. Suppose that for every k there exists an unsatisfiable instance Φ k of girth at least k where every constraint has an injective satisfying assignment. Then CSP(Γ) cannot be solved by Datalog.
We will see examples for l = 1 and for l = 2 in this paper. Note that by 1-transitivity, every unary relation in Γ either denotes the empty set or the full domain of Γ. Since Φ k only contains satisfiable constraints, all unary constraints in Φ k are satisfied by every mapping to Γ. So we make in the following the assumption that Φ k does not contain unary constraints.
In the proof we use the following concept, inspired by a Datalog inexpressibility result that was established for temporal reasoning [BK10]. The notion of dominated sets allows us to specify a winning strategy for Duplicator for the existential k-pebble game.
Proof of Lemma 3.5. To apply Theorem 3.4, we have to prove that Duplicator wins the existential k-pebble game on Φ k and Γ.
Suppose that in the course of the game, u is an unpebbled leaf of a dominated set S with pebbled leaves a 1 , . . . , a l , and let b 1 , . . . , b l be the corresponding responses of Duplicator. Duplicator will play in such a way that b 1 , . . . , b l are pairwise distinct. Moreover, Duplicator always maintains the following invariant. Whenever Spoiler places a pebble on a l+1 , Duplicator can play a value b l+1 from Γ such that the mapping that assigns a i to b i for 1 ≤ i ≤ l + 1 can be extended to all of S such that this extension is a satisfying assignment for Φ k [S].
The invariant is satisfied at the beginning of the game: when spoiler places a pebble on a 1 , Duplicator can play any value b 1 , which is a legal move by our assumption that Φ k does not contain unary constraints.
Suppose that during the game Spoiler pebbles a variable a. Let S 1 , . . . , S p be the dominated sets where a is the unpebbled leaf before Spoiler puts his pebble on a. (If there is no such dominated set, then p = 0.) Let T 1 , . . . , T q be the newly created dominated sets after Spoiler put his pebble on a. Note that since each T i has not been a dominated set before Spoiler put his pebble on a, it must contain one unpebbled leaf distinct from a, which we denote by r i . For an illustration, see Figure 1.
We have to show that under the assumption that Duplicator in her previous moves has always maintained the invariant, she will be able to make a move that again fulfills the invariant. If p > 0, then the union S of the sets S 1 , . . . , S p was itself a dominated set already before Spoiler played on a, since G S is clearly connected (all the S i share the vertex a) and no unpebbled leaves can be created by taking a union of dominated sets. The next move of Duplicator is the value b from the invariant applied to S. This preserves the invariant, since for every i ≤ q, the set T i ∪ S has been a dominated set already before Spoiler played on a: because T i and S share the vertex a, the graph G S∪T i is connected, and since a is not a leaf in G S∪T i , the only unpebbled leaf of G S∪T i is r i . Therefore, α can be extended to all of T i . If p = 0, Duplicator plays an arbitrary element b in Γ. We prove by induction on the size of T i that α can be extended to T i such that α(a) = b. We can assume that only leaves in G T i are pebbled (otherwise, since G T i is a tree, the task reduces to proving the statement for proper subsets of T i ). Consider a clause φ of Φ k [T i ] that contains a, and let V be the variables of φ. This clause must be unique: otherwise, the graph obtained from G(Φ k ) by removing the vertex a has at least two components. Only one of those components can contain r i ; the other component must then be a dominated set where all leaves are pebbled, a contradiction to the assumption that p = 0. Now consider the graph H obtained from G T i by removing the vertex that corresponds to φ. See Figure 2.
If one of the connected components of H, say C, forms a dominated set, then the unique variable v in C ∩ V (uniqueness again follows from the fact that G T i is a tree) is the unique unpebbled leaf of C, and by the invariant of Duplicator's strategy α can be extended to α that is defined on all of C such that it satisfies Φ k [C]. Hence, by removing the pebbles from C and adding a pebble on v, with α (v) the corresponding response of Duplicator, we can apply the inductive assumption to T i \ C ∪ {v} to find an extension of α that is a satisfying assignment for Φ k [T i ] and maps a to b.
Otherwise, all variables in V except for the variable that lies in the connected component of r i in H are pebbled. By our assumption on the signature, the clause φ contains at most l pebbled variables (including a). Also by assumption there exists an injective mapping β : V → Γ that satisfies φ. Since Γ is l-transitive, there is an automorphism γ of Γ that maps β(a) to b and that sends β(w) to α(w), for w ∈ T i \ {v}. Then we extend α to v by α(v) := γ(β(v)); the extension clearly satisfies φ. Now we repeat the argument with v in place of a, and α(v) in place of b, and are done by inductive assumption.
Application to the Rooted Phylogeny Problem. We now turn back to the rooted triple consistency problem, CSP(Λ). The structure Λ is 2-transitive and the only relation has arity three, and hence we can apply Lemma 3.5 to prove that CSP(Λ) cannot be solved by Datalog.
To construct an unsatisfiable girth k instance Φ k for CSP(Λ), let G be a cubic graph of girth at least k that has a Hamiltonian cycle. Such a graph exists; see e.g. the comments after the proof of Theorem 3.2 in [Big98]. Note that G must have an even number of vertices. Let H = (v 1 , v 2 , . . . , v n ) be the Hamilton cycle of G. For any vertex a of G, let r(a) be the vertex that precedes a on H, s(a) the vertex that follows a on H, and t(a) the third remaining neighbor of a in G.
We now define Φ k . The vertices of G will be the variables of Φ k . Then

r(a)s(a)|t(a) .
Consider the graph on the variables of Φ k that has an edge ab when Φ k contains a triple clause ab|c for some variable c of Φ k . This graph is connected, since it actually equals the Hamilton cycle H of G. Hence, a condition due to Aho et al. [ASSU81] implies that Φ k is unsatisfiable for all k ≥ 1. This can also be seen by Lemma 4.3 in Section 4. It is clear that every triple clause of Φ k has an injective satisfying assignment. So the only remaining condition to apply Lemma 3.5 is the verification that G(Φ k ) has girth k. But this is obvious since any cycle of length 2l < 2k in the incidence graph G(Φ k ) would give rise to a cycle of length l < k in G, in contradiction to G having girth k.
Corollary 3.7. There is no Datalog program that solves the rooted triple consistency problem.
Other Applications of the Technique. Our technique to show Datalog inexpressibility can be adapted to show that the following (closely related) problems cannot be solved by Datalog as well.
• Satisfiability of branching time constraints [BJ03]; • The network consistency problem of the left-linear-point algebra [Due05,Hir97]; • Cornell's tree description logic [Cor94,BK07]; All these three problems contain the following computational problem as a special case. To again apply Lemma 3.5, we first have to show that Tree-Description-Consistency can be formulated as a CSP for a transitive ω-categorical structure Ω = (D; <, ||); this has already been observed in [BN06]. This time, it is more convenient to directly construct Ω. The domain D consists of the set of all non-empty finite sequences of rational numbers. For a = (q 1 , q 2 , . . . , q n ), b = (q 1 , q 2 , . . . , q m ), n ≤ m, we write a < b if one of the following conditions holds: • a is a proper initial subsequence of b, i.e., n < m and q i = q i for 1 ≤ i ≤ n; • q i = q i for 1 ≤ i < n, and q n < q n . The relation || is the set of all unordered pairs of distinct elements that are incomparable with respect to <. A proof that Ω is indeed 1-transitive and ω-categorical can be found in [AN98] (Section 5). Since the signature is binary, we can again apply Lemma 3.5, and have to find unsatisfiable instances of arbitrarily high girth.
Here we use the fact that Tree-Description-Consistency can simulate the rooted triple consistency problem by a simple reduction [BK07]. We construct Ψ k from Φ k by replacing each triple clause of Φ k of the form xy|z by the three conjuncts u xyz ||z, u xyz < x, and u xyz < y, where u xyz is a newly introduced variable. It can be shown (see [BK07]) that this transformation preserves (un-)satisfiability, and thus Ψ k is unsatisfiable as well. Moreover, the transformation is such that the girth of Ψ k is not smaller than the girth of Φ k . Finally, it is clear that every conjunct in Ψ k has an injective satisfying assignment. Hence, Lemma 3.5 applies, and CSP(Ω) cannot be solved by Datalog.

The Algorithm
In this section we show that the rooted phylogeny problem can be solved in polynomial time if all clauses come from the following class T , defined as follows.
The set of all tame clauses is denoted by T .
The algorithm we present in this section builds on previous algorithmic results about the rooted triple consistency problem, most notably [ASSU81,HKW96]. One of the central ideas for the polynomial-time algorithm for the rooted triple consistency problem in [ASSU81] is to associate a certain undirected graph to an instance of the rooted triple consistency problem. We generalize this idea to tame clauses as follows.
Definition 4.2. Let Φ be an instance of the rooted triple consistency problem with tame clauses. Then F Φ := (V, E) is the graph where the vertex set V is the set of variables of Φ, and where E contains an edge {x, y} iff Φ contains a clause xy|z 1 ∨ · · · ∨ xy|z p for p ≥ 1.
The following provides a sufficient (but not a necessary) condition for unsatisfiability of rooted triple formulas with tame clauses. Lemma 4.3. Let Φ be an instance of the rooted phylogeny problem with tame clauses. If F Φ is connected then Φ is unsatisfiable.
Proof. Let V be the set of variables in Φ. Suppose that there is a solution (T, α) for Φ. Let r be the yca of α(V ) in T (where α(V ) is the set of all leaves in the image of V under α). It cannot be that all vertices in α(V ) lie below the same child of r in T , since otherwise the child would have been above r = yca(α(V )), which is impossible. Since the graph F Φ is connected, there is an edge {x, y} in F Φ such that α(x) and α(y) lie below different children of r in T . Hence, there are z 1 , . . . , z p ∈ V and a clause xy|z 1 ∨ · · · ∨ xy|z p in Φ. By assumption, the yca of α(x) and α(y), which is r, lies strictly below the yca of α(x) and α(z i ) for some 1 ≤ i ≤ p, a contradiction to the choice of r.
To see that the condition is not necessary consider the following example.
Example 4.4. The rooted triple formula Φ = (ab|c ∧ bc|a ∧ ab|d) is unsatisfiable since the first two literals cannot simultaneously be satisfied. But the graph F Φ is disconnected; it has the two components {a, b, c} and {d}.   Proof. If Φ is the empty conjunction, then Φ is clearly satisfiable, and so the answer of the algorithm is correct in this case. The algorithm first computes a connected component S of F Φ (we discuss details of this step in the paragraph about the running time of the algorithm); if S = V , i.e., if F Φ is connected, then Lemma 4.3 implies that Φ is unsatisfiable.
Otherwise, we execute the algorithm recursively on Φ[S] and on Φ[V \ S]. If any of these recursive calls reports an inconsistency, then Φ is clearly unsatisfiable as well: since if there was a solution (T, α) to Φ, then (T, α| V ) would be a solution to Φ[V ]. Otherwise, we inductively assume that the algorithm correctly asserts the existence of a solution (T 1 , α 1 ) of Φ[S] and of a solution (T 2 , α 2 ) of Φ[V \ S].
Let T be the tree obtained by creating a new vertex r, linking the roots of T 1 and T 2 below r, and making r the root of T . Let α be the mapping that maps x to α i (x) if x ∈ L(T i ), for i ∈ {1, 2}. We claim that (T, α) is a solution to Φ, i.e., we have to show that in every clause ψ of Φ at least one literal is satisfied. If ψ = (xy|z 1 ∨ · · · ∨ xy|z p ), then x and y are in the same subtree T i of T , since they are connected by an edge in F Φ . If all variables of ψ lie completely inside S or completely inside V \ S, we are done by inductive assumption, because (T 1 , α 1 ) is a solution for Φ[S] and (T 2 , α 2 ) is a solution for Φ[V \ S]. Otherwise, there must be a j, 1 ≤ j ≤ p, such that z j lies in a different component than x and y. But in this case the yca of α(x) and α(y) lies strictly below r, which is the yca of α(x) and α(z j ). Hence, the literal xy|z j in ψ is satisfied. This concludes the correctness proof of the algorithm shown in Figure 3.
We still have to show how this procedure can be implemented such that the running time is in O(m log 2 n). There are amortized sub linear algorithms for testing connectivity in undirected graphs while removing the edges of the graph. This was used to speedup the algorithm for the rooted triple consistency problem [HKW96]. At present, the fastest known algorithm for this purpose appears to be the deterministic decremental graph connectivity algorithm of Holm, de Lichtenberg, and Thorup [THdL98], which has a query time in O(log n/ log log n), and an update time in O(log 2 n). We can use the same approach as in [HKW96] and obtain an O(m log 2 n) bound for the worst-case running time of our algorithm.

Complexity Classification
This section is devoted to the proof of the following result.
Theorem 5.1. Let C be a set of rooted triple clauses that contains clauses that are not tame (Definition 4.1). Then the rooted phylogeny problem for clauses from C is NP-complete.
Our proof of Theorem 5.1 consists of two parts. In the first part, we show that if C is not a subset of T , then a certain Boolean split problem associated to C (defined below) is NP-hard. In the second part we show that this Boolean split problem reduces to the rooted phylogeny problem for C.
Definition 5.2 (split formula for Φ). Let Φ be a rooted triple formula. Then the split formula for Φ is the Boolean formula obtained from Φ by replacing each literal xy|z by (x ↔ y) ∧ (z ∨ ¬z).
The purpose of the tautological second conjunct z ∨ z is to introduce the variable z, which would otherwise not appear in the formula; this becomes relevant in the following. If C is a class of triple clauses, we define B(C) to be the set of split formulas for the clauses from C.
A solution to a propositional formula is called surjective if at least one variable is set to true and at least one variable is set to false. The split problem for a set of Boolean formulas B is the problem to decide whether a given conjunction of formulas obtained from formulas in B by variable substitution has a surjective solution.
We will show that if C is a class of triple clauses that is not a subclass of T , then there exists a finite subset C of C such that the split problem for B(C ) is NP-complete. In the proof of this statement we use the following result, which follows from Theorem 6.12 in [CKS01], and is due to [CH97]. The notion of Horn, dual Horn, affine, and bijunctive Boolean formulas are standard and introduced in detail in [CKS01]. Bijunctive formulas are also known as 2-CNF formulas.
Theorem 5.3 (of [CH97]). Let B be a set of Boolean formulas. Then the split problem for B is in P if all formulas in B are from one of the following types: Horn, dual Horn, affine, bijunctive. In all other cases, B contains a finite subset B such that the split problem for B is NP-complete.
Proposition 5.4. If C is not a subclass of T , then B(C) is neither Horn, dual Horn, affine, nor bijunctive.
Proof. Let φ be a clause from C \ T . By construction the split formula ψ for φ is preserved by x → ¬x and is also preserved by constant operations. Moreover, it is known (and follows from [Pos41]) that every Boolean formula that is preserved by ¬, contains the constants, and is either Horn, dual Horn, affine, or bijunctive must also be preserved by the operation xor defined as (x, y) → (x + y mod 2). So it suffices to show that ψ cannot be preserved by xor.
Because φ is not from T and in particular non-trivial, there is a tree T and an injective mapping from the variables V of φ to the leaves of T such that (T, α) is not a solution to φ. Moreover, since the clause φ is not tame, it must contain triples ab|c and uv|z where {a, b} = {u, v}. Consider the assignment β that maps x ∈ V to 0 if α(x) is below the first child of the yca of α(V ) in T , and that maps x to 1 otherwise (which child is selected as the first child is not important in the proof). By construction, the assignment β does not satisfy the split formula for ψ, since φ is not satisfied by (T, α). Observe that the assignment β 1 that is obtained from β by negating the value assigned to a is a satisfying assignment for ψ, since it satisfies the disjunct ((a ↔ b) ∧ (c ∨ ¬c)) of ψ. The assignment β 2 that is constant 0 except for the variable a which is assigned 1 is also a satisfying assignment for ψ, because ψ satisfies ((u ↔ v) ∧ (w ∨ ¬w)). But since xor(β 1 (x), β 2 (x)) equals β(x) for all x ∈ V , this shows that ψ is not preserved by xor, which is what we wanted to show.
We now turn to the second part of the proof of Theorem 5.1. The idea to reduce the split problem for B(C) to the rooted phylogeny problem for clauses from C is to construct instances Φ of the phylogeny problem for C in such a way that Φ is satisfiable if and only if B(Φ) has a surjective solution. To implement this idea, we construct an instance of the phylogeny problem Φ that fragments into simple and satisfiable pieces if B(Φ) has a surjective solution.
Proposition 5.5. Let C be a finite class of triple clauses. Then the split problem for B(C) can be reduced in polynomial time to the rooted phylogeny problem for clauses from C.
Proof. Note that the split formula for a trivial clause is a tautological Boolean formula. Hence, if all clauses in C are trivial, then the split problem for B(C) is clearly in P and there is nothing to show. Otherwise, we can assume that C contains the clause that just consists of ab|c since this clause can be simulated by non-trivial clauses from C by appropriately equating variables (Lemma 2.10).
• To define the first group Φ 1 of clauses, suppose that ψ i has variables y 1 , . . . , y q . Let φ i (y 1 , . . . , y q ) be the triple clause that defines the Boolean relation from B(C) used in ψ i (y 1 , . . . , y q ). By the assumption that C and B(C) are finite it is clear that φ i can be computed efficiently (in constant time). We then add the clause φ i ((y 1 , i, 1), . . . , (y q , i, 1)) to Φ 1 . • The second group Φ 2 of clauses has for all x s ∈ V , i ∈ {0, . . . , m − 2} (if m = 1 the second group of clauses is empty), and j ∈ {1, . . . , n − 1} the clause (x s , i, j)(x s , i, j + 1)|(x s+j , i, 1) .
Note that Φ 2 only consists of rooted triples, and therefore F Φ 2 is defined, and consists of exactly n paths of length (n − 1)(m − 1). We claim that Φ is satisfiable if and only if ψ 1 ∧ · · · ∧ ψ m has a surjective solution. First suppose that Φ has a solution (T, α). Then the variables U of Φ can be partitioned into the

Concluding Remarks
We have shown that consistency of rooted phylogeny data can be decided in polynomial time when the data consists of tame disjunctions of rooted triples. Our algorithm extends previous algorithmic results about the rooted triple consistency problem, without sacrificing worst-case efficiency. The class T of tame triple clauses that can be handled efficiently is also motivated by another result of this paper, which states that any set of triple clauses that is not contained in T has an NP-complete rooted phylogeny problem. Here we use known results about the complexity of surjective Boolean constraint satisfaction problems.
We also show that no Datalog program can solve the rooted triple consistency problem, using a pebble game that captures the expressive power of Datalog for constraint satisfaction problems with infinite ω-categorical structures. In fact, our result follows from a more general result that also applies to many constraint satisfaction problems outside of phylogenetic reconstruction. We show that a constraint satisfaction problem for a structure with a large automorphism group cannot be solved by Datalog if, roughly, for all k there exists a unsatisfiable instance of girth at least k.
The class of phylogeny problems studied in this paper has a natural generalization to a larger class of computational problems, namely problems of the form CSP(Γ) where Γ has a first-order definition in Λ, the ω-categorical relatively 3-transitive C-set introduced in Section 3. This class contains several additional problems that have been studied in phylogenetic reconstruction, for instance the quartet consistency problem [Ste92]. The larger class also contains new problems that can be solved in polynomial time, and where the split problem consists in finding surjective solutions to Boolean linear equation systems. A complexity classification for this larger class of computational problems remains open and is left for future research.