FO 2 ( <, +1 , ∼ ) on data trees, data tree automata and branching vector addition systems

A data tree is an unranked ordered tree where each node carries a label from a ﬁnite alphabet and a datum from some inﬁnite domain. We consider the two variable ﬁrst order logic FO 2 ( <, +1 , ∼ ) over data trees. Here +1 refers to the child and the next sibling relations while < refers to the descendant and following sibling relations. Moreover, ∼ is a binary predicate testing data equality. We exhibit an automata model, denoted DTA # , that is more expressive than FO 2 ( <, +1 , ∼ ) but such that emptiness of DTA # and satisﬁability of FO 2 ( <, +1 , ∼ ) are inter-reducible. This is proved via a model of counter tree automata, denoted EBVASS, that extends Branching Vector Addition Systems with States (BVASS) with extra features for merging counters. We show that, as decision problems, reachability for EBVASS, satisﬁability of FO 2 ( <, +1 , ∼ ) and emptiness of DTA # are equivalent.


Introduction
A data tree is an unranked ordered tree where each node carries a label from a finite alphabet and a datum from some infinite domain.Together with the special case of data words, they have been considered in the realm of program verification, as they are suitable to model the behavior of concurrent, communicating or timed systems, where data can represent e.g., process identifiers or time stamps [1,6,7].Data trees are also a convenient model for XML documents [4], where data represent attribute values or text contents.Therefore finding decidable logics for this model is a central problem as it has applications in most reasoning tasks in databases and in verification.
Several logical formalisms and models of automata over data trees have been proposed.Many of them were introduced in relation to XPath, the standard formalism to express properties of XML documents.Although satisfiability of XPath in the presence of data values is undecidable, automata models were introduced for showing decidability of several data-aware fragments [12,4,11,10,13].
As advocated in [4], the logic FO 2 (<, +1, ∼) can be seen as a relevant fragment of XPath.Here FO 2 (<, +1, ∼) refers to the two-variable fragment of first order logic over unranked ordered data trees, with predicates for the child and the next sibling relations (+1), predicates for the descendant and following sibling relations (<) and a predicate for testing data equality between two nodes (∼).Over data words, FO 2 (<, +1, ∼) was shown to be decidable by a reduction to Petri Nets or, equivalently, Vector Addition Systems with States (VASS) [5].It is also shown in [4] that reachability for Branching Vector Addition Systems with States, BVASS, reduces to satisfiability of FO 2 (<, +1, ∼) over data trees.The model of BVASS, extends VASS with a natural branching feature for running on trees, see [15] for a survey of the various formalisms equivalent to BVASS.As the reachability of BVASS is a long standing open problem, showing decidability of finite satisfiability for FO 2 (<, +1, ∼) seems unlikely in the near future.
This paper is a continuation of the work of [5,4].We introduce a model of counter automata, denoted EBVASS, and show that satisfiability of FO 2 (<, +1, ∼) is inter-reducible to reachability in EBVASS.This model extends BVASS by allowing new features for merging counters.In a BVASS the value of a counter at a node x in a binary tree is the sum of the values of that counter at the children of x, plus or minus some constant specified by the transition relation.In EBVASS constraints can be added modifying this behavior.In particular (see Section 3 for a more precise definition) it can enforce the following at node x: one of the counters of its left child and one of the counters of its right child are decreased by the same arbitrary number n, then the sum is performed as for BVASS, and finally, one of the resulting counters is increased by n.
The reduction from FO 2 (<, +1, ∼) to EBVASS goes via a new model of data tree automata, denoted DTA # .Our first result (Section 2) shows that languages of data trees definable in FO 2 (<, +1, ∼) are also recognizable by DTA # .Moreover the construction of the automaton from the formula is effective.Our automata model is a non-trivial extension from data words to data trees of the Data Automata (DA) model of [5], chosen with care in order to be powerful enough to capture the logic but also not too powerful in order to match its computational power.The obvious extensions of DA to data trees are either too weak to capture FO 2 (<, +1, ∼) or too expressive and undecidable (see Proposition 1).Here we consider the strongest of these extensions, called DTA, which is undecidable, and restrict it into a model called DTA # with an associated emptiness problem is equivalent to satisfiability of FO 2 (<, +1, ∼).
Our second result (Section 3) shows that the emptiness problem for DTA # reduces to the reachability problem for EBVASS.Finally we show in Section 4 that the latter problem can be reduced to the satisfiability of FO 2 (<, +1, ∼), closing the loop.Altogether, this implies that showing (un)decidability of any of these problems would show (un)decidability of the three of them.Although this question of (un)decidability remains open, the equivalence shown in this paper between the decidability of these three problems, the definition of the intermediate model DTA # and the techniques used for proving the interreductions provides a better understanding of the three problems, and in particular of the emptiness of the branching vector addition systems with states.
Related work.There are many other works introducing automata or logical formalism for data words or data trees.Some of them are shown to be decidable using counter automata, see for instance [9,13].The link between counter automata and data automata is not surprising as the latter only compare data values via equality.Hence they are invariant under permutation of the data domain and therefore, often, it is enough to count the number of data values satisfying some properties instead of knowing their precise values.

Preliminaries
In this paper A or B denote finite alphabets while D denotes an infinite data domain.We use E or F when we do not care whether the alphabet is finite or not.We denote by E # the extension of an alphabet E with a new symbol # that does not occur in E.
Unranked ordered data forests.We work with finite unranked ordered trees and forests over an alphabet E, defined inductively as follows: for any a ∈ E, a is a tree.If t 1 , • • • , t k is a finite non-empty sequence of trees then t 1 + • • • + t k is a forest.If s is a forest and a ∈ E, then a(s) is a tree.The set of trees and forests over E are respectively denoted Trees(E) and Forests(E).A tree is called unary (resp.binary) when every node has at most one (resp.two) children.We use standard terminology for trees and forests defining nodes, roots, leaves, parents, children, ancestors, descendants, following and preceding siblings.
Given a forest t ∈ Forests(E), and a node x of t, we denote by t(x) the label of x in t.
We say that two forests t 1 ∈ Forests(E 1 ) and t 2 ∈ Forests(E 2 ) have the same domain if there is a bijection from the nodes of t 1 to the nodes of t 2 that respects the parent and the next-sibling relations.In this case we identify the nodes of t 1 with the nodes of t 2 and the difference between t 1 and t 2 lies only in the label associated to each node.Given two forests t 1 ∈ Forests(E 1 ), t 2 ∈ Forests(E 2 ) having the same domain, we define t 1 ⊗ t 2 ∈ Forests(E 1 × E 2 ) as the forest over the same domain and such that for all nodes x, t 1 ⊗ t 2 (x) = t 1 (x), t 2 (x) .
The set of data forests over a finite alphabet A and an infinite data domain D is defined as Forests(A×D).Note that every t ∈ Forests(A×D) can be decomposed into a ∈ Forests(A) and d ∈ Forests(D) such that t = a ⊗ d.

Logics on data forests.
A data forest of Forests(A×D) can be seen as a relational model for first order logic.The domain of the model is the set of nodes in the forest.There is a unary relation a(x) for all a ∈ A containing the nodes of label a.There is a binary relation x ∼ y containing all pairs of nodes carrying the same data value of D, and binary relations E → (x, y) (y is the sibling immediately next to x), E ↓ (x, y) (x is the parent of y), and E ⇒ , E which are the non reflexive transitive closures respectively of E → and E ↓ , minus respectively E → and E ↓ (i.e., they define two or more navigation steps).The reason for this non-standard definition of E ⇒ and E is that it will be convenient that equality, E → , E ↓ , E ⇒ and E are disjoint binary relations.We will often make use of the macro, x < > y, and say that x and y are incomparable, when none of x = y, E → (x, y), E ↓ (x, y), E ⇒ (x, y) and E (x, y) holds.
Let FO 2 (<, +1, ∼) be the set of first order sentences with two variables built on top of the above predicates.Typical examples of properties definable in FO 2 (<, +1, ∼) are key constraints (all nodes of label a have different data values), ∀x∀y a(x) ∧ a(y) ∧ x ∼ y → x = y, and downward inclusion constraints (every node x of label a has a node y of label b in its subtree with the same data value), We also consider the extension EMSO 2 (<, +1, ∼) of FO 2 (<, +1, ∼) with existentially quantified monadic second order variables.Every formula of EMSO 2 (<, +1, ∼) has the form ∃R 1 . . .∃R n φ where φ is a FO 2 (<, +1, ∼) formula called the core, involving the variables R 1 , . . ., R n as unary predicates.The extension to full monadic second order logic is denoted MSO(<, +1, ∼).
We write MSO(<, +1) for the set of formulas not using the ∼ predicates.These formulas are ignoring the data values, i.e., they are classical monadic second-order formulas over forests.
Automata models for forests.We will informally refer to automata and transducers for forests and unranked trees over a finite alphabet.The particular choice of a model of automata is not relevant here and we refer to [8,Chapters 1,6,8] for a detailed description.A set of forests accepted by an automaton is called a regular language and regular languages are exactly those definable in MSO(<, +1).1: A forest t followed by its class forests t [1] and t [2] We now define two models of automata over data trees.The first and most general one is a straightforward generalization to forests of the automata model over data words of [5].The second one adds a restriction in order to avoid undecidability.

Automata
General Data Forest Automata model: DTA.A DTA is a pair (A, B) where A is a non-deterministic letter-to-letter transducer taking as input a forest in Forests(A) and returning a forest in Forests(B) with the same domain, while B is a forest automaton taking as input a forest in Forests(B # ).Over data words this model was shown to be decidable [5].Unfortunately it is undecidable over data trees.
Proof.We show that DTA can simulate the Class Automata of [3].This latter model has an undecidable emptiness problem, already when restricted to data words, i.e., forests of the form a Given a class automaton C = (A, B) over A × D, we construct a DTA C such that C accepts a data word iff C accepts a data tree.The idea of the reduction is that we replace each letter b i by a tree of depth i.Hence, even if b i is replaced by # during the run of C (conversion to class word), this label can still be recovered.
Let O be a new alphabet containing the two symbols b and #.For any symbol s and 1 ≤ i ≤ n, let s i be the unary data tree of depth i defined recursively by: s 1 = s and s i+1 = s(s i ).We associate to a data word . From the word automaton B we can construct a forest automaton B accepting exactly the set of class forests ŵ[d] such that w d is accepted by B, for all d ∈ D.
The best way to see this is to use MSO(<, +1) logic.The language recognized by B can be defined by a formula ϕ of MSO(<, +1).The formula corresponding to B is constructed by replacing in ϕ each atom of the form b i , 1 by b i (x) and each atom of the form b i , 0 by a formula testing that x has label # and that the subtree rooted at x has depth i.
From there it is now easy to construct an A such that the DTA (A , B ) accepts a data forest iff the class automaton C = (A, B) accepts a data word.
Restricted Data Forest Automata model: DTA # .The second data tree automata model we consider is defined as DTA with a restriction on B. The restriction makes sure that B ignores repeated and contiguous occurrences of # symbols.This ensures that for each class forest t[d], not only the automata cannot see the label of a node whose data value is not d, but also can not see the shape of subtrees of nodes whose data value differs from d.In particular it can no longer count the number of # symbols in a subtree and the undecidability proof of Proposition 1 no longer works.
A set L ⊆ Forests(B) is called #-stuttering iff it is closed under the rules depicted in Figure 2. Intuitively these rules should be understood as follows: if a subforest is matched by a left-hand side of a rule (when the variables x and y are replaced by (possibly empty) forests), then replacing this subforest by the corresponding right-hand side (with the same variable replacement) yields a forest also in L, and the other way round.
x Fig. 2: Closure rules for #-stuttering sets.x represents an arbitrary forest.
For instance if L is #-stuttering and contains the trees t [1] and t [2] of Figure 1, then it should also contain the trees in Figure 3.Typical examples of languages that are not #-stuttering are those counting the number of nodes of label #.Note that #-stuttering languages are closed under union and intersection.
We define DTA # as those DTA (A, B) such that the language recognized by B is #-stuttering.
We conclude this section with the following simple lemma whose proof is a straightforward Cartesian product construction.We use the term letter projection for a relabeling function defined as h : A → A, where A and A are alphabets.
Lemma 2. The class of DTA # languages is closed under union, intersection and letter projection.
The proof works in two steps.In the first step we provide a normal form for sentences of FO 2 (<, +1, ∼) that is essentially an EMSO 2 (<, +1, ∼) formula whose core is a conjunction of simple formula of FO 2 (<, +1, ∼).In a second step, we show that each of the conjunct can be translated into a DTA # , and we conclude using composition of these automata by intersection, see Lemma 2.

Intermediate Normal Form
We show first that every FO 2 (<, +1, ∼) formula φ can be transformed into an equivalent EMSO 2 (<, +1, ∼) formula in intermediate normal form: where each χ i has one of the following forms: where each of α and β is a type, that is, a conjunction of unary predicates or their negation (these unary predicates are either from A or from R 1 , . . ., R k , i.e., introduced by the existentially quantified variables), δ(x, y) is either x ∼ y or x ∼ y, γ(x, y) is one of ¬E ⇒ (x, y), ¬E (x, y) or ¬(x< >y), and (x, y) is one of x = y, E → (x, y), E → (y, x), E ↓ (x, y), E ↓ (y, x), E ⇒ (x, y), E ⇒ (y, x), E (x, y), E (y, x), x< >y or false.This normal form is obtained by simple syntactical manipulation similar to the one given in [5] for the data words case, and detailed below.
Scott normal form.We first transform the formula φ into Scott Normal Form obtaining an EMSO 2 (<, +1, ∼) formula of the form: where χ and every χ i are quantifier free, and R 1 , . . .R m are new unary predicates (called monadic).This transformation is standard: a new unary predicate R θ is introduced for each subformula θ(x) with one free variable for marking the nodes where the subformula holds.The subformula θ(x) is then replaced by R θ (x) and a conjunct ∀x R θ (x) ↔ θ(x) is added.This yields a formula in the desired normal form.
From Scott to intermediate normal form.We show next that every conjunct of the core of the formula ψ in Scott Normal Form can be replaced by an equivalent conjunction of formulas of the form (1) or (2), possibly by adding new quantifications with unary predicates upfront.
Case ∀x∀y χ.Recall that with our definition, the binary relations E → , E ⇒ , E ↓ , E ,< >and = are pairwise disjoint.Hence we can rewrite ∀x∀y χ into an equivalent FO 2 (<, +1, ∼) formula in the following form, where every subformula ψ * is quantifier free and only involves the predicate ∼ together with unary predicates.They can be obtained from χ via conjunctive normal form and De Morgan's law.The resulting formula is equivalent to the conjunction where leaf(x) is a new predicate denoting the leaves of the forest and last(x) is also a new predicate denoting nodes having no right sibling.The predicate leaf is specified by the following formulas, that have the desired form.
The first three conjuncts, with quantifier prefix ∀x∃y, will be treated later when dealing with the second case.
For the next three conjuncts, putting ¬ψ ⇒ , ¬ψ , ¬ψ < > in disjunctive normal form (with an exponential blowup), we rewrite ψ ⇒ , ψ , ψ < > as a conjunction of formulas of the form ¬(α(x) ∧ β(y) ∧ δ(x, y)), where α, β, and δ are as in (1).By distribution of conjunction over implication, and by contraposition, we obtain for the 3 cases an equivalent conjunction of formulas of the following form (matching the desired form (1)) Case ∀x∃y χ.We first transform χ (with an exponential blowup) into an equivalent disjunction of the form χ where α j , β j , δ j and j are as in (2).Next, in order to eliminate the disjunctions, we add a new monadic second-order variables R χ,j , that we existentially quantify upfront of the global formula, and transform ∀x∃y χ into the conjunction The first conjuncts express that if R χ,j (x) holds, then there exists a node y such that the corresponding conjunct of χ holds, and the last conjunct expresses that for all node x, at least one of the R χ,j (x) must hold and can be rewritten as ∀x∃y ¬R χ,j (x) → false .Now all the conjuncts are as in (2) and we are done.

Case analysis for constructing DTA # from intermediate normal forms
We now show how to transform a formula in intermediate normal form into a DTA # .Let A be the initial alphabet and let A be the new alphabet formed by combining letters of A with the newly quantified unary predicates R 1 , . . ., R k .By closure of DTA # under intersection and letter projection (Lemma 2), it is enough to construct a DTA # automaton for each simple formula of the form (1) or ( 2), accepting the data forests in Forests(A × D) satisfying the formula.
We do a case analysis depending on the atoms involved in the formula of the form (1) or (2).For each case we construct a DTA # (A, B) recognizing the set of data forests satisfying the formula.The construction borrows several ideas from the data word case [5], but some extra work is needed as the tree structure is more complicated.In the discussion below, a node whose label satisfies the type α will be called an α-node.Many of the cases build on generic constructions that we described in the following remark.
Remark 1.A DTA # (A, B) can be used to distinguish one specific data value, by recoloring, with A, all the nodes carrying the data value, and checking, with B, the correctness of the recoloring.We will then say that (A, B) marks a data value using the new color c.This can be done as follows.The transducer A marks (i.e.relabel the node by adding to its current label an extra color) a node x with this data value with a specific new color c .At the same time it guesses all the nodes sharing the same data value as x and marks each of them with a new color c.Then, the forest automaton B checks, for every data value, that either none of the nodes are marked with c or c , or that all nodes not labeled with # are marked with c or c and that c occurs exactly once in the same class forest.Note that this defines a #-stuttering language.It is now clear that for the run to be accepting, A must color exactly one data value and that all the nodes carrying this data value must be marked with c or c .The transducer A can then build on this fact for checking other properties.
A generic example of the usefulness of this remark is given below.Once an arbitrary data value is marked with a color c, then a property of the form ∀x∀y α x, y) is now a regular property and can therefore be tested by A. Hence it is enough to consider the case where x does not carry the marked data value.The same reasoning holds if two data values are marked or if the formula starts with a ∀x∃y quantification.We will use this fact implicitly in the case analysis below.
Given a data forest, a vertical path is a set of nodes containing exactly one leaf and all its ancestors and nothing else.A horizontal path is a set of nodes containing one node together with all its siblings and nothing else.
We start with formulas of the form (1).
Case 1: ∀x∀y α(x) ∧ β(y) ∧ x ∼ y → γ(x, y), where γ(x, y) is as in (1).These formulas express a property of pairs of nodes with the same data value.We have seen that those are #-stuttering languages that can be tested by the forest automaton B solely (i.e., by a DTA # with A doing nothing).
Case 2: ∀x∀y α(x) ∧ β(y) ∧ x ∼ y → ¬E ⇒ (x, y).This formula expresses that a data forest cannot contain an α-node having a β-node with a different data value as a sibling to its right, except if it is the next-sibling.Let X be an horizontal path in a data forest t containing at least one α-node, and let x be the leftmost α-node in X.Let d be the data value of x.Consider an α-node x and a β-node y that make the formula false within X, in particular we have E ⇒ (x , y ).Then, if y has a data value different from d we already have E ⇒ (x, y ) and the formula is also false for the pair (x, y ).Hence the validity of the formula within X can be tested over pairs (x , y ) such that either x or y has data value d.With this discussion in mind we construct (A, B) as follows.In every horizontal path X containing one α-node, the transducer A identify the leftmost occurrence x of an α-node in X, and marks it with a new color c , and marks all the nodes of X with the same data value as x with a color c.As in Remark 1, the forest automaton B checks that the guesses are correct, i.e. it accepts only forests in which every horizontal path X satisfy one of the following conditions: X contains one occurrence of the color c and all other nodes of X not labeled with # are marked with c, or X contains none of the colors c and c at all.All these properties define regular and #-stuttering languages, and hence can be checked by a forest automaton B.
Assuming this, the transducer A rejects if there are some unmarked β-nodes occurring as a right sibling (except for the next-sibling) of a marked α-node or there is an unmarked α-node as left sibling, except for the previous sibling, of a marked β-node.As explained in Remark 1, this is a regular property.
Case 3: ∀x∀y α(x) ∧ β(y) ∧ x ∼ y → ¬E (x, y).The property expressed by this formula is similar to the previous case, replacing the right sibling relationship with the descendant relationship.
Let X be a vertical path in a data forest t containing at least one α-node, and let x be the α-node in X the closest to the root.Let d be the data value of x.Consider an α-node x and a β-node y that make the formula false within X, in particular we have E (x , y ).Then, if y has a data value different from d we already have E (x, y ) and the formula is also false for the pair (x, y ).Hence the validity of the formula within X can be tested over pairs (x , y ) such that either x or y has data value d.The construction of (A, B) is similar to the previous case, except that different vertical paths may share some nodes.The transducer A marks all the α-nodes that have no α-node as ancestor, with a new color c .Then, for every node x marked c , A guesses all the nodes inside the subtree rooted at x having the same data value as x and mark them with a new color c.As in Remark 1, the forest automaton B checks that the guesses of colors are correct for each vertical path (see also the previous case).
Assuming this, the transducer A rejects if there are an unmarked β-node that is a descendant, but not a child, of a marked α-node or there is an unmarked α-node as an ancestor, except for the parent, of a marked β-node.This is a regular property that can be checked by A in conjunction with the marking, following the principles of Remark 1.
Case 4: ∀x∀y α(x) ∧ β(y) ∧ x ∼ y → ¬(x< >y).The formula expresses that every two nodes of type respectively α and β and with different data values cannot be incomparable.Recall that two nodes are incomparable if they are not ancestors and not siblings.

Subcase 4.1:
There exist two α-nodes that are incomparable.
Let x 1 and x 2 be two incomparable α-nodes and let z be their least common ancestor (see Figure 4).We can choose x 1 and x 2 such that none of the α-nodes are incomparable with z or sibling of z, because if this was not the case then there is an α-node x 3 incomparable with z or sibling of z, and therefore x 3 is incomparable with x 1 , and we can replace x 2 with x 3 , continuing with their least common ancestor, a node which is strictly higher than z.Let z 1 and z 2 be the children of z that are respectively ancestors of x 1 and x 2 .Note that by construction, z 1 = z 2 .If x 1 = z 1 and there is an α-node x 3 in the subtree of z, different from x 1 and incomparable with z 2 , then we replace x 1 by x 3 and proceed.In other words we ensure that if x 1 = z 1 then there is no α-node incomparable with z 2 in the subtree of z.We proceed similarly to enforce that if x 2 = z 2 then there is no α-node incomparable with z 1 in the subtree of z.Notice that we cannot have at the same time x 1 = z 1 and x 2 = z 2 because we assumed x 1 and x 2 to be incomparable.All these properties can be specified in MSO(<, +1) and therefore can be tested by a forest automaton.Let d 1 and d 2 be the respective data values of x 1 and x 2 (possibly Consider now a β-node y whose data value is neither d 1 nor d 2 .If y is incomparable with z or sibling of z, then the formula cannot be true as it is contradicted by (x 1 , y).If y is an ancestor of z then, as no α-node is incomparable with z, none is incomparable with y.Hence the formula can only be true with such y.Assume now that y is inside the subtree of z.If y = z 1 and x 2 = z 2 , then the formula is contradicted by (x 2 , y).If y = z 1 and x 2 = z 2 , then, by hypothesis, there is no α-node incomparable with y in the subtree of z, and there is no α-node incomparable with y outside the subtree of z, hence altogether, the formula holds for y.If y = z 1 and y is a descendant of z 1 , then the formula is contradicted by (x 2 , y).The cases where y is descendant of z 2 are symmetric: in this case, the formula can only be true if y = z 2 and x 1 = z 1 .In the remaining cases y is in the subtree of z and not in the subtrees of z 1 and z 2 , making the formula false.Indeed, in each of these cases, either (x 1 , y) or (x 2 , y) contradicts the formula.To summarize, the only cases making the formula true are when y is an ancestor of z, With this discussion in mind, this case can be solved as follows: The transducer A guesses the nodes of x 1 , x 2 , z 1 , z 2 and z and checks that they satisfy the appropriate properties.Moreover, A guesses whether d 1 = d 2 and marks the data values of x 1 and x 2 accordingly with one or two new colors.The forest automaton B will then check that the data values are marked appropriately as in Remark 1.
Moreover A checks that for all marked β-nodes there is no α-node incomparable with it and with a different data value, a regular property as explained in Remark 1.It now remains for A to check that every unmarked β-node y behaves according to the discussion above: y is an ancestor of z or y = z 1 and x 2 = z 2 or y = z 2 and x 1 = z 1 .This is a regular property testable by A.
Let x be an α-node such that no α-node is a descendant of x.By hypothesis, all α-nodes are either ancestors or siblings of x.Let d be the data value of x.We distinguish between several subcases depending on whether there are other α-nodes that are siblings of x or not.
If there is an α-node x that is a sibling of x, then let d be its data value (possibly d = d ).Consider now a β-node y whose data value is neither d nor d .Then, in order to make the formula true, y must be an ancestor or a sibling of x.
In this case, the transducer A guesses the nodes x and x and marks the corresponding data values with one or two new colors (according to whether d = d or not).The forest automaton B will then check that the data values are marked correctly as explained in Remark 1.For the marked β-nodes, the property expressed by the formula is regular and can also be checked by A. It remains for A to check that every unmarked β-nodes is either an ancestor of x or a sibling of x.Now, if there are no α-nodes that are sibling of x, and y is a β-node whose data value is not d, then in order to make the formal true, y cannot be incomparable with x, and therefore, y can be an ancestor, a descendant or a sibling of x.
In this second case, the transducer A guesses the node x, marks its data value using a new color.The forest automaton B will then check that the data values were marked correctly as explained in Remark 1.The transducer A checks that all marked β-nodes make the formula true (a regular property), and that all unmarked β-nodes are not incomparable with x.
We now turn to formulas of the form (2).
Case 5: ∀x∃y α(x) → (β(y) ∧ x ∼ y ∧ (x, y)), where (x, y) is as in (2).These formulas express properties of nodes with the same data value.Moreover they express a regular property over all t [d].Therefore can be treated by the forest automaton B as for the case 1.
Case 6: ∀x∃y α(x) → (β(y) ∧ x ∼ y ∧ E → (x, y)).This formula expresses that every α-node has a next sibling of type β with a different data value.The transducer A marks every α-node x, with a new color c and checks that the next-sibling of x is a β-node.The forest automaton B accepts only the forests such that for every node marked with c, its right sibling is labeled with #.
Case 10: ∀x∃y α(x) → (β(y) ∧ x ∼ y ∧ E ⇒ (x, y)).This formula expresses that every α-node must have a β-node as a right sibling, but not as its next-sibling, and with a different data value.
Let X be an horizontal path.Let y be the rightmost β-node of X and d be its data value.Consider now an α-node x of X with a data value different from d. Then either x is at the left of the previous-sibling of y, and y can serve as the desired witness, or x has no witness and the formula is false.
The transducer A, for each horizontal path X containing an α-node, marks its rightmost β-node y with a new color c , guesses all the nodes of X with the same data value as y and marks them with a new color c.Then it checks that every unmarked α-node of X occurs at the left of the previous-sibling of y.The forest automaton B checks that the guesses are correct as in Remark 1: for each horizontal paths, either all elements are marked with c or c , or none.
Let y 1 and y 2 be two incomparable β-nodes and let z be their least common ancestor.Using the same reasoning as in subcase 4.1, we can choose y 1 and y 2 such that none of the β-nodes is incomparable with z or a sibling of z.Let z 1 and z 2 be the children of z that are the ancestors of y 1 and y 2 respectively.By construction, z 1 = z 2 .Using the same trick as in subcase 4.1, we can ensure that if y 1 = z 1 then there is no β-node incomparable with z 2 , and if y 2 = z 2 then there is no β-node incomparable with z 1 .Moreover, we cannot have at the same time y 1 = z 1 and y 2 = z 2 .Recall that all these properties can be tested by a forest automaton.Let d 1 and d 2 be the respective data values of y 1 and y 2 (possibly Consider now an α-node x whose data value is neither d 1 nor d 2 .If x is incomparable with z or a sibling of z, then y 1 is a witness for x.If x is an ancestor of z then by hypothesis there is no β-node incomparable with x and hence the formula is false.Assume now that x is in the subtree rooted at z.If x = z 1 and y 2 = z 2 , then y 2 is a β-node incomparable with x with a different data value, hence a witness for x in the formula.If x = z 1 and y 2 = z 2 , then by hypothesis, there is no β-node incomparable with x in the subtree of z, and since there are neither β-nodes incomparable with x outside the subtree of z, the formula must be false.If x = z 1 and x is a descendant of z 1 , then y 2 is a witness for x.The cases where x is a descendant of z 2 are symmetric.In the remaining cases, x is in the subtree of z and not a descendant of z 1 or z 2 .In each of these cases, either y 1 or y 2 is a witness for x.
With this discussion in mind, this case can be solved as follows: The transducer A guesses the nodes of y 1 , y 2 , z 1 , z 2 and z and checks that they satisfy the appropriate properties.Moreover, A guesses whether d 1 = d 2 and marks accordingly the data values of z 1 and z 2 with one or two new colors.The forest automaton B will then check that the data values are marked appropriately, as in Remark 1. Moreover A checks that for every marked α-node, there exists a β-node making the formula true.It remains for A to check the three following properties: no unmarked α-node occurs above z, if y 1 = z 1 then z 2 is not an unmarked α-node, and if y 2 = z 2 then z 1 is not an unmarked α-node.

Subcase 14.2:
There are no two β-nodes that are incomparable.
Let y be an β-node such that no β-node is a descendant of y.By hypothesis, all β-nodes are either ancestors or siblings of y.Let d be the data value of y.We distinguish between several subcases depending on whether there are β-nodes that are siblings of y or not.
If there exists a β-node y that is a sibling of y, let d be its data value (possibly d = d ).Consider an α-node x whose data value is neither d nor d .If x is incomparable with y, then y is a witness for x.If x is an ancestor or a sibling of y, then the formula cannot be true, because by hypothesis every β-node cannot be incomparable with x.If x is a descendant of y, then y makes the formula true for that x.
Consider now the case where there are no β-node that are sibling of y.Note that y can have β-nodes among its ancestors.Let x be a α-node that has data value different from d.If x is not incomparable with y then the formula must be false.Otherwise, y is a witness for x.
The transducer A guesses the β-node y and marks its data value using a new color.Then it checks whether there is an β-node y that is a sibling of y.If yes, it guesses whether the value at y is the same as the value at y or not, and marks the data value of y using a new color.The forest automaton B will then check that the data values are marked appropriately.For marked α-nodes, A checks the regular property making the formula true.It now remains for A to check, in both cases, that every unmarked α-node x satisfy the appropriate condition described above, i.e., that x is incomparable with y or a descendant of y if there exists a sibling y and that x is incomparable with y otherwise.
Case 15: ∀x∃y α(x) → false.It is sufficient to test with A that no α-node is present in the forest.

From DTA # to EBVASS
In this section we show that the emptiness problem of DTA # can be reduced to the reachability of a counter tree automata model that extends BVASS, denoted EBVASS.An EBVASS is a tree automaton equipped with counters.It runs on binary trees over a finite alphabet.It can increase or decrease its counters but cannot perform a zero test.For BVASS, when going up in the tree, the new value of each counter is the sum of its values at the left and right child.An EBVASS can change this behavior using simple arithmetical constraints.
The general idea of the reduction is as follows.Let (A, B) be a DTA # .We want to construct an automaton that recognizes exactly the projections of the data forests accepted by (A, B).Because this automaton does not have access to the data values, the main difficulty is to simulate the runs of B on all class forests.We will use counters for this purpose.The automaton will maintain the following invariant: At any node x of the forest, for each state q of B, we have a counter that stores the number of data values d such that B is in state q at x when running on the class forest associated to d.In order to maintain this invariant we make sure that the automata model has the appropriate features for manipulating the counters.In particular, moving up in the tree, in order to simulate B, the automaton has to decide which data value occurring in the left subtree also appears in the right subtree.At the current node, each data value is associated to a state of B and, by the invariant property, a counter.In order to maintain the invariant for data values occurring in both subtrees, for each pair q, q of states of B, the automaton guesses a number n (the number of data values being at the same time in state q in the left subtree and in state q in the right subtree), removes n from both associated counters and adds n to the counter corresponding to the state resulting from the application of the transition function of B on (q, q ).This preserves the invariant property and a BVASS cannot do it, so we explicitly add this feature to our model.Once we have done this, the counters from both sides are added like a BVASS would do.The #-stuttering property of the language of B will ensure that this last operation is consistent with the behavior of B. This is essentially what we do.But of course there are some nasty details.In particular DTA # run over unranked forests while EBVASS run over binary trees.
We start by defining the counter tree automata model and then we present the reduction.

Definition of EBVASS
An EBVASS is a tuple (Q, A, q 0 , k, δ, χ) where A is a finite alphabet, Q is a finite set of states, q 0 ∈ Q is the initial state, k ∈ N is the number of counters, χ is a finite set of constraints of the form C i1 C i2 → C i with 1 ≤ i 1 , i 2 , i ≤ k, and δ is the set of transitions which are of two kinds: -transitions (subset denoted δ ) and up-transitions (subset denoted δ u ).An -transition is an element of (Q × A) × (Q × U ) where U = {I i , D i : 1 ≤ i ≤ k} is the set of possible counter updates: D i stands for decrement counter i and I i stands for increment counter i.We view each element of U as a vector over {−1, 0, 1} k with only one non-zero position.An up-transition is an element of (Q Informally, an -transition may change the current state and increment or decrement one of the counters.An up-transition depends on the label of the current node and, when the current node is an inner node, on the states reached at its left and right child.It defines a new state and the new value of each counter is the sum of the values of the corresponding counters of the children.Moreover, the behavior of up-transitions can be modified by the constraints χ.Informally a constraint of the form C i1 C i2 → C i modifies this process as follows: before performing the addition of the counters, two positive numbers n 1 and n 2 are guessed (possibly of value 0), the counter i 1 of the left child and the counter i 2 of the right child are decreased by n 1 , the counter i 2 of the left child and the counter i 1 of the right child are decreased by n 2 and, once the addition of the counters has been executed, the counter i is increased by n 1 + n 2 .Note that n 1 and n 2 must be so that all intermediate values remain positive.This is essentially what is explained in the sketch above except that we cannot distinguish the left child from the right child.This will be a property resulting from #-stuttering languages when coding them into binary trees.We now make this more precise.
A configuration of an EBVASS is a pair (q, v) where q ∈ Q and v is a valuation of the counters, seen as a vector of N k .The initial configuration is (q 0 , v 0 ) where v 0 is the function setting all counters to 0. There is an -transition of label a from (q, v) to (q , v ) if (q, a, q , u) ∈ δ and v = v + u (in particular this implies that v + u ≥ 0).We write (q, v) a − → (q , v ), if (q , v ) can be reached from (q, v) via a finite sequence of -transitions of label a.
Given a binary tree a ∈ Trees(A), a run ρ of a EBVASS is a function from nodes of a to configurations verifying for all leaf x, ρ(x) = (q 0 , v 0 ) and for all nodes x, x 1 , x 2 of a with x 1 and x 2 the left and right child of x, and ρ(x) = (q, v), ρ(x 1 ) = (q 1 , v 1 ), ρ(x 2 ) = (q 2 , v 2 ) there exist (q 1 , v 1 ), (q 2 , v 2 ) such that: 2. (q 1 , a(x), q 2 , q) ∈ δ u , 3. for each constraint θ ∈ χ of the form C i1 C i2 → C i there are two numbers n 1 θ and n 2 θ (they may be 0) and vectors u θ,1 , u θ,2 , u θ ∈ N k , having n 1 θ and n 2 θ at positions i 1 , i 2 for u θ,1 , having n 2 θ and n 1 θ at positions i 1 , i 2 for u θ,2 , n 1 θ + n 2 θ at position i for u θ and all other positions set to zero, We stress that it will be important for coding the automata into the logic (Section 4) that χ is independent of the current state of the automaton.Without the constraints of χ we have the usual notion of BVASS [15].It does not seem possible in general to simulate directly a constraint C i1 C i2 → C i with BVASS transitions.One could imagine using an arbitrary number of -transitions decreasing the counters i 1 and i 2 while increasing counter i, after the merging operation summing up the counters.However, it is not clear how to do this while preserving the positiveness of the corresponding decrements before the merge (Step 4 above).
The reachability problem for an EBVASS, on input q ∈ Q, asks whether there is a tree and a run on that tree reaching the configuration (q, v 0 ) at its root.

Reduction from DTA # to EBVASS
Theorem 4. The emptiness problem for DTA # reduces to the reachability problem for EBVASS.
Proof.We first take care of the binary trees versus unranked forest issue.It is well known that forests of Forests(E) can be transformed into binary trees in Trees(E # ) using the first-child/right-sibling encoding, denoted by fcns, and formally defined as follows (for a ∈ E and s, s ∈ Forests(E)): This transformation effectively preserves regularity: for each automaton B computing on Forests(E) there exists an automaton B on binary trees of Trees(E # ), effectively computable from B, recognizing exactly the fcns encoding of the forests recognized by B. This automaton B is called the fcns view of B. Note that we use the same # symbol in the fcns construction and for class forests.This simplifies the technical details of the proof.In particular we can assume that our tree automata start with a single initial state at the leaves of the tree.
We show that given a DTA # D, one can construct an EBVASS E with a distinguished state q such that for all a ∈ Forests(A), there is a run of E on fcns(a) reaching (q, v 0 ) at its root iff a ⊗ d is accepted by D for some d.
Before explaining the construction of E we first show the consequences of the fact that the second component of D recognizes a #-stuttering language on its fcns view B. The fcns view of the rules of Figure 2 are depicted in Figure 5: One obtains the same result by application of fcns and then of a rule of Figure 5 than by application of the corresponding rule of Figure 2 and then of fcns.This can be enforced using the following syntactic restrictions on the fcns view B that will be useful in our proofs.In the definition of these restrictions, we use the notation (p 1 , b, p 2 ) → p for a transition of B from the states p 1 , p 2 in the left and right child of a node of label b, moving up with state p.
We assume without loss of generality that the states of B permit to distinguish the last symbol read by B. More precisely, we assume that the set of states of B is split into two kinds: the #-states and the non-#-states.The states of the first kind are reached by B on nodes labeled with symbol #, while the states of the second kind are reached by B on nodes with label in B. We say that B is #-stuttering if B is deterministic and has a specific #-state p # that it must reach on all leaves of label #, and verifies the following properties: 1. if a transition rule of the form (p 1 , #, p # ) → p 2 is applied at a #-node that is the left-child of another #-node, then p 1 = p 2 2. if a transition rule of the form (p # , #, p 1 ) → p 2 is applied at a #-node that is the right-child of another #-node, then p 1 = p 2 3. all transition rules of the form (p # , #, p 1 ) → p 2 with p 1 a #-state verify p 1 = p 2 .
From these definitions, it is straightforward to see that for a set L ⊆ Forests(B), the following properties are equivalent • fcns(L) is closed under the rules in Figure 5, • there exists an #-stuttering automaton recognizing fcns(L).We now turn to the construction of E.
be the fcns views of the two components of D. The automaton B is #-stuttering (i.e.there is a distinction in its states between #-states and non-#-states, and the existence of a #-state p # ∈ Q B on which B evaluates the tree with a single node labeled with #) and we also assume without loss of generality that it is deterministic and complete, i.e., for every b ∈ Trees(B # ), B evaluates into exactly one state of Q B .
Here Q A and Q B (resp.F A , F B ) are the respective state sets (resp.final state sets) of A and B, A is the input alphabet of A, B is the output alphabet of A and input alphabet of B (with the symbol #), and ∆ A , ∆ B are the sets of transitions.We will use the notation (r 1 , a, r 2 , r, b), for a transition of A from the states r 1 , r 2 in the left and right child of a node of label a, renaming this node with b and moving up with state r.In the following, we write explicitly the set of states of B as For any data tree t ∈ Trees(B # ×D), and any data value d occurring in t, the state of Q B corresponding to the evaluation of B on the class forest t[d] is called the B-state associated to d in t.When d is the data value at the root of t, this state is called the B-state of t.Note that for all t the B-state of t exists and is unique, since B is assumed to be deterministic and complete, and that it is always a non-#-state.
We now construct the expected EBVASS E = (Q, A, q 0 , k, δ, χ) where Q 0 is a finite set of auxiliary control states.The initial state q 0 is the tuple formed with q 0 A , the initial state of A, q 0 B , the initial state of B, and a specific state of Q 0 .The first and second components of a state q ∈ Q are respectively called the A-state and the B-state of q.The transitions of the EBVASS E are constructed in order to ensure the following invariant: ( ) E reaches the configuration (q, v) at the root of a tree a ∈ Trees(A) iff there exists a data tree t = a ⊗ d and a possible output b ∈ Trees(B) of A on a witnessed by a run of A whose state at the root of a is the A-state of q, and moreover for all i, 1 ≤ i ≤ k, v i is the number of data values having p i as associated B-state in b ⊗ d.
Note that the counters ignore the number of data values having p # as associated B-state (which will always be infinite).A consequence of ( ) is that: ( ) there is only one non-#-state p i ∈ Q B such that v i = 0, and actually v i = 1.We will refer to this state p i as the B-state of v, and the construction of E will ensure that p i is also the B-state of q.
If we can achieve the invariant ( ) then we are done.Indeed, we can add to E some -transitions which, when reaching a state q containing a final A-state, decrement the counters corresponding to final states of B (and only those).Then, E reaches a configuration (q, v 0 ) with the A-state of q being a final state of A iff there exists a data tree accepted by D.
Notice that the property ( ) is invariant under permutations of D. Hence if a tree d witnesses the property ( ), then any tree d constructed from d by permuting the data values is also a witness for ( ).This observation will be useful for showing the correctness of the construction of E.
Before defining the transition relation of E we sketch with more details its construction.
The automaton E needs to maintain the invariant ( ).One direction will be immediate: if D has an accepting run on a ⊗ d then E is constructed so that it has an accepting run on a satisfying ( ) as witnessed by d.For the converse direction, we need to construct from a run of E on a a tree d such that D has a run on a ⊗ d as in ( ).
The simulation of A is straightforward as E can simulate any regular tree automaton.The simulation of A is done using the A-state of the states of E: for every -transition (q, a, q , u) of E, the A-states of q and q must coincide, and for every up-transition (q 1 , a, q 2 , q) of E, there exists a transition of A of the form (r 1 , a, r 2 , r, b), for some b ∈ B such that r 1 , r 2 and r are the respective A-states of q 1 , q 2 and q.In other words, the A-state of E always is the state of A at the current node.
Let's now turn to the simulation of B and the invariant ( ).This invariant will be shown by induction in the depth of the tree.Let us assume that E reached the configuration (q, v) at the root x of a tree a.
If x is a leaf node, then by definition of EBVASS, q is the initial state q 0 of E and v = v 0 , hence ( ) holds.
If x is an inner node, then a = a(a 1 , a 2 ) for some letter a ∈ A. By induction on the depth, we have trees d 1 and d 2 such that there is a run of D on a 1 ⊗ d 1 and a 2 ⊗ d 2 satisfying ( ).From the remark above on the invariance of ( ) under permutations of D, we can assume that d 1 and d 2 do not share any data value.We need to set the transitions of E such that from d 1 , d 2 , we can construct a tree d such that D also has a run on a ⊗ d as in ( ).The tree d will be of the form d(d 1 , d 2 ) for some d ∈ D, where d 1 and d 2 are constructed from d 1 and d 2 by permuting the data values.The permutation will identify some data values of d 1 with some data values of d 2 .The number of data values we identify is given by the n in the constraints of E as explained in the initial sketch on page 11.This n is therefore given by the run of E and we will see that it does not matter which data values we actually choose, it is only important that we pick n of them.The constraints make sure that this is consistent with the runs of B.
For this purpose we define χ as the set of constraints of the form C j1 C j2 → C j such that there exists a transition (p j1 , #, p j2 ) → p j of B where p j1 , p j2 , and p j are #-states in Q B \ {p # }.Note that the commutativity rule in the definition of #-stuttering languages implies that whenever we have a constraint C j1 C j2 → C j then both (p j1 , #, p j2 ) → p j and (p j2 , #, p j1 ) → p j are transitions of B.
This does maintain ( ) assuming that d, the data value expected at x, is not among the data values we identify (in the transitions used to construct χ the root symbol is #).This data value d has to be treated separately and we have several cases depending on whether d is completely new (does not occur in d 1 and d 2 ), or occurs in d 1 but not in d 2 , or the other way round, or it occurs in both subtrees.Actually it will also be necessary to consider separately the cases where d occurs at the root of This last choice is guessed by E and can therefore be read from the run of E. We can then choose d consistently with the guess of E. Again the precise value of d is not important.It is only important that its equality type with the other data values is consistent with the choice made by E. This makes finitely many cases and we define the transition function of E as the union of corresponding family of transitions.Each of them involving disjoint intermediate states they don't interfere between each other.We therefore define them separately and immediately after prove that they do maintain ( ) for their case.
1. E guessed that the data value of the current node is equal to the data value of both its children.To handle this case, for each transition τ = (p i1 , b, p i2 ) → p i of B, where none of p i1 , p i2 , p i are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases counter i 1 and moves to a state q 1 τ from a state q 2 of B-state p i2 it decreases counter i 2 and moves to state q 2 τ from a state q τ it increases counter i and moves to a state q of B-state p i up-transition: (q 1 τ , a, q 2 τ , q τ ).
The state q 1 τ (resp.q 2 τ , q τ ) differs from q 1 (resp.q 2 , q) only by its third component (in Q 0 ), that contains τ .We shall use the same convention for the states introduced in the following construction cases.
Correctness.Let us show that if E makes an up-transition (q 1 τ , a, q 2 τ , q τ ) at the root of a ∈ Trees(A) we can construct d such that D has a run on a ⊗ d satisfying ( ).This up-transition can only occur if we had -transitions from q 1 to q 1 τ in the left subtree and from q 2 to q 2 τ in the right subtree where p i1 and p i2 are the B-states of q 1 and q 2 .Let x be the root of the tree a ⊗ d where this transition occurred.We have a = a(a 1 , a 2 ).By induction hypothesis we have trees d 1 and d 2 and possible outputs b 1 , b 2 ∈ Trees(B) of A on respectively a 1 and a 2 such that there is a run of D on t 1 = a 1 ⊗ d 1 and t 2 = a 2 ⊗ d 2 satisfying ( ).
We first apply a bijection on the labels of d 1 in order for the data value of its root to match the one of the root of d 2 .Let d be this data value.
For each constraint θ = C k1 C k2 → C k ∈ χ we let n 1 θ and n 2 θ be the numbers used by the run of E when using the above up-transition.By induction hypothesis ( ), and semantics of the constraints (making sure the counters are big enough) there are at least n 1 θ (resp.n 2 θ ) distinct data values different from d (because the up-transition is applied after we decreased the counter k 1 by n 1 θ ) in d 1 having p k1 (resp.p k2 ) as associated B-state in t 1 = b 1 ⊗ d 1 , and similarly for t 2 = b 2 ⊗ d 2 .We pick such data values in each subtree and call them the data values associated to θ.We do this for all constraints θ and we choose the associated data values such that they are all distinct.We now apply to d 2 a permutation on the data values such that for all θ the data values associated to θ in d 2 are identified with the ones for d 1 and such that all other data values are distinct.In order to simplify the notations we call the resulting tree also Let e be an arbitrary data value occurring in d.
If e = d, the root symbol of the class forest t [e] is a and the counter i is increased by 1 by the last -transition.By induction hypothesis and its consequence ( ), v i = 1 and for all other non-#-states the corresponding value via v will be 0. Hence p i is the new B-state of v.It is also the B-state of q by construction.
If e = d we consider 3 subcases.If e occurs in both d 1 and d 2 then the class forest t [e] has the form #(s 1 , s 2 ) for some forests s 1 and s 2 containing each at least one symbol other than # (not at the root node).Let p j1 and p j2 be the states reached by B when evaluating s 1 and s 2 .They are the B-states associated to e in t 1 and t 2 , (resp. the left-and right subtrees of t ), and both are #-states in Q B \{p # }.By construction of d, there are at least n θ = n 1 θ + n 2 θ such data values e, where θ = C j1 C j2 → C j and p j is the unique state of B such that (p j1 , #, p j2 ) → p j is a transition of B (and therefore also (p j2 , #, p j1 ) → p j is also a transition).These n θ data values will contribute to an increase of v j by n θ as expected.
Assume now that e occurs in d 1 but not in d 2 (the remaining case being symmetrical).Then t [e] has the form #(s 1 , s 2 ) where s 1 contains at least one symbol other than # (not at root node), and all nodes of s 2 are labeled #.By the hypothesis that B is #-stuttering, the B-state associated to e in t is the same as the one associated to e in t 1 , and the B-state associated to e in t 2 is p # .This is consistent with the behavior of E that propagates upward the value of the counter corresponding to this state, after applying the constraints.Altogether this shows that t = a ⊗ d verifies ( ).
2. E guessed that the data value d 1 of the current node is equal to the data value of its left child but different from the data value d 2 of its right child.Moreover E guessed that the B-state associated to d 1 in the right subtree is p k2 , and that the data value d 2 of the right child also appears in the left subtree, with p k1 as B-state associated to d 2 in this left subtree.Note that both p k1 and p k2 must be #-states in Q B \ {p # }.To handle this case for all transitions τ = (p i1 , b, p k2 ) → p i and τ = (p k1 , #, p i2 ) → p j of B, where none of p i1 , p i2 , p i are #-states but p j (like p k1 and p k2 ) are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 and k 1 and moves to state q 1 τ,τ from a state q 2 of B-state p i2 it decreases the counters i 2 and k 2 and moves to state q 2 τ,τ from a state q τ,τ it increases counters i and j and moves to a state q of B-state p i up-transition: (q 1 τ,τ , a, q 2 τ,τ , q τ,τ ).
Correctness.We argue as in the previous case with the following modifications.From For each θ ∈ χ we select the associated data values making sure they are neither d 1 nor d 2 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The same argument as above shows that the resulting tree d = d 1 (d 1 , d 2 ) has the desired properties.
3. E guessed that the data value d 1 of the current node is equal to the data value of its left child but different from the data value of its right child.Moreover E guessed that d 1 also appear in the right subtree of the current node, with p k2 as associated B-state in this right subtree, and that the data value of the right child of the current node does not appear in the left subtree.Note that p k2 must be a #-state in Q B \ {p # }.To handle this case for all transitions τ = (p i1 , b, p k2 ) → p i and τ = (p # , #, p i2 ) → p j of B, where none of p i1 , p i2 , p i are #-states but p k2 and p j are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counter i 1 and moves to state q 1 τ,τ from a state q 2 of B-state p i2 it decreases the counter i 2 and k 2 and moves to state q 2 τ,τ from a state q τ,τ it increases the counters i and j and moves to a state q of B-state p i up-transition: (q 1 τ,τ , a, q 2 τ,τ , q τ,τ ).
Correctness.We argue as in the previous cases with the following modifications.From For each θ ∈ χ we select the associated data values making sure they are neither d 1 nor d 2 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.As before we show that the resulting tree d = d 1 (d 1 , d 2 ) has the desired properties.
4. E guessed that the data value d of the current node is different from the ones of its children but appear in both subtrees, with p k1 and p k2 as associated B-states repectively in left and right subtrees.Moreover E guessed that the data values of both children of the current node are equal.Note that p k1 and p k2 must be #-states in Q B \ {p # }.To handle this case for all transitions τ = (p k1 , b, p k2 ) → p i and τ = (p i1 , #, p i2 ) → p j of B, where none of p i1 , p i2 , p i are #-states but p k1 , p k2 and p j are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 and k 1 and moves to state q 1 τ,τ from a state q 2 of B-state p i2 it decreases the counters i 2 and k 2 and moves to state q 2 τ,τ from a state q τ,τ it increases the counters i and j and moves to a state q of B-state Fig. 9: Proof of Theorem 4, Case 4.
Correctness.We argue as in the previous cases with the following modifications.For each θ ∈ χ we select the associated data values making sure they are neither d 1 nor d.The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The rest of the argument is similar after setting d = d(d 1 , d 2 ).
5. E guessed that the data value d of the current node is different from the ones of its children but appear in both subtrees, with p k1 and p k2 as associated B-states in respectively the left and right subtree.Moreover E guessed that the data values of both children of the current node are distinct but appear in the other subtree with respective associated B-state p 1 and p 2 .Note that p k1 , p k2 , p 1 , p 2 must be #-states.
To handle this case for all transitions τ = (p k1 , b, p k2 ) → p i , τ 1 = (p i1 , #, p 1 ) → p j1 and τ 2 = (p 2 , #, p i2 ) → p j2 of B, where none of p i1 , p i2 , p i are #-states but p k1 , p k2 , p 1 , p 2 , p j1 , p j2 are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 , k 1 and l 2 and moves to state q 1 τ,τ1,τ2 from a state q 2 of B-state p i2 it decreases the counters i 2 , k 2 and l 1 and moves to state q 2 τ,τ1,τ2 from a state q τ,τ1,τ2 it increases the counters i, j 1 and j 2 and moves to a state q of B-state p i up-transition: (q 1 τ,τ1 τ2 , a, q 2 τ,τ1,τ2 , q τ,τ1,τ2 ).
Correctness.We argue as in the previous cases with the following modifications.For each θ ∈ χ we select the associated data values making sure that they are neither d, d 1 nor d 2 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The rest of the argument is similar after setting d = d(d 1 , d 2 ).
6. E guessed that the data value d of the current node is different from the ones of its children but appear in both subtrees, with p k1 and p k2 as associated B-states in respectively the left and right subtree.Moreover E guessed that the data value of the right child of the current node appear in its left subtree, with p 1 as associated B-state in this left subtree, and that the data value of the left child does not appear in the right subtree.Note that p k1 , p k2 and p 1 must be #-states in Q B \ {p # }.To handle this case for all transitions τ = (p k1 , b, p k2 ) → p i , τ 1 = (p i1 , #, p # ) → p j1 and τ 2 = (p 1 , #, p i2 ) → p j2 of B, where none of p i1 , p i2 , p i are #-states but p k1 , p k2 , p 1 , p j1 , p j2 are #-states, E has the following transitions: Fig. 11: Proof of Theorem 4, Case 6.
-transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 , k 1 and l 1 and moves to state q 1 τ,τ1,τ2 from a state q 2 of B-state p i2 it decreases the counters i 2 , k 2 and moves to state q 2 τ,τ1,τ2 from a state q τ,τ1,τ2 it increases the counters i, j 1 and j 2 and moves to a state q of B-state p i up-transition: (q 1 τ,τ1 τ2 , a, q 2 τ,τ1,τ2 , q τ,τ1,τ2 ).
Correctness.We argue as in the previous cases with the following modifications.For each θ ∈ χ we select the associated data values making sure that they are neither d, d 1 nor d 2 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The rest of the argument is similar after setting d = d(d 1 , d 2 ).
7. E guessed that the data value d of the current node is different from the ones of its children but appears in both subtrees with p k1 and p k2 as associated B-states.Moreover it guessed that the data values of both children of the current node do not appear elsewhere.Note that p k1 , p k2 must be #-states in Q B \ {p # }.To handle this case for all transitions τ = (p k1 , b, p k2 ) → p i , τ 1 = (p # , b, p p2 ) → p j1 and τ 2 = (p i1 , b, p # ) → p j2 of B, where none of p i1 , p i2 , p i are #-states but p k1 , p k2 , p j1 , p j2 are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 , k 1 and moves to state q 1 τ,τ1,τ2 from a state q 2 of B-state p i2 it decreases the counters i 2 , k 2 and moves to state q 2 τ,τ1,τ2 from a state q τ,τ1,τ2 it increases the counters i, j 1 and j 2 and moves to a state q of B-state p i up-transition: (q 1 τ,τ1 τ2 , a, q 2 τ,τ1,tau2 , q τ,τ1,τ2 ).
Correctness.We argue as in the previous cases with the following modifications.For each θ ∈ χ we select the associated data values making sure that they are neither d, d 1 nor d 2 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The rest of the argument is similar after setting Fig. 13: Proof of Theorem 4, Case 8.
8. E guessed that the data value d of the current node is different from the ones of its children and does not appear in the subtrees.Moreover E guessed that the data values of both children of the current node are equal.To handle this case for all transitions τ = (p # , b, p # ) → p i , τ = (p i1 , #, p i2 ) → p j of B, where none of p i1 , p i2 , p i are #-states, E has the following transitions: -transitions: from a state q 1 of B-state p i1 it decreases the counters i 1 and moves to state q 1 τ,τ from a state q 2 of B-state p i2 it decreases the counter i 2 and moves to state q 2 τ,τ from a state q τ,τ it increases the counters i, and j and moves to a state q of B-state p i up-transition: (q 1 τ,τ , a, q 2 τ,τ , q τ,τ ).
Correctness.We argue as in the previous cases with the following modifications.
From d 1 and d 2 we first apply a bijection making sure that the data values of their roots are equal (let us call it d 1 ).
For each θ ∈ χ we select the associated data values making sure it is not d 1 .The decrement in the -transitions make sure that this is always possible.We then perform the same identification as in the previous case.The rest of the argument is similar after setting d = d(d 1 , d 2 ), where d is a fresh new value.
9. E guessed that the data value d of the current node is different from the ones of its children and does not appear in both subtrees.Moreover E guessed that the data values of both children of the current node (say d 1 and d 2 ) are distinct but appear in the other subtree with respective associated B-state p 1 and p 2 .Note that p 1 , p 2 must be #-states in Q B \ {p # }.This case is treated as before with the expected transitions.
10. E guessed that the data value d of the current node is different from the ones of its children and does not appear in both subtrees.Moreover E guessed that the data value d 2 of the right child in its left subtree with p 1 as associated B-state in this left subtree and that the data value d 1 of the left child does not appear in the right subtree.Note that p 1 must be a #-states in Q B \ {p # }.This case is treated as before with the expected transitions.
11.We omit the symmetric cases.
• the initial state q 0 can be found at the leaves and the state q is reached at the root.Note that the above three conditions can be checked by a standard tree automaton over A E , and therefore can be expressed in EMSO 2 (<, +1).Therefore, by setting A = A c × A for a suitable A matching the existential part of the EMSO formula, the property above can be expressed in FO 2 (<, +1).
The formula φ now needs to make sure that no counter ever gets negative and that pseudo-encodings of up-transitions are actually real encodings.This is where data values are needed: The formula φ enforces that 1. no two nodes with label D i can have the same data value, for 1 ≤ i ≤ k, 2. no two nodes with label I i can have the same data value, for 1 ≤ i ≤ k, 3. for all i ∈ [k], every node with label D i has a descendant with label I i and with the same data value, 4. for all i ∈ [k], every node with label I i has an ancestor with label D i and with the same data value.
These four conditions enforce that the counters never get negative and that they are all set to 0 at the root.It remains to enforce that all pseudo-encodings can be into real encodings.This is done with the following conditions.5. no two nodes with label T θ , for θ ∈ χ, can have the same data value, 6. no two nodes with label L θ , for θ ∈ χ, can have the same data value, 7. no two nodes with label R θ , for θ ∈ χ, can have the same data value, 8. every node with label T θ has a descendant with label L θ and a descendant with label R θ both with the same data value, 9. every node with label L θ or R θ has an ancestor with label T θ and with the same data value, 10. two nodes of label L θ and R θ with the same data value are not comparable with the ancestor relationship.
It now remains to show that φ has the desired property.
Lemma 6. φ has a model iff (q, v 0 ) is reachable by E.
Proof.From reachability to models of φ.Assume that (q, v 0 ) is reachable and let ρ be a run of E witnessing this fact.Let a be the tree constructed from ρ by concatenating the sequences of encodings of transitions of ρ as explained above.The binary tree a certainly satisfies the "regular" part of φ.We now assign the data values so that the remaining part of φ is satisfied.This is done in the obvious way: each time a counter i is decremented, as the resulting value is positive, this means that a matching increment was performed before.Similarly, each time a constraint θ is used in a transition µ, we assign one distinct data value per triple L θ , R θ , T θ occurring in the encoding of µ.The formula was constructed to make the resulting tree a model of φ.
From models of φ to reachability.Assume now that t = a ⊗ d |= φ.Unfortunately, it may happen that a does not encode a run of E because some section corresponds to a pseudo-encoding of an uptransition, instead of an expected real encoding.However, we show that from t we can construct another tree t = a ⊗ d such that t |= φ and a encodes a real run of E.
To see this, let us consider a node x of t with label T θ , where θ = C i1 C i2 → C i , and let d = d(x).Let x 1 and x 2 be two descendants of x with respective labels L θ and R θ and such that d = d(x 1 ) = d(x 2 ).Let z be the least common ancestor of x 1 and x 2 .The existence of x 1 and x 2 is guaranteed by φ (conditions 5-8).The sentence φ also ensures that x is an ancestor of z (conditions 9-10).By construction the subtree at z must correspond to a pseudo-encoding of an up-transition µ .
We now move (down) x and its parent (that must have label I i ) right above z within the coding of µ .Similarly we move (up) x 1 and its parent (that must have label D i1 ) right below z, and similarly for x 2 .The reader can verify that the resulting tree is still a model of φ: the regular conditions remain obviously satisfied.Conditions 1-4 are still valid because the node of label I i1 matching the parent of y was already below the initial position of y and its new position is upward in the tree.Finally conditions 5-10 remain valid by construction.
Repeating this argument eventually yields a model t = b⊗d of φ such that b is a correct sequencing of encodings of transitions a E. This encoding is actually a real run because conditions 1-4 of φ immediately enforces that no counter is ever negative.Theorem 5 is now immediate from Lemma 6.

Conclusion
We have seen that satisfiability of FO 2 (<, +1, ∼), emptiness of DTA # and reachability of EBVASS are equivalent problems in terms of decidability.The main open problem is of course whether they are all decidable or not.
The use of the EBVASS constraints of the form C i1 C i2 → C i is crucial for the construction of Section 3. Their semantics cannot be directly simulated with the usual BVASS, but it is not clear whether EBVASS are strictly more expressive than BVASS, and whether this extension is needed in order to capture the expressive power of FO 2 (<, +1, ∼) on data trees.
In our definition of EBVASS the constraints of the form C i1 C i2 → C i have a "commutative" semantics.Without commutativity, i.e., the rule modifies only counter i 1 on the left child and counter i 2 on the right child, the automata model is more powerful.In order to describes its runs as in the proof of Theorem 5, the logic needs to be able to enforce that a L θ must be to the left of the R θ with the same data value.This can be done by adding the document order predicate into the logic.A close inspection of the proof of Theorem 3 and Theorem 4 then shows that the extension of FO 2 (<, +1, ∼) with the document order predicate can be captured by a DTA # without the commutativity rule and that such automata can be captured by the non-commutative version of EBVASS.
In [2] it was shown that, over data words, the Data Automata model of [5] is more expressive than the Register Automata of [14].It is not obvious that our automata model DTA # extends the expressive power of the straightforward extension of register automata to data trees.This remains to be investigated.
Fig.1: A forest t followed by its class forests t[1] and t[2] Intuitively a DTA works as follows on a forest t = a ⊗ d: first the transducer A relabels the nodes of a into b and the forest automaton B has to accept all class forests of b ⊗ d.More formally a data forest t = a ⊗ d ∈ Forests(A × D) is accepted by (A, B) iff 1. there exists b ∈ Forests(B) such that b is a possible output of A on a and, 2. for all d ∈ D, the class forest (b ⊗ d)[d] ∈ Forests(B # ) is accepted by B.
1 , d 1 +. ..+ a m , d m .It captures indeed the class of languages of words -without data -recognized by counter automata.Like Data Automata, Class Automata are defined as pairs made of one transducer A and one word automaton B. However, the B part in the Class Automata model has access to the label of the nodes that are not in the class, while it sees only # in the Data Automata case.This extra power implies undecidability.We assume two finite alphabets A and B, writing the latter in extenso as B = {b 1 , . . ., b n }.A class automaton over A × D is a pair C = (A, B) where A is a non-deterministic letter-to-letter word transducer from A into B and B is a word automaton taking as input words over the alphabet B × {0, 1}.In order to define the acceptance of data words by class automata, we shall use a notion of class word associated to a data word w = b ⊗ d and a value d ∈ D, denoted w d , defined as the word having the same domain as w and such that, for every node x of w, w d (x) = b(x), 1 if d(x) = d and w d (x) = b(x), 0 otherwise.A data word w = a ⊗ d is accepted by C iff 1. there exists a word b over B such that b is a possible output of A on a and, 2. for all d ∈ D, the class word (b ⊗ d) d is accepted by B.

Fig. 5 :
Fig. 5: fcns view of the #-stuttering closure rules.x and y are arbitrary binary trees.

Fig. 6 :
Fig. 6: Proof of Theorem 4, Case 1.The B-states are displayed in parentheses in the class tree t [d].
d 2 .We then set d as d(d 1 , d 2 ) and t = b ⊗ d where b = b(b 1 , b 2 ) is an output of A on a compatible with the transition.

d 1 and d 2
we first apply a bijection making sure that the data values d 1 and d 2 of their roots are different and that d 1 has B-state p k2 in b 2 ⊗ d 2 and d 2 does not appear in d 1 (b 2 is as in previous cases).

From d 1
and d 2 we first apply a bijection making sure that the data value d 1 of their roots are equal and that d 1 and d 2 share a common data value d = d 1 of B-state p k2 in b 2 ⊗ d 2 and B-state p k1 in b 1 ⊗ d 1 .

From d 1
and d 2 we first apply a bijection making sure that the data values d 1 and d 2 of their roots are distinct and that d 1 has B-state p 1 in b 2 ⊗ d 2 and d 2 has B-state p 2 in b 1 ⊗ d 1 .Moreover d 1 and d 2 share a common data value d distinct from d 1 and d 2 of B-state p k2 in d 2 and B-state p k1 in d 1 .

From d 1
and d 2 we first apply a bijection making sure that the data values d 1 and d 2 of their roots are distinct and that d 1 does not appear in d 2 and d 2 has B-state p 1 in b 1 ⊗ d 1 .Moreover d 1 and d 2 share a common data value d distinct from d 1 and d 2 of B-state p k2 in b 2 ⊗ d 2 and B-state p k1 in b 1 ⊗ d 1 .

From d 1
and d 2 we first apply a bijection making sure that the data values d 1 and d 2 of their roots are distinct and that d 1 does not appear in d 2 and d 2 does not appear in d 1 .Moreover d 1 and d 2 share a common data value d distinct from d 1 and d 2 of B-state p k2 in d 2 and B-state p k1 in d 1 .