Piecewise Testable Tree Languages

This paper presents a decidable characterization of tree languages that can be defined by a boolean combination of Sigma1 formulas. This is a tree extension of the Simon theorem, which says that a string language can be defined by a boolean combination of Sigma1 formulas if and only if its syntactic monoid is J-trivial.


Introduction
Logics for expressing properties of labeled trees and forests figure importantly in several different areas of Computer Science.This paper is about logics on finite trees.All the logics we consider are less expressive than monadic second-order logic, and thus can be captured by finite automata on finite trees.Even with these restrictions, this encompasses a large body of important logics, such as variants of first-order logic, temporal logics including CTL* or CTL, as well as query languages used in XML.
One way of trying to understand a logic is to give an effective characterization.An effective characterization for a logic L is an algorithm which inputs a tree automaton, and says if the language recognized by the automaton can be defined by a sentence of the logic L.Although giving an effective characterization may seem an artificial criterion for understanding a logic, it has proved to work very well, as witnessed by decades of research, especially into logics for words.In the case of words, effective characterizations have been studied by applying ideas from algebra: A property of words over a finite alphabet A defines a set of words, that is a language L ⊆ A * .As long as the logic in question is no more expressive than monadic second-order logic, L is a regular language, and definability in the logic often boils down to verifying a property of the syntactic monoid of L (the transition monoid of the minimal automaton of L).This approach dates back to the work of McNaughton and Papert [11] on first-order logic over < (where < denotes the usual linear ordering of positions within a word).A comprehensive survey, treating many extensions and restrictions of first-order logic, is given by Straubing [16].Thérien and Wilke [20,18,19] similarly study temporal logics over words.
An important early discovery in this vein, due to Simon [14], treats word languages definable in first-order logic over < with low quantifier complexity.Recall that a Σ 1 sentence is one that uses only existential quantifiers in prenex normal form, e.g.∃x∃y x < y.Simon proved that a word language is definable by a boolean combination of Σ 1 sentences over < if and only its syntactic monoid M is J -trivial.This means that for all m, m ′ ∈ M, if MmM = Mm ′ M, then m = m ′ .(In other words, distinct elements generate distinct two-sided semigroup ideals.)Thus one can effectively decide, given an automaton for L, whether L is definable by such a sentence.(Simon did not discuss logic per se, but phrased his argument in terms of piecewise testable languages which are exactly those definable by boolean combinations of Σ 1 sentences.) There has been some recent success in extending these methods to trees and forests.(We work here with unranked trees and forests, and not binary or ranked ones, since we believe that the definitions and proofs are cleaner in this setting.)The algebra is more complicated, because there are two multiplicative structures associated with trees and forests, both horizontal and a vertical concatenation.Benedikt and Segoufin [1] use these ideas to effectively characterize sets of trees definable by first-order logic with the parent-child relation.Bojańczyk [2] gives a decidable characterization of properties definable in a temporal logic with unary ancestor and descendant operators.Similarly Bojańczyk and Segoufin [3] and Place and Segoufin [13] provided decidable characterizations of tree languages definable in ∆ 2 (<) and FO 2 (<, < h ) where < denotes the descendant-ancestor relationship while < h denotes the sibling relationship.The general theory of the 'forest algebras' that underlie these studies is presented by Bojańczyk and Walukiewicz [6].
In the present paper we provide a further illustration of the utility of these algebraic methods by generalizing Simon's theorem from words to trees.In fact, we give several such generalizations, differing in the kinds of atomic formulas we allow in our Σ 1 sentences.
In Section 2 we present our basic terminology concerning trees, forests, and logic.Initially our logic contains two orderings: the ancestor relation between nodes in a forest, and the depth-first, left-first, total ordering of the nodes of a forest.In Section 3 we describe the algebraic apparatus.This is the theory of forest algebras developed in [6].
In Section 4 we give our main result, an effective test of whether a given language is piecewise testable (Theorem 4.) The test consists of verifying that the syntactic forest algebra satisfies a particular identity.While we have to some extent drawn on Simon's original argument, the added complexity of the tree setting makes both formulating the correct condition and generalizing the proof quite nontrivial.We give a quite different, equivalent identity in Proposition 18, which makes clear the precise relation between piecewise testability for forest languages and J -triviality.
In Section 5, we study in detail a variant of our logic in which the binary ancestor relation is replaced by a ternary closest common ancestor relation, and prove a version of our main theorem for this case.Section 6 is devoted to other variants: the far simpler case of languages defined by Σ 1 sentences (instead of boolean combinations thereof); the logics in which only the ancestor relation is present, and in which the horizontal ordering on siblings is present; and, since our algebraic formalism concerns forests rather than trees, the modifications necessary to obtain an effective characterization of the piecewise testable tree languages.We discuss some directions for further research in the concluding Section 7.
An earlier, much abbreviated version of this paper, without complete proofs, was presented at the 2008 IEEE Symposium on Logic in Computer Science.

Notation
Trees, forests and contexts.In this paper we work with finite unranked ordered trees and forests over a finite alphabet A. Formally, these are expressions defined inductively as follows: for any a ∈ A, a is a tree.If t 1 , ... , t n is a finite sequence of trees, then t 1 + • • • + t n is a forest.If s is a forest and a ∈ A, then as is a tree.It will also be convenient to have an empty forest, that we will denote by 0, and this forest is such that a0 = a and 0+t = t +0 = t.Forests and trees alike will be denoted by the letters s, t, u, ... For example, the forest that we conventionally draw as When there is no ambiguity we use as instead of a(s).In particular bc stands for the tree whose root has label b and has a unique child of label c.
The notions of node, child, parent, descendant and ancestor relations between nodes are defined in the usual way.We write x < y to say that x is a strict ancestor of y or, equivalently, that y is a strict descendant of x.We say that a sequence y 1 , ... , y n of nodes forms a chain if we have y i < y i+1 for all 1 ≤ i < n.As our forests are ordered, each forest induces a natural linear order on its set of nodes that we call the forest-order and denote by < dfs , which corresponds to the depth-first left-first traversal of the forest or, equivalently, to the order provided by the expression denoting the forest seen as a word.We write < h for the horizontal-order, i.e. x < h y expresses the fact that x is a sibling of y occurring strictly before y in the forest-order.Finally, the closest common ancestor of two nodes x, y is the unique node z that is a descendant of all nodes that are ancestors of both x and y.
If we take a forest and replace one of the leaves by a special symbol , we obtain a context.This special node is called the hole of the context.Contexts will be denoted using letters p, q, r .For example, from the forest t given above, we can obtain, among others, the context A forest s can be substituted in place of the hole of a context p; the resulting forest is denoted by ps.If we take the context p above and if s = (b + ca), then This is depicted in the figure below.There is a natural composition operation on contexts: the context qp is formed by replacing the hole of q with p.This operation is associative, and satisfies (pq)s = p(qs) for all forests s and contexts p and q.
We distinguish a special context, the empty context, denoted .It satisfies s = s and p = p = p for any forest s and context p.
Regular forest languages.A set L of forests over A is called a forest language.There are several notions of automata for unranked ordered trees, see for instance [8, chapter 8].They all recognize the same class of forest languages, called regular, which also corresponds to definability in MSO as defined below.
Piecewise testable languages.We say that a forest s is a piece of a forest t if there is an injective mapping from nodes of s to nodes of t that preserves the label of the node together with the forest-order and the ancestor relationship.An equivalent definition is that the piece relation is the reflexive transitive closure of the relation {(pt, pat) : p is a context, a is a node, t is a forest or empty} In other words, a piece of t is obtained by removing nodes from t while preserving the forest-order and the ancestor relationship.We write s t to say that s is a piece of t.In the example above, a(a + b) + c is a piece of t.
We extend the notion of piece to contexts.In this case, the hole must be preserved while removing the nodes: The size of a piece is the size of the corresponding forest, i.e. the number of its nodes.The notions of piece for forests and contexts are related, of course.For instance, if p, q are contexts with p q, then p0 q0.Also, conversely, if s t, then there are contexts p q with s = p0 and t = q0.
A forest language L over A is called piecewise testable if there exists n ≥ 0 such that membership of t in L is determined by the set of pieces of t of size n or less.Equivalently, L is a finite boolean combination of languages {t : s t}, where s is a forest.Every piecewise testable forest language is regular, since given n ≥ 0, a finite automaton can calculate on input t the set of pieces of t of size no more than n.
Logic.Regularity and piecewise testability correspond to definability in a logic, which we now describe.A forest can be seen as a logical relational structure.The domain of the structure is the set of nodes.The signature contains a unary predicate P a for each symbol a of the label alphabet A, plus possibly some extra predicates on nodes, such as the descendant relationship, the forest-order or the closest common ancestor.Let Ω be a set of predicates.The predicates Ω that we use always include (P a ) a∈Σ and equality, hence we do not explicitly mention them in the sequel.We use the classical syntax and semantics for first-order logic, FO(Ω), and monadic second order logic, MSO(Ω), building on the predicates in Ω.Given a sentence φ of any of these formalisms, the set of forests that are a model for φ is called the language defined by φ.In particular a language is definable in MSO(<, < h ) iff it is regular [8, chapter 8]. A where the formula γ is quantifier-free and uses predicates from Ω. Initially we will consider two predicates on nodes: the ancestor order x < y and the forest-order x < dfs y.Later on, we will see other combinations of predicates, for instance when the closest common ancestor is added, and the forest-order is removed.
It is not too hard to show that a forest language L can be defined by a Σ 1 (<, < dfs ) sentence if and only if it is closed under adding nodes, i.e.
holds for all contexts p, q and forests t.Moreover this condition can be effectively decided given any reasonable representation of the language L. We will carry out the details in Section 6.1.
We are more interested here in the boolean combinations of properties definable in Σ 1 (<, < dfs ).It is easy to see that: Proposition 2.1.A forest language is piecewise testable iff it is definable by a boolean combination of Σ 1 (<, < dfs ) sentences.
One direction is immediate as for any forest s, the set of forests having s as a piece is easily definable in Σ 1 (<, < dfs ).For instance the sentence ∃x, y, z, u P a (x) ∧ P a (y) defines the language of forests having a(a + b) + c as a piece.
For the other direction, notice that for any language definable in Σ 1 (<, < dfs ), by disambiguating the relative positions between each pair of variables, one can compute a finite set of pieces such that a forest belongs to the language iff it has one of them as a piece.For instance the sentence ∃x, y, z, u P a (x) ∧ P a (y) defines the language of forests having a(a + b) + c, c + a(a + b) or ca(a + b) as a piece.This result does not address the question of effectively determining whether a given regular forest language admits either of these equivalent descriptions.Such an effective characterization is the goal of this paper: The problem.Find an algorithm that decides whether or not a given regular forest language is piecewise testable.
As noted in the introduction, the corresponding problem for words was solved by Simon, who showed that a word language L is piecewise testable if and only if its syntactic monoid M(L) is J -trivial [14]; that is, if distinct elements m, m ′ always generate distinct two-sided ideals.Note that one can test, given the multiplication table of a finite monoid M, whether M is J -trivial in time polynomial in |M|: for each m = m ′ ∈ M, one calculates the ideals MmM and Mm ′ M and then verifies that they are different.Therefore, it is decidable if a given regular word language is piecewise testable.We assume that the language L is given by its syntactic monoid and syntactic morphism, or by some other representation, such as a finite automaton, from which these can be effectively computed.
We will show that a similar characterization can be found for forests; although the characterization will be more involved.For decidability, it is not important how the input language is represented.In this paper, we will represent a forest language by a morphism into a finite forest algebra that recognizes it.Forest algebras are described in the next section.

Forest algebras
Forest algebras.Forest algebras were introduced by Bojańczyk and Walukiewicz as an algebraic formalism for studying regular tree languages [6].Here we give a brief summary of the definition of these algebras and their important properties.A forest algebra consists of a pair (H, V ) of monoids, subject to some additional requirements, which we describe below.We write the operation in V multiplicatively and the operation in H additively, although H is not assumed to be commutative.We denote the identity of V by and that of H by 0.
We require that V act on the left of H.That is, there is a map such that w (vh) = (wv)h for all h ∈ H and v, w ∈ V .We further require that this action be monoidal, that is, for all h ∈ H, and that it be faithful, that is, if vh = wh for all h ∈ H, then v = w .
We further require that for every g ∈ H, V contains elements ( + g) and (g + ) such that ( + g)h = h + g, (g + )h = g + h for all h ∈ H. Observe, in particular, that for all g, h ∈ H, (g + )(h + ) = (g + h) + , so that the map h → h + is a morphism embedding H as a submonoid of V .
Let A be a finite alphabet, and let us denote by H A the set of forests over A, and by V A the set of contexts over A. Clearly H A forms a monoid under +, V A forms a monoid under composition of contexts (the identity element is the empty context ), and substitution of a forest into a context defines a left action of V A on H A .It is straightforward to verify that this action makes (H A , V A ) into a forest algebra, which we denote A ∆ .If (H, V ) is a forest algebra, then every map f from A to V has a unique extension to a forest algebra morphism α : A ∆ → (H, V ) such that α(a ) = f (a) for all a ∈ A. In view of this universal property, we call A ∆ the free forest algebra on A.
We say that a forest algebra (H, V ) recognizes a forest language L ⊆ H A if there is a morphism α : A ∆ → (H, V ) and a subset X of H such that L = α −1 (X ).We also say that the morphism α recognizes L. It is easy to show that a forest language is regular if and only if it is recognized by a finite forest algebra.
Given L ⊆ H A we define an equivalence relation ∼ L on H A by setting s ∼ L s ′ if and only if for every context p ∈ V A , ps and ps ′ are either both in L or both outside of L. We further define an equivalence relation on V A , also denoted ∼ L , by p ∼ L p ′ if for all s ∈ H A , ps ∼ L p ′ s.This pair of equivalence relations defines a congruence of forest algebras on A ∆ .The quotient (H L , V L ) is called the syntactic forest algebra of L. The projection morphism of A ∆ onto (H L , V L ) is denoted α L and called the syntactic morphism of L. α L always recognizes L and it is easy to show that L is regular iff (H L , V L ) is finite.
Idempotents and aperiodicity.We recall the well known notions of idempotent and aperiodicity.If M is a finite monoid and m ∈ M, then there is a unique element e = m n , where n > 0, such that e is idempotent, i.e., e 2 = e.If we take a common multiple of these exponents n over all m ∈ M, we obtain an integer ω > 0 such that m ω is idempotent for every m ∈ M. Observe that while infinitely many different values of ω have this property with respect to M, the value of m ω is uniquely determined for each m ∈ M.
Let (H, V ) be a forest algebra.Since we write the operation in H additively, we denote powers of h ∈ H by n • h, where n ≥ 0. As noted above, H embeds in V , so any ω > 0 that yields idempotents for V serves as well for H.That is, there is an integer ω > 0 such that v ω is idempotent for all v ∈ V , and ω • h is idempotent for all h ∈ H.
We say that a finite monoid M is aperiodic if it contains no nontrivial groups.Since the set of elements of the form m ω m k for k ≥ 0 is a group, aperiodicity is equivalent to having m ω = m ω+1 for all m ∈ M. In this case we can take ω = |M|.All the finite monoids that we encounter in this paper are aperiodic.In particular, every J -trivial monoid is aperiodic, because all elements of a group in a finite monoid generate the same two-sided ideal.
Pieces.Recall that in Section 2, we defined the piece relation for contexts in the free forest algebra.We now extend this definition to an arbitrary forest algebra (H, V ).The general idea is that a context v ∈ V is a piece of a context w ∈ V , denoted by v w , if one can construct a term (using elements of H and V ) which evaluates to w , and then take out some parts of this term to get v.
Let (H, V ) be a forest algebra.We say v ∈ V is a piece of w ∈ V , denoted by v w , if α(p) = v and α(q) = w hold for some morphism and some contexts p q over A. The relation is extended to H by setting g h if g = v0 and h = w 0 for some contexts v w .
As we will see in the proof of Lemma 3.1, in the above definition, we can replace the term "some morphism" by "any surjective morphism".The following example shows that although the piece relation is transitive in the free algebra A ∆ , it may no longer be so in a finite forest algebra.
Example: Consider the syntactic algebra of the language {abcd }, which contains only one forest, which in turn has just one path, labeled by abcd .The context part of the syntactic algebra has twelve elements: an error element ∞, and one element for each infix of abcd .We have a aa = ∞ = bd bcd but we do not have a bcd .
We will now show that in a finite forest algebra, one can compute the relation in time polynomial in |V |.The idea is to use a different but equivalent definition.Let R be the smallest relation on V that satisfies the following rules, for all v, v ′ , w , w ′ ∈ V : Over any finite forest algebra the relations R and are the same.
In any finite algebra, the relation R can be computed by applying the rules until no new relations can be added.This gives the following corollary: Corollary 3.2.In any given finite forest algebra, the relation on contexts (also on forests) can be calculated in polynomial time.
Proof of Lemma 3.1.We first show the inclusion of R in .Let α : A ∆ → (H, V ) be any surjective morphism.A simple induction on the number of steps used to derive v R w , produces contexts p q with α(p) = v and α(q) = w .The surjectivity of α is necessary for starting the induction in the case R v.
For the opposite inclusion, suppose v w .Then there is a morphism α : A ∆ → (H, V ) and contexts p q such that v = α(p), w = α(q).We will show that α(p) R α(q) by induction on the size of p: • If p is the empty context, then the result follows thanks to the first rule in the definition of R. If p = a then from p q it follows that q = q 1 aq 2 for some contexts q 1 , q 2 and using the first three rules in the definition of R we get that and hence p R q. • If there is a decomposition p = p 1 p 2 where p 1 and p 2 are not empty contexts, then from p q there must be a decomposition q = q 1 q 2 with p 1 q 1 and p 2 q 2 .By induction we get that α(p 1 ) R α(q 1 ) and α(p 2 ) R α(q 2 ).Then α(p) R α(q) follows by using the third rule in the definition of R. • Suppose now p = s + or p = + s.We can assume that s is a tree, since otherwise the context p can be decomposed as (s 1 + )(s 2 + ).Since s is a tree, it can be decomposed as a(p ′ 0), with a being a context with a single letter and the hole below and p ′ a context smaller than p.By inspecting the definition of , there must be some decomposition q = q 0 (a(q ′ 0) + q 1 ) or q = q 0 (q 1 + a(q ′ 0)), with p ′ q ′ .By the induction assumption, α(p ′ ) R α(q ′ ).From this the result follows by applying rules three, four and five in the definition of R.This argument shows that if v w with respect to a particular morphism α, then v R w and consequently v w with respect to every morphism.Thus we have also established the claim made above that the relation on H is independent of the underlying morphism.

Piecewise Testable Languages
The main result in this paper is a characterization of piecewise testable languages: Theorem 4.1.A forest language is piecewise testable if and only if its syntactic algebra satisfies the identity The identity (4.1) is illustrated in Figure 1.In view of Corollary 3.2, an immediate consequence of Theorem 4.1 is that piecewise testability is a decidable property.Corollary 4.2.It is decidable if a regular forest language is piecewise testable.
Proof.We assume the language is given by its syntactic forest algebra, which can be computed in polynomial time from any recognizing forest algebra.The new identities can easily be verified in time polynomial in |V L | by enumerating all the elements of V L .
The above procedure gives an exponential upper bound for the complexity in case the language is represented by a deterministic or even nondeterministic automaton, since there is an exponential translation from automata into forest algebras.We do not know if this upper bound is optimal.In contrast, for languages of words, when the input language is represented by a deterministic automaton, there is a polynomial-time algorithm for determining piecewise testability [15].
In Sections 4.1 and 4.2, we prove both implications of Theorem 4.1.Finally, in Section 4.3, we give an equivalent statement of Theorem 4.1, where the relation is not used.But before we prove the theorem, we would like to show how it relates to the characterization of piecewise testable word languages given by Simon.
Let M be a monoid.For m, n ∈ M, we write m ⊑ n if m is a-not necessarily connected-subword of n, i.e. there are elements n 1 , ... , We claim that, using this relation, the word characterization can be written in a manner identical to Theorem 4.1: Theorem 4.3.A word language is piecewise testable if and only if its syntactic monoid satisfies the identity Proof.Recall that Simon's theorem says a word language is piecewise testable if and only if its syntactic monoid is J -trivial.Therefore, we need to show J -triviality is equivalent to (4.2).We use an identity known to be equivalent to J -triviality (see, for instance, [9], Sec.V.3.): 3) Since the above identity is an immediate consequence of (4.2), it suffices to derive (4.2) from the above.We only show n ω m = n ω .As we assume m ⊑ n, there are decompositions By induction on i, we show The result then follows immediately.The base i = 0, is immediate.In the induction step, we use the induction assumption to get: By applying (4.3), we have and therefore Note that since the vertical monoid V in a forest algebra is a monoid, it would make syntactic sense to have the relation ⊑ instead of in Theorem 4.1.Unfortunately, the "if" part of such a statement would be false, as we will show in Section 4.3.That is why we need to have a different relation on the vertical monoid, whose definition involves all parts of a forest algebra, and not just composition in the vertical monoid.

4.1.
Correctness of the identities.In this section we show the easy implication in Theorem 4.1.
Proof.Fix a language L that is piecewise testable and let n be such that membership of t in L only depends on the pieces of t with at most n nodes.
We will use the following simple fact: Fact 4.5.If r is any context, p q are contexts and t is a forest, then rpt rqt.
We only show the first part of the identity, i.e.
Fix v u as above.By definition of ω, we can write the identity as an implication: Let k be as above.Let p q be contexts that are mapped to v and u respectively by the syntactic morphism of L. By unraveling the definition of the syntactic algebra, we need to show that holds for any context r and forest t.Consider now the forests rq ik t and rq ik pt for i ∈ N .
As p q, thanks to Fact 4.5, we get When i is increasing, the number of pieces of size n of rq ik t is increasing.As there are only finitely many pieces of size n, for i sufficiently large, the two forests rq ik t and rq (i+1)k t have the same set of pieces of size n.Therefore, for sufficiently large i, the two forests rq ik t and rq ik pt have the same set of pieces of size n, and either both belong to L, or both are outside L.However, since α L (q k ) = α L (q k q k ), we have which gives the desired result.

4.2.
Completeness of the identities.This section is devoted to showing completeness of the identities: an algebra that satisfies identity (4.1) in Theorem 4.1 can only recognize piecewise testable languages.We fix an alphabet A, and a forest language L over this alphabet, whose syntactic forest algebra (H L , V L ) satisfies the identity.We will write α rather than α L to denote the syntactic morphism of L, and sometimes use the term "type of s" for the image α(s) (likewise for contexts).
We write s ∼ n t if the two forests s, t have the same pieces of size no more than n.Likewise for contexts.The completeness part of Theorem 4.1 follows from the following two results.
Lemma 4.6.Let n ∈ N.For k sufficiently large, if two forests satisfy s ∼ k s ′ , then they have a common piece t in the same ∼ n -class, i.e. t s, t s ′ , t ∼ n s, and t ∼ n s ′ .Proposition 4.7.For n sufficiently large, pat ∼ n pt entails α(pat) = α(pt).
Proof of the completeness part of Theorem 4.1.Take n as in Proposition 4.7, and then apply Lemma 4.6 to this n, yielding k.We show that s ∼ k s ′ implies s ∈ L ⇐⇒ s ′ ∈ L, which immediately shows that L is piecewise testable, by inspecting pieces of size k.Indeed, assume s ∼ k s ′ , and let t be their common piece as in Lemma 4.6.Since t is a piece of s with the same pieces of size n, it can be obtained from s by a sequence of steps where a single letter is removed in each step without affecting the ∼ n -class.Each such step preserves the type thanks to Proposition 4.7.Applying the same argument to s ′ , we get which gives the desired conclusion.
We begin by showing Lemma 4.6, and then the rest of this section is devoted to proving Proposition 4.7, the more involved of the two results.
Proof of Lemma 4.6.We begin with the following observation.Fact 4.8.Let n ∈ N and let K be a regular language.There is some constant k, such that every t ∈ K contains a piece s ∈ K of size at most k such that s ∼ n t.
Proof of Fact 4.8.Let β : A ∆ → (H, V ) be a morphism into a finite forest algebra.Let m = |H|.There is a k such that every forest s of size greater than k can be written as s = q 0 q 1 • • • q m s ′ where s ′ is a forest and the q i are nonempty contexts: this is because every large enough forest contains either a collection of m siblings or a chain of length m.It follows that the sequence of values β(s ′ ), β(q m s ′ ), β(q m−1 q m s ′ ), ... , β(q 1 • • • q m s ′ ) contains a repeat, and so we can remove a subsequence of the q i and obtain a proper piece t of s such that β(s) = β(t).Thus every forest s has a piece t of size at most k such that β(s) = β(t).
Now let (H, V ) be the direct product of the syntactic algebra (H K , V K ) and the quotient algebra A ∆ / ∼ n , and let β be the product of the syntactic moprhism of K and the natural projection onto the quotient by ∼ n .If s ∈ K then there is a piece t of s of size at most k such that β(s) = β(t).Thus t ∈ K and s ∼ n t, proving the Fact.
We are now ready to prove Lemma 4.6.Fix n ∈ N. Notice that each ∼ n class is a regular language and ∼ n has finitely many classes.For each ∼ n -class K , Fact 4.8 gives a constant k K .Let k be the maximum of n and all these k K ; we claim the lemma holds for k.Indeed, take any two forests s ∼ k s ′ .Let t be a piece of s of size at most k with s ∼ n t, as given by Fact 4.8.Since s ∼ k s ′ , the forest t is also a piece of s ′ .Furthermore since ∼ k implies ∼ n (by k ≥ n), we get s ′ ∼ n s ∼ n t, which implies s ′ ∼ n t by transitivity of ∼ n .
We now show Proposition 4.7.Let us fix a context p, a label a and a forest t as in the statement of the proposition.The context p may be empty, and so may be the forest t.We search for the appropriate n; the size of n will be independent of p, a, t.We also fix the types v = α(p), h = α(t) for the rest of this section.In terms of these types, our goal is to show that vh = vα(a)h.To avoid clutter, we will sometimes identify a with its image α(a), and write vh = vah instead of vh = vα(a)h.
Let s be a forest and X be a set of nodes in s.The restriction of s to X , denoted s[X ], is the piece of s obtained by only keeping the nodes in X .
Let s be a forest, X a set of nodes in s, and x ∈ X .We say that x ∈ X is a vahdecomposition of s if: a) if we restrict s to X , remove descendants of x, and place the hole in x, the resulting context has type v; b) the node x has label a; c) if we restrict s to X and only keep nodes in X that are proper descendants of x, the resulting forest has type h.Definition 4.9.A fractal of length k inside a forest s is a sequence A subfractal is extracted by only using a subsequence of the vah-decompositions.Such a subsequence is also a fractal.Proof.The proof is by induction on k.The case k = 1 is obvious.Assume the lemma is proved for k and n and consider the case k + 1.
The set of forests which have a fractal of length k is a regular language, call it K .By Fact 4.8 applied to K , there is some constant m such that every forest in K has a piece that is also in K , and whose size is bounded by m.(In this reasoning, we do not use the parameter n of Fact 4.8, so we can call Fact 4.8 with n = 0).We can assume without loss of generality that m > n.In other words, if a forest has a fractal of length k, then it has a piece of size at most m which has a fractal of length k.This means that if a forest has a fractal of length k, then it has a fractal of length k which has at most m nodes (the number of nodes in a fractal is the number of nodes in the largest of its vah-decompositions).
Assume now that pat ∼ m pt.By the induction assumption, as m > n, we have a fractal of length k inside pat.From the previous observation, this fractal can be assumed to be of size smaller than m.Hence we obtain a piece of pt which is a fractal of length k inside pt.Clearly, this resulting fractal can be extended to a fractal of length k + 1 by taking for X k +1 all the nodes of pat and for x k +1 the node a.
... The rest of this section is devoted to a proof of this proposition.The general idea is as follows.Using some simple combinatorial arguments, and also Ramsey's Theorem, we will show that there is also a large subfractal whose structure is very regular, or tame, as we call it.We will then apply identity (4.1) to this regular fractal, and show that a node with label a can be eliminated without affecting the type.
A fractal such that for each i = 1, ... , k, the node x i is part of the context q i , see Fig. 2.This does not necessarily mean that the nodes x 1 , ... , x k form a chain, since some of the contexts q i may be of the form + t.Lemma 4.12.Let k ∈ N.For n sufficiently large, if there is a fractal of length n inside pat, then there is a tame fractal of length k inside pat.
Proof.The main step is the following claim.Claim 4.13.Let m ∈ N.For n sufficiently large, for every forest s, and every set X of at least n nodes, there is a decomposition s = qq 1 • • • q m s ′ where every context q i contains at least one node from X .
Proof.Let Y be the smallest set of nodes that contains X and is closed under closest common ancestors.If n is chosen large enough, either s[Y ] consist of more than m trees, or it contains a node having more than m children, or s[Y ] contains a chain of length bigger than m.We are thus left with three cases: • In the set Y , there is a path y 1 < • • • < y m+1 .For i ∈ {1, ... , m + 1}, consider the set of nodes Each set Y i contains at least one node of X , by definition of the set Y .The decomposition in the statement of the lemma is chosen so that context q i corresponds to the set Y i .The context q corresponds to all nodes that are not descendants of y 1 , and the forest s ′ corresponds to all descendants of y m+1 .• There is a node y ∈ Y such that at least m + 1 children of y have some node from Y (and therefore also X ) in their subtree.Let t be the forest containing all proper descendants of y.By assumption on y, the forest t can be decomposed as t = t 1 + • • • + t m+1 so that each of the forests contains at least one node from X .For the decomposition in the statement of the lemma, we define q to be the set of nodes outside t, which includes y, and we define q i to be t i + and s ′ as t m+1 .• The forest s can be decomposed as t = t 1 + • • • + t m+1 so that each of the forests contains at least one node from X .We conclude as in the previous case but with an empty q.
We now come back to the proof of the lemma.For k ∈ N let n be the number defined by Claim 4.13 for m = k 2 .Let We apply Claim 4.13, with X = {x 1 , ... , x n } and obtain a decomposition s = qq 1 • • • q m s ′ .For each i = 1, ... , m the context q i contains at least one node of X .We chose arbitrarily one of them and denote it by x n i .Unfortunately, the function i → n i need not be monotone, as required in a tame fractal.However, we can always extract a monotone subsequence, since any number sequence of length k 2 is known to have a monotone subsequence of length k [10] We now assume there is a tame fractal with the node x i belonging to the context q i .The dual case when the decomposition is s = qq k • • • q 1 s ′ , corresponding to a decreasing sequence in the proof of Lemma 4.12, is treated analogously.
The general idea is as follows.We will define a notion of monochromatic tame fractal, and show that vah = vh follows from the existence of large enough monochromatic tame fractal.Furthermore, a large monochromatic tame fractal can be extracted from any sufficiently large tame fractal thanks to the Ramsey Theorem.
Let i, j, l be such that 0 ≤ i < j ≤ l ≤ k.We define u ijl to be the image under α of the context obtained from q i+1 • • • q j by only keeping the nodes from X l (with the hole staying where it is).We define w ijl to be the image under α of the context obtained from q i+1 • • • q j by only keeping the nodes from X l \ {x l }.Straight from this definition, as X l ⊆ X l+1 we have w ijl u ijl and u ijl u ij(l+1) (4.4) A tame fractal is called monochromatic if for all i < j < l and all i ′ < j ′ < l ′ taken from {1, ... , k}, we have Note that in the above definition, we require j < l, even though u ijl is defined even when j ≤ l.
We apply the following form of Ramsey's Theorem (see, for example, Bollobas [7]): Let c, r , k be positive integers.Then there exists an integer N with the following property.Let |S| ≥ N, and suppose that the subsets of S of cardinaility r are colored with c colors.Then there exists a subset T of S with |T | ≥ k such that all subsets of T with of cardinality r have the same color.
Let ω be the exponent associated to the syntactic forest algebra (H L , V L ) as defined in Section 3. If there is a tame fractal of size N inside s, then the map {i, j, l} → u ijl gives us a coloring of the cardinality 3 subsets of {1, ... , N} with |V L | colors.By Ramsey's Theorem, if N is sufficiently large, there is a monochromatic fractal of length k = ω + 1 inside s.
We conclude by showing the following result: Lemma 4.14.If there is a monochromatic tame fractal of length Proof.Fix a monochromatic tame fractal The type of s[X k \ {x k }] is decomposed the same way, only u (k −1)kk is replaced by w (k −1)kk .Therefore, the lemma will follow if Since the fractal is monochromatic, and since k = ω + 1 the above becomes kk .By (4.4) and monochromaticity we have Therefore identity (4.1) can be applied to show that both sides are equal to u ω 01k .Note that we use only one side of identity (4.1), u ω v = u ω .We would have used the other side when considering the case when s = qq k • • • q 1 s ′ .The first reason is that identity (4.1) refers to the relation v w .One consequence is that we need to prove Corollary 3.2 before concluding that identity (4.1) can be checked effectively.
The second reason is that we want to pinpoint how identity (4.1) diverges from Jtriviality of the context monoid V .Consider the forest language "all trees in the forest are of the form aa". It is easy to verify that the syntactic forest algebra of this language is such that V is J -trivial.But this language is not piecewise testable, since for any k > 0, the forests k • aa and k • aa + a contain the same pieces of size at most k, but the first of these forests is in the language, while the second is not.
The proposition below identifies an additional condition (depicted in Figure 3) that must be added to J -triviality.Proposition 4.15.Identity (4.1) is equivalent to J -triviality of V , and the identity Proof.One implication is obvious: both J -triviality and (4.5) follow from (4.1).For the other implication, we assume V is J -trivial and that (4.5) holds.We must show that if v u, then We will only show the first equality, the other is done the same way.By unraveling the definition of v u, there is a morphism and two contexts p q over A such that α(p) = v and α(q) = u.
The proof goes by induction on the size of p.
If p can be decomposed as p 1 p 2 with p 1 , p 2 nonempty, then we have p 1 q and p 2 q and, by induction, α(q) ω • α(p 1 ) = α(q) ω , α(q) ω • α(p 2 ) = α(q) ω .Hence we get: If p consists of single node with a hole below, then we have q = q 0 pq 1 for some two contexts q 0 , q 1 , and therefore also u = u 0 vu 1 for some u 0 , u 1 .The result then follows by J -triviality of V (recall that J -triviality implies identity (4.3)): In the above, we used twice identity (4.3):Once when adding u 0 to u ω , and then when removing u 0 v from after u ω .
The interesting case is when p = + s for some tree s.In this case, the context q can be decomposed as q 1 ( + t)q 2 , with s t.We have Thanks to identity (4.3), the above can be rewritten as It is therefore sufficient to show that s t implies The proof of the above equality is by induction on the number of nodes that need to be removed from t to get s.The base case s = t follows by aperiodicity of H, which follows by aperiodicity of V , itself a consequence of J -triviality.Consider now the case when t is bigger than s.In particular, we can remove a node from t and still have s as a piece.In other words, there is a decomposition t = q 0 q 1 t ′ such that s q 0 t ′ .Applying the induction assumption, we get Furthermore, applying identity (4.5), we get Combining the two equalities, we get the desired result.

Closest common ancestor
According to the definition of piece in Section 2, t = d (a + b) is a piece of the forest s = dc(a + b).In this section we consider a notion of piece which does not allow removing the closest common ancestor of two nodes, in particular removing the node c in the example above.The logical counterpart of this notion is a signature where the closest common ancestor (a three argument predicate) is added.
Recall that in a forest s we say that a node z is the closest common ancestor of the nodes x and y, denoted z = x ⊓ y, if z is an ancestor of both x and y and all other nodes of s with this property are ancestors of z.Note that the ancestor relation can be defined in terms of the closest common ancestor, since a node x is an ancestor of y if and only if x is the closest common ancestor of x and y.We now say that a forest s is a cca-piece of a forest t, and write this as s t, if there is an injective mapping from nodes of s to nodes of t that preserves the label of the node together with the forest-order and the closest common ancestor relationship (the ancestor relationship is then necessarily preserved).An equivalent definition is that the cca-piece relation is the reflexive transitive closure of the relation {(pt, pat) : p is a context, a is a node, t is a tree or empty} Notice the difference with the notion of piece as defined in Section 2, where t could be an arbitrary forest.Similarly we say that a context p is a cca-piece of the context q, p q, if there is an injective mapping from p to q as above that also preserves the hole.
A forest language L is called cca-piecewise testable if there exists n > 0 such that membership of t in L depends only on the set of cca-pieces of t of size n.
As before, every cca-piecewise testable language is regular and an analogue of Proposition 2.1 holds as well.
Recall that the ancestor relation can be expressed using the closest common ancestor relation hence Σ 1 (⊓, < dfs ) could be replaced by Σ 1 (⊓, < dfs , <) in the statement of Proposition 5.1.A first remark is that there are more cca-piecewise testable languages than there are piecewise testable ones.Hence the identities that characterize piecewise testable languages are no longer valid.In particular, in the syntactic algebra of a cca-piecewise testable language, the context monoid V may no longer be J -trivial.To see this consider the language L of forests over {a, b, c} that contain the cca-piece a(b + c).This is the language "some a is the closest common ancestor of some b and c".Then, for all n, the context p = (ab) n is not the same as the context q = (ab) n a as p(b + c) ∈ L while q(b + c) ∈ L. Hence the identity (uv) ω = (uv) ω u does not hold in the syntactic context monoid of L. However as we noted earlier, any J -trivial monoid satisfies this identity.Note however that p and q satisfy the equivalence pt ∈ L iff qt ∈ L for all trees t.The characterization below is a generalization of this idea of distinguishing trees from forests.
We call a context a tree-context if it is nonempty and has one node that is the ancestor of all other nodes, including the hole.
In the presence of the closest common ancestor, the situation is more complicated as well: cca-piecewise testability of a forest language L is not determined by the syntactic forest algebra alone.To obtain an algebraic characterization of this class of languages, it is necessary to look at the syntactic morphism α L : A ∆ → (H L , V L ) that maps each (h, v) to its ∼ L -class, and not just the the image of this morphism.(We can be considerably more precise about this: The distinction is that the cca-piecewise testable languages do not form a variety of languages in the sense described by Eilenberg [9].In particular, this family of languages lacks the crucial property of being closed under inverse images of morphisms between free forest algebras; this fails if the morphism maps some generator a to the empty context, or to a context of the form p + s, where p is a context and s is a nonempty forest.However cca-piecewise testable languages satisfy all the other properties of varieties of languages and in particular they are closed under inverse images of homomorphisms that are "tree-preserving", i.e., the image of a is a tree-context p for all a. Varieties of forest languages are discussed in [4].) We extend the cca-piece relation to elements of a forest algebra (H, V ) in the presence of a morphism α : A ∆ → (H, V ) as follows: we write v w if there are contexts p q that are mapped to v and w respectively by the morphism α.There is a subtle difference here with the definition of defined in Section 2: the relation on V depends on the morphism α! Similarly we define the notion of g h for g, h ∈ H.
The elements of V that are images under the morphism α of a tree-context are called tree-context-types.Similarly, the elements of H that are images of a tree are called treetypes (it is possible for an element to be an image of both a tree and a non-tree, but it is still called a tree-type here).Note that the notions of tree-type and of tree-context-type are relative to α. Theorem 5.2.A forest language L is cca-piecewise testable if and only if its syntactic algebra and syntactic morphism satisfy the following identities: whenever h is a tree-type or empty, and v u are tree-context-types, and Because of the finiteness of the syntactic forest algebra (H L , V L ) one can effectively decide whether an element of one of these monoids is the image of a tree-context or of a tree.Whether or not v u or g h holds can be decided in polynomial time using an algorithm as in Corollary 3.2 based on the following equivalent definition of : Let (H, V ) be a forest algebra and α a surjective morphism from A ∆ → (H, V ).Let then R be the smallest relation on V that satisfies the following rules, for all v, v ′ , w , w ′ ∈ V : For any finite (H, V ) and surjective morphism α, the relations R and are the same.
Proof.We first show the inclusion of R in .A simple induction on the number of steps used to derive v R w , produces contexts p q with α(p) = v and α(q) = w .Moreover p (q) is a tree-context whenever u (v) is a tree-context-type.The surjectivity of α is necessary for starting the induction in the case R v.
For the inclusion of in R, we show that α(p) R α(q) holds for all contexts p q.The proof is by induction on the size of p: • If p is the empty context, then the result follows thanks to the first rule in the definition of R. If p = a then from p q it follows that q = q 1 aq 2 for some contexts q 1 , q 2 and using the first and second rule in the definition of R we get that R α(q 1 ), R α(q 2 ), and α(a)Rα(a)α(q 2 ).Hence using the third rule in the definition of R we get the desired result by composition.• If there is a decomposition p = p 1 ap 2 where p 1 , p 2 are contexts, then from p q there must be a decomposition q = q 1 aq 2 with p 1 q 1 and p 2 q 2 .By induction we get that α(p 1 ) R α(q 1 ) and α(p 2 ) R α(q 2 ).Applying the second rule to the latter we get that α(ap 2 ) R α(aq 2 ).We can now apply the third rule to derive α(p) R α(q).• If there is a decomposition p = p 1 p 2 where p 1 , p 2 are non empty contexts and p 1 is of the form (s + + t), then from p q there must be a decomposition q = q 1 q 2 with p 1 q 1 and p 2 q 2 and where q 1 is of the form (s ′ + + t ′ ).We conclude by induction and using the fourth rule in the definition of R. • The remaining case is when p = (t + ) (or p = + t) where t is a tree of the form ap ′ 0 for some context p ′ .Then from p q we have q = aq ′ 0 + q 1 for some contexts q 1 , q ′ , with p ′ q ′ .By induction we have α(p ′ ) R α(q ′ ).Using the second rule we get α(ap ′ ) R α(aq ′ ).
Using the last rule we get α(p) R α(aq ′ 0 + ).By the first rule we have R α(q 1 ).We conclude using the fourth rule.
This implies that Theorem 5.2 yields a decidable characterization of the cca-piecewise testable languages.
Corollary 5.4.It is decidable if a regular forest language is cca-piecewise testable.
The proof of Theorem 5.2 follows the same outline as that of the proof of Theorem 4.1, but the details are somewhat complicated.5.1.Proof of Theorem 5.2.The proof that (5.1) and (5.2) are necessary is the same as Section 4.1.The only difference is that instead of Fact 4.5, we use the following.Fact 5.5.If r is any context, p q are tree-contexts, and t is a tree or empty, then rpt rqt.
We now turn to the completeness proof in Theorem 5.2.The proof is very similar to the one of the previous section, with some subtle differences.
As before, we fix a language L whose syntactic forest tree algebra (H, V ) satisfies all the identities of Theorem 5.2.We write α for the syntactic morphism.
We now write s ∼ n t if the two forests s, t have the same cca-pieces of size n.Likewise for contexts.
The main step is to show the following proposition.
Proposition 5.6.For n sufficiently large, if t is a tree or empty, then pat ∼ n pt entails α(pat) = α(pt).
Theorem 5.2 follows from the above proposition in the same way as Theorem 4.1 follows from Proposition 4.7 in the previous section.The reason why we assume that t is either a tree or empty is because when s is an cca-piece of s ′ , then s can be obtained from s ′ by iterating one of the following two operations: removing a leaf, or removing a node which has only one child.Hence during the pumping argument yielding Theorem 5.2 from Proposition 5.6 it is enough to preserve the type only for these operations.We thus concentrate on showing Proposition 5.6.
We will now redefine the concept of fractal for our new, closest common ancestor setting.The key change is in the concept of a vah-decomposition.We change the notion of x ∈ X being a vah-decomposition of s as follows: all conditions of the old definition hold, but new conditions are added.First we require that s[X ] be a closest common ancestor piece of s, in particular this implies that if two elements of X have a closest common ancestor in s then this closest common ancestor is also in X .Moreover either x has no descendants in X ; or there is a minimal element of X that has x as a proper ancestor.In other words, the part of s[X ] that corresponds to h is either empty, or is a tree.In particular, s[X \ {x}] is a closest common ancestor piece of s[X ]; which is the key property required below.From now on, when referring to a vah-decomposition, we use the new definition.In particular in the concept of a fractal x 1 ∈ X 1 , ... , x k ∈ X k inside s we now have that for each i, x i ∈ X i is a vah-decomposition of s in the new sense.
The proof of the following lemma is exactly the same as its counterpart in Section 4.2 (Lemma 4.10) and is therefore omitted.Lemma 5.7.Let k ∈ N.For n sufficiently large, if t is a tree or empty, then pat ∼ n pt entails the existence of a fractal of length k inside pat.
such that either: • Each q i is a tree context whose root node belongs to X i \ {x i }.
• Each q i is a context of the form + t i , with t i a forest.Lemma 5.8.Let k ∈ N.For n sufficiently large, if there is a fractal of length n inside pat, then there is a cca-tame fractal of length k inside pat.
Proof.The proof is essentially the same as for the counter part in Section 4.2 (Lemma 4.12); only this time we need to be more careful to satisfy the more stringent requirements in a cca-tame fractal.
Let m = 2k + 2. Using the same reasoning as in the proof of Lemma 4.12, if n is large enough then we may extract a subfractal of length m where either: • All the nodes x 1 , ... , x m have the same closest common ancestor.In this case, we can extract a cca-tame subfractal, where each context is of the form + t i .• The set Y = {y : y is a closest common ancestor of some x i , x j } contains a chain y 1 < • • • < y m , such that for each i ≤ m, the set Y i = {z : z ≥ y i and z ≥ y i+1 } contains at least one of the node x i .(There is a second case, where the nodes y 1 , ... , y m are ordered the other way: with y i+1 an ancestor of y i .This case is treated analogously.)In particular, y i is the closest common ancestor of x i and any of the nodes x i+1 , ... , x m .Since X i+1 contains both x i and x i+1 , each node y i belongs to the set X i+1 .As we may have x i = y i , the desired cca-tame fractal is obtained as follows: We use x 2 ∈ X 2 , x 4 ∈ X 4 , ... , x 2k ∈ X 2k as the fractal (recall that m = 2k + 2); while the decomposition qq 1 ... q k s ′ is chosen so that q i has its root in y 2i−1 , and its hole in y 2i+1 .
Recall the definition of u ijl and w ijl as the image under α of the context obtained from q i+1 • • • q j by restricting s to X l and X l \ {x l }, respectively.Note that because of the new definition of fractals we have: if the q i are tree-contexts then u ijl , w ijl are tree-context-types (5.4)The definition of monochromaticity is the same as in the previous section and Ramsey's Theorem gives.Lemma 5.9.If there is a cca-tame fractal of sufficiently large size inside pat, then there is a monochromatic cca-tame fractal of size m = ω + 2 inside pat.
We will now take a monochromatic cca-tame fractal, and conclude by showing that α(pat) = α(pt).
Proof.Fix a monochromatic cca-tame fractal of size m = ω + 2 and let k = m − 1.Since x k ∈ X k is a vah-decomposition, the statement of the lemma follows once we show that α assigns the same type to the forest s Recall that the type of the forest s[X k ] can be decomposed as follows (the case where ) and notice that if q m is a tree-context then h is a tree-type.Therefore, the lemma will follow if Since the fractal is monochromatic, and since k = ω + 1, the above becomes 3) and monochromaticity, we have We now have two cases.If all the q i are tree-contexts, we conclude using identity (5.1) which can be applied because of (5.5), and the fact that h is then a tree-type and (5.4).If all the q i are contexts of the form + f i , we conclude from (5.5) using identity (5.2).

5.2.
An equivalent set of identities.In this section, we give a set of identities that is equivalent to the one used in Theorem 5.2.The rationale is the same as in Proposition 4.15: we want to avoid the use of v w in the identities.
Proposition 5.11.The conditions on the syntactic morphism stated in Theorem 5.2 are equivalent to the following equalities: (uv) ω h = (uv) ω uh (5.6) whenever h is a tree-type or empty, and whenever u and v are tree-context-types, and whenever u is a tree-context-type or empty and g, h are tree-types or empty.
The rest of Section 5.2 is devoted to showing the above proposition.
It is immediate to see that identity (5.1) implies identity (5.7) and that identity (5.1) implies identity (5.8).We now show that identities (5.1) and (5.2) imply identity (5.6).Let u and v be two context-types and h be a tree-type.We want to show that (uv) ω h = (uv) ω uh.
We consider several cases.• In the first case we assume that u = u 1 u 2 for some tree-context-type u 2 .In that case we have: Notice now that u 2 v u 2 vu 1 u 2 vu 1 and that u 2 vu 1 u 2 u 2 vu 1 u 2 vu 1 .As u 2 is a tree-contexttype, all the context-types involved are tree-context-types and we can use identity (5.1) twice and replace u 2 v by u 2 vu 1 u 2 .This yields: And we have (uv By idempotency, this yields the desired result: As v 2 is a tree-context-type, all the context-types involved are tree-context-types and we can use identity (5.1) twice and replace v 2 by v 2 u.This yields: And we have (uv) ω h = (uv) ω (uv) ω uh = (uv) ω uh • When none of the above cases works, we must have u = f 1 + + f 2 and v = g 1 + + g 2 .In that case we have (uv) ω h = ω •(f 1 +g 1 )+h +ω •(g 2 +f 2 ), and we conclude using identity (5.2) as f 1 (f 1 + g 1 ) and f 2 (f 2 + g 2 ).
We now consider the converse implication in Proposition 5.11.Assume that identities (5.6)-(5.8)hold.We show that identities (5.1) and (5.2) are satisfied.We first show the following lemma: Lemma 5.12.If u is a tree-context-type, v, w , w ′ are (not necessarily tree) context-types with w ′ w , and g, h are either tree-types or empty, then the following identity holds Note that the identity (5.2) is a direct consequence of the above, by taking u, v to be the empty context, and g, h to be the empty tree.We will also use the above lemma to show (5.1), but this will require some more work.
Proof.The proof is by induction on the number of steps used to derive w ′ w .
• Consider first the case when w , w ′ can be decomposed as Two applications of the induction assumption give us for all tree-type or empty g: As u is a tree-context-type we can iterate on (5.10) and then apply (5.11) in order to derive: As u is a tree-context-type, we can apply again (5.10) in the reverse direction in order to derive the desired result.• Consider now the case when w , w ′ can be decomposed as with w ′ 3 a tree-context-type or empty.We first use the induction assumption to get 13) By applying the identity (5.8), we get for all tree-type or empty g: 14) Note that it is important here that w ′ 3 h is either a tree-context-type or empty.Finally, we apply once again the induction assumption to get As u is a tree-context type, we can first iterate on (5.13), then iterate on (5.14) and finally applying (5.15) in order to get: Because u is a tree-context-type we can now apply (5.13) and (5.14) in reverse to eliminate the inner products and obtain the desired result.
• Finally, consider the case when w , w ′ can be decomposed as In this case, the identity becomes: 0)g where v ′ = v(h + ).The result now follows by induction assumption with w 1 , w ′ 1 in place of w , w ′ .
We now claim that all cases have been considered.Assume first that either w ′ or w consists of several trees.Then, by the definition of , w ′ and w can be decomposed into smaller forests and we conclude using the first bullet.We can thus assume that both w and w ′ are trees.If w ′ contains a node between its root and its hole then, by definition of , we can decompose w and w ′ and apply the second bullet.Similarly we can transform w using the first bullet until the third bullet can be applied.
We now derive the first part of identity (5.1).Let u, v be tree-context-types such that v u, and let h be a tree-type.We show by induction on v that u ω h = u ω vh.
where both v 1 and v 2 are tree-context-types then we consider v 2 first and v 1 next: It is important here that v 2 h is a tree-type.
Therefore it is enough to consider the case where v is of the form α(a)( + f ) for some letter a and some forest-type f .In the sequel we write a instead of α(a) in order to improve readability.From v u we get u = u 1 a( + g)u 2 where u 1 and u 2 are tree-context-types and f g.Then we have from identity (5.6) for any tree-type h: and therefore, as a( + g)h is a tree-type we get for any tree-type h: Iterating on (5.16) we get: It will therefore be enough to show for f g.This, however, is a consequence of (5.9).The second part of identity (5.1), u ω = vu ω , is shown the same way using identity (5.7) instead of identity (5.6) and building on (5.17) below instead of (5.9).Lemma 5.13.If u is a tree-context-type, v, w , w ′ are (not necessarily tree) context-types with w ′ w , and g, h are either tree-types or empty, then the following identity holds (5.17) Proof.Identical to the proof of Lemma 5.12, applying the other side of identity (5.8).

Variations
In this section we show that the techniques we developed in the previous sections are fairly robust and can be adapted to many situations.We describe some of them.
6.1.Languages definable in Σ 1 .Here we treat the relatively simple case of languages defined by Σ 1 sentences (rather than boolean combinations of such formulas).We will prove: Theorem 6.1.It is decidable whether a given regular forest language L is definable by a Σ 1 (<, < dfs ) sentence.
We will show how to do this using the syntactic forest algebra and syntactic morphism, although this could be carried out just as well using an automaton model.The argument we give is based on an idea of Pin [12] concerning ordered monoids.
Let L ⊆ H A be a regular forest language, and let α L : A ∆ → (H L , V L ) be its syntactic morphism.We set The relations ≤ H L and ≤ V L are partial orders on H L and V L , respectively.These orders are compatible with the algebra operations in the sense that whenever Proof.This is straightforward from the definitions: Transitivity and reflexivity of ≤ H L are obvious.To prove antisymmetry, suppose Transitivity and reflexivity of ≤ V L are likewise trivial, and antisymmetry follows from the antisymmetry of ≤ H L and the faithfulness of the action of V L on H L .For the multiplicative properties, let h i , u i , v i be as in the statement of the Proposition.
Theorem 6.3.Let L ⊆ H A be a regular forest language.The following are equivalent: • L is definable by a Σ 1 (<, < dfs ) formula.
• For all contexts p, q and forests t, Proof.The first condition implies the second, because inserting new nodes in a forest does not change the < or < dfs relation among the already existing nodes.
To show that the second condition implies the first, we use a pumping argument: Let n = |H L |.There exists K > 0 such that any forest s with at least K nodes has a factorization s = q 1 q 2 • • • q n t for some forest t, nonempty contexts q i .In particular, there is a factorization s = pqt with α L (t) = α L (qt).Thus a forest belongs to L if and only if it is obtained by successive insertion of nodes starting with a forest in L of size less than K .We can write a Σ 1 sentence φ that describes all the relations among nodes of the forests of size less than K that belong to L, and thus this sentence defines L.
To show the equivalence of the second and third conditions, suppose the second condition holds.We need to show v ≤ V L for all v ∈ V .This says that for every forest s and every context p, s ∈ L implies ps ∈ L, which follows from the second condition.Conversely, suppose the third condition holds, and that p, q are contexts and t a forest with pt ∈ L. Then α L (pt) = α L (p) α L (t) ∈ X .By the multiplicative properties of the partial order, α L (p)α L (q)α L (t) ∈ X , and thus pqt ∈ L. Theorem 6.1 is an immediate corollary, since one can effectively compute the order ≤ V L given the syntactic algebra and syntactic morphism of L.
6.2.Commutative languages.In this section we consider forest languages that are commutative, i.e., closed under rearranging siblings.
A forest t ′ is called a reordering of a forest t if it is obtained from t by rearranging the order of siblings.In other words, reordering is the least equivalence relation on forests that identifies all pairs of forests of the form p(s + t) and p(t + s).A forest language is called commutative if it is closed under reordering.In other words, a forest language is commutative if and only if its syntactic forest algebra satisfies the identity We say a forest s is a commutative piece of t, if s is a piece of some reordering of t.A forest language L is called commutative-piecewise testable if for some n ∈ N, membership of t in L depends only on the set of commutative pieces of t that have no more than n nodes.This definition also has a counterpart in logic, by removing the forest-order from the signature.The following proposition is immediate: Proposition 6.4.A forest language is commutative-piecewise testable iff it is definable by a Boolean combination of Σ 1 (<) formulas.
If a language is commutative-piecewise testable, then it is clearly commutative and piecewise testable (in the more powerful, noncommutative, sense).Below we show that the converse implication is also true: Theorem 6.5.A forest language is commutative-piecewise testable if and only if it is commutative and piecewise testable.
As piecewise testability is decidable, by Corollary 3.2, and commutativity is obviously decidable, the theorem above implies decidability: Corollary 6.6.It is decidable if a regular forest language is commutative-piecewise testable.Theorem 6.5 follows quite easily from: Lemma 6.7.Let n ∈ N.For k sufficiently large, if two forests have the same commutative pieces of size at most k, then they can be both reordered so that the resulting forests have the same pieces of size at most n.
To see this, assume L is a commutative and piecewise testable forest language.We need to show that there is a k such that if t and s have the same commutative pieces of size k then t ∈ L iff s ∈ L. As L is piecewise testable there exists an n such that whenever s and t have the same pieces of size no more than n then t ∈ L iff s ∈ L. Let k be the number given by Lemma 6.7 for that n.Assume now that s and t have the same commutative pieces of size k.By Lemma 6.7 they can be reordered into respectively s ′ and t ′ such that s ′ and t ′ have the same pieces of size n.Hence s ′ ∈ L iff t ′ ∈ L. But as L is commutative this yields s ∈ L iff t ∈ L as desired.
Proof of Lemma 6.7.Let P(s) be the set of pieces of s that have size at most n.As in Lemma 4.6, there is some k such that any forest s has a piece t s of size at most k with P(s) = P(t).Let now s 1 , s 2 be two forests with the same commutative pieces of size k.For i = 1, 2, consider the families To prove the lemma, we need to show that the families P 1 and P 2 share a common element.To this end, we show that for any X ∈ P 1 , there is some Y ∈ P 2 with X ⊆ Y , and vice versa; in particular, the families share the same maximal elements.Let then X = P(s ′ 1 ) ∈ P 1 .By the choice of k, the forest s ′ 1 has a piece t of size at most k with P(t) = X .Therefore t is a commutative piece of s 1 of size k.By assumption, the forest t is also a commutative piece of s 2 and therefore a piece of some reordering s ′ 2 of s 2 .Hence X ⊆ P(s ′ 2 ) ∈ P 2 .
Similarly we can define the notion of commutative-cca-piece and commutative-ccapiecewise testable forest language.Using the same arguments as above we can prove: Proposition 6.8.A forest language is commutative-cca-piecewise testable iff it is definable by a Boolean combination of Σ 1 (⊓) formulas.Theorem 6.9.A forest language is commutative-cca-piecewise testable if and only if it is commutative and cca-piecewise testable.Corollary 6.10.It is decidable if a regular forest language is commutative-cca-piecewise testable.
6.3.Tree languages.Our previous results were provided decidable characterizations for forest languages, and in fact the algebraic theory used here works best when forests, rather than trees, are treated as the fundamental object.Traditionally, though, interest has focused on trees rather than forests.Thus we want to give a decidable characterization of the piecewise testable tree languages or, equivalently, the sets of trees that are definable by Boolean combinations of Σ 1 sentences.
For certain logics, like first-order logic over the descendant relation, or first-order logic over successor, one can write a sentence that says "this forest is a tree", and thus there is no need to treat tree and forest languages separately.For piecewise testability, we need to do something more, since the set of all trees over a finite alphabet A is not definable by a Boolean combination of Σ 1 sentences over any of the predicates mentioned in this paper.
We define a tree piecewise testable language over a finite alphabet A to be the intersection of a piecewise testable forest language with the set of all trees over A. In other words this is the set of languages definable by a Boolean combination of Σ 1 (<, < dfs ) formulas when we interpret these formulas in trees.This is preferable to defining a piecewise testable tree language to be a tree language that is piecewise testable (as a forest language), since the latter definition would only define tree languages that are either finite or contain only chains (no branching).Moreover it would not correspond to the tree languages definable by a Boolean combination of Σ 1 (<, < dfs ) formulas.The cases when the pieces are assumed to be commutative and/or take into account closest common ancestor are defined analogously.
We will obtain our decidability result by a general method for translating algebraic characterizations of classes of forest languages to characterizations of the corresponding classes of tree languages.This method will apply to all the cases we considered earlier: piecewise testable languages, cca-piecewise testable languages, and their commutative counterparts.
First, suppose α : A ∆ → (H, V ) is a surjective forest algebra morphism.Recall that we denote by H A the set of all forests of A. Based on α, we define an equivalence relation on H A : We write s ∼ t if for all contexts p such that ps and pt are both trees (this happens if p is a tree-context or if p is the empty context and both t and s are trees) we have α(ps) = α(pt).Notice that if s and t are such that α(s) = α(t) then s ∼ t and that if s and t are both trees then s ∼ t implies α(s) = α(t) (take p = in the definition of ∼).It is clear that if s ∼ t then for any context q, qs ∼ qt.Thus ∼ defines a forest algebra congruence on A ∆ .Let be the projection morphism onto the quotient by this congruence.We call α ′ the tree reduction of α.From the remark above it follows that if t and s are both trees then α Let F be a family of forest languages over A. We say that a set F of surjective forest algebra morphisms with domain A ∆ characterizes F if a forest language L belongs to F if and only if L is recognized by some morphism in F. We will further assume that F is closed in the following sense: suppose α : A ∆ → (H 1 , V 1 ) belongs to F, and β : (H 1 , V 1 ) → (H 2 , V 2 ) is a morphism onto a finite forest algebra.Then βα belongs to F. Theorem 6.11.Let F and F be as above, and let L ⊆ H A be a set of trees.Then there is a forest language K ∈ F such that L consists of all the trees in K if and only if the tree reduction of the syntactic morphism α L of L belongs to F.
Proof.Let L be a tree language, α L be its syntactic morphism and let α ′ L : A ∆ → (H ′ L , V ′ L ) be its tree reduction.
Assume first that there is a forest language K such that L consists of all the trees in K .Let α K : A ∆ → (H K , V K ) be the syntactic morphism of K .By definition, α K ∈ F. Fix h ∈ H K and let t, s be forests such that α K (t) = h = α K (s).We show that α ′ L (s) = α ′ L (t).Suppose this is not the case.Then there exists a context p such that ps and pt are both trees but α L (ps) = α L (pt).By definition of α L this means that there exists a context q such that qps ∈ L but qpt ∈ L. From qps ∈ L we know that qps is a tree, hence, as pt is a tree, qpt must also be a tree.By hypothesis this implies qps ∈ K but qpt ∈ K , contradicting α K (t) = α K (s).
Since V ′ L acts faithfully on H ′ L , it follows that for any contexts p and q, α K (p) = α K (q) implies α ′ L (p) = α ′ L (q).Thus α ′ L = βα K for some morphism β : . By hypothesis on F this implies that α ′ L ∈ F. Conversely, suppose that α ′ L belongs to F. Let X = α ′ L (L) and set K = (α ′ L ) −1 (X ).From the hypothesis it follows that K ∈ F. Assume that t is a tree such that α ′ L (t) ∈ X .By definition of X , there is a tree s ∈ L such that α ′ L (s) = α ′ L (t).But as α ′ L is the tree reduction of α L , we have α ′ L (s) = α ′ L (t) implies α L (s) = α L (t) and therefore t ∈ L. Hence L is the set of trees of K .
As a result we have: Corollary 6.12.It is decidable if a regular tree language is tree (commutative) (cca-)piecewise testable.
Proof.We only give the proof for the piecewise testable case.The other cases are handled similarly.
Let F be the family of piecewise testable forest languages over A, and let F be the family of morphisms from A ∆ onto finite forest algebras that satisfy the identities of Theorem 4.1.Notice that from Proposition 4.15 it follows that if α ∈ F then βα ∈ F for all onto morphism β.Hence F and F satisfy the hypothesis of Theorem 6.11.
Consequently, a regular tree language L is tree piecewise testable if and only if the tree reduction of α L belongs to F. It remains to show that we can effectively compute the image of the tree reduction given α L .Consider h ∈ H L and notice that all the forests in α −1 L (h) agree on α ′ L .Hence the procedure amounts to deciding which pairs of elements of the syntactic forest algebra are identified under the reduction, which we can do as long as we know which elements are images under α L of trees.It is easy to see that if an element of H L is the image of a tree, then it is the image of a tree of depth at most |V L | in which each node has at most |H L | children, so we can effectively decide this as well.6.4.Horizontal order.We could also consider other natural predicates over forests.Recall for instance the definition of horizontal-order with x < h y expresses the fact that x is a sibling of y occurring strictly before y in the forest-order.
Correspondingly we say that s is a horizontal-piece of t, denoted s t, if there is an injective mapping from nodes of s to nodes of t that preserve the horizontal-order and the ancestor relationship.An equivalent definition is that the piece relation is the reflexive transitive closure of the relation {(pt, pat) : p is a context, a is a node, t is a forest or empty and either t is empty or a does not have a sibling in pat} From this notion of horizontal-piece we derive the notion of horizontal-piecewise testability as expected and the very same proofs as in Section 4 yield: Proposition 6.13.A forest language is horizontal-piecewise testable iff it is definable by a Boolean combination of Σ 1 (< h , < dfs ) formulas.Theorem 6.14.A forest language is horizontal-piecewise testable if and only if its syntactic algebra satisfies the identity u ω v = u ω = vu ω (6.1) for all u, v ∈ V L such that v u.This implies decidability of horizontal-piecewise testability and it would be interesting to see what would be the corresponding equivalent set of identities that does not make use of , in the spirit of Proposition 4.15.
A straightforward adaptation of Section 5 would also give a decidable characterization of definability by a Boolean combination of Σ 1 (<, < h , ⊓).

Conclusion/discussion
Simon's theorem on J -trivial monoids has emerged as one of the fundamental results in the algebraic theory of automata on words.The principal contribution of the present paper has been to show that the use of forest algebras leads to a natural generalization of this theorem to trees and forests.In proving this generalization we have introduced a number of new techniques that we believe will prove useful in the continuing development of the algebraic theory of tree automata.
Let us briefly indicate a few directions for further research.There is a purely algebraic formulation of Simon's theorem, stating that every finite J -trivial monoid M is the quotient of a finite monoid N that admits a partial order compatible with the multiplication in N and in which the identity is the maximum element.Our new results have a similar formulation: Every finite forest algebra satisfying the identities of Section 4 is the quotient of an algebra that admits compatible partial orders on both its horizontal and vertical components.In fact, Straubing and Thérien [17] have proved this order property of finite J -trivial monoids directly, yielding a quite different proof of Simon's theorem.It would be interesting to know whether such an argument is also possible for forest algebras.
In the word case, the boolean combinations of Σ 1 -definable languages form the first level of hierarchy whose union is the first-order definable languages.Little is known about the higher levels of this hierarchy, apart from the fact that it is strict.Indeed, the problem of effectively characterizing the languages definable by boolean combinations of Σ 2 -sentences has been open for many years.In contrast, the first-order definable languages themselves constitute one of the first classes for which an effective algebraic characterization was given: these are exactly the languages whose syntactic monoids are aperiodic.(McNaughton and Papert [11].)The corresponding problem for trees and forests, however, remains open: We possess non-effective algebraic characterizations for the forest languages definable by firstorder sentences over the ancestor relation, and for the related subclasses CTL and CTL* (see Bojańczyk, et.al. [5]), but the problem of finding effective tests for membership of a language in any of these classes remains one of the greatest challenges in this work.

Figure 1 :
Figure 1: The identity u ω = u ω v, with v u.The gray nodes are from v.

Lemma 4 . 10 .
Let k ∈ N.For n sufficiently large, pat ∼ n pt entails the existence of a fractal of length k inside pat.

Figure 3 : 4 . 3 .
Figure 3: The identity ω(vuh) = ω(vuh) + vh, with the white nodes belonging to u.4.3.An equivalent set of identities.In this section, we rephrase the identities used in Theorem 4.1.There are two reasons to rephrase the identities.
the statement of the lemma follows if α assigns the same type to the two restrictions s[X k ] and s[X k \ {x k }].Recall the definition of u ijl and w ijl above.The type of the forest s[X k ] can be decomposed as