A Trichotomy for Regular Trail Queries

Regular path queries (RPQs) are an essential component of graph query languages. Such queries consider a regular expression r and a directed edge-labeled graph G and search for paths in G for which the sequence of labels is in the language of r. In order to avoid having to consider infinitely many paths, some database engines restrict such paths to be trails, that is, they only consider paths without repeated edges. In this paper we consider the evaluation problem for RPQs under trail semantics, in the case where the expression is fixed. We show that, in this setting, there exists a trichotomy. More precisely, the complexity of RPQ evaluation divides the regular languages into the finite languages, the class Ttract (for which the problem is tractable), and the rest. Interestingly, the tractable class in the trichotomy is larger than for the trichotomy for simple paths, discovered by Bagan, Bonifati, and Groz [JCSS 2020]. In addition to this trichotomy result, we also study characterizations of the tractable class, its expressivity, the recognition problem, closure properties, and show how the decision problem can be extended to the enumeration problem, which is relevant to practice.


Introduction
Graph databases are a popular tool to model, store, and analyze data [Neo, Tig, Ora, Wik, DBp].They are engineered to make the connectedness of data easier to analyze.This is indeed a desirable feature, since some of today's largest companies have become so successful because they understood how to use the connectedness of the data in their specific domain (e.g., Web search and social media).One aspect of graph databases is to bring tools for analyzing connectedness to the masses.
Regular path queries (RPQs) are a crucial component of graph databases, because they allow reasoning about arbitrarily long paths in the graph and, in particular, paths that are longer than the size of the query.A regular path query essentially consists of a regular expression r and is evaluated on a graph database which, for the purpose of this article, we view as an edge-labeled directed graph G.When evaluated, the RPQ r searches for paths in G for which the sequence of labels is in the language of r.The return type of the query varies: whereas most academic research on RPQs [MW95, Bar13, BLR11, LM13, ACP12] and SPARQL [W3C13] focus on the first and last node of matching paths, Cypher [Ope] returns the entire paths.G-Core, a recent proposal by partners from industry and academia, sees paths as "first-class citizens" in graph databases [AAB + 18].
In addition, there is a large variation on which types of paths are considered.Popular options are all paths, simple paths, trails, and shortest paths.Here, simple paths are paths without repeated nodes and trails are paths without repeated edges.Academic research has focused mostly on all paths, but Cypher 9 [Ope, FGG + 18], which is perhaps the most widespread graph database query language at the moment, uses trails.Since the trail semantics in graph databases has received virtually no attention from the research community yet, it is crucial that we improve our understanding.
In this article, we study the data complexity of RPQ evaluation under trail semantics.That is, we study variants of RPQ evaluation in which the RPQ r is considered to be fixed.As such, the input of the problem only consists of an edge-labeled (multi-)graph G and a pair (s, t) of nodes and we are asked if there exists a trail from s to t on which the sequence of labels matches r.One of our main results is a trichotomy on the RPQs for which this problem is in AC 0 , NL-complete, or NP-complete, respectively.By T tract , we refer to the class of tractable languages (assuming NP ̸ = NL).
In order to increase our understanding of T tract , we study several important aspects of this class of languages.A first set of results is on characterizations of T tract in terms of closure properties and syntactic and semantic conditions on their finite automata.In a second set of results, we therefore compare the expressiveness of T tract with yardstick languages such as FO 2 [<], FO 2 [<, +1], FO[<] (or aperiodic languages), and SP tract .The latter class, SP tract , is the closely related class of languages for which the data complexity of RPQ evaluation under simple path semantics is tractable. 1Interestingly, T tract is strictly larger than SP tract and includes languages outside SP tract such as a * bc * and (ab) * that are relevant in application scenarios in network problems, genomic datasets, and tracking provenance information of food products [PS] and were recently discovered to appear in public query logs [BMT17,BMT19].Furthermore, every single-occurrence regular expression [BNSV10] is in T tract , which can be a convenient guideline for users of graph databases, since singleoccurrence (every alphabet symbol occurs at most once) is a very simple syntactical property.It is also popular in practice: we analyzed the 50 million RPQs found in the logs of [BMT18] and discovered that over 99.8% of the RPQs are single-occurrence regular expressions.
We then study the recognition problem for T tract , that is: given an automaton, does its language belong to T tract ?This problem is NL-complete (resp., PSPACE-complete) if the input automaton is a DFA (resp., NFA).We also treat closure under common operations such as union, intersection, reversal, quotients and morphisms.
We conclude by showing that also the enumeration problem is tractable for T tract .By tractable, we mean that the paths that match the RPQ can be enumerated with only polynomial delay between answers.Technically, this means that we have to prove that we cannot only solve a decision variant of the RPQ evaluation problem, but we also need to find witnessing paths.We prove that the algorithms for the decision problems can be extended to return shortest paths.This insight can be combined with Yen's Algorithm [Yen71] to give a polynomial delay enumeration algorithm.
Related Work.RPQs on graph databases have been studied since the end of the 80's and are now finding their way into commercial products.The literature usually considers the variant of RPQ evaluation where one is given a graph database G, nodes s, t, and an RPQ r, and then needs to decide if G has a path from s to t (possibly with loops) that matches r.
For arbitrary and shortest paths, this problem is well-known to be tractable, since it boils down to testing intersection emptiness of two NFAs.Mendelzon and Wood [MW95] studied the problem for simple paths, which are paths without node repetitions.They observed that the problem is already NP-complete for regular expressions a * ba * and (aa) * .These two results rely heavily on the work of Fortune et al. [FHW80] and LaPaugh and Papadimitriou [LP84].
Our work is most closely related to the work of Bagan et al. [BBG20] who, like us, studied the complexity of RPQ evaluation where the RPQ is fixed.They proved a trichotomy for the case where the RPQ should only match simple paths.In this article we will refer to this class as SP tract , since it contains the languages for which the simple path problem is tractable, whereas we are interested in a class for trails.Martens and Trautner [MT19] refined this trichotomy of Bagan et al. [BBG20] for simple transitive expressions, by analyzing the complexity where the input consists of both the expression and the graph.
Paperman has integrated the classes SP tract and T tract in his tool called Semigroup Online [Pap22].The tool can process a regular expression as input and can tell the user whether the language is in SP tract , T tract , and/or in many other important classes of languages.
Trails versus Simple Paths.We conclude with a note on the relationship between simple paths and trails.For many computational problems, the complexities of dealing with simple paths or trails are the same due to two simple reductions, namely: (1) constructing the line graph or (2) splitting each node into two, see for example Perl and Shiloach [PS78, Theorem 2.1 and 2.2].As soon as we consider labeled graphs, the line graph technique still works, but not the nodes-splitting technique, because the labels on paths change.As a consequence, we know that finding trails is at most as hard as finding simple paths, but we do not know if it has the same complexity when we require that they match a certain RPQ r.
In this article we show that the relationship is strict, assuming NL ̸ = NP.An easy example is the language (ab) * , which is NP-hard for simple paths [LP84,MW95], butassuming that a-labeled edges are different from b-labeled edges-in NL for trails.This is because every path from s to t that matches (ab) * can be reduced to a trail from s to t that matches (ab) * by removing loops (in the path, not in the graph) that match (ab) * or (ba) * .In Figure 1 we depict four small graphs, all of which have trails from s to t.(In the three rightmost graphs, there is exactly one path labeled (ab) * , which is also a trail.) Outline.We note that this is a full version of the work presented in [MNT20].In addition to adding the full proofs, we generalize our results to multigraphs throughout the article.In Section 2 we define our notation, Section 3 introduces the class T tract , which contains exactly the regular languages for which finding a trail from s to t is in polynomial time (assuming P ̸ = N P ).We prove this dichotomy in Section 4. (The article is named trichotomy because we can also differentiate between finite and infinite languages.For the first, finding such a path is in AC 0 while NL-hard for the latter.)After giving some interesting closure properties of T tract in Section 5 and extending the algorithm for languages in T tract to an enumeration algorithm, we conclude our work in Section 7. The most complex part of this article is in Section 3, where we give several equivalent definitions of T tract , some of which are needed for the proof of its tractability, others, like the syntactic definition given in Theorem 3.31 might be useful for database engineers, while others are used to compare T tract to well-known classes such as FO or FO 2 [<, +1].

Preliminaries
We use [n] to denote the set of integers {1, . . ., n}.By Σ we always denote a finite alphabet, i.e., a finite set of symbols.We always denote symbols by a, b, c, d and their variants, like a ′ , a 1 , b 1 , etc.The regular expressions we use in this article are defined as follows: ∅, ε, and every symbol in Σ is a regular expression.When r and s are regular expressions, then (rs), (r + s), (r?), (r * ), and (r + ) are also regular expressions.We use the usual precedence rules to omit parentheses.For n ∈ N, we use r n to abbreviate the n-fold concatenation r • • • r of r.The language L(r) of a regular expression r is defined as usual.For readability, we often omit the L(•) and only write r for the language of r.A word is a finite sequence We consider edge-labeled directed multigraphs G = (V, E, E), where V is a finite set of nodes, E is a finite set of edges, and E : E → V × Σ × V is a function that maps each edge identifier to a tuple (v 1 , a, v 2 ) describing the origin, the label, and the destination node of the edge.We denote v 1 by origin(e), a by lab(e) and v 2 by destination(e).We emphasize that E does not need to be injective, i.e., there might be several edges with identical origin, label, and destination.The size of G is defined as |V | + |E|.A (simple) graph is a multigraph where E is injective.A path p from node s to t is a sequence e 1 • • • e m of edges such that origin(e 1 ) = s, destination(e m ) = t, and for 1 ≤ i < m it holds that destination(i) = origin(i + 1).By |p| we denote the number of edges of a path.A path is a trail if every edge e appears at most once 2 and a simple path if all the nodes in origin(e 1 ) and destination(e 1 ), . . ., destination(e m ) are different.We note that each simple path is a trail but not vice versa.We denote lab(e 1 ) • • • lab(e m ) by lab(p).Given a language L ⊆ Σ * , path p matches L if lab(p) ∈ L. For a subset E ′ ⊆ E, path p is E ′ -restricted if every edge of p is in E ′ .Given a trail p and two edges e 1 and e 2 in p, we denote the subpath of p from e 1 to e 2 by p[e 1 , e 2 ].
We define an NFA A to be a tuple (Q, Σ, I, F, δ) where Q is the finite set of states; transition relation; and F ⊆ Q is the set of accepting states.Strongly connected components of (the graph of) A are simply called components.Unless noted otherwise, components will be non-trivial, i.e., containing at least one edge.We write C(q) to denote the strongly connected component of state q.
By δ(q, w) we denote the states reachable from state q by reading w.Given a path p, we also slightly abuse notation and write δ(q, p) instead of δ(q, lab(p)).We denote by q 1 ⇝ q 2 that state q 2 is reachable from q 1 .Finally, L q denotes the set of all words accepted from q and L(A) = q∈I L q is the set of words accepted by A. For every state q, we denote by Loop(q) the set {w ∈ Σ + | δ L (q, w) = q} of all non-empty words that allow to loop on q.For a word w and a language L, we define wL = {ww ′ | w ′ ∈ L} and w −1 L = {w ′ | ww ′ ∈ L}.

Vol. 19:4
A TRICHOTOMY FOR REGULAR TRAIL QUERIES 20:5 A DFA is an NFA such that I is a singleton and for all q ∈ Q and σ ∈ Σ: |δ(q, σ)| ≤ 1.Let L be a regular language.We denote by A L = (Q L , Σ, i L , F L , δ L ) the (complete) minimal DFA for L and by N the number |Q L | of states.For q 0 ∈ Q, we say that a run from q 0 of A on w = a 1 • • • a n is a sequence q 0 → • • • → q n of states such that q i ∈ δ(q i−1 , a i ), for every i ∈ {1, . . ., n}.When A is a DFA and q 0 its initial state, we also simply call it the run of A on w.The product of multigraph G = (V, E, E) and NFA lab(e), q 2 ) ∈ δ} and E ′ ((e, (q 1 , q 2 ))) = ((origin(e), q 1 ), lab(e), (destination(e), q 2 )).
A language L is aperiodic if and only if δ L (q, w N +1 ) = δ L (q, w N ) for every state q and word w.Equivalently, L is aperiodic if and only if its minimal DFA does not have simple cycles labeled w k for k > 1 and w ̸ = ε.Thus, for "large enough n" we have: So, a language like (aa) * is not aperiodic (take w = a and k = 2), but (ab) * is.(There are many characterizations of aperiodic languages [Sch65].) We study the regular trail query (RTQ) problem for a regular language L.

RTQ(L) Given:
A (multi-)graph G = (V, E, E) and (s, t) ∈ V × V .Question: Is there a trail from s to t that matches L?
A similar problem, which was studied by Bagan et al. [BBG20], is the RSPQ problem.The RSPQ(L) problem asks if there exists a simple path from s to t that matches L.

The Tractable Class
In this section, we define and characterize a class of languages of which we will prove that it is exactly the class of regular languages L for which RTQ(L) is tractable (if NL ̸ = NP).
3.1.Warm-Up: Downward Closed Languages.It is instructive to first discuss the case of downward closed languages.A language L is downward closed (DC) if it is closed under taking subsequences.That is, for every word w = a 1 • • • a n ∈ L and every sequence 0 Perhaps surprisingly, downward closed languages are always regular [Hai69].Furthermore, they can be defined by a clean class of regular expressions (which was shown by Jullien [Jul69] and later rediscovered by Abdulla et al. [ACBJ04]), which is defined as follows.
Definition 3.1.An atomic expression over Σ is an expression of the form (a + ε) or of the form (a 1 + • • • + a n ) * , where a, a 1 , . . ., a n ∈ Σ.A product is a (possibly empty) concatenation e 1 • • • e n of atomic expressions e 1 , . . ., e n .A simple regular expression is of the form p 1 + • • • + p n , where p 1 , . . ., p n are products.
Another characterization is by Mendelzon and Wood [MW95], who show that a regular language L is downward closed if and only if its minimal DFA A L = (Q L , Σ, i L , F L , δ L ) exhibits the suffix language containment property, which says that if δ L (q 1 , a) = q 2 for some symbol a ∈ Σ, then we have L q 2 ⊆ L q 1 . 3Since this property is transitive, it is equivalent to require that L q 2 ⊆ L q 1 for every state q 2 that is reachable from q 1 .Theorem 3.2 [ACBJ04, Hai69, Jul69, MW95].The following are equivalent: 3 They restrict q1, q2 to be on paths from iL to some state in FL, but the property trivially holds for q2 being a sink-state.
(1) L is a downward closed language.
(2) L is definable by a simple regular expression.
(3) The minimal DFA of L exhibits the suffix language containment property.
Obviously, RTQ(L) is tractable for every downward closed language L, since it is equivalent to deciding if there exists a path from s to t that matches L. For the same reason, deciding if there is a simple path from s to t that matches L is also tractable for downward closed languages.However, there are languages that are not downward closed for which we show RTQ(L) to be tractable, such as a * bc * and (ab) * .For these two languages, the simple path variant of the problem is intractable.
3.2.Main Definitions and Equivalence.The following definitions are the basis of the class of languages for which RTQ(L) is tractable.Definition 3.3.An NFA A satisfies the left-synchronized containment property if there exists an n ∈ N such that the following implication holds for all q 1 , q 2 ∈ Q and a ∈ Σ: Similarly, A satisfies the right-synchronized containment property if the same condition holds with w 1 = w ′ 1 a and w 2 = w ′ 2 a.
We illustrate this definition in Figure 2. We note that the minimal DFA of any downward closed language satisfies the left-synchronized containment property.q 1 q 2 q 3 q 4 q 5 q 6 a b c a b c a q 7 q 8 q 9 q 10 q 11 q 12 q 13 a b c a c a c a Figure 2: Example illustrating Definition 3.3.The left NFA does not satisfy the leftsynchronized containment property as (ac) * L q 6 ∩ L q 1 = ∅.The right NFA satisfies the left-synchronized containment property with n = 2 as (ac) 2 L q 13 ⊆ L q 7 and (ca) 2 L q 12 ⊆ L q 9 .
The left-synchronizing length of an NFA A is the smallest value n such that the implication in Definition 3.3 for the left-synchronized containment property holds.We define the right-synchronizing length analogously.
Observation 3.4.Let n 0 be the left-synchronizing length of an NFA A. Then the implication of Definition 3.3 is satisfied for every n ≥ n 0 .The reason is that w 2 ∈ Loop(q 2 ).Definition 3.5.A regular language L is closed under left-synchronized power abbreviations (resp., closed under right-synchronized power abbreviations) if there exists an n ∈ N such that for all words w ℓ , w m , w r ∈ Σ * and all words w 1 = aw ′ 1 and w 2 = aw ′ 2 (resp., w 1 = w ′ 1 a and w 2 = w ′ 2 a) we have that w ℓ w n 1 w m w n 2 w r ∈ L implies w ℓ w n 1 w n 2 w r ∈ L.
We note that Definition 3.5 is equivalent to requiring that there exists an n ∈ N such that the implication holds for all i ≥ n.The reason is that, given i > n and a word of the form w ℓ w i 1 w m w i 2 w r , we can write it as w ′ ℓ w n 1 w m w n 2 w ′ r with w ′ ℓ = w ℓ w i−n 1 and w ′ r = w i−n 2 w r , for which the implication holds by Definition 3.5.Lemma 3.6.Consider a minimal DFA A L = (Q L , Σ, i L , F L , δ L ) with N states.Then the following is true: (1) If A L satisfies the left-synchronized containment property, then the left-synchronizing length is at most N .(2) If A L satisfies the right-synchronized containment property, then the right-synchronizing length is at most N .
Proof.We only prove (1), (2) is symmetric.By Definition 3.3, there exists an n ∈ N such that: If q 1 , q 2 ∈ Q A and a ∈ Σ such that q 1 ⇝ q 2 and if w 1 ∈ Loop(q 1 ), w 2 ∈ Loop(q 2 ) with w 1 = aw ′ 1 and w 2 = aw ′ 2 , then w n 2 L q 2 ⊆ L q 1 .If n > N , then there must be a loop in the w n 2 part that generates multiples of w 2 .Applying the pigeonhole principle there is an i < n for which w i 2 L q 2 ⊆ L q 1 holds.By repetition, we obtain an i with i < N .
From Definition 3.3, Observation 3.4, and Lemma 3.6, we get the following corollary.
Proof.Let A L satisfy the left-or right-synchronized containment property.We show that L satisfies Property (P), restated here for convenience.
This proves the lemma since all languages satisfying Property (P) are aperiodic, see Lemma 3.8.Let q 1 , q 2 ∈ Q L and w satisfy q 1 ⇝ q 2 and w ∈ Loop(q 1 ) ∩ Loop(q 2 ).By Corollary 3.7 we then have that w N L q 2 ⊆ L q 1 .Since w ∈ Loop(q 1 ), we have that δ(q 1 , w N ) = q 1 , which in turn implies that L q 2 ⊆ L q 1 .Lemma 3.10.If L is closed under left-or right-synchronized power abbreviations, then L is aperiodic.
Proof.Let L be closed under left-or right-synchronized power abbreviations and i ∈ N be as in Definition 3.5.We show that A L satisfies the Property (P).The aperiodicity then follows from Lemma 3.8.Let q 1 , q 2 ∈ Q L and w satisfy q 1 ⇝ q 2 and w ∈ Loop(q 1 ) ∩ Loop(q 2 ).Let w ℓ , w m ∈ Σ * be such that q 1 = δ L (i L , w ℓ ) and q 2 = δ L (q 1 , w m ).Let w r ∈ L q 2 .Then, w ℓ w * w m w * w r ⊆ L by construction.Especially, w ℓ w i w m w i w r ∈ L and, by Definition 3.5, also w ℓ w i w i w r ∈ L. Since δ L (i L , w ℓ w i w i ) = q 1 , this means that w r ∈ L q 1 .Therefore, L q 2 ⊆ L q 1 .Next, we show that all conditions defined in Definitions 3.3 and 3.5 are equivalent for DFAs.
Theorem 3.11.For a regular language L with minimal DFA A L , the following are equivalent: (1) A L satisfies the left-synchronized containment property.
(2) A L satisfies the right-synchronized containment property.
(3) L is closed under left-synchronized power abbreviations.
(4) L is closed under right-synchronized power abbreviations. 1) ⇒ (3): Let A L satisfy the left-synchronized containment property.We will show that if there exists a word w ℓ w i 1 w m w i 2 w r ∈ L with i = N + N 2 and w 1 and w 2 starting with the same letter, then w ℓ w i 1 w i 2 w r ∈ L. To this end, let w ℓ w i 1 w m w i 2 w r ∈ L. Due to the pumping lemma, there are states q 1 , q 2 and integers h, j, k, ℓ, m, n ≤ N with j, m ≥ 1 satisfying: , and w n 2 w r ∈ L q 2 .This implies that Since A L satisfies the left-synchronized containment property and by Corollary 3.7, we have (w m 2 ) N L q 2 ⊆ L q 1 and therefore w ℓ w h 1 (w j 1 ) * (w m 2 ) N w n 2 w r ⊆ L .Now we use that L is aperiodic, see Lemma 3.9: And finally, we use that i = N + N 2 and h, j, m, n ≤ N to obtain w ℓ (w 1 ) i (w 2 ) i w r ∈ L.
(3) ⇒ (4): Let L be closed under left-synchronized power abbreviations and let j ∈ N be the maximum of |A L | and n + 1, where the n is from Definition 3.5.We will show that if w ℓ (w 1 a) j w m (w 2 a) j w r ∈ L, then w ℓ (w 1 a) j (w 2 a) j w r ∈ L. If w ℓ (w 1 a) j w m (w 2 a) j w r ∈ L, then we also have w ℓ (w 1 a) j w m (w 2 a) j+1 w r ∈ L since L is aperiodic, see Lemma 3.10, and j ≥ |A L |.This can be rewritten as As L is closed under left-synchronized power abbreviations, and n < j, this implies This can be rewritten into w ℓ (w 1 a) j (w 2 a) j w r ∈ L.
(4) ⇒ (2): Let L be closed under right-synchronized power abbreviations.We will prove that A L satisfies the right-synchronized containment property, that is, if there are two states q 1 , q 2 in A L with q 1 ⇝ q 2 and w 1 ∈ Loop(q 1 ), w 2 ∈ Loop(q 2 ), such that w 1 and w 2 end with the same letter, then (w 2 a) N L q 2 ⊆ L q 1 .Let q 1 , q 2 be such states.Then there exist w ℓ , w m with q 1 = δ L (i L , w ℓ ) and q 2 = δ L (q 1 , w m ).If L q 2 = ∅, we are done.So let us assume there is a word w r ∈ L q 2 .We define w ′ r = w N 2 w r .Due to construction, we have w ℓ w * 1 w m w * 2 w ′ r ⊆ L. Since L is closed under right-synchronized power abbreviations, there is an i ∈ N such that w ℓ w i 1 w i 2 w ′ r ∈ L. Since we have a deterministic automaton and q 1 = δ L (i L , w ℓ w i 1 ) this implies that w i 2 w ′ r = w i 2 w N 2 w r ∈ L q 1 .We now use that L is aperiodic due to Lemma 3.10 to infer that w N 2 w r ∈ L q 1 .
(2) ⇒ (1): Let A L satisfy the right-synchronized containment property.We will show that if there exist states q 1 , q 2 ∈ Q L and words w 1 , w 2 ∈ Σ * with aw 1 ∈ Loop(q 1 ) and aw 2 ∈ Loop(q 2 ) and q 1 ⇝ q 2 , then (aw 2 ) N L q 2 ⊆ L q 1 .Let q 1 , q 2 be such states and w 1 , w 2 as above.We define q ′ 1 = δ L (q 1 , w 1 ) and q ′ 2 = δ L (q 2 , w 2 ).Since A L is deterministic, the construction implies that w 1 a ∈ Loop(q ′ 1 ) and w 2 a ∈ Loop(q ′ 2 ).Furthermore, it holds that (i) With this we will show that (w 2 a) Corollary 3.12.If a regular language L satisfies Definition 3.5 and N = |A L | then, for all i > N 2 + N and for all words w ℓ , w m , w r ∈ Σ * and all words w 1 = aw ′ 1 and Proof.This immediately follows from the proof of (1) ⇒ (3).
In Theorem 4.1 we will show that, if NL ̸ = NP, the languages L that satisfy the above properties are precisely those for which RTQ(L) is tractable.To simplify terminology, we will henceforth refer to this class as T tract .Definition 3.13.A regular language L belongs to T tract if L satisfies one of the equivalent conditions in Theorem 3.11.
For example, (ab) * and (abc) * are in T tract , whereas a * ba * , (aa) * and (aba) * are not.The following property immediately follows from the definition of T tract .
Observation 3.14.Every regular expression for which each alphabet symbol under a Kleene star occurs at most once in the expression defines a language in T tract .
A special case of these expressions are those in which every alphabet symbol occurs at most once.These are known as single-occurrence regular expressions (SORE) [BNSV10].SOREs were studied in the context of learning schema languages for XML [BNSV10], since they occur very often in practical schema languages.

3.3.
The inner Structure of minimal DFAs in T tract .The components of minimal DFAs of languages in T tract have a very special form.The insights provided in this section are used in Section 4 to show trichotomy results for T tract , and in Section 3.4 to give a syntactic characterization of languages in T tract .
Proof.Let q 1 ̸ = q 2 be two states in C. Let σ satisfy δ L (q 1 , σ) ∈ C and let w ∈ Loop(q 1 )∩σΣ * a.Such a w exists since δ L (q 1 , σ) ∈ C and δ L (q 1 , w 1 a) = q 1 .Let q 3 = δ L (q 2 , w N ).We will prove that q 1 = q 3 , which implies that δ L (q 2 , σ) ∈ C. As L is aperiodic, w ∈ Loop(q 3 ).Consequently, there is an n ∈ N such that w n L q 3 ⊆ L q 1 by Definition 3.3.Since w ∈ Loop(q 1 ), this also implies L q 3 ⊆ L q 1 .Furthermore, q 2 has a loop ending with a and A L satisfies the right-synchronized containment property, so w N L q 1 ⊆ L q 2 by Corollary 3.7.Hence, L q 1 ⊆ (w N ) −1 L q 2 and, by definition of q 3 , we have (w N ) −1 L q 2 = L q 3 .So we showed L q 3 ⊆ L q 1 and L q 1 ⊆ L q 3 which, by minimality of A L , implies q 1 = q 3 .The following is a direct consequence thereof.
Corollary 3.16.Let L ∈ T tract , a ∈ Σ, C be a component of A L , and q 1 , q 2 ∈ C. If there exist w 1 a ∈ Loop(q 1 ) and w 2 a ∈ Loop(q 2 ), then δ L (q 1 , w) ∈ C if and only if δ L (q 2 , w) ∈ C for all words w ∈ Σ * .Lemma 3.17.Let A L satisfy the left-synchronized containment property.If states q 1 and q 2 belong to the same component of A L and Loop(q 1 ) ∩ Loop(q 2 ) ̸ = ∅, then q 1 = q 2 .Proof.Let q 1 , q 2 be as stated and let w be a word in Loop(q 1 ) ∩ Loop(q 2 ).According to Definition 3.3, there exists an n ∈ N such that w n L q 2 ⊆ L q 1 .Since w ∈ Loop(q 1 ), this implies that L q 2 ⊆ L q 1 .By symmetry, we have L q 2 = L q 1 , which implies q 1 = q 2 , since A L is the minimal DFA.
To this end, we obtain the following synchronization property for A L .
Lemma 3.18.Let L ∈ T tract , let C be a component of A L , let q 1 , q 2 ∈ C, and let w be a word of length Since there are at most N 2 distinct pairs (q 1,i , q 2,i ), there exist i, j with 0 ≤ i < j ≤ N 2 such that q 1,i = q 1,j and q 2,i = q 2,j .Since Furthermore, we show that every language in T tract satisfies an inclusion property which is stronger than indicated by Definition 3.3.That is, we show that it is not necessary to repeat some word w 2 multiple times.Instead, we show that any word w that stays in a component, given that w is long enough and starts with a suitable symbol, already implies an inclusion property.
Lemma 3.19.Let L ∈ T tract , a ∈ Σ and let q 1 , q 2 be two states such that q 1 ⇝ q 2 and Loop(q 1 ) ∩ aΣ * ̸ = ∅.Let C be the component of A L that contains q 2 .Then, L q 2 ∩ L a q 2 Σ * ⊆ L q 1 where L a q 2 is the set of words w of length N 2 that start with a and such that δ L (q 2 , w) ∈ C. Proof.If Loop(q 2 ) = ∅, then L q 2 ∩ L a q 2 Σ * = ∅ and the inclusion trivially holds.Therefore we assume from now on that Loop(q 2 ) ̸ = ∅.Since the proof of this lemma requires a number of different states and words, we provide a sketch in Figure 3. Let w ∈ L q 2 ∩ L a q 2 Σ * , u be the prefix of w of length N 2 and w ′ be the suffix of w such that w = uw ′ .Since q 2 and δ L (q 2 , u) are both in the same component C, there exists a word v with uv ∈ Loop(q 2 ).Corollary 3.7 implies that (uv) N L q 2 ⊆ L q 1 .(3.1) Let q 3 = δ L (q 1 , (uv) N ).Due to aperiodicity we have uv ∈ Loop(q 3 ).Since A L is deterministic, this implies L q 3 = ((uv) N ) −1 L q 1 and, together with Equation (3.1) that We now show that there is a prefix u 1 of u such that δ L (q 1 , u 1 ) = q and δ L (q 3 , u 1 ) = q ′ with Loop(q) ∩ Loop(q Let q α,0 = q α and, for each i from 1 to N 2 and α ∈ {1, 3}, let q α,i = δ L (q α , a 1 • • • a i ).Since there are at most N 2 distinct pairs (q 1,i , q 3,i ), there exist i, j with 0 ≤ i < j ≤ N 2 such that q 1,i = q 1,j and q 3,i = q 3,j .20:11 We have u 2 ∈ Loop(q 1,i ) ∩ Loop(q 3,i ).We define q = δ L (q 1 , u 1 ) and q ′ = δ L (q 3 , u 1 ).Since q ⇝ q ′ and u 2 ∈ Loop(q) ∩ Loop(q ′ ), Corollary 3.7 implies u N 2 L q ′ ⊆ L q .Since u 2 ∈ Loop(q), we also have that By definition of q and the determinism of A L , we have that Since u 1 is a prefix of u, and by Equation (3.2), we also have L q 2 ∩ uΣ * ⊆ L q 1 .This implies that w ∈ L q 1 , which concludes the proof.

3.4.
A Syntactic Characterization.The goal of this section is to give a better understanding of languages in T tract .We provide a syntactic definition, which will allow to construct languages in T tract .More precisely, we will show that every language of a "memoryless component" is in T tract .And if memoryless components are connected with "consistent jumps", then the language is again in T tract .We show that all languages in T tract can be constructed in this way.Using this modular principle, systems with graphical search queries could enable users to "click" a language in T tract together.Note that this section is quite technical and detached from the rest of the article, thus it can be skipped.
As we have seen before, regular expressions in which every symbol occurs at most once define languages in T tract .We will define a similar notion on automata.Definition 3.20.A component C of some NFA A is called memoryless, if for each symbol a ∈ Σ, there is at most one state q in C, such that there is a transition (p, a, q) with p in C.
In this section, we will prove the following theorem which provides (in a non-trivial proof that requires several steps) a syntactic condition for languages in T tract .The syntactic condition is item (4) of the theorem, which we define after its statement.Condition (5) imposes an additional restriction on condition (4).
Theorem 3.21.For a regular language L, the following properties are equivalent: (1) L ∈ T tract (2) There exists an NFA A for L that satisfies the left-synchronized containment property.
(3) There exists an NFA A for L that satisfies the left-synchronized containment property and only has memoryless components.(4) There exists a detainment automaton for L with consistent jumps.
(5) There exists a detainment automaton for L with consistent jumps and only memoryless components.
To define detainment automata, we use finite automata with counters or CNFAs from Gelade et al. [GGM12], which we slightly adapt to make the construction easier. 4e recall the definition of counter NFAs from Gelade et al. [GGM12].We introduce a minor difference, namely that counters count down instead of up, since this makes our construction easier to describe.Furthermore, since our construction only requires a single counter, zero tests, and setting the counter to a certain value, we immediately simplify the definition to take this into account.
Let c be a counter variable, taking values in N. A guard on c is a statement γ of the form true or c = 0. We denote by c |= γ that c satisfies the guard γ.In the case where γ is true, this is trivially fulfilled and, in the case where γ is c = 0, this is fulfilled if c equals 0. By G we denote the set of guards on c.An update on c is a statement of the form c := c − 1, c := c, or c := k for some constant k ∈ N. By U we denote the set of updates on c. Definition 3.22.A nondeterministic counter automaton (CNFA) with a single counter is a 6-tuple A = (Q, I, c, δ, F, τ ) where Q is the finite set of states; is the transition relation; and F ⊆ Q is the set of accepting states.Furthermore, τ ∈ N is a constant such that every update of the form c := k has k ≤ τ .
Intuitively, A can make a transition (q, a, γ; q ′ , π) whenever it is in state q, reads a, and c |= γ, i.e., guard γ is true under the current value of c.It then updates c according to the update π, in a way we explain next, and moves into state q ′ .To explain the update mechanism formally, we introduce the notion of configuration.A configuration is a pair (q, ℓ) where q ∈ Q is the current state and ℓ ∈ N is the value of c.An initial configuration is (q 0 , 0) with q 0 ∈ I.A configuration (q, ℓ) is accepting if q ∈ F and ℓ = 0.A configuration α ′ = (q ′ , ℓ ′ ) immediately follows a configuration α = (q, ℓ) by reading a ∈ Σ, denoted α → a α ′ , if there exists (q, a, γ; q ′ , π) ∈ δ with c |= γ and ℓ ′ = π(ℓ).
For a string w = a 1 • • • a n and two configurations α and α ′ , we denote by α ⇒ w α ′ that α → a 1 • • • → an α ′ .A configuration α is reachable if there exists a string w such that α 0 ⇒ w α for some initial configuration α 0 .A string w is accepted by A if α 0 ⇒ w α f where α 0 is an initial configuration and α f is an accepting configuration.We denote by L(A) the set of strings accepted by A.
It is easy to see that CNFA accept precisely the regular languages.(Due to the value τ , counters are always bounded by a constant.) Let A be a CNFA with one counter c.Initially, the counter has value 0. The automaton has transitions of the form (q 1 , a, P ; q 2 , U ) where P is a precondition on c and U an update operation on c.For instance, the transition (q 1 , a, c = 5; q 2 , c := c − 1) means: if A is in state q 1 , reads a, and the value of c is five, then it can move to q 2 and decrease c by one.If we decrease a counter with value zero, its value remains zero.We denote the precondition that is always fulfilled by true.
We say that A is a detainment automaton if, for every component C of A: • every transition inside C is of the form (q 1 , a, true; q 2 , c := c − 1); • every transition that leaves C is of the form (q 1 , a, c = 0; q 2 , c := k) for some k ∈ N; 5 20:13 Intuitively, if a detainment automaton enters a non-trivial component C, then it must stay there for at least some number of steps, depending on the value of the counter c.The counter c is decreased for every transition inside C and the automaton can only leave C once c = 0. We say that A has consistent jumps if, for every pair of components C 1 and C 2 , if C 1 ⇝ C 2 and there are transitions (p i , a, true; q i , c := c − 1) inside C i for all i ∈ {1, 2}, then there is also a transition (p 1 , a, P ; q 2 , U ) for some P ∈ {true, c = 0} and some update U . 6e illustrate this in Figure 4. We note that C 1 and C 2 may be the same component.The consistent jump property is the syntactical counterpart of the left-synchronized containment property.The memoryless condition carries over naturally to CNFAs, ignoring the counter.
(1) ⇒ (5) uses a very technical construction that essentially exploits that-if the automaton stays in the same component for a long time-the reached state only depends on the last N 2 symbols read in the component.This is formalized in Lemma 3.18 and allows us to merge any pair of two states p, q which contradict that some component is memoryless.
To preserve the language, words that stay in some component C for less than N 2 symbols have to be dealt with separately, essentially avoiding the component altogether.Finally, the left-synchronized containment property allows us to simply add transitions required to satisfy the consistent jumps property without changing the language.
(5) ⇒ (3) and (4) ⇒ (2): We convert a given CNFA to an NFA by simulating the counter (which is bounded) in the set of states.The consistent jump property implies the left-synchronized containment property on the resulting NFA.The property that all components are memoryless is preserved by the construction.
(2) ⇒ (1): One can show that the left-synchronized containment property is invariant under the powerset construction.
The following lemma is the implication (1) ⇒ (5) from Theorem 3.21 Lemma 3.23.If L ∈ T tract , then there exists a detainment automaton for L with consistent jumps and only memoryless components.
Proof.Let A L = (Q L , Σ, i L , F L , δ L ) be the minimal DFA for L. The proof goes as follows: First, we define a CNFA A with two counters.Second, we show that we can convert A to an equivalent CNFA A ′ with only one counter that is a detainment automaton with consistent jumps and only memoryless components.This conversion is done by simulating one of the counters using a bigger set of states.Last, we show that L(A) = L(A L ), which shows the lemma statement as L(A) = L(A ′ ).
Before we start we need some additional notation.We write p 1 ↷ a q 2 to denote that C(p 1 ) ⇝ C(q 2 ) and there are states q 1 ∈ C(p 1 ) and p 2 ∈ C(q 2 ) such that (p i , a, q i ) ∈ δ L for i ∈ {1, 2}.Let q be a state, then we write Σ ⟳ (q) to denote the set of symbols a, such that there is a word w = aw ′ ∈ Loop(q).Let ∼ ⊆ Q L × Q L be the smallest equivalence relation over Q L that satisfies p ∼ q if C(p) = C(q) and Σ ⟳ (p) ∩ Σ ⟳ (q) ̸ = ∅.For q ∈ Q L , we denote by [q] the equivalence class of q.By [Q L ] we denote the set of all equivalence classes.We also write [C] to denote the equivalence classes that only use states from some component C. We extend the notion C(q) to [Q L ], i.e., C([q]) = C(q) for all q ∈ Q L .
We will use the following observation that easily follows from Lemma 3.15 using the definition of ∼.
We define a CNFA A = (Q, I, c, d, δ, F, N 2 ) that has two counters c and d.The counter c is allowed to have any initial value from [0, N 2 ], while the counter d has initial value 0. We note that we will eliminate counter c when converting to a one counter automaton, thus this is not a contradiction to the definition of CNFA with one counter that we use.
We use i.e., we can use the states from A L and the equivalence classes of the equivalence relation ∼.The latter will be used to ensure that components are memoryless, while the former will only be used in trivial components.We use , if the run stays in C for more than N 2 symbols.All other components are short run components.
For short run components, we use states from Q L .We use the counter c to enforce that these parts are indeed short.For long run components, we first use states in [Q L ].Only the last N 2 symbols in the component are read using states from Q L .The left-synchronized containment property guarantees that for long run components the precise state is not important, which allows us to make these components memoryless.
The transition relation is divided into transitions between states from the same component of A L (indicated by δ ⟳ = δ 1 ⟳ ∪δ 2 ⟳ ∪δ 3 ⟳ ) and transitions between different components (indicated by δ → = δ 1 → ∪ δ 2 → ).Transitions in δ ↷ are added to satisfy the consistent jumps property.They are the only transitions that increase the counter d.This is necessary, as the leftsynchronized containment property only talks about the language of the state reached after staying in the component for some number of symbols.If we added the transitions in δ ↷ without using the counter, we would possibly add additional words to the language.This concludes the definition of A.
We now argue that the automaton ) derived from A by pushing the counter c into the states is a detainment automaton with consistent jumps that only has memoryless components.The states of A ′ have two components, first the state of A and second the value of the second counter that is bounded by N 2 .We do not formally define δ ′ .It is derived from δ in the obvious way, i.e., by doing the precondition checks that depend on c on the second component of the state.Similarly, updates of c are done on the second component of the states.
It is straightforward to see that A ′ is a detainment automaton with consistent jumps that only has memoryless components using the following observations: • Every transition in A that does not have c = N 2 before and after the transition requires d = 0. • Let Cuts be the set of components of A, then the set of components of The consistent jumps are guaranteed by the transitions in δ ↷ .As A ′ only has memoryless components, the consistent jump property is trivially satisfied for states inside the same component.
We now show that L(A L ) ⊆ L(A).Let w = a 1 • • • a n be some string in L(A L ) and q 0 → • • • → q n be the run of A L on w. countdown : N → N that gives us how long we stay inside some component as countdown : i → j − i, where j is the largest number such that C(q j ) = C(q i ).
It is easy to see by the definitions of the transitions in δ → and δ ⟳ , that the run is an accepting run of A, where p i is q i if c i < N 2 and [q i ] otherwise.We note that the counter d is always zero, as we do not use any transitions from δ ↷ .The transitions in δ ↷ are only there to satisfy the consistent jumps property.This shows L(A L ) ⊆ L(A).Towards the lemma statement, it remains to show that L(A) ⊆ L(A L ).Let therefore w = a 1 • • • a n be some string in L(A), (p 0 , c 0 , d 0 ) → • • • → (p n , c n , d n ) be an accepting run of A, and q 0 → • • • → q n be the unique run of A L on w.
We now show by induction on i that there are states q1 , . . ., qn in Q L such that the following claim is satisfied.The claim easily yields that q n ∈ F L , as both counters have to be zero for the word to be accepted.
The base case i = 0 is trivial by the definition of I. We now assume that the induction hypothesis holds for i and are going to show that it holds for i + 1.Let ρ = (p i−1 , a i , P ; p i , U ) be the transition used to read a i .We distinguish several cases depending on ρ.
Case ρ ∈ δ → : In this case, c i = 0 by the definition of δ → .Therefore, the claim for i + 1 follows with qi+1 = p i+1 , as qi = p i by the induction hypothesis and (p i , a, p i+1 ) ∈ δ L by the definition of δ → .
Case ρ ∈ δ 3 ⟳ : We want to show that L p i+N 2 ⊆ L q N 2 establishing the claim directly for the position i + N 2 using qi+N 2 = p i+N 2 .Therefore, we first want to apply Lemma 3.18 to show that δ(q i , a i+1 • • • a i+N 2 ) = p i+N 2 .The preconditions of the lemma require us to show that (i) Precondition (i) is given by the induction hypothesis, precondition (ii) is by the definition of δ ⟳ , i.e., that all transitions in δ ⟳ are inside the same component of A L , and precondition (iii) is by the fact that each transition in δ ⟳ has a corresponding transition in δ L that stays in the same component.Therefore, we can actually apply Lemma 3.18 to conclude that δ(q i , a i+1 • • • a i+N 2 ) = p i+N 2 .As we furthermore have that L qi ∩ a i+1 • • • a i+d i Σ * ⊆ L q i by the induction hypotheses, we can conclude that L p i+N 2 ⊆ L q N 2 .This establishes the claim for position i + N 2 using qi+N 2 = p i+N 2 .As we only need the claim for position n (and not for all smaller positions), we can continue the induction at position i + N 2 .Especially there is no need to look at the case where ρ ∈ δ 1 ⟳ .Case ρ ∈ δ ↷ : By the definition of δ ↷ , we have that ), and p ′ ⇝ p ′′ .This (and the fact that qi ∈ p i by the induction hypothesis) allows us to apply Observation 3.24, which yields δ(q i , a i+1 ) ∈ C(p i ).From p ′ ⇝ p ′′ and qi ∈ C(p ′ ) we can conclude that qi ⇝ p ′′ .We now can apply Lemma 3.19 that gives us p ′′ .By the definition of δ ↷ , we have d i+1 = N 2 , enforcing that the next N 2 transitions are all from δ 2 ⟳ , as these are the only transitions that allow d > 0 in the precondition.Applying Observation 3.24 N 2 times yields that a i+1 ) yielding the claim for i + 1.This concludes the proof of the lemma.
We now continue with the rest of the proof of Theorem 3.21.
(2) ⇒ (1): Let A = (Q, Σ, δ, I, F ) be an NFA satisfying the left-synchronized containment property and A L be the minimal DFA equivalent to A. We show that A L satisfies the leftsynchronized containment property establishing (1).
Let M be the left synchronizing-length of A and q 1 , q 2 ∈ Q L be states of A L such that • q 1 ⇝ q 2 ; and • there are words w 1 ∈ Loop(q 1 ) and w 2 ∈ Loop(q 2 ) that start with the same symbol a.
We need to show that there exists an n ∈ N with w n 2 L q 2 ⊆ L q 1 .Let w be a word such that δ(q 1 , w) = q 2 .Let P 1 ⊆ Q be a state of the powerset automaton of A with L P 1 = L q 1 and let P 2 = δ(P 1 , ww * 2 ) be the state in the powerset automaton of A that consists of all states reachable from P 1 reading some word from ww * 2 .It holds that L P 2 = L q 2 , as δ(q 1 , ww * 2 ) = q 2 and L q 1 = L P 1 .We define We obviously have P ′′ 2 ⊆ P ′ 2 ⊆ P 2 .Furthermore, we have The second equation is by δ(q 2 , w |A| 2 ) = q 2 .We can conclude that L q 2 = L P ′ 2 .Let ρ : Q → Q be a function that selects for every state p 2 ∈ P ′ 2 a state p 1 ∈ P 1 such that p 1 ⇝ p 2 .By definition of P ′ 2 , such states exist.Using the fact that A satisfies the left-synchronized containment property, we get that w M 2 L p 2 ⊆ L ρ(p 2 ) for each p 2 ∈ P 2 .We can conclude and therefore w |A|+M 2 L q 2 ⊆ L q 1 .So A L satisfies the left-synchronized containment property with n = M , where M is the left synchronizing-length of A. This concludes the proof for (2) ⇒ (1) and thus the proof of the theorem.(1 One characterization of SP tract is the following (Theorem 6 in [BBG20]): Theorem 3.26.SP tract is the set of regular languages L such that there exists an i ∈ N for which the following holds: for all w ℓ , w, w r ∈ Σ * and w 1 , w 2 ∈ Σ + we have that, if w ℓ w i 1 ww i 2 w r ∈ L, then w ℓ w i 1 w i 2 w r ∈ L. Comparing the characterization with Definition 3.5, we see that Definition 3.5 imposes an extra "synchronizing" condition on w 1 and w 2 , namely that they share the same first (or last) symbol.We therefore have the following observation: Observation 3.27.The class SP tract is contained in T tract .
3.6.An algebraic Characterization of T tract and SP tract .We now provide an algerbaic characterization of T tract and SP tract .We use this characterization for two things: First, we use the characterizations to fully classify the expressiveness of both classes with respect to some well known fragments of first order logic.The results are depicted in Figure 5. Later, in Section 5, we will conclude a bunch of closure properties for both classes.These properties follow from Observation 3.29, which is the only result from this subsection that is used outside of it.
We refer the reader to the book [Pin97] for a general overview of syntactic semigroups and the different hierarchies.We use the following notation.The syntactic preorder of a language L of Σ * is the relation ≤ L defined on Σ * by x ≤ L y if and only if for all u, v ∈ Σ * we have uxv ∈ L ⇒ uyv ∈ L. The syntactic congruence of L is the associated equivalence relation ∼ L defined by x ∼ L y if and only if x ≤ L y and y ≤ L x.The quotient Σ + / ∼ L (Σ * / ∼ L ) is called the syntactic semigroup (monoid) of L. A word e ∈ Σ * is idempotent if e 2 = e.Given a finite semigroup S, it is folklore that there is an integer ω(S) (denoted by ω when S is understood) such that for all s ∈ S, s ω is idempotent.More precisely, s ω is the limit of the Cauchy sequence (s n! ) n≥0 .
(1) A regular language L ∈ Σ + is in SP tract if and only if its syntactic semigroup belongs to SP. (2) A regular language L ∈ Σ + is in T tract if and only if its syntactic semigroup belongs to T.
Proof.Item (1) follows from Theorem 3.26 and the observation that if there exists an i for which Theorem 3.26 holds, then it also holds for each i ′ ≥ i.This can easily be seen by choosing w ′ ℓ = w ℓ w i ′ −i 1 and w ′ r = w i ′ −i 2 w r .Item (2) follows from Theorem 3.11, Definition 3.5 and the paragraph after the definition.
Observation 3.29.The theorem immediately implies that SP tract and T tract are varieties of semigroups and ne-varieties [Pin97,PS05].
We now fully classify the expressiveness of T tract and SP tract compared to yardstick classes such as DC, FO 2 [<], and FO 2 [<, +1] (see also Figure 5).Here, FO 2 [<] and FO 2 [<, +1] are the two-variable restrictions of FO[<] and FO[<, +1] over words, respectively.By FO[<, +1] we mean the first-order logic with unary predicates P a for all a ∈ Σ (denoting positions carrying the letter a) and the binary predicates +1 and < (denoting the successor relation and the order relation among positions).The logic FO[<] is FO[<, +1] without the successor predicate.
We use the characterizations from Theorem 3.28 to classify SP tract and T tract wrt. the Straubing-Thérien hierarchy [Str81,Thé81]) and the dot-depth hierarchy (also known as Brzozowski hierarchy [CB71]).Both hierarchies are particular instances of concatenation hierarchies, which means that they can be built through a uniform construction scheme.Pin [Pin17] summarized numerous results and conjectures around these hierarchies.
Thomas [Tho82] showed that the dot-depth hierarchy corresponds, level by level, to the quantifier alternation hierarchy of first-order formulas, defined as follows.A formula is a Σ n -formula if it is equivalent to a formula Q(x 1 , . . ., x k )φ, where φ is quantifier free and Q(x 1 , . . ., x k ) is a sequence of n blocks of quenatifiers such that the first block contains only existental quantifiers.The class Σ n is the class of languages which can be defined by Σ n -formulas.The class Π n is defined by starting with a block of universal instead of existential quantifiers.
Proof.We first show DC ⊊ SP tract .As DC is definable by simple regular expressions, we have for each downward closed language L that w ℓ w i 1 ww i 2 w r ∈ L implies w ℓ w i 1 w i 2 w r ∈ L for every integer i ∈ N and all words w ℓ , w 1 , w, w 2 , w r ∈ Σ * .Therefore, L ∈ SP tract by Theorem 3.26.The language {a} is not downward closed, but in SP tract using Theorem 3.26 with i = 1.
The subset relation SP tract ⊆ T tract was already obserbed earlier (Observation 3.27) and a * bc * is a language in T tract which is not in SP tract , showing that the containment is strict.
So SP tract and T tract are between Π 1 [<] and Π 1 [<, +1].While SP tract and T tract behave similar when the number of alternations of a first-order formula is restricted, restricting the number of variables (FO 2 ) leads to a different behavior: Proof.We first show (a).Thérien and Wilke [TW98] proved that DA=FO 2 [<], where DA is defined by the identity (xyz) ω y(xyz) ω = (xyz) ω .Thus we only have to prove that each syntactic semigroup of a language in SP tract satisfies this identity.Let L ∈ SP tract .By Theorem 3.28, it immediately follows that the syntactic semigroup of L satisfies (xyz) ω y(xyz) ω ≤ (xyz) ω .Thus it remains to show that there exists an n ′ such that for each n ≥ n ′ and all u, v, x, y, z ∈ Σ * it holds that: Statement (b) simply follows from the facts that the language a * ba * is in FO 2 [<] but not in T tract whereas the language (ab) * is in It remains to prove (c), which follows from Theorem 3.30 as Π 1 [<, +1] is a subset of the 1st level of the dot-depth hierarchy, which in turn is a subset of FO 2 [<, +1].The language a * ba * is an example of a language in FO 2 [<, +1] and not in T tract .
Proof.Let L ∈ SP tract .The 3/2th level of the Straubing-Thérien hierarchy is defined by the profinite inequality x ω ≤ x ω yx ω where Alph(x) = Alph(y) [Pin97,Theorem 8.9].This means that we have to show that there exists an n ′ such that for all n ≥ n ′ and words w ℓ , w r it holds: if w ℓ x n w r in L, then also w ℓ x n yx n w r in L. We can easily see that every language in SP tract satisfies this: The components have the form (A ≥k + ε) for some set of symbols A by the definition of SP tract in terms of regular expressions, see [BBG20,Theorem 6].Therefore, the implication immediately holds for all y with Alph(y) = Alph(x).20:21

The Trichotomy
This section is devoted to the proof of the following theorem.
Theorem 4.1.Let L be a regular language.
(2) If L ∈ T tract and L is infinite, then RTQ(L) is NL-complete.
4.1.Finite Languages.We now turn to proving Theorem 4.1.We start with Theorem 4.1(1).Clearly, we can express every finite language L as a FO-formula.Since we can also test in FO that no edge e is used more than once, the multigraphs for which RTQ(L) holds are FO-definable.By Immerman [Imm88], this implies that RTQ(L) is in AC 0 .
4.2.Languages in T tract .We now sketch the proof of Theorem 4.1(2).We note that we define several concepts (trail summary, local edge domains, admissible trails) that have a natural counterpart for simple paths in Bagan et al.'s proof of the trichotomy for simple paths [BBG20].However, the underlying proofs of the technical lemmas are quite different.
For instance, components of languages in SP tract behave similarly to A * for some A ⊆ Σ, while components of languages in T tract are significantly more complex.Furthermore, the trichotomy for trails leads to a strictly larger class of tractable languages.
For the remainder of this section, we fix the constant K = N 2 .We will show that in the case where L belongs to T tract , we can identify a number of edges that suffice to check if the path is (or can be transformed into) a trail that matches L. This number of edges only depends on L and is therefore constant for the RTQ(L) problem.These edges will be stored in a summary.We will define summaries formally and explain how to use them to check whether a trail between the input nodes that matches L exists.To this end, we need a few definitions.Definition 4.2.Let p = e 1 • • • e m be a path and r = q 0 → • • • → q m the run of A L over lab(p).For a set C of states of A L , we denote by left C the first edge e i with q i−1 ∈ C and by right C the last edge e j with q j ∈ C. Next, we want to reduce the amount of information that we require for trails.The synchronization property, see Lemma 3.18, motivates the use of summaries, which we define next.
If p is a trail, then the summary S p of p is the sequence obtained from p by replacing, for each long run component C, the subsequence p[left C , right C ] by the abbreviation (C, (v, q), p suff ), where v is the source node of the edge left C , q is the state in which A L is immediately before reading left C , and p suff is the suffix of length We note that the length of a summary is always bounded by O(N 3 ), i.e., a constant that depends on L. Indeed, A L has at most N components and, for each of them, we store at most K + 3 many things (namely, C, v, q, and K edges).Our goal is to find a summary S and replace all abbreviations with matching pairwise edge-disjoint trails which do not use any other edge in S, because this results in a trail that matches L. However, not every sequence of edges and abbreviations is a summary, because a summary needs to be obtained from a trail.So, we will work with candidate summaries instead.
Definition 4.4.A candidate summary S is a sequence of the form S = α 1 • • • α m with m ≤ N where each α i is either (1) an edge e ∈ E or (2) an abbreviation (C, (v, q), e K • • • e 1 ) ∈ Abbrv.Furthermore, all components in S are distinct and each edge e occurs at most once.A path p that is derived from S by replacing each α i ∈ Abbrv by a trail p i such that p i |= α i is called a completion of the candidate summary S.
The following corollary is immediate from the definitions and Lemma 3.18, as the lemma ensures that the state after reading p inside a component does not depend on the whole path but only on the labels of the last K edges, which are fixed.
Corollary 4.5.Let L be a language in T tract .Let S be the summary of a trail p that matches L and let p ′ be a completion of S.Then, p ′ is a path that matches L.
Together with the following lemmas, Corollary 4.5 can be used to obtain an nondeterministic logarithmic space algorithm7 that gives us a completion of a summary S. The lemma heavily relies on other results on the structure of components in A L .Lemma 4.6.There exists a nondeterministic logarithmic space algorithm that, given a directed graph G and nodes s and t, outputs a shortest path from s to t in G.
Proof.We show that Algorithm 1 can output a shortest path in nondeterministic logarithmic space.Recall that nondeterministic algorithms with output either give up, or produce a correct output and that at least one computation does not give up.We note that Algorithm 1 is a mixture of the Immermann-Szelepscényi Theorem [Imm88,Sze88] and reachability.To this end, S(k) denotes the set of nodes reachable from s with k edges.Using the algorithm given by Immermann [Imm88] and Szelepscényi [Sze88] to show that non-reachability is in NL, we can find in lines 1-27 the smallest n such that a path from s to t of length n but none of length n − 1 exists.Indeed, we only added a test in line 19 to find the smallest k for which t ∈ S(k)-this k is the length of a shortest path from s to t.After line 28 we then use the smallest k (which we name n) together with a standard reachability algorithm to nondeterministically output a path of this length.(If we are only interested in the length of a shortest path, we can return n instead.)We note that one can easily change the algorithm to avoid outputting edges of paths that will give up.This would require an extra test if there exists a path of length n − p from w p to t before outputting the edge from w p−1 to w p .We omitted this extra test for readability (and because at this point we know that there is a solution and non-deterministic algorithms will always return the correct output).
That Algorithm 1 runs in nondeterministic logarithmic space follows from the Immermann-Szelepscényi Theorem and reachability being in NL.
We explain how to use the algorithm described in Lemma 4.6 to output a shortest path that satisfies some additional constraints.Lemma 4.7.Let L ∈ T tract , let (C, (v, q), e K • • • e 1 ) be an abbreviation and E ′ ⊆ E. There exists a nondeterministic logarithmic space algorithm that outputs a shortest trail p such that p |= E ′ (C, (v, q), e K • • • e 1 ) if it exists and rejects otherwise.
We then output a shortest path from (v, q, K) to (t, q ′ , 1) for t being the target node of e 1 and some q ′ ∈ C. 8 More precisely, since we want a path in G and not in the product, we project away the unnecessary state and number and only output the corresponding edge in G in line 33.
It remains to show that p is a trail (in G).Assume towards contradiction that p = d 1 • • • d m e K • • • e 1 is not a trail.Then there exists an edge d i = d j that appears at least twice in p.Note that d j is not in the suffix e K • • • e 1 by definition of p.We define and show that p ′ is a shorter than p but meets all requirements.Let q 1 = δ(q, d 1 • • • d i ) and q 2 = δ(q, d 1 • • • d j ).By definition, q 1 , q 2 ∈ C and both have an incoming edge with label lab(d i ) = lab(d j ).This allows us to use Corollary 3.16 to ensure that We can then apply Lemma 3.18 to prove that So p ′ is indeed a trail satisfying p ′ |= E ′ (C, (v, q), e K • • • e 1 ).Furthermore, p ′ is shorter than p, contradicting our assumption.
Using the algorithm of Lemma 4.7 we can, in principle, output a completion of S that matches L using nondeterministic logarithmic space.However, such a completion does not necessarily correspond to a trail.The reason is that, even though each trail p C we guess for some abbreviation involving a component C is a trail, the trails for different components may not be disjoint.Therefore, we will define pairwise disjoint subsets of edges that can be used for the completion of the components.
The following definition fulfills the same purpose as the local domains on nodes in Bagan et al. [BBG20,Definition 7].Since our components can be more complex, we require extra conditions on the states (the δ L (q, π) ∈ C condition).Definition 4.8 (Local Edge Domains).Let S = α 1 • • • α k be a candidate summary and E(S) be the set of edges appearing in S. We define the local edge domains Edge i ⊆ E i inductively for each i from 1 to k, where E i are the remaining edges defined by E 1 = E \ E(S) and E i+1 = E i \ Edge i .If there is no trail p such that p |= α i or if α i is a single edge, we define Edge i = ∅.
Otherwise, let α i = (C, (v, q), e K • • • e 1 ).We denote by m i the minimal length of a trail p with p |= E i α i and define Edge i as the set of edges used by trails π that start at v, only use edges in E i , are of length at most m i − K, and satisfy δ L (q, π) ∈ C.
By definition of Edge i , we can conclude that E(e i ) ̸ = E(e j ) for all e i ∈ Edge i , e j ∈ Edge j , i ̸ = j, as e i ∈ Edge i and E(e i ) = E(e j ) imply that e j ∈ Edge i .We note that a shortest trail using e i but not e j can use e j instead of e i .We note that the sets E(S) and (Edge i ) i∈[k] are always disjoint.Definition 4.9 (Admissible Trail).We say that a trail p is admissible if there exist a candidate summary S = α 1 • • • α k and trails p 1 , . . ., p k such that p = p 1 • • • p k is a completion of S and p i |= Edge i α i for every i ∈ [k].
We show that shortest trails that match L are always admissible.Thus, the existence of a trail is equivalent to the existence of an admissible trail.
Lemma 4.10.Let G and (s, t) be an instance for RTQ(L), with L ∈ T tract .Then every shortest trail from s to t in G that matches L is admissible.
Proof sketch.We assume towards a contradiction that there is a shortest trail p from s to t in G that matches L and is not admissible.That means there is some ℓ ∈ N, and an edge e used in p ℓ with e / ∈ Edge ℓ .There are two possible cases: (1) e ∈ Edge i for some i < ℓ and (2) e / ∈ Edge i for any i.In both cases, we construct a shorter trail p that matches L, which then leads to a contradiction.We depict the two cases in Figure 6.We construct the new trail by substituting the respective subtrail with π.
Proof.In this proof, we use the following notation for trails.By p[e 1 , e 2 ) we denote the prefix of p[e 1 , e 2 ] that excludes the last edge (so it can be empty).Notice that p[e 1 , e 2 ] and p[e 1 , e 2 ) are always well-defined for trails.Let p = d 1 • • • d m be a shortest trail from s to t that matches L. Let S = α 1 • • • α k be the summary of p and let p 1 , . . ., p k be trails such that 20:31 We wondered if, similarly to Theorem 3.2, it could be the case that languages closed under left-synchronized power abbreviations are always regular, but this is not the case.For example, the (infinite) Thue-Morse word [Thu06,Mor21] has no subword that is a cube (i.e., no subword of the form w 3 ) [Thu06, Satz 6].The language containing all prefixes of the Thue-Morse word thus trivially is closed under left-synchronized power abbreviations (with i = 3), yet it is not regular.
We now give some closure properties of SP tract and T tract .We note that Bagan et al. [BBG20] already observed that SP tract is closed under finitie unions, intersections, and reversal.
Lemma 5.3.Both classes SP tract and T tract are closed under (i) finite unions, (ii) finite intersections, (iii) reversal, (iv) left and right quotients, (v) inverses of non-erasing morphisms, (vi) removal and addition of individual strings.
Proof.The closure properties (i) to (vi) follow immediately from Observation 3.29, i.e., that SP tract and T tract are ne-varieties, see [Pin97,PS05].
Lemma 5.4.The classes SP tract and T tract are not closed under complement.
Proof.Let Σ = {a, b}.The language of the expression b * clearly is in SP tract and T tract .Its complement is the language L containing all words with at least one a.It can be described by the regular expression Σ * aΣ * .Since b i ab i ∈ L for all i, but b i b i / ∈ L for any i, the language L is neither in SP tract nor in T tract .
It is an easy consequence of Lemma 5.3 (vi) that for regular languages outside of SP tract or T tract there do not exist best lower or upper approximations.
Corollary 5.5.Let C ∈ {SP tract , T tract }.For every regular language L such that L / ∈ C and • for every upper approximation L ′′ of L (i.e., L ⊊ L ′′ ) with L ′′ ∈ C it holds that there exists a language L ′ ∈ C with L ⊊ L ′ ⊊ L ′′ ; • for every lower approximation L ′′ of L (i.e., L ′′ ⊊ L) it holds that there exists a language L ′ ∈ C with L ′′ ⊊ L ′ ⊊ L.
The corollary implies that Angluin-style learning of languages in SP tract or T tract is not possible.However, learning algorithms for single-occurrence regular expressions (SOREs) exist [BNSV10] and can therefore be useful for an important subclass of T tract .

Enumeration
In this section we state that-using the algorithm from Theorem 4.1-the enumeration result from [Yen71] transfers to the setting of enumerating trails matching L.
Theorem 6.1.Let L be a regular language, G be a multigraph and (s, t) a pair of nodes in G.If NL ̸ = NP, then one can enumerate trails from s to t that match L in polynomial delay in data complexity if and only if L ∈ T tract .
Proof sketch.The algorithm is an adaptation of Yen's algorithm [Yen71] that enumerates the k shortest simple paths for some given number k, similar to what was done by Martens and Trautner [MT19].It uses the algorithm from Corollary 4.15 as a subprocedure.

Figure 1 :
Figure 1: Directed, edge-labeled graphs that have a trail from s to t.

Figure 4 :
Figure 4: Consistent jump condition (simplified, i.e.: without preconditions, counter and update) used in Theorem 3.31.C 1 and C 2 are components (not necessarily different) such that C 2 is reachable from C 1 .

Figure 5 :
Figure 5: Expressiveness of subclasses of the aperiodic languages

For
this direction, we use that Bagan et al. [BBG20, Theorem 6] give a definition of SP tract in terms of regular expressions, showing that each component can be represented as (A ≥k + ε) for some set A ⊆ Σ and k ∈ N.So if there is xyz ∈ Σ * with u(xyz) M v ∈ L for some u, v ∈ Σ * , then we also have u(xyz) M (Alph(xyz)) * (xyz) M v ⊆ L, where Alph(x) denotes the set of symbols x uses.Thus we especially have u(xyz) M y(xyz) M v ∈ L, which proves the other direction.The same holds for each M ′ ≥ M .This concludes the proof of (a).
A component C of A L is a long run component of p if left C and right C are defined and |p[left C , right C ]| > K.

Figure 7 :
Figure 7: Example of the reduction in Lemma 4.19 for the language da * c(abc) * ef .We use w ℓ = d, w m = c, w r = ef , w 1 = aa, and w 2 = abc for the construction.For the ease of readability, we omit the intermediate nodes on the bc and ef paths.
Extension of the Immermann-Szelepscényi Theorem Input: A directed graph G = (V, E, E), nodes s, t in G, s ̸ = t Output: A shortest path from s to t in G or "no" if no path from s to t exists 1 n ← −1 ▷ n will be the length of a shortest path from s to t 2