EXPRESSIVE PATH QUERIES ON GRAPHS WITH DATA

. Graph data models have recently become popular owing to their applications, e.g., in social networks and the semantic web. Typical navigational query languages over graph databases — such as Conjunctive Regular Path Queries (CRPQs) — cannot express relevant properties of the interaction between the underlying data and the topology. Two languages have been recently proposed to overcome this problem: walk logic (WL) and regular expressions with memory (REM). In this paper, we begin by investigating fundamental properties of WL and REM, i.e., complexity of evaluation problems and expressive power. We ﬁrst show that the data complexity of WL is nonelementary, which rules out its practicality. On the other hand, while REM has low data complexity, we point out that many natural data/topology properties of graphs expressible in WL cannot be expressed in REM. To this end, we propose register logic , an extension of REM, which we show to be able to express many natural graph properties expressible in WL, while at the same time preserving the elementariness of data complexity of REMs. It is also incomparable to WL in terms of expressive power.


Introduction
Graph databases have gained renewed interest due to applications, such as the semantic web, social network analysis, crime detection networks, software bug detection, biological networks, and others (e.g., see [1] for a survey).Despite the importance of querying graph databases, no general agreement has been reached to date about the kind of features a practical query language for graph databases should support and about what can be considered a reasonable computational cost of query evaluation for the aforementioned applications.
Typical navigational query languages for graph databases -including the conjunctive regular path queries [7] and its many extensions [4] -suffer from a common drawback: they are well-suited for expressing relevant properties about the underlying topology of a graph database, i.e., about the way in which (labeled) nodes are connected via (labeled) edges, but not about how such topology interacts with the node ids or the data.This drawback is shared by common specification languages for verification [6] (e.g.CTL * ), which are evaluated over a similar graph data model (a.k.a.transition systems).Examples of important queries that combine graph data and topology, but cannot be expressed in usual navigational languages for graph databases, include the following [8,13]: (Q1) Find pairs of people in a social network connected by professional links restricted to people of the same age.(Q2) Find pairs of cities x and y in a transportation system, such that y can be reached from x using only services operated by the same company.In each one of these queries, the connectivity between two nodes (i.e., the topology) is constrained by the data (from an infinite domain, e.g., N), in the sense that we only consider paths in which all intermediate nodes satisfy a certain condition (e.g. they are people of the same age).
Two languages, walk logic and regular expressions with memory, have recently been proposed to overcome this problem.These languages have different goals: (a) Walk logic (WL) was proposed by Hellings et al. [8] as a unifying framework for understanding the expressive power of path queries over graph databases.Its strength is on the expressiveness side.The underlying data model of WL is that of (node or edge)labeled directed graphs.In this context, WL can be seen as a natural extension of FO with path quantification, plus the ability to check whether positions p and p ′ in paths π and π ′ , respectively, have the same data values.In their paper, Hellings et al. assume the restriction that each node carries a distinct data value (and, therefore, that this data value serves as an identifier for the node).However, as we shall see, this makes no difference in terms of the results that we can obtain.
(b) Regular expressions with memory (REMs) were proposed by Libkin and Vrgoč [10] as a formalism for comparing data values along a single path, while retaining a reasonable complexity for query evaluation.The strength of this language is on the side of efficiency.The data model of the class of REMs is that of edge-labeled directed graphs, in which each node is assigned a data value from an infinite domain.REMs define pairs of nodes in the graph database that are linked by a path satisfying a given condition c.Each such condition c is defined in a formalism inspired by the class of register automata [9], allowing some data values to be stored in the registers and then compared against other data values.The evaluation problem for REMs is Pspace-complete (same as for FO over relational databases), and can be solved in polynomial time in data complexity [10], i.e., assuming queries to be fixed. 1This shows that the language is, in fact, well-behaved in terms of the complexity of query evaluation.
The aim of this paper is to investigate the expressiveness and complexity of query evaluation for WL and the class of REMs with the hope of finding a navigational query language for data graphs that strikes a good balance between these two important aspects of query languages.Contributions.We start by considering WL, which is known to be a powerful formalism in terms of expressiveness.Little is known about the cost of query evaluation for this language, save for the decidability of the evaluation problem and NP-hardness of its data complexity.Our first main contribution is to pinpoint the exact complexity of the evaluation problem for WL (and thus answering an open problem from [8]): we prove that it is non-elementary, and that this holds even in data complexity, which rules out the practicality of the language.
We thus move to the class of REMs, which suffers from the opposite drawback: Although the complexity of evaluation for queries in this class is reasonable, the expressiveness of the language is too rudimentary for expressing some important path properties due to its inability to (i) compare data values in different paths and (ii) express branching properties of the graph database.An example of an interesting query that is not expressible as an REM is the following: (Q) Find pairs of nodes x and y, such that there is a node z and a path π from x to y in which each node is connected to z.Notice that this is the query that lies at the basis of the queries (Q1) and (Q2) we presented before.
Our second contribution then is to identify a natural extension of this language, called register logic (RL), that closes REMs under Boolean combinations and existential quantification over nodes, paths and register assignments.The latter allows the logic to express comparisons of data values appearing in different paths, as well as branching properties of the data.This logic is incomparable in expressive power to WL. Besides, many natural queries relating data and topology in data graphs can be expressed in RL including: the query (Q), hamiltonicity, the existence of an Eulerian trail, bipartiteness, and connected graphs with an even number of nodes.We then study the complexity of the problem of query evaluation for RL, and show that it can be solved in elementary time (in particular, that it is Expspace-complete).This is in contrast to WL, for which even the data complexity is non-elementary.With respect to data complexity, we prove that RL is Pspace-complete.We then identify a slight extension of its existential-positive fragment, which is tractable (NLogspace) in data complexity and can express many queries of interest (including the query (Q)).The idea behind this extension is that atomic REMs can be enriched with an existential branching operator -in the style of the class of nested regular expressions [5] that increases expressiveness without affecting the cost of evaluation.Organization of the paper.Section 2 defines our data model.In Section 3, we briefly recall the definition of walk logic and some basic results from [8].In Section 4, we prove that the data complexity of WL is nonelementary.Section 5 contains our results concerning register logic.We conclude in Section 6 with future work.

The Data Model
We start with a definition of our data model: data graphs.Definition 2.1 (Data graph).Let Σ be a finite alphabet.A data graph G over Σ is a tuple (V, E, κ), where V is the finite set of nodes, E ⊆ V × Σ × V is the set of directed edges labeled in Σ (that is, each triple (v, a, v ′ ) ∈ E is to be seen as an edge from v to v ′ in G labeled a), and κ : V → D is a function that assigns a data value in D to each node in V .This is the data model adopted by Libkin and Vrgoč [10] in their definition of REMs.In the case of WL [8], the authors adopted graph databases as their data model, i.e., data graphs G = (V, E, κ) such that κ is injective (i.e. each node carries a different data value).In such a case we can think of κ(v) as the identifier (id) of v, for each v ∈ V .We shall adopt the general model of [10] since none of our complexity results are affected by the data model: upper bounds hold for data graphs, while all lower bounds are proved in the more restrictive setting of graph databases.However, for the sake of the comparison with the expressiveness of WL, many of our examples are constructed in the scenario of graph databases, that is, when κ(v) serves as an id for node v.
There is also the issue of edge-labeled vs node-labeled data graphs.Our data model is edge-labeled, but the original one for WL is node-labeled [8].We have chosen to use the former because it is the standard in the literature [2].Again, this choice is inessential, since all the complexity results we present in the paper remains true if the logics are interpreted over node-labeled graph databases or data graphs (applying the expected modifications to the syntax).
Finally, in several of our examples we use logical formulas to express properties of undirected graphs.In each such case we assume that an undirected graph H is represented as a graph database G = (V, E, κ) over unary alphabet Σ = {a}, where V is the set of nodes of H and E is a symmetric relation (i.e.(v, a, v ′ ) ∈ E iff (v ′ , a, v) ∈ E).In particular, since G = (V, E, κ) is a graph database we have that κ is injective, i.e., each node is uniquely determined by its data value.

Walk Logic
WL is an elegant and powerful formalism for defining properties of paths in graph databases, which was originally proposed in [8] as a yardstick for measuring the expressiveness of different path logics.
The syntax of WL is defined with respect to countably infinite sets Π of path variables (that we denote as π, π 1 , π 2 , . . . ) and T (π), for each π ∈ Π, of position variables of sort π.We assume that different sorts are associated with distinct position variables.We denote position variables by t, t 1 , t 2 , . . ., and write t π when we need to emphasize that position variable t is of sort π.Definition 3.1 (Walk logic (WL)).The set of formulas of WL over finite alphabet Σ is defined by the following grammar, where (i) a ∈ Σ, (ii) t, t 1 , t 2 are position variables of any sort, (iii) π is a path variable, and (iv) t π 1 , t π 2 are position variables of the same sort π: As usual, WL formulas without free variables are called Boolean.
To define the semantics of WL we need to introduce some terminology.A path (a.k.a.walk in [8]) in the data graph G = (V, E, κ) is a finite, nonempty sequence The set of positions of ρ is {1, . . ., n}, and v i is the node in position i of ρ, for 1 ≤ i ≤ n.The intuition behind the semantics of WL formulas is as follows.Each path variable π is interpreted as the data graph G, while each position variable t of sort π is interpreted as a position 1 ≤ i ≤ n in ρ (that is, position variables of sort π are interpreted as positions in the path that interprets π).The atomic formula the position p 2 that interprets t 2 in ρ is the successor of the position p 1 that interprets t 1 (i.e.p 2 = p 1 + 1), and node in position p 1 is linked in ρ by an a-labeled edge to node in position p 2 (that is, a p 1 = a).In the same way, t π 1 < t π 2 holds iff in the path ρ that interprets π the position that interprets t 1 is smaller than the one that interprets t 2 .Furthermore, t 1 ∼ t 2 is the case iff the data value carried by the node in the position assigned to t 1 is the same than the data value carried by the node in the position assigned to t 2 (possibly in different paths).We formalize the semantics of WL below.
Let G = (V, E, κ) be a data graph and φ a WL formula.Assume that S φ is the set that consists of (i) all position variables t π and path variables π such that t π is a free variable of φ, and (ii) all path variables π such that π is a free variable of φ.Intuitively, S φ defines the set of (both path and position) variables that are relevant to define the semantics of φ over G.An assignment α for φ over G is a mapping that associates a path with each path variable π ∈ S φ , and a position 1 ≤ i ≤ n with each position variable of the form t π in S φ (notice that this is well-defined since π ∈ S φ every time a position variable of the form t π is in S φ ).As usual, we denote by α[t → i] and α[π → ρ] the assignments that are equal to α except that t is now assigned position i and π the path ρ, respectively.
We say that G satisfies φ under α, denoted (G, α) |= φ, if one of the following holds (we omit Boolean combinations which are standard): , where v i is the node in position α(t i ) of α(π i ), for i = 1, 2. • φ = ∃t π ψ and one of the following holds: (1) t π does not appear free in ψ, or (2) both t π and π appear free in ψ, and there is a position i in α(π) such that (G, α[t π → i]) |= ψ, or (3) t π appears free in ψ, π does not appear free in ψ, and there is a path ρ in G and a position i in ρ such that (G, α[π → ρ, t π → i]) |= ψ. • φ = ∃πψ and the following holds: (1) π does not appear free in ψ, or (2) there is a path Example 3.1.A simple example from [8] that shows that WL expresses NP-complete properties is the following query that checks if a graph G has a Hamiltonian path: . In fact, this query expresses that there is a path π in G that does not repeat nodes (because )), and every node belongs to such path (because π , and, therefore, every node that occurs in some path π ′ in the graph database also occurs in π).Note that this formula uses in an essential way the fact that G is a graph database, i.e., that each node is uniquely identified by its data value.✷

WL Evaluation is Non-elementary in Data Complexity
In this section we pinpoint the precise complexity of query evaluation for WL.It was proven in [8] that this problem is decidable.Although the precise complexity of this problem was left open in [8], one can prove that this is, in fact, a non-elementary problem by an easy translation from the satisfiability problem for FO formulas -which is known to be nonelementary [15,16].In databases, however, one is often interested in a different measure of complexity -called data complexity [17] -that assumes the formula φ to be fixed.This is a reasonable assumption since databases are usually much bigger than formulas.Often in the setting of data complexity the cost of evaluating queries is much smaller than in the general setting in which formulas are part of the input.The main result of this section is that the data complexity of evaluating WL formulas is nonelementary even over graph databases, which rules out its practicality.
Let φ be a WL formula without free variables.The evaluation problem for φ, denoted Eval(WL,φ), is defined as follows: Given a data graph G, is it the case that G |= φ?We prove the following: Theorem 4.1.The evaluation problem for WL is non-elementary in data complexity.In particular, for each k ∈ Z >0 , there is a finite alphabet Σ and a Boolean formula φ over Σ, such that the problem Eval(WL,φ) of evaluating the WL formula φ is k-Expspace-hard.In addition, the latter holds even if the input is restricted to the class of graph databases.
We prove the above result by showing that for all natural numbers k, the data complexity of the model checking problem for WL is k-ExpSpace-hard.For all natural numbers k and f 0 , we provide a reduction to the class of problems solvable by a Turing machine using a tape of size tower (k, f 0 n) given an input word of size n, where tower (1, n) := 2 n and tower More precisely, for all natural numbers k > 0, there is a Turing machine M and a constant f 0 such that the following problem is k-ExpSpace-hard: given a word w of size n, is there an accepting run of M over w using at most tower (k, f 0 n) cells?We prove that there is a formula φ ∈ WL such that for all words w of size n, there is a graph G w such that G w φ iff there is an accepting run of M over w using at most tower (k, f 0 n) cells.
(4.1) Before giving a proof, we sketch the case k = 1 here, which illustrates the proof idea.Let M be a Turing machine M such that the following problem is ExpSpace-hard: given a word w of size n, is there an accepting run of M over w using at most 2 f 0 n cells?The formula φ that we will define and satisfying equivalence (4.1) is of the form where ψ is a formula that does not contain any quantification over path variables.Given a word w of size n, the label of the path π in the graph G w will encode an accepting run of M over the word w in the following way.
Given a word w of size n, consider a configuration C of the run of M over w where the head is scanning the cell number i 0 , the machine is in state q and the content of the tape is the word w ′ = w ′ 0 . . .w ′ j (j = 2 f 0 n − 1).We may encode the configuration C by the word e C = d C 0 . . .d C j where each d C i encodes the information in cell number i and j = 2 f 0 n − 1.More precisely, we define d C i as a word of the form where c(i) and q ′ i are defined as follows.The word c(i) is the binary encoding of the number i.The letter w ′ i is the content of the cell i.The letter q ′ i is equal to the dummy symbol $ if the head is not scanning the cell number i; otherwise, q ′ i is equal to the state q.That is, q ′ i 0 = q and for all i = i 0 , q ′ i = $.We encode a run C 0 C 1 . . .as the sequence e C 0 e C 1 . . . .We think of a path π encoding a run as consisting of two parts: the first part contains the encoding e C 0 of the initial configuration and is a path through a subgraph I w of G w , while the second part contains the encoding e C 1 e C 2 . . .and is a path through the subgraph H of G w .If Q is the set of states of M and Σ is the alphabet, we define H as the following graph Σ and the number of nodes with outgoing edges with labels 0 and 1 is equal to f 0 n.The label of a path π ′ from the "left-most" node x to the "right-most" node z with only once occurrence of x is exactly the description of a cell in a configuration: it is the binary encoding of a natural number < 2 f 0 n followed by a pair of the form (q ′ , a).We can define a formula φ C ∈ WL such that for all paths π starting in x and ending in z, H φ C (π) iff the label of π is the encoding of a configuration.
We do not give details; φ C has to express that the encoding of a configuration only has one tape head, that the first number encoded in binary is 0, that the last number is 2 f 0 n − 1 and that the encoding of the description of cell number j is followed by the description of cell number j + 1.Using the formula φ C , we can define a formula φ 1 such that for all paths π, H φ 1 (π) iff the label of π is the encoding of an accepting run.The formula φ 1 has to ensure that if e C e C ′ occurs in the label of π, then C and C ′ are consecutive configurations according to M .Moreover, φ 1 has to express that eventually we reach the final state.In order to express φ C and φ 1 , we use the ability of WL to check whether two positions correspond to the same node.For example, in order to define φ 1 , since we need to compare consecutive configurations e C and e C ′ , we need to be able to compare the content of a cell in configuration C and the content of that same cell in C ′ .In particular, we want to be able to express whether two subpaths π ′ 0 and π ′ 1 of π starting in x and ending in y correspond to the binary encoding of the same number.Since the length of such subpaths depends on n, we cannot check node by node whether the two subpaths are equal.However, it is sufficient to check that if t ), then their successors also correpond to the same node (t ).Note that using the facts that π ′ 0 and π ′ 1 are subpaths of π, we will be able to define φ 1 such that it only contains quantifications over node variables (and no quantifications over path variables).Similarly, in the formula φ C , we use the operator ∼ in order to express that two subpaths correspond to the binary encodings of numbers that are successors of each other.
Similarly to the way we define the graph H, we can introduce a graph I w and a formula φ 0 (π) such that I w φ 0 (π) iff the label of π is the encoding e C 0 , where C 0 is the initial configuration of the run of M over w.By adding an edge from I w to H, we construct a graph G w such that for all paths π, G w φ 0 (π) ∧ φ 1 (π) iff the label of π is the encoding of an accepting run over w.Hence, the formula φ := ∃π(φ 0 (π) ∧ φ 1 (π)) satisfies (4.1).
For the case where k > 1, the problem to adapt the above proof is that we have to consider Turing machine configurations whose size is bounded by a tower of exponentials of height k.If k > 1, the binary representation of such a bound is not polynomial.The trick is to represent such exponential towers by k-counters.A 1-counter is the binary representation of a number.If k > 1, a k-counter is a word σ 0 l 0 . . .σ j 0 l j 0 , where l j is a (k − 1)-counter and σ j ∈ {0, 1}.
Definition.For all natural numbers k, we consider the alphabet Σ k = {a k , b k }, where a k and b k represent 0 and 1 respectively.We define Γ k as the alphabet A 1-counter of length n is a sequence of the form where for all 0 ≤ i < f 0 n, l i ∈ Σ 1 .This 1-counter represents the number f 0 n−1 i=0 l i 2 i .Recall that if l i is equal to a 1 (resp.b 1 ), then l i represents 0 (resp.1).
If k ≥ 2, a k-counter of length n is a sequence of the form where for all 0 ≤ i ≤ j, l i ∈ Σ k , σ i is a (k − 1)-counter representing the number i and j = tower (k − 1, f 0 n) − 1.This k-counter represents the number j i=0 l i 2 i .Again recall that if l i is equal to a 1 (resp.b 1 ), then l i represents 0 (resp.1).
A (k, f 0 n, p)-description (over an alphabet ∆) is a sequence where for all p ≤ i ≤ j, d i ∈ ∆, σ i is a (k − 1)-counter representing the number i and j = tower (k, f 0 (n − 1)) − 1.A (f 0 k, n)-description (over an alphabet ∆) is a (k, f 0 n, 0)description.
Note that a (k, f 0 n)-description over the alphabet Σ k is a k-counter of length n.If ∆ is the alphabet (Q ∪ {$}) × Σ (where Q is the set of states and Σ is the alphabet of the machine), a (k, f 0 n)-description over ∆ is of the form l 0 (x 0 , y 0 ) . . .l j (x j , y j ) where j = tower (k, f 0 n) − 1.Hence, if we define c(i) in (4.2) as the k-counter encoding the number i, the encoding of a configuration (as defined above) is nothing but a (k, f 0 n)description.
In particular, if we want to encode a run as the label of a path satisfying some well-chosen formula in a well-chosen graph, we should also be able to encode (k, f 0 n, p)descriptions as labels of paths.We show how to do so in the following lemma.
Notation.Given a path π in a graph over an alphabet ∆, we denote by l(π) the label of π.Given an alphabet ∆ ′ ⊆ ∆, we denote by l ∆ ′ (π) the trace of l(π) over the alphabet ∆ ′ , that is, the subsequence of l(π) obtained by deleting the letters that do not belong to ∆ ′ .
Let G ′ = (V ′ , E ′ , κ ′ ) be a subgraph of G = (V, E, κ) and let π be a path in G and of the form v 1 a 1 v 2 . . .v n−1 a n v n , where (v i , a i , v i+1 ) ∈ E for all 1 ≤ i < n.Assume that there are i 0 and i 1 such that i 0 ≤ i 1 and {v i : that is, once the path leaves G ′ , it never goes back to G ′ .Then we define the trace π ′ of π on G ′ as the subpath In order to make notation easier, we also abbreviate the formula φ ∆ k,n,p (π) iff the label l(π) of π satisfies the following conditions: We let φ ∆ k,n (π) be an abbreviation for φ ∆ k,n,0 (π).Moreover, if ∆ = Σ k , then there are formulas succ k,n (π, π ′ ), number i k,n (1 ≤ i ≤ n), last k,n and eq k,n (π, π ′ ) such that for all paths π and π Proof.The formulas and the graph are defined by induction on k.Suppose first that k = 1 and ∆ = Σ 1 .We define G 0 as the following graph where the number of nodes with outgoing edges with labels a 1 and b 1 , is equal to f 0 n.The label N is an additional label that we introduce in order to simplify the notation in the formulas.
We let G ∆ 1,n be the graph G 0 .We define now the formula φ ∆ 1,n .In fact, any path π over G 0 starting with the node with no incoming edge and ending with the node with no outgoing edge, will be such that l Σ 1 (π) is the encoding of a 1-counter.Hence, we can define φ ∆ 1,n as the conjunction of the formula ∃s π [¬∃t π , t < s] and the formula ∃s π [¬∃t π , s < t].We show now how to define the formulas num i 1,n (π) (by induction on i), eq k,n (π, π ′ ) and last 1,n (π).For the formula last 1,n (π), a path π corresponds to the encoding of the number 2 f 0 n − 1 iff we always choose the node with label b 1 .Or equivalently, if we never choose the node with label a 1 .Hence, we may define last 1,n (π) as the formula ¬∃s π , a 1 (s).
For the formula eq 1,n (π, π ′ ), two paths π and π ′ correspond to the same number iff π and π ′ are equal.Since π and π ′ are simple paths with the same starting node, this is equivalent over graph databases (where each node carries a different data value) to the fact the the following formula holds The formulas num i 1,n (π) is defined by induction on i.If i = 0, the path π encodes the number 0 iff we always choose the node with label a 1 .Or equivalently, if we never choose the node with label b 1 , which is expressed by For the induction case, the path π encodes the number i + 1 iff there is a path π ′′ encoding the number i and the number encoded by π is the successor of the number encoded by π ′′ .Hence, we can define num i+1 1,n (π) as the formula ∃π ′′ (num i 1,n (π ′′ ) ∧ succ 1,n (π ′′ , π)).In order to finish the base case, it remains to define the formula succ 1,n (π, π ′ ).Basically, we have to simulate addition in binary.If x 1 . . .x f 0 n is the binary encoding of a number i < 2 f 0 n − 1, then the binary encoding of the number i + 1 is the sequence 0 and all the elements x m+1 , . . ., x f 0 n are equal to 1, (b) 0 if x m = 1 and all the elements x m+1 , . . ., x f 0 n are equal to 1, (c) 0 if x m = 0 and there is an element in the sequence x m+1 . . .x f 0 n that is equal to 0, (d) 1 if x m = 1 and there is an element in the sequence x m+1 . . .x f 0 n that is equal to 0. Case (a) can be expressed by the following formula The other cases can be treated similarly.This finishes the base case.
We turn now to the induction step.If ∆ = {d 1 , . . ., d l }, we define G ∆ k+1,n as the following graph The edge with label i ∆ k+1,n and the edge with label ∆ k+1 f are pointing to the initial node in The edge with label ∆ k+1 is an edge starting from the final node in G Σ k k,n .We define now the formula φ ∆ k+1,n,p (π).The intuition is as follows.We encode a (k + 1, f 0 n, p)-description σ p d p . . .σ j d j , as a path π starting with the edge with label i ∆ k+1,n and ending with the edge with label f ∆ k+1,n .Each k-counter σ i will correspond to a path through the subgraph G Σ k k,n , while d i will correspond to the label of an edge occurring after the edge with label ∆ k=1 .The formula φ ∆ k+1,n,p (π) needs to ensure that the following hold: (a) The first edge of π is the edge with label i ∆ k+1,n .(b) Each "passage" of the path π through the graph G Σ k k,n corresponds to the encoding of a k-counter.To express this, we will use the formula φ Σ k k,n (π) given by the induction hypothesis.(c) The first time the path π "goes through" the graph G Σ k k,n corresponds to the encoding of the number p.(d) Two successive "passages" of π through the graph G Σ k k,n correspond to two successive k-counters.(e) The edge with label f ∆ k+1,n occurs after the edge with label ∆ k+1 f iff the last passage of the path π through the graph G Σ k k,n , corresponded to the encoding of the number tower (k, f 0 n).This ensures that we fully encode a (k + 1, f 0 n)-description, and not a subsequence of it.We only show how to express (b) as this is one of the most difficult cases and the other ones can be treated similarly.
For (b) we have to express that each passage of π through the graph G Σ k k,n corresponds to the encoding of a k-counter.Recall that by the induction hypothesis, since a (k, Hence, in order to express (b), it is enough to ensure that if s is the first node of a passage of π through G Σ k k,n and if t is the last node of that same passage, then the formula and t is the last node of that same passage.
We define IF k,n (s, t, π) as the formula This equivalent to saying that s is the initial node of and the path "never goes out" of the graph G Σ k k,n (this can be enforced by imposing that we do not go through the edge with label d i for some i).
We define now the formula χ 1 (s, t, π) expressing condition (b), that is, if s is the first node of a passage of π through G Σ k k,n and if t is the last node of that same passage, then the formula φ Σ k k,n (π s,t ) holds.By (4.3), we may define χ 1 (s, t, π) as the formula We turn now to the definitions of the formulas succ k+1,n (π), num i k+1,n (π), eq k+1,n (π, π ′ ) and last k+1,n (π).The formulas succ k+1,n (π), num i k+1,n (π) and last k+1,n (π) are defined in a similar fashion as the basis case (k = 1).
In order to define the formula eq k+1,n (π, π ′ ), let π and π ′ be two paths satisfying the formula φ Σ k+1 k+1,n .Recall that π corresponds to the encoding of a (k + 1)-counter where each σ i corresponds to a passage π s,t of π through G Σ k k,n and d i corresponds to the label of an edge occurring right after that passage.Given the structure of the graph G ∆ k+1,n , that edge is the incoming edge of the node t + 2.
The paths π and π ′ correspond to the encoding of the same (k + 1)-counter if for all passages π s,t of π through G Σ k k,n and for all passages π s ′ ,t ′ of π through G Σ k k,n such that π s,t and π s ′ ,t ′ encode the same k-counter, we have that t + 2 and t ′ + 2 are the same nodes.By (4.3) and by the induction hypothesis, this can be expressed by the following formula eq k+1,n (π, π ′ ) given by This finishes the proof of Lemma 4.2.
We are now ready to prove Theorem 4.1.
Proof of Theorem 4.1.As explained earlier, we prove that for all Turing machines M and for all k, there is a formula φ ∈ WL such that for all words w of size n, there is a graph G w such that G w φ iff there is an accepting run of M over w using at most tower (k, n) cells.
Let (Σ, Q, δ, q i , q f ) be the Turing machine M , where Σ is the input alphabet together with a blank symbol B, q 0 is the initial state, q f is the final state and δ : is the transition map, where L stands for "left" and R stands for "right".
The formula φ is a formula of the form where ψ is a formula that does not contain any quantification over path variables.Given a word w, the label of π in the graph G w is the encoding of an accepting run of M over the word w.Recall that we encode a configuration of the machine in the following way.Suppose that C is a configuration where the content of the tape is the word , the head is scanning the cell number i 0 and the machine is in state q.We may encode C by the word and c(i), w ′ i and q ′ i are defined as follows.The word c(i) is the k-counter encoding the number i.The letter w ′ i is the content of the cell i.The letter q ′ i is equal to $ if the head is not scanning the cell number i; otherwise, q ′ i is equal to the state q.This implies that given a configuration C, the word ec The run of M over the word w is a sequence of configurations of the form C 0 C 1 . . . .We encode the run as the word e C 0 e C 1 . . .(which is a sequence of (k + 1, n)-counters).We will define the formula ψ(π) and the graph G w in such a way that a path π satisfies ψ iff the projection of the label of π on the alphabet Γ k ∪ ∆ is the encoding of an accepting run of M over w.
We think of a path π encoding a run of M over w as consisting of two parts.The label of the first part contains the encoding e C 0 of the initial configuration C 0 .The label of the second part contains the encoding e C 1 e C 2 . . . of the remaining part of the run.The first part of the path π is a path in a subgraph I w of G w , while the second part is a path in the subgraph H (independent of w) of G w .The graph G w will be obtained by adding an edge from a node of I w to a node of H.
We start by defining the graph H. Recall that ∆ is the alphabet (Q ∪ {$}) × Σ.The graph H is defined as the graph G ∆ k,n with an additional edge from the final node to the initial node.Hence, it follows from the proof of Lemma 4.2 that H is the following graph, where ∆ = {d 1 , . . ., d l } and where the edges with label i ∆ k,n and ∆ k+1 f are edges pointing to the initial node of G Σ k k,n and the edge with label ∆ k+1 is an edge starting from the final node of G Σ k k,n .In the above paragraphs, any edge pointing to the graph G Σ k k−1,n is an edge pointing to the initial node of that graph.Similarly, any edge starting from the graph G Σ k k−1,n will always be referring to an edge starting in the final node of the graph.
Recall that if a path π encodes a run C 0 C 1 C 2 . . ., the trace of π on H will encode the part C 1 C 2 . . . of the run.Each configuration C i is encoded as a (k + 1, f 0 n)-description over ∆, which will correspond, as in Lemma 4.2, to a passage of the path π from the initial node to the final node of G Σ k k−1,n .We define now the graph I w encoding the initial configuration of the tape.Recall that in the initial configuration, the tape contains the word w = w 0 . . .w n−1 , all the cells with number ≥ n contain the blank symbol B, the head is scanning the first cell and the state is q 0 .The graph I w is obtained by "assembling" the subgraphs K 0 , . . ., K n−1 and K, which we will define next.For each i ≤ n, the graph K i is such that the label of its unique maximal path is the encoding of the cell number i in the initial configuration.The trace of the path π on the graph K is the encoding of the contents of the cells with number ≥ n in the initial configuration.
More precisely, we define the graphs K 0 , . . ., K n−1 and K in the following way.The following graph is the graph The node with label in will be the starting node of the path π.Since the trace of π in K 1 encodes the content of the cell with number 0 in the initial configuration, its label must contain the k-counter encoding the number 0 followed by the letter (q i , w 0 ) of the alphabet ∆ (indicating that the first cell contains the letter w 0 , the head is scanning the first cell and the current state is q 0 ).Using Lemma 4.2 and the formula num 0 k,n , we will impose that the passage of π through the subgraph G Σ k k,n of K 0 corresponds to the encoding of the number 0.
Next, for all 1 ≤ i ≤ n, we define K i as the following graph Recall that we want to define K i in such a way that the trace of π on K i is the encoding of the contents of the cell with number in in the initial configuration (that it, it contains the letter w i and the head is not scanning the cell since i = 0).Recall that the encoding of such a cell (and its content) is given by where c(i) is the k-counter encoding the number i.We will use the formula num i k,n given by Lemma 4.2 to express that the passage of π through the subgraph G Σ k k,n of K i corresponds to the k-counter encoding i.
Finally we define the graph K as the graph where ∆ B is the one-letter alphabet containing the blank symbol B. Recall that the trace of π on the graph K will encode the contents of the cells with number ≥ n in the initial configuration (that is, the fact that those cells contain the blank symbol and are not scanned by the head).Since the encoding of such a cell with number i is given by (where c(i) is the k-counter encoding i), the label of the trace of π on K must contain the word c(n + 1)(B, $) . . .c(j)(B, $), where j = tower (k, n) − 1.That is, the label of the trace of π on K is the unique (k, f 0 n, n + 1)-description over the alphabet ∆ B .We will express that the passage of π through the graph G ∆ B k,n corresponds to the (k, f 0 n, n + 1)-description over the alphabet ∆ B using the formula φ ∆ B k,n,n+1 provided by Lemma 4.2.We are now ready to define the graph I w which is obtained by assembling the graphs previously introduced in the following way.
Each edge between two graphs in the picture above is an edge from the "left-most" node of the first graph to the "right-most" node of the second graph.Finally the graph G w is the graph obtained by considering the union of the graph I w and H and adding an edge from the final node of K to the initial node of H. Now that we have defined the graph G w , we are ready to define the formula ψ.The formula ψ(π) is obtained as the conjunction of the following formulas.(A) First we need to express that the path π starts with the edge with label in.(B) We need to express that eventually in a configuration, the machine reaches the final state q f .(C) We also have to express that each passage of the path π from the initial node of G Σ k k,n to the final node of G Σ k k,n in the graph H corresponds to the encoding of a (k + 1, f 0 n)description.(D) We have to express that for all i < n, the trace of π on the subgraph G Σ k k,n of the graph K i corresponds to the k-counter encoding i. (E) We need to express that the trace of the path π on the subgraph G ∆ B k+1,n of K is the unique (k + 1, f 0 n, n + 1)-description over the alphabet ∆ B .(F) Finally we need to express how we move from one configuration of the tape to the next one.Cases (A) and (B) are straightforward.Cases (C), (D) and (E) are similar and we only give details for case (C) and case (F).By Lemma 4.2, case (C) means that if π s,t is the subpath of π corresponding to such a passage, then φ ∆ k+1,n,0 (π s,t ) holds.(4.4)The node s is a node satisfying i Σ k k,n , while t is the "closest" node to s with an incoming edge with label f ∆ k+1,n .This is expressed by the following formula Finally we treat the most difficult case which is case (F).We need to express how we move from one configuration of the tape to the next one.Recall that the trace of π on the graph H will contain the encoding of the sequence C 1 C 2 . . . of the run, where C 0 C 1 . . . is the full run of the machine on the input w.
Let π s,t be the subpath of π corresponding to the encoding of a configuration C i and let π s ′ ,t ′ be the subpath of π corresponding to the configuration C i+1 .We need to express how to move from the configuration C i to the configuration C i+1 .Suppose that in the configuration C i , the current state is q, and the head is scanning the cell c containing the letter u.Suppose also that δ(q, u) = (q ′ , v, R) (we can treat similarly the case where the head moves to the left).In order to keep our formulas simpler, we use a slightly different definition of a run of a Turing machine, but it would be clear that the notion of run that we use here, can be simulated by a usual Turing machine.Here, we assume that if the machine scans a cell c with content u and δ(q, u) = (q ′ , v, R), then in the next state, the machine scans the successor c ′ of c, the content of c ′ is v, while the content of c is u (in the usual definition, the content of c ′ is unchanged, while the content of c is v).
Let π r,s be the subpath of π corresponding to the encoding of the cell c in the configuration C i .Let π r ′ ,s ′ be the encoding of an arbitrary cell c ′ in the configuration C i+1 .If c ′ is the successor of the cell c, then the head should scan the cell c ′ and the content of c ′ should be the letter v.We express this by the formula change R (q,a,q ′ ,b) (r, s, r ′ , s ′ , π) defined by Recall that by Lemma 4.2, succ k,n (π r,s , π r ′ ,s ′ ) is the formula expressing that the k-counter associated with π r ′ ,s ′ is the successor of the k-counter associated with π r,s .
If c ′ is not the successor of the cell c, then the head is not scanning the cell c ′ and its content remains unchanged.If π x,y is the subpath of π corresponding to the content of the cell c ′ in the configuration C i , this is expressed by the following formula stay R (q,u,q ′ ,v) (r, s, x, y, r ′ , s ′ , π) defined by where q ′′ ∈ Q ∪ {$}.Recall that by Lemma 4.2, eq k,n (π x,y , π r ′ ,s ′ ) expresses that the kcounters associated with π x,y and π r ′ ,s ′ are the same.Now we need to express that the paths π r,s , π s ′ ,s ′ and π x,y correspond to the encodings of k-counters.By Lemma 4.2, this means that those paths correspond to passages of π through the graph G Σ k k,n .Similarly to (4.5), we introduce a formula This formula expresses that the path π r,s corresponds to the encoding of a k-counter.
Next we also need a formula to assert that the paths π r,s and π r ′ ,s ′ appear in the encodings of successive configurations (and similarly, that the paths π x,y and π r ′ ,s ′ appear in the encodings of successive configurations).Since the encoding of a configuration starts with the unique edge with label i ∆ k+1,n (and that edge only occurs at the beginning of the encoding of a configuration), this is equivalent to say that there is a unique edge between s and r ′ with label i ∆ k+1,n .This is expressed by the formula config (s, r ′ , π) defined by We are now ready to define θ R q,u,q ′ ,v (π) as the following formula It expresses the following.Suppose that π r,s , π r,s and π r ′ ,s ′ are k-counters encoding the numbers of three cells (this corresponds to (4.7)).Suppose that π r,s and π x,y correspond to cells occurring in the same configuration C and that the cell corresponding to π r ′ ,s ′ occurs in the next configuration C ′ (this is expressed by (4.8)).Then, if we "apply" the transition δ(q, u) = (q ′ , v, R) to move from C to C ′ , we move the head to the right and update the content of the cell being scanned (as expressed by the formula change R (q,u,q ′ ,v) (r, s, r ′ , s ′ , π)) and we leave the other cells unchanged (as expressed by the formula stay R (q,u,q ′ ,v) (r, s, x, y, r ′ , s ′ , π)).We define now the formula θ R (π) as the formula {θ R q,u,q ′ ,v (π) : δ(q, u) = (q ′ , v, R)}.This formula expresses how we move from one configuration to another, when the head moves to the right.Similarly, we can define a formula θ L (π) expressing how we move from one configuration to another, when the head moves to the left.
This finishes the proof of Theorem 4.1.
As a corollary to the proof of Theorem 4.1, we obtain that data complexity is nonelementary even for simple WL formulas that talk about a single path in a graph database.
Corollary 4.3.The evaluation problem for WL over graph databases is non-elementary in data complexity, even if restricted to Boolean WL formulas of the form ∃πψ, where ψ uses no path quantification and contains no position variable of sort different than π.

Register Logic
We saw in the previous section that WL is impractical due to its very high data complexity.In this section, we start by recalling the notion of regular expressions with memory (REM) and their basic results from [10].In our view, this logic is rather limited in terms of expressive power.For instance, the query (Q) from the introduction cannot be expressed in REM.We then introduce an extension of REM, called regular logic (RL), that remedies this limitation in expressive power (in fact, it can express many natural examples of queries expressible in WL, e.g., those given in [8]) while retaining elementary complexity of query evaluation.Finally, we study which fragments of RL are well-behaved for database applications.
5.1.Regular expressions with memory.REMs define pairs of nodes in data graphs that are linked by a path that satisfies a constraint in the way in which the topology interacts with the underlying data.REMs allow us to remember data values and use them later.Data values are stored in k registers r 1 , . . ., r k .At any point we can compare a data value with one previously stored in the registers.As an example, consider the REM ↓ r.a + [r = ].This can be read as follows: Store the current data value in register r (represented by the expression ↓ r), and then check that after reading a word in a + we see the same data value again (condition [r = ]).We formally define REM next.
Let r 1 , . . ., r k be registers.The set of conditions c over {r 1 , . . ., r k } is recursively defined as: That is, REM extends the class of regular expressions e -which is a popular mechanism for specifying topological properties of paths in graph databases (see, e.g., [18,2]) -with expressions of the form e[c], for c a condition, and ↓ r.e, for r a tuple of registers -that define how such topology interacts with the data.
Semantics: To define the evaluation e(G) of an REM e over a data graph G = (V, E, κ), we use a relation e G that consists of tuples of the form (u, λ, ρ, v, λ ′ ), for u, v nodes in V , ρ a path in G from u to v, and λ, λ ′ two k-tuples over D ⊥ .The intuition is the following: the tuple (u, λ, ρ, v, λ ′ ) belongs to e G if and only if the data and topology of ρ can be parsed according to e, with λ being the initial assignment of the registers, in such a way that the final assignment is λ ′ .We then define e(G) as the pairs (u, v) of nodes in G such that (u, ⊥ k , ρ, v, λ) ∈ e G , for some path ρ in G from u to v and k-tuple λ over D ⊥ .
We inductively define relation e G below.We assume that λ r=d , for d ∈ D, is the tuple obtained from λ by setting all registers in r to be d.Also, if Then we define: , where e 1 G • e 2 G is the set of tuples (u, λ, ρ, v, λ ′ ) such that (u, λ, ρ 1 , w, λ ′′ ) ∈ e 1 G and (w, λ ′′ , ρ 2 , v, λ ′ ) ∈ e 2 G , for some w ∈ V , k-tuple λ ′′ over D ⊥ , and paths For each REM e, we will use the shorthand notation e * to denote ε ∪ e + .
Example 5.1.The REM Σ * • (↓r.Σ + [r = ]) • Σ * defines the pairs of nodes in data graphs that are linked by a path in which two nodes have the same data value.The REM ↓ r.(a[¬r = ]) + defines the pairs of nodes that are linked by a path ρ with label in a + , such that the data value of the first node in the path is different from the data value of all other nodes in ρ. ✷ The problem Eval(REM) is, given a data graph G = (V, E, κ), a pair (v 1 , v 2 ) of nodes in V , and an REM e, is (v 1 , v 2 ) ∈ e(G)?The data complexity of the problem refers again to the case when e is considered to be fixed.REMs are tractable in data complexity and have no worse combined complexity than FO over relational databases: Proposition 5.1 ( [10]).Eval(REM) is Pspace-complete, and Nlogspace-complete in data complexity.5.2.Register logic.REM is well-behaved in terms of the complexity of evaluation, but its expressive power is rather rudimentary for expressing several data/topology properties of interest in data graphs.As an example, the query (Q) from the introduction -which can be easily expressed in WL -cannot be expressed as an REM (we actually prove a stronger result later).The main shortcomings of REM in terms of its expressive power are its inability to (i) compare data values in different paths and (ii) express branching properties of the data.
In this section, we propose register logic (RL) as a natural extension of REM that makes up for this lack of expressiveness.We borrow ideas from the logic CRPQ ¬ , presented in [4], that closes the class of regular path queries [7] under Boolean combinations and existential node and path quantification.In the case of RL we start with REMs and close them not only under Boolean combinations and node and path quantification -which allow to express arbitrary patterns over the data -but also under register assignment quantification -which permits comparing data values in different paths.We also prove that the combined complexity of the evaluation problem for RL is elementary (Expspace), and, thus, that in this regard RL is in stark contrast to WL.
Intuitively, ν = ⊥ holds iff ν is the empty register assignment, (x, π, y) checks that π is a path from x to y, and e(π, ν, ν ′ ) checks that π can be parsed according to e starting from register assignment ν and finishing in register assignment ν ′ .The quantifier ∃ν is to be read "there exists an assignment of data values in the data graph to the registers".
Let G = (V, E, κ) be a data graph over Σ and φ a RL formula over Σ and {r 1 , . . ., r k }.Assume that D is the set of data values that are mentioned in G, i.e., D = {κ(v) : v ∈ V }.An assignment α for φ over G is a mapping that assigns (i) a node in V to each free node variable x in φ, (ii) a path ρ in G to each free path variable π in φ, and (iii) a tuple λ in D k ⊥ to each register variable ν that appears free in φ.That is, for safety reasons we assume that α(ν) can only contain data values that appear in the underlying data graph.This represents no restriction for the expressiveness of the logic.
We inductively define (G, α) |= φ, for G a data graph, φ an RL formula, and α an assignment for φ over G, as follows (we omit equality atoms and Boolean combinations since they are standard): Thus, each REM e is expressible in RL using the formula: Example 5.2.Recall query (Q) from the introduction: Find pairs of nodes x and y in a graph database, such that there is a node z and a path π from x to y in which each node is connected to z.This query can be expressed in RL over Σ = {a} and a single register r as follows: ∃π (x, π, y) ∧ ∃z∀ν(e 1 (π, ν, ν) → ∃z ′ ∃π ′ ((z ′ , π ′ , z) ∧ e 2 (π ′ , ν, ν))) , where e 1 := a * [r = ]•a * is the REM that checks whether the node (i.e.data) stored in register r appears in a path, and e 2 := ε[r = ] • a * is the REM that checks if the first node of a path is the one that is stored in register r.
In fact, this formula defines the pairs of nodes x and y such that there exists a path π that goes from x to y and a node z for which the following holds: for every register value ν (i.e., for every node ν) such that e 1 (π, ν, ν) (i.e.node ν is in π), it is the case that there is a path π ′ from some node z ′ to z such that e 2 (π ′ , ν, ν) (i.e., z ′ = ν and π ′ connects ν to z).Notice that this uses the fact that the underlying data model is that of graph databases, in which each node is uniquely identified by its data value.✷ The limitations in expressive power of RL have also been independently recognized by Libkin, Martens and Vrgoc [12].In order to allow for interesting data value comparisons while retaining reasonable complexity of evaluation, they propose to use query languages based on the XML language XPath.These languages are not comparable in terms of expressive power to the ones we study here.

Complexity of evaluation for RL:
The evaluation problem for RL, denoted Eval(RL), is as follows: Given a data graph G, an RL formula φ, and an assignment α for φ over G, is it the case that (G, α) |= φ?As before, we denote by Eval(RL,φ) the evaluation problem for the fixed RL formula φ.
We show next that, unlike WL, register logic RL can be evaluated in elementary time, and, actually, with only one exponential jump over the complexity of evaluation of REMs: Theorem 5.2.Eval(RL) is Expspace-complete.The lower bound holds even if the input is restricted to graph databases.
Proof.We start by proving the upper bound, that is, Eval(RL) is in Expspace.The structure of the proof is quite similar to the one that proves that CRPQ ¬ queries can be evaluated in Pspace in combined complexity [4].The difference is that now we have to accommodate the extra expressive power of RL, that allows to express properties of register values and check acceptance of data walks by REMs.
Let τ be a first-order (FO) vocabulary Nodes, Paths, Registers, Endpoints, e 1 , . . ., e m , ⊥ , where (a) Nodes, Paths and Registers are unary relation symbols, (b) Endpoints and e i (1 ≤ i ≤ m) are ternary relation symbols, and (c) ⊥ is a constant.We define, from G, an FO structure M G over τ as follows: The domain of M G is the disjoint union of V , all the paths that belong to G, and all k-tuples over D ⊥ .(Notice that each node in V is also a path in G, but here we consider them to be different objects.That is, each v ∈ V appears separately as a node and as a path in the domain of M G ).The constant ⊥ is interpreted in M G as the tuple ⊥ k .The interpretation of Nodes in M G contains all those elements of the domain that are nodes.The interpretation of Paths in M G contains all those elements of the domain that are paths.The interpretation of Registers in M G contains all those elements of the domain that are k-tuples over D ⊥ .The interpretation of the ternary relation Endpoints contains all tuples (v, ρ, v ′ ) such that ρ is a path in G from node v to node v ′ .Finally, the interpretation of the symbol e i (1 ≤ i ≤ m) contains all tuples (λ, ρ, λ ′ ) such that ρ is a path in G, λ, λ ′ are k-tuples over D ⊥ , and e i (ρ, λ, λ ′ ).
Clearly, G, α φ iff M G , α φ τ .Of course, M G cannot be effectively constructed from G since the set of paths in G is potentially infinite, and, thus, M G is also potentially infinite.However, it is possible to prove that there exists a finite structure M ′ G,ρ such that G, α φ iff M ′ G,ρ , α φ τ .We show how to define M ′ G,ρ next.Assume that the quantifier rank of φ τ is k ≥ 0, where the quantifier rank of an FO formula θ is the depth of nested quantification in θ.Let E ⊆ {e 1 , . . ., e m }×D k ⊥ ×D k ⊥ .A path ρ in G satisfies E if the following holds: For each triple (e i , λ, λ ′ ) ∈ {e 1 , . . ., e m } × D k ⊥ × D k ⊥ , it is the case that G, α e i (ρ, λ, λ ′ ) iff (e i , λ, λ ′ ) ∈ E. (Notice that for each path in G there is one, and only one, subset E of {e 1 , . . ., e m } × D k ⊥ × D k ⊥ that it satisfies.)For each pair (v, v ′ ) of nodes in V , and for every E ⊆ {e 1 , . . ., e m } × D k ⊥ × D k ⊥ , let c E,v,v ′ ≥ 0 be the minimum between k + |ρ| and the number of paths in G that go from v to v ′ and satisfy E. We arbitrarily pick, for each pair (v, v ′ ) of nodes in V and for each that satisfy E. We define the structure M ′ G,ρ as follows: Its domain contains all the nodes of V , each path ρ that belongs to the tuple ρ, every path of the form G,ρ contains all those elements of the domain that are paths.The interpretation of Registers in M G,ρ contains all those elements of the domain that are k-tuples over D ⊥ .The interpretation of the ternary relation Endpoints contain all tuples of the form (v, ρ, v ′ ), where v, v ′ ∈ V and ρ is a path in the domain that goes from v to v ′ in G. Finally, the interpretation of e i (1 contains all tuples (λ, ρ, λ ′ ) such that ρ is a path in the domain, λ, λ ′ are k-tuples over D ⊥ , and e i (ρ, λ, λ ′ ).
By using a standard Ehrenfeucht-Fraïssé argument it is possible to prove the following: Proof.We show that the duplicator has a winning strategy in the k-round Ehrenfeucht-Fraïssé game played on (M G , v, ρ, λ) and (M ′ G,ρ , v, ρ, λ).The duplicator's response to a spoiler move in round i ≤ k is (inductively) defined as follows (we assume without loss of generality that the spoiler never repeats moves, i.e. in no round does the spoiler choose an element that has already been chosen by either player in previous rounds): • If the spoiler's move in round i is a node in either of the two structures, then the duplicator responds by mimicking the spoiler's move on the other structure; • if the spoiler's move in round i is a k-tuple over D ∪ {⊥} in either of the two structures, then the duplicator responds by mimicking the spoiler's move on the other structure; • if the spoiler's move in round i is a path ρ in ρ in either of the two structures, then again the duplicator responds by mimicking the spoiler's move on the other structure; • if the spoiler plays a path ρ from node v to v ′ , in either of the two structures, such that ρ satisfies E ⊆ {e 1 , . . ., e m } × D k ⊥ × D k ⊥ and ρ is not a path in ρ, then the duplicator responds with any path from v to v ′ in the other structure that (1) satisfies E, (2) does not belong to ρ, and (3) has not been previously chosen in the game.Notice that it is always possible for the duplicator to choose such a path, since for each pair of nodes v, v ′ ∈ V and for each E ⊆ {e 1 , . . ., e m } × D k ⊥ × D k ⊥ , the number of paths from v to v ′ that satisfy E and do not belong to ρ is the same up to k.It is easy to see that duplicator's response defined in this way always preserves a partial isomorphism between the two structures.This implies that the duplicator has a winning strategy in the k-round Ehrenfeucht-Fraïssé game played on (M G , v, ρ, λ) and (M ′ G,ρ , v, ρ, λ), and, thus, by well-known results, that the structures are indistinguishable by FO sentences of quantifier rank ≤ k.
The previous claim shows that (G, α) φ iff (M ′ G,ρ , α) φ τ .Thus, a straightforward approach to check whether (G, α) φ would be to construct M ′ G,ρ and then evaluate φ τ over it.The problem with this approach is that M ′ G,ρ could be of double exponential size (because there is a double exponential number of different subsets , and, thus, impossible to construct in exponential space.It will be necessary to follow a different approach.
Assume that φ τ is given in prenex normal form, i.e. φ τ is of the form where each Q i is either ∃ or ∀, each y i is a node, path or register variable, and ψ is quantifier-free (if φ τ is not in prenex normal form, we can convert it in polynomial time into an equivalent formula in prenex normal form).We follow a usual argument to evaluate FO formulas on structures.The main problem with this is that some of the elements in M ′ G,ρ are paths and register values, and have to be treated as such.Therefore, we define a way of encoding paths (in exponential space) and register values (in polynomial space).
Clearly, each register value can be codified with a tuple of length k • log 2 (|V |).In order to denote that this tuple is the address of a register value (and not, say, of a path), we add an extra bit at the beginning of the tuple which is labeled with a new symbol r.Codification of paths requires a bit of extra work.Each path ρ is encoded with an address, that is, a string that satisfies the following: • It starts with a new symbol p, that states that this is the address of a path; • the address continues with the encodings of the two endpoints v and v ′ of the path (separated with some delimiter); this part of the address uses O(log 2 |V |) space; • then the address encodes the subset E of {e 1 , . . ., e m } × D k ⊥ × D k ⊥ that ρ satisfies; this encoding can be easily expressed with a string of length m × |V | k × |V | k over alphabet {0, 1} that flags with a 1 those elements of {e 1 , . . ., e m } × D k ⊥ × D k ⊥ that belong to E; • finally, the address contains an encoding of the integer i Clearly, the address of a path defined in this way can be specified using at most exponential space.
We show next how the problem of checking whether (v, ρ, λ) belongs to the evaluation of φ τ over M ′ G,ρ can be solved in exponential time by an alternating Turing machine.This will finish the proof of the theorem, since the class of problems that can be solved in exponential time by alternating Turing machines coincides with the class of problems that can be solved in Expspace.
The alternating machine proceeds as follows.It first replaces in φ τ each node variable x in x with the encoding of the corresponding node v of v, each path variable π in π with the encoding (address) of the corresponding path ρ in ρ, and each register variable ν in ν with the encoding of the corresponding tuple λ in λ.Then the machine reads the formula φ τ from left-to-right.Each time it encounters an existential quantifier ∃y i it enters an existential state, and each time it encounters a universal quantifier ∀y i it enters a universal state.In each case, the machine "guesses" the interpretation of y i as the encoding of a node, a path or a register value c(y i ) in the domain.(Since encodings of paths are of exponential size, this alternating machine requires at least exponential time to work).Finally, the machine verifies that ψ(v, ρ, λ, c(y 1 ), . . ., c(y m )) holds, and if that is the case it accepts.We show next that the latter can be done in exponential time.Notice that this implies that the whole process can be performed in exponential time.
We start with the case of the atomic formulas in ψ.In order to check whether the element assigned to a variable belongs to the interpretation of Nodes in M ′ G,ρ , we only have to check that the encoding of this element does not start with a p or an r.In order to check whether the element belongs to the interpretation of Paths (resp., Registers), it is sufficient to check that its encoding starts with a p (resp., r).In order to check whether the elements a, b, c assigned to variables x, π, y, respectively, are such that (a, b, c) belongs to the interpretation of Endpoints, we only have to check that b is the encoding of a path, a and c are encodings of nodes, and that b is a path from a to c. Finally, in order to check whether the elements (a, b, c) assigned to a variable belongs to the interpretation of e i (1 ≤ i ≤ m), we only have to check that a, c are register values (i.e.their encodings start with symbol r), that b encodes a path ρ (i.e. its encoding starts with p), and that the bit that corresponds to tuple (e i , a, c) in the part of the address b that encodes the set ⊥ that ρ satisfies is set to 1. Thus, the value of the atomic formulas involved in ψ(v, ρ, λ, c(y 1 ), . . ., c(y m )) can be computed in polynomial time (in the size of ψ(v, ρ, λ, c(y 1 ), . . ., c(y m ))).But since ψ is a polysize Boolean combination of atomic formulas, the value of α(v, ρ, λ, c(y 1 ), . . ., c(y m )) can be computed in polynomial time from the values of the atomic formulas.We conclude that computing the value of α(v, ρ, λ, c(y 1 ), . . ., c(y m )) can be done in polynomial time.
There is, however, one small issue that requires explanation in order for the previous procedure to work properly.Assume that the procedure "guesses" the interpretation of a variable y i in φ τ to be the encoding of a path in Then it is necessary to check that, if the encoding implies that this path is In order to do so, the procedure needs to check, in a subroutine, whether there exist i different paths from v to v ′ that satisfy E. The next claim shows that this can be done in exponential space, which finishes the proof of the theorem.
, and i ≤ k + |ρ|, one can check in Expspace whether there are i distinct paths in G from v to v ′ that satisfy E.
Proof.Let # be a symbol not in Σ and denote by Σ # the alphabet Σ∪{#}.Let A v,v ′ be the automaton over alphabet {v} ∪ (Σ # × V ) defined as follows.The set of states is the disjoint union of V with a new state s.The initial state of A is s and the final state is v ′ .Further, the transition relation of A is defined as follows: (1) For every edge (v 1 , a, v 2 ) ∈ E there is a transition in A from v 1 to v 2 labeled (a, v 2 ), (2) for every node v 1 ∈ V there is a transition from v 1 to v 1 in A labeled (#, v 1 ), and (3) there is a transition in A from s to v labeled v. Intuitively, A v,v ′ accepts exactly those strings of the form v(a when we allow paths to loop arbitrarily many times on #-labeled nodes. Let A i v,v ′ be the automaton over alphabet {v i } ∪ (Σ # × V ) i defined as follows: The set of states is V i ∪ {s i }, the initial state is s i and the final state is (v ′ ) i .There is a transition in A i from ū = (u 1 , . . ., u i ) to w = (w 1 , . . ., w i ) labeled t = (t 1 , . . ., t i ) iff there is a transition labeled t ℓ from u ℓ to w ℓ in A v,v ′ , for each 1 ≤ ℓ ≤ i. (Notice that A i v,v ′ is not exactly the i-th product of A v,v ′ with itself, as A i v,v ′ does not contain all states in such a product).Clearly, A i v,v ′ is of exponential size but the size of each one of its states is polynomial.Furthermore, it is decidable in polynomial time whether there exists a transition labeled t from state ū to w in satisfy the following: The first condition says that each projection of a string accepted by A ′ v,v ′ represents a path in G from v to v ′ that loops only on v ′ and only at the end of the path.The second condition ensures that any two distinct projections of a path accepted by It is not hard to prove that the language accepted by A ′ v,v ′ is nonempty iff there exist i distinct paths in G from v to v ′ .Further, it is also not hard to see that A ′ v,v ′ is of exponential size but the size of each one of its states is polynomial; and it is decidable in polynomial time whether there exists a transition labeled t from state q to state q′ in A ′ v,v ′ .Using techniques in [10], it is also possible to construct in exponential time an NFA A v,v ′ ,e i ,λ,λ ′ (e i ∈ {1, . . ., m}, λ, λ ′ ∈ D k ⊥ ) over alphabet v ∪ (Σ # × V ), that accepts precisely the strings w accepted by A v,v ′ such that the path ρ from v to v ′ in G that is represented by w satisfies e i (ρ, λ, λ ′ ).(The main idea is to construct A v,v ′ ,e i ,λ,λ ′ in such a way that, at each position while reading w, it keeps in its state the k-tuple of data values that is stored in the registers of e i ).The set of states of A v,v ′ ,e i ,λ,λ ′ is of exponential size, but each particular state can be represented using polynomial space.Further, deciding if there is a transition between two states of A v,v ′ ,e i ,λ,λ ′ can be done in polynomial time.This means that for each A v,v ′ ,e i ,λ,λ ′ , its complement can be constructed in double exponential time, with each state using only exponential space.
It is not hard to see, then, that one can construct in double exponential time an automaton A E,v,v ′ over alphabet {v i } ∪ (Σ # × V ) i that does the following: It starts from A ′ v,v ′ , and restricts acceptance to strings Further, each state in A E,v,v ′ can be represented using exponential space and checking whether there is a transition between two given states of A E,v,v ′ can be done in polynomial time.
It is clear that there exist i distinct paths in G from v to v ′ that satisfy E if and only if A E,v,v ′ accepts at least one string.But we can check A E,v,v ′ for nonemptiness in Expspace using a standard "on-the-fly" argument.This finishes the proof of the claim.This also finishes the proof that Eval(RL) is in Expspace.Now we show that Eval(RL) is Expspace-hard.
For all constants f 0 , we provide a reduction from the class of problems solvable by a Turing machine using a tape of size 2 f 0 n given an input word of size n.There are a Turing machine M and a constant f 0 such that the following problem is ExpSpace-hard: given a word w of size n, is there an accepting run of M over w using at most 2 f 0 n cells?We prove that there is a formula φ ∈ RL such that for all words w of size n, there are a formula φ w and a graph G w such that G w φ w iff there is an accepting run of M over w using at most 2 f 0 n cells.
Let (Σ, Q, δ, q 0 , q f ) be the Turing machine M , where Σ is the alphabet consisting of the input alphabet and the blank symbol B, q 0 is the initial state, q f is the final state and is the transition map, where L stands for "left" and R stands for "right".
The formula φ w that we associate with the machine M and a word w is a formula of the form ∃πψ w (π), where ψ w is a formula that does not contain any quantification over path variables.The formula ψ w (π) expresses that the path π in the graph G w encodes an accepting run of M over the word w.
As in the proof of Theorem 4.1, we encode a configuration C of a run of M in the following way.Suppose that the content of the tape is the word w ′ = w ′ 1 . . .w ′ 2 f 0 n , the head is scanning the cell number i 0 and the machine is in state q.We encode the configuration C by the word e where & plays the role of a delimiter and each d C i encodes the information in cell number i.More precisely, if ∆ is the alphabet (Q ∪ {$}) × Σ, we define d C i as the word c(i) (q ′ i , w ′ i ), where c(i) and q ′ i are defined as follows.The word c(i) is the binary representation of the number i.The letter w ′ i is the content of the cell i.The letter q ′ i is equal to $ if the head is not scanning the cell number i; otherwise, q ′ i is equal to the state q.That is, q ′ i 0 = q and for all i = i 0 , q ′ i = $.We encode a run C 0 C 1 . . .as the word e C 0 #e C 1 # . . . .We define the formula ψ w (π) and the graph G w in such a way that a path π satisfies ψ w iff the label of π is the encoding of an accepting run of M over w.
The graph G w is (almost) the same graph as the graph in the proof of Theorem 4.1 in the case where k = 1.That is, G w is obtained by "linking" two graphs I w and H.If C 0 C 1 . . . is an accepting run with associated path π, the trace of π on I w will correspond to the encoding of the initial configuration C 0 , while the trace of π on H is the encoding of the run C 1 C 2 . . . .The graph H is given by

&, #
Recall that the set {d i : 1 ≤ i ≤ l} is defined as (Q ∪ {$}) × Σ.Consider a simple path π ′ starting from the node with data value 1 and ending in the node with outgoing edges with labels & and #.Its label is of the form c(i)(q ′ , a); that is, it is the encoding of the information in a cell with number i. Hence, the label of a path π starting and ending in the node with data value 1 satisfies c(i where each d j is the encoding of the information of a cell and each c(i j ) is the encoding of the number i j .We define the formula ψ in such a way that if ψ(π) holds, the succession of encodings of cells describe a run of M .
Next we define the graph I w where we encode the initial configuration.Suppose that w = w 1 . . .w n .For all 1 ≤ i ≤ n, we introduce a graph K i describing the cell number i in the initial configuration.If b 1 . . .b n is the binary encoding of the number i, the graph K i is given by where q ′ 1 = q 0 and q ′ i = $ if i = 1.The label of the longest path of K i is exactly the encoding of cell number i in the initial configuration.Next we define the graph K which allows us to encode the cells with number > n in the initial configuration.The graph K is given by

&
The label of a simple path from the node with data value 1 to the node with outgoing edge with label & is of the form c(i)($, B); that is, it is the encoding of an unscanned cell with a blank symbol.In particular, it is the encoding of all the cells with number > n in the initial configuration.The graph I w is obtained by linking together the graph K 1 , . . ., K n , K in the following way The arrow with a label s is an arrow pointing to the node of K 1 with data value 1.Each arrow with label # between two graphs is from the "right-most" node of the first graph to the "left-most" node of the second graph.Finally, the graph G w is defined by where the edge with label # is an edge from the "right-most" node of K to the node with data value 1 in H.We define now a formula ψ π such that G w ψ(π) iff l(π) is the encoding of an accepting run of M over w using at most 2 f 0 n cells, where l(π) is the label of π.In fact it is easier to define a formula χ(π) such that G w χ(π) iff l(π) is not the encoding of an accepting run of M over w using at most 2 f 0 n cells.(5.1) Suppose that π is a path through G w .The label of π is not the encoding of an accepting run of M over w using at most 2 f 0 n cells iff at least one of the following conditions holds.
(i) The first letter of l(π) is not the initial letter s or the run never reaches a final state, i.e. there is no pair of the form (q f , a) for some a, occurring in l(π).(ii) The symbol # is not at the "right place": − either after we reach the symbol # (i.e.we are going to enter the encoding of a new configuration), the label contains the binary encoding of a number = 1, − or after the binary encoding of the number 2 f 0 n (that is, after encoding the information of the last cell), the symbol # does not occur in the label of π.Since # is used as a delimiter between encodings of configurations, this means that although we finished encoding the last cell of a configuration, we do not move to a new configuration.(iii) There is a substring c(i)d&c(j)d ′ (where i < 2 n ) such that j is not the successor of i.That is, after encoding the information of cell number i, we do not encode the information of cell number i + 1. (iv) Finally, there is a string e C #e C ′ of D π such that C and C ′ are not successive configurations.Expressing cases (i) and (ii) is fairly easy.We concentrate on (iii) and (iv).We start by showing how to express case (iv).Suppose that c(i) and c(j) are two successive binary encodings occurring in the label of π.Suppose We show how we can express (a); the other cases are similar.Case (a) is expressed by the following REM (Λ\∆ f ) * ↓ r 1 .1 * ∆&{0, 1} * [r = 1 ]1Λ * .where ∆ f is the set {q f } × Σ and Λ is the alphabet of the graph G w .In the register r 1 , we store the number k such that b k . . .b f 0 n = 1 . . . 1 (that is, we only go through edges with label 1 until we reach an edge with label in ∆).When we reach again the node with data value k, the label of the outgoing edge is 1, expressing that b ′ k = 1.Finally we look at case (iv), i.e. how to express that there is a string e C #e C ′ of D π such that C and C ′ are not successive configurations.This might happen for several reasons: (A) either we did not modify properly the content of a cell or move properly the head or (B) we modified the content that was supposed to remain constant.We only treat case (A), as the other case can be handled in a similar way (note that the proof of Theorem 5.3 is very similar to this proof and there, we will treat case (B)).In case (A), we also only consider the case of a transition moving the head to the right, the other case being symmetric.
So suppose that δ(q, a) = (q ′ , b, R).As in the proof of Theorem 4.1, we use a slightly different definition of a run of a Turing machine (which is equivalent to the usual definition, but helps us to keep our formulas simpler).We assume that if δ(q, a) = (q ′ , b, R) and the machine scans a cell c with content a, then in the next state, the machine scans the successor c ′ of c, the content of c ′ is b, while the content of c is a.
Let (Λ !# ) * be the set of words over Λ that contain at most one occurrence of #.We define e R (q,a) as the following REM Λ * {0, 1} ↓ r 1 . . . .{0, 1} ↓ r n .(q,a)(Λ !# ) * {0, 1}[r = 1 ] . . .{0, 1}[r = f 0 n ]∆&{0, 1} * (∆\{(q ′ , b)})Λ * .We store in the registers r 1 , . . ., r f 0 n the binary encoding of a number i.That number is the number of a scanned cell with content a and the current state is q.After the next occurrence of # (after reading a word in (Λ !# ) * ), we enter a new configuration.In that new configuration, we reach the cell number i when we read a sequence matching the contents of the registers.The encoding of the next cell (after reading a sequence in ∆&) must consist of the binary encoding of a number followed by a symbol in ∆ that is not (q ′ , b).
The increase in expressiveness of RL over REM has an important cost in data complexity, which becomes intractable: Theorem 5.3.Eval(RL) is in Pspace in data complexity.Furthermore, there is a finite alphabet Σ and a RL formula φ over Σ and a single register r, such that Eval(RL,φ) is Pspace-hard.In addition, the latter holds even if the input is restricted to graph databases.
Proof.The upper bound follows as a corollary to the proof of the upper bound in Theorem 5.2.In fact, it is clear that the whole process can be carried in Pspace if we assume a fixed RL query (in fact, to obtain a Pspace upper bound we do not need more than to fix the number of registers used in the query).
For the lower bound, we define a formula φ in RL such that for all constants f 0 , there is a reduction from the class of problems solvable by a Turing machine using a tape of size f 0 n given an input word of size n, to the evaluation problem of φ.More precisely, there are a Turing machine M and a constant f 0 such that the following problem is PSpace)-hard: given a word w of size n, is there an accepting run of M over w using at most f 0 n cells?
We prove that the formula φ is such that for all words w of size n, there is a graph G w such that G w φ iff there is an accepting run of M over w using at most f 0 n cells.
Let (Σ, Q, δ, q i , q f ) be the Turing machine M , where Σ is the input alphabet together with a blank symbol B, q 0 is the initial state, q f is the final state and δ : is the transition map, where L stands for "left" and R stands for "right".
The formula φ that we associate with the machine M is a formula of the form where ψ is a formula that does not contain any quantification over path variables.Given a word w, the path π in the graph G w will encode an accepting run of M over the word w.
Given a word w of size n, consider a configuration C of the run of M over w where the contents of the tape is the word w ′ = w ′ 1 . . .w ′ f 0 n , the head is scanning the cell number i 0 and the machine is in state q.Similarly to the proof of Theorem 4.1, we encode the configuration C by the word where each d C i encodes the information in cell number i in the configuration C. We define d C i as the pair (q ′ i , w ′ i ), where q ′ i is defined as follows.The letter w ′ i is the contents of the cell i.The letter q ′ i is equal to $ if the head is not scanning the cell number i; otherwise, q ′ i is equal to the state q.That is, q ′ i 0 = q and for all i = i 0 , q ′ i = $.The run of M over the word w is a (possibly infinite) sequence of configurations of the form C 0 C 1 . . . .We encode the run as the word e C 0 #e C 1 # . . ., where # plays the role of a delimiter.We will define the formula ψ(π) and the graph G w in such a way that a path π satisfies ψ iff the label of π is the encoding (as defined above) of an accepting run of M over w.
We think of a path π encoding a run of M over w as consisting of two parts.The label of the first part contains the encoding e C 0 of the initial configuration C 0 .The label of the second part contains the encoding e C 1 #e C 2 # . . . of the remaining part of the run.The first part of the path π is a path in a subgraph I w of G w , while the second part is a path in the subgraph H (independent of w) of G w .The graph G w will be obtained by adding an edge from a node in I w to a node in H.
The graph I w is given by 1 2 3 . . .
In the graph above the data value i carried by a node v indicates that the label of the outgoing edge of v is d C 0 i (where C 0 is the initial configuration and d C 0 i is defined as above).Recall that d C 0 i indicates the contents of the cell number i and whether or not the head is scanning that cell.There is a unique path π 0 from the node with no incoming edge to the unique node with no outgoing edge.The label of π 0 is sN (q i , w 0 ) ($, w 1 )N ($, w 2 ) . . .N ($, w n )N ($, B) . . .N ($, B), that is, the word s.e C 0 , where e C 0 is the encoding of the initial configuration C 0 .
We define now the graph H encoding the remaining part of the run of the machine.

N
For all 1 ≤ i ≤ f 0 n, all q ∈ Q and all a ∈ Σ, the node with data value i admits outgoing edges with label (q, a) and ($, a).A path from the "left-most" node to the "right-most" node that does not go through the edge with label # has a label of the form where each d i belongs to (Q ∪ {$}) × Σ, that is, an encoding of a configuration of the machine.Hence, a path π ′ from the "left-most" node to the "right-most" node of H (possibly going through the edge with label #) has a label of the form where each e C i is the encoding of a configuration of the machine.
We are now ready to define the graph G w .

I w H #
The edge with label # is an edge from the unique node in I w with no outgoing edge to the "left-most" node in H.We define now the formula ψ.Let Λ be the alphabet The formula ψ must be such that where l(π) is the label of π and C 0 . . .C p is an accepting run of the machine over w.In fact, it will be intuitively easier to first define a formula χ(π) such that and define ψ as ¬χ.The formula χ is obtained as a disjunction of the following subformulas.
• First l(π) might not satisfy se C 0 #e C 1 . . .e Cp Λ * because: (i) it does not start with the letter s or (ii) it does not contain the encoding of a final configuration, i.e. it does not contain any occurrence of a pair of the form (q f , a) for some a ∈ Σ. Case (i) is expressed by the formula where e γ is the REM γΛ * .Case (ii) is expressed by the formula where ∆ f is the set {q f } × Σ.This formula express that there is no pair of the form (q f , a), for some a ∈ Σ. • Next l(π) might not be of the "right form" because it contains a substring of the form e C #e C ′ occurring before a pair of the form (q f , a) and such that C and C ′ are not consecutive configurations.This might happen for several reasons: (A) either we did not modify properly the contents of a cell or move properly the head or (B) we modified the contents that was supposed to remain constant.
We will only treat one case.As we treated case (A) in the proof of Theorem 5.2, here we treat case (B).As in the proof of Theorem 5.2, we also consider a slightly different definition of a run of a Turing machine (which is equivalent to the usual definition, but helps us to keep our formulas simpler).We assume that if δ(q, a) = (q ′ , b, R) and the machine scans a cell c with content a, then in the next state, the machine scans the successor c ′ of c, the content of c ′ is b, while the content of c is a.
In case (B), we make the following case distinction.-Suppose first that case (B) happened because we modified the contents of the cell that was scanned by the head in configuration C. Note that where when moving from C to C ′ , by definition of a run, we cannot modify the contents of the cell scanned in configuration C. Let (q, a) be a pair in Q × Σ and let ∆ be the set (Q ∪ {$} × Σ).We define also Λ * !# as the set of words over Λ that contain at most one occurrence of the symbol #.We let e (q,a) (π) be the formula Before we reach a final state, we store in register r the number of a cell that is scanned in the current configuration and with contents a.In the next configuration (after reading a word in Λ * !# ), when we read the same cell, it does not contain a.That is, the label of the edge is not a pair of the form (q, a).The formula {∃ν e (q,a) (π, ⊥, ν) : (q, a) ∈ Q × Σ}, takes care of the cases where from moving from one configuration to the next, we modified the contents of the cell scanned in the first configuration.-Next suppose that when moving from C to C ′ , we modified the contents of a cell that was not scanned in C. Suppose also that according to the machine, we were not supposed to modified that contents.
Let a be a letter in Σ.Let ∆ L be the set of pairs (q, a) such that δ(q, a) = (q ′ , b, L) and let ∆ R be the set of pairs (q, a) such that δ(q, a) = (q ′ , b, R).We define e a (π) as the REM Before we reach a final state, in a configuration C, we store in register r the number i of an unscanned cell with contents a.We assume that if the cell with number (i − 1) (if its exists) is scanned in C, then the head is not moving to the right (i.e. to cell number i) in the next configuration.That is, the pair describing the cell number (i − 1) in C is not a pair in ∆ R .We also assume that if cell number (i + 1) (if it exists) is scanned, then the head is not moving to cell number i in the next configuration, i.e. cell number (i + 1) is not described by a pair in ∆ L .Next we express that in the next configuration, it is not the case that the cell with number i is an unscanned cell with contents a.That is, after reading a word in Λ * !# , when we see again the node stored in the register, the label of the edge is not the pair ($, a).The formula {∃ν e a (π, ⊥, ν) : a ∈ ∆}, takes care of case (B) in the case where the cell modified is a cell unscanned by the head.This finishes the proof of the theorem.
In the next section we introduce an interesting language, based on a restriction of RL, that is tractable in data complexity, and thus better suited for database applications.This language is a proper extension of REM.But before that, we make some important remarks about the expressive power of RL.
Expressive power of RL.We now look at the expressive power of the logic RL.It was proven in [8] that CRPQ is not subsumed by WL.Since RL subsumes CRPQ ¬ , it follows that RL is not subsumed by WL.On the other hand, WL is also not subsumed by RL due to Theorem 4.1, Theorem 5.2, and the standard time/space hierarchy theorem from complexity theory.Therefore, we have the following proposition: Proposition 5.4.The expressive power of WL and RL are incomparable.
On the other hand, we shall argue now that many natural queries about the interaction between data and topology are also expressible in RL.The aforementioned query (Q) is one such example.We shall now mention other examples: hamiltonicity (H), the existence of an eulerian trail (E), bipartiteness (B), and connected graphs with an even number of nodes (C2).The first two are expressible in WL, while (B) and (C2) are not known to be expressible in WL.We conjecture that they are not.
We now show how to express in RL the existence of a hamiltonian path in a graph; the query (E) can be expressed in the same way but with two registers (to remember edges, i.e., consisting of two nodes).This is done with the following formula over Σ = {a} and a single register r: ∃π ∀λ∀λ ′ ¬e 1 (π, λ, λ ′ ) ∧ ∀λ(λ = ⊥ → e 2 (π, λ, λ)) , where e 1 := a * • (↓r.a + [r = ]) • a * is the REM that checks whether in a path some node is repeated (i.e., that it is not a simple path), and e 2 := a * [r = ]a * is the REM that checks that the node stored in register r appears in a path.In fact, this query expresses that there is a path π that it is simple (as expressed by the formula ∀λ∀λ ′ ¬e 1 (π, λ, λ ′ )), and every node of the graph database is mentioned in π (as expressed by the formula ∀λ(λ = ⊥ → e 2 (π, λ, λ))).
We now show how to express in RL the bipartiteness property from graph theory.An undirected graph G = (V, E) is bipartite if its set of nodes can be partitioned into two sets S 1 and S 2 such that, for each edge (v, w) ∈ E, either (i) v ∈ S 1 and w ∈ S 2 , or (ii) v ∈ S 2 and w ∈ S 1 .It is well-known that a graph database is bipartite if and only if it does not have cycles of odd length.The latter is expressible in RL since the existence of an odd-length cycle can be expressed as ∃π∃λ∃λ ′ e(π, λ, λ ′ ), where e =↓ r.a(aa) * [r = ].
We now show how to express in RL that a graph database is a connected graph with an odd number of nodes.To this end, it is sufficient and necessary to express the existence of a hamiltonian path π with an even number of edges in the graph.But this is a simple modification of our formula for expressing hamiltonicity: we add the check that π has an even number of edges by adding the conjunct e(π, ν, ν ′ ), where e = (aa) * , and close the entire formula under existential quantification of ν and ν ′ .5.3.Tractability in data complexity.Let RL + be the positive fragment of RL, i.e., the logic obtained from RL by forbidding negation but adding conjunctions (as they were not explicitly present in RL).It is easy to prove that the data complexity of the evaluation problem for RL + is tractable (NLogspace).This fragment contains the class of conjunctive REMs, that has been previously identified as tractable in data complexity [10].However, the expressive power of RL + is limited as the following proposition shows.
Proposition 5.5.The query (Q) from the introduction is not expressible in RL + .
Proof.Recall that Q is the following query: Find pairs of nodes x and y such that there is a node z and a path π from x to y in which each node is connected to z. Suppose for contradiction that there is a formula φ in RL + over an alphabet Σ and registers r 1 , . . ., r k , expressing ∃x∃y Q.We may assume that φ is of the form ∃x 1 . . .∃x n 1 ∃π 1 . . .∃π n 2 ∃ν 1 . . .∃ν n 3 ψ where ψ is a disjunction of conjunctions of atoms.Let G = (V, E, κ) be the following graph where each edge is labeled with a.The query ∃x∃y Q is true in G; hence, the formula φ must be true in G.That is, there is an assignment α mapping each variable x i to a node in G, each path variable π i to a path ρ i in G and each variable ν i to a tuple in {⊥, $, 1, . . ., n 2 +1} k such that (G, α) ψ.Let G ′ be the graph (V, E ′ , κ) where E ′ is the set {(i, a, i + 1) : 1 ≤ i < n 2 } ∪ {(i, a, $) : for some 1 ≤ j ≤ n 2 , (i, $) occurs in ρ j } EXPRESSIVE PATH QUERIES ON GRAPHS WITH DATA 35 That is, we delete the edges (i, $) that α "does not use".By definition of E ′ , the formula ψ remains true in G ′ under the assignment α.In particular, φ is true in G ′ .This implies that ∃x, y Q holds in G ′ .Now, for all 1 ≤ j ≤ n 2 , there is at most one natural number i such that (i, a, $) occurs in ρ j .This is simply because there is no path going through edges (i, a, $) and (i ′ , a, $) if i = i ′ .This implies that the set {(i, a, $) : for some 1 ≤ j ≤ n 2 , (i, $) occurs in ρ j } contains at most n 2 edges.Since G admits n 2 + 1 edges of the form (i, a, $), there must be an edge (i 0 , a, $) occurring in G, but not in G ′ .This means that G ′ is a graph of the form In particular, ∃x, y Q is not true in G ′ , which contradicts the fact that φ is true in G ′ .
Proof.By the proof of Theorem 5.3 (and using the same notation), we know that for every Turing machine M , there is a formula χ(π) such that for all words w of size n, there is a graph G w (of size polynomial in n) such that G w ∃π¬χ(π) iff there is an accepting run of M over w using at most cn cells.
Now we prove that Eval(RL,φ ′ ) is Pspace-complete.where φ ′ is a formula of the form ∃π∃λ¬(e ′ (π, ⊥, ⊥) ∨ f ′ (π, ⊥, λ)) , for some REMs e ′ and f ′ .The intuition is as follows.The difference between φ and φ ′ is that in φ, we may choose the data value that is in the register after checking that f is true.However, in φ ′ , we must be able to store any value in the register after checking that f ′ is true.We will make two changes to make this possible.
First, we modify the graph G w in such a way that two arbitrary nodes are always reachable.This can be easily achieved by adding an edge from the "right-most node" of the graph H to the "left-most node" of the graph I w (allowing to encode the initial configuration of a run).Second, we modify the REMs of φ in such a way that the label of a path satisfying those REMs, encodes an accepting run and after reaching the final state, it goes through all the nodes of G w .Hence, once we checked that the run reaches the final state, we can simply store any value in the register.We leave out the details, as the intuition is pretty simple and the details a bit tedious.
In the case of basic navigational languages for graph databases, it is possible to increase the expressive power -without affecting the cost of evaluation -by extending formulas with a branching operator (in the style of the class of nested regular expressions [5]).The same idea can be applied in our scenario, by extending atomic REM formulas in RL + with such a branching operator.The resulting language is more expressive than RL + (in particular, this extension can express query (Q)), yet remains tractable in data complexity.We formalize this idea below.
The class of nested REMs (NREM) extends REM with a nesting operator • defined as follows: If e is an NREM then e is also an NREM.Intuitively, the formula e filters those nodes in a data graph that are the origin of a path that can be parsed according to e. Formally, if e is an NREM over k registers and G is a data graph, then e G consists of all tuples of the form (u, λ, ρ = u, u, λ) such that (u, λ, ρ ′ , v, λ ′ ) ∈ e G , for some node v in G, path ρ ′ in G, and k-tuple λ ′ over D ⊥ .
Let NRL + be the logic that is obtained from RL + by allowing atomic formulas of the form e(π, λ, λ ′ ), for e an NREM.Given a data graph G and an assignment α for π, λ and λ ′ over G, we write as before (G, α) |= e(π, λ, λ ′ ) if and only if α(π) goes from u to v and (u, α(λ), α(π), v, α(λ ′ )) ∈ e G .The semantics of NRL + is thus obtained from the semantics of these atomic formulas in the expected way.The following example shows that query (Q) is expressible in NRL + , and, therefore, that NRL + increases the expressiveness of RL + .
Example 5.3.Over graph databases, the query (Q) from the introduction is expressible in NRL + using the following formula over Σ = {a} and register r: φ = ∃π∃λ (x, π, y) ∧ e(π, λ, λ) , where e := ( e 1 • a) * e 1 , for e 1 = a * [r = ].Intuitively, e 1 checks in a path whether its last node is precisely the node stored in register r, and thus e checks whether every node in a path can reach the node stored in register r.Therefore, the formula φ defines the set of pairs (x, y) of nodes, such that there is a path π that goes from x to y and a register value λ (i.e., a node λ) that satisfies that every node in π is connected to λ. ✷ The extra expressive power of NRL + over RL + does not affect the data complexity of query evaluation: Theorem 5.7.Evaluation of NRL + formulas is in NLogspace in terms of data complexity.
Proof.Let G = (V, E, κ) be a data graph and φ an NRL + formula.Also, let D = {κ(v) | v ∈ V }.We assume without loss of generality that φ is Boolean, that is, we study the complexity of deciding whether G |= φ.In the case when φ is not Boolean, that is, when the input consists of G and an assignment α for φ over G, we simply replace each free variable η in φ by α(η), and then use the evaluation algorithm we describe below for the resulting formula.
Assume without loss of generality that φ is of the form ∃x∃ν∃πψ, where x is a tuple of node variables, ν is a tuple of register assignment variables, π is a tuple of path variables, and ψ is quantifier-free.Assume also that {e 1 , . . ., e m } is the set of NREMs mentioned in φ, and From the proof of Theorem 5.7 it also follows that NRL + formulas can be evaluated in Pspace in combined complexity.

Conclusions and Future Work
We have proven that the data complexity of walk logic (WL) is nonelementary, which rules out the practicality of the logic.We have proposed register logic (RL), which is an extension of regular expressions with memory.Our results in this paper suggest that register logic is capable of expressing natural queries about interactions between data and topology in data graphs, while still preserving the elementary data complexity of query evaluation (Pspace).Finally, we showed how to make register logic more tractable in data complexity (NLogspace) through the logic NRL + , while at the same time preserving some level of expressiveness of RL.
We leave open several problems for future work.One interesting question is to study the expressive power of extensions of walk logic, in comparison to RL and ECRPQ ¬ from [4].For example, we can consider extensions with regularity tests (i.e. an atomic formula testing whether a path belongs to a regular language).Even in this simple case, the expressive power of the resulting logic, compared to RL and ECRPQ ¬ , is already not obvious.Secondly, we do not know whether NRL + is strictly more expressive than RL.Finally, we will also mention that expressibility of bipartiteness in WL is still open (an open question from [8]).We also leave open whether the query that a graph database is a connected graph with an even number of nodes is expressible in WL.
the extension of the set D of data values with a new symbol ⊥.Satisfaction of conditions is defined with respect to a value d ∈ D (the data value that is currently being scanned) and a tuple τ = (d 1 , . . ., d k ) ∈ D k ⊥ (the data values stored in the registers, assuming that d i = ⊥ represents the fact that register r i has no value assigned) as follows (Boolean combinations omitted): (d, τ ) |= r = i iff d = d i .Definition 5.1 (REMs).The class of REMs over Σ and {r 1 , . . ., r k } is defined by the grammar: e := ε | a | e ∪ e | e • e | e + | e[c] | ↓ r.e where a ranges over symbols in Σ, c over conditions over {r 1 , . . ., r k }, and r over tuples of elements in {r 1 , . . ., r k }.
and every tuple in D k ⊥ .The constant ⊥ is interpreted in M ′ G,ρ as the tuple ⊥ k .The interpretation of Nodes in M ′ G,ρ contains all nodes in the domain.The interpretation of Paths in M ′
c(i) = b 1 . . .b f 0 n and c(j) = b ′ 1 . . .b ′ f 0 n .Then j is not the successor if i iff one of the following holds.(a) For some k, b k . . .b f 0 n is equal to 1 . . . 1 and b ′ k is not equal to 0. (b) For some k, b k . . .b f 0 n is equal to 01 . . . 1 and b ′ k is not equal to 1. (c) For some k, b k = 1, 0 occurs in b k+1 . . .b f 0 n and b ′ k is not equal to 1.(d) For some k, b k = 0, 0 occurs in b k+1 . . .b f 0 n and b ′ k is not equal to 0.
.5) It follows from the definitions of IF ∆ (s, t, π) and the graph G ∆