Modular Path Queries with Arithmetic

We propose a new approach to querying graph databases. Our approach balances competing goals of expressive power, language clarity and computational complexity. A distinctive feature of our approach is the ability to express properties of minimal (e.g. shortest) and maximal (e.g. most valuable) paths satisfying given criteria. To express complex properties in a modular way, we introduce labelling-generating ontologies. The resulting formalism is computationally attractive - queries can be answered in non-deterministic logarithmic space in the size of the database.


Introduction
Graphs are one of the most natural representations of data in a number of important applications such as modelling transport networks, social networks, technological networks (see the surveys [AAB + 17, Woo12,Bar13]). The main strength of graph representations is the possibility to naturally represent the connections among data. Effective search and analysis of graphs is an important factor in reasoning performed in various tasks. This motivates the study of query formalisms, which are capable of expressing properties of paths.
Nevertheless, most real-world data still resides in relational databases and relational engines are still the most popular database management systems [dbe18]. Hence, it would be desirable to consider a query formalism that directly generalizes the relational approach, offers a natural representation of data values and, at the same time, enables convenient querying of the graph structure.
Modern day databases are often too large to be stored in computers' memory. To make a query formalism computationally feasible, its query evaluation problem should be, preferably, in logarithmic space w.r.t. the size of the database (data complexity) [CDGL + 06, ACKZ07,BLLW12]. Checking existence of a path between two given nodes is already NL-complete and hence NL is the best possible lower bound for an expressive language for graph querying. It is worth to mention that every problem in NL can be solved deterministically in O(log 2 (n)) space, which is a reasonable bound even for huge databases.
Aggregation. Ability to use aggregate functions such as sum, average or count is a fundamental mechanism in database systems. Klug [Klu82] extended the relational algebra and calculus with aggregate functions and proved their equivalence. Early graph query languages G + [CMW88] or GraphLog [CM90,CM93] can aggregate data values. Consens and Mendelzon [CM93] studied path summarization, i.e., summarizing information along paths in graphs. They assumed natural numbers in their data model and allowed to aggregate summarization results. In order to achieve good complexity (in the class NC) they allowed aggregate and summing operators that form a closed semiring. Other examples of aggregation can be found in [Woo12].
Summing vectors of numbers along graph paths have been already studied in the context of various formalisms based on automata or regular expressions and lead to a number of proposals that have combined complexity in PSPACE and data complexity in NL. Kopczyński and To [KT10] have shown that Parikh images (i.e., vectors of letter counts) for the usual finite automata can be expressed using unions of linear sets that are polynomial in the size of the automaton and exponential in the alphabet size (the alphabet size, in our context, corresponds to the dimension of vectors). Barcelo et al. [BLLW12] extended ECRPQs with linear constraints on the numbers of edge labels counts along paths. They expressed the constraints using reversal-bounded counter machines, translated further to Presburger arithmetic formulas of a polynomial size and evaluate them using techniques from [KT10,Sca84]. Figueira and Libkin [FL15a] studied Parikh automata introduced in [KR03]. These are finite automata that additionally store a vector of counters in N k . Each transition also specifies a vector of natural numbers. While moving along graph paths according to a transition the automaton adds this transition's vector to the vector of counters. The automaton accepts if the computed vector of counters is in a given semilinear set in N k . Also, a variant of regular expressions capturing the power of these automata is shown. This model has been used to define a family of variants of CRPQs that can compare tuples of paths using synchronization languages [FL15b]. This is a relaxation of regularity condition for relations on paths of ECRPQs and leads to more expressive formalisms with data complexity still in NL. These formalisms are incomparable to ours since they can express non-regular relations on paths like suffix but cannot express properties of data values, nodes' degrees or extrema.
Cypher [The18] is a practical query language first implemented in the graph database engine Neo4j. It uses property graphs as its data model. These are graphs with labelled nodes and edges, but edges and nodes can also store attribute values for a set of properties. MATCH clause of Cypher queries allows for specifying graph patterns that depend on nodes' and edges' labels as well as on their properties values. OpenCypher, an initiative to standardize the language, has produced Cypher Query Language Reference (Version 9) [ope17]. More on Cypher can be found in the surveys [AAB + 17] and [FGG + 18].
G-Core [AAB + 18] is a joint effort of industrial and academic partners to define a language that is composable (i.e. graphs are inputs and outputs of a query), treats paths as first-class citizens and integrates the most important features of existing graph query languages. The G-Core data model extends property graphs with paths. Namely, in a graph, there is also a (possibly empty) collection of paths. The paths have their identity and can have their own labels and ⟨property, value⟩ pairs. G-Core includes also features like aggregation and (basic) arithmetic along paths that is closely related with our proposal. G-Core allows for defining costs of paths either by the hop-count (length) or by the positive weights (which may be computed by functional expressions). The full cost of a path is the sum of the weights and G-Core is able to look for paths that minimize it. In contrast, our data model allows negative weights.
In Section 5 we compare the language constructions of G-Core, Cypher and OPRA in more detail.
Another proposal of a graph query language of commercial strength is PGQL [vRHK + 16]. PGQL closely follows syntactic structures of SQL and defines powerful regular expressions that allow for filtering nodes and edges along paths as well as computing shortest paths.
Our data model allows to operate on property graphs [AAB + 17], where many edges between a pair of nodes are allowed. To do so, for each of the edges we introduce a single, unique, additional node. Then, an edge can be represented by a binary labelling returning 1 for the pair of the source node and the additional node, and for the pair of the additional node and the target node for the edge. Naturally, the property values for the edge are assigned by unary labellings (named after the property keys) of the additional node. We present an example of how to encode a property graph in Section 2. Alternatively, an edge can be represented by a ternary labelling function returning 1 for all triples of the source, the target node and the corresponding additional node.
RDF [CLW14] is a W3C standard that allows encoding of the content on the Web in a form of a set of triples representing an edge-labelled graph. Each triple consists of the subject s, the predicate p, and the object o that are resource identifiers (URI's), and represents an edge from s to o labelled by p. Interestingly, the middle element, p, may play the role of the first or the third element of another triple. Our formalism OPRA allows operating directly on RDF without any complex graph encoding, by using a ternary labelling representing RDF triples. This allows for convenient navigation by regular expressions, in which also the middle element of a triple can serve as the source or the target of a single navigation step (cf. [LRSV18]). The standard query formalism for RDF is SPARQL [PS08,HS13]. It implements property paths, which are RPQs extended with inverses and limited form of negation (see the survey [AAB + 17]).
Another idea in processing graphs, fundamentally different than our approach is based on Pregel model [MAB + 10] where the computation is organized in rounds. Basically, each node has a state and in each round is able to send messages to its neighbouring nodes and change its state according to the messages sent by its neighbours.

Preliminaries
Various kinds of data representations for graphs are possible and presented in the literature. The differences typically include the way the elements of graphs are labelled -both nodes and edges may be labelled by finite or infinite alphabets, which may have some inner structure. Here, we choose a general approach in which a labelled graph, or simply a graph, is a tuple consisting of a finite number of nodes V and a number of labelling functions λ ∶ V l → Z ∪ {−∞, ∞} assigning integers to vectors of nodes of some fixed size (we allow labellings without parameters, i.e., l = 0, for reasons that will be apparent later).
While edges are not explicitly mentioned, if needed, one can consider an edge labelling E such that E(v, v ′ ) is 1 if there is an edge from v to v ′ and it is 0 otherwise.
A special case of a labelled graph is a relational database. In this case, the range of all the labellings is {0, 1} (and hence the labellings can be called "relations"), and every node  Figure 3. An equivalent graph G ′ : its nodes represent the nodes and the edges of G.
has to be in at least one relation. Clearly, each of the labellings defines a relation of the same arity as the labelling. For convenience, we assume that the set of nodes contains a distinguished node ◻this is an artificial node we use to avoid problems with paths of different lengths. Note that all labellings must define their values also for tuples with ◻ and we do not restrict these values in principle.
A path is a sequence of nodes. For a path p = v 1 . . . v k , by p we denote the length of p and by p[i] we denote its i-th element, v i , if i ≤ k, and ◻ otherwise.
To compare our language with other formalisms, we will also consider a special subclass of labelled graphs, called standard graphs. These are graphs with a single unary labelling function λ ∶ V → {0, . . . , k} for some k and a single binary labelling function E ∶ V 2 → {0, 1}. We show how to represent property graphs [AAB + 17] in our model. In this model nodes and edges of a graph can be annotated with properties of the form of key-value pairs. We present an example of a property graph G in Figure 2. The graph G represents a fragment of a map. The nodes represent places and the edges represent links between the places. Both the nodes and the edges of G contain labels such as S or T and a number of properties, e.g., a type which identifies what kind of a place, or a link, it is. As depicted in Figure 3, and already discussed in Section 1, we represent both nodes and edges of property graphs as nodes in our model. This is hardly a surprise since we adopt a lightweight concept of edges. The binary labelling E we use, yields a value 0 or 1 for a pair of nodes specifying the existance of an edge and does not store any additional information. Note that the key-value pairs are also represented with labellings, in particular each labelling is named after the key.

Language OPRA
To define OPRA, we first introduce its two fragments, PR and PRA, which can be seen as syntactic restrictions of OPRA. The first fragment, PR, includes Path constraints and Regular constraints, and the second one, PRA, includes also Arithmetical constraints.
The language OPRA will extend PRA with Ontologies. For simplicity, we assume that the queries return only node identifiers. For practical purposes this can be straightforwardly extended to returning labels.
We now introduce basic notational conventions we use. By x we denote a node variable, by π a path variable, by ⃗ x and ⃗ π tuples of node and path variables respectively, by c an integer value. By λ we denote a labelling; for binary labellings (i.e. with two arguments), we often use symbol E instead of λ. Our conventions are summarised in the following listing.
x, x 1 , . . . ∈ NodeVariables node variables π, π 1 , . . . ∈ PathVariables path variables c, c 1 , . . . ∈ Integers integer values E, E 1 , . . . ∈ BinaryLabellings binary labellings λ, λ 1 , . . . ∈ Labellings labellings (of any arity) v, t, . . . ∈ V nodes p, p 1 , . . . ∈ V * paths f, g, f 1 , . . . ∈ F functions OPRA queries are of the form (red font distinguishes the terminal symbols): where ⃗ x are free node variables, ⃗ π are free path variables, PathConstraints is a conjunction of path constraints, RegularConstraints is a conjunction of regular constraints, ArithmeticalConstraints is a conjunction of arithmetical constraints, and Ontologies is a sequence of auxiliary labellings definitions defined in Section 3.3. Either of the tuples ⃗ x, ⃗ π can be empty. The conjuncts in these constraints are connected with the keyword AND : The constraints may contain variables not listed in the SELECT clause (which are then existentially quantified). Unnecessary components may be omitted (e.g., the keyword WHERE if no regular constraints are needed). Each node variable may be also treated as a path variable, representing a single-node path.
3.1. Path and regular constraints. The language PR is a syntactic fragment of OPRA obtained by disallowing the keywords LET ... IN and HAVING . In other words, PR queries consist of path constraints, which involve node and path variables, and regular constraints, which involve path variables only. We complete its definition by defining the path constraints and regular constraints.
Each node constraint has a natural number parameter k. Such k-node constraint is either ⟨⊺⟩, denoting a dummy constraint that always holds or an expression of the form ⟨X △ X ′ ⟩, where △ ∈ {≤, <, =, >, ≥, ≠} and each of X, X ′ is an integer constant or a labelling function λ applied to some of the variables in Π.
A regular constraint R(π 1 , . . . , π k ) is syntactically a regular expression over an infinite alphabet consisting of conjunctions of all k-node constraints. We write regular expressions with denoting the empty word, ⋅ denoting concatenation, + denoting alternation, and * denoting Kleene star.
Regular constraints are evaluated over paths from SELECT part and paths quantified existentially. In order to allow accessing nodes on a path π, we introduce fresh, free node variables Prev(π), π, and Next(π). Naturally, Prev(π), π and Next(π) represent the nodes at the previous, the current and the next position of π accordingly. By Π we denote the set of the variables Prev(π), π, Next(π) for all paths π of a given query.
Intuitions for Regular Constraints. Before we define the semantics of regular constraints we present an intuition first. An important part of a query language for graphs is the way to specify graph patterns. In PR, we define them using conjunctions of regular constraints. Regular constraints extend the formalism of Regular Path Queries (RPQs) to deal with the values of labelling functions. The idea is different than the one involving Regular Expressions with Memory (REMs) [LMV16]. REMs can store the values at the current position in registers and then test their equality with other values already in registers. Here, we limit PR expressions to access only the nodes at the current, the next and the previous position on each of the paths. Later on, in the last example of Section 4, we show how OPRA enables to compare values of nodes at any distant positions.
Path Constraints Semantics. Given a set of node and path variables V and a graph G with the nodes V we define a variable instantiation η G as a function from V to the nodes and paths of G. We sometimes omit the superscript when the graph is clear from the context and we do not distiguish an instantiation and its canonical extention to tuples of node and path variables.
Given a labelling λ in a graph G and a tuple ⃗ v of nodes of G (of appropriate arity) we say that Regular Constraints Semantics. Given paths p 1 , . . . , p k we define W (p 1 , . . . , p k ) ∈ (V 3k ) * to be the word of the length of the longest path among p 1 , . . . , p k such that for each j we have Intuitively, the semantics is given by applying the specified labelling functions to the nodes in the given vector and comparing the values according to the △ symbol. A node constraint may be seen as a function that takes a vector of 3k nodes (i.e., W (p 1 , . . . , p k )[j], if j is the current position on the paths), represented by the variables in Π, and returns a Boolean value.
. Given a graph G over vertices V , the language of R, denoted as L G (R), is a subset of (V 3k ) * defined using the usual rules: L G ( ) is the empty word, for a conjunction of k-node constraints r, L G (r) is the set of all vectors of length 3k for which r returns true, and L G (R ⋅ R ′ ), L G (R + R ′ ) and L G (R * ) are defined inductively as the concatenation, the union and Kleene star closure of appropriate languages. Given a graph G and an instatiation η G we say that R(π 1 , . . . , π k ) holds in G under η G , G, η G ⊧ R(π 1 , . . . , π k ), if and only if W (η G (π 1 ), . . . , η G (π k )) ∈ L G (R).

Example. Consider the query:
SELECT NODES x, y SUCH THAT x → π E y WHERE ⟨E(Next(π), π)⟩ * ⟨⊺⟩ AND ⟨λ(π) > 0⟩ * Let G be a graph and let η G be an instatiation. Let v = η G (x) and w = η G (y). The query holds in G iff there is a bidirectional path between v and w whose each node is labelled by λ with a positive number. Notice that ⟨⊺⟩ is required as Next(π) is ◻ for the last node. If G has an additional binary labelling E −1 such that for all nodes v, w we have E −1 (v, w) = E(w, v), then the same query can be stated as follows: Note that the existantially quantified path π is mentioned in both of the path constraints above. We discuss in Section 3.3 how to define E −1 as an auxiliary labelling based on E.
Query semantics. Let Q(⃗ x, ⃗ π) be a PR query. Given a graph G and tuples of nodes ⃗ v and paths ⃗ p of G we say that Q(⃗ v, ⃗ p) holds in G, denoted by G ⊧ Q(⃗ v, ⃗ p), if and only if there exists an instantiation η G of all free and existential node and path variables of Q such that η G (⃗ x) = ⃗ v, η G (⃗ π) = ⃗ p and all path and regular constraints of Q hold for G under η G . 3.2. Arithmetical constraints. The syntax for arithmetical constraints is defined as follows.
The language PRA is a syntactic fragment of OPRA obtained by disallowing the keyword LET ... IN , i.e., it extends PR with arithmetical constraints, which we define below.
An arithmetical atom is of the form λ[π 1 , . . . , π k ], where λ is a labelling. An arithmetical constraint is an inequality c 1 Λ 1 + ⋯ + c j Λ j ≤ d, where c 1 , . . . , c j are integer constants, d is either an integer constant or a parameterless labelling, and each Λ is an arithmetical atom.
Semantics. Let Λ be an arithmetical atom λ[π 1 , . . . , π k ] and let G be a graph. Consider an instantiation η G of path variables π 1 , . . . , π k with paths p 1 , . . . , p k . The value of Λ under η G , denoted by val η G (Λ), is defined as The arithmetical constraint Example. Consider a class of graphs with an edge labelling E, a unary labelling One, which returns 1 for all nodes, and a unary labelling λ. The following query holds in a graph G under an instatiation η G when the nodes η G (x), η G (y) are connected by a path p = η G (π) such that each node of p is labelled by λ with a positive number and the average value of λ over all nodes is at most 5.
Query semantics. The query semantics is defined in a virtually the same way as for PR queries; we additionally require the instantiation to satisfy all arithmetical constraints of the query.
3.3. Auxiliary labellings. The language OPRA extends PRA with the constructions that define auxiliary labellings of graphs, which are defined based on existing graph labellings and its structure.

Ontologies
: where Q is an OPRA query, c is an integer, −∞ or +∞, π ∈ PathVariables, x is a fresh x is a tuple of node variables, x 1 , x 2 are variables from ⃗ x and tuples ⃗ y, ⃗ y 1 , . . . , ⃗ y k consist of the variables from ⃗ x, λ ∈ Labellings. The ability to define auxiliary labellings significantly extends the expressive power of the language. The essential property of auxiliary labellings is that their values do not need to be stored in the database, which would require polynomial memory, but can be computed on demand. An auxiliary labelling may be seen as an ontology or a view.
Ontologies is of the form λ 1 (⃗ x 1 )∶=t 1 , . . . , λ n (⃗ x n )∶=t n . Such a sequence defines auxiliary labellings λ 1 , . . . , λ n that can be used in the PRA query and also in the following labellings, i.e., λ i can be used in λ j if i < j. The labellings are defined by means of terms t 1 , . . . , t n , which are expressions with free variables ⃗ x 1 , . . . , ⃗ x n . To define terms, we assume a set where F A is a set of aggregate functions including maximum Max, minimum Min, counting Count, summation Sum, and F b is a set of binary functions including +, −, ⋅ and ≤. The set F can be extended, if needed, by any functions computable by a non-deterministic Turing machine (see Remark 8.3) whose size of all tapes while computing f (⃗ x) is logarithmic in length of ⃗ x and values in ⃗ x, assuming binary representation, provided that additional aggregate functions in F are invariant under permutation of arguments.
We distinguish four types of terms.
Basic terms. There are three basic types of terms: a constant c, a labelling value λ(⃗ y), and a node identity test y = y ′ , where y, y ′ are node variables, ⃗ y is a vector of node variables, c is a constant, λ is a labelling.
Function application. A term can be a function application to another terms: f (t 1 , . . . , t n ), where f ∈ F and t 1 , . . . , t n are terms.
Subqueries. There are two essential constructs involving subqueries in terms. First, we can evaluate the truth value of a subquery, i.e., the expression [Q(⃗ y)], where Q is an OPRA query with ⃗ y being free node variables. Note that ⃗ y occur in the main quary and hence they become instantiated in the subquery. Therefore, [Q(⃗ y)] returns the value 1 if under the current instantiation of ⃗ y, query Q(⃗ y) holds, and 0 otherwise. Second, we minimise (resp., maximise) the value of a parameter satisfying a subquery, i.e., the value of the expression min λ,π Q(⃗ y, π) (resp. max λ,π Q(⃗ y, π)), where Q is an OPRA query with free node variables ⃗ y and a single free path variable π, to obtain the minimum (resp. maximum) of values λ[π] over paths satisfying Q (as usual, the minimum of the empty set is +∞ and maximum of the empty set is −∞). The expression λ[π] denotes as in the arithmetical constraints the value of the labelling λ applied to the path π (i.e., the sum of the values of λ along π).
Aggregative properties. The last type of term allows us to apply a function to a set of labels of all nodes satisfying some label.
where f ∈ F A is an aggregate function. The value can be defined as follows: first, we compute the set X = {x 1 , . . . , x s } of nodes such that for every x ∈ X we have t ′ (x, ⃗ y) = 1. Then, the value of the term is the value of f (t(x 1 ), t(x 2 ), . . . , t(x s )). Notice that since f is an aggregate function, the value does not depend on the order among x 1 , . . . , x s . Formal semantics. The semantics of terms is as follows. Let G be a graph and let η G be an instatiation. In Table 1 we inductively extend instantiations to terms. If G is clear from 1 if Q(η G (⃗ y)) holds in G and 0 otherwise, min λ,π Q(⃗ y, π) the minimum of values of λ[p], defined as in the arithmetical constraints, over all paths p such that Q(η G (⃗ y), p) holds in G, max λ,π Q(⃗ y, π) the maximum of values of λ[p], defined as in the arithmetical constraints, over all paths p such that Q(η G (⃗ y), p) holds in G, Auxiliary labellings. Consider a term t(⃗ x) of an arity k and a graph G, which does not have a labelling λ. We define the graph G[λ∶=t] as the graph G extended with the labelling λ of an arity k such that λ(⃗ v) = t(⃗ v) for all ⃗ v ∈ V k . We call λ an auxiliary labelling of G. We write G[λ 1 ∶=t 1 , . . . , λ n ∶=t n ] to denote the results of successively adding labellings λ 1 , . . . , λ n to the graph G, i.e., Size of an OPRA query. The size of a query Q of the form LET O IN Q' is the sum of binary representations of terms t 1 , . . . , t n in O and the size of the query Q ′ .
Example. We can define labellings E −1 and One presented in the above examples with terms E −1 (x, y) ∶= E(y, x) and One(x) ∶= 1 . The example from Section 3.2 can be stated for graphs without One labelling as follows: . Given a graph G and tuples of nodes ⃗ v and paths ⃗ p of G we say that is a PRA query, which can refer to auxiliary labellings λ 1 , . . . , λ n .

Examples
We focus on the following scenario: a graph database that corresponds to a map of some area. Each graph's node represents either a place or a link from one place to another. The graph has four unary labellings and one binary labelling. The labelling type represents the type of a place for places (e.g., square, park, pharmacy) or the mode of transport for links (e.g., walk, tram, train); we assume each type is represented by a constant, e.g., c square , c park . The labelling attr represents attractiveness (which may be negative, e.g., in unsafe areas), and time represents time. The binary labelling E represents edges: for nodes v 1 , v 2 , the value E(v 1 , v 2 ) is 1 if there is an edge from v 1 to v 2 and 0 otherwise. For example, the graph on Fig. 3 represents a map with two places: S is a square and P is a park. There are three nodes representing links: node W represents moving from S to P by walking, T moving from S to P by tram and B moving from P to S by bus.
The language PRA can express properties of paths' sums. Consider the query Q 1 (x, y) below. For nodes s, t, the query Q 1 (s, t) holds iff there is a route from s to t that takes at most 6 hours and its attractiveness is over 100.
Furthermore, we can compute averages, to some extent. For example, the following arithmetical constraint says that for some path π the average attractiveness of π is at least 4 attractiveness points per minute: Multiple paths. We define a query Q 2 (x, y) that asks whether there is a route from x to y, such that from every place we can take a tram (e.g., if it starts to rain). We express that by stipulating a route π from x to y and a sequence ρ of tram links, such that every node of π representing a place is connected with the corresponding tram link in ρ. In a way, ρ works as an existential quantifier for nodes of π.
where the parameterized macro link (π, ρ) is defined as (⟨type(π) = c bus ⟩ + ⟨type(π) = c walk ⟩ + ⟨type(π) = c tram ⟩ + ⟨E(π, ρ) = 1⟩) * states that every node of the first path either is not a place, i.e, it represents any of possible links (by a bus, a walk or a tram), or is connected with the corresponding node of the second path.
4.2. Language OPRA. We show how to employ auxiliary labellings in our queries. For readability, we introduce some syntactic sugar -constructions which do not change the expressive power of OPRA, but allow queries to be expressed more clearly. We use the function symbols =, ≠ and Boolean connectives, which can be derived from ≤ and arithmetical operations. Also, we use terms t(x, y) in arithmetical constraints, which can be expressed by first defining the labelling λ(x, y)∶=t(x, y), defining additional paths ρ 1 = x, ρ 2 = y of length 1, and using λ[ρ 1 , ρ 2 ]. Processed labellings. Online route planners often allow looking for routes which do not require much walking. The query Q 3 (x, y) asks whether there exists a route from x to y such that the total walking time is at most 10 minutes. To express it, we define a labelling t walk(x), which is the time of x, if it is a walking link, and 0 otherwise.
Nested queries. It is often advisable to avoid crowded places, which are usually the most attractive places. We write a query that holds for routes that are always at least 10 minutes away from any node with attractiveness greater than 100. We define a labelling crowded(x) as Notice that the variables π and y are existentially quantified. We check whether the value of crowded is 0 for each node of the path π.
Nodes' neighbourhood. "Just follow the tourists" is an advice given quite often. With OPRA, we can verify whether it is a good advice in a given scenario. A route is called greedy if at every position, the following node on the path is the most attractive successor. We define a labelling MAS(x, y) that is 1 if y is the most attractive successor of x, and 0 otherwise, and use it to express that there is a greedy route from a node x to a node y.
The above query considers routes that in each place select the most attractive link, and in each link select the most attractive place. What if we are interested in the attractiveness of the places only? If we assume that x and y are places and all edges are between places and links only, then this can be achieved by replacing the WHERE clause of the above query by: Such a query checks at every link that the following place is the most attractive place that can directly reached from the previous place. Properties of paths' lengths. In route planning, we can optimize different aspects such as time, necessary budget or attractiveness. We can express the conjunction of such objectives, i.e., specify routes that are optimal for several objectives. The following query asks whether is it possible to get from s to t with a route that takes the shortest time among all routes, and at the same time it maximises the attractiveness among all routes.
The following query asks whether there exists a route from a club s to a club t on which the attractiveness of visited clubs never decreases. In the register-based approach, we achieve this by storing the most recently visited club in a separate register. Here, we express this register using an additional path ρ, storing the values of the register, a labelling r(x ′ , y, y ′ ) which states that y ′ = x ′ if x ′ is a club, and y ′ = y otherwise, and an auxiliary labelling ⊺ which is true for all the pairs of nodes.

G-Core, Cypher and OPRA
In this section we compare constructs of G-Core [AAB + 18], Cypher [The18] and OPRA.
All three formalisms have SQL-like clauses and rely crucially on matching of graph patterns. Such patterns are specified in G-Core and Cypher using ASCII-art syntax, e.g., the pattern (n)-[:connection]->(m)->n binds the variables n, m to the pairs of nodes (a, b) such that there is a directed edge labelled connection from a to b and there is also some edge going back. Patterns can be fixed structures (rigid patterns), consisting of the exact graph that should match the input graph but also may be specified by navigational paths using regular expressions. In OPRA we provide two options: path constraints x s → π E x t that essentially define reachability over a binary labelling that encodes edges, as well as powerful regular constraints.
The process of matching patterns generates tuples of matched node, edge and path values that can be filtered by WHERE conditions. Technically, a graph-pattern match corresponds to a homomorphism from a query Q to the input graph G. All three formalisms allow for an unconstrained semantics where the multiple variables may match the same value (a node or an edge). The default matching semantics in CYPHER is, however, no-repeated-edge semantics [AAB + 17] where variables corresponding to edges have to map one-to-one and the other variables need not to be mapped injectively. Each of the languages may produce an enumeration of all matched tuples as well as of some of the projections on subsets of variables. G-Core and CYPHER support shortest paths (hop count) matching, G-Core allows also for weighted shortest paths where the costs may be specified using positive weights. In CYPHER and G-Core the stream of tuples generated by pattern matching can be aggregated and then returned to be processed in the following parts of a query. For example, consider the following CYPHER query [AAB + 17] to find the longest movies in a collection. The first pattern MATCH (m:Movie) matches all movies, aggregates their lengths and return the maximal length using WITH clause. Then, the second MATCH (m:Movie) again matches all movies but this time, however, it filters out the ones with runtime not equal to maxTime.
Although we do not allow in OPRA for an aggregation at this stage we can reformulate such queries as follows.
Recall that here we use the fact that each node variable may be also treated as a path variable that represents a single-node path.
The specific thing for OPRA is the ability to compare paths in terms of regular relations [BLLW12]. Regular relations that can be specified in regular constrains include path equality, prefix (i.e., is a path a prefix of another?), length comparisons, fixed-edit distance, synchronous transformation.
G-Core and OPRA have tractable data complexity, on the other hand there are at least two reasons for which data complexity of CYPHER is NP-hard. We have already mentioned that CYPHER has no-repeated-edge semantics for graph pattern matching. This makes the evaluation intractable [MW95]. The second reason is its ability to unwind paths, that is to return path elements (e.g. nodes) and then to process them. This feature may be used [AAB + 17] to write a fixed query that returns two different disjoint routes between given two nodes which is also an NP-hard problem. Note that this is precisely the reason for which we do not allow in OPRA nested queries with free path variables (only node variables are allowed). We discuss this topic in Section 6. G-Core and CYPHER have a number of features that are not present in OPRA. In particular, in OPRA there are no features allowing for any modification of graphs nor their data values. We can only define new labellings and then use them in the following part of queries. It is also not possible to construct and return graphs (and thus queries are not composable).

Closure properties
In this section we discuss closure properties of OPRA under standard set-theoretic operations. We define these operations formally as follows. First, for a query Q(⃗ x, ⃗ π) and a graph G we define the result of Q on G, denoted by Q[G], as To avoid problems with the artificial node ◻, which is used to align paths of different lengths, we ignore it in Q[G].

Jakub Michaliszyn, Jan Otop, and Piotr Wieczorek
Vol. 17:3 • a query Q ∃ (x i 1 , . . . x i k , π j 1 , . . . , π j l ) is a projection of Q(⃗ x, ⃗ π) if and only if i 1 , . . . , i k are distinct indices from {1, . . . , ⃗ x }, j 1 , . . . , j k are distinct indices from {1, . . . , ⃗ π }, and for every graph if and only if ⃗ x and ⃗ x ′ are disjoint, ⃗ π and ⃗ π ′ are disjoint, and for every graph G we have Theorem 6.1. Given OPRA queries Q 1 , Q 2 , we can compute in polynomial time every projection of Q 1 , the intersection, the union, and the Cartesian product of Q 1 and Q 2 . If Q 1 has no free path variables, then we can compute in polynomial time the complement of Q 1 .
Proof of Theorem 6.1. The projection case is straightforward -a projection of a query can be obtained by simply not listing the unwanted variables in the SELECT statement.
To define the complement we simply use Q 1 as a subquery (we can do that only for queries without free path variables).
Having queries Q 1 , Q 2 with only node variables, we can give similar constructions for the cases of a Cartesian product, an intersection and a union. For queries with free path variables, the constructions are more complex.
Assume that, for i = 1, 2 the query Q i is of the form Without loss of generality, we assume that quantified variables in Q 1 and Q 2 are disjoint (if not, we simply rename the conflicting entities).
For the Cartesian-product case, we assume that nodes variables ⃗ x 1 and ⃗ x 2 are disjoint, and path variables ⃗ π 1 and ⃗ π 2 are disjoint as well. Then, the following query expresses Q × : For the intersection case, we assume that ⃗ x 1 = ⃗ x 2 and ⃗ π 1 = ⃗ π 2 . We construct the query Q ∩ as follows: Finally, the union case is the most difficult one. Roughly speaking, we want to do the Cartesian product of two queries and define the result as the union of the projections of this product. Assuming, without the loss of generality, that the variables in the queries are disjoint, this can be achieved in the following way.
where ⃗ x, ⃗ π are fresh variables. The regular constraint EQ guarantees that either (⃗ x = ⃗ x 1 and ⃗ π = ⃗ π 1 ), or (⃗ x = ⃗ x 2 and ⃗ π = ⃗ π 2 ). It can be defined in an OPRA query in a straightforward way using a regular constraint with alternation and a new auxiliary labelling defined with a node identity check; notice that the definition will depend on the arity of the vectors.
Notice however, if one of the queries is empty, the Cartesian product is empty, and this naive approach fails.
To avoid this problem, we first define, for each i ∈ {1, 2}, an additional labelling λ Q i = [Q i ]. By definition, λ Q i is 0 if Q i returns the empty result and 1 otherwise. Then, for each i, we define R ′ i based on R i in the following way: for each conjunct r of R i , R ′ i contains r + ⟨λ Q i = 0⟩ * . Finally, for each i, we define A ′ i based on A i in the following way: for each conjunct ∑ i s i < d of A i , A ′ i contains the conjunct ∑ i s i < λ d , where λ d is an auxiliary labelling equal to d if λ Q i = 1 and ∞ otherwise. This can be defined as min(d, (2λ Q i − 1) ⋅ ∞).
The above definition means that R ′ i and A ′ i are trivially satisfied if λ Q i = 0. Let O be the definitions of the auxiliary labellings described in the paragraphs above. Putting it all together, we obtain: In Theorem 6.1 the construction for the complement is given only for OPRA queries without free path variables. We show that assuming that NL ≠ NP, OPRA queries are not closed under the complement. Indeed, we show in the following Lemma 6.2 that if all OPRA queries are closed under the complement, then there exists a boolean query Q ham which holds if there is a Hamiltonian cycle in a graph. However, we show in Theorem 8.1 that for a fixed query, the query evaluation problem is NL-complete. Therefore, having the query Q ham , we can decide the existence of a Hamiltonian cycle in a given G in NL and hence NL = NP.  Proof. The query Q ham is the conjunction of the following queries with a free path variable π that becomes existentially quantified in Q ham : • Q len (π) which holds for the cycles connected by E with the length equal to the number of all nodes in a graph, and • Q unique (π) which holds for the paths in which all nodes are different.
Note that such paths are Hamiltonian cycles.
The query Q len (π) is as follows Note the use of an existantial variable y and how we count the number of all nodes in a graph with the Nodes labelling. Now, we construct the query Q unique (π). First, we express Q repeats (π) that holds for paths with some node occuring at least twice.
It states that π ′ consists of the same node u repeated multiple times and we require that there are at least two positions i in π such that u = π ′ [i] = π[i].
Finally, as we assume that all OPRA queries are closed under the complement, there exists a query Q unique (π) in OPRA that is the complement of Q repeats (π).
Here are some examples of employing the closure properties. To check whether a given graph is a directed acyclic graph, we have to check that the graph has no cycle. Instead, we can check whether the graph has a cycle using the following query and then complement this query: The above query is Boolean, i.e., it has no free variables, so it is considered over an empty tuple, (). This query checks whether there exists a path with the same initial and final nodes of length at least 2, i.e., a cycle.
Finally, we can write a query that asks whether there is a unique path between x and y. First, the following query asks for nodes x, y connected with at least two different paths.
Next, we take the complement of the above query and intersect that complement with the following query stating that there is at least one path from x to y:

Expressive power
To understand the expressive power of OPRA, we compare it with other languages. Let us first mention that OPRA expresses all SQL queries over relational databases (subject to technical details arising from types and the fact that SQL can return an ordered list with repetitions). Most of the main ingredients of the proof of this claim are presented in Theorem 6.1, where we have shown the closure properties of OPRA. We skip the proof because it provides little insight into what we are really interested in -graph-oriented properties. Instead, we compare OPRA with a well-known graph query language ECRPQ and its extension with linear constraints (ECRPQ+LC) [BLLW12]. We prove the results depicted in Figure 1: that PR subsumes ECRPQ and PRA subsumes ECRPQ+LC. The strength of ECRPQ comes from the possibility of comparing properties of paths that are expressible by synchronized regular automata. Nevertheless, ECRPQ cannot deal with data values. Therefore, in the final part we show that OPRA subsumes Regular Queries with Memory (RQM) [LMV16] over graphs with integer data values. We conclude with a short discussion on additional expressive power of OPRA over PRA. An ECRPQ graph [BLLW12] is a tuple ⟨V, E⟩, where V is a finite set of nodes, and E ⊆ V × Σ × V is a set of edges labelled by a finite alphabet Σ. A path in an ECRPQ graph G is a sequence of interleaved nodes and edge labels v 0 e 0 v 1 . . . v k such that for every i < k we have E(v i , e i , v i+1 ). The difference between ECRPQ graphs and our graphs is mostly syntactical, yet obscures the close relationship between ECRPQ and PR. To overcome this problem, we define the standard embedding, which is a natural transformation of ECRPQ graphs to graphs. The main idea is to replace paths of the form v 0 e 0 v 1 e 1 . . . v n with paths of the form (v 0 , e 0 )(v 1 , e 1 ) . . . (v n−1 , e n−1 )(v n , ◻).
The standard embedding of an ECRPQ graph G = ⟨V, E⟩ over Σ = {b 1 , . . . , b k } is a graph G se whose set of nodes is V se = V × Σ ◻ , where Σ ◻ = Σ ∪ {◻}. The graph is equipped with a binary Boolean labelling E se encoding the edge relation: E se ((v, a), (v ′ , a ′ )) = 1 if and only if E(v, a, v ′ ), and Σ ◻ unary Boolean labellings λ b encoding the edge labels: λ b (v 1 , a) = 1 if and only if a = b. To deal with variables that occur multiple times in path constraints (e.g. x in x → π x), we need an additional Boolean binary labelling ∼ that ties nodes representing the same node in G: ∼ ((v, a), (v ′ , a ′ )) = 1 if and only if v = v ′ , for every (v, a), (v ′ , a ′ ) ∈ V se . We say that a node v corresponds to the node v se = (v, ◻), and that a path p = v 1 e 1 v 2 . . . v n corresponds to the path p se = (v 1 , e 1 ) . . . (v n−1 , e n−1 )(v n , ◻). 7.1. Extended conjunctive regular path queries (ECRPQs). An ECRPQ Q(⃗ x, ⃗ π) over Σ is a conjunction of path constraints of the form x i → π k x j and regular-relation constraints of the form R(π i 1 , . . . , π in ), where x i , x j are node variables, π k , π i 1 , . . . , π in are path variables, and R is a regular expression defining an n-ary regular relation over Σ. An ECRPQ Q(⃗ x, ⃗ π) can contain other node and path variables beside those listed among ⃗ x or ⃗ π; the remaining nodes and path variables are existentially quantified.
The language of ECRPQs is based on the notion of regular relations. An n-ary relation R on Σ * is regular if there is a regular expression R over the alphabet (Σ ∪ {◻}) n such that for all words w 1 , . . . , w n ∈ Σ * we have (w 1 , . . . , w n ) ∈ R if and only if W (w 1 , . . . , w n ) ∈ L(R) (the notion W (p 1 , . . . , p k ) has been introduced in Section 3 to define the semantics of regular constraints). Note that we use the symbol ◻ to deal with the differences of paths' lengths, and hence, we need regular expressions over the alphabet (Σ ∪ ◻) n to define n-ary relations over Σ.

Jakub Michaliszyn, Jan Otop, and Piotr Wieczorek
Vol. 17:3 The semantics of ECRPQs is defined with respect to an ECRPQ graph G and an instantiation of all node and path variables ν, i.e., for a nodes ⃗ v of G and paths ⃗ p in G, we have that Q(⃗ v, ⃗ p) holds in G if and only if there is an instantiation ν of nodes and path variables, which is consistent with ⃗ v and ⃗ p on free nodes and respective path variables and such that all constraints of Q(⃗ x, ⃗ π) are satisfied. A constraint x i → π k x j is satisfied by ν if ν(π k ) is a path from ν(x i ) to ν(x j ); the semantics is the same as that of x i → π k E x j in PR. The ECRPQ graph G and ν satisfy R(π i 1 , . . . , π in ) if and only if the sequences of labels of paths ν(π i 1 ), . . . , ν(π in ) belong to the relation defined by R.
ECRPQs are defined in a similar way to PR queries. However, regular-relation constraints in ECRPQs and regular constraints are different. In the case of a single path, regular-relation constraints specify regular languages of labels, while regular constraints specify regular languages of node constraints, which are supposed to match the path. Node constraints can express that a given node has a specific label and hence regular constraints (over a single path) can specify that a path has the sequence of labels from a given regular language. The same reasoning works for multiple paths and it shows that regular constraints from PR polynomially subsume regular-relation constraints from ECRPQs.
A query Q 1 on ECRPQ graphs is se-equivalent to a query Q 2 on graphs if for all ECRPQ graphs G, nodes ⃗ v and paths ⃗ p, we have Q 1 (⃗ v, ⃗ p) holds in G if and only if Q 2 (⃗ v se , ⃗ p se ) holds in G se . A query language L on graphs subsumes a query language L ′ on ECRPQ graphs if for every query in L ′ there exists a se-equivalent L query. Moreover, L polynomially subsumes L ′ if every query in L can be transformed to a query in L ′ and the underlying transformation of queries is effective and takes polynomial time.
Proof. (1): ECRPQs consist of two types of constraints. Path constraints x i → π k x j of ECRPQ have similar semantics to path constraints x i → π k E x j in PR, but there is a subtle difference arising from different path representation. For example, if we take an ECRPQ graph with an edge (v, a, v), then x → π x should be satisfied by π = vav. However, then π se becomes (v, a)(v, ◻) that has different endpoints. Therefore we do as follows. For each x i → π k x j we use a fresh variable x ′ i . The translation now consists of a path constraint x ′ i → π k E se x j and three regular constraints: ⟨λ ◻ (x i ) = 1⟩, ⟨λ ◻ (x j ) = 1⟩ and ⟨∼ (x i , x ′ i ) = 1⟩ Note that in the translation of x → π x the last of the regular constraints has the form ⟨∼ (x, x ′ ) = 1⟩.
The regular-relation constraints of ECRPQs are basically regular expressions over the alphabet (Σ ∪ {◻}) n . In PR, any letter (a 1 , . . . , a n ) ∈ (Σ ∪ {◻}) n can be expressed as the node constraint ⟨λ a 1 (π 1 ) = 1 ∧ . . . ∧ λ an (π n ) = 1⟩ referring to the current positions on the respective path variables of ECRPQ. This can be extended to a translation of all regular-relation constraints in a straightforward way. Hence PR polynomially subsumes ECRPQ.
(2): Consider the following PR query Q b : SELECT NODES x, y SUCH THAT x → π E y WHERE ⟨E(Next(π), π)⟩ * ⟨⊺⟩ which holds on nodes x, y such that there is a bidirectional path between x and y. We claim that this query is not expressible in ECRPQ. Suppose that there is an ECRPQ Q ′ (x, y) that holds if and only if there is a bidirectional path between x and y. Let m be the number of all node variables in Q ′ . Consider the graphs G, G ′ depicted in Figure 4. We show that if Q ′ (v 0 , v m ) holds in G, then Q ′ (u 0 , u 2 ) holds in There is some node v j of G, which is not referred by any node variable. We define a node variable instantiation Observe that if there is a path in G of length l between two nodes ν(x) and ν(y), then there is a path of the same length between ν ′ (x) and ν ′ (y) in G ′ . Similarly, if there is a path in G of length l from (resp., to) ν(x), then there is a path of the same length in G ′ from (resp., to) ν ′ (x). It follows that we can extend the instantiation ν ′ to path variables such that paths in ν and ν ′ have the same endpoints among instances of node variables and the same lengths. Therefore, all constraints from Q ′ (u 0 , u 2 ) of the form x i → π k x j are satisfied in G ′ under ν ′ . Finally, since for every path variable ν(π) and ν ′ (π) have the same length and a is the only label, all regular constraints of Q ′ (u 0 , u 2 ) are satisfied as well. Hence, we have that Q ′ (u 0 , u 2 ) holds in G ′ , but there is no bidirectional path between u 0 and u 2 . 7.2. ECRPQs with linear constraints. ECRPQ+LC [BLLW12] is an extension of ECRPQ with linear constrains, expressing that a given vector of paths ⃗ π satisfying a given ECRPQ query satisfies linear inequalities, which specify the multiplicity of edge labels in various components of ⃗ π. Formally, a linear constraint is given by h > 0, a h × ( Σ ⋅ n) matrix A with integer coefficient and a vector ⃗ b ∈ Z h . An instantiation ν of ⃗ π (of length n) satisfies this constraint if A ⃗ l ≤ ⃗ b holds for the vector ⃗ l = (l 1,1 , . . . , l Σ ,1 , l 1,2 , . . . , l Σ ,n ), where l j,i is the number of occurrences of the j-th edge label in ν(⃗ p[i]). In ECRPQ+LC, we require a tuple of paths to satisfy both the ECRPQ part and the linear constraints.
Linear constraints can be expressed by arithmetical constraints of PRA. This and Theorem 7.1 imply that PRA polynomially subsumes ECRPQ+LC. Still, linear constraints do not help with expressing structural graph properties. In particular, ECRPQ+LC does not express the query "x and y are connected with a bidirectional path", which is expressible in PR. Nevertheless, there are ECRPQ+LC queries not expressible in PR. In consequence, ECRPQ+LC and PR are incomparable and we have the following.
Theorem 7.2. (1) PRA polynomially subsumes ECRPQ+LC. (2) There is a PR query Q with no ECRPQ+LC query Q ′ se-equivalent to Q. (3) There is an ECRPQ+LC query Q with no PR query Q ′ se-equivalent to Q.
Proof. (1): Language PRA can express all ECRPQs with linear constraints. Consider a query Q of ECRPQ+LC. First, due to Theorem 7.1, PR polynomially subsumes ECRPQs, and hence there is a PR query ϕ E corresponding to the ECRPQ part of Q. Second, we can express each l j,i by the arithmetical atom Λ j,i = λ b j [π i ]. Then, for k = 1, . . . , h, the arithmetical constraint ϕ k A corresponding to the k − th row of A can be constructed as the product of the row of A and all the atoms Λ j,i compared to the k-th element of the vector b. Thus, ϕ A , defined as the conjunction of all ϕ k A , corresponds to the linear constraints A ⃗ l ≤ ⃗ b of Q. Finally, we define the PRA query se-equivalent to Q as the conjunction of ϕ E and ϕ A .
(2): Consider the PR query Q b presented in the proof of (2) in Theorem 7.1. The argument from (2) in Theorem 7.1 straightforwardly extends to ECRPQ+LC. We assume towards a contradiction that there is an ECRPQ+LC Q ′ (x, y) that holds if there is a bidirectional path between x and y. Let k be the number of all node variables in Q ′ . Then, we proceed as in the proof of (2) in Theorem 7.1 to show that if Q ′ (v 0 , v k ) is satisfied under some instantiation ν in G (depicted in Figure 4), then we can define the corresponding instantiation ν ′ in G ′ (depicted in Figure 4) such that paths in ν and ν ′ have the same endpoints among instances of node variables and the same lengths. Therefore, the ECRPQ part of Q ′ (u 0 , u 2 ) holds in G ′ under ν ′ . Observe that the value of the vector ⃗ l in the linear constraints of Q ′ (v 0 , v k ) under ν in G is the same as the value of ⃗ l under ν ′ in G ′ . Thus, the linear constraints of Q ′ are satisfied in G ′ under ν ′ . Hence, we have that Q ′ (u 0 , u 2 ) holds in G ′ , but there is no bidirectional path between u 0 and u 2 .
(3): Consider the query Q a=b (π): "a given path π has the same number of edges labelled a as edges labelled b". We claim that Q a=b (π) is expressible in ECRPQ+LC, whereas it is not expressible in PR.
To see this, let us fix a graph G consisting of a single state and two self-loops labelled by a and b respectively. Consider the set of labellings of all paths satisfying Q a=b . This set can be regarded as the language of words over {a, b} with the same number of a's and b's. This language is not regular whereas for any PR query Q, the language of labellings of paths from G satisfying Q is regular.
Indeed, suppose that Q has k free path variables and no other path variables. Observe that the language L Q of labellings of paths satisfying Q is a regular language over {a, b, ◻} k , i.e., k element tuples over {a, b, ◻}. Now, if Q ′ is obtained from Q by making some free path variables existentially quantified, then L Q ′ is obtained from L Q by projecting out the components {a, b, ◻} k that correspond to existentially quantified variables. Such an operation preserves regularity of the language; the language L Q ′ is still regular. Now we discuss why OPRA is provably stronger than PRA.
Remark 7.3 (OPRA is stronger than PRA). The language OPRA syntactically contains PRA and it strictly subsumes PRA. Consider the property: "an input graph is a directed acyclic graph (dag)". PRA queries are monotonic and, in consequence, no PRA query expresses the property. On the other hand it is expressible in OPRA. We can write a boolean PR query Q() that holds on input graphs with a cycle. From Theorem 6.1 the complement of Q() can be expressed in OPRA. 7.3. Regular queries with memory (RQMs). Regular Query with Memory (RQM) [LMV16,LTV15] is of the form x → π y ∧ π ∈ L(e), where e is a regular expression with memory (REM) [LMV16,LTV15]. We refrain from presenting a formal definition of REMs as we do not use it below. Intuitively, REMs can store in a variable the data value at the current position and test its (dis)equality with other values already stored. RQMs are evaluated over data graphs [LMV16]. A data graph G over a finite alphabet Σ and countable infinite set D is a triple (V, E, ρ), where V is a finite set of nodes, E ⊆ V × Σ × V is a set of labelled edges; and ρ ∶ V → D is a function that assigns a data value to each node in V . In this paper we assume that D is a set of integer numbers. A path in a data graph G is a sequence of interleaved nodes and edge labels v 0 e 0 v 1 . . . v k such that for every i < k we have E(v i , e i , v i+1 ). REMs are evaluated over data paths. Given a path p = v 0 e 0 v 1 . . . v k , a data path corresponding to p is ρ(v 0 )e 0 ρ(v 1 ) . . . ρ(v k ), i.e., a sequence of alternating data values and labels that starts and ends with data values. Given a data graph G, the result of the RQM x → π y ∧ π ∈ L(e) on G consists of pairs of nodes (v, v ′ ) such that there is a data path w from v to v ′ that belongs to L(e).
In order to relate PRA to RQM we apply a natural transformation of data graphs to graphs. We discussed it in Section 2, see the examples 2 and 3. The standard embedding of a data graph G data = (V, E, ρ) is a graph G sed whose set of nodes is V sed = V ∪ E. G sed is equipped with the following labellings: • a binary Boolean labelling E sed encoding the edge relation: for each v ∈ V and e ∈ E we set We say that a query Q 1 on data graphs is se-equivalent to a query Q 2 on graphs if for all data graphs G and nodes v, v ′ the query Similarly as before, a query language L on graphs subsumes a query language L ′ on data graphs if for every query in L ′ there exists a se-equivalent L query. Also, L polynomially subsumes L ′ if every query in L ′ can be transformed to a query in L and the transformation is effective and can be computed in polynomial time. Now, we would like to prove that PRA subsumes RQM. To make the proof easier we introduce an intermediate step through regular data path queries (RDPQ) [LMV16]. RDPQs are automata-based formalism and, like RQMs, define pairs of nodes of data graphs. In the proof, given a RQM Q we express it by RDPQ Q 1 and then we construct a se-equivalent PRA query. Now we formally define RDPQs [LMV16]. An RDPQ is of the form x → π y ∧ π ∈ L(A), where A is a Register Data Path Automaton (RDPA). RDPAs, similarly as REMs, are evaluated over data paths. In order to compare the data values, RDPA use Boolean combinations of the conditions of the form x = i , x ≠ i , z = and z ≠ , where each variable x i refers to the i-th register and z is a data value from D (a constant). Let C k be the set of all such conditions over k registers and their Boolean combinations. Each of the registers store either a data value or a special value that means that the register has not been assigned yet. Semantics of the conditions is defined with respect to a (current) data value d and a valuation of registers τ = (d 1 , . . . , d k ) ∈ (D ∪ ) k in a natural way: for each ⊗ ∈ {=, ≠} and each i ∈ {1, . . . , k} we define d, In th sequel, given a register valuation τ = (d 1 , . . . , d k ) we will write τ (i) for d i , the ith element of the tuple τ .
A k-register RDPA [LMV16] consists of a finite set of word states Q w , a finite set of data states Q d , an initial state q 0 ∈ Q d , a set of final states F ⊆ Q w and two transition relations δ w , δ d such that: Given a data path w = d 0 e 1 d 2 . . . d l , where each d i ∈ D and each e i ∈ Σ, a computation of A on w is a sequence of tuples (0, q 0 , τ 0 ), . . . , (l + 1, q l+1 , τ l+1 ), where q 0 = q 0 , τ 0 = k and: • for each even j there is a transition (q j , c, I, q j+1 ) ∈ δ d such that d j , τ j ⊧ c and for each i ∈ {1, . . . , k} the value τ j+1 (i) is equal to d j if i ∈ I and to τ j (i) otherwise. • for each odd j, there is a transition (q j , e j , q j+1 ) ∈ δ w and τ j+1 = τ j .
A data path w is accepted by A if A has a computation on w that ends in a configuration containing a final state. Given a data graph G, the result of the RDPQ x → π y ∧ π ∈ L(A) on G consists of pairs of nodes (v, v ′ ) such that there is a data path w from v to v ′ that is accepted by A.
Each RQM can be expressed with a RDPQ. This is because for each REM with k variables one can construct in PTIME a RDPA with k registers that accepts the same language of data paths [LMV16, Proposition 3.13] and [LTV15,Theorem 4.4 ].
We show that PRA subsume RQM. The transformation from an RQM to a PRA involves a single exponential blow-up.
Proof. Let Q be an RDPQ of the form x → π y ∧ π ∈ L(A), where A is a RDPA. We will construct an se-equivalent PRA query as follows. First, assume that A contains exactly one final state. We explain later what to do when this is not true. We construct an intermediate Path Automaton (iPA) A ′ that is equivalent to A in the following sense: given a data graph G, nodes v, v ′ of G the RDPA A accepts a data path from v to v ′ if and only if A ′ accepts a path from v to v ′ in G sed . The iPA is an automaton where transitions are labelled by regular constraints from PRA.
iPA process tuples of paths of a graph (like PRA) rather than data paths of a data graph (as in the case of RDPA). A tuple of paths includes additional paths that store values during computation just like registers do in the case of RDPA. Therefore, iPA do not have registers explicitly and whenever we mention registers in the context of iPA we mean the mechanism to store values in additional paths. Moreover, the stored values are nodes of a graph as in PRA queries and not data values as in RDPQ.
Formally, a k-register iPA consists of a set of states Q, a single initial state q I , a single final state q F , and a transition relation δ ⊆ Q × Regular_constraints × Q. The regular constraints in a k-register iPA use a path variable π 0 corresponding to an input path and k existentially quantified, additional path variables π 1 , . . . , π k to store the values of the k registers.
In what follows, for a path p, by p[i, j] we denote the fragment of p that starts at position i and ends at the position j. e.g., for p = v 1 e 2 v 3 e 4 v 5 , the expression p[3, 5] denotes v 3 e 4 v 5 .
We say that for a given graph G, an iPA A ′ accepts a path p 0 if there are paths p 1 , . . . , p k , each of the same length as p 0 , there are: a number n ∈ N, positions 0 = i 0 < i 1 < i 2 < ⋯ < i n−1 ≤ i n = p 0 and a sequence of states q 0 , . . . , q n such that • q 0 is the initial state, q n is the final state, and • for each j ∈ {0, . . . , n − 1} there is a transition (q j , R(π 0 , . . . , π k ), q j+1 ) ∈ δ such that for j < n − Note that for j ∈ {1, . . . , n − 1} the positions i j on the paths p 0 , . . . , p k are processed twice, first by the always satisfied node constraint ⟨⊺⟩ and only then by the actual transition. We apply this trick to be able to refer to the next position on the paths even if the current position is the last one in the current section. In other words, for 0 ≤ j ≤ n − 2 we include the position i j+1 in the fragments p i [i j , i j+1 ] only to be able to refer to the next position while being at the position i j+1 − 1. Now, given a k-register RDPA A = (Q w , Q d , q 0 , F, δ w , δ d ) we define an equivalent kregister iPA A ′ = (Q, q I , q F , δ). Let Q = Q w ∪ Q d . We define q I to be q 0 and q F to be the only state in F .
We now define a transition relation δ.
Note that we can safely refer to the next position on paths π i because q ′ is not the final state as q F ∈ Q w and δ w ⊆ Q w × Σ × Q d . For each transition (q, c, I, q ′ ) ∈ δ d , with a condition c and an update set I, we put (q, R, q ′ ) in δ, where the regular constraint R ensures that (1) there is an edge between the current and the next node on π 0 in G; (2) the condition c is satisfied assuming for each i ∈ {1, . . . , k} the value of the register i is the current value of the path π i ; (3) unless q ′ is the final state, for each i ∈ I the next value of the path π i is set to the current value of the input path π 0 ; (4) unless q ′ is the final state, for each i ∈ {1, . . . , k} ∖ I the next value of the path π i is the same as the current value of π i .
We express the condition c ∈ C k as the Boolean combination of node constraints, denoted by N (c), by replacing each x ⊗ i by ⟨λ(π i ) ⊗ λ(π 0 )⟩ and z ⊗ by ⟨z ⊗ λ(π 0 )⟩, for ⊗ ∈ {=, ≠}. Recall that λ is the labelling encoding the data values of the nodes.
We cannot include N (c) directly in a regular constraint because regular constraints are conjunctions of node constraints. Hence, we transform N (c) to DNF. Then we remove all negations swapping = and ≠ in x ⊗ i and z ⊗ as required. Denote the resulting expression by N = ⋁ l N l .
If q ′ is not the final state then for each of the conjunctions N l we define a regular constraint R l as ⟨E sed (π 0 , Next(π 0 )) = 1 ∧ N l ∧ ⋀ i∈I λ(Next(π i )) = λ(π 0 ) ∧ ⋀ i∈{1,...,k}∖I λ(Next(π i )) = λ(π i )⟩ Otherwise R l is defined as ⟨N l ⟩. Finally, we define R as R 1 +⋯+R s , where N = N 1 ∨. . . N s . This finishes the construction of the automaton A ′ . The automaton A ′ has the same number of states as A, but its size may be exponential in A due to transformation from an arbitrary Boolean combination of node constraints to a DNF. This blow-up can be avoided if we allow OPRA queries, in which we can define a new labelling corresponding to any Boolean function, and then use this labelling to express a Boolean combination of node constraints. To end the proof of (1) we use the standard state removal method of converting an NFA to a regular expression. This removes all the states of A ′ except q I and q F . Then we use the same techniques as when removing states to define from the remaining transitions (i.e., the loops on q I , on q F , and the transition from q I to q F and back) the single transition t = (q I , R, q F ) with the regular constraint R such that R describes equivalently all possible paths from q I and q F . We use R to define PRA query SELECT NODES x, y WHERE R that is se-equivalent to the original RDPQ query Q. The transformation from a A ′ to R is exponential in the number of states of A ′ , which is the same as for A, and hence the whole construction results in a single exponential blow-up.
If the RDPA A has more than one final state in F we repeat the above construction F times. Namely, for each q ∈ F we construct A q which is identical as A but with q as its only final state. This way for each q ∈ F we obtain a regular constraint R q and then we set R as the alternation (+) of the regular constraints R q for all q ∈ F .
In order to prove (2) it is enough to note that OPRA can express the query "an input graph is a dag" as we discuss in Remark 7.3. Clearly, RQMs are monotonic (i.e., if an RQM query holds in a graph G then it holds in any G ′ containing G) and cannot express this query.

The query evaluation problem
The query evaluation problem asks, given an OPRA query Q(⃗ x, ⃗ π), a graph G, nodes ⃗ v and paths ⃗ p of G, whether Q(⃗ v, ⃗ p) holds in G. We are interested in the combined complexity of the query evaluation problem, where the size of the input is the size of G, paths ⃗ p and an input OPRA query, and in the data complexity, where a query is fixed and only G and ⃗ p are the input.
Unary encoding of graph labels. To obtain the desired complexity results, we assume that the absolute values of the graph labels are polynomially bounded in the size of a graph, or equivalently that graph labels are encoded in unary. This allows us to compute arithmetical relations on these labels in logarithmic space. Without such a restriction, the data complexity of the query evaluation problem we study is NP-hard by a straightforward reduction from the knapsack problem.
For the combined complexity, we assume that auxiliary labellings have a bounded depth defined as follows. For auxiliary labellings O ∶= λ 1 ∶=t 1 , . . . , λ n ∶=t n , we say that λ i depends on λ j if t i refers to λ j . The relation depends on defines a directed acyclic graph on λ 1 , . . . , λ n and we define the depth of O as the maximal length of a path in this acyclic graph. An OPRA query Q := LET O IN Q ′ is an OPRA query of depth (at most) s, denoted by OPRA[s], if s = 0, and Q is a PRA query, i.e., O is empty, or s > 0, O has depth at most s and all subqueries of Q ′ are OPRA[s-1] queries.
Theorem 8.1. The following holds: (1) The data complexity of the query evaluation problem for OPRA queries is NL-complete.
(2) The combined complexity of the query evaluation problem for OPRA queries with a bounded depth of auxiliary labellings is PSpace-complete.
The lower bounds in Theorem 8.1 holds even for PR. Indeed, the NL-hardness result in (1) of Theorem 8.1 follows from NL-hardness of the reachability problem, which can be expressed (1) Given a graph G, nodes ⃗ v and paths ⃗ p of G, and an OPRA[s] query Q(⃗ x, ⃗ π), we can decide whether Q(⃗ v, ⃗ p) holds in G in non-deterministic polynomial space in Q and non-deterministic logarithmic space in G .
Remark 8.3 (Functions computed non-deterministically). In Lemma 8.2 values min λ,π Q(⃗ v, π), max λ,π Q(⃗ v, π) are computed on a non-deterministic machine, which is a non-standard notion. However, some subcomputations of of our non-deterministic decision procedure return values. Therefore, we adopt non-deterministic framework to functional problems as follows. We say that a non-deterministic machine M computes a function f , if for every input w, (1) M can have multiple computations on w each accepting or rejection, (2) for all accepting computations of M on w, it returns the correct value f (w), (3) M has at least one accepting computation on w.
Note that when considering data complexity the query is fixed and hence its depth is bounded.
We first prove the upper bounds for PRA (i.e., for s = 0), and then extend the results to OPRA. The general idea of the proof is in a similar vein as the proof of the upper bound of ECRPQ. However, since our language is much more complex, we use more sophisticated, well-tailored tools.
8.1. Language PRA. Assume a PRA query Q = SELECT NODES ⃗ x, PATHS ⃗ π SUCH THAT P WHERE R HAVING A We prove the results in two steps. First, we construct a Turing machine of a special kind (later on called QAM) that represents graphs, called answer graphs. Given a query Q and a graph G, the answer graph is a graph with distinguished initial and final nodes such that every path from an initial node to a final node in this graph is an encoding of a vector of paths that satisfy constraints P and R of Q in graph G (for some instantiation of variables ⃗ x). The answer graph is augmented with the computed values of the expressions that appear in the arithmetical constraints A. Thus, the query evaluation problem reduces to the existence of a path in the answer graph satisfying A. The instantiation of ⃗ x can be inferred from the path.
Second, we prove that checking whether there is a path from an initial node to a final node in the answer graph that encodes a path in G satisfying A can be done within desired complexity bounds. However, the answer graph for Q and G has a polynomial size in G and hence it cannot be explicitly constructed in logarithmic space. We represent these graphs on-the-fly using Query Applying Machines (QAM). Such a representation allows us to construct the answer graph and check the existence of a path satisfying A in non-deterministic logarithmic space in G.