Algorithms for Game Metrics

Simulation and bisimulation metrics for stochastic systems provide a quantitative generalization of the classical simulation and bisimulation relations. These metrics capture the similarity of states with respect to quantitative specifications written in the quantitative {\mu}-calculus and related probabilistic logics. We first show that the metrics provide a bound for the difference in long-run average and discounted average behavior across states, indicating that the metrics can be used both in system verification, and in performance evaluation. For turn-based games and MDPs, we provide a polynomial-time algorithm for the computation of the one-step metric distance between states. The algorithm is based on linear programming; it improves on the previous known exponential-time algorithm based on a reduction to the theory of reals. We then present PSPACE algorithms for both the decision problem and the problem of approximating the metric distance between two states, matching the best known algorithms for Markov chains. For the bisimulation kernel of the metric our algorithm works in time O(n^4) for both turn-based games and MDPs; improving the previously best known O(n^9\cdot log(n)) time algorithm for MDPs. For a concurrent game G, we show that computing the exact distance between states is at least as hard as computing the value of concurrent reachability games and the square-root-sum problem in computational geometry. We show that checking whether the metric distance is bounded by a rational r, can be done via a reduction to the theory of real closed fields, involving a formula with three quantifier alternations, yielding O(|G|^O(|G|^5)) time complexity, improving the previously known reduction, which yielded O(|G|^O(|G|^7)) time complexity. These algorithms can be iterated to approximate the metrics using binary search.


Introduction
System metrics constitute a quantitative generalization of system relations.The bisimulation relation captures state equivalence: two states s and t are bisimilar if and only if they cannot be distinguished by any formula of the µ-calculus [5].The bisimulation metric captures the degree of difference between two states: the bisimulation distance between s and t is a real number that provides a tight bound for the difference in value of formulas of the quantitative µ-calculus at s and t [12].A similar connection holds between the simulation relation and the simulation metric.
The classical system relations are a basic tool in the study of boolean properties of systems, that is, the properties that yield a truth value.As an example, if a state s of a transition system can reach a set of target states R, written s |= QR in temporal logic, and t can simulate s, then we can conclude t |= QR.System metrics play a similarly fundamental role in the study of the quantitative behavior of systems.As an example, if a state s of a Markov chain can reach a set of target states R with probability 0.8, written s |= P ≥0.8Q R, and if the metric simulation distance from t to s is 0.3, then we can conclude t |= P ≥0.5Q R. The simulation relation is at the basis of the notions of system refinement and implementation, where qualitative properties are concerned.In analogous fashion, simulation metrics provide a notion of approximate refinement and implementation for quantitative properties.
We consider three classes of systems: • Markov decision processes.In these systems there is one player.At each state, the player can choose a move; the current state and the move determine a probability distribution over the successor states.• Turn-based games.In these systems there are two players.At each state, only one of the two players can choose a move; the current state and the move determine a probability distribution over the successor states.• Concurrent games.In these systems there are two players.At each state, both players choose moves simultaneously and independently; the current state and the chosen moves determine a probability distribution over the successor states.System metrics were first studied for Markov chains and Markov decision processes (MDPs) [12,32,33,13,14], and they have recently been extended to two-player turn-based and concurrent games [11].The fundamental property of the metrics is that they provide a tight bound for the difference in value that formulas belonging to quantitative specification languages assume at the states of a system.More precisely, let qµ indicate the quantitative µ-calculus, a specification language in which many of the classical specification properties, including reachability and safety properties, can be written [10].The metric bisimulation distance between two states s and t, denoted [s ≃ g t], has the property that [s ≃ g t] = sup ϕ∈qµ |ϕ(s) − ϕ(t)|, where ϕ(s) and ϕ(t) are the values ϕ assumes at s and t.To each metric is associated a kernel: the kernel of a metric d is the relation that relates the pairs of states that have distance 0; to each metric corresponds a metric kernel relation.The kernel of the simulation metric is probabilistic simulation; the kernel of the bisimulation metric is probabilistic bisimulation [27].Metric as bound for discounted and long-run average payoff.Our first result is that the metrics developed in [11] provide a bound for the difference in long-run average and discounted average properties across states of a system.These average rewards play a central role in the theory of stochastic games, and in its applications to optimal control and economics [4,17].Thus, the metrics of [11] are useful both for system verification, and for performance evaluation, supporting our belief that they constitute the canonical metrics for the study of the similarity of states in a game.We point out that it is possible to define a discounted version [≃ g ] α of the game bisimulation metric; however, we show that this discounted metric does not provide a bound for the difference in discounted values.Algorithmic results.Next, we investigate algorithms for the computation of the metrics.The metrics can be computed in iterative fashion, following the inductive way in which they are defined.A metric d can be computed as the limit of a monotonically increasing sequence of approximations d 0 , d 1 , d 2 , . . ., where d 0 (s, t) is the difference in value that variables can have at states s and t.For k ≥ 0, d k+1 is obtained from d k via d k+1 = H(d k ), where the operator H depends on the metric (bisimulation, or simulation), and on the type of system.Our main results are as follows: (1) Metrics for turn-based games and MDPs.We show that for turn-based games, and MDPs, the one-step metric operator H for both bisimulation and simulation can be computed in polynomial time, via a reduction to linear programming (LP).The only previously known algorithm, which can be inferred from [11], had EXPTIME complexity and relied on a reduction to the theory of real closed fields; the algorithm thus had more a complexity-theoretic, than a practical, value.The key step in obtaining our polynomial-time algorithm consists in transforming the original sup-inf non-linear optimization problem (which required the theory of reals) into a quadratic-size inf linear optimization problem that can be solved via LP.We then present PSPACE algorithms for both the decision problem of the metric distance between two states and for the problem of computing the approximate metric distance between two states for turn-based games and MDPs.Our algorithms match the complexity of the best known algorithms for the sub-class of Markov chains [31].(2) Metrics for concurrent games.For concurrent games, our algorithms for the H operator still rely on decision procedures for the theory of real closed fields, leading to an EXPTIME procedure.However, the algorithms that could be inferred from [11] had time-complexity O(|G| O(|G| 7 ) ), where |G| is the size of a game; we improve this result by presenting algorithms with O(|G| O(|G| 5 ) ) time-complexity.(3) Hardness of metric computation in concurrent games.We show that computing the exact distance of states of concurrent games is at least as hard as computing the value of concurrent reachability games [15,8], which is known to be at least as hard as solving the square-root-sum problem in computational geometry [18].These two problems are known to lie in PSPACE, and have resisted many attempts to show that they are in NP. (4) Kernel of the metrics.We present polynomial time algorithms to compute the simulation and bisimulation kernel of the metrics for turn-based games and MDPs.Our algorithm for the bisimulation kernel of the metric runs in time O(n 4 ) (assuming a constant number of moves) as compared to the previous known O(n 9 • log(n)) algorithm of [35] for MDPs, where n is the size of the state space.For concurrent games the simulation and the bisimulation kernel can be computed in time where |G| is the size of a game.Our formulation of probabilistic simulation and bisimulation differs from the one previously considered for MDPs in [2]: there, the names of moves (called "labels") must be preserved by simulation and bisimulation, so that a move from a state has at most one candidate simulator move at another state.Our problem for MDPs is closer to the one considered in [35], where labels must be preserved, but where a label can be associated with multiple probability distributions (moves).
For turn-based games and MDPs, the algorithms for probabilistic simulation and bisimulation can be obtained from the LP algorithms that yield the metrics.For probabilistic simulation, the algorithm we obtain coincides with the algorithm previously published in [35].The algorithm requires the solution of feasibility-LP problems with a number of variables and inequalities that is quadratic in the size of the system.For probabilistic bisimulation, we are able to improve on this result by providing an algorithm that requires the solution of feasibility-LP problems that have linearly many variables and constraints.Precisely, as for ordinary bisimulation, the kernel is computed via iterative refinement of a partition of the state space [24].Given two states that belong to the same partition, to decide whether the states need to be split in the next partition-refinement step, we present an algorithm that requires the solution of a feasibility-LP problem with a number of variables equal to the number of moves available at the states, and number of constraints linear in the number of equivalence classes.Overall, our algorithm for bisimulation runs in time O(n 4 ) (assuming a constant number of moves), considerably improving the O(n 9 • log(n)) algorithm of [35] for MDPs, and providing for the first time a polynomial algorithm for turn-based games.

Definitions
Valuations.Let [θ 1 , θ 2 ] ⊆ IR be a fixed, non-singleton real interval.Given a set of states S, a valuation over S is a function f : S → [θ 1 , θ 2 ] associating with every state s ∈ S a value θ 1 ≤ f (s) ≤ θ 2 ; we let F be the set of all valuations.For c ∈ [θ 1 , θ 2 ], we denote by c the constant valuation such that c(s) = c at all s ∈ S. We order valuations pointwise: for f, g ∈ F, we write f ≤ g iff f (s) ≤ g(s) at all s ∈ S; we remark that F, under ≤, forms a lattice.Given a, b ∈ IR, we write a ⊔ b = max{a, b}, and a ⊓ b = min{a, b}; we also let a⊕ b = min{1, max{0, a+ b}} and a⊖ b = max{0, min{1, a− b}}.We extend ⊓, ⊔, +, −, ⊕, ⊖ to valuations by interpreting them in pointwise fashion.
Game structures.For a finite set A, let Dist(A) denote the set of probability distributions over A. We say that p ∈ Dist(A) is deterministic if there is a ∈ A such that p(a) = 1.We assume a fixed finite set V of observation variables.
The moves in Moves are called pure moves.We extend the transition function to mixed moves by defining, for s ∈ S and A path σ of G is an infinite sequence s 0 , s 1 , s 2 , ... of states in s ∈ S, such that for all k ≥ 0, there exist moves We write Σ for the set of all paths, and Σ s for the set of all paths starting from state s.
A strategy for player i ∈ {1, 2} is a function π i : S + → Dist(Moves) that associates with every non-empty finite sequence σ ∈ S + of states, representing the history of the game, a probability distribution π i (σ), which is used to select the next move of player i; we require that for all σ ∈ S * and states s ∈ S, if π i (σs)(a) > 0, then a ∈ Γ i (s).We write Π i for the set of strategies for player i.Once the starting state s and the strategies π 1 and π 2 for the two players have been chosen, the game is reduced to an ordinary stochastic process, denoted G π 1 ,π 2 s , which defines a probability distribution on the set Σ of paths.We denote by Pr π 1 ,π 2 s (•) the probability of a measurable event (sets of paths) with respect to this process, and denote by E π 1 ,π 2 s (•) the associated expectation operator.For k ≥ 0, we let X k : Σ → S be the random variable denoting the k-th state along a path.One-step expectations and predecessor operators.Given a valuation f ∈ F, a state s ∈ S, and two mixed moves x 1 ∈ D 1 (s) and x 2 ∈ D 2 (s), we define the expectation of f from s under x 1 , x 2 by, For a game structure G, for i ∈ {1, 2} we define the valuation transformer Pre i : F → F by, for all f ∈ F and s ∈ S as, Pre i (f )(s) = sup Intuitively, Pre i (f )(s) is the maximal expectation player i can achieve of f after one step from s: this is the standard "one-day" or "next-stage" operator of the theory of repeated games [17].
2.1.Quantitative µ-calculus.We consider the set of properties expressed by the quantitative µ-calculus (qµ).As discussed in [20,10,22], a large set of properties can be encoded in qµ, spanning from basic properties such as maximal reachability and safety probability, to the maximal probability of satisfying a general ω-regular specification.
Syntax.The syntax of quantitative µ-calculus is defined with respect to the set of observation variables V as well as a set MVars of calculus variables, which are distinct from the observation variables in V.The syntax is given as follows: , observation variables v ∈ V, and calculus variables V ∈ MVars.
In the formulas µV.ϕ and νV.ϕ, we furthermore require that all occurrences of the bound variable V in ϕ occur in the scope of an even number of occurrences of the complement operator ¬.A formula ϕ is closed if every calculus variable V in ϕ occurs in the scope of a quantifier µV or νV .From now on, with abuse of notation, we denote by qµ the set of closed formulas of qµ.A formula is a player i formula, for i ∈ {1, 2}, if ϕ does not contain the pre ∼i operator; we denote with qµ i the syntactic subset of qµ consisting only of closed player i formulas.A formula is in positive form if the negation appears only in front of constants and observation variables, i.e., in the context ¬c and ¬v; we denote with qµ + and qµ + i the subsets of qµ and qµ i consisting only of positive formulas.We remark that the fixpoint operators µ and ν will not be needed to achieve our results on the logical characterization of game relations.They have been included in the calculus because they allow the expression of many interesting properties, such as safety, reachability, and in general, ω-regular properties.The operators ⊕ and ⊖, on the other hand, are necessary for our results.Semantics.A variable valuation ξ: MVars → F is a function that maps every variable V ∈ MVars to a valuation in F. We write ξ[V → f ] for the valuation that agrees with ξ on all variables, except that V is mapped to f .Given a game structure G and a variable valuation ξ, every formula ϕ of the quantitative µ-calculus defines a valuation } where i ∈ {1, 2}.The existence of the fixpoints is guaranteed by the monotonicity and continuity of all operators and can be computed by Picard iteration [10].If ϕ is closed, [[ϕ]] ξ is independent of ξ, and we write simply [[ϕ]].

Discounted quantitative µ-calculus.
A discounted version of the µ-calculus was introduced in [9]; we call this dµ.Let Λ be a finite set of discount parameters that take values in the interval [0, 1).The discounted µ-calculus extends qµ by introducing discounted versions of the player pre modalities.The syntax replaces pre i (ϕ) for player i ∈ {1, 2} with its discounted variant, λ • pre i (ϕ), where λ ∈ Λ is a discount factor that discounts one-step valuations.Negation in the calculus is defined as This leads to two additional pre-modalities for the players, (1 − λ) + λ • pre i (ϕ).

Game bisimulation and simulation metrics.
A directed metric is a function d : S 2 → IR ≥0 which satisfies d(s, s) = 0 and the triangle inequality d(s, t) ≤ d(s, u) + d(u, t) for all s, t, u ∈ S. We denote by M ⊆ S 2 → IR the space of all directed metrics; this space, ordered pointwise, forms a lattice which we indicate with (M, ≤).Since d(s, t) may be zero for s = t, these functions are pseudo-metrics as per prevailing terminology [32].In the following, we omit "directed" and simply say metric when the context is clear.For a metric d, we indicate with C(d) the set of valuations k ∈ F where k(s) − k(t) ≤ d(s, t) for every s, t ∈ S. A metric transformer H 1 : M → M is defined as follows, for all d ∈ M and s, t ∈ S: The player 1 game simulation metric [ 1 ] is the least fixpoint of H 1 ; the game bisimulation metric [≃ 1 ] is the least symmetrical fixpoint of H 1 and is defined as follows, for all d ∈ M and s, t ∈ S: 2) The operator H 1 is monotonic, non-decreasing and continuous in the lattice (M, ≤).We can therefore compute H 1 using Picard iteration; we denote by [ n 1 ] = H n 1 (0) the n-iterate of this.From the determinacy of concurrent games with respect to ω-regular goals [21], we have that the game bisimulation metric is reciprocal, in that [≃ 1 ] = [≃ 2 ]; we will thus simply write [≃ g ].Similarly, for all s, t ∈ S we have [s The main result in [11] about these metrics is that they are logically characterized by the quantitative µ-calculus of [10].We omit the formal definition of the syntax and semantics of the quantitative µ-calculus; we refer the reader to [10] for details.Given a game structure G, every closed formula ϕ of the quantitative µ-calculus defines a valuation [[ϕ]] ∈ F. Let qµ (respectively, qµ + 1 ) consist of all quantitative µ-calculus formulas (respectively, all quantitative µ-calculus formulas with only the Pre 1 operator and all negations before atomic propositions).The result of [11] shows that for all states s, t ∈ S, Metrics for the discounted quantitative µ-calculus.We call dµ α the discounted µ-calculus with all discount parameters ≤ α.We define the discounted metrics via an α-discounted metric transformer H α : M → M, defined for all d ∈ M and all s, t ∈ S by: Again, H α 1 is continuous and monotonic in the lattice (M, ≤).The α-discounted simulation metric [ 1 ] α is the least fixpoint of H α 1 , and the α-discounted bisimulation metric The following result follows easily by induction on the Picard iterations used to compute the distances [9]; for all states s, t ∈ S and a discount factor α ∈ [0, 1), Using techniques similar to the undiscounted case, we can prove that for every game structure G and discount factor α ∈ [0, 1), the fixpoint [ i ] α is a directed metric and [≃ i ] α is a metric, and that they are reciprocal, i.e., [ Given the discounted bisimulation metric coincides for the two players, we write We now state without proof that the discounted µ-calculus provides a logical characterization of the discounted metric.The proof is based on induction on the structure of formulas, and closely follows the result for the undiscounted case [11].Let dµ α (respectively, dµ α,+ 1 ) consist of all discounted µ-calculus formulas (respectively, all discounted µ-calculus formulas with only the Pre 1 operator and all negations before atomic propositions).It follows that for all game structures G and states s, t ∈ S, [11] and the relation ≃ α g is called the discounted game bisimulation relation.Similarly, we define the game simulation preorder s 1 t as the kernel of the directed metric [ 1 ], that is, s 1 t iff [s 1 t] = 0.The discounted game simulation preorder is defined analogously.

Bounds for Average and Discounted Payoff Games
From (2.3) it follows that the game bisimulation metric provides a tight bound for the difference in valuations of quantitative µ-calculus formulas.In this section, we show that the game bisimulation metric also provides a bound for the difference in average and discounted value of games.This lends further support for the game bisimulation metric, and its kernel, the game bisimulation relation, being the canonical game metrics and relations.
3.1.Discounted payoff games.Let π 1 and π 2 be strategies of player 1 and player 2 respectively.Let α ∈ [0, 1) be a discount factor.The α-discounted payoff v α 1 (s, π 1 , π 2 ) for player 1 at a state s for a variable r ∈ V and the strategies π 1 and π 2 is defined as: where X n is a random variable representing the state of the game in step n.The discounted payoff for player 2 is defined as . Thus, player 1 wins (and player 2 loses) the "discounted sum" of the valuations of r along the path, where the discount factor weighs future rewards with the discount α.Given a state s ∈ S, we are interested in finding the maximal payoff v α i (s) that player i can ensure against all opponent strategies, when the game starts from state s ∈ S.This maximal payoff is given by: These values can be computed as the limit of the sequence of α-discounted, n-step rewards, for n → ∞.For i ∈ {1, 2}, we define a sequence of valuations w α i (0)(s), w α i (1)(s), w α i (2)(s), . . .as follows: for all s ∈ S and n ≥ 0: where the initial valuation w α i (0) is arbitrary.Shapley proved that w α i = lim n→∞ w α i (n) [28].
3.2.Average payoff games.Let π 1 and π 2 be strategies of player 1 and player 2 respectively.The average payoff v 1 (s, π 1 , π 2 ) for player 1 at a state s for a variable r ∈ V and the strategies π 1 and π 2 is defined as where X k is a random variable representing the k-th state of the game.The reward for player 2 is . A game structure G with average payoff is called an average reward game.The average value of the game G at s for player i ∈ {1, 2} is defined as Mertens and Neyman established the determinacy of average reward games, and showed that the limit of the discounted value of a game as all the discount factors tend to 1 is the same as the average value of the game: for all s ∈ S and i ∈ {1, 2}, we have lim α→1 w α i (s) = w i (s) [23].It is easy to show that the average value of a game is a valuation.

3.3.
Metrics for discounted and average payoffs.We show that the game simulation metric [ 1 ] provides a bound for discounted and long-run rewards.The discounted metric [ 1 ] α on the other hand does not provide such a bound as the following example shows.
Figure 1: Example that shows that the discounted metric may not be an upper bound for the difference in the discounted value across states.
In the following we consider player 1 rewards (the case for player 2 is identical).
Theorem 1.The following assertions hold.
(1) For all game structures G, α-discounted rewards w α 1 , for all states s, t ∈ S, we have, (a) (2) There exists a game structure G, states s, t ∈ S, such that for all α-discounted rewards We first prove assertion (1)(a).As the metric can be computed via Picard iteration, we have for all n ≥ 0: We prove by induction on n ≥ 0 that For all s ∈ S, taking w α 1 (0)(s) = [r](s), the base case follows.Assume the result holds for n − 1 ≥ 0. We have: where the last step follows by (3.4), since by the induction hypothesis we have w α ).This proves assertion (1)(a).Given (1)(a), from the definition of [s The example shown in Figure 1 proves the second assertion.
Using the fact that the limit of the discounted reward, for a discount factor that approaches 1, is equal to the average reward, we obtain that the metrics provide a bound for the difference in average values as well.
Corollary 1.For all game structures G and states s and t, we have (a) w(s 3.4.Metrics for total rewards.The total reward v T 1 (s, π 1 , π 2 ) for player 1 at a state s for a variable r ∈ V and the strategies π 1 ∈ Π 1 and π 2 ∈ Π 2 is defined as [17]: where X j is a random variable representing the j-th state of the game.The payoff v T 2 (s, π 1 , π 2 ) for player 2 is defined by replacing [r] with −[r] in (3.5).The total-reward value of the game G at s for player i ∈ {1, 2} is defined analogously to the average value, via, While the game simulation metric [≃ g ] provides an upper bound for the difference in discounted reward across states, as well as for the difference in average reward across states, it does not provide a bound for the difference in total reward.We now introduce a new metric, the total reward metric, [⊲⊳ g ], which provides such a bound.For a discount factor α ∈ [0, 1), we define a metric transformer H α ¢ 1 : M → M as follows.For all d ∈ M and s, t ∈ S, we let: The metric [¢ 1 ] α (resp.[⊲⊳ 1 ] α ) is obtained as the least (resp.least symmetrical) fixpoint of (3.6).We write If α < 1 we get the discounted total reward metric and if α = 1 we get the undiscounted total reward metric.While the discounted total reward metric is bounded, the undiscounted total reward metric may not be bounded.The total metrics provide bounds for the difference in discounted, average, and total reward between states.
Theorem 2. The following assertions hold.
(1) For all game structures G, for all discount factors α ∈ [0, 1), for all states s, t ∈ S, ) There exists a game structure G and states s, t ∈ S such that, [s Proof.For assertion (1)(a), notice that p(s, t) ≤ (θ 2 −θ 1 ).Consider the n-step Picard iterate towards the metric distance.We have, In the limit this yields [s ¢ 1 t] α ≤ (θ 2 − θ 1 )/(1 − α).Assertion (1)(b) follows by induction on the Picard iterations that realize the metric distance.For all n ≥ 0, [s Assertion (1)(c) follows by the definition of the discounted total reward metric where we have replaced the ⊔ with a +.By induction, for all n ≥ 0, from the proof of Theorem 1 we have, For assertion (1)(d), towards an inductive argument on the Picard iterates that realize the metric, for all n ≥ 0, we have [ using Corollary 1.This proves assertion (1)(d).We now prove assertion (1)(e) by induction and show that for all n ≥ 0, As the metric can be computed via Picard iteration, we have for all n ≥ 0: We define a valuation transformer u : F → F as u(0) = [r] and for all n > 0 and state s ∈ S as, We take w T 1 (0) = u(0) = [r] and for n > 0, from the definition of total rewards (3.5), we get the n-step total reward value at a state s ∈ S in terms of u as, Notice that w T 1 (n)(s) ≤ u(n) for all n ≥ 0. When n = 0, the result is immediate by the definition of w T 1 (0), noticing that [s ¢ 0 1 t] = p(s, t).Assume the result holds for n − 1 ≥ 0. We have: where (3.9) follows from (3.8) by (3.7), since by our induction hypothesis we have w T ) for all 0 ≤ i < n and (3.10) follows from (3.9) from the monotonicity of the undiscounted total reward metric.To prove assertion (2), consider the game structure on the left hand side in Figure 1.The total reward at state s is unbounded; w T 1 (s) = 2+5+. . .= ∞ Now consider a modified version of the game, with identical structure and with states s ′ and t ′ corresponding to s and t of the original game.Let [r](t ′ ) = 0.In the modified game, w T 1 (s ′ ) = 2. From result (1)(e), since w T 1 (s) = ∞ and w T 1 (s ′ ) = 2, we have [s ¢ 1 s ′ ] = ∞.It is a very simple observation that the quantitative µ-calculus does not provide a logical characterization for [¢ α 1 ] or [¢ 1 ].In fact, all formulas of the quantitative µ-calculus have valuations in the interval [θ 1 , θ 2 ], while as stated in Theorem 2, the total reward can be unbounded.The difference is essentially due to the fact that our version of the quantitative µ-calculus lacks a "+" operator.It is not clear how to introduce such a + operator in a context sufficiently restricted to provide a logical characterization for [¢ α 1 ]; above all, it is not clear whether a canonical calculus, with interesting formal properties, would be obtained.
3.5.Metric kernels.We now show that the kernels of all the metrics defined in the paper coincide: an algorithm developed for the game kernels 1 and ≃ g , compute the kernels of the corresponding discounted and total reward metrics as well.

Algorithms for Turn-Based Games and MDPs
In this section, we present algorithms for computing the metric and its kernel for turnbased games and MDPs.We first present a polynomial time algorithm to compute the operator H i (d) that gives the exact one-step distance between two states, for i ∈ {1, 2}.We then present a PSPACE algorithm to decide whether the limit distance between two states s and t (i.e., [s 1 t]) is at most a rational value r.Our algorithm matches the best known bound known for the special class of Markov chains [31].Finally, we present improved algorithms for the important case of the kernel of the metrics.Since by Theorem 3 the kernels of the metrics introduced in this paper coincide, we present our algorithms for the kernel of the undiscounted metric.For the bisimulation kernel our algorithm is significantly more efficient compared to previous algorithms.

4.1.
Algorithms for the metrics.For turn-based games and MDPs, only one player has a choice of moves at a given state.We consider two player 1 states.A similar analysis applies to player 2 states.We remark that the distance between states in S i and S ∼i is always θ 2 − θ 1 due to the existence of the variable turn.For a metric d ∈ M, and states s, t ∈ S 1 , computing H 1 (d)(s, t), given that p(s, t) is trivially computed by its definition, entails evaluating the expression, sup k∈C(d) Pre 1 (k)(s) − Pre 1 (k)(t) , which is the same as, as player 1 is the only player with a choice of moves at state s.By expanding the expectations, we get the following form, sup We observe that the one-step distance as defined in (4.1) is a sup-inf non-linear (quadratic) optimization problem.We now present two lemmas by which we transform (4.1) to an inf linear optimization problem, which we solve by linear programming (LP).The first lemma reduces (4.1) to an equivalent formulation that considers only pure moves at state s.The second lemma further reduces (4.1), using duality, to a formulation that can be solved using LP.
Lemma 1.For all turn-based game structures G, for all player i states s and t, given a metric d ∈ M, the following equality holds, Proof.We prove the result for player 1 states s and t, with the proof being identical for player 2. Given a metric d ∈ M, we have, sup sup For a fixed k ∈ C(d), since pure optimal strategies exist at each state for turn-based games and MDPs, we replace the sup x∈D 1 (s) with sup a∈Γ 1 (s) yielding (4.2).Since the difference in expectations is multi-linear, y ∈ D 1 (t) is a probability distribution and C(d) is a compact convex set, we can use the generalized minimax theorem [29], and interchange the innermost sup inf to get (4.4) from (4.3).The proof of Lemma 1 is illustrated using the following example.Example 4.1.Consider the example in Figure 2. In the MDPs shown in the figure, every move leads to a unique successor state, with the exception of move e ∈ Γ 1 (s), which leads to states u and v with equal probability.Assume the variable valuations are such that all states are at a propositional distance of 1.Without loss of generality, assume that the valuation k ∈ C(d) is such that k(u) > k(v).By the linearity of expectations, for move c ∈ Γ 1 (s), E c s (k) ≥ E x s (k) for all x ∈ D 1 (s).Similar arguments can be made for k(u) < k(v).This gives an informal justification for step (4.2) in the proof; given a k ∈ C(d), there exist pure optimal strategies for the single player with a choice of moves at each state.While we can use pure moves at states s and t if k ∈ C(d) is known, the principle difficulty in directly computing the left hand side of the equality arises from the uncountably many values for k; the distance is the supremum over all possible values of k.In the final equality, step (4.4), and hence by this theorem, we have avoided this difficulty, by showing an equivalent expression that picks a k ∈ C(d) to show the difference in distributions induced over states.As we shall see, this enables computing the one-step metric distance using a trans-shipping formulation.We remark that while we can use pure moves at state s, we cannot do so at state t in the right hand side of step (4.4) of the proof.Firstly, the proof of the theorem depends on y ∈ D 1 (t) being convex.Secondly, if we could restrict our attention to pure moves at state t, then we can replace inf y∈D 1 (t) with inf f ∈Γ 1 (t) on the right hand side.But this yields too fine a one-step distance.Consider move e at state s.We see that neither c nor b at state t yield distributions over states that match the distribution induced by e.We can then always pick 2 , we match the distribution induced by move e from state s, which implies that for any choice of Intuitively, the right hand side of the equality can be interpreted as a game between a protagonist and an antagonist, with the protagonist picking y ∈ D 1 (t), for every pure move a ∈ Γ 1 (s), to match the induced distributions over states.The antagonist then picks a k ∈ C(d) to maximize the difference in induced distributions.If the distributions match, then no choice of k ∈ C(d) yields a difference in expectations bounded away from 0.
From Lemma 1, given d ∈ M, we can write the player 1 one-step distance between states s and t as follows, Hence we compute for all a ∈ Γ 1 (s), the expression, and then choose the maximum, i.e., max a∈Γ 1 (s) OneStep(s, t, d, a).We now present a lemma that helps reduce the above inf − sup optimization problem to a linear program.We first introduce some notation.We denote by λ the set of variables λ u,v , for u, v ∈ S. Given a ∈ Γ 1 (s), and a distribution y ∈ D 1 (t), we write λ ∈ Φ(s, t, a, y) if the following linear constraints are satisfied: (1) for all v ∈ S : (3) for all u, v ∈ S : λ u,v ≥ 0 .
Lemma 2. For all turn-based game structures and MDPs G, for all d ∈ M, and for all s, t ∈ S, the following assertion holds: Proof.Since duality always holds in LP, from the LP duality based results of [32], for all a ∈ Γ 1 (s) and y ∈ D 1 (t), the maximization over all k ∈ C(d) can be re-written as a minimization problem as follows: The formula on the right hand side of the above equality is the trans-shipping formulation, which solves for the minimum cost of shipping the distribution δ(s, a) into δ(t, y), with edge costs d.The result of the lemma follows.
Using the above result we obtain the following LP for OneStep(s, t, d, a) over the variables: (a) {λ u,v } u,v∈S , and (b) (1) for all v ∈ S : (2) for all u ∈ S : (3) for all u, v ∈ S : Example 4.2.We now use the MDPs in Figure 3(a) and 3(b) to compute the simulation distance between states using the results in Lemma 1 and Lemma 2. In the figure, states of the same color have a propositional distance of 0 and states of different colors have a propositional distance of 1; p(s, s In Table 2, we show the simulation metric distance between states of the MDPs in Figure 3(a) and Figure 3(b).Consider states t and t ′ .c is the only move available to player 1 from state t ′ and it induces a transition probability of 1  2 + ǫ to state v ′ and 1 2 − ǫ to state u ′ .For the pure move c at state t, the induced transition probabilities and edge costs in the trans-shipping formulation are shown in Figure 4(a).It is easy to see that the trans-shipping cost in this case is 1  2 + ǫ; shown in Table 1 along the row corresponding to move c from state t and column corresponding to state t ′ .Similarly, the trans-shipping cost for the moves b and f from state t are 1  2 − ǫ and ǫ respectively.The metric distance Table 1: The moves from states w ′ and t ′ that minimize the trans-shipping cost for each a ∈ Γ 1 (t) and the corresponding costs.
[ ] The simulation metric distance between states in MDP 1 and states in MDP 2. [t t ′ ], which is the maximum over these trans-shipping costs is then 1 2 + ǫ.Now consider the states t and w ′ .In Table 1, we show for each pure move a ∈ Γ 1 (t), the move x ∈ D 1 (w ′ ) that minimizes the trans-shipping cost together with the minimum cost.In this case it is easy to see that [t w ′ ] = ǫ.Given [t t ′ ] = 1 2 + ǫ and [t w ′ ] = ǫ, we can calculate the distance [s s ′ ] from the trans-shipping formulation shown in Figure 4(b); the minimum cost is ǫ that entails choosing move a from state s ′ , giving us [s s ′ ] = ǫ.For all states s, t ∈ S, iteration of OneStep(s, t, d) converges to the exact distance.However, in general, there are no known bounds for the rate of convergence.We now present a decision procedure to check whether the exact distance between two states is at most a rational value r.We first show how to express the predicate d(s, t) = OneStep(s, t, d).We observe that since H 1 is non-decreasing, we have OneStep(s, t, d) ≥ d(s, t).It follows that the equality d(s, t) = OneStep(s, t, d) holds iff for every a ∈ Γ 1 (s), of which there are finitely many, all the linear inequalities of LP (4.6) are satisfied, and d(s, t) = u,v∈S d(u, v) • λ u,v holds.It then follows that d(s, t) = OneStep(s, t, d) can be written as a predicate in the theory of real closed fields.Given a rational r, two states s and t, we present an existential theory of reals formula to decide whether [s 1 t] ≤ r.Since [s 1 t] is the least fixed point of H 1 , we define a formula Φ(r) that is true iff, in the fixpoint, [s 1 t] ≤ r, as follows: If the formula Φ(r) is true, then there exists a fixpoint d, such that d(s, t) is bounded by r, which implies that in the least fixpoint d(s, t) is bounded by r.Conversely, if in the least fixpoint d(s, t) is bounded by r, then the least fixpoint is a witness d for Φ(r) being true.Since the existential theory of reals is decidable in PSPACE [6], we have the following result.
Theorem 4.4.(Decision complexity for exact distance).For all turn-based game structures and MDPs G, given a rational r, and two states s and t, whether [s 1 t] ≤ r can be decided in PSPACE.
Approximation.Given a rational ǫ > 0, using binary search and O(log( θ 2 −θ 1 ǫ )) calls to check the formula Φ(r), we can obtain an interval [l, u] with u − l ≤ ǫ such that [s 1 t] lies in the interval [l, u].

Corollary 2. (Approximation for exact distance)
. For all turn-based game structures and MDPs G, given a rational ǫ, and two states s and t, an interval

4.2.
Algorithms for the kernel.The kernel of the simulation metric 1 can be computed as the limit of the series 0 1 , 1 1 , 2 1 , . . ., of relations.For all s, t ∈ S, we have (s, t) ∈ 0 1 iff s ≡ t.For all n ≥ 0, we have (s, t) ∈ n+1 1 iff OneStep(s, t, 1 n 1 ) = 0. Checking the condition OneStep(s, t, 1 n 1 ) = 0, corresponds to solving an LP feasibility problem for every a ∈ Γ 1 (s), as it suffices to replace the minimization goal γ = u,v∈S 1 n 1 (u, v) • λ u,v with the constraint γ = 0 in the LP (4.6).We note that this is the same LP feasibility problem that was introduced in [35] as part of an algorithm to decide simulation of probabilistic systems in which each label may lead to one or more distributions over states.
For the bisimulation kernel, we present a more efficient algorithm, which also improves on the algorithms presented in [35].The idea is to proceed by partition refinement, as usual for bisimulation computations.The refinement step is as follows: given a partition, two states s and t belong to the same refined partition iff every pure move from s induces a probability distribution on equivalence classes that can be matched by mixed moves from t, and vice versa.Precisely, we compute a sequence Q 0 , Q 1 , Q 2 , . . ., of partitions.Two states s, t belong to the same class of Q 0 iff they have the same variable valuation (i.e., iff s ≡ t).
For n ≥ 0, since by the definition of the bisimulation metric given in (2.2), [s ≃ g t] = 0 iff [s 1 t] = 0 and [t 1 s] = 0, two states s, t in a given class of Q n remain in the same class in Q n+1 iff both (s, t) and (t, s) satisfy the set of feasibility LP problems OneStepBis(s, t, Q n ) as given below: OneStepBis(s, t, Q) consists of one feasibility LP problem for each a ∈ Γ(s).
The problem for a ∈ Γ(s) has set of variables {x b | b ∈ Γ(t)}, and set of constraints: (1) for all b ∈ Γ(t) : x b = 1, (3) for all V ∈ Q : In the following theorem we show that two states s, t ∈ S are n + 1 step bisimilar iff OneStepBis(s, t, Q n ) and OneStepBis(t, s, Q n ) are feasible.
Theorem 4.5.For all turn-based game structures and MDPs G, for all n ≥ 0, given two states s, t ∈ S and an n-step bisimulation partition of states Proof.We proceed by induction on n.Assume the result holds for all iteration steps up to n and consider the case for n + 1.In one direction, if [s ≃ g t] n+1 = 0, then [s 1 t] n+1 = [t 1 s] n+1 = 0 by the definition of the bisimulation metric.We need to show that given From the definition of the n + 1 step simulation distance, given p(s, t) = 0 by our induction hypothesis, we have, Consider a player 1 move a ∈ Γ 1 (s).Since we can interchange the order of the inf and sup by the generalized minimax theorem in inf x∈D 1 (t) sup k∈C(d n ) (E a s (k) − E x t (k)), the optimal values of x ∈ D 1 (t) and k ∈ C(d n ) exist and only depend on a.Let x a and k a be the optimal values of x and k that realize the inf and sup in inf x∈D 1 (t) sup k∈C(d n ) (E a s (k)− E x t (k)).Using x a and k a in (4.7) we have: where (4.9) follows from (4.8) by noting that for all V ∈ Q n , for all states u, v ∈ V , To show (4.10) follows from (4.9), assume towards a contradiction that there exists a V ′ ∈ Q n such that u∈V ′ δ(t, x a )(u) < u∈V ′ δ(s, a)(u).Then there must be a V ′′ ∈ Q n such that u∈V ′′ δ(t, x a )(u) > u∈V ′′ δ(s, a)(u) since δ(t, x a ) is a probability distribution and the sum of the probability mass allocated to each equivalence class should be 1.Further, for all V ∈ Q n , for all u, v ∈ V , we have d n (u, v) = d n (v, u) = 0 and for all u ∈ V and for all w ∈ S \ V , we have d n (u, w) = d n (w, u) = 1.Therefore, we can pick a feasible k ′ ∈ C(d n ) such that k ′ (v) > 0 for all v ∈ V ′′ and k ′ (v) = 0 for all other states.Using k ′ we get E a s (k ′ ) − E xa t (k ′ ) > 0 which means k a is not optimal, contradicting (4.7).In the other direction, assume that OneStepBis(s, t, Q n ) is feasible.We need to show that [s 1 t] n+1 = 0. Since OneStepBis(s, t, Q n ) is feasible, there exists a distribution x a ∈ D 1 (t) for all a ∈ Γ 1 (s) such that, ∀V ∈ Q n .(u∈V δ(t, x a )(u) ≥ v∈V δ(s, a)(v)).By our induction hypothesis, this implies that for all k ∈ C(d n ), we have (E For the case when m is constant, the previous best known algorithm worked in O(n 9 •log(n)) time, whereas our algorithm works in time O(n 4 ).

Algorithms for Concurrent Games
In this section we first show that the computation of the metric distance is at least as hard as the computation of optimal values in concurrent reachability games.The exact complexity of the latter is open, but it is known to be at least as hard as the square-root sum problem, which is in PSPACE but whose inclusion in NP is a long-standing open problem [16,18].Next, we present algorithms based on a decision procedure for the theory of real closed fields, for both checking the bounds of the exact distance and the kernel of the metrics.Our reduction to the theory of real closed fields removes one quantifier alternation when compared to the previous known formula (inferred from [11]).This improves the complexity of the algorithm.5.1.Reduction of reachability games to metrics.We will use the following terms in the result.A proposition is a boolean observation variable, and we say a state is labeled by a proposition q iff q is true at s.A state t is absorbing in a concurrent game, if both players have only one action available at t, and the next state of t is always t (it is a state with a self-loop).For a proposition q, let Qq denote the set of paths that visit a state labeled by q at least once.In concurrent reachability games, the objective is Qq, for a proposition q, and without loss of generality all states labeled by q are absorbing states.
Theorem 4. Consider a concurrent game structure G, with a single proposition q, such that all states labeled by q are absorbing states.We can construct in linear-time a concurrent game structure G ′ , with one additional state t ′ , such that for all s ∈ S, we have Proof.The concurrent game structure G ′ is obtained from G by adding an absorbing state t ′ .The states that are not labeled by q, and the additional state t ′ , are labeled by its complement ¬q.Observe there is only one proposition sequence from t ′ , and it is (¬q) ω .To prove the desired claim we show that for all s ∈ S we have [s . From a state s in G the possible proposition sequences can be expressed as the following ω-regular expression: (¬q) ω ∪ (¬q) * • q ω .Since the proposition sequence from t ′ is (¬q) ω , the supremum of the difference in values over qµ formulas at s and t ′ is obtained by satisfying the set of paths formalized as (¬q) * • q ω at s.The set of paths defined as (¬q) * • q ω is the same as reaching q in any number of steps, since all states labeled by q are absorbing.Hence, It follows from the results of [10] that for all s ∈ S we have, From the above equalities and the logical characterization result (2.3) we obtain the desired result.

5.2.
Algorithms for the metrics.We first prove a lemma that helps to obtain reducedcomplexity algorithms for concurrent games.The lemma states that the distance [s 1 t] is attained by restricting player 2 to pure moves at state t, for all states s, t ∈ S. Lemma 3.For all concurrent game structures G and all metrics d ∈ M, we have, Proof.To prove our claim we fix k ∈ C(d), and player 1 mixed moves x ∈ D 1 (s), and y ∈ D 1 (t).We then have, sup where (5.3) follows from (5.2) since the decomposition on the rhs of (5.2) yields two independent linear optimization problems; the optimal values are attained at a vertex of the convex hulls of the distributions induced by pure player 2 moves at the two states.This easily leads to the result.We now present algorithms for metrics in concurrent games.Due to the reduction from concurrent reachability games, shown in Theorem 4, it is unlikely that we have an algorithm in NP for the metric distance between states.We therefore construct statements in the theory of real closed fields, firstly to decide whether [s 1 t] ≤ r, for a rational r, so that we can approximate the metric distance between states s and t, and secondly to decide if [s 1 t] = 0 in order to compute the kernel of the game simulation and bisimulation metrics.
The statements improve on the complexity that can be achieved by a direct translation of the statements of [11] to the theory of real closed fields.The complexity reduction is based on the observation that using Lemma 3, we can replace a sup operator with finite conjunction, and therefore reduce the quantifier complexity of the resulting formula.Fix a game structure G and states s and t of G.We proceed to construct a statement in the theory of reals that can be used to decide if [s 1 t] ≤ r, for a given rational r.
In the following, we use variables First, notice that we can write formulas that state that a variable x is a mixed move for a player at state s, and k is a constructible predicate (i.e., k ∈ C(d)): In the following, we write bounded quantifiers of the form "∃x 1 ∈ D 1 (s)" or "∀k ∈ C(d)" which mean respectively ∃x 1 .IsDist(x 1 , Γ 1 (s)) ∧ • • • and ∀k.kBounded(k, d) → • • • .
Let η(k, x 1 , x 2 , y 1 , b) be the polynomial E x 1 ,x 2 s (k) − E y 1 ,b t (k).Notice that η is a polynomial of degree 3. We write a = max{a 1 , . . ., a l } for variables a, a 1 , . . ., a l for the formula (a = a 1 ∧ l i=1 a 1 ≥ a i ) ∨ . . .∨ (a = a l ∧ l i=1 a l ≥ a i ) .
We construct the formula for game simulation in stages.First, we construct a formula Φ 1 (d, s, t, k, x, α) with free variables d, k, x, α such that Φ 1 (d, s, t, k, x 1 , α) holds for a valuation to the variables iff α = inf The formula Φ 1 (d, s, t, k, x 1 , α) is given by:  (5.5) The above sentence is true iff in the least fixpoint, d(s, t) is bounded by r.Like in the case of turn-based games and MDPs, given a rational ǫ > 0, using binary search and O(log( θ 2 −θ 1 ǫ )) calls to a decision procedure to check the sentence (5.5), we can compute an interval [l, u] with u − l ≤ ǫ, such that [s 1 t] ∈ [l, u].Complexity.Note that Φ is of the form ∀∃∀, because Φ 1 is of the form ∀∃, and appears in negative position in Φ.The formula Φ has (|S|+|Γ 1 (s)|+3) universally quantified variables, followed by (|S|+|Γ 1 (s)|+3+2(|Γ 1 (t)|+|Γ 2 (s)|•|Γ 2 (t)|+|Γ 2 (t)|+2)) existentially quantified In the applications we have discussed, non-determinism is modeled probabilistically.In applications where non-determinism needs to be interpreted demonically, rather than probabilistically, MDPs or turn-based games would be the appropriate framework for analysis.If the interaction between various sources of non-determinism needs to be modeled simultaneously, then concurrent games would be the appropriate framework for analysis.For the analysis of these general models, our results and algorithms will be useful.Open Problems.While we have shown polynomial time algorithms for the kernel of the simulation and bisimulation metrics for MDPs and turn-based games, the existence of a polynomial time algorithm for the kernel of both the simulation and bisimulation metrics for concurrent games is an open problem.The existence of a polynomial time algorithm to approximate the exact metric distance in the case of turn-based games and MDPs is an open problem.The existence of a PSPACE algorithm for the decision problem of the exact metric distance in concurrent games is an open problem.

Figure 2 :
Figure 2: An example illustrating the proof of Lemma 1.

Figure 3 :
Figure 3: An example used to compute the simulation metric between states.States of the same color have a propositional distance of 0.

Figure 4 :
Figure 4: The trans-shipping formulation that gives the metric distances between states.

Theorem 4 . 3 .
For all turn-based game structures and MDPs G, given d ∈ M, for all states s, t ∈ S, we can compute H 1 (d)(s, t) in polynomial time by the Linear Program (4.6).
which associates with each variable v ∈ V a valuation[v].•Afinite set Moves of moves.•Two move assignments Γ 1 , Γ 2 : S → 2 Moves \ {∅}.For i ∈ {1, 2}, the assignment Γ i associates with each state s ∈ S the nonempty set Γ Moves , Γ 1 , Γ 2 , δ .We indicate the opponent of a player i ∈ {1, 2} by ∼i = 3 − i.We consider the following subclasses of game structures.Turn-based game structures.A game structure G is turn-based if we can write S = S 1 ∪ S 2 with S 1 ∩ S 2 = ∅ where s ∈ S 1 implies |Γ 2 (s)| = 1, and s ∈ S 2 implies |Γ 1 (s)| = 1, and further, there exists a special variable turn ∈ V, such that [turn]s = θ 1 iff s ∈ S 1 , and [turn]s = θ 2 iff s ∈ S 2 .Markov decision processes.For i ∈ {1, 2}, we say that a structure is an i-MDP if ∀s ∈ S, |Γ ∼i (s)| = 1.For MDPs, we omit the (single) move of the player without a choice of moves, and write δ(s, a) for the transition function.
i (s) ⊆ Moves of moves available to player i at state s.• A probabilistic transition function δ: S × Moves × Moves → Dist(S), that gives the probability δ(s, a 1 , a 2 )(t) of a transition from s to t when player 1 plays move a 1 and player 2 plays move a 2 .At every state s ∈ S, player 1 chooses a move a 1 ∈ Γ 1 (s), and simultaneously and independently player 2 chooses a move a 2 ∈ Γ 2 (s).The game then proceeds to the successor state t ∈ S with probability δ(s, a 1 , a 2 )(t).We let Dest(s, a 1 , a 2 ) = {t ∈ S | δ(s, a 1 , a 2 )(t) > 0}.The propositional distance p(s, t) between two states s, t ∈ S is the maximum difference in the valuation of any variable: p(s, t) = max v∈V |[v](s) − [v](t)| .
In a similar fashion, if OneStepBis(t, s, Q n ) is feasible then [t 1 s] n+1 = 0, which leads to [s ≃ g t] n+1 = 0 by the definition of the bisimulation metric, as required.Complexity.The number of partition refinement steps required for the computation of both the simulation and the bisimulation kernel is bounded by O(|S| 2 ) for turn-based games and MDPs, where S is the set of states.At every refinement step, at most O(|S| 2 ) state pairs are considered, and for each state pair (s, t) at most |Γ(s)| LP feasibility problems needs to be solved.Let us denote by LPF(n, m) the complexity of solving the feasibility of m linear inequalities over n variables.We obtain the following result.The previous algorithm for the bisimulation kernel checked two way simulation and hence has the complexity O(n 4 • m • (n 2 + m)2.5• log(n 2 + m)), whereas our algorithm works in time O(n 4 • m • m 2.5 • log(m)).For most practical purposes, the number of moves at a state is constant (i.e., m is constant).