Permissive Controller Synthesis for Probabilistic Systems

We propose novel controller synthesis techniques for probabilistic systems modelled using stochastic two-player games: one player acts as a controller, the second represents its environment, and probability is used to capture uncertainty arising due to, for example, unreliable sensors or faulty system components. Our aim is to generate robust controllers that are resilient to unexpected system changes at runtime, and flexible enough to be adapted if additional constraints need to be imposed. We develop a permissive controller synthesis framework, which generates multi-strategies for the controller, offering a choice of control actions to take at each time step. We formalise the notion of permissivity using penalties, which are incurred each time a possible control action is disallowed by a multi-strategy. Permissive controller synthesis aims to generate a multi-strategy that minimises these penalties, whilst guaranteeing the satisfaction of a specified system property. We establish several key results about the optimality of multi-strategies and the complexity of synthesising them. Then, we develop methods to perform permissive controller synthesis using mixed integer linear programming and illustrate their effectiveness on a selection of case studies.


Introduction
Probabilistic model checking is used to automatically verify systems with stochastic behaviour.Systems are modelled as, for example, Markov chains, Markov decision processes, or stochastic games, and analysed algorithmically to verify quantitative properties specified in temporal logic.Applications include checking the safe operation of fault-prone systems ("the brakes fail to deploy with probability at most 10 −6 ") and establishing guarantees on the performance of, for example, randomised communication protocols ("the expected time to establish connectivity between two devices never exceeds 1.5 seconds").
A closely related problem is that of controller synthesis.This entails constructing a model of some entity that can be controlled (e.g., a robot, a vehicle or a machine) and its environment, formally specifying the desired behaviour of the system, and then generating, through an analysis of the model, a controller that will guarantee the required behaviour.In many applications of controller synthesis, a model of the system is inherently probabilistic.For example, a robot's sensors and actuators may be unreliable, resulting in uncertainty when detecting and responding to its current state; or messages sent wirelessly to a vehicle may fail to be delivered with some probability.
In such cases, the same techniques that underly probabilistic model checking can be used for controller synthesis.For, example, we can model the system as a Markov decision process (MDP), specify a property φ in a probabilistic temporal logic such as PCTL or LTL, and then apply probabilistic model checking.This yields an optimal strategy (policy) for the MDP, which instructs the controller as to which action should be taken in each state of the model in order to guarantee that φ will be satisfied.This approach has been successfully applied in a variety of application domains, to synthesise, for example: control strategies for robots [22], power management strategies for hardware [16], and efficient PIN guessing attacks against hardware security modules [29].
Another important dimension of the controller synthesis problem is the presence of uncontrollable or adversarial aspects of the environment.We can take account of this by phrasing the system model as a game between two players, one representing the controller and the other the environment.Examples of this approach include controller synthesis for surveillance cameras [24], autonomous vehicles [11] or real-time systems [1].In our setting, we use (turn-based) stochastic two-player games, which can be seen as a generalisation of MDPs where decisions are made by two distinct players.Probabilistic model checking of such a game yields a strategy for the controller player which guarantees satisfaction of a property φ, regardless of the actions of the environment player.
In this paper, we tackle the problem of synthesising robust and flexible controllers, which are resilient to unexpected changes in the system at runtime.For example, one or more of the actions that the controller can choose at runtime might unexpectedly become unavailable, or additional constraints may be imposed on the system that make some actions preferable to others.One motivation for our work is its applicability to model-driven runtime control of adaptive systems [5], which uses probabilistic model checking in an online fashion to adapt or reconfigure a system at runtime in order to guarantee the satisfaction of certain formally specified performance or reliability requirements.
We develop novel, permissive controller synthesis techniques for systems modelled as stochastic two-player games.Rather than generating strategies, which specify a single action to take at each time-step, we synthesise multi-strategies, which specify multiple possible actions.As in classical controller synthesis, generation of a multi-strategy is driven by a formally specified quantitative property: we focus on probabilistic reachability and expected total reward properties.The property must be guaranteed to hold, whichever of the specified actions are taken and regardless of the behaviour of the environment.Simultaneously, we aim to synthesise multi-strategies that are as permissive as possible, which we quantify by assigning penalties to actions.These are incurred when a multi-strategy disallows (does not make available) a given action.Actions can be assigned different penalty values to indicate the relative importance of allowing them.Permissive controller synthesis amounts to finding a multi-strategy whose total incurred penalty is minimal, or below some given threshold.
We formalise the permissive controller synthesis problem and then establish several key theoretical results.In particular, we show that randomised multi-strategies are strictly more powerful than deterministic ones, and we prove that the permissive controller synthesis problem is NP-hard for either class.We also establish upper bounds, showing that the problem is in NP and PSPACE for the deterministic and randomised cases, respectively.
Next, we propose practical methods for synthesising multi-strategies using mixed integer linear programming (MILP) [27].We give an exact encoding for deterministic multistrategies and an approximation scheme (with adaptable precision) for the randomised case.For the latter, we prove several additional results that allow us to reduce the search space of multi-strategies.The MILP solution process works incrementally, yielding increasingly permissive multi-strategies, and can thus be terminated early if required.This is well suited to scenarios where time is limited, such as online analysis for runtime control, as discussed above, or "anytime verification" [28].Finally, we implement our techniques and evaluate their effectiveness on a range of case studies.
This paper is an extended version of [13], containing complete proofs, optimisations for MILP encodings and experiments comparing performance under two different MILP solvers.
1.1.Related Work.Permissive strategies in non-stochastic games were first studied in [2] for parity objectives, but permissivity was defined solely by comparing enabled actions.Bouyer et al. [3] showed that optimally permissive memoryless strategies exist for reachability objectives and expected penalties, contrasting with our (stochastic) setting, where they may not.The work in [3] also studies penalties given as mean-payoff and discounted reward functions, and [4] extends the results to the setting of parity games.None of [2,3,4] consider stochastic games or even randomised strategies, and they provide purely theoretical results.As in our work, Kumar and Garg [20] consider control of stochastic systems by dynamically disabling events; however, rather than stochastic games, their models are essentially Markov chains, which the possibility of selectively disabling branches turns into MDPs.[26] studies games where the aim of one opponent is to ensure properties of systems against an opponent who can modify the system on-the-fly by removing some transitions.
Finally, although tackling a rather different problem (counterexample generation), [31] is related in that it also uses MILP to solve probabilistic verification problems.

Preliminaries
We denote by Dist(X) the set of discrete probability distributions over a set X. A Dirac distribution is one that assigns probability 1 to some s ∈ X.The support of a distribution 2.1.Stochastic Games.In this paper, we use turn-based stochastic two-player games, which we often refer to simply as stochastic games.A stochastic game takes the form G = S ♦ , S , s, A, δ , where S = S ♦ ∪ S is a finite set of states, each associated with player ♦ or , s ∈ S is an initial state, A is a finite set of actions and δ : S×A → Dist(S) is a (partial) probabilistic transition function such that the distributions assigned by δ only select elements of S with rational probabilities.
An MDP is a stochastic game in which either S ♦ or S is empty.Each state s of a stochastic game G has a set of enabled actions, given by A(s) = {a ∈ A | δ(s, a) is defined}.The unique player • ∈ {♦, } such that s ∈ S • picks an action a ∈ A(s) to be taken in state s.Then, the next state is determined randomly according to the distribution δ(s, a), i.e., a transition to state s ′ occurs with probability δ(s, a)(s ′ ).
A path through G is a (finite or infinite) sequence ω = s 0 a 0 s 1 a 1 . . ., where s i ∈ S, a i ∈ A(s i ) and δ(s i , a i )(s i+1 ) > 0 for all i.We denote by IPath s the set of all infinite paths starting in state s.For a player • ∈ {♦, }, we denote by FPath • the set of all finite paths starting in any state and ending in a state from S • .
A strategy σ : FPath • → Dist(A) for player • of G is a resolution of the choices of actions in each state from S • based on the execution so far, such that only enabled actions in a state are chosen with non-zero probability.In standard fashion [19], a pair of strategies σ and π for ♦ and induces, for any state s, a probability measure Pr σ,π G,s over IPath s .A strategy σ is deterministic if σ(ω) is a Dirac distribution for all ω, and randomised if not.In this work, we focus purely on memoryless strategies, where σ(ω) depends only on the last state of ω, in which case we define the strategy as a function σ : S • → Dist(A).We write Σ • G for the set of all (memoryless) player • strategies in G.

Properties and Rewards.
In order to synthesise controllers, we need a formal description of their required properties.In this paper, we use two common classes of properties: probabilistic reachability and expected total reward, which we will express in an extended version of the temporal logic PCTL [18].
For probabilistic reachability, we write properties of the form φ = P ⊲⊳p [ F g ], where ⊲⊳ ∈ { , }, p ∈ [0, 1] and g ⊆ S is a set of target states, meaning that the probability of reaching a state in g satisfies the bound ⊲⊳ p. Formally, for a specific pair of strategies σ ∈ Σ ♦ G , π ∈ Σ G for G, the probability of reaching g under σ and π is We say that φ is satisfied under σ and π, denoted G, σ, π |= φ, if Pr σ,π G,s (F g) ⊲⊳ p.We also augment stochastic games with reward structures, which are functions of the form r : S × A → Q 0 mapping state-action pairs to non-negative rationals.In practice, we often use these to represent "costs" (e.g.elapsed time or energy consumption), despite the terminology "rewards".The restriction to non-negative rewards allows us to avoid problems with non-uniqueness of total rewards, which would require special treatment [30].
Rewards are accumulated along a path and, for strategies σ ∈ Σ ♦ G and π ∈ Σ G , the expected total reward is defined as: For technical reasons, we will always assume the maximum possible reward sup σ,π E σ,π G,s (r) is finite (which can be checked with an analysis of the game's underlying graph); similar assumptions are commonly introduced [25, Section 7].In our proofs, we will also use E σ,π G,s (r↓s) for the expected total reward accumulated before the first visit to s, defined by: where An expected reward property is written φ = R r ⊲⊳b [ C ] (where C stands for cumulative), meaning that the expected total reward for r satisfies ⊲⊳ b.We say that φ is satisfied under strategies σ and π, denoted G, σ, π |= φ, if E σ,π G,s (r) ⊲⊳ b.In fact, probabilistic reachability can easily be reduced to expected total reward (by replacing any outgoing transitions from states in the target set with a single transition to a sink state labelled with a reward of 1).Thus, in the techniques presented in this paper, we focus purely on expected total reward.

Controller Synthesis.
To perform controller synthesis, we model the system as a stochastic game G = S ♦ , S , s, A, δ , where player ♦ represents the controller and player represents the environment.A specification of the required behaviour of the system is a property φ, either a probabilistic reachability property P ⊲⊳p [ F g ] or an expected total reward property R r ⊲⊳b [ C ]. Definition 2.1 (Sound strategy).A strategy σ ∈ Σ ♦ G for player ♦ in stochastic game G is sound for a property φ if G, σ, π |= φ for any strategy π ∈ Σ G .
To simplify notation, we will consistently use σ and π to refer to strategies of player ♦ and , respectively, and will not always state explicitly that σ ∈ Σ ♦ G and π ∈ Σ G .Notice that, in Defn.2.1, strategies σ and π are both memoryless.We could equivalently allow π to range over history-dependent strategies since, for the properties φ considered in this paper (probabilistic reachability and expected total reward), the existence of a history-dependent counter-strategy π for which G, σ, π |= φ implies the existence of a memoryless one.
The classical controller synthesis problem asks whether there exists a sound strategy for game G and property φ.We can determine whether this is the case by computing the optimal strategy for player ♦ in G [12,15].This problem is known to be in NP ∩ co-NP, but, in practice, methods such as value or policy iteration can be used efficiently.
Example 1. Fig. 1 shows a stochastic game G, with controller and environment player states drawn as diamonds and squares, respectively.It models the control of a robot moving between 4 locations (s 0 , s 2 , s 3 , s 5 ).When moving east (s 0 →s 2 or s 3 →s 5 ), it may be impeded by a second robot, depending on the position of the latter.If it is impeded, there is a chance that it does not successfully move to the next location.
We use a reward structure moves, which assigns 1 to the controller actions north, east , south, and define property φ = R moves [ C ], meaning that the expected number of moves to reach s 5 is at most 5 (notice that s 5 is the only state from which all subsequent transitions have reward zero).A sound strategy for φ in G (found by minimising moves) chooses south in s 0 and east in s 3 , yielding an expected number of moves of 3.5.

Permissive Controller Synthesis
We now define a framework for permissive controller synthesis, which generalises classical controller synthesis by producing multi-strategies that offer the controller flexibility about which actions to take in each state.
3.1.Multi-Strategies.Multi-strategies generalise the notion of strategies, as defined in Section 2. They will always be defined for player ♦ of a game.Definition 3.1 (Multi-strategy).Let G = S ♦ , S , s, A, δ be a stochastic game.A (memoryless) multi-strategy for G is a function θ : S ♦ →Dist(2 A ) with θ(s)(∅) = 0 for all s ∈ S ♦ .
As for strategies, a multi-strategy θ is deterministic if θ always returns a Dirac distribution, and randomised otherwise.We write Θ det G and Θ rand G for the sets of all deterministic and randomised multi-strategies in G, respectively.
A deterministic multi-strategy θ chooses a set of allowed actions in each state s ∈ S ♦ , i.e., those in the unique set B ⊆ A for which θ(s)(B) = 1.When θ is deterministic, we will often abuse notation and write a ∈ θ(s) for the actions a ∈ B. The remaining actions A(s) \ B are said to be disallowed in s.In contrast to classical controller synthesis, where a strategy σ can be seen as providing instructions about precisely which action to take in each state, in permissive controller synthesis a multi-strategy provides (allows) multiple actions, any of which can be taken.A randomised multi-strategy generalises this by selecting a set of allowed actions in state s randomly, according to distribution θ(s).
We say that a controller strategy σ complies with multi-strategy θ, denoted σ ⊳ θ, if it picks actions that are allowed by θ.Formally (taking into account the possibility of randomisation), we define this as follows.Definition 3.2 (Compliant strategy).Let θ be a multi-strategy and σ a strategy for a game G.We say that σ is compliant (or that it complies) with θ, written σ ⊳ θ, if, for any state s ∈ S ♦ and non-empty subset B ⊆ A(s), there is a distribution d s,B ∈ Dist(B) such that, for all a ∈ A(s), σ(s Example 2. Let us explain the technical definition of a compliant strategy on the game from Ex. 1 (see Fig. 1).Consider a randomised multi-strategy θ that, in s 0 , picks {east , south} with probability 0.5, {south} with probability 0.3, and {east } with probability 0.2.A compliant strategy then needs to, for some number 0 x 1, pick south with probability 0.3 + 0.5 • x and east with probability 0.2 + 0.5 • (1 − x).The number x corresponds to the probability d s 0 ,{east ,south} (south) in the formal definition above.
Hence, a strategy σ that picks east and south with equal probability 0.5 satisfies the requirements of compliance in state s 0 , as witnessed by selecting x = 0.4, or, in other words, the distribution d s 0 ,{east ,south} assigning 0.4 and 0.6 to south and east , respectively.On the other hand, a strategy that picks east with probability 0.8 cannot be compliant with θ.
Each multi-strategy determines a set of compliant strategies, and our aim is to design multi-strategies which allow as many actions as possible, but at the same time ensure that any compliant strategy satisfies some specified property.We define the notion of a sound multi-strategy, i.e., one that is guaranteed to satisfy a property φ when complied with.Definition 3.3 (Sound multi-strategy).A multi-strategy θ for game G is sound for a property φ if any strategy σ that complies with θ is sound for φ.
Example 3. We return again to the stochastic game from Ex. 1 (see Fig. 1) and re-use the property φ = R moves 5 [ C ].A strategy that picks south in s 0 and east in s 3 results in an expected reward of 3.5 (i.e., 3.5 moves on average to reach s 5 ).A strategy that picks east in s 0 and south in s 2 yields expected reward 5. Thus, a (deterministic) multi-strategy θ that picks {south, east } in s 0 , {south} in s 2 and {east } in s 3 is sound for φ since the expected reward is always at most 5.

3.2.
Penalties and Permissivity.The motivation behind synthesising multi-strategies is to offer flexibility in the actions to be taken, while still satisfying a particular property φ.Generally, we want a multi-strategy θ to be as permissive as possible, i.e., to impose as few restrictions as possible on actions to be taken.We formalise the notion of permissivity by assigning penalties to actions in the model, which we then use to quantify the extent to which actions are disallowed by θ.Penalties provide expressivity in the way that we quantify permissivity: if it is more preferable that certain actions are allowed than others, then these can be assigned higher penalty values.
A penalty scheme is a pair (ψ, τ ), comprising a penalty function ψ : S ♦ × A → Q 0 and a penalty type τ ∈ {sta, dyn}.The function ψ represents the impact of disallowing each action in each controller state of the game.The type τ dictates how penalties for individual actions are combined to quantify the permissivity of a specific multi-strategy.For static penalties (τ = sta), we simply sum penalties across all states of the model.For dynamic penalties (τ = dyn), we take into account the likelihood that disallowed actions would actually have been available, by using the expected sum of penalty values.
More precisely, for a penalty scheme (ψ, τ ) and a multi-strategy θ, we define the resulting penalty for θ, denoted pen τ (ψ, θ) as follows.First, we define the local penalty for θ at state s ∈ S ♦ as pen loc (ψ, θ, s) = B⊆A(s) a / ∈B θ(s)(B)ψ(s, a).If θ is deterministic, pen loc (ψ, θ, s) is simply the sum of the penalties of actions that are disallowed by θ in s.If θ is randomised, pen loc (ψ, θ, s) gives the expected penalty value in s, i.e., the sum of penalties weighted by the probability with which θ disallows them in s.Now, for the static case, we sum the local penalties over all states, i.e., we put: For the dynamic case, we use the (worst-case) expected sum of local penalties.We define an auxiliary reward structure ψ θ rew given by the local penalties: ψ θ rew (s, a) = pen loc (ψ, θ, s) for all s ∈ S ♦ and a ∈ A(s), and ψ θ rew (s, a) = 0 for all s ∈ S and a ∈ A(s).Then: and σ complies with θ}.We use pen dyn (ψ, θ) = pen dyn (ψ, θ, s) to reference the dynamic penalty in the initial state.

Permissive Controller Synthesis.
We can now formally define the central problem studied in this paper.Definition 3.4 (Permissive controller synthesis).Consider a game G, a class of multistrategies ⋆ ∈ {det , rand }, a property φ, a penalty scheme (ψ, τ ) and a threshold c ∈ Q 0 .The permissive controller synthesis problem asks: does there exist a multi-strategy θ ∈ Θ ⋆ G that is sound for φ and satisfies pen τ (ψ, θ) c?
Alternatively, in a more quantitative fashion, we can aim to synthesise (if it exists) an optimally permissive sound multi-strategy.Definition 3.5 (Optimally permissive).Let G, ⋆, φ and (ψ, τ ) be as in Defn.
G and θ is sound for φ}.
Example 4. We return to Ex. 3 and consider a static penalty scheme (ψ, sta) assigning 1 to the actions north, east, south (in any state).The deterministic multi-strategy θ from Ex. 3 is optimally permissive for φ = R moves

5
[ C ], with penalty 1 (just north in s 3 is disallowed).If we instead use φ ′ = R moves 16 [ C ], the multi-strategy θ ′ that extends θ by also allowing north is now sound and optimally permissive, with penalty 0. Alternatively, the randomised multistrategy θ ′′ that picks {north} with probability 0.7 and {north, east } with probability 0.3 in s 3 is sound for φ with penalty just 0.7.
It is important to point out that penalties will typically be used for relative comparisons of multi-strategies.If two multi-strategies θ and θ ′ incur penalties x and x ′ with x < x ′ , then the interpretation is that θ is better than θ ′ ; there is not necessarily any intuitive meaning assigned to the values x and x ′ themselves.Accordingly, when modelling a system, the penalties of actions should be chosen to reflect the actions' relative importance.This is different from rewards, which usually correspond to a specific measure of the system.
Next, we establish several fundamental results about the permissive controller synthesis problem.Proofs that are particularly technical are postponed to the appendix and we only highlight the key ideas in the main body of the paper.
Optimality.Recall that two key parameters of the problem are the type of multi-strategy sought (deterministic or randomised) and the type of penalty scheme used (static or dynamic).We first note that randomised multi-strategies are strictly more powerful than deterministic ones, i.e., they can be more permissive (yield a lower penalty) whilst satisfying the same property φ.
Theorem 3.6.The answer to a permissive controller synthesis problem (for either a static or dynamic penalty scheme) can be "no" for deterministic multi-strategies, but "yes" for randomised ones.
Proof.Consider an MDP with states s, t 1 and t 2 , and actions a 1 and a 2 , where δ(s, a i )(t i ) = 1 for i ∈ {1, 2}, and t 1 , t 2 have self-loops only.Let r be a reward structure assigning 1 to (s, a 1 ) and 0 to all other state-action pairs, and ψ be a penalty function assigning 1 to (s, a 2 ) and 0 elsewhere.We then ask whether there is a multi-strategy satisfying φ = R r 0.5 [ C ] with penalty at most 0.5.
Considering either static or dynamic penalties, the randomised multi-strategy θ that chooses distribution 0.5:{a 1 } + 0.5:{a 2 } in s is sound and yields penalty 0.5.However, there is no such deterministic multi-strategy.

⊓ ⊔
This is why we explicitly distinguish between classes of multi-strategies when defining permissive controller synthesis.This situation contrasts with classical controller synthesis, where deterministic strategies are optimal for the same classes of properties φ.Intuitively, randomisation is more powerful in this case because of the trade-off between rewards and penalties: similar results exist in, for example, multi-objective controller synthesis on MDPs [14].
Next, we observe that, for the case of static penalties, the optimal penalty value for a given property (the infimum of achievable values) may not actually be achievable by any randomised multi-strategy.
Theorem 3.7.For permissive controller synthesis using a static penalty scheme, an optimally permissive randomised multi-strategy does not always exist.
Proof.Consider a game with states s and t, and actions a and b, where we define δ(s, a)(s) = 1 and δ(s, b)(t) = 1, and t has just a self-loop.The reward structure r assigns 1 to (s, b) and 0 to all other state-action pairs.The penalty function ψ assigns 1 to (s, a) and 0 elsewhere.Now observe that any multi-strategy which disallows the action a with probability ε > 0 and allows all other actions incurs penalty ε and is sound for R r 1 [ C ], since any strategy which complies with the multi-strategy leads to action b being taken eventually.Thus, the infimum of achievable penalties is 0. However, the multi-strategy that incurs penalty 0, i.e. allows all actions, is not sound for R r If, on the other hand, we restrict our attention to deterministic strategies, then an optimally permissive multi-strategy does always exist (since the set of deterministic, memoryless multi-strategies is finite).For randomised multi-strategies with dynamic penalties, the question remains open.
Complexity.Next, we present complexity results for the different variants of the permissive controller synthesis problem.We begin with lower bounds.
Theorem 3.8.The permissive controller synthesis problem is NP-hard, for either static or dynamic penalties, and deterministic or randomised multi-strategies.
We prove NP-hardness by reduction from the Knapsack problem, where weights of items are represented by penalties, and their values are expressed in terms of rewards to be achieved.The most delicate part is the proof for randomised strategies, where we need to ensure that the multi-strategy cannot benefit from picking certain actions (corresponding to items being put into the Knapsack) with probability other than 0 or 1. See Appx.A.1 for details.For upper bounds, we have the following.Theorem 3.9.The permissive controller synthesis problem for deterministic (resp.randomised) strategies is in NP (resp.PSPACE) for dynamic/ static penalties.
For deterministic multi-strategies, it is straightforward to show NP membership in both the dynamic and static penalty case, since we can guess a multi-strategy satisfying the required conditions and check its correctness in polynomial time.For randomised multistrategies, with some technical effort, we can encode existence of the required multi-strategy as a formula of the existential fragment of the theory of real arithmetic, solvable with polynomial space [7].See Appx.A.2.A natural question is whether the PSPACE upper bound for randomised multi-strategies can be improved.We show that this is likely to be difficult, by giving a reduction from the square-root-sum problem.Theorem 3.10.There is a reduction from the square-root-sum problem to the permissive controller synthesis problem with randomised multi-strategies, for both static and dynamic penalties.
We use a variant of the problem that asks, given positive rationals x 1 ,. . .,x n and y, whether n i=1 √ x i y.This problem is known to be in PSPACE, but establishing a better complexity bound is a long-standing open problem in computational geometry [17].See Appx.A.3 for details.

MILP-Based Synthesis of Multi-Strategies
We now consider practical methods for synthesising multi-strategies that are sound for a property φ and optimally permissive for some penalty scheme.Our methods use mixed integer linear programming (MILP), which optimises an objective function subject to linear constraints that mix both real and integer variables.A variety of efficient, off-the-shelf MILP solvers exists.
An important feature of the MILP solvers we use is that they work incrementally, producing a sequence of increasingly good solutions.Here, that means generating a series of sound multi-strategies that are increasingly permissive.In practice, when computational resources are constrained, it may be acceptable to stop early and accept a multi-strategy that is sound but not necessarily optimally permissive.
Here, and in the rest of this section, we assume that the property φ is of the form ) can be handled by negating rewards and converting to a lower bound.For the purposes of encoding into MILP, we rescale r and b such that sup σ,π E σ,π G,s (r) < 1 for all s, and rescale every (non-zero) penalty such that ψ(s, a) 1 for all s and a ∈ A(s).
We begin by discussing the synthesis of deterministic multi-strategies, first for static penalties and then for dynamic penalties.Subsequently, we present an approach to synthesising approximations to optimal randomised multi-strategies.In each case, we describe encodings into MILP problems and prove their correctness.We conclude this section with a brief discussion of ways to optimise the MILP encodings.Then, in Section 5, we investigate the practical applicability of our techniques.Variables y s,a encode a multi-strategy θ as follows: y s,a has value 1 iff θ allows action a in s ∈ S ♦ (constraint (4.2) enforces at least one allowed action per state).Variables x s represent the worst-case expected total reward (for r) from state s, under any controller strategy complying with θ and under any environment strategy.This is captured by constraints (4.3)-(4.4)(which are analogous to the linear constraints used when minimising the reward in an MDP).Constraint (4.1) puts the required bound of b on the reward from s.
The objective function minimises the static penalty (the sum of all local penalties) minus the expected reward in the initial state.The latter acts as a tie-breaker between solutions Minimise: − x s + s∈S ♦ a∈A(s) (1 − y s,a )•ψ(s, a) subject to: y s,a for all s ∈ S ♦ (4.2) x s α s for all s ∈ S (4.5) β s,a,t for all s ∈ S, a ∈ A(s) (4.6) for all s, a, t with t ∈ supp(δ(s, a)) (4.8) Figure 2: MILP encoding for deterministic multi-strategies with static penalties.
Minimise: z s subject to (4.1),. . .,(4.8) and: for all s ∈ S ♦ (4.9) with equal penalties (but, thanks to rescaling, is always dominated by the penalties and therefore does not affect optimality).
As an additional technicality, we need to ensure the values of x s are the least solution of the defining inequalities, to deal with the possibility of zero reward loops.To achieve this, we use an approach similar to the one taken in [31].It is sufficient to ensure that x s = 0 whenever the minimum expected reward from s achievable under θ is 0, which is true if and only if, starting from s, it is possible to avoid ever taking an action with positive reward.
In our encoding, α s = 1 if x s is positive (constraint (4.5)).The binary variables β s,a,t = 1 represent, for each such s and each action a allowed in s, a choice of successor t = t(s, a) ∈ supp(δ(s, a)) (constraint (4.6)).The variables γ s then represent a ranking function: if r(s, a) = 0, then γ s > γ t(s,a) (constraint (4.8)).If a positive reward could be avoided starting from s, there would in particular be an infinite sequence s 0 , a 1 , s 1 , . . .with s 0 = s and, for all i, either (i) x s i > x s i+1 , or (ii) x s i = x s i+1 , s i+1 = t(s i , a i ) and r(s i , a i ) = 0, and therefore γ s i > γ s i+1 .This means that the sequence (x s 0 , γ s 0 ), (x s 1 , γ s 1 ), . . . is (strictly) decreasing w.r.t. the lexicographical order, but at the same time S is finite, and so this sequence would have to enter a loop, which is a contradiction.
Correctness.Before proving the correctness of the encoding (stated in Theorem 4.2, below), we prove the following auxiliary lemma that characterises the reward achieved under a multi-strategy in terms of a solution of a set of inequalities.Lemma 4.1.Let G = S ♦ , S , s, A, δ be a stochastic game, φ = R r b [ C ] a property, (ψ, sta) a static penalty scheme and θ a deterministic multi-strategy.Consider the inequalities: Then the following hold: r) is a solution to the above inequalities.• A solution xs to the above inequalities satisfies xs inf σ⊳θ,π E σ,π G,s (r) for all s whenever the following condition holds: for every s with xs > 0, every σ ⊳ θ and every π there is a path ω = s 0 a 0 . . .s n a n starting in s that satisfies Pr σ,π G,s (ω) > 0 and r(s n , a n ) > 0.
Proof.The game G, together with θ, determines a Markov decision process G θ = ∅, S ♦ ∪ S , s, A, δ ′ in which the choices disallowed by θ are removed, i.e. δ ′ (s, a) is equal to δ(s, a) for every s ∈ S and every s ∈ S ♦ with a ∈ θ(s), and is undefined for any other combination of s and a.We have: inf since, for any strategy pair σ ⊳ θ and π in G, there is a strategy σ in G θ which is defined, for every finite path ω of G θ ending in t, by σ(ω) = σ(ω) or σ(ω) = π(ω), depending on whether t ∈ S ♦ or t ∈ S , and which satisfies . Similarly, a strategy σ for G θ induces a compliant strategy σ and a strategy π defined for every finite path ω of G ending in S ♦ (resp.S ) by σ(ω) = σ(ω) (resp.π(ω) = σ(ω)).
The rest is then the following simple application of results from the theory of Markov decision processes.The first item of the lemma follows from [25, Theorem 7.1.3],which gives a characterisation of values in MDPs in terms of Bellman equations; the inequalities in the lemma are in fact a relaxation of these equations.For the second part of the lemma, observe that if, inf σ⊳θ,π E σ,π G,s (r) is infinite, then the claim holds trivially.Otherwise, from the assumption on the existence of ω we have that, under any compliant strategy, there is a path ω ′ = s 0 a 0 s 1 . . .s n of length at most |S| in G θ such that inf σ⊳θ,π E σ,π G,sn (r) = 0 (otherwise the reward would be infinite) and so xsn = 0. We can thus apply [25,Proposition 7.3.4],which states that a solution to our inequalities gives optimal values whenever under any strategy the probability of reaching a state s with x s = 0 is 1.Note that the result of [25] applies for maximisation of reward in "negative models"; our problem can be easily reduced to this setting by multiplying the rewards by −1 and looking for maximising (instead of minimising) strategies.
⊓ ⊔ Theorem 4.2.Let G be a game, φ = R r b [ C ] a property and (ψ, sta) a static penalty scheme.There is a sound multi-strategy in G for φ with penalty p if and only if there is an optimal assignment to the MILP instance from Fig. 2 which satisfies p = s∈S ♦ a∈A(s) (1− y s,a )•ψ(s, a).
Proof.We prove that every multi-strategy θ induces a satisfying assignment to the variables such that the static penalty under θ is s∈S ♦ a∈A(s) (1 − y s,a )•ψ(s, a), and vice versa.The theorem then follows from the rescaling of rewards and penalties that we performed.
We start by proving that, given a sound multi-strategy θ, we can construct a satisfying assignment {ȳ s,a , xs , ᾱs , βs,a,t , γt } s,t∈S,a∈A to the constraints from Fig. 2. For s ∈ S ♦ and a ∈ A(s) we set ȳs,a = 1 if a ∈ θ(s), and otherwise we set ȳs,a = 0.This gives satisfaction of contraint (4.2).For s ∈ S and a ∈ A(s) we set ȳs,a = 1, ensuring satisfaction of (4.7).We then put xs = inf σ⊳θ,π E σ,π G,s (r).By the first part of Lemma 4.1 we get that constraints (4.1), (4.3) (for a ∈ θ(s)) and (4.4) are satisfied.Constraint (4.3) for a / ∈ θ(s) is satisfied because in this case ȳs,a = 0, and so the right-hand side is always at least 1.
We further set ᾱs = 1 if x s > 0 and ᾱs = 0 if x s = 0, thus satisfying constraint (4.5).For a state s, let d s be the maximum distance to a positive reward.Formally, the values d s are defined inductively by putting d s = 0 for any state s such that we have r(s, a) > 0 for all a ∈ A(s), and then for any other state s: Put d s = ⊥ if d s was not defined by the above.For s such that d s = ⊥, we put γs = d s /|S|, and for every a we choose t such that d t < d s , and set βs,a,t = 1, leaving βs,a,t = 0 for all other t.For s such that d s = ⊥ we define γs = 0 and for all a and t put βs,a,t = 0.This ensures the satisfaction of the remaining constraints.
In the opposite direction, assume that we are given a satisfying assignment.Firstly, we create a game G ′ from G by making any states s with xs = 0 sink states (i.e.imposing a self-loop with no penalty on s and removing all other transitions).Any sound multi-strategy θ for φ in G ′ directly gives a sound multi-strategy θ ′ for φ in G defined by θ ′ (s) = θ(s) for states s ∈ S ♦ with x s > 0, and otherwise letting θ allow all available actions.
We construct θ for G ′ by putting θ(s) = {a ∈ A(s) | ȳs,a = 1} for all s ∈ S ♦ with xs > 0, and by allowing the self-loop in the states s ∈ S ♦ with x s = 0; note that θ(s) is non-empty by constraint (4.2).First, by definition, the multi-strategy yields the penalty s∈S ♦ a∈A(s) (1 − ȳs,a )•ψ(s, a).Next, we will show that θ satisfies the assumption of the second part of Lemma 4.1, from which we get that: which, together with constraint (4.1) being satisfied, gives us the desired result.
Consider any s such that inf σ⊳θ,π E σ,π G ′ ,s (r) > 0. Then we have xs > 0 (by the definition of G ′ ).Let us fix any σ ⊳ θ and any π, and let s 0 = s.We show that there is a path ω satisfying the assumption of the lemma.We build ω = s 0 . . .s n a n inductively, to satisfy: (i) r(s n , a n ) > 0, (ii) xs i xs i−1 for all i, and (iii) for any sub-path s i a i . . .s j with xs i = xs j we have that γs k < γs k−1 for all i + 1 k j.
Assume we have defined a prefix s 0 a 0 . . .s i to satisfy conditions (ii) and (iii).We put a i to be the action picked by σ (or π) in s i .If r(s i , a i ) > 0, we are done.Otherwise, we pick s i+1 as follows: • If there is s ′ ∈ supp(δ(s i , a i )) with xs ′ > xs , then we put s i+1 = s ′ .Such a choice again satisfies (ii) and (iii) by definition.• If we have xs ′ = xs for all s ′ ∈ supp(δ(s i , a i )), then any choice will satisfy (ii).To satisfy the other conditions, we pick s i+1 so that βs i ,a i ,s i+1 = 1 is true.We argue that such an s i+1 can be chosen.We have xs i > 0 and so ᾱs = 1 by constraint (4.5).We also have ȳs,a = 1: for s ∈ S ♦ this follows from the definition of θ, for s ∈ S from constraint (4.7).Hence, since constraint (4.6) is satisfied, there must be s i+1 such that βs i ,a,s i+1 = 1.Then, we apply constraint (4.8) (for s = s i , t = s i+1 and a = a i ) and, since the last two summands on the right-hand side are 0, we get γs i+1 < γs i , thus satisfying (iii).Note that the above construction must terminate after at most |S| steps since, due to conditions (ii) and (iii), no state repeats on ω.Because the only way of terminating is satisfaction of (i), we are done.
⊓ ⊔ 4.2.Deterministic Multi-Strategies with Dynamic Penalties.Next, we show how to compute a sound and optimally permissive deterministic multi-strategy for a dynamic penalty scheme (ψ, dyn).This case is more subtle since the optimal penalty can be infinite.Hence, our solution proceeds in two steps as follows.
Initially, we determine if there is some sound multi-strategy.For this, we just need to check for the existence of a sound strategy, using standard algorithms for solution of stochastic games [12,15].If there is no sound multi-strategy, we are done.Otherwise, we use the MILP problem in Fig. 3 to determine the penalty for an optimally permissive sound multi-strategy.This MILP encoding extends the one in Fig. 2 for static penalties, adding variables ℓ s and z s , representing the local and the expected penalty in state s, and three extra sets of constraints.First, (4.9) and (4.10) define the expected penalty in controller states, which is the sum of penalties for all disabled actions and those in the successor states, multiplied by their transition probabilities.The behaviour of environment states is then captured by constraint (4.11),where we only maximise the penalty, without incurring any penalty locally.
The constant c in (4.10) is chosen to be no lower than any finite penalty achievable by a deterministic multi-strategy, a possible value being: where p is the smallest non-zero probability assigned by δ, and pen max is the maximal local penalty over all states.To see that (4.12) indeed gives a safe bound on c (i.e. it is lower than any finite penalty achievable), observe that for the penalty to be finite under a deterministic multi-strategy, for every state s there must be a path of length at most |S| to a state from which no penalty will be incurred.This path has probability at least p |S| , and since the penalty accumulated along a path of length i • |S| is at most i • |S| • pen max , the properties of (4.12) follow easily.
If the MILP problem has a solution, this is the optimal dynamic penalty over all sound multi-strategies.If not, no deterministic sound multi-strategy has a finite penalty and the optimal penalty is ∞ (recall that we already established there is some sound multistrategy).In practice, we might choose a lower value of c than the one above, resulting in a multi-strategy that is sound, but possibly not optimally permissive.
Correctness.Formally, correctness of the MILP encoding for the case of dynamic penalties is captured by the following theorem.Theorem 4.3.Let G be a game, φ = R r b [ C ] a property and (ψ, dyn) a dynamic penalty scheme.Assume there is a sound multi-strategy for φ.The MILP formulation from Fig. 3 satisfies: (a) there is no solution if and only if the optimally permissive deterministic multistrategy yields infinite penalty; and (b) there is a solution zs if and only if an optimally permissive deterministic multi-strategy yields penalty zs .
Proof.We show that any sound multi-strategy with finite penalty zs gives rise to a satisfying assignment with the objective value zs , and vice versa.Then, (b) follows directly, and (a) follows by the assumption that there is some sound multi-strategy.
Let us prove that for any sound multi-strategy θ we can construct a satisfying assignment to the constraints.For constraints (4.1) to (4.8), the construction works exactly the same as in the proof of Theorem 4.2.For the newly added variables, i.e. z s and ℓ s , we put ls = pen loc (ψ, θ, s), ensuring satisfaction of constraint (4.9), and: which, together with [25, Section 7.2.7,Equation 7.2.17](giving characterisation of optimal reward in terms of a linear program), ensures that constraints (4.10) and 4.11 are satisfied.
In the opposite direction, given a satisfying assignment we construct θ for G ′ exactly as in the proof of Theorem 4.2.As before, we can argue that constraints (4.1) to (4.8) are satisfied under any sound multi-strategy.We now need to argue that the multi-strategy satisfies pen dyn (ψ, θ, s) zs .It is easy to see that pen loc (ψ, θ, s) = ls .Moreover, by [25, Section 7.2.7,Equation 7.2.17] the penalty is the least solution to the inequalities: We can replace (4.13) with: , though, cannot be adapted to the randomised case, since this would need non-linear constraints (intuitively, we would need to multiply expected total rewards by probabilities of actions being allowed under a multi-strategy, and both these quantities are unknowns in our formalisation).Instead, in this section, we propose an approximation which finds the optimal randomised multi-strategy θ in which each probability θ(s)(B) is a multiple of 1 M for a given granularity M .Any such multi-strategy can then be simulated by a deterministic one on a transformed game, allowing synthesis to be carried out using the MILP-based methods described in the previous section.Before giving the definition of the transformed game, we show that we can simplify our problem by restricting to multi-strategies which in any state select at most two actions with non-zero probability.
Proof.If the (dynamic) penalty under θ is infinite, then the solution is straightforward: we can simply take θ ′ which, in every state, allows a single action so that the reward is maximised.This restrictive multi-strategy enforces a strategy that maximises the reward (so it performs at least as well as any other multi-strategy), and at the same time it cannot yield the dynamic penalty worse than θ, as the dynamic penalty under θ is already infinite.From now on, we will assume that the penalty is finite.
Let θ be a multi-strategy allowing n > 2 different sets A 1 , . . ., A n with non-zero probabilities λ 1 , . . ., λ n in s 1 ∈ S ♦ .We construct a multi-strategy θ ′ that in s 1 allows only two of the sets A 1 , . . ., A n with non-zero probability, and in other states behaves like θ.
We first prove the case of dynamic penalties and then describe the differences for static penalties.Supposing that inf σ⊳θ,π E σ,π G,s 1 (r) inf σ⊳θ ′ ,π E σ,π G,s 1 (r), we have that the total reward is: where the equation ( * ) above follows by the fact that, up to the first time s 1 is reached, θ and θ ′ allow the same actions.Hence, it suffices to define θ ′ so that inf σ⊳θ,π E σ,π G,s 1 (r) inf σ⊳θ ′ ,π E σ,π G,s 1 (r).Similarly, for the penalties, it is enough to ensure sup σ⊳θ,π E σ,π G,s 1 (ψ θ rew ) sup σ⊳θ ′ ,π E σ,π G,s 1 (ψ θ ′ rew ).Let P i and R i , where i ∈ {1, ..., n}, be the penalties and rewards from θ after allowing A i against an optimal opponent strategy, i.e.: We also define R = inf σ⊳θ,π E σ,π G,s 1 (r) and P = sup σ⊳θ,π E σ,π G,s 1 (ψ θ rew ) and have R = n i=1 λ i R i and P = n i=1 λ i P i .Let S 0 ⊆ S be those states for which there are σ ⊳ θ and π ensuring a return to s 1 without accumulating any reward.Formally, S 0 contains all states s 0 which satisfy Pr σ,π G,s 0 (F {s 1 }) = 1 and E σ,π G,s 0 (r↓s 1 ) for some σ ⊳ θ and π.We say that A i is progressing if for all a ∈ A i we have r(s 1 , a) > 0 or supp(δ(s 1 , a)) ⊆ S 0 .We note that A i is progressing whenever R i > R (since any a violating the condition above could have been used by the opponent to force R i R).
We will find α ∈ (0, 1) and 1 u, v n such that one of A u or A v is progressing, and define the multi-strategy θ ′ to pick A u and A v with probabilities α and 1 − α, respectively.We distinguish several cases, depending on the shape of T : (1) T has non-empty interior.Let (R 1 , P 1 ), . . ., (R m , P m ) be its vertices in the anticlockwise order.Since all λ i are positive, (R, P ) is in the interior of T .Now consider the point (R, P ′ ) directly below (R, P ) on the boundary of T , i.e.
), and we pick such α and u ∈ I j and v ∈ I j+1 .If (R, P ′ ) happens to be a vertex (R j , P j ) we can (since P j < P ) instead choose sufficiently small α > 0 so that R αR j + (1 − α)R j+1 and P αP j + (1 − α)P j+1 and again pick u ∈ I j and v ∈ I j+1 .In either case, we necessarily have R j+1 > R (by ordering of the vertices in the anticlockwise order and since (R, P ) is in the interior of T ), and so A v is progressing.
(2) T is a vertical line segment, i.e. it is the convex hull of two extreme points (R, P 0 ) and (R, P 1 ) with P 0 < P 1 .In case R = 0, we can simply always allow some A i with i ∈ I 0 , minimising the penalty and still achieving reward 0. If R > 0, there must be at least one progressing A u .Since all λ i are positive, (R, P ) lies inside the line segment, and in particular P > P 0 .We can therefore choose some v and α ∈ (0, 1) such that P α (3) T is a non-vertical line segment, i.e. it is the convex hull of two extreme points (R 0 , P 0 ) and (R 1 , P 1 ) with R 0 < R 1 .Since all λ i are positive, (R, P ) is not one of the extreme points, i.e. (R, P ) = α(R 0 , P 0 ) + (1 − α)(R 1 , P 1 ) with 0 < α < 1.We can therefore choose u ∈ I 0 , v ∈ I 1 .Again, since R 1 > R, A v is progressing.(4) T consists of a single point (R, P ).This can be treated like the second case: either R = 0, and we can allow any combination, or R > 0, and there is some progressing A u , and we then pick arbitrary v and α.We now want to show that the reward of the updated multi-strategy is indeed no worse than before.For i ∈ {u, v} we define: and we define Pick an action a (resp.a ′ ) that realises the minimum and strategies σ and π (resp.σ ′ and π ′ ) that realise the infimum in the definition of R i (resp.R ′ i ).(Such strategies indeed exist).Define: We have By finiteness of rewards and the choice of θ(s 1 ), at least one of the return probabilities c ′ u , c ′ v is less than 1, and thus so is We can show that the penalty under θ ′ is at most as big as the penalty under θ in exactly the same way (note that in addition using ψ θ ′ rew instead of ψ θ rew for c ′ , d ′ , R ′ and R ′ i ).For static penalties, the proof that the new multi-strategy is no worse than the old one is straightforward from the choice of θ ′ (s 1 ).

⊓ ⊔
The result just proved allows us to simplify the game construction that we use to map between (discretised) randomised multi-strategies and deterministic ones.Let the original game be G and the transformed game be G ′ .The transformation is illustrated in Fig. 4. The left-hand side shows a controller state s ∈ S ♦ in the original game G (i.e., the one for which we are seeking randomised multi-strategies).For each such state, we add the two layers of states illustrated on the right-hand side of the figure: gadgets s ′ 1 , s ′ 2 representing the two subsets B ⊆ A(s) with θ(s)(B) > 0, and selectors s i (for 1 i m), which distribute probability among the two gadgets.Two new actions, b 1 and b 2 , are also added to label the transitions between selectors s i and gadgets s ′ 1 , s ′ 2 .The selectors s i are reached from s via a transition using fixed probabilities p 1 , . . ., p m which need to be chosen appropriately.For efficiency, we want to minimise the number of selectors m for each state s.We let m = ⌊1 + log 2 M ⌋ and p i = l i M , where l 1 . . ., l m ∈ N are defined recursively as follows: ⌉ for 2 i m.For example, for M =10, we have m = 4 and l 1 , . . ., l 4 = 5, 3, 1, 1, so p 1 , . . ., p 4 = 5 10 , 3 10 , 1 10 , 1 10 .
We are now able to find optimal discretised randomised multi-strategies in G by finding optimal deterministic multi-strategies in G ′ .This connection will be formalised in Lemma 4.5 below.But we first point out that, for the case of static penalties, a small transformation to the MILP encoding (see Fig. 2) used to solve game G ′ is required.The problem is that the definition of static penalties on G ′ does not precisely capture the static penalties of the original game G.In this, case we adapt Fig. 2 as follows.For each state s, action a ∈ A(s) and i ∈ {1, . . ., n}, we add a binary variable y ′ s i ,a and constraints We then change the objective function that we minimise to: a property, (ψ, τ ) a (static or dynamic) penalty scheme, and let G ′ be the game transformed as described above.The following two claims are equivalent: (1) There is a sound multi-strategy θ in G with pen dyn (ψ, θ) = x (or, for static penalties, pen sta (ψ, θ) = x), and θ only uses probabilities that are multiples of 1 M .
(2) There is a sound deterministic multi-strategy θ ′ in G ′ and pen dyn (ψ, θ) = x (or, for static penalties, Proof.Firstly, observe that for any integer 0 k M there is a set I k ⊆ {1, . . ., m} with j∈I k l j = k.The opposite direction also holds. Let θ be a multi-strategy in G.By Theorem 4.4 we can assume that |supp(θ(s))| 2 for any s.We create θ ′ as follows.For every state s ∈ S ♦ with {A 1 , A 2 } = supp(θ(s)), we set θ ′ (s ′ 1 )(A 1 ) = 1 and θ ′ (s ′ 2 )(A 2 ) = 1.Then, supposing θ(s)(A) = k M , we let θ ′ (s i )({b 1 }) = 1 whenever i ∈ I k , and θ ′ (s i )({b 2 }) = 1 whenever i ∈ I k .If θ(s) is a singleton set, the construction is analogous.Clearly, the property for static penalties is preserved.For any memoryless σ ′ ⊳ θ ′ there is a memoryless strategy σ ⊳ θ that is given by σ for any a, and conversely for any σ ⊳ θ we can define σ ′ ⊳ θ ′ by putting σ ′ (s ′ 1 ) = d s,A 1 and σ ′ (s ′ 2 ) = d s,A 2 for all s, where d s,A 1 and d s,A 2 are distributions witnessing that σ is compliant with θ.It is easy to see that both σ and σ ′ in either of the above yield the same reward and dynamic penalty.
In the other direction, we define θ from θ ′ for all s ∈ S ♦ as follows.Let A 1 and A 2 be the sets allowed by θ ′ in s ′ 1 and s ′ 2 respectively.If A 1 = A 2 , then θ(s) allows this set with probability 1. Otherwise θ(s) allows the set A 1 ∪ A 2 with probability i:θ(s i )={b 1 ,b 2 } p i , the set A 1 with probability i:θ(s i )={b 1 } p i and the set A 2 with probability i:θ(s i )={b 2 } p i .The correctness can be proved similarly to above.

⊓ ⊔
Our next goal is to show that, by varying the granularity M , we can get arbitrarily close to the optimal penalty for a randomised multi-strategy and, for the case of static penalties, define a suitable choice of M .This will be formalised in Theorem 4.7 shortly.First, we need to establish the following intermediate result, stating that, in the static case, in addition to Theorem 4.4 we can require the action subsets allowed by a multi-strategy to be ordered with respect to the subset relation.Theorem 4.6.Let G be a game, φ = R r b [ C ] a property and (ψ, sta) a static penalty scheme.For any sound multi-strategy θ we can construct another sound multi-strategy θ ′ such that pen sta (ψ, θ) pen sta (ψ, θ ′ ) and for each s ∈ S ♦ , if supp(θ ′ (s))={B, C}, then either B ⊆ C or C ⊆ B.
Proof.Let θ be a multi-strategy and fix s 1 such that θ takes two different actions B and C with probability p ∈ (0, 1) and 1 − p where B C and C B. If inf σ⊳θ,π E σ,π G,s 1 (r) = 0, then we can in fact allow deterministically the single set A(s 1 ) and we are done.Hence, suppose that the reward accumulated from s 1 is non-zero.
Suppose, w.l.o.g., that: 16) It must be the case that, for some D ∈ {B, C}, we have: (otherwise the minimal reward accumulated from s 1 is 0 since there is a compliant strategy that keeps returning to s 1 without ever accumulating any reward), and if the inequality in (4.16) is strict, then (4.17) holds for D = C. W.l.o.g., suppose that the above property holds for C. We define θ ′ by modifying θ and picking B ∪ C with probability p, C with (1 − p), and B with probability 0. Under θ, the minimal reward achievable by some compliant strategy is given as the least solution to the following equations [25,Theorem 7.3.3](as before, the notation of [25] requires "negative" models): The minimal rewards x ′ s achievable under θ ′ are defined analogously.In particular, for the equation with s 1 on the left-hand side we have: We show that the least solution x to x is also the least solution to x ′ .First, note that x is clearly a solution to any equation with s = s 1 on the left-hand side since these equations remain unchanged in both sets of equations.As for the equation with s 1 , we have min a∈B s ′ r(s 1 , a)+δ(s 1 , a)(s ′ )• xs ′ min a∈C s ′ r(s 1 , a)+δ(s 1 , a)(s ′ )• xs ′ , and so necessarily min a∈B s To see that x is the least solution to x ′ , we show that (i) for all s, if inf σ⊳θ ′ ,π E σ,π G,s (r) = 0 then xs = 0; and (ii) there is a unique fixpoint satisfying xs = 0 for all s such that inf σ⊳θ ′ ,π E σ,π G,s (r) = 0.For (i), suppose xs > 0. Let σ ′ be a strategy compliant with θ ′ , and π an arbitrary strategy.Suppose Pr σ ′ ,π G,s (F s 1 ) = 0, then there is a strategy σ compliant with θ which behaves exactly as σ ′ when starting from s, and by our assumption on the properties of xs we get that E σ,π G,s (r) > 0 and so E σ ′ ,π G,s (r) > 0. Now suppose that Pr σ ′ ,π G,s (F s 1 ) > 0. For this case, by condition (4.17), the fact that it holds for D = C and by defining θ ′ so that it picks C with nonzero probability we get that the reward under any strategy compliant with θ ′ is non-zero when starting in s 1 , and so E σ ′ ,π G,s (r) > 0. Point (ii) can be obtained by an application of [25,Proposition 7.3.4].

⊓ ⊔
We can now return to the issue of how to vary the granularity M to get sufficiently close to the optimal penalty.We formalise this as follows.
We deal with the cases of static and dynamic penalties separately.For static penalties, let t ∈ S ♦ and θ(t)(A 1 ) = q, θ(t)(A 2 ) = 1 − q for A 1 ⊆ A 2 ⊆ A(t).Modify θ by rounding q up to the number q ′ which is the nearest multiple of 1  M .The resulting multi-strategy θ ′ is again sound, since any strategy compliant with θ ′ is also compliant with θ: the witnessing distributions (see Definition 3.2) d t,A 1 and d t,A 2 for θ are obtained from the distributions for all a ∈ A 2 ; note that both d t,A 1 and d t,A 2 are indeed probability distributions.Further, the penalty in θ ′ changes by at most 1 M a∈A(s) ψ(t, a).To obtain the result we repeat the above for all t.Now let us consider dynamic penalties.Intuitively, the claim follows since by making small changes to the multi-strategy, while not (dis)allowing any new actions, we only cause small changes to the reward and penalty.
Thus, by increasing the probability of allowing A 1 in t the soundness of the multi-strategy is preserved.
The above gives us that, for any error bound ε and a fixed state s, there is an x such that we can modify the decision of θ in s by x, not violate the soundness property and increase the penalty by at most ε/|S|.We thus need to pick M such that 1/M x.To finish the proof, we repeat this procedure for every state s.

⊓ ⊔
For the sake of completeness, we also show that Theorem 4.6 does not extend to dynamic penalties.This is because, in this case, increasing the probability of allowing an action can lead to an increased penalty if one of the successor states has a high expected penalty.An example is shown in Fig. 5, for which we want to reach the goal state s 3 with probability at least 0.5.This implies θ(s 0 , {b})•θ(s 1 )({d}) 0.5, and so θ(s 0 )({b})>0, θ(s 1 )({d})>0.If θ satisfies the condition of Theorem 4.6, then θ(s 0 )({c}) = θ(s 1 )({e}) = 0, so an opponent can always use b, forcing an expected penalty of θ(s 0 )({b}) + θ(s 1 )({d}), for a minimal value of √ 2. However, the sound multi-strategy θ with θ(s 0 )({b})=θ(s 0 )({c})=0.5 and θ(s 1 , {d})=1 achieves a dynamic penalty of just 1. 4.4.Optimisations.We conclude this section on MILP-based multi-strategy synthesis by presenting some optimisations that can be applied to our methods.The general idea is to add additional constraints to the MILP problems that will reduce the search space to be explored by a solver.We present two different optimisations, targeting different aspects of our encodings: (i) possible variable values; and (ii) penalty structures.
Bounds on variable values.In our encodings, for the variables x s , we only specified very general lower and upper bounds that would constrain their value.Narrowing down the set of values that a variable may take can significantly reduce the search space and thus the solution time required by an MILP solver.One possibility that works well in practice is to bound the values of the variables by the minimal and maximal expected reward achievable from the given state, i.e., add the constraints: where both the infima and suprema above are constants obtained by applying standard probabilistic model checking algorithms.
Actions with zero penalties.Our second optimisation exploits the case where an action has zero penalty assigned to it.Intuitively, this action could always be disabled without harming the overall penalty of the multi-strategy.On the other hand, enabling an action with zero penalty might be the only way to satisfy the property and therefore we cannot disable all such actions.However, it is enough to allow at most one action that has zero penalty.For simplicity of the presentation, we assume Z s = {a ∈ A(s) | ψ(s, a) = 0}; then formally we add the constraints a∈Zs y s,a 1 for all s ∈ S ♦ .

Experimental Results
We have implemented our techniques within PRISM-games [9], an extension of the PRISM model checker [21] for performing model checking and strategy synthesis on stochastic games.PRISM-games can thus already be used for (classical) controller synthesis problems on stochastic games.To this, we add the ability to synthesise multi-strategies using the MILP-based method described in Section 4. Our implementation currently uses CPLEX [32] or Gurobi [33] to solve MILP problems.We investigated the applicability and performance of our approach on a variety of case studies, some of which are existing benchmark examples and some of which were developed for this work.These are described in detail below and the files used can be found online [34].Our experiments were run on a PC with a 2.8GHz Xeon processor and 32GB of RAM, running Fedora 14.
5.1.Deterministic Multi-strategy Synthesis.We first discuss the generation of optimal deterministic multi-strategies, the results of which are presented in Tab.s 1 and 2.
Tab.In Tab. 2, we show, for each different model, the penalty value of the optimal multistrategy and the time to generate it.We report several different times, each for different combinations of the optimisations described in Section 4.4 (either no optimisations, one or both).For the last result, we give times for both MILP solvers: CPLEX and Gurobi.
Case studies.Now, we move on to give further details for each case study, illustrating the variety of ways that permissive controller synthesis can be used.Subsequently, we will discuss the performance and scalability of our approach.cloud: We adapt a PRISM model from [6] to synthesise deployment of services across virtual machines (VMs) in a cloud infrastructure.Our property φ specifies that, with high probability, services are deployed to a preferred subset of VMs, and we then assign unit (dynamic) penalties to all actions corresponding to deployment on this subset.The resulting multi-strategy has very low expected penalty (see Tab. 2) indicating that the goal φ can be achieved whilst the controller experiences reduced flexibility only on executions with low probability.android: We apply permissive controller synthesis to a model created for runtime control of an Android application that provides real-time stock monitoring.We extend the application to use multiple data sources and synthesise a multi-strategy which specifies an efficient runtime selection of data sources (φ bounds the total expected response time).We use static penalties, assigning higher values to actions that select the two most efficient data sources at each time point and synthesise a multi-strategy that always provides a choice of at least two sources (in case one becomes unavailable), while preserving φ. mdsm: Microgrid demand-side management (MDSM) is a randomised scheme for managing local energy usage.A stochastic game analysis [8] previously showed it is beneficial for users to selfishly deviate from the protocol, by ignoring a random back-off mechanism designed to reduce load at busy times.We synthesise a multi-strategy for a (potentially selfish) user, with the goal (φ) of bounding the probability of deviation (at either 0.1 or 0.01).The resulting multi-strategy could be used to modify the protocol, restricting the behaviour of this user to reduce selfish behaviour.To make the multi-strategy as permissive as possible, restrictions are only introduced where necessary to ensure φ.We also guide where restrictions are made by assigning (static) penalties at certain times of the day.investor: This example [23] synthesises strategies for a futures market investor, who chooses when to reserve shares, operating in a (malicious) market which can periodically ban him/her from investing.We generate a multi-strategy that achieves 90% of the maximum expected profit (obtainable by a single strategy) and assign (static) unit penalties to all actions, showing that, after an immediate share purchase, the investor can choose his/her actions freely and still meet the 90% target.team-form: This example [10] synthesises strategies for forming teams of agents in order to complete a set of collaborative tasks.Our goal (φ) is to guarantee that a particular task is completed with high probability (0.9999).We use (dynamic) unit penalties on all actions of the first agent and synthesise a multi-strategy representing several possibilities for this agent while still achieving the goal.cdmsn: Lastly, we apply permissive controller synthesis to a model of a protocol for collective decision making in sensor networks (CDMSN) [8].We synthesise strategies for nodes in the network such that consensus is achieved with high probability (0.9999).We use (static) penalties inversely proportional to the energy associated with each action a node can perform to ensure that the multi-strategy favours more efficient solutions.
Performance and scalability.Unsurprisingly, permissive controller synthesis is more costly to execute than classical controller synthesis -this is clearly seen by comparing the times in the rightmost column of Tab. 1 with the times in Tab. 2. However, we successfully synthesised deterministic multi-strategies for a wide range of models and properties, with model sizes ranging up to approximately 100,000 states.The performance and scalability of our method is affected (as usual) by the state space size.In particular, it is also affected Table 5: Experimental results for approximating optimal randomised multi-strategies.

Conclusions
We have presented a framework for permissive controller synthesis on stochastic two-player games, based on generation of multi-strategies that guarantee a specified objective and are optimally permissive with respect to a penalty function.We proved several key properties, developed MILP-based synthesis methods and evaluated them on a set of case studies.In this paper, we have imposed several restrictions on permissive controller synthesis.Firstly, we focused on properties expressed in terms of expected total reward (which also subsumes the case of probabilistic reachability).A possible topic for future work would be to consider more expressive temporal logics or parity objectives.The results might also be generalised so that both positive and negative rewards can be used, for example by using the techniques of [30].We also restricted our attention to memoryless multi-strategies, rather than the more general class of history-dependent multi-strategies.Extending our theory to the latter case and exploring the additional power brought by history-dependent multi-strategies is another interesting direction of future work.which is a contradiction with i ∈ I. Hence, α i 2 −m (1 − β i ) and so: pen sta (ψ, θ, t i ) + pen sta (ψ, θ, t ′ i ) = β i w i + α i 2 3m w i β i w i + 2 −m (1 − β i )2 3m w i w i We have: i∈I w i i∈I pen sta (ψ, θ, t i ) + pen sta (ψ, θ, t ′ i ) W + 2 −m • W and, because i∈I w i and W are fractions with denominator q, by the choice of m, we can infer that i∈I w i W . Similarly: and again, because i∈I v i and V are fractions with denominator q, by the choice of m we can infer that i∈I v i V .Hence, in the instance of the knapsack problem, it suffices to pick exactly items from I to satisfy the restrictions.
Randomised multi-strategies with dynamic penalties.The proof is analogous to the proof above, we only need to modify the MDP and the computations.For an instance of the Knapsack problem given as before, we construct the following MDP: We claim that there is a multi-strategy θ sound for the property R r V /n [ C ] such that pen dyn (ψ, θ) 1  n W if and only if the answer to the Knapsack problem is "yes".In the direction ⇐, for I ⊆ {1, . . ., n} the set of items in the knapsack, we define θ by θ(t i )({a i }) = 1 for i ∈ I and by allowing all actions in every other state.
In the direction ⇒, let us have a multi-strategy θ satisfying the assumptions.Let P (s → s ′ ) denote the lower bound on the probability of reaching s ′ from s under a strategy which complies with the multi-strategy θ.Denote by I ⊆ {1, . . ., n} the indices i such that θ(t i )({a i }) > 0. Observe that P (t i → ⊤) = v i if i ∈ I and P (t i → ⊤) = 0 otherwise.Hence: and for the penalty, denoting x i := θ(t i )({a i }), we get: because the strategy that maximises the penalty will pick b i whenever it is available.Hence, in the instance of the knapsack problem, it suffices to pick exactly items from I to satisfy the restrictions.assigned to actions c ′ i and c′ i is equal to 1.We claim that there is a multi-strategy θ sound for the property R r 1 [ C ] such that pen dyn (ψ, θ) 2 • y/n if and only if n i=1 √ x i y.

Figure 1 :
Figure 1: A stochastic game G used as a running example (see Ex. 1).

Figure 3 :
Figure 3: MILP encoding for deterministic multi-strategies with dynamic penalties.

Figure 4 :
Figure 4: A node in the original game G (left), and the corresponding nodes in the transformed game G ′ for approximating randomised multi-strategies (right, see Section 4.3).

1 Figure 5 :
Figure 5: Counterexample for Theorem 4.6 in case of dynamic penalties.

Table 1 :
Details of the models and properties used for experiments with deterministic multistrategies, and execution times for single strategy synthesis.
1 summarises the models and properties considered.For each model, we give: the case study name, any parameters used, the number of states (|S|) and of controller states (|S ♦ |), and the property used.The final column gives, for comparison purposes, the time required for performing classical (single) strategy synthesis on each model and property φ.*No optimal solution to MILP problem within 5 minute time-out.

Table 2 :
Experimental results for synthesising optimal deterministic multi-strategies.

Table 4 :
State space growth for approximating optimal randomised multi-strategies.
* Sound but possibly non-optimal multi-strategy obtained after 5 minute MILP time-out.