DATA OPTIMIZATIONS FOR CONSTRAINT AUTOMATA

. Constraint automata (CA) constitute a coordination model based on ﬁnite automata on inﬁnite words. Originally introduced for modeling of coordinators, an interesting new application of CAs is implementing coordinators (i.e., compiling CAs into executable code). Such an approach guarantees correctness-by-construction and can even yield code that outperforms hand-crafted code. The extent to which these two potential advantages materialize depends on the smartness of CA-compilers and the existence of proofs of their correctness. Every transition in a CA is labeled by a “data constraint” that speciﬁes an atomic data-ﬂow between coordinated processes as a ﬁrst-order formula. At run-time, compi-ler-generated code must handle data constraints as eﬃciently as possible. In this paper, we present, and prove the correctness of two optimization techniques for CA-compilers related to handling of data constraints: a reduction to eliminate redundant variables and a translation from (declarative) data constraints to (imperative) data commands expressed in a small sequential language. Through experiments, we show that these optimization techniques can have a positive impact on performance of generated executable code.


Introduction
Context. In the early 2000s, hardware manufacturers shifted their attention from manufacturing faster-yet purely sequential-unicore processors to manufacturing slower-yet increasingly parallel-multicore processors. In the wake of this shift, concurrent programming became essential for writing scalable programs on general hardware. Conceptually, concurrent programs consist of processes, which implement modules of sequential computation, and protocols, which implement the rules of concurrent interaction that processes must abide by. As programmers have been writing sequential code for decades, programming processes poses no new fundamental challenges. What is new-and notoriously difficult-is programming protocols.
In ongoing work, we study an approach to concurrent programming based on syntactic separation of processes from protocols. In this approach, programmers write their processes Problem. Briefly, our current ca-to-Java compiler translates passive data structures for cas into (re)active "coordinator threads". A coordinator thread is, effectively, a state machine whose transitions correspond one-to-one to transitions in a ca. Essentially, then, compiler-generated coordinator threads simulate cas by firing their transitions, continuously monitoring run-time data structures for their ports. 1 To actually fire a transition, a coordinator thread must first check both that transition's synchronization constraint and its data constraint. The check for the synchronization constraint ensures that all ports involved in the transition have a pending i/o-operation (and are thus ready to participate in the transition); the check for the data constraint subsequently ensures that those pending i/o-operations can result in admissible data-flows.
Although data flow through ports always in a certain direction, we do not yet distinguish input ports from output ports; this comes later.
Out of ports and memory cells, we construct data variables, which serve as the variables in our calculus. Every data variable designates a datum. For instance, ports can hold data (to exchange), so every port serves as a data variable in the calculus. Similarly, memory cells can hold data, but the meaning of "to hold" differs in this case. Ports hold data only for exchange during a coordination step (i.e., transiently, in passing). In contrast, memory cells hold data also before and after a coordination step. Consequently, in the context of data variables, a memory cell before a coordination step and the same memory cell after that step have different identities. After all, the content of the memory cell may have changed in between. Therefore-inspired by notation from Petri nets [Rei85]-for every memory cell m, both • m and m • serve as data variables: • m refers to the datum in m before a coordination step, while m • refers to the datum in m after that coordination step. We abbreviate sets { • m | m ∈ M } and {m • | m ∈ M } as • M and M • .
Definition 2.5 (data variables). A data variable is an object x generated by the following grammar: x ::= p | • m | m • X denotes the set of all data variables. 2 X denotes the set of all sets of data variables, ranged over by X.
We subsequently assign meaning to data variables with data assignments.
Definition 2.6 (data assignments). A data assignment is a partial function from data variables to data. Assignm = X ⇀ D denotes the set of all data assignments, ranged over by σ. 2 Assignm denotes the set of all sets of data assignments, ranged over by Σ.
Essentially, a data assignment σ comprehensively models a coordination step involving the ports and memory cells in Dom(σ) and the data in Img(σ). As coordinators have only finitely many ports and memory cells in practice, we stipulate that the domain of every data assignment is finite, too. The same holds for their support. We proceed by defining data functions and data relations, which serve as the functions and predicates in our calculus. Together, data, data functions, and data relations constitute our set of extralogicals. To avoid excessive machinery-but at the cost of formal imprecision-we do not distinguish extralogical symbols from their interpretation as data, data functions, and data relations.
Definition 2.7 (data functions). A data function is a function from tuples of data to data. F = {D k → D | k > 0} denotes the set of all data functions, ranged over by f . Definition 2.8 (data relations). A data relation is a relation on tuples of data. R = {2 D k | k > 0} denotes the set of all data relations, ranged over by R.
Henceforth, we write elements of F in camel case monospace (e.g., divByThree, inc), while we write elements of R in captitalized camel case monospace (e.g., Odd, SmallerThan).
Out of data variables, data, and data functions, we construct data terms, which serve as the terms in our calculus. Every data term represents a datum.
Definition 2.9 (data terms). A data term is an object t generated by the following grammar: t ::= x | d | f (t 1 , . . . , t k≥1 ) DATA OPTIMIZATIONS FOR CONSTRAINT AUTOMATA 5 Term denotes the set of all data terms. 2 Term denotes the set of all sets of data terms, ranged over by T .
Henceforth, let < Term denote some strict total order on Term. 2 Given a data assignment whose domain includes at least the data variables in a data term t, we can evaluate t to a datum. (To evaluate t, additionally, every data function application in t must have the right number of inputs: the arity of a data function and its number of inputs must match. Henceforth, we tacitly assume that this always holds true.) Definition 2.10 (evaluation). eval : Assignm × Term → D ∪ {nil} denotes the function defined by the following equations: eval σ (t 1 ) = nil and · · · and eval σ (t k ) = nil   nil otherwise Out of data terms, data relations, and data variables, we construct data constraints.
Every data constraint characterizes a set of data assignments through an entailment relation. This entailment relation, thus, formalizes the semantics of data constraints. Let ϕ[t/x] denote data constraint ϕ with data term t substituted for every occurrence of data variable x (in a capture-free way).
Contradiction, tautology, and (multiary) conjunction have standard semantics [Rau10]. Negation ¬a means that, despite all free variables in a having a value, a does not hold true; the extra condition on the free variables in a ensures the monotonicity of entailment (i.e., σ| X |= ϕ implies σ |= ϕ, for all X, ϕ). Data atom t 1 = t 2 means that t 1 and t 2 evaluate to the same datum. Typical examples include p 1 = p 2 (i.e., the same datum passes through ports p 1 and p 2 ), p = m • (i.e., the datum that passes through port p enters the buffer modeled by memory cell m), and p = • m (i.e., the datum in the buffer modeled by memory cell m exits that buffer and passes through port p). Tautology ⊤ means that it does not matter which data flow through which ports. Henceforth, let ⇒ and ≡ denote the implication relation and the equivalence relation on data constraints, derived from |= in the usual way [Rau10]. Furthermore, let Variabl(ϕ) denote the set of data variables in ϕ, and let Free(ϕ) denote its set of free data variables.
Constraint Automata. We proceed by formally defining a ca a, which models a coordinator, as a tuple consisting of a set of states Q, a triple of three sets of ports (P all , P in , P out ), a set of memory cells M , a transition relation −→, and an initial state q 0 . The set P all contains all ports monitored and controlled by a, while P in and P out contain only its input ports and its output ports. Although P all contains the union of P in and P out , the converse not necessarily holds true: beside input and output ports, P all may contain also internal ports. If a ca has internal ports, we call it a composite; otherwise, we call it a primitive.
Definition 2.13 (states). A state represents a configuration of a coordinator. Q denotes the set of all states, ranged over by q. 2 Q denotes the set of all sets of states, ranged over by Q.
Definition 2.14 (constraint automata). A constraint automaton is a tuple: • q 0 ∈ Q (initial state) Autom denotes the set of all constraint automata, ranged over by a. The requirement Free(ϕ) ⊆ P ∪ • M ∪M • means that the effect of a transition remains local to its own scope: a transition cannot affect, or be affected by, ports outside its synchronization constraint and memory cells outside its ca. Henceforth, let Dc(a) denote the set of data constraints that occur on the transitions of a ca a (not to be confused with DC, which denotes the set of all data constraints; see Definition 2.11). Figure 2 shows an example of a ca. In graphical representations of cas, we annotate ports in synchronization constraints with superscripts "in" and "out" to indicate their direction; internal ports have no such annotation. The ca in Figure 2 models a producers/ consumer coordinator with two input ports A and B (each shared with a different producer, presumably) and an output port C (shared with the consumer). Initially, a put by the producer on A can complete, causing that producer to offer a datum into internal buffer x (modeled by data constraint A = x • ). Alternatively, a put by the other producer on B can similarly complete. Subsequently, only a get by the consumer on C can complete, causing the consumer to accept the datum previously stored in x. This coordinator, thus, enforces asynchronous, unordered, reliable communication from two producers to a consumer.
The precise definitions of language acceptance and bisimulation for cas do not matter in this paper. Likewise, the precise definitions of behavioral equivalence (based on language acceptance) and behavioral congruence (based on bisimulation), such that behavioral congruence implies behavioral equivalence, do not matter. These definitions appear elsewhere [Jon16a]. The only result about the behavior of cas that matters in this paper is the following intuitive proposition. Let ≃ denote behavioral congruence, and let a[ϕ ′ /ϕ] denote ca a with data constraint ϕ ′ substituted for every occurrence of data constraint ϕ.
This proposition means that we can freely replace every data constraint in a ca with an equivalent data constraint in a behavior-neutral way. This proposition plays a key role in the correctness proofs of the two optimization techniques presented in the rest of this paper.
Instead of defining cas directly, in practice, we construct them compositionally using two binary operations [BSAR06,Jon16a]: join, denoted by ⊗, and hide, denoted by ⊖. Join performs parallel composition: it "glues" together two cas on their shared ports, after which those shared ports become internal. Essentially, whenever two cas have joined, if a transition in one of those cas involves shared ports, that transition can fire only synchronously with a transition in the other ca that involves exactly the same shared ports (i.e., at any time, the cas must agree on firing transitions involving their shared ports).
Infinitely often atomically accepts a datum d on its input port p 1 , then offers d on its output port p 2 .
SyncDrain(p 1 , p 2 ; ) Infinitely often atomically accepts data d 1 and d 2 on its input ports p 1 and p 2 , then loses d 1 and d 2 .
LossySync(p 1 ; p 2 ) Infinitely often either atomically accepts a datum d on its input port p 1 , then offers d on its output port p 2 or atomically accepts a datum d on p 1 , then loses d .
Filter R (p 1 ; p 2 ) Infinitely often either atomically accepts a datum d on its input port p 1 , then establishes that d satisfies data relation R, then offers d on its output port p 2 or atomically accepts a datum d on p 1 , then establishes that d violates R, then loses d .
Fifo{; m}(p 1 ; p 2 ) Infinitely often first atomically accepts a datum d on its input port p 1 , then stores d in its memory cell m and subsequently atomically loads d from m, then offers d on its output port p 2 .
Merg2(p 1 , p 2 ; p 3 ) Infinitely often atomically accepts a datum d either on its input port p 1 or on its input port p 2 , then offers d on its output port p 3 .
Repl2(p 1 ; p 2 , p 3 ) Infinitely often atomically accepts a datum d on its input port p 1 , then offers d on its output ports p 2 and p 3 .
BinOp f (p 1 , p 2 ; p 3 ) Infinitely often atomically accepts data d 1 and d 2 on its input ports p 1 and p 2 , then applies data function f to d 1 and d 2 , then offers f (d 1 , d 2 ) on its output port p 3 . Hide performs port abstraction: it "cuts" a port out from a ca. Typically, we use hide to remove internal ports from the definition of a ca, as such ports do not directly contribute to its observable behavior (i.e., processes cannot perform i/o-operations on internal ports).    To compositionally construct a ca, then, we first join a number of "small" primitive cas into a "large" composite ca. Second, we hide all internal ports from this large ca to make its definition more concise (without losing essential information). Figure 3 shows a number of common primitive cas; Figure 4 explains their behavior in terms of data-flows between their ports. In these figures, every ca has a signature formatted as follows: name extralogicals {internal ports ; memory cells}(input ports ; output ports) Instead of writing explicit ⊗/⊖-expressions to construct cas, in practice, we often draw them in a graphical, more intuitive syntax, based on the coordination language Reo [Arb04,Arb11]. 3 Essentially, in this syntax, we draw a (hyper)digraph, where every vertex denotes a port, and where every (hyper)arc denotes a ca consisting of the ports denoted by its connected vertices. By convention, every vertex has degree 1 (for input and output ports) or 2 (for internal ports). The ⊗/⊖-expression denoted by a digraph, then, is the join of (the denotations of) its arcs, and the hide of (the denotations of) its vertices of degree 2. Intuitively, every transition in the (evaluated) ⊗/⊖-expression for a digraph corresponds to an atomic flow of data along the arcs in that digraph. Figure 5 shows digraphs for the primitives in Figure 3; Figure 6 shows digraphs for example composites.
In Figure 6, Sync 2 models the same coordinator as a single Sync: it enforces a standard synchronous channel protocol between a producer and a consumer. Fifo 2 models a coordinator between a producer and a consumer that enforces a standard (order-preserving) asynchronous channel protocol with a buffer of capacity 2. LateAsyncMerg 2 is (a behaviorally congruent ca to) the ca in Figure 2. EarlyAsyncMerg 2 models a coordinator between two producers and one consumer, as LateAsyncMerg 2 . The difference between the two is that with EarlyAsyncMerg 2 , every producer has its own buffer, which results in significantly different behavior (as producers no longer need to wait for each other before their puts can complete). Rout 2 models a coordinator between one producer and two consumers that enforces a symmetric protocol to Merg2: infinitely often, it atomically accepts a datum on its input port, then offers it on one of its output ports . Finally, OddFib 2 models a coordinator between two producers and one consumer. Whenever the i-th put by the producer completes, one of two things happens. If the i-th Fibonacci number is even, the datum put by the producer is lost, and no interaction occurs between the producer and the two consumers. If the i-th Fibonacci number is odd, in contrast, a get by each of the two consumers must complete at the same time (i.e., atomically, i.e., synchronously). In this case, specifically, the datum put by the producer is lost, while the consumers get the i-th Fibonacci number. This coordinator, thus, enforces synchronous, unreliable (in the sense just described) communication from a producer to two consumers.
The primitives in Figure 5 were introduced by Arbab [Arb04], except BinOp, which was introduced by Jongmans [Jon16a] (BinOp is, however, a generalization of primitive Join, which was introduced by Kokash and Arbab [KA09]). LateAsyncMerg and EarlyAsyncMerg in Figure 6 are probably folklore; these two names were first used by Jongmans [Jon16a]. OddFib is based on Arbab's Fibonacci [Arb05]. Rout was introduced by Arbab [Arb05].

Optimization I: Eliminate (Instead of Hide)
Motivating Example. To illustrate the need for our first technique to optimize the performance of checking data constraints, presented in this section, we start with a motivating example. Recall the Sync primitive in Figure 3. Sync has a special property: it acts as a kind of algebraic identity of join and hide, in the following sense. Let a[p ′ /p] denote ca a with port p ′ substituted for every occurrence of port p. Let a range over the set of all cas that (i) have an input port p 2 and (ii) in which port p 1 does not occur. Then: In words, (Sync(p 1 ; p 2 ) a) p 2 and a are behaviorally congruent modulo substitution of p 1 for p 2 . Generally, we can "prefix" (i.e., join on its input ports) or "suffix" (i.e., join on its output ports) any number of Syncs to a ca without affecting-in the sense just described-that ca's behavior. Given this property, it seems not unreasonable to assume that compiler-generated code for a single Sync has the same performance as a chain of 64 Syncs. Slightly more formally, if ∼ means "has the same performance", one may expect: Sync(p 1 ; p 65 ) ∼ (Sync(p 1 ; p 2 ) · · · Sync(p 64 ; p 65 )) p 2 · · · p 64 Our compiler-generated code, however, violates this equation: a single Sync fires 27 million transitions in four minutes, whereas the chain of 64 Syncs fires only nine million transitions.
To understand this phenomenon, we first present the definition of hide [BSAR06,Jon16a]: : Autom × P → Autom denotes the function defined by the following equation: where −→ denotes the smallest relation induced by the following rule: In words, hide removes a port both from sets P all , P in , P out and from every transition. (Because P in , P out ⊆ P all by Definition 2.14, we need to remove p not only from P in and P out but also from P all .) But whereas hide removes ports from synchronization constraints syntactically-effectively making those constraints smaller-it removes ports from data constraints only semantically. Indeed, does not reduce the size of data constraints (in terms of the number of data variables, data literals, and existential quantifications) but, in fact and in contrast, makes data constraints larger by enveloping them in existential quantifications: the transition in the single Sync has just p 1 = p 65 as its data constraint, whereas the corresponding transition in the chain of 64 Syncs has ∃p 64 .· · ·∃p 2 .(p 1 = p 2 ∧ · · · ∧ p 64 = p 65 ). Clearly, although the two data constraint expressions are semantically (logically) equivalent, checking the latter data constraint expression requires more resources than the former.
Below, we develop a variant of hide, called eliminate, that, when applied 63 times to the chain of 64 Syncs, yields the same data constraint as the one in the single Sync. The key idea is to mechanically simplify data constraint expressions using the equivalence ∃p.
Eliminate. First, we need to introduce the concept of determinants of free data variables in data constraints. For a data constraint ϕ and one of its free data variables x ∈ Free(ϕ), the set of determinants of x consists of those terms that precisely determine the datum σ(x) assigned to x in any data assignment σ that satisfies ϕ (i.e., σ |= ϕ). "Precisely" here means that a determinant neither overspecifies nor underspecifies σ(x). Thus, if a set of determinants contains multiple data terms, each of those data terms evaluates to the same datum under σ. Determinants furthermore determine σ(x) independent of x itself: no determinant of x has x among its free data variables (i.e., determinants have no recursion).
Definition 3.2 (determinants). Determ : X × DC → 2 Term denotes the function defined by the following equations: For instance, consider the following data constraint: (This data constraint appears in the ⊗/⊖-expression denoted by the digraph for OddFib in Figure 6.) The free data variables in ϕ eg have the following determinants: Next, let a denote a ca, and let ϕ denote one of its data constraints. Suppose that we hide x from a with . By Definition 3.1 of , the transition(s) of a previously labeled by ϕ are now labeled with ∃x.ϕ. However, if x has determinants, instead of enveloping ϕ in an existential quantification as does, we can alternatively perform a syntactic substitution of one of those determinants for x. We formalize such a substitution as follows.
Definition 3.3 (syntactic existential quantification). exists : X × DC → DC denotes the function defined by the following equation: In this definition, function min(·) takes the least element in Determ x (ϕ), under the global order on data terms < Term , to ensure that exists always produces the same output under the same input. The following equations exemplify the (nested) application of exists on ϕ eg .
We define eleminate in terms of exists.
Definition 3.4 (eliminate). : Autom × P → Autom denotes the function defined by the following equation: where −→ denotes the smallest relation induced by the following rule: In the previous definition, we use exists to remove ports from data constraints. Although Definition 3.3 of exists also allows for removing data variables for memory cells, we do not pursue such elimination in this paper.
Correctness and Effectiveness. We conclude this section by establishing the correctness and effectiveness of eliminate. We consider eliminate correct if it yields a ca behaviorally congruent to the ca that hide yields. Before formulating this as a theorem, the following lemma first states the equivalence of existential quantification and exists.
From Proposition 2.15 and Lemma 3.5, we conclude the following correctness theorem.
We consider eliminate effective if, after eliminating a port p from a ca a, that port no longer occurs in any of that ca's data constraint expressions. Generally, however, such unconditional effectiveness does not hold true: if a has a data constraint ϕ in which p occurs, but p has no determinants in ϕ, eliminate has nothing to replace p with. In that case, exists p (ϕ) = ∃p.(ϕ), and consequently, eliminate does not have its intended (simplifying) effect. Eliminate does satisfy a weaker-but useful-form of effectiveness, though. To formulate this as a theorem, we first define a function that computes ever-determined ports. We call a port p ever-determined in a ca a iff both p occurs in a and every data constraint in a has a determinant for p.
Definition 3.7 (ever-determined ports). Edp : Autom → 2 P denotes the function defined by the following equation: For instance, p 1 , p 2 , and p 3 all qualify as ever-determined in Merg2 in Figure 3. To understand the ever-determinedness of p 1 , observe that p 1 occurs in the data constraint on the top transition in Merg2 and that p 1 has a determinant in that data constraint (namely p 3 ); because p 1 does not occur in the data constraint on the bottom transition in Merg2, p 1 indeed qualifies as ever-determined. A similar explanation applies to p 2 . To understand the ever-determinedness of p 3 , observe that p 3 occurs in the data constraint on both transitions in Merg2 and that p 3 has a determinant in both these data constraints (namely p 1 and p 2 ). Consequently, also p 3 qualifies as ever-determined. In contrast, p 1 in members of Filter in Figure 3 does not qualify as ever-determined, because p 1 occurs in the data constraint on the top transition in Filter but does not have a single determinant in that data constraint.
The following theorem states the effectiveness of eliminate, conditional on ever-determinedness: after eliminating an ever-determined port from a ca, that port no long occurs in any of that ca's data constraints.
Effectiveness" refers to a rather theoretical property; it says nothing yet about the impact of applying in practice. In Section 5, we study this impact through a number of experiments; in this section, we only revisit our motivating example. By using instead of , and after removing t = t literals (each of which trivially equates to ⊤), we get exactly the same data constraint in the chain of 64 Syncs as in the single Sync. Consequently, the compiler-generated code for the chain of 64 Syncs has the same performance as compiler-generated code for the single Sync (which corresponds to a 3× speedup relative to unoptimized code generated with hide instead of eliminate).

Optimization II: Commandify (Instead of Seek)
Data Commands. In the previous section, we presented a first technique to optimize the performance of checking data constraints. In this section, we present a second technique to further optimize the performance of such checks and, in particular, the expensive constraint solver calls involved. Essentially, this new technique comprises the generation of a little, dedicated constraint solver for every data constraint at compile-time. At run-time, then, instead of calling a general-purpose constraint solver to check a data constraint, the compiler-generated coordinator thread for a ca calls a more efficient constraint solver generated specifically for that data constraint. First, in this subsection, we describe a basic sequential language (syntax, semantics, proof system) in which to express such dedicated constraint solvers; in the next subsections, we present the process of their generation.
General-purpose techniques for constraint solving-an np-complete problem for finite domains-inflict not only a solving overhead proportional to the size of a data constraint but also a constant overhead for preparing, making, and processing the result of every call to a full-fledged solver. Although we generally cannot escape using such techniques for checking arbitrary data constraints, a better alternative exists for many data constraints in practice. The crucial observation is that the data constraints in all cas that we know of in the literature really constitute declarative specifications of a relatively straightforward imperative program. What we need to do, then, is develop a technique for statically translating such a data constraint ϕ, off-line at compile-time, into a small imperative program that computes a data assignment σ such that σ |= ϕ, without resorting to general-purpose constraint solving. We call such a small program a data command and the translation from data constraints to data commands commandification. Essentially, we formalize and automate what programmers do when they write an imperative implementation of a declarative specification expressed as a data constraint. After presenting our technique, we make the class of data constraints currently supported by commandification precise.
In the previous definition, ε denotes the empty data command, x := t denotes an assignment, and ϕ -> π denotes a failure statement. 4 Henceforth, we often write "value of x" instead of "the datum assigned to x".
We define an operational semantics for data commands based on an operational semantics for a sequential language by Apt et al. [AdBO09]. As data commands are supposed to solve data constraints, we model the data state that a data command executes in with either a function from data variables to data-a data assignment-or the distinguished object fail, which models abnormal termination. A data configuration, then, consists of a data command and a data state to execute that data command in.
Definition 4.3 (data configurations). A data configuration is a pair (π, ς) where: (data state) Conf denotes the set of all data configurations.
A transition system on configurations formalizes their evolution in time. Note that ϕ -> π indeed denotes a failure statement rather than a conditional statement: if the current data state violates the guard ϕ, execution abnormally terminates. Through the transition system in Definition 4.4, we associate two different semantics with data commands. The partial correctness semantics of a data command π under a set of initial data states Σ consists of all the final data states Σ ′ to which any of those initial states may evolve through execution of π. Notably, this partial correctness semantics ignores abnormal termination. In contrast, the total correctness semantics of π under Σ consists not only of Σ ′ but, if at least one execution abnormally terminates, also of fail. Definition 4.5 (correctness semantics of data commands). Final, respectively, Final fail denote the functions Comm × 2 Assignm → 2 Assignm∪{fail} defined by the following equations: showed that all programs from a superset of the set of all data commands execute deterministically [AdBO09]. Consequently, also data commands execute deterministically.
To prove the correctness of commandification, we use Hoare logic [Hoa69], where triples of the form {ϕ} π {ϕ ′ } play a central role. In such a triple, precondition ϕ characterizes the set of initial data states, π denotes the data command to execute on those states, and postcondition ϕ ′ characterizes the set of final data states after executing π.
Let ϕ denote the set of data states that satisfy ϕ (i.e., the data assignments characterized by ϕ). We interpret triples in two senses: that of partial correctness and that of total correctness. In the former case, a triple {ϕ} π {ϕ ′ } holds true iff every final data state to which an initial data state characterized by ϕ can evolve under π satisfies ϕ ′ ; in the latter case, additionally, execution of π does not abnormally terminate.
Definition 4.8 (interpretation of triples). |= part , |= tot ⊆ Tripl denote the smallest relations induced by the following rules: To prove properties of data commands, we use the following sound proof systems for partial and total correctness, adopted from Apt et al. with some minor cosmetic changes [AdBO09].
Definition 4.9 (proof systems of triples). ⊢ part , ⊢ tot ⊆ Tripl denote the smallest relations induced by the rules in Figure 8.
Note that the first four rules for ⊢ part and the first four rules for ⊢ tot have the same premise/consequence. We use ⊢ part to prove the soundness of commandification; We use ⊢ tot to prove commandification's completeness.
Commandification (without Cycles). At run-time, to check if a transition (q, P, ϕ, q ′ ) can fire, a compiler-generated coordinator thread first checks every port in P for readiness. For instance, every (data structure for an) input port should have a pending put. Subsequently, the coordinator thread checks whether a data state σ exists that (i) satisfies ϕ and (ii) subsumes an initial data state σ init (i.e., σ init ⊆ σ). If so, we call σ a solution of ϕ under σ init . The domain of σ init contains all uncontrollable data variables in ϕ: the input ports in P (intersected with Free(ϕ)) and • m for every memory cell m in the ca (also intersected with Free(ϕ)). More precisely, σ init maps every input port p in Free(ϕ) to the particular datum forced to pass through p by the process thread on the other side of p (i.e., the datum involved in p's pending put), while σ init maps every • m in Free(ϕ) to the datum that currently resides in m. Thus, before the coordinator thread invokes a constraint solver for ϕ, it already fixes values for all uncontrollable data variables in ϕ; when subsequently invoked, a constraint solver may, in search of a solution for ϕ under σ init , select values only for data variables outside σ init 's domain. Slightly more formally: σ init = p → d the put pending on input port p involves datum d and p ∈ Free(ϕ) With commandification, instead of invoking a constraint solver, the coordinator thread executes a compiler-generated data command for ϕ on σ init , thereby gradually extending σ init to a full solution. This compiler-generated data command essentially works as an efficient, small, dedicated constraint solver for ϕ.
To translate a data constraint of the form ℓ 1 ∧ · · · ∧ ℓ k , we construct a data command that (i) enforces as many data literals of the form t 1 = t 2 as possible with assignment statements and (ii) checks all remaining data literals with failure statements. We call data literals of the form t 1 = t 2 data equalities. To examplify such commandification, recall data constraint ϕ eg on page 12. In this data constraint, let C denote an input port and let x denote a memory cell. In that case, the set of uncontrollable data variables in ϕ eg consists of C and • x. Now, ϕ eg has six correct commandifications: We stipulate the same precondition for each of these data commands, namely that • x and C have a non-nil value (later formalized as data literals • x = • x and C = C). This precondition models that the execution of these data commands should always start on an initial data state over the uncontrollable data variables • x and C. Under this precondition, if a coordinator thread executes π 1 , it first assigns the values of • x and C to B and D. Subsequently, it assigns the evaluation of add(B, D) to E. Next, it assigns the value of E to F and G. Finally, it checks ¬Odd(G) with a failure statement. Data commands π 2 and π 3 differ from data command π 1 only in the order of the last three steps; data commands π 4 , π 5 and π 6 differ from π 1 , π 2 and π 3 only in the order of the first two steps. If execution of π i on σ init successfully terminates, the resulting final data state σ satisfies ϕ eg . We call this soundness. Moreover, if a σ ′ exists such that σ ′ |= ϕ eg and σ init ⊆ σ ′ , execution of π i successfully terminates. We call this completeness.
Generally, soundness and completeness crucially depend on the order in which assignments and failure statements follow each other in π. For instance, changing the order of G := E and ¬Odd(G) -> skip in the previous example yields a data command whose execution always fails (because G does not have a value yet on evaluating the guard of the ℓ 1 ⊑ ℓ 2 and ℓ 2 ⊑ ℓ 3 and ℓ 2 / ∈ {ℓ 1 , ℓ 3 } ℓ 1 ⊑ ℓ 3 (4.22) Figure 9. Addendum to Definition 4.12 failure statement). Such a trivially sound but incomplete data constraint serves no purpose. As another complication, not every data equality can become an assignment. In a first class of cases, neither the left-hand side nor the right-hand side of a data equality matches data variable x. For instance, We must translate add(B, D) = mult(B, D) into a failure statement, because we clearly cannot assign either of its two operands to the other. In a second class of cases, multiple data equalities in a data constraint have a left-hand side or a right-hand side that matches the same data variable x. For instance, we can translate only one data equality in E = add(B, D) ∧ E = mult(B, D) into an assignment, after which we must translate the other one into a failure statement, to avoid conflicting assignments to E.
To deal with these complications, we define a precedence relation on the data literals in a data constraint that formalizes their dependencies. Recall from Definition 2.11 that every data constraint consists of a conjunctive kernel of data literals, enveloped with existential quantifications. First, for technical convenience, we introduce a function that extends Liter(ϕ) (i.e., the data literals in the kernel of ϕ) with "symmetric data equalities".
Obviously, because t 1 = t 2 ≡ t 2 = t 1 , we have Liter(ϕ) ≡ Liter = (ϕ) for all ϕ. We usually write ⊑ ϕ instead of ⊑(ϕ) and use ⊑ ϕ as an infix relation. In words, x = t ⊑ ϕ ℓ means that the assignment x := t precedes the commandification of ℓ (i.e., ℓ depends on x). Rule 4.20 deals with the previously discussed first class of data-equalities-that-cannot-become-assignments, by imposing precedence only on data literals of the form x = t; shortly, we comment on the second class of data-equalities-that-cannot-become-assignments. Rule 4.21 conveniently ensures that every x = t precedes all differently shaped data literals. Strictly speaking, we do not need this rule, but it simplifies some notation and proofs later on.
For the sake of argument-generally, this does not hold true-suppose that a precedence relation ⊑ ϕ denotes a strict partial order on Liter = (ϕ). In that case, we can linearize ⊑ ϕ to a strict total order < (i.e., embedding ⊑ ϕ into < such that ⊑ ϕ ⊆ <) with a topological sort on the digraph (Liter = (ϕ), ⊑ ϕ ) [Kah62,Knu97]. Intuitively, such a linearization gives us an order in which we can translate data literals in Liter = (ϕ) to data commands in a sound and Figure 10. Digraph for precedence relation ⊑ ϕeg (without loop arcs and without arcs induced by Rule 4.21, to avoid further clutter). An arc (ℓ, ℓ ′ ) corresponds to ℓ ⊑ ϕeg ℓ ′ . Arcs between the same data vertices, but in different directions, lie on top of each other. Bold arcs represent a fragment of the strict partial order extracted from ⊑ ϕeg .
complete way. Shortly, we give an algorithm for doing so and indeed prove its correctness. Problematically, however, ⊑ ϕ generally does not denote a strict partial order: generally, it violates asymmetry and irreflexivity (i.e., graph-theoretically, it contains many cycles).
For instance, Figure 10 shows the digraph (Liter = (ϕ eg ), ⊑ ϕeg ), which indeed contains cycles. For now, we defer this issue to the next subsection, because it forms a concern orthogonal to the commandification algorithm and its correctness. Until then, we simply assume the existence of a procedure for extracting a strict partial order from ⊑ ϕ , represented by bold arcs in Figure 10. Algorithm 1 translates a data constraint ϕ, a set of data variables X, and a binary relation on data literals < to a data command π. It requires the following on its input. First, < should denote a strict total order on the =-symmetric closure of ϕ's data literals. Let n denote a-not necessarily the-number of data equalities in Liter = (ϕ), and let m denote the number of remaining data literals in Liter = (ϕ). Then, ℓ 1 , . . . , ℓ n+m denote the data literals in Liter = (ϕ) such that (i) their indices respect < and (ii) every ℓ i denotes x i = t i for 1 ≤ i ≤ n. Next, for every data variable in a data literal in Liter = (ϕ), but outside the set of uncontrollable data variables X, a data equality x i = t i should exist. Otherwise, such a data variable can get a value only through search-exactly what commandification tries to avoid-and not through assignment; underspecified data constraints fundamentally lie outside the scope of commandification in general and Algorithm 1 in particular. Finally, if Algorithm 1 Algorithm for translating a data constraint ϕ, a set of data variables X, and a binary relation on data literals < to a data command π Require: < denotes a strict total order on Liter = (ϕ) and Liter = (ϕ) = {ℓ 1 , . . . , ℓ n+m } and ℓ 1 < · · · < ℓ n < ℓ n+1 < · · · < ℓ n+m and ℓ 1 = x 1 = t 1 and · · · and ℓ n = x n = t n and Variabl(ϕ) \ X ⊆ {x 1 , . . . , x n } and for all σ a term t in a data equality x = t depends on a variable x ′ , a data equality x ′ = t ′ should precede x = t under <. The rules in Definition 4.12 induce precedence relations for which all these requirements hold true, except that those precedence relations do not necessarily denote strict partial orders and, hence, may not admit linearization. Consequently, the precedence relations in Definition 4.12 may not yield strict total orders as required by Algorithm 1. We address this issue in the next subsection.
Assuming satisfaction of its requirements, Algorithm 1 works as follows. It first loops over the first n (according to <) x i = t i data literals. If an assignment for x i already exists in the data command under construction π, Algorithm 1 translates x i = t i to a failure statement; otherwise, it translates x i = t i to an assignment. This approach resolves issues ℓ 1 ⊑ ϕ ℓ 2 ℓ 1 ⊑ ℓ 2 (4.23) ℓ ∈ Liter = (ϕ) and Variabl(ℓ) ⊆ X ⋆ ⊑ ℓ (4.24) x = t ∈ Liter = and Variabl(t) ⊆ X ⋆ ⊑ x = t (4.25) Figure 11. Addendum to Definition 4.14 with the previously discussed second class of equalities-that-cannot-become-assignments. After the first loop, the algorithm uses a second loop to translate the remaining m data literals to failure statements. The algorithm runs in time linear in n + m, and it terminates. Upon termination, Algorithm 1 ensures the soundness (first conjunct) and completeness of π (second conjunct). Note that we use a different proof system for soundness (partial correctness, ⊢ part ) than for completeness (total correctness, ⊢ tot ).
Algorithm 1 has the minor issue that it may produce more failure statements than strictly necessary. For instance, if we run Algorithm 1 on the total order extracted from ⊑ ϕeg in Figure 10, we get both the assignment D := C and the unnecessary failure statement C = D -> skip. After all, the digraph contains both D = C and C = D, one of which we added while computing Liter = (ϕ eg ) to account for the symmetry of =. Generally, such symmetric data literals result either in one assignment and one failure statement or in two failure statements; one can easily prove that symmetric data literals never result in two assignments. In both cases, one can safely remove one of the failure statements, because successful termination of the remaining statement already accounts for the removed failure statement.
Commandification (with Cycles). Algorithm 1 requires that < denotes a strict total order. Precedence relations in Definition 4.12 of ⊑, however, do not yield such orders: graph-theoretically, they may contain cycles. In this subsection, we present a solution for this problem. We start by extending the previous precedence relations with a unique least element, ⋆, and by making dependencies of data literals on uncontrollable data variables explicit. In the following definition, let X denote a set of such variables.
We usually write ⊑ X ϕ instead of ⊑(ϕ, X) and use ⊑ X ϕ as an infix relation. The two new rules state that data literals in which only uncontrollable data variables occur "depend" on ⋆.
Relation ⊑ X ϕ denotes a strict partial order if its digraph (Liter = (ϕ) ∪ {⋆}, ⊑ X ϕ ) defines a ⋆-arborescence: a digraph consisting of n − 1 arcs such that a path exists from ⋆ to each of its n vertices [KV08]. Equivalently, in a ⋆-arborescence, ⋆ has no incoming arcs, every other vertex has exactly one incoming arc, and the arcs form no cycles [KV08]. The first formulation seems more intuitive here: every path from ⋆ to some data literal ℓ represents an order in which Algorithm 1 should translate the data literals on that path to ensure Figure 12. B-graph corresponding to the digraph in Figure 10 (without loop b-arcs and without three-tailed b-arcs, to avoid further clutter). An arc (ℓ, ℓ ′ ) corresponds to ℓ ⊑ ϕeg ℓ ′ . Bold arcs represent an arborescence.
the correctness of the translation of ℓ. The second formulation simplifies observing that arborescences correspond to strict partial orders. A naive approach to extract a strict partial order from ⊑ X ϕ consists of computing a ⋆arborescence of the digraph (Liter = (ϕ) ∪ {⋆}, ⊑ X ϕ ). Even if such a ⋆-arborescence exists, however, this approach does not work as expected if Liter = (ϕ) contains a data literal x = t where t has more than one data variable. For instance, by definition, every arborescence of the digraph in Figure 10 has only one incoming arc for E = add(B, D), even though assignments to both B and D must precede an assignment to E. Because these dependencies exist as two separate arcs, no arborescence can capture them. To solve this, we must somehow represent the dependencies of E = add(B, D) with a single incoming arc. We can do so by allowing arcs to have multiple tails, one for every data variable. In that case, we can replace the two separate incoming arcs of E = add(B, D) with a single two-tailed incoming arc as in Figure 12. The two tails make explicit that to evaluate add, we need values for both its arguments: multiple tails represent a conjunction of dependencies of a data literal.
By combining single-tailed arcs into multiple-tailed arcs, we effectively transform the digraphs considered so far into b-graphs, a special kind of hypergraph with only b-arcs (i.e., backward hyperarcs, i.e., hyperarcs with exactly one head) [GLPN93]. Generally, we cannot derive such b-graphs from precedence relations as in Definition 4.14: their richer structure makes b-graphs more expressive-they convey strictly more information-than digraphs. In contrast, we can easily transform a b-graph into a precedence relation by splitting barcs into single-tailed arcs in the obvious way. Deriving precedence relations from more and x ∈ X ⋆ ◭ x = x (4.28) Figure 13. Addendum to Definition 4.15 expressive b-graphs therefore constitutes a correct way of obtaining strict total orders that satisfy the requirements of Algorithm 1; doing so just eliminates irrelevant information. Thus, we propose the following. Instead of formalizing dependencies among data literals in a set Liter = (ϕ) ∪ {⋆} directly as a precedence relation, we first formalize those dependencies as a b-graph. If the resulting b-graph defines a ⋆-arborescence, we can directly extract a cycle-free precedence relation ⊏. Otherwise, we compute a ⋆-arborescence of the resulting b-graph and extract a cycle-free precedence relation ⊏ afterward. Either way, ⊏ denotes a strict partial order whose linearization satisfies the requirements in Algorithm 1. We usually write ◭ X ϕ instead of ◭(ϕ, X) and use ◭ X ϕ as an infix relation. Rule 4.26 generalizes Rule 4.20 in Definition 4.12, by joining sets of dependencies of a data literal in a single b-arc. Rule 4.27 states that x = t does not necessarily depend on x-as implied by Rule 4.26-but only on the free variables in t (i.e., we can derive a value for x from values of the data variables in t). Note that through Rules 4.26 and 4.27, we extend the previous domain Liter = (ϕ) ∪ {⋆} with semantically insignificant data equalities of the form x = x, each of which we relate to ⋆ with Rule 4.28. We do this only for the technical convenience of treating both uncontrollable data variables in X (which may have no data equalities in Liter = (ϕ)) and the other variables (which must have data equalities) in a uniform way. For instance, Figure 12 shows the b-graph for data constraint ϕ eg .
Generally, in a b-graph, data literals can have multiple incoming b-arcs, which represents a disjunction of conjunctions of dependencies. Importantly, as long as Algorithm 1 respects the dependencies represented by one incoming b-arc, the other incoming b-arcs do not matter. An arborescence, which contains one incoming b-arc for every data literal, therefore preserves enough dependencies. Shortly, Theorem 4.17 makes this more precise.
ℓ 1 ⊏ ℓ 2 and ℓ 2 ⊏ ℓ 3 and ℓ 2 / ∈ {ℓ 1 , ℓ 3 } ℓ 1 ⊏ ℓ 3 (4.31) Figure 14. Addendum to Definition 4.16 We can straightforwardly compute an arborescence of a b-graph with an exploration algorithm reminiscent of breadth-first search. First, let ⊳ ⊆ ◭ X ϕ denote the aborescence under computation, and let L done ⊆ Liter = (ϕ) ∪ {⋆} ∪ {x = x | x ∈ X} denote the set of vertices (i.e., data literals) already explored; initially, ⊳ = ∅ and L done = {⋆}. Now, given some L done , compute a set of vertices L next connected only to vertices in L done by a b-arc in ◭ X ϕ . Then, for every vertex in L next , add an incoming b-arc to ⊳. 5 Afterward, add L next to L done . Repeat this process until L next becomes empty. Once that happens, either ⊳ contains an arborescence (if L done = L) or no arborescence exists. This computation runs in linear time, in the size of the b-graph. See also Footnote 5. Henceforth, let ⊳ X ϕ denote the final arborescence so computed; if no arborescence exists, we stipulate ⊳ X ϕ = ∅. Definition 4.16 (precedence iii). ⊏ : DC × 2 X → DC × DC denotes the function defined by the following equation: ⊏(ϕ, X) = ⊏ where ⊏ denotes the smallest relation induced by the rules in Figure 14.
We usually write ⊏ X ϕ instead of ⊏(ϕ, X). Rules 4.30 and 4.31 have the same premise/ consequence as Rules 4.21 and 4.22; Rule 4.29 straightforwardly splits b-arcs into singletailed arcs. For instance, the bold arcs in Figure 10 represent a fragment of the precedence relation so derived from the arborescence in Figure 12.
In that case, the other vertices fail to resolve at least one of ℓ's dependencies. This occurs, for instance, when ℓ depends on x, but the b-graph contains no x = t vertex. As another example, consider a recursive data equality x = t with x ∈ Variabl(t): unless another data equality x = t ′ with t = t ′ exists, every incoming b-arc in its b-graph loops onto itself. Consequently, no arborescence exists. In practice, such cases inherently require constraint solving techniques with backtracking to find a value for x. Nonexistence of a ⋆-arborescence thus signals a hard limit to the applicability of Algorithm 1 (although mixed techniques of translating some parts of a data constraint to a data command at compile-time and leaving other parts to a constraint solver at run-time seem worthwhile to explore; we leave this possibility for future work). Thus, the set of data constraints to which we can apply Algorithm 1 contains those (i) whose b-graph has a ⋆-arborescence, which guarantees linearizability of the induced precedence, and (ii) that satisfy also the rest of the requirements in Algorithm 1.
Commandify. To introduce data commands in cas, we introduce commandify as a unary operation on cas. First, because we want to avoid ad-hoc modifications to Definitions 2.11 and 2.14 (of data constraints and cas), we present an encoding of data commands as data relations. In the following definition, let ϕ denote a data constraint in a ca, let X denote the set of uncontrollable data variables in ϕ, and let x 1 , . . . , x k denote the free data variables in ϕ, ordered by < Term . Then, data relation R, which encodes the commandification π of ϕ, holds true of a data tuple (d 1 , . . . , d k ) iff execution of π on an initial data state (over the variables in X) successfully terminates on a data state σ that maps every x i to d i .
Definition 4.18 (data commands as data relations). comm : DC × 2 X → DC denotes the function defined by the following equation: . , x k } and x 1 < Term · · · < Term x k and ⊳ X ϕ = ∅ and X ⊆ Free(ϕ) where R denotes the smallest relation induced by the following rule: Note that σ in Rule 4.32 may map also data variables outside Free(ϕ). This happens, for instance, with data constraints with existential quantifiers. The data commands for such data constraints explicitly assign values to quantified data variables, even though those variables do not qualify as free. Because {x 1 → d 1 , . . . , x k → d k } contains the free data variables in ϕ, however, the additional data variables mapped by σ cannot affect the truth of ϕ (by monotonicity of entailment). We define commandification in cas in terms of comm. where −→ denotes the smallest relation induced by the following rules: Correctness and Effectiveness. We conclude this section by establishing the correctness and effectiveness of commandify. We consider commandify correct if it yields a behaviorally congruent ca to the original one. Before formulating this as a theorem, the following lemma first states the equivalence of a data constraint and its commandification.
We consider commandify effective if, after commandifying a ca a, every data constraint in the resulting ca either encodes a data command as in Definition 4.18 or has no data variables in it (in which case a compiler can statically check that data constraint). Generally, however, such unconditional effectiveness does not hold true. After all, if the b-graph for a data constraint ϕ in a has no ⋆-arborescence, we have no strict precedence relation to run Algorithm 1 with. In that case, comm(ϕ, X) = ϕ, and consequently, commandify does not have its intended effect. Fortunately, commandify does satisfy a weaker-but useful-form of effectiveness. To formulate this as a theorem, we first define a relation that holds true of arborescent cas. We consider a ca arborescent if the b-graph for each of its data constraints has a ⋆-arborescence.
Definition 4.22 (arborescentness). ♣ ⊆ Autom denotes the smallest relation induced by the following rule: ϕ ∈ Dc(a) implies ⊳ X ϕ = ∅ for all ϕ ♣ a (4.34) The following theorem states the effectiveness of commandify, conditional on arborescentness: after commandifying an arborescent ca a, every data constraint in the resulting ca encodes a data command as a data relation (as in Definition 4.18). Let R range over the set of data relations defined in Definition 4.18 of comm. Discussion. The constraint programming community has already observed that, for constraint solving, "if domain specific methods are available they should be applied instead [sic] of the general methods" [Apt09a]. Commandification pushes this piece of conventional wisdom to an extreme: essentially, every data command generated for a data constraint ϕ by Algorithm 1 constitutes a small, dedicated constraint solver capable of solving only ϕ. Nevertheless, execution of data commands bears similarities with constraint propagation techniques, in particular with forward checking [BMFL02]. Generally, constraint propagation aims to reduce the search space of a constraint satisfaction problem by transforming it into an equivalent "simpler" one, where variables have smaller domains, or where constraints refer to fewer variables. With forward checking, whenever a variable x gets a value d, a constraint solver removes values from the domains of all subsequent variables that, given d, violate a constraint. In the case of an equality x = x ′ , for instance, forward checking reduces the domain of x ′ to the singleton {d} after an assignment of d to x. Commandification implicitly uses that same property of equality, but instead of explicitly representing the domain of a variable and the reduction of this domain to a singleton at run-time, commandification already turns the equality into an assignment at compile-time.
Commandification may also remind one of classical Gaussian eliminination for solving systems of linear equations over the reals [Apt09b]: there too, one orders variables and substitutes values/expressions for variables in other expressions. Data constraints, however, have a significantly different structure from real numbers, which makes solving data constraints directly via Gaussian elimination at least not obvious.
Before we did the work presented in this paper, Clarke et al. already worked on purely constraint-based implementations of protocols [CPLA11]. Essentially, Clarke et al. specify not only the transition labels of an automaton as boolean constraints but also its state space and transition relation. In recent work, Proença and Clarke developed a variant of compile-time predicate abstraction to improve performance [PC13a]. They also used this technique to allow a form of interaction between a constraint solver and its environment during constraint solving [PC13b]. The work of Proença and Clarke resembles our work in the sense that we all try to "simplify" constraints at compile-time. We see also differences, though: (i) commandification fully avoids constraint solving and (ii) we adopted a richer language of data constraints in this paper. For instance, Proença and Clarke have only unary functions in their language, which would have avoided our need for b-graphs.

Experiments
Setup. We implemented our two optimization techniques as extensions to our existing cato-Java compiler, a plug-in for the Eclipse Ide. This plug-in is an integrated part of a larger toolset, which also consists of an editor that supports the graphical syntax for cas presented in Section 2, through a drag-and-drop interface. To evaluate the impact of our optimization techniques in practice, then, we performed a number of experiments with their implementation, the results of which we present in this section.
We divided our experiments into two categories. The first category consists of experiments involving compiler-generated coordinator threads in isolation. These experiments are "pure" in the sense that we measure only the performance of the compiler-generated code, without "polluting" these measurements with delays caused by process threads. The second category consists of experiments involving compiler-generated coordinator threads We ran each of our experiments five times on a machine with 24 cores (two Intel E5-2690V3 processors in two sockets), without Hyper-Threading and without Turbo Boost (i.e., with a static clock frequency), and averaged our measurements afterward.
Category I. To study the performance of compiler-generated coordinator threads in isolation, we selected seven sets of cas for experimentation, whose elements differ in the value of k ∈ {1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64}: Sync k , Fifo k , OddFib k , Merg k , LateAsyncMerg k , EarlyAsyncMerg k , and Rout k . In total, thus, we generated code for 96 cas, yielding 96 experiments. Application of our optimization techniques did not add any measurable compilation overhead. Each of these cas, except the Merg k cas, is the k-parametric generalization of a ca denoted by a digraph in Figure 6; every Merg k ca is the k-parametric generalization of Merg2 in Figure 3. For Sync k /Fifo k , parameter k controls the number of Syncs/Fifos in the chain. For Merg k , LateAsyncMerg k , and EarlyAsyncMerg k , parameter k controls the number of producers. For OddFib k and Rout k , parameter k controls the number of consumers. See Section 2 for a brief description of the behavior of these cas for k = 2.
In each run of an experiment, we measured the number of completed transitions in four minutes after warming up the Java virtual machine for thirty seconds. To measure the performance of only the compiler-generated code, we used "empty" producers and consumers, which essentially execute while (true) put(...) and while (true) get(...). Figure 15 shows our experimental results. The figure shows that, individually, our two optimization techniques are already very effective. When we apply both optimization techniques simultaneously, in many cases (Sync k , LateAsyncMerg k , EarlyAsyncMerg k , and Rout k ), performance is further improved, but the improvement is not the sum of the individual improvements. The reason is that after applying one of the techniques, there is "less room" for the other technique to make further improvement: there is only so much that can be optimized in checking data constraints, and each of our two techniques individually seems to already make a significant step toward an optimum. Still, as Figure 15 shows, it is useful to apply both techniques, especially since they do not appear to negatively influence each other.
Category II. To study the performance of compiler-generated coordinator threads in the context of full programs, we adapted the Nas Parallel Benchmarks Npb [BBB + 91], a popular suite to evaluate parallel performance with. The Npb suite specifies eight benchmarksfive computational kernels and three realistic applications-derived from computational fluid dynamics programs; for each of these benchmarks, to standardize comparisons, the Npb suite specifies four classes of problem sizes (class w, class a, class b, class c).
We compared the Java reference implementation of Npb with a ca-based implementation. The Java reference implementation, developed by Frumkin et al. [FSJY03], contains a Java program for seven of Npb's eight benchmarks; one kernel benchmark is missing. Each of these programs consists of a master process and a number of worker processes. The master and its workers interact with each other under a classical master/workers protocol (i.e., the master distributes work among its workers; the workers inform their master once their work is done). Frumkin et al. programmed this protocol using monitors.
We took the Java reference implementation of Npb as the basis for our ca-based implementation. First, we removed all instances of the master/workers protocol from the seven programs. Then, we added ports and put/get. Separately, we drew the master/ workers protocol in our graphical syntax for cas. Subsequently, we compiled our specification for k ∈ {2, 4, 8, 16, 32, 64} workers (unless a combination of benchmark+class supported only fewer workers), and let our compiler automatically integrate the hand-written code (for masters/workers) with its own compiler-generated code. Application of our optimization techniques did not add any measurable compilation overhead. Figures 16 and 17 show our experimental results. These results, in contrast to the results in Figure 15, look messy and are hard to derive a meaningful conclusion from: in some cases, using both optimizations results in the best performance, but in other cases, using only one of the optimizations results in the best performance, and in yet a few other cases, using no optimization actually results in the best performance.
The reason for these results, so we found out, has to do with hardware cache performance: it turns out that the memory footprint of our compiler-generated code seriously impacts numbers of cache misses, a phenomenon that did not yet manifest when we ran our compiler-generated code in isolation. As we have not yet optimized compiler-generated code for memory usage, a reasonable assumption is that code with a large memory footprint results in more cache misses. However, things are even more subtle than that: due to the way the Java virtual machine allocates memory, so we found out, a larger memory footprint may in fact result in fewer cache misses. We admit that we do not yet understand the impact of the memory footprint of our compiler-generated code on the execution-time performance of the code sufficiently well enough to appropriately account for this impact in our We consider the revelation of this underdeveloped aspect of our compilation technology as a significant contribution of this paper.

Conclusion
We presented, and established the correctness of, two techniques to optimize the performance of checking data constraints. The first technique, called "eliminate" and formalized as operation , reduces the size of data constraints at compile-time, to reduce the complexity of constraint solving at run-time. The second technique, called "commandify" and formalized as operation · , translates data constraints into small pieces of imperative code at compile-time, to replace expensive calls to a general-purpose constraint solver at runtime. Finding satisfying assignments for data constraints resembles a game of hide-and-seek, played by our compiler-generated code at run-time with the aid of a constraint solver. This game was reasonable when our ca compilation technology was still in its infancy, but no longer as this technology matures.
Although the experiments in which we evaluated compiler-generated code in isolation show that eliminate and commandify indeed have a positive impact on performance, the experiments in which we evaluated compiler-generated code in the context of full programs remain inconclusive because of seemingly erratic hardware cache behavior. Here lies an important next research step: we need to better understand the impact of memory footprints Figure 17. Experimental results for four Npb kernels: speedups, on the y-axis, of compiler-generated code optimized with eliminate, commandify, or both, and of reference code by Frumkin et al., relative to unoptimized compiler-generated code, as a function of the number of processes, on the x-axis. See Figure 16 for a legend. of compiler-generated code. So far, including in this paper, we have focused our attention exclusively on compilation techniques for optimizing "algorithmic" aspects of compiler-generated code (i.e., minimizing the number of computation steps necessary to, for instance, check data constraints). Our experimental results in this paper show that we need to start considering memory too.
Another interesting piece of future work involves comparing our compilation technology for constraint automata, including the optimization techniques presented in this paper, with compilation technology for other coordination models and languages. One interesting candidate is Bip. In recent work [DJAB15], we already performed a theoretical study on the relation between (the formal semantics of) Reo and Bip. A natural next step in this line of work consists of a practical comparison of these models (including not only performance of their generated code, but also such software engineering qualities as programmability, maintainability, reusability, and so on).
Proof of Theorem 4.13. To show the correctness of Algorithm 1 (henceforth "the algorithm"), we need to show that if its requirements are satisfied, upon termination, it ensures both: We call the former soundness and the latter completeness and prove their truth separately. Soundness: We start by arguing that ⊢ part { {x = x | x ∈ X}} π {ℓ 1 ∧ · · · ∧ ℓ i } holds after every iteration of the first loop. For 1 ≤ i ≤ n, after doing an assignment x i := t i in a data state σ, literal ℓ i = x i = t i holds in σ if all variables in t i have a non-nil value.
(Otherwise, t i evaluates to nil, which the definition of |= forbids.) Reasoning toward a contradiction, suppose that some variable y in t i has a nil value. Then, because no assignment assigns nil, no y := t assignment has occurred previously. But because y ∈ Variabl(t i ), either a literal y = t ∈ L exists that precedes x i = t i or y ∈ X (by the requirements of the algorithm). In the former case, a y :=t assignment must have occurred previously, such that y in fact has a non-nil value (namely, the evaluation of t). In the latter case, by the precondition of the triple we are proving, we know that σ |= y = y holds. By the definition of |=, this means that y has a non-nil value.
Thus, ℓ i = x i = t i holds in σ after its update with x i := t i . By the precondition of the triple, we know that x = x held for all x ∈ X before updating σ. Additionally, suppose that the preceding literals x j = t j (for 1 ≤ j < i) held before updating σ. Each of those literals can have become false only if the update overwrote an x or an x j . In that case, x i ∈ X ∪ {x 1 , . . . , x i−1 }. But then, the algorithm did not translate x i = t i to an assignment in the first place but to a failure statement x i = t i -> skip. If execution of this statement successfully terminates, obviously x i = t i holds, and because it leaves σ unchanged, all preceding literals remain true. Note that the ⊢ part proof rule for failure statements allows us to assume that the guard holds; we do not need to establish this yet (cf. completeness below, where we use ⊢ tot ).
We can inductively repeat the reasoning in the previous paragraphs for all 1 ≤ i ≤ n to conclude that ⊢ part { {x = x | x ∈ X}} π {ℓ 1 ∧ · · · ∧ ℓ i } holds after the first loop. The failure statements added in the second loop leave state σ unchanged, meaning that literals that held before executing those statements in σ remain true. Thus, if those statements successfully terminate,
If x i / ∈ X ∪ {x 1 , . . . , x i−1 }, we know that ℓ i = x i = t i holds in σ after its update with x i :=t i (see soundness above). By our initial assumption, we also know that ℓ i = x i = t i holds in σ ′ . Thus, by the definition of |=, we conclude σ(x i ) = eval σ (t i ) and σ ′ (x i ) = eval σ ′ (t i ). Now, because a y = t literal precedes x i = t i for all y ∈ Variabl(t i ) (see soundness above), σ maps every such a y to the same value as σ ′ (i.e., y = x j for some 1 ≤ j < i). Consequently, eval σ (t i ) = eval σ ′ (t i ). Combining this with the previous intermediate result, the following equation holds: σ(x i ) = eval σ (t i ) = eval σ ′ (t i ) = σ ′ (x i ). Thus, x i = σ ′ (x i ) holds in σ. As before (see soundness above), we can also establish that, for x j ∈ X ∪ {x 1 , . . . , x i−1 }, updating σ with x i := t i does not make x j = σ ′ (x j ) literals that held already before this update false. Thus, {x = σ ′ (x) | x ∈ X ∪ {x 1 , . . . , x i }} holds in σ.
If x i ∈ X ∪ {x 1 , . . . , x i−1 }, we can immediately conclude that x j = σ ′ (x j ) held in σ for all x j ∈ X ∪ {x 1 , . . . , x i−1 } already before executing the failure statement x i = t i -> skip added by the algorithm. To prove that this failure statement also successfully terminates, the ⊢ tot proof rule for failure statements dictates that we must establish-instead of assume (cf. soundness above)-that the guard x i = t i holds in σ. This follows from the fact that x i = t i holds in σ ′ by our initial assumption, and because σ and σ ′ map all variables in ℓ i = x i = t i to the same values. To prove the latter, we can use a similar argument involving the precedence relation and its linearization as before (see soundness above).
We can inductively repeat the previous reasoning for all 1 ≤ i ≤ n to conclude that ⊢ tot { {x = σ ′ (x) | x ∈ X}} π { {x = σ ′ (x) | x ∈ X ∪ {x 1 , . . . , x n }}} holds after the first loop. The failure statements added in the second loop leave σ unchanged, meaning that the x j = σ ′ (x j ) literals that held already before executing those statements in σ, for x j ∈ X ∪ {x 1 , . . . , x n }, remain true. In order to prove the successful termination of those failure statements, we can use a similar argument as for the failure statements added in the first loop: by our initial assumption, σ ′ |= ℓ i for all n + 1 ≤ j ≤ n + m, and σ and σ ′ still map the same variables to the same values. Thus, ⊢ tot { {x = σ ′ (x) | x ∈ X}} π { {x = σ ′ (x) | x ∈ X ∪ {x 1 , . . . , x n+m }}} holds also after the second loop. A full, detailed proof appears as the proof of Theorem 18 in [Jon16b,Appendix D.4].
Proof of Theorem 4.17. Recall that the rules in Definition 4.12 of ⊑ (and, therefore, also the rules in Definition 4.14) induce precedence relations for which all requirements of Algorithm 1 (henceforth: "the algorithm") hold, except that those precedence relations do not necessarily denote strict partial orders. What we need to show here, then, is that ⊏ X ϕ is both a strict partial order and a "large enough" subset of ⊑ X ϕ to satisfy the algorithm's requirements. The theorem subsequently follows, as < X ϕ is just the linearization of ⊏ X ϕ . The fact that ⊏ X ϕ is a strict partial order follows from ⊳ X ϕ forming an arborescence. To show ⊏ X ϕ ⊆ ⊑ X ϕ , we need to consider the three rules in Definition 4.16 of ⊏. First, take any pair (ℓ, ℓ ′ ) such that ℓ ⊏ X ϕ ℓ ′ by Rule 4.29. Then, by the premise of that rule, {ℓ 1 , . . . , ℓ k } ⊳ X ϕ ℓ ′ such that ℓ = ℓ i for some 1 ≤ i ≤ k. Because ⊳ X ϕ ⊆ ◭ X ϕ (because the former is an arborescence of the latter), the premises of the rules in Definition 4.15 of ◭, subsequently guarantee after some manipulation that ℓ = ℓ i = x = t for some x and t. Moreover, x ∈ Variabl(ℓ ′ ). By Rule 4.20, we subsequently conclude that ℓ ⊑ X ϕ ℓ ′ holds. Second, Rule 4.30 is identical to Rule 4.21, so any pair (ℓ, ℓ ′ ) in ⊏ X ϕ induced by the former is also induced in ⊑ X ϕ by the latter. Third, by induction, we can show the same result for pairs (ℓ, ℓ ′ ) such that ℓ ⊏ X ϕ ℓ ′ by Rule 4.31. Thus, ⊏ X ϕ ⊆ ⊑ X ϕ . Finally, we must show that ⊏ X ϕ is "large enough" for it to satisfy the precondition of the algorithm. Informally, this means that arborescences do not exclude b-arcs in the b-graph that actually represent essential dependencies: for every free variable y that a literal ℓ ∈ L depends on, ⊏ X ϕ must contain at least one pair (y = t, ℓ) (for some t). To see that this holds, note that every b-arc entering a literal ℓ represents a complete set of dependencies of ℓ. If ℓ has multiple incoming b-arcs, this simply means that several ways exist to resolve ℓ's dependencies. In principle, however, keeping one of those options suffices for our purpose. Therefore, the single incoming b-arc that ℓ has in an arborescence represents enough dependencies of ℓ. Proof of Theorem 4.23. To prove this theorem, by Definition 4.19 of · , we need to show that for every data constraint ϕ in a, the pair (ϕ, X) for X = Free(ϕ) ∩ (P in ∪ • M ) satisfies the four conditions in Definition 4.18 of comm. The first two conditions always hold. The third condition follows from ♣ a: by Definition 4.22 of ♣ , every data constraint in a is arborescent. Finally, the fourth condition follows from set theory. A full, detailed proof appears as the proof of Theorem 21 in [Jon16b,Appendix D.4].