FTMPST: Fault-Tolerant Multiparty Session Types

Multiparty session types are designed to abstractly capture the structure of communication protocols and verify behavioural properties. One important such property is progress, i.e., the absence of deadlock. Distributed algorithms often resemble multiparty communication protocols. But proving their properties, in particular termination that is closely related to progress, can be elaborate. Since distributed algorithms are often designed to cope with faults, a first step towards using session types to verify distributed algorithms is to integrate fault-tolerance. We extend multiparty session types to cope with system failures such as unreliable communication and process crashes. Moreover, we augment the semantics of processes by failure patterns that can be used to represent system requirements (as, e.g., failure detectors). To illustrate our approach we analyse a variant of the well-known rotating coordinator algorithm by Chandra and Toueg.


Introduction
Multi-Party Session Types (MPST) are used to statically ensure correctly coordinated behaviour in systems without global control ( [HYC16,CDCPY15]).One important such property is progress, i.e., the absence of deadlock.Like with every other static typing approach, the main advantage is their efficiency, i.e., they avoid the problem of state space explosion.MPST are designed to abstractly capture the structure of communication protocols.They describe global behaviours as sessions, i.e., units of conversations [HYC16, BCD + 08, BHTY10].The participants of such sessions are called roles.Global types specify protocols from a global point of view.These types are used to reason about processes formulated in a session calculus.
Distributed algorithms (DA) very much resemble multiparty communication protocols.An essential behavioural property of DA is termination [Tel94,Lyn96], despite failures, but it is often elaborate to prove.It turns out that progress (as provided by MPST) and termination (as required by DA) are closely related.
Many DA were designed in a fault-tolerant way, in order to work in environments, where they have to cope with system failures-be it links dropping messages or processes crashing.Gärtner [Gär99] suggested four different forms of fault-tolerance, depending on whether the safety and liveness requirements are met, or not.An algorithm is called masking in the (best) case that both properties hold while tolerating faults transparently, i.e., without further intervention by the programmer.It is called non-masking, however, if faults are dealt with explicitly in order to cope with unsafe states, while still guaranteeing liveness.The fail-safe case then captures algorithms that remain safe, but not live.(The fourth form is just there for completeness; here neither safety nor liveness is guaranteed.)We focus on masking fault-tolerant algorithms.
While the detection of conceptual design errors is a standard property of type systems, proving correctness of algorithms despite the occurrence of system failures is not.Likewise, traditional MPST do not cover fault tolerance or failure handling.There are several approaches to integrate explicit failure handling in MPST (e.g.[CHY08, CGY16, CVB + 16, VCE + 18, DHH + 15, APN17]).These approaches are sometimes enhanced with recovery mechanisms such as [CDCG17] or even provide algorithms to help find safe states to recover from as in [NY17].Many of these approaches introduce nested try-and-catch-blocks and a challenge is to ensure that all participants are consistently informed about concurrent throws of exceptions.Therefore, exceptions are propagated within the system.Though explicit failure handling makes sense for high-level applications, the required message overhead is too inefficient for many low-level algorithms.Instead these low-level algorithms are often designed to tolerate a certain amount of failures.Since we focus on the communication structure of systems, additional messages as reaction to faults (e.g. to propagate faults) are considered non-masking failure handling.In contrast, we expect masking fault-tolerant algorithms to cope without messages triggered by faults.We study how much unhandled failures a well-typed system can tolerate, while maintaining the typical properties of MPST.
We propose a variant of MPST with unreliable interactions and augment the semantics to also represent failures such as message loss and crashing processes, as well as more abstract concepts of fault-tolerant algorithms such as the possibility to suspect a process to be faulty.To guide the behaviour of unreliable communication, the semantics of processes uses failure patterns that are not defined but could be instantiated by an application.This allows us to cover requirements on the system-as, e.g., a bound on the number of faulty processes-as well as more abstract concepts like failure detectors.It is beyond the scope of this paper to discuss how failure patterns could be implemented.
1.1.Related Work.Type systems are usually designed for failure-free scenarios.An exception is [KGG14] that introduces unreliable broadcast, where a transmission can be received by multiple receivers but not necessarily all available receivers.In the latter case, the receiver is deadlocked.In contrast, we consider fault-tolerant interactions, where in the case of a failure the receiver is not deadlocked.
The already mentioned systems in [CHY08, CGY16, CVB + 16, VCE + 18, DHH + 15] extend session types with exceptions thrown by processes within try-and-catch-blocks, interrupts, or similar syntax.They structurally and semantically encapsulate an unreliable part of a protocol and provide some means to 'detect' a failure and 'react' to it.For example [VCE + 18] proposes a variant of MPST with the explicit handling of crash failures.Therefore they coordinate asynchronous messages for run-time crash notifications using a coordinator.Processes in [VCE + 18] have access to local failure detectors which eventually detect all 1.2.Summary.The present paper is an extended version of [PNW22a] that additionally contains the proofs of the presented results as well as some additional explanations (also see the technical report [PNW22b]).In Section 2 we give an impression of the forms of fault-tolerant interactions that we consider.Section 3 introduces the syntax of our version of multiparty session types.The semantics of the session calculus is given in Section 4. In Section 5 we provide the typing rules and show that the standard properties that are usually required for multiparty session types versions are valid in our case.Section 6 provides an example of using fault-tolerant multiparty session types by analysing an implementation of a well-known Consensus algorithm.

Fault-Tolerance in Distributed Algorithms
We consider three sources of failure in an unreliable communication (Figure 1(a)): (1) the sender may crash before it releases the message, (2) the receiver may crash before it can consume the message, or (3) the communication medium may lose the message.The design of a DA may allow it to handle some kinds of failures better than others.Failures are unpredictable events that occur at runtime.Since types consider only static and predictable information, we do not distinguish between different kinds of failure or model their source in types.Instead we allow types, i.e., the specifications of systems, to distinguish between potentially faulty and reliable interactions.
A fault-tolerant algorithm has to solve its task despite such failures.Remember that MPST analyse the communication structure.Accordingly, we need a mechanism to tolerate faults in the communication structure.We want our type system to ensure that a faulty interaction neither blocks the overall protocol nor influences the communication structure of the system after this fault.We consider an unreliable communication as fault-tolerant if a failure does not influence the guarantees for the overall communication structure except for this particular communication.Moreover, if a potentially unreliable communication is executed successfully, then our type system ensures the same guarantees as for reliable communication such as e.g. the absence of communication mismatches.
To ensure that a failure does not block the algorithm, both the receiver and the sender need to be allowed to proceed without their unreliable communication partner.Therefore, the receiver of an unreliable communication is required to specify a default value that, in the case of failure, is used instead of the value the process was supposed to receive.The type system ensures the existence of such default values and checks their sort.
Moreover, we augment unreliable communication with labels that help us to avoid communication mismatches.Assume for instance two subsequent unreliable communications in that values of different sorts, a natural number and a boolean, are transmitted.If the first message with its natural number is lost but the second message containing a Boolean value is transmitted, the receiver could wrongly receive a Boolean value although it still waits for a natural number.To avoid this mismatch, we add a label to unreliable communication, ensure (by the typing rules) that the same label is never associated with different types, and let the semantics inspect the label of a message before reception.Note that this problem, i.e., how to ensure the absence of communication mismatches in the case of unreliable communication, is one of the main challenges in structuring fault-tolerant communication.
Branching in the context of failures is more difficult, because a branch marks a decision point in a specification, i.e., the participants of the session are supposed to behave differently w.r.t.this decision.In an unreliable setting it is difficult to ensure that all participants are informed consistently about such a decision.
For clarity, we often distinguish names into values, i.e., the payload of messages, shared channels, or session channels according to their usage; there is, however, no need to formally distinguish between different kinds of names.
We assume that the sets N of names a, s, x . ..; R of roles n, r, . ..; L of labels l , l d , . ..; V T of type variables t; and V P of process variables X are pairwise distinct.To simplify the reduction semantics of our session calculus, we use natural numbers as roles (compare to [HYC16]).Sorts S range over B, N, . ... The set E of expressions e, v, b, . . . is constructed from the standard Boolean operations, natural numbers, names, and (in)equalities.
Global types specify the desired communication structure from a global point of view.In local types this global view is projected to the specification of a single role/participant.We use standard MPST ([HYC08, HYC16]) extended by unreliable communication and weakly reliable branching (highlighted in blue) in Figure 2.
A new session s with n roles is initialised with a[n](s).P and a[r](s).P via the shared channel a.We identify sessions with their unique session channel.
The type r 1 → r r 2 :⟨S⟩.G specifies a strongly reliable communication from role r 1 to role r 2 to transmit a value of the sort S and then continues with G.A system with this

Global Types
Local Types Processes P ::= a[n](s).P | a[r](s).P Figure 2: Syntax of Fault-Tolerant MPST type will be guaranteed to perform a corresponding action.In a session s this communication is implemented by the sender s[r 1 , r 2 ]! r ⟨e⟩.P 1 (specified as [r 2 ]! r ⟨S⟩.T 1 ) and the receiver s[r 2 , r 1 ]? r (x ).P 2 (specified as [r 1 ]? r ⟨S⟩.T 2 ).As result, the receiver instantiates x in its continuation P 2 with the received value.The type r 1 → u r 2 :l ⟨S⟩.G specifies an unreliable communication from r 1 to r 2 transmitting (if successful) a label l and a value of type S and then continues (regardless of the success of this communication) with G.The unreliable counterparts of senders and receivers are s[r 1 , r 2 ]! u l ⟨e⟩.P 1 (specified as [r 2 ]! u l ⟨S⟩.T 1 ) and s[r 2 , r 1 ]? u l ⟨v ⟩(x ).P 2 (specified as [r 1 ]? u l ⟨S⟩.T 2 ).The receiver s[r 2 , r 1 ]? u l ⟨v ⟩(x ).P 2 declares a default value v that is used instead of a received value to instantiate x after a failure.Moreover, a label is communicated that helps us to ensure that a faulty unreliable communication has no influence on later actions.
The strongly reliable branching r 1 → r r 2 :{l i .G i } i∈I allows r 1 to pick one of the branches offered by r 2 .We identify the branches with their respective label.Selection of a branch is by s[r 1 , r 2 ]! r l .P (specified as [r 2 ]! r {l i .T i } i∈I ).Upon receiving l j , s[r 2 , r 1 ]? r {l i .P i } i∈I (specified as [r 1 ]? r {l i .T i } i∈I ) continues with P j .
As discussed in the end of Section 1, the counterpart of branching is weakly reliable and not unreliable.It is implemented by r → w R:{l i .G i } i∈I,l d , where R ⊆ R and l d with d ∈ I is the default branch.We use a broadcast from r to all roles in R to ensure that the sender can influence several participants consistently.Splitting this action to inform the roles in R separately does not work, because we cannot ensure consistency if the sender crashes while performing these subsequent actions.The type system will ensure that no message is lost.Because of that, all processes that are not crashed will move to the same branch.We often abbreviate branching w.r.t. to a small set of branches by omitting the set brackets and instead separating the branches by ⊕, where the last branch is always the default branch.In contrast to the strongly reliable cases, s[r, R]! w l .P (specified as [R]! w {l i .T i } i∈I ) allows to broadcast its decision to R and s[r j , r]? w {l i .P i } i∈I,l d (specified as [r]? The ⊥ denotes a process that crashed.Similar to [HYC16], we use message queues to implement asynchrony in sessions.Therefore, session initialisation introduces a directed and initially empty message queue s r 1 →r 2 :[ ] for each pair of roles r 1 ̸ = r 2 of the session s.The separate message queues ensure that messages with different sources or destinations are not ordered, but each message queue is FIFO.Since the different forms of interaction might be implemented differently (e.g. by TCP or UDP), it make sense to further split the message queues into three message queues for each pair r 1 ̸ = r 2 such that different kinds of messages do not need to be ordered.To simplify the presentation of examples in this paper and not to blow up the number of message queues, we stick to a single message queue for each pair r 1 ̸ = r 2 , but the correctness of our type system does not depend on this decision.We have five kinds of messages m and corresponding message types mt in Figure 2-one for each kind of interaction.In strongly reliable communication a value v (of sort S) is transmitted in a message ⟨v⟩ r of type ⟨S⟩ r .In unreliable communication the message l ⟨v⟩ u (of type l ⟨S⟩ u ) additionally carries a label l .For branching only the picked label l is transmitted and we add the kind of branching as superscript, i.e., message/type l r is for strongly reliable branching and message/type l w for weakly reliable branching.Finally, message/type s[r] is for session delegation.A message queue M is a list of messages m and MT is a list of message types mt.
In types (µt)G and (µt)T the type variable t is bound.In processes (µX )P the process variable X is bound.Similarly, all names in round brackets are bound in the remainder of the respective process, e.g.s is bound in P by a[n](s).P and x is bound in P by s[r 1 , r 2 ]? r (x ).P .A variable or name is free if it is not bound.Let FN(P ) return the free names of P .
Let subterm denote a (type or process) expression that syntactically occurs within another (type or process) term.We use '.' (as e.g. in a[r](s).P ) to denote sequential composition.In all operators the prefix before '.' guards the continuation after the '.'.Let Let R(G) return all roles that occur in G.We write nsr(G), nsr(T ), and nsr(P ), if none of the prefixes in G, T , and P is strongly reliable or for delegation and if P does not contain message queues.Definition 3.1 (Well-Formedness, Global Type).A global type is well-formed if (1) it neither contains free nor unguarded type variables, (2) R(G) = {1, . . ., |R(G)|}, (3) for all its subterms of the form r 1 → r r 2 :⟨S⟩.G or r 1 → u r 2 :l ⟨S⟩.G, we have r 1 ̸ = r 2 , (4) for all its subterms of the form r 1 → r r 2 :{l i .G i } i∈I or r → w R:{l i .G i } i∈I,l d , we have r 1 ̸ = r 2 , r / ∈ R, d ∈ I, and the labels l i are pairwise distinct, and (5) for all its subterms of the form We restrict our attention to well-formed global types.Definition 3.2 (Well-Formedness, Global Type).A local type is well-formed if (1) it neither contains free nor unguarded type variables and (2) for all its subterms of the form [r]! r {l i .T i } i∈I , [r]? r {l i .T i } i∈I , [R]! w {l i .T i } i∈I,l d , or of the form [R]? w {l i .T i } i∈I,l d , we have d ∈ I and the labels l i are pairwise distinct.
We restrict our attention to well-formed local types.
A session channel and a role together uniquely identify a participant of a session, called an actor.A process has an actor s[r] if it has an action prefix on s that mentions r as its first role.Let A(P ) be the set of actors of P .
3.1.Examples.Consider the specification G dice,r of a simple dice game in a bar where the dealer Role 3 continues to roll a dice and tell its value to player 1 and then to roll another time for player 2 until the dealer decides to exit the game.We can combine strongly reliable communication/branching and unreliable communication, e.g. by ordering a drink before each round in G dice,r .
where role 4 represents the bar tender and the noise of the bar may swallow these orders.Moreover, we can remove the branching and specify a variant of the dice game in that 3 keeps on rolling the dice forever, but, e.g.due to a bar fight, one of our three players might get knocked out at some point or the noise of this fight might swallow the announcements of role 3: To restore the branching despite the bar fight that causes failures, we need the weakly reliable branching mechanism.
If 3 is knocked out by the fight, i.e., crashes, the game cannot continue.Then 1 and 2 move to the default branch end , have to skip the respective unreliable communications, and terminate.But the game can continue as long as 3 and at least one of the players 1, 2 participate.An implementation of G dice is P dice = P 3 | P 1 | P 2 , where for i ∈ {1, 2}: Role 3 stores the sums of former dice rolls for the two players in its local variables x 1 and x 2 , and roll(x i ) rolls a dice and adds its value to the respective x i .Role 3 keeps rolling dice until the sum x i for one of the players exceeds 21.If both sums x 1 and x 2 exceed 21 in the same round, then 3 wins, i.e., both players receive f; else, the player that stayed below 21 wins and receives t.The players 1 and 2 use their respective last known sum that is stored in x as default value for the unreliable communication in the branch play and f as default value in the branch end .The last branch, i.e., end , is the default branch.
3.2.Projection.Our type system verifies processes, i.e., implementations, against a specification that is a global type.Since processes implement local views, local types are used as a mediator between the global specification and the respective local end points.To ensure that the local types correspond to the global type, they are derived by projection.Instead of the projection function described in [HYC16] we use a more relaxed variant of projection as introduced in [YDBH10, CDGH20,vGHH21].
Projection maps global types onto the respective local type for a given role p.The projections of the new global types are obtained straightforwardly from the projection of their respective strongly reliable counterparts: where either ⋄ = r, S = ⟨S⟩ or ⋄ = u, S = l ⟨S⟩ and In the last case of strongly reliable or weakly reliable branching-when projecting onto a role that does not participate in this branching-we map to i∈{1,...,n} The ⊔ allows to unify the projections G i ↾ p if all of them return the same kind of branching input [p]? ⋄ . . .were the respective sets of branches my differ as long as the same label is always followed by the same local type.The operation ⊔ is (similar to [YDBH10]) inductively defined as: where T, T ′ ∈ T are local types, I, I 1 , I 2 , J are sets of branches of local types of the form l .T , l / ∈ I is short hand for ∄T ′ .l .T ′ ∈ I, and is undefined in all other cases.By the first line, identical types can be merged.By the second and third line, local types for the reception of a branching request can be merged if they have the same prefix and the respective sets of branches can be merged.The third line, for the weakly reliable case, additionally requires that the two sets of branches have the same default branch.The sets of branches, that need to be merged according to the second and third line, contain elements of the form l .T , where l is a label and T a local type.The last two lines above inductively define how to merge such sets, i.e., here we overload the operator ⊔ on local types to an operator on sets of branches of local types.The case distinction in the last line ensures that elements l .T with a label that occurs in only one of the two sets can be kept, but if both sets contain an element with the same label then the respective local types have to be merged for the resulting set.
The mergeability relation ⊔ states that two types are identical up to their branching types, where only branches with distinct labels are allowed to be different.This ensures that if the sender r 1 in r 1 → r r 2 :{l i .G i } i∈I decides to branch then only processes that are informed about this decision can adapt their behaviour accordingly; else projection is not defined.
The remaining global types are projected as follows: ( The projection of delegation is similar to communication.The projection of end if p does not occur at all.Recursive types without their recursion variable are mapped to the projection of their recursion body (similar to [CDGH20]), else if p occurs in the recursion body we map to a recursive local type, or else to successful termination.Type variables and successful termination are mapped onto themselves.We denote a global type G as projectable if for all r ∈ R(G) the projection G↾ r is defined.We restrict our attention to projectable global types.Projection maps well-formed global types onto the respective local type for a given role p, where the results of projection-if defined-are again well-formed.
Projecting the global type G dice,r in (1) results in the local types where the types of the two players T 1:dice,r = T 2:dice,r = T i:dice,r are identical.The projection of the outer branching in G dice,r on 2 results in [3]? r roll .tfor the first branch and [3]? r exit.end for the second branch.These two [3]?r types are unified by ⊔ into a single [3]? r type with two branches.Projection maps G dice in (3) to: where i ∈ {1, 2} and both T i:dice are obtained by the second case of projection.The type system will ensure that either 3 transmits the request to branch to both players 1, 2 simultaneously and, since these messages cannot be lost, all players that are not crashed move to same branch or 3 crashes and all remaining players move to the default branch.
Assume instead that 3 can only inform one of the players 1, 2 at once.
is not projectable, because ⊔ does not allow to unify the projections [3]? u roll ⟨N⟩.t and [3]? u win⟨B⟩.end of the two branches of 2. Replacing them by strongly reliable communications implies that neither 3 nor 2 fail.The type where 3 informs its two players subsequently about the chosen branch is projectable.But it introduces the two additional branches end .endand play.end,i.e., 3 is allowed to choose the branches for the players 1, 2 separately and differently, whereas in (1) as well as in (3) the players 1, 2 are always in the same branch.Because of that, we allow for broadcast in weakly reliable branching such that 3 can inform both players consistently without introducing additional and not-intended branches.
3.3.Labels.We use labels for two purposes: they allow us to distinguish between different branches, as usual in MPST-frameworks, and we assume that they may carry additional runtime information such as timestamps.We do that because we think of labels not only as identifiers for branching, but also as some kind of meta data of messages as they can be often found in communication media or as they are assumed by many distributed algorithms.
A prominent example is the use of timestamps in message headers, that allow a receiver to identify outdated messages and to discard them.Thereby, these additional runtime information placed in the label by the sender help the receiver to implement one of the already mentioned failure patterns; namely the one that allows a receiver to skip a message and continue with a default value instead.We will introduce failure patterns with the semantics in the next section.Although it is beyond the scope of this paper to discuss the implementation of failure patterns, we have to provide the technical means to do so.Allowing for runtime information in labels requires a subtle difference in the way labels are used.A timestamp may be added by the sender to capture the transmission time, but for the receiver it is hard to have this information already present in its label before or during reception.Similarly, types in our static type system should not depend on any runtime information.Hence, in contrast to standard MPST, we do not expect the labels of senders and receivers as well as the labels of processes and types to match exactly.Instead we assume a predicate = that compares two labels and is satisfied if the parts of the labels that do not refer to runtime information correspond.If labels do not contain runtime information, = can be instantiated with equality.We require that = is unambiguous on labels used in types, i.e., given two labels of processes l P , l ′ P and two labels of types l T , l ′ Of course, the presented type system remains valid if we use labels without additional runtime information.Indeed all presented examples carry in their labels statically available information only.Interestingly, also the static information in labels, that have to coincide for senders and receivers and their types, can be exploited to guide communication.In contrast to standard MPST and to support unreliable communication, our MPST variant will ensure that all occurrences of the same label are associated with the same sort.This helps us in the case of failures to ensure the absence of communication mismatches, i.e., the 14:12 type of a transmitted value has to be the type that the receiver expects.The global type G NB = 1 → u 2:l 1 ⟨N⟩.1 → u 2:l 2 ⟨B⟩.end specifies two subsequent unreliable communications in that values of different sorts are transmitted as discussed in Section 2. If the first message with its natural number is lost but the second message containing a Boolean value is transmitted, the receiver 2 should not wrongly receive a Boolean value although it still waits for a natural number.To avoid this mismatch, we add a label to unreliable communication and ensure (by the typing rules) that the same label is never associated with different types.In the case of G NB , the type system associates l 1 with sort N and l 2 with sort B and ensures that l 1 ̸ = l 2 .Sine l 1 ̸ = l 2 , the reduction rules do not allow the receiver to use the value received in a message with label l 2 for its first communication action, i.e., forces the receiver to first skip its first communication and use a default value before its is allowed to receive the message with label l 2 .Here we interpret labels again as some kind of meta data of messages that allow a receiver to use static information in labels to guide reception.In particular, our unreliable communication mechanism exploits such meta data to guarantee strong properties about the communication structure including the described absence of communication mismatches.Since the labels of the sender and the receiver are associated with a unique sort, the type system can then ensure that received values have the expected sort.Similarly, labels are used in [CV10] to avoid communication errors.

A Semantics with Failure Patterns
Before we describe the semantics, we introduce substitution and structural congruence as auxiliary concepts.The application of a substitution { y /x} on a term A, denoted as A{ y /x}, is defined as the result of replacing all free occurrences of x in A by y, possibly applying alpha-conversion to avoid capture or name clashes.For all names n ∈ N \ {x } the substitution behaves as the identity mapping.We use substitution on types as well as processes and naturally extend substitution to the substitution of variables by terms (to unfold recursions) and names by expressions (to instantiate a bound name with a received value).We assume an evaluation function eval(•) that evaluates expressions to values.
We use structural congruence to abstract from syntactically different processes with the same meaning, where ≡ is the least congruence that satisfies alpha conversion and the rules: The reduction semantics of the session calculus is defined in the Figures 3 and 4, where we follow [HYC16]: session initialisation is synchronous and communication within a session is asynchronous using message queues.The rules are standard except for the five failure pattern and two rules for system failures: (Crash) for crash failures and (ML) for message loss.Failure patterns are predicates that we deliberately choose not to define here (see below).They allow us to provide information about the underlying communication medium and the reliability of processes.
Rule (Init) initialises a session with n roles.Session initialisation introduces a fresh session channel and unguards the participants of the session.Finally, the message queues of this session are initialised with the empty list under the restriction of the session channel.
Rule (RSend) implements an asynchronous strongly reliable message transmission.As a result the value eval(y) is wrapped in a message and added to the end of the corresponding message queue and the continuation of the sender is unguarded.Rule (USend) is the counterpart of (RSend) for unreliable senders.(RGet) consumes a message that is marked as strongly reliable with the index r from the head of the respective message queue and replaces in the unguarded continuation of the receiver the bound variable x by the received value y.
There are two rules for the reception of a message in an unreliable communication that are guided by failure patterns.Rule (UGet) is similar to Rule (RGet), but specifies a failure pattern FP uget to decide whether this step is allowed.This failure pattern could, e.g., be used to reject messages that are too old.Moreover, l = l ′ is required to enforce that the static information in the transmitted label matches the expectation specified in the label of the receiver.As explained in Section 3.3, this allows to avoid communication mismatches.The Rule (USkip) allows to skip the reception of a message in an unreliable communication using a failure pattern FP uskip and instead substitutes the bound variable x in the continuation with the default value dv .The failure pattern FP uskip tells us whether a reception can be skipped (e.g. via failure detector).
Rule (RSel) puts the label l selected by r 1 at the end of the message queue towards r 2 .Its weakly reliable counterpart (WSel) is similar, but puts the label at the end of all relevant message queues.With (RBran) a label is consumed from the top of a message queue and the receiver moves to the indicated branch.There are again two weakly reliable counterparts of (RBran).Rule (WBran) is similar to (RBran), whereas (WSkip) allows r 1 to skip the message and to move to its default branch if the failure pattern FP wskip holds.The requirement l = l j in RBran and WBran ensures as usual that indeed the branch specified by the message at the queue is picked by the receiver.Note that this branch has to be identified by the statically available information in the respective labels.
The Rules (Crash) for crash failures and (ML) for message loss, describe failures of a system.With Rule (Crash) P can crash if FP crash , where FP crash can e.g.model immortal processes or global bounds on the number of crashes.(ML) allows to drop an unreliable (If-T) if e then P else message if the failure pattern FP ml is valid.FP ml allows, e.g., to implement safe channels that never lose messages or a global bound on the number of lost messages.
Figure 4 provides the remaining reduction rules for conditionals, delegation, parallel composition, restriction, recursion, and structural congruence.They are standard.
Consider the implementation of G dice,u in (2), i.e., an infinite variant of the dice game, where the players 1 and 2 use their respective last known sum x i of former dice rolls as default value: An unreliable communication in a global type specifies a communication that, due to system failures, may or may not happen.Moreover, regardless of the successful completion of this unreliable communication, the future behaviour of a well-typed system will follow its specification in the global type.Since the players 1 and 2 repeat the same kind of unreliable action, they may lose track of the current round.If they successfully receive a new sum of dice rolls from 3 they cannot be sure on how often 3 actually did roll the dice.Because of lost messages, they may have missed some former announcements of 3 and, because of their ability to skip the reception of messages, they may have proceeded to the next round before 3 rolled a dice.Because the information about the current round is irrelevant for the communication structure in this case, there is no need to enforce round information.
We deliberately do not specify failure pattern, although we usually assume that the failure patterns FP uget , FP uskip , and FP wskip use only local information, whereas FP ml and FP crash may use global information of the system in the current run.We provide these predicates to allow for the implementation of system requirements or abstractions like failure detectors that are typical for distributed algorithms.Directly including them in the semantics has the advantage that all traces satisfy the corresponding requirements, i.e., all traces are valid w.r.t. the assumed system requirements.An example for the instantiation of these patterns is given implicitly via the Conditions 5.2.1-5.2.6 in Section 5 and explicitly in Section 6.If we instantiate the patterns FP uget with true and the patterns FP uskip , FP wskip , FP crash , FP ml with false, then we obtain a system without failures.In contrast, the instantiation of all five patterns with true results in a system where failures can happen completely non-deterministically at any time.
Note that we keep the failure patterns abstract and do not model how to check them in producing runs.Indeed system requirements such as bounds on the number of processes that can crash usually cannot be checked, but result from observations, i.e., system designers ensure that a violation of this bound is very unlikely and algorithm designers are willing to ignore these unlikely events.In particular, FP ml and FP crash are thus often implemented as oracles for verification, whereas e.g.FP uskip and FP wskip are often implemented by system specific time-outs.Note that we are talking about implementing these failure patterns and not formalising them.Failure patterns are abstractions of real world system requirements or software.We implement them by conditions providing the necessary guarantees that we need in general (i.e., for subject reduction and progress) or for the verification of concrete algorithms.In practice, we expect that the systems on which the verified algorithms are running satisfy the respective conditions.Accordingly, the session channels, roles, labels, and processes mentioned in Figure 3 are not parameters of the failure patterns, but just a vehicle to more formally specify the conditions on failure patterns in Section 5.An implementation may or may not use these information to implement these patterns but may also use other information such as runtime information about time or the number of processes, as indicated by the . . . in failure patterns in Figure 3 such as FP crash (P , . ..).
Similarly, strongly reliable and weakly reliable interactions in potentially faulty systems are abstractions.They are usually implemented by handshakes and redundancy; replicated servers against crash failures and retransmission of late messages against message loss.Algorithm designers have to be aware of the additional costs of these interactions.

Typing Fault-Tolerant Processes
The type of processes is checked using typing rules that define the derivation of type judgments.Within type judgements, the type information are stored in type environments.
Definition 5.1 (Type Environments).The global and session environments are given by Assignments x :S of values to sorts are used to check whether transmitted values are well-sorted, i.e., sender and receiver expect the same sort.Assignments a:G capture the global type of a session for session initialisation via the shared channel a. Assignments l :S link labels to sorts.Assignments X :t of process variables to type variables are used to check the type of recursive processes.
Assignments s[r]:T of actors to local types are used to compare the behaviour of a process that implements this actor with its local specification T .Assignments s r 1 →r 2 :MT * allow to check the current content of a message queue s r 1 →r 2 against a list of message types MT.
We write x ♯Γ and x ♯∆ if the name x does not occur in Γ and ∆, respectively.We use • to add an assignment provided that the new assignment is not in conflict with the type environment.More precisely, Γ • x :S implies x ♯Γ, Γ • l :S implies l ♯Γ, and Γ • X :s[r]:t implies X , t♯Γ.Moreover, ∆ • s[r]:T implies (∄T ′ .s[r]:T ′ ∈ ∆) and ∆ • s r 1 →r 2 :M implies (∄M ′ .s r 1 →r 2 :M ′ ∈ ∆).We naturally extend this operator towards sets, i.e., Γ • Γ ′ implies (∀A ∈ Γ ′ .Γ • A) and ∆ • ∆ ′ implies (∀A ∈ ∆ ′ .∆ • A).The conditions described for the operator • for global and session environments are referred to as linearity.Accordingly, we denote type environments that satisfy these properties as linear and restrict in the following our attention to linear environments.We abstract in session environments from assignments towards terminated local types, i.e., ∆ • s[r]:end = ∆.
A type judgement is of the form Γ ⊢ P ▷ ∆, where Γ is a global environment, P ∈ P is a process, and ∆ is a session environment.We use typing rules to derive type judgements, where we assume that all mentioned global types are well-formed and projectable, all local types are well-formed, and all environments are linear.A process P is well-typed w.r.t.Γ and ∆ if Γ ⊢ P ▷ ∆ can be derived from the rules in the Figures 5 and 6.We write nsr(∆) if none of the prefixes in T is strongly reliable or for delegation for all local types T in ∆ and if ∆ does not contain message queues.With Γ ⊩ y:S we check that y is an expression of the sort S if all names x in y are replaced by arbitrary values of sort S x for x :S x ∈ Γ.
Let us consider the interaction cases in Figure 5.We observe that all new cases are quite similar to their strongly reliable counterparts.
Rule (RSend) checks strongly reliable senders, i.e., requires a matching strongly reliable sending in the local type of the actor and compares the actor with this type.With Γ ⊩ y:S we check that y is an expression of the sort S if all names x in y are replaced by arbitrary values of sort S x for x :S x ∈ Γ.Then the continuation of the process is checked against the continuation of the type.The unreliable case is very similar, but additionally checks that the label is assigned to the sort of the expression in Γ. Rule (RGet) type strongly reliable receivers, where again the prefix is checked against a corresponding type prefix and the assumption x :S is added for the continuation.Again the unreliable case is very similar, but apart from the label also checks the sort of the default value.
Rule (RSel) checks the strongly reliable selection prefix, that the selected label matches one of the specified labels, and that the process continuation is well-typed w.r.t. the type continuation following the selected label.The only difference in the weakly reliable case is the set of roles for the receivers.For strongly reliable branching in (RBran) we check the prefix and that for each branch in the type there is a matching branch in the process that is well-typed w.r.t. the respective branch in the type.For the weakly reliable case we have to additionally check that the default labels of the process and the type coincide.
Rule (Crash) for crashed processes checks that nsr(∆), i.e., that for every type G or T in ∆ the predicate nsr(G) or nsr(T ) holds.
Figure 6 presents the runtime typing rules, i.e., the typing rules for processes that may result from steps of a system that implements a global type.Since it covers only operators that are not part of initial systems, a type checking tool might ignore them.We need these rules however for the proofs of progress and subject reduction.Under the assumption that initial systems cannot contain crashed processes, Rule (Crash) may be moved to the set of runtime typing rules.
Rule (Res2) types sessions that are already initialised and that may have performed already some of the steps described by their global type.The relation s → is given in Figure 7 and describes how a session environment evolves alongside reductions of the system, i.e., it emulates the reduction steps of processes.As an example consider the rule ∆

→.
We have to prove that our extended type system satisfies the standard properties of MPST, i.e., subject reduction and progress.Because of the failure pattern in the reduction semantics in Figure 3, subject reduction and progress do not hold in general.Instead we have to fix conditions on failure patterns that ensure these properties.Subject reduction needs one condition on crashed processes and progress requires that no part of the system is blocked.In fact, different instantiations of these failure patterns may allow for progress.(1) If FP crash (P , . ..), then nsr(P ).
The crash of a process should not block strongly reliable actions, i.e., only processes with nsr(P ) can crash (Condition 5.2.1).Condition 5.2.2 requires that no process can refuse to consume a message on its queue to prevent deadlocks that may arise from refusing a message that is never dropped.Condition 5.2.3 requires that if a message can be dropped from a message queue then the corresponding receiver has to be able to skip this message and vice versa.Similarly, processes that wait for messages from a crashed process have to be able to skip (Condition 5.2.4) and all messages of a queue towards a crashed receiver can be dropped (Condition 5.2.5).Finally, weakly reliable branching requests should not be lost.To ensure that the receiver of such a branching request can proceed if the sender is crashed but is not allowed to skip the reception of the branching request before the sender crashed, we require that FP wskip (s, r 1 , r 2 , . ..) is false as long as s[r 2 ] is alive or messages on the respective queue are still in transit (Condition 5.2.6).
The combination of the 6 conditions in Conditions 5.2 might appear quite restrictive as e.g. the combination of the Condition 5.2.4 and 5.2.6 ensures the correct behaviour of weakly reliable branching such that branching messages can be skipped if and only if the 14:19 respective sender has crashed.An implementation of such a weakly reliable interaction in an asynchronous system that is subject to message losses and process crashes, might require something like a perfect failure detector or actually solving consensus 1 .It is important to remember that these conditions are minimal assumptions on the system requirements and that system requirements are abstractions.Parts of them may be realised by actual softwarecode (which then allows to check them), whereas other parts of the system requirements may not be realised at all but rather observed (which then does not allow to verify them).
Weakly reliable branching is a good example of this case.The easiest way to obtain a weakly reliable interaction, is by using a handshake communication and time-outs.If the sender time-outs while waiting for an acknowledgement, it resends the message.If the sender does not hear from its receiver for a long enough period of time, it assumes that the receiver has crashed and proceeds.With carefully chosen time-frames for the time-outs, this approach is a compromise between correctness and efficiency.In a theoretical sense, it is clearly not correct.There is no time-frame such that the sender can be really sure that the receiver has crashed.From a practical point of view, this is not so problematic, since in many systems failures are very unlikely.If failures that are so severe that they are not captured by the time-outs are extremely unlikely, then it is often much more efficient to just accept that the algorithm is not correct in these cases.Trying to obtain an algorithm that is always correct might be impossible or at least usually results into very inefficient algorithms.Moreover, verifying this requires to also verify the underlying communication infrastructure and the way in that failures may occur, which is impossible or at least impracticable.Because of that, it is an established method to verify the correctness of algorithms w.r.t.given system requirements (e.g. in [CT96,Lam01,vST17]), even if these system requirements are not verified and often do not hold in all (but only nearly all) cases.Let us have a closer look at the typing rules in the Figures 5 and 6.We observe that all typing rules are clearly distinguished by the outermost operator of the process in the conclusion except that there are two typing rules for restriction.With that, given a type judgement Γ ⊢ P ▷ ∆, we can use the structure of P -with a case split for restriction-to reason about the structure of the proof tree that was necessary to obtain Γ ⊢ P ▷ ∆ and from that derive conditions about the nature of the involved type environments.If P is e.g. a parallel composition P 1 | P 2 then, since there is only one rule to type parallel compositions (the Rule (Par)), Γ ⊢ P 1 | P 2 ▷ ∆ implies that there are ∆ 1 , ∆ 2 such that ∆ = ∆ 1 • ∆ 2 , Γ ⊢ P 1 ▷ ∆ 1 , and Γ ⊢ P 2 ▷ ∆ 2 .In the following, we write 'by Rule (Par)' as short hand for 'by the clear distinction of the typing rules by the process in the conclusion and Rule (Par) in particular' and similar for the other rules.
In the following we prove some properties of our MPST variant.We start with an auxiliary result, proving that structural congruence preserves the validity of type judgements.The proof is by induction on P ≡ P ′ .In each case we can use the information about the structure of the process that is provided by the considered rule of structural congruence to conclude on the last few typing rules that had to be applied to derive the type judgement in the assumption.From these partial proof trees we obtain enough information to construct the proof tree for the conclusion.
Proof.The proof is by induction on P ≡ P ′ .
Ultimately, we are however interested into coherence.Note that obviously the coherent case implies the respective weakly coherent case.Our strengthened goal for subject reduction thus becomes: Proof of Theorem 5.6.The proof is by induction on the reduction P −→ P ′ that is derived from the rules of the Figures 3 and 4.
Case of Rule (Init): In this case and we use alpha conversion to ensure that s♯ (Γ, ∆).By the typing Rules (Par), (Req), and (Acc), Γ ⊢ P ▷ ∆ implies that there are G,   Case of Rule (RGet): In this case P = s[r 1 , r 2 ]? r (x ).P | s r 2 →r 1 :⟨v⟩ r #M, P ′ = Q{ v /x} | s r 2 →r 1 :M, and we use alpha conversion to ensure that x ♯ (Γ, ∆, s).By (Par), (RGet), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ , and Γ ⊢ s r 2 →r 1 :M ▷ s r 2 →r 1 :MT.Since Γ, ∆ are coherent, S 1 = S 2 .By Lemma 5.4, then Γ    By the Rules (Par), (RSel), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ Q , I, j, MT and for all i ∈ I there are and P ′ = Q j | s r 2 →r 1 :M.By the Rules (Par), (RBran), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ Q , I 2 , MT, l ′′ and for all i ∈ I 2 there are l ′ i , T i such that ∆ = ∆ Q • s[r 1 ]:[r 2 ]? r {l ′ i , T i } i∈I 2 • s r 2 →r 1 :l ′′ r #MT, Γ ⊢ s r 2 →r 1 :M ▷ s r 2 →r 1 :MT, l j = l ′′ , and for all k ∈ I 2 exists some m ∈ I 1 such that By the Rules (Par), (WSel), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ Q , I, j, MT 1 , . . ., MT n and for all i ∈ I there are l . By the Rules (Par), (Deleg), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ We use alpha conversion to ensure that s ′ = s ′′ and r = r ′′ .By the Rules (Par), (SRecv), and the typing rules for message queues, Γ ⊢ P ▷ ∆ implies that there are ∆ Case of (Res2): In this case x = s and there are G, a, ∆ ′′ such that we have s♯ (Γ, ∆), Case of (Rec): In this case P = (µX )Q and P ′ = Q{ (µX )Q /X }.By Rule (Rec), then there are ∆ Since we restrict our attention to linear environments, type judgements ensure linearity of session channels.With subject reduction, this holds for all derivatives of well-typed processes.
Lemma 5.8 (Linearity).Let Γ ⊢ P ▷ ∆, Γ, ∆ be coherent, and there are no name clashes on session channels.Then all session channels of P are linear, i.e., for all P −→ * P ′ and all s, r, r 1 , r 2 there is at most one unguarded actor s[r] and at most one queue s r 1 →r 2 in P ′ .Proof.By Theorem 5.6, there is some ∆ ′ such that Γ ⊢ P ′ ▷ ∆ ′ and Γ, ∆ ′ are coherent.By the Definition 5.5 of coherence and projection, ∆ ′ contains at most one actor s[r] and at most one queue s r 1 →r 2 for each a:G ∈ Γ and r ∈ R(G).By the Figures 5 and 6, only the Rules (Req), (Acc), and (Res2) can introduce new actors or queues.The linearity of global environments ensures, that all new actors and queues introduced by the rules are on fresh channel names and are pairwise distinct.The Rules (Req) and (Acc) introduce exactly one actor each on a fresh session channel s that is bound by a prefix for session initialisation.Rule (Res2) introduces assignments for actors and queues for pairwise different roles on a fresh session channel s that is bound by restriction.Since there are no name clashes, the session channels in binders are pairwise different and distinct from free session channels.By the typing rules and because Γ ⊢ P ′ ▷ ∆ ′ , all actors and queues in P ′ have to satisfy their specification as described by an assignment of this actor or queue towards a local type.By the linearity of session environments and since new assignments for actors result from bound session channels, all unguarded actors and queues in P ′ are pairwise different.
For strongly reliable systems coherence ensures that for each actor there is a matching communication partner.In the case of asynchronous communication, this means that for each sender (or message on a queue) there is a receiver and for each receiver there is a sender or a message on a queue, where the receiver as well as the sender or the message queue appear under the same binder of the session channel or both are free.In the case of unreliable communication, messages get lost, senders can crash, and receivers can crash themselves or suspect the sender.In the case of weakly reliable branching for each sender (or message on a queue) there are all specified receivers that are not crashed and vice versa.
We summarise these properties of strongly reliable and weakly reliable interactions in error-freedom: for each strongly reliable sender or message there is a matching receiver and vice versa, for each weakly reliable sender or message there is a possibly crashed receiver and vice versa.We obtain similar requirements for session delegation.
Lemma 5.9 (Error-Freedom).If Γ ⊢ P ▷ ∆ and Γ, ∆ is coherent then: • for each unguarded s[r 1 , r 2 ]! r ⟨y⟩.Q 1 and each message ⟨y⟩ r on a message queue s r 1 →r 2 in P there is some s[r 2 , r 1 ]? r (x ).Q 2 in P , • for each unguarded s[r 2 , r 1 ]? r (x ).Q 2 in P there is some s[r 1 , r 2 ]! r ⟨y⟩.Q 1 or a message ⟨y⟩ r on a message queue s r 1 →r 2 in P , • for each unguarded s[r 1 , r 2 ]! r l .Q and each message l r on a message queue s r 1 →r 2 in P there is some j ∈ I and s[r 2 , r 1 ]? r {l i .Q i } i∈I in P with l j = l , • for each unguarded s[r 2 , r 1 ]? r {l i .Q i } i∈I in P there is j ∈ I and s[r 1 , r 2 ]! r l .Q or a message l r on a message queue s r 1 →r 2 in P with l j = l , • for each unguarded s[r, R]! w l .Q and each message l w on a message queue s r→r ′ in P and each r ′ ∈ R there is some s[r ′ , r]? w {l i .P i } i∈I,l d and j ∈ I in P with l j = l or P does not contain an actor s[r ′ ], • for each unguarded s[r ′ , r]? w {l i .P i } i∈I,l d in P there is j ∈ I and s[r, R]! w l .Q or a message l w on a message queue s r→r ′ in P with l j = l and r ′ ∈ R or P does not contain an actor s[r], • for each unguarded s[r 1 , r 2 ]!⟨⟨s ′ [r]⟩⟩.Q 1 and each message s ′ [r] on a message queue s r 1 →r 2 in P there is some s[r 2 , r 1 ]?((s ′′ [r ′ ])).Q 2 in P , and • for each unguarded s[r 2 , r 1 ]?((s ′′ [r ′ ])).Q 2 in P there is some s[r 1 , r 2 ]!⟨⟨s ′ [r]⟩⟩.Q 1 or a message s ′ [r] on a message queue s r 1 →r 2 in P .
Proof.By coherence and projection for each strongly reliable and each weakly reliable sender there is initially a matching receiver for each free session channel in the session environment.By the typing rules and Rule (Res2) in particular, this holds also for restricted session channels.Session environments may evolve using s → but all such steps preserve the above defined requirements, i.e., strongly reliable or weakly reliable send prefixes can be mapped onto the type of the respective message in a queue but no such message can be dropped.The typing rules ensure that the processes follow their specification in the local types of session environments.Then, the first four and the last two cases follow from the typing rules and coherence, and the fact that only unreliable processes can crash.The remaining two cases follow from the typing rules and coherence.
Session fidelity claims that the interactions of a well-typed process follow exactly the specification described by its global types, i.e., if a system is well-typed w.r.t. to coherent type environments then the system follows its specification in the global type.One direction of this property already follows from the above variant of subject reduction.The steps of well-typed systems are reflected by corresponding steps of the session environment and, thus, respect their specification in global types.What remains to show is that the specified interactions can indeed be performed.The above formulation of error-freedom alone is not strong enough to show this, because it ensures only the existence of matching communication partners and not that they can be unguarded.
To obtain session fidelity we prove progress.Progress states that no part of a well-typed and coherent system can block other parts, that eventually all matching communication partners as described by error-freedom are unguarded, that interactions specified by the global type can happen, and that there are no communication mismatches.Subject reduction and progress together then imply session fidelity, i.e., that processes behave as specified in their global types.
To ensure that the interleaving of sessions and session delegation cannot introduce deadlocks, we assume an interaction type system as introduced in [BCD + 08, HYC16].For this type system it does not matter whether the considered actions are strongly reliable, weakly reliable, or unreliable.More precisely, we can adapt the interaction type system of [BCD + 08] in a straightforward way to the above session calculus, where unreliable communication and weakly reliable branching is treated in exactly the same way as strongly reliable communication/branching.We say that P is free of cyclic dependencies between

Figure 5 :
Figure 5: Typing Rules for Fault-Tolerant Systems. s

Figure 7 :
Figure 7: Reduction Rules for Session Environments.