A model of actors and grey failures

Existing models for the analysis of concurrent processes tend to focus on fail-stop failures, where processes are either working or permanently stopped, and their state (working/stopped) is known. In fact, systems are often affected by grey failures: failures that are latent, possibly transient, and may affect the system in subtle ways that later lead to major issues (such as crashes, limited availability, overload). We introduce a model of actor-based systems with grey failures, based on two interlinked layers: an actor model, given as an asynchronous process calculus with discrete time, and a failure model that represents failure patterns to inject in the system. Our failure model captures not only fail-stop node and link failures, but also grey failures (e.g., partial, transient). We give a behavioural equivalence relation based on weak barbed bisimulation to compare systems on the basis of their ability to recover from failures, and on this basis we define some desirable properties of reliable systems. By doing so, we reduce the problem of checking reliability properties of systems to the problem of checking bisimulation.


Introduction
Many real-world computing systems are affected by non-negligible degrees of unpredictability, such as unexpected delays and failures, which are not straightforward to capture accurately.Several works contribute towards a formal account of unpredictability, for example in the context of process calculi -potentially including session types -by extending calculi to model node failures [FGL + 96, RH97], link failures [APN17], and a combination of link and node failures [BH03]; these calculi also add a variety of program constructs to deal with failures including escapes [CGY16], interrupts [HNY + 13], exceptions [FLMD19], and timeouts [LZ05,BY07,LP11].Most existing models assume a fail-stop model of failure, where processes are either working or permanently stopped, and their state of being either working or stopped is known.In fact, systems are often affected by grey failures: failures that are latent, possibly transient, and may affect the system in subtle ways that later lead to major issues, such as crashes, limited availability and overload.The symptoms of grey failure tend to be ambiguous.Several kinds of grey failure have been studied in the last decade such as transient failure (e.g., a component is down at periodic intervals), partial failure (only some sub-components are affected), or slowdown [GSS + 18].In a distributed system, processes may have different perceptions as to the state of health of the system (aka differential observation) [HGZ + 17].Grey failures tend to be behind many service incidents in cloud systems, and in these situations traditional fault tolerance mechanisms tend to be ineffective or counterproductive [HGZ + 17].Diagnosis can be challenging and lengthy: for example, the work in [LHS20] estimates a median time for the diagnosis of partial failures to be 6 days and 5 hours.One of the main causes of late diagnosis is ambiguity of the symptoms and hence difficulty in correlating failures with their effects.
In this paper we make a first step towards a better understanding of the correlation between failures and symptoms via static formal analysis.We focus on the distributed actor model of Erlang [Arm13], which is known for its effectiveness in handling failures and has been emulated in many other languages, e.g., the popular Akka framework for Scala [Wya13].
We define a formal model of actor-based systems with grey failures, which we call 'cursed systems'.More precisely, we introduce two interlinked models: (1) a model of systems, which are networks of distributed actors; (2) a model of (grey) failures that allows us to characterise 'curses' as patterns of grey failures to inject in the system.This model of failures can represent node failures (with loss of messages in the node's mailbox), node slowdowns, link failures (with loss of the message in transit), and link slowdowns.The aforementioned instances of failure can be specified at the granularity of single nodes and links, to capture total and partial failures, and at the granularity of (discrete) time instants, to capture permanent, transient, and periodic failures.For example, a failed node can be in a failed state for a while before being restarted.The model of systems allows one to specify whether a node is restarted from the initial state (reset) or from a checkpoint.To capture the ambiguity of symptoms of grey failure we assume that actors have no knowledge of the state of health of other actors.However, actors can observe the presence (or absence) of messages in their own mailboxes and hence can infer the effects of failure from the communications that they have (not) received.In Erlang, a key mechanism for detecting and dealing with failure is the use of timeouts, which are one of the main ingredients of our system model.
Modelling failures as a separate layer allows us to compare systems recovery strategies with respect to specific failure patterns.This is a first step towards analysing the resilience of systems to failures, and assessing the effect of failure on different parts of the system.We introduce a behavioural equivalence, based on weak barbed bisimulation, to compare systems affected by failures.We show that reliability properties of interest, namely resilience and recoverability, can be reduced to the problem of checking weak barbed bisimulation between systems with failures.Furthermore, we introduce a notion of augmentation, based on weak barbed bisimulation, to model and analyse the improvement of a system with respect to its recoverability against certain kinds of failure.
Synopsis.The paper is structured as follows.In Section 2, we give an informal overview of the system model, and compare it with related work.Next we introduce the models of failure (Section 3) and systems (Section 4).In Section 5 we give a behavioural equivalence between 13:3 systems with failures, and show how it is used to model properties of interest.Section 7 describes prospective applications and promising directions of this work.Section 8 discusses conclusions and related work.
Extensions with respect to the conference paper.This work is an extended version of the conference paper appeared in COORDINATION 2022 [BLTV22] with the following additional contributions: • The syntax and semantics of systems have been extended to model checkpoints.We have added Section 4.2 to show that the extension can still express the systems in [BLTV22] (Proposition 4.11).In a new section, Section 5.3, we show that the notion of n-recoverability given in [BLTV22] is not suitable for systems with checkpoints.We have therefore added, in Section 5.3, a more relaxed notion of n-recoverability for checkpointing systems.• We found a flaw in the definition of n-recoverability given in [BLTV22].We have fixed in this extended version.In Section 5.1 we give an amended definition of n-recoverability alongside the one given in [BLTV22], and discuss the differences by examples.
• In [BLTV22] we informally stated a relationship between two reliability properties we defined in that work: 'resilience is equivalent to 0-recoverability'.In a new section of this extended version, Section 5.2, we give the formal proof of this equivalence based on our amended definition of n-recoverability.
• The examples in Section 5 have been improved to reflect the feedback from the presentation of the conference.For instance, Example 5.6 has been framed to show the role of redundancy in fault-tolerance and how we can express it with our framework.• Following feedback at the conference presentation, we have extended the section of related works.In particular, we have added a comparison with the works in [Gär99] and [DCMA17].Due to the particular relevance of the work in [Gär99], we have also added a new subsection, Section 5.4, with a more technical discussion on how our work relates to the more general definitions of fault-tolerance given in [Gär99].• We have added a Section 7 with a discussion of prospective applications of this work.

Informal overview
Actor-based systems are modelled using a process calculus with three key elements, following the actor model of Erlang: (1) time and timeouts, (2) asynchronous communication based on mailboxes with pattern-matching, and (3) actor nodes and injected failures.
Time and timeouts.Timeouts are essential for an actor to decide when to trigger a recovery action.Time is also crucial to observe the effects of failure patterns including quantified delays or down-times of nodes and links.We based our model of time on the Temporal Process Language (TPL) [HR95], a well understood extension of CCS with discrete time and timeouts.Delays are processes of the form sleep.P that behave as P after one time unit.Timeouts are modelled after the idiomatic receive...after pattern in Erlang.Concretely, the Erlang pattern below (left) is modelled as the process below (right): ?{p 1 .P 1 , . . ., p N .P N } after m Q where p 1 , . . ., p N is a set of patterns, each associated with a continuation P i , with i ∈ {1, . . ., N }, and Q is the timeout handler, executed if none of the patterns can be matched with a message in the mailbox within m time units.Following TPL, an action can be either a time action or an instantaneous communication action, and time actions can happen only when communication actions are not possible (maximal progress [HR95]).Concretely, we define the systems behaviour as a reduction relation with two kinds of actions: communication actions − ⇀ and time actions ∼∼▷ .While TPL is synchronous and only prioritises synchronisations over delays, we model asynchronous communications and prioritise any send or receive action over time actions.Thus, in our model, by maximal progress, communications have priority over delays.
The state of an actor at a time t is modelled as n[ P ](M )(t), where n is the actor identifier (unique in the system), M the mailbox, and P the process run by that actor.System R t below is the parallel composition of actors n 1 and n 2 : Although each actor in R t has its own local time t explicitly represented, which makes it easy to inject failures compositionally, our semantics keeps the time of parallel components synchronized (as in TPL).In R t , node n 1 is deliberately idling and n 2 is temporarily blocked on a receive/timeout action, so no communication can happen, and thus only a time action is possible, updating both actors' times and triggering the timeout in n 2 : Mailboxes.Each pair of actors can communicate via two unidirectional links.For example, (n 1 , n 2 ) denotes the link for communications from n 1 to n 2 .An interaction involves three steps: (I) the sending actor sends the message by placing it in the appropriate link, (II) the message reaches the receiver's mailbox, and (III) the receiving actor processes the message.These three steps allows us to capture e.g., effects of failures in senders versus receivers, on nodes versus links, and to model latency.Consider the system Step (I), the sending of a message, is illustrated below on R c : 1.(n 1 , n 2 , a) models a latent message in link (n 1 , n 2 ) with content a. Prefix 1 is the average network latency (assumed to be a constant).Due to latency, the message can only be added to the receiver's mailbox after one time step: These floating messages (n 1 , n 2 , a) with no latency are similar to messages in the ether [SFE10], in the global mailbox [LNPV18], or to the floating messages in [LSZ19].
Step (II) is the reception of the message, and happens as illustrated below (omitting the idle actor n 1 ), where message a is added to the mailbox of n 2 : Step (III) is the processing of the message, as illustrated below: where message a in the mailbox matches the receive pattern (made up of a single atom a) and is therefore processed.Mailboxes give us an expressive model of communication for modern real-world systems.An alternative model of communication is peer-to-peer communication, used e.g., in Communicating Finite State Machines (CFSM) [BZ83] and Multiparty Session Types [HYC16,CDYP16], where a receiver must specify from whom the message is expected.This makes it difficult to accurately capture interactions with public servers, or patterns like multiple producers-one consumer.
In the interaction above, note that n 2 processes message a because it matches pattern a; this would be the case even if there were an older message b in the mailbox, if that message did not match that pattern a. Alternative models, like Mailbox CFSMs [BBO12, BGF + 21], typically do not model the selective receive pattern (e.g., pattern-matching in Erlang) shown above.Without selective receive, participants can easily get stuck if messages are received out of order.One can encode peer-to-peer communication over FIFO unidirectional channels by using pattern matching with selective receive: using the sender's identifier in the message and in the receive pattern.A similar communication model to ours was proposed in [MV11].
Localities and failures.The actor construct is similar to that used to model locality for processes [Cas01], and also studied in relation to failures [BH03, RH01, FH07, FH08] but using a fail-stop untimed model.We use actor nodes to model the effects of injected failures on specific nodes and links.
Referring to system R ′ c in (2.1), by placing floating messages into a link with latency before they reach the receiver's mailbox we can observe the effects of link failure as message loss.Assume link (n 1 , n 2 ) is down at time t: ∥ n 2 [ ?a.P after 2 Q ](∅)(t) the floating message gets lost which in turn would end up causing a timeout in n 2 .Similarly, in the case of node failure, node n 1 in system R c , seen earlier in (2.1), would go into a crashed node state before sending the message, hence triggering a timeout in n 2 : Assumptions.When a node crashes and comes back up again later on, it will come up with the same node identifier.This is consistent with Distributed Erlang, where by default all nodes are named; on the other hand, if we were resuscitating processes, we would need to name them for this to be possible.For simplicity, we assume nodes are not created at run-time, focusing on fixed topologies.Extending the language with the capability of creating new nodes is relatively straightforward, and can be done in a similar way to π-calculus restriction.We assume that behaviour within a node is sequential: actors can be composed in parallel but processes cannot, hence limiting communication to distributed communications between nodes.
We choose to focus on inter-node communication on its own, because there already exist good strategies (e.g, in Erlang and Elixir) for dealing with in-node failure through the use of a supervision hierarchy, supervision strategies, and let-it-crash philosophy.Messages in transit when a node goes down remain in transit and may enter the mailbox after this node is resumed.
We allow a restricted (external) version of choice, based on the communication patterns found in Erlang.Free, or completely unrestricted choice, while central to many process algebras, for example CCS, tends to be less used in practice.

A model of failures
Let N be the set of node identifiers in a system.The model of failures is defined to be the ∆ function: mapping each discrete time t ∈ N, node n ∈ N , and link (n 1 , n 2 ) ∈ N × N to a value representing the state of health of that node or link, at that time.The symbol ↑ denotes the "healthy" state, ↓ identifies the failure of a node or link, and ⟳ indicates a node or link slowdown.
The failure scenarios covered by ∆ include node crash, message loss, slow processes or slow networks.If node n is down at time t, written ∆(t)(n) = ↓ , then it will perform no action until it is resumed, if ever.If n is resumed at time t ′ , then its state at time t ′ will be set to the initial state (see Definition 4.6 for the formal definition).If link (n 1 , n 2 ) is down at time t, written ∆(t)(n 1 , n 2 ) = ↓ , then any message in transit on that link at time t will be lost.If node n is slow at time t, written ∆(t)(n) = ⟳ , then any actions of the process running in n are delayed for one time step, and may resume at time t + 1 if ∆(t + 1)(n) = ↑ .If link (n 1 , n 2 ) is slow at time t, written ∆(t)(n 1 , n 2 ) = ⟳ , then the delivery of any message in transit on that link at time t will not happen at that time, and so will be delayed by at least one time unit.This delay is in addition to the network latency, which is modelled as a constant.Failures can be permanent or transient, as shown below by examples.
Example 3.1 (Permanent and transient failures).Permanent node failure after a certain point in time, say t = 10, can be modelled by the definition ∆ 1 below.Function ∆ 2 shows a transient periodic structural failure of node n, with each period having 100 time units of healthy state and 100 of down state.
One could similarly model transient degrading failure by setting uptimes when t = n 2 for (n ∈ N). 13:7 Systems R ::

Calculus for cursed systems
This section presents the model for actor based systems.The syntax of the calculus is given in Figure 1.
Systems are nodes n[ P ] Q (M )(t), messages (floating or latent), crashed nodes n[ ↓ ] Q (∅)(t), empty systems ∅, and parallel compositions of systems R || R. The term n[ P ](M )(t) denotes the state of node n ∈ N at time t where P is the process running in n, Q is the saved checkpoint process, and M is the mailbox of n.A mailbox is a (possibly empty) list of messages.A message m is a tuple of values, which can be atoms a, node ids n or variables X. Messages are read from a mailbox via pattern matching.
We define the pattern matching function in the style of [MV11] through the derivations in Figure 2. Given a pattern E and a message (tuple) V , ( E, V ) ⊢ match σ the match function returns a substitution σ.Note that the match is only defined if E and V have the same size, and if the pattern and message match.We write (E, m) ̸ ⊢ match when message m does not match pattern E. Juxtaposition denotes concatenation of pattern and value tuples, and, since we assume that variables appear uniquely in pattern tuples, σσ is the union of the two substitutions.
A floating message (n 1 , n 2 , m)(t) represents a message m in link (n 1 , n 2 ).Latent messages u.(n 1 , n 2 , m)(t) are floating messages which can only reach the receiver's mailbox after a latency u.We assume all sent messages have a latency defined as a constant L, which abstracts the average network latency.
Looking at processes, a term of the form !{n i m i .P i } i∈I chooses to send to node n i a message m i and continues as P i .Term ?{p i .P i } i∈I after P tries to pattern match a message from the mailbox against one of the patterns p i , and continues as P i given that the matching succeeds for p i , timing out after one time unit if no message matches and executing P .Process sleep.P consumes a time unit and then continues as P .Process save.P saves the current state as a checkpoint process.Process µt.P is for recursion, and t is the recursive call.Finally, 0 is the idle process.
Remark 4.1.We use notation ?{p i .P i } i∈I after u P as syntactic sugar for nesting u timeouts1 and sleep u.P for the sequential composition of u delays with continuation P .
Recall (Section 3) that we fix the set of system's nodes N , and the domain of ∆ is N ∪ (N × N ), that is the set of nodes and links between pairs of nodes.Our unit of analysis is a cursed system defined below.Definition 4.2 (Cursed system).A cursed system is a pair (R, ∆) where R is a system, ∆ is a curse.
The semantics of cursed systems is given in Def.4.3 as a reduction relation over systems that is parametric on ∆.We write R 1 ≡ R 2 to mean that the systems R 1 and R 2 are the same up-to associativity and commutativity of ||, plus 0.
The first set of rules in Figure 3a is for actors actions, happening at a time t, when the nodes and links are in a healthy state i.e. ∆(t)(n) = ↑ .In rule [Snd], n chooses to send a message m j to node n j , and continues as P j .Modelling asynchronous communication, a latent message L.(n, n j , m j )(t) is introduced in the system, where L is the network latency constant.Rule [Sched] delivers a floating message to the receiver's mailbox.Rule [Rcv], retrieves the first message m in the mailbox that matches one of the receive patterns p j .The match function returns a substitution σ that is applied to the continuation process P j associated with pattern p j ; and m is removed from the mailbox.Rule [Checkpoint] saves the current state P as a checkpoint process for that node n.Finally, Rule [Rec] allows a node with a recursive process to proceed with a communication or a time action.13:9 models an instantaneous node that crash injected by ∆(t)(n) = ↓ , and erases the process and mailbox of the node.Rule [DownLate] allows time to pass for a crashed node.In rule [NUp] a crashed node is restarted with its saved checkpoint process Q and empty mailbox.Σ is a mapping from N to processes, that gives the initial process of each actor node.We assume that the node identifier is unchanged when restarting the node.
Runtime System actions.The last set of rules given in Figure 3d models system actions.
In rule [ParCom] a communication action of system part R 1 is reflected in the composite system R 1 || R 2 .In rule [ParTime] time actions need to be reflected in all the parts of a system.A whole system can have a time action only if all parts of the system have no communication or failure actions to perform at the current time (R i − ⇀ − ). [Str] is for communication and time actions of structurally equivalent systems.4.1.Basic properties of systems reductions.In the remainder of this section we discuss two properties of cursed systems: time-coherence (the semantics keeps clocks synchronized) and non-Zenoness.We start by defining the time of a system.All definitions below apply straightforwardly to cursed systems by fixing a ∆.Definition 4.4 (Time of a system).Let t range over N ∪ { * }.We define the synchronization (partial) function δ : returns a time or a wildcard * , and is undefined if t 1 ̸ = t 2 and neither t 1 nor t 2 is a wildcard.We define time(R) as a partial function over systems: We can now define time-coherence of a system, holding when all its components have the same time.13:11 Definition 4.5 (Time coherence).R is time coherent if time(R) is defined.For example, system n 1 The time function is also useful to characterise systems where all actors are coherently at time 0 and in their initial state.Definition 4.6 (Initial system).Let Σ and Γ be mappings from N to processes such that Σ(n) is the initial process of n and Γ(n) is the initial checkpoint of n.Note that by the definition of processes ↓ is not a process, and so nodes are never crashed in the initial state.
We assume any system R to start off as initial and hence, by Prop.4.8, to be timecoherent.
Next we show that the reduction over systems preserves time-coherence, hence all reachable systems are coherent.
The proof of the lemma is straightforward, by induction on the derivation.In fact, the only rule that updates the time of a parallel composition is [ParTime] which requires time passing for all parallel processes.The fact that if R is initial then time(R) is defined (as 0) yields the following property.We let − → * be the transitive closure of the reduction relation.
We assume any system R to start off as initial and hence, by Prop.4.8, to be timecoherent.
Next, we give a desirable property for timed models: non-Zenoness.This prevents an infinite number of communication actions at any given time (Zeno behaviours).Besides yielding a more natural abstraction of a real world system, non-Zenoness simplifies analysis; for example, we can assume that the set of states reachable without time passing is finite.We start by defining a non-instantaneous process.Definition 4.9 (Non-instantaneous process).We define function ninst(P ) inductively as follows: We say that P is non-instantaneous if ninst(P ) = true.We say that R is non-instantaneous if all nodes in R run non-instantaneous processes.
The proof is straightforward by induction on the structure of R ′ .Intuitively, any non-instantanous actor can only make a finite number of instantaneous actions at any given time, and hence at time time(R ′ ).Hereafter we assume systems to be non-instantaneous, and hence non-Zeno.

4.2.
Reset vs Checkpointing Systems.We call reset systems those systems obtained using the grammar for systems but without the save processes save.P , and where Γ = Σ.Reset systems model systems where each node reacts to (presumed) failure by restarting the execution from the initial state.More formally: Proposition 4.11 (Reset systems).If R is reachable from an initial reset system then for all n The property above is proved straightforwardly by coinduction, showing that having checkpoint Σ(n) in all nodes is a property of initial reset systems and an invariant of reset systems preserved by reduction (by case analysis on the reduction rules).
Reset systems are common in Erlang: robustness is provided by a supervision hierarchy which explicitly describes the ways in which parts of the system are restarted when they or other parts fail.Restarts can escalate: if a component repeatedly restarts, then its parent process may itself have to be restarted.
While Erlang provides no explicit mechanism for checkpointing, it is possible to save state periodically using bulk storage known as ETS-tables.These provide global storage from which state can be retrieved, always assuming that the tables themselves are preserved.Diskbased ETS-tables (DETS-tables) provide more permanent storage, but with an associated time cost.
In fact, in the short version of this article [BLTV22] we focussed on a formalisms that corresponds to reset systems.Here, we explore a more general setting, to show a more interesting relationship of our work to the ones in [Gär99][DCMA17], and particularly to the notion of non-masking fault-tolerance therein.

Properties of cursed systems
In this section we define a behavioural relation between cursed systems, as a weak barbed bisimulation, which is the standard choice since we have a reduction semantics [SW01].The aim is to compare the systems' abilities to preserve 'normal' functionality when they are affected by failures.We abstract from the fact that some parts of the system may be deadlocked, as long as healthy actors can keep receiving the messages they expect.Mailboxbased (rather than point-to-point) communication and pattern matching allow us to capture e.g., multiple-producer scenarios where a consumer can receive the expected feeds as long as some producers are healthy.
Our behavioural relation also abstracts from time, to disregard the delays introduced by recovering actions, and only observes the effects of such delays (we do not focus on efficiency).Essentially, two systems are equivalent when actors receive the same messages, 13:13 abstracting from senders, in a time-abstract way. 2 On the basis of this equivalence we define recoverability and augmentation.
We start by defining weak barbed simulation for cursed systems.
If R ↓ x we say that R has a barb on x.
Barbs abstract from (i.e., do not include in the model of observation) the sender of a message.This allows us to disregard the identity of the senders, following mailbox-based communications in actor-based systems.Scenarios where the identity of the sender is important can be encoded by using node identifiers as message content. 3We observe m and p to retain expressiveness with respect to channel-based scenarios, as discussed in Section 6.1.
Example 5.2 (Examples on barbs).Consider a system R R with a consumer node c receiving data d from two replicas r1 and r2.If the messages from both replicas are delayed then the consumer notifies a monitor node m (omitted here for simplicity): Regarding our choice of barbs in this example, the consumer needs to receive regular feeds d, no matter whether they are from r1 or r2.Abstracting away from the identity of the sending replica is directly captured by our definition of barbs.In fact, the set of barbs of R R is {! c d, ?c d}.
A system defined in the same way as R R but with only one replica, e.g.obtained by removing node r2, or with one of the replicas down, e.g.obtained by substituting node r2 with r2[ ↓ ](∅)(0), would have the same set of barbs as R R , namely {! c d, ?c d}.
It is worth noting that if the identity of the sender does matter, it can be observed by encoding the identity into the messages being sent by the sender: The set of barbs of If node r2 was removed or crashed, the set of barbs would be affected, becoming {! c n1 d, ?c n1 d, ?c n2 d}, and making it possible to distinguish among senders.
2 Abstracting from timing and message senders is an assumption of our model that we adopted for the sake of generality: it allows us to capture scenarios where the timing and order of the messages does not matter (e.g., multiple producers).On the other hand the model can encode scenarios where such orders matter.It can, for example, support Erlang-style actor behaviour.Erlang does not guarantee temporal order of messages between different processes in general, however between any two processes it does guarantee that messages sent directly between them will be received in the same order.Erlang behaviour can be encoded in our model if messages are extended to include the identity of the sender and a counter (e.g., as atoms) to guarantee message origin and ordering.
3 This is precisely how sender information is communicated in Erlang.13:15 Example 5.6 shows a non-resilient cursed system (R, ∆) and a resilient variant (R ′ , ∆) obtained by tuning the timeout in R. In the following Example 5.7 we provide two additional resilient variants of (R, ∆) obtained using time-redundancy (e.g., retry strategies) and spaceredundancy (e.g., replication).Redundancy has been shown [Gär99] to be a necessary condition for fault-tolerance.Resilience gives a tool to assess whether a 'redundant' system is indeed attaining the intended fault-tolerance.
Example 5.7 (Resilience and redundancy).Consider (R, ∆) from Example 5.6 and, again, fix the latency constant as 1 time unit.We define a variant of R, called R T , where time-redundancy is attained by retrying the communication once more in case of timeout Similarly, we define a variant of R, called R S , where space-redundancy is applied by adding an extra producer: One can verify that both R T and R S are resilient with respect to ∆ from Example 5.6.
Our definition of resilience sets the behaviour of a system without curses as a model of expected behaviour.By Definition 5.5, any deviation from the expected behaviour, even a temporary one, makes a system non-resilient.This is a very strict characterization of fault-tolerance.For example, resilience is too strong to capture the effects of more complex retry-strategies than those applied in R T from Example 5.7, as shown in the Example 5.8 below.
Example 5.8 (Resilience and more complex retry strategies).Consider ∆ from Example 5.6, latency of 1 time unit, and a variant R T T of R T , where time-redundancy affects both processes: R T T = p[ µX.sleep .!c item.?{ok .0,retry.X} ](∅)(0) ∥ c[ µX.?item.!p ok.0 after 3 (!p retry.X) ](∅)(0) System R T T is not resilient with respect to ∆ from Example 5.6 because the nodes add some communications to acknowledge correct interaction or coordinate on a retry iteration.
In the remaining of this section, we study a less restrictive characterization than resilience, which we call recoverability, to allow for some deviation from the expected behaviour as long as the system eventually resumes the expected behaviour.In Section 5.1 we discuss recoverability.In Section 5.2 we show a relation between resilience and recoverability.In Section 5.3 we provide a more general account of reliability that can easily capture reset and checkpointing systems.Section 5.1 is based on the notion of n-recoverability first introduced in [BLTV22], which is fixed and improved.Section 5.3 is new.5.1.Recoverability for reset-systems.We define n-recoverability as the ability of a system to display the expected behaviour after time n.The definition from [BLTV22] had several issues that we have amended in this work.The original definition is as follows: Example 5.10 (Counterexample).Fix a latency of 1 time unit and a generic ∆ that does not affect any node or link at time 0. System R ce below reduces to R ′ ce after a communication action which does not hold for any ∆ because of a difference in barbs-hence not even for ∆ =↑.
Example 5.10 shows that Definition 5.9 is too strict to capture the intended meaning of n-recoverability.In [BLTV22] for example, 0-recoverability is (wrongly) set to correspond to resilience.Definition 5.9 requires that all states at time n are bisimilar to the initial state, and this is too strict since several actions may naturally happen in a time unit.
We provide a weaker definition of n-recoverability, using universal quantification over paths of actions at time n and existential quantification on the states on each of these paths, which is set to better represent the intuition.First, we define the concept of n-entry, which is the set of states that are, for some execution, the first state to be reached at time n.Then a n-path is the maximal path from a n-entry where states are at time n.Definition 5.11 (n-Entry).Let n ∈ N and (R 0 , ∆) be an initial state.If n = 0 then (R 0 , ∆) is the only 0-entry for itself.If n > 0, (R, ∆) is a n-entry for (R 0 , ∆) if there exists an execution A n-entry (R, ∆) is the first state to be reached at time n.Observe that in Definition 5.11 if n > 0 then it is always the case that time(R ′ ) = n − 1.We define an execution (R 1 , ∆) ⇀ * (R m , ∆) to be a sequence of configurations (R i , ∆), with 1 ≤ i ≤ m − 1 such that (R i , ∆) ⇀ (R i+1 , ∆).Definition 5.12 (n-Path).Let n ∈ N and (R 0 , ∆) be an initial state.Execution (R 1 , ∆) ⇀ * (R m , ∆) is a n-path for (R 0 , ∆) if: (1) (R 1 , ∆) is a n-entry for (R 0 , ∆), and (2) (R m , ∆) cannot make other actions than time actions.
Definition 5.13 says that in any arbitrary n-path there exists a state (R i , ∆) reachable at time n that is weak-barbed bisimilar to (R 0 , ↑).
Example 5.14 (n-Recoverability).Consider the system R below (and any ∆ that does not affect the system at times 0 and 1): (R, ∆) reduces to the successfully terminated system (R ′ , ∆) with at time zero.System R T T is not resilient with respect to ∆ from Example 5.6 because the nodes add some communications to acknowledge correct interaction or coordinate on a retry iteration.
It is however n-recoverable with n = 6.
By Definition 5.13, checking resilience and n-recoverability is reduced to the problem of checking weak barbed bisimulation.Note that, in Definition 5.13, the number of R ′ that can be reached from R is finite, because the execution up to R ′ lasts for n time units and, by Proposition 4.10, a system can perform only a finite number of actions in a finite amount of time.
In the following, we show that resilience is equivalent to 0-recoverability.This fact was conjectured for not formally proven in [BLTV22].This result is given in Section 5.2.
Definition 5.16 (↑-consistency).Two systems R ∆ and R ↑ are ↑-consistent if there exist R, R u , R d , and R f such that and: • R u and R d are parallel compositions of the same (possibly empty) set of nodes.
• the nodes in R d are all down, i.e., of the form n • R f is the parallel composition of a (possibly empty) set of latent or floating messages.
Intuitively, ↑-consistency defines a structural relation between the evolution of a system with and without curses: R models the parts of the system (if any) that R ↑ and R ∆ have in common; R u and R d are the nodes that are up in R ↑ and down in R ∆ , respectively (they model the difference between R ↑ and R ∆ wrt.crashed nodes); moreover, R ↑ may have some additional floating messages, represented by R f , that have been lost in R ∆ .
↑-consistency enjoys two properties.The first one, given in Lemma 5.17, is that instantaneous actions preserve ↑-consistency and does not decrease the number of down nodes in the cursed system.The second one, given in Lemma 5.18, ensures that the barbs of the cursed system are always a subset of those of the uncursed counterpart.
2) the set of down nodes in R ∆ is a subset of the set of down nodes in R ↑ .

13:18
Proof.By induction on the derivation.In case of actions by [Snd], [Sched], [Rcv], and [Checkpoint] all yield that there is where R ′ is as R but without the lost message, and R ′ f is as R f but with the addition of the lost message.The set of down nodes does not change, hence R ′ ∆ and R ′ ↑ are ↑-consistent, yielding the thesis for this case.In case of  Proof.By coinduction, observing that initial systems are ↑-consistent, ↑-consistency is preserved by communication actions by Lemma 5.17 and ensures that R ∆ ↓ x implies R ↑ ↓ x by Lemma 5.18.
We next show an intuitive property that will be useful to show equivalence of resilience and 0-reliability: (R, ∆) and (R, ↑ ) are weak barbed bisimilar at time 0 if no time actions or failures happen (Lemma 5.21).This is proved by coinduction via Lemma 5.20 ensuring that actor/node transitions preserve equivalence of barbs in the evolution of cursed and uncursed systems.
We are not able to state the main results: equivalence of resilience and 0-recoverability.
Theorem 5.22 (0-recoverability and resilience).An initial cursed system (R, ∆) is resilient if and only if it is 0-recoverable.
Proof.The only if case is immediate since resilience implies the existence of a state, the initial one, such that (R, ↑) ≈ (R, ∆).For the if case, we assume (R, ∆) to be 0-recoverable: for all path of executions of (R, ∆) at time 0 (i.e., 0-paths of (R, ∆)) there exists a state (R ′ , ∆) in that path such that time(R ′ ) = 0 and (R, ).The reductions to (R ′′ , ∆) can be by either (i) one of the Actor/node actions, or by (ii) one of the instantaneous failure actions ([MsgLoss] or [NodeDown]).
Observe that, for all R and ∆, the relation ((R, ∆), (R, ↑)) is a bisimulation if we consider a restriction of the reduction relation that only uses actions that is actor/node actions with no failure and no time-consuming actions (Lemma 5.21, using the fact that R is initial and hence fail-free).So, if only actions (i) are possible in the reduction of (R, ∆) then (R ′′ , ∆) ≈ (R ′′ , ↑) by Lemma 5.21.If there are only (i) actions at time 0, since any state reachable from (R, ∆) at time 0 is bisimilar to the corresponding state reached by (R, ↑) then (R, ∆) is resilient.Hence done.

5.3.
Recoverability for checkpointing systems.The definition of recoverability in the previous section formalises a system restarting from the initial state, and does not capture checkpointing systems that recover to intermediate states.In this section we add definition of bisimulation up to a particular time, and also a notion of n-recoverability for checkpointing systems.This is illustrated with an example of a system that is not n-recoverable but that is n-checkpoint recoverable.
We introduce a notion of weak barbed simulation up to n where n is a relative time, up to which we want to compare behaviour (ignoring what happens afterwards).
Definition 5.23 (Weak barbed simulation up to n).Recall − → ∈ {− ⇀, ∼∼▷ }.A weak barbed simulation up to n is a set of binary relations S r for r ≤ n between cursed systems such that: if there exists some weak barbed simulation up to n, S n , such that (R 1 , ∆ 1 ) S n (R 2 , ∆ 2 ).By point (1), weak barbed simulation up to 0 holds for all pairs of systems, whereas weak barbed simulation is morally equivalent to weak barbed simulation up to ∞. Definition 5.24 (Weak barbed bisimulation up to n).We say that S n is a weak barbed bisimulation up to n if S n and S n−1 are weak barbed simulations up to n.We say that (R 1 , ∆ 1 ) and (R 2 , ∆ 2 ) are weak barbed bisimilar up to n, written (R 1 , ∆ 1 ) ≈ n (R 2 , ∆ 2 ), if there exists some weak barbed bisimulation up to n, S n , such that (R It is a straightforward consequence of these definitions that if two systems are (bi-)similar up to n then they are (bi-)similar up to r for any r < n, and two systems are (bi-)similar if and only if they are (bi-)similar up to n for all n.
We now define a variant of n-recoverability that, based on weak barbed bisimulation up to n, aims to characterise recoverability for systems that use checkpoints to recover from failures.
Informally, a cursed system (R, ∆) that behaves correctly up to a state at time t, can always reach a later state (R ′ , ∆) which is bisimilar to the state (R ′′ , ↑ ).Suppose that (R, ↑ ) ≈ t (R, ∆), where t ≤ n.Then for any R ′′ reachable from (R, ↑ ) such that time(R ′′ ) = t there exists R ′ that by time n displays the remaining behaviour of the correct (uncursed) system, that is (R ′′ , ↑ ).In contrast to Definition 5.13 this definition does not require that (R ′ , ∆) exhibits the complete behaviour of (R, ↑ ) but only the behaviour after a certain point, for example from a checkpoint onwards.In contrast to Definition 5.13, Definition 5.25 does not require the recovered system (R ′ , ∆) to exhibit the complete behaviour of the uncursed system since its initial state (i.e., have the same behaviour of (R, ↑)) but only the behaviour of (R, ↑) after a certain point (i.e., (R ′′ , ↑)) for example from a checkpoint onwards.It is this state that makes system R T T not n-recoverable with respect to ∆ both because node c sends message failed in its timeout process and the communication of the order 13:21 message does not get repeated when the rest of interaction is repeated.While the failed message is not read by p, it adds an additional barb to the cursed system (R T T , ∆) that is not matched by the uncursed system (R T T , ↑ ).
The system is however n-checkpoint-recoverable with n = 4, R ′ T T further reduces to p[ !c item.?ok .0 after 2 0 ](failed)(4) ∥ c[ ?item.!p ok.0 after 4 0) ](∅)(4) from which state the cursed system exhibits the behaviour of the uncursed system from time 2 (or from the checkpoint) onwards.5.4.Fault-tolerance: a more general perspective.In [Gär99], the author gives a theoretical definition of the problem of fault tolerance along two dimensions: safety (the system does not reach bad states, although it can possibly stop due to faults) and liveness (the system eventually reaches good states, hence in case of bad behaviour it eventually recovers).In this context, the guarantee of both safety and liveness is called masking fault-tolerance, of only safety is called fail-safe fault-tolerance, and of only liveness is called non-masking fault-tolerance.
In our framework, we can characterise these three kinds of fault-tolerance by using our simulation relation: • (R, ∆) ≈ (R, ↑) -masking fault-tolerance: R cursed by ∆ has all and only the behaviour of healthy system R. • (R, ∆) ≲ (R, ↑) -fail-safe: R cursed by ∆ has only the behaviour of healthy system R.
Resilience, given in Definition 5.5, corresponds to the safety and liveness combination of masking fault-tolerance.We have shown in Example 5.7 that masking-failure can be attained by using space redundancy (e.g., replication of nodes as in the multiple producers scenario) and time redundancy (e.g., retry-strategies).In fact, the author in [Gär99] substantiates that redundancy is a necessary condition for fault tolerance.Fail-safe fault tolerance, while easier to attain, is not the most desirable property in many real-world scenarios: a systems that just stops to prevent 'bad' actions, may not be a suitable model when you want eventually consistency despite perturbations to the ideal course of actions.Intuitively, both fail-safe fault tolerance and non-masking fault tolerance for cursed system can be expressed by using the notion of weak barbed simulation given in Definition 5.3, as shown above.The formulation of non-masking fault tolerant as (R, ∆) ≳ (R, ↑) is very general.In principle, this definition consider fault tolerant any system that performs an infinite sequence of actions among which, sometimes, a correct action happens to make the system progress.Practically we would want to see that the behaviour added is not random, but follows a sensible pattern of restart, reset or another benign behaviour.In this paper we have put most emphasis on non-masking fault tolerance, but focussing on more stringent definitions of non-masking fault-tolerance: some unforeseen sequence of actions may be visible at some point, but after some recovery actions, at a time that is not later than n, the system will revert to the required behaviour by restarting from the beginning (n-recoverability for reset systems) or from the point of failure (n-recoverability for checkpointing systems).These definitions are intentionally non-general, with the aim of capturing known recovery patterns.We leave as a future work the extension of n-recoverability to cater for periodic failures.
A similar approach, of characterising masking/fail-safe/non-masking fault tolerance using simulation was followed by [DCMA17] but with a clear distinction of good versus faulty states (using coloured Kripke structures).More on the relationship with [DCMA17] is discussed in Section 8.

Augmentation of cursed systems
Augmentation of a cursed system is the result of adding or modifying some behaviour in the initial system to improve the system's ability of handling failures.The following definition applies to reset systems as it is based on n-recoverability.A corresponding notion of augmentation could be given for checkpointing systems by using, in Definition 6.1, ncheckpoint-recoverability instead of n-recoverability.In the remaining of this section we focus on reset systems.Definition 6.1 (Augmentation).R I is an augmentation of R if time(R I ) = time(R) and: i) transparency: (R, ↑) ≈ (R I , ↑) ii) improvement: there exist ∆ and n such that (R I , ∆) is n-recoverable and (R, ∆) is not n-recoverable.Moreover, we say that an augmentation is preserving if, for all n and ∆, (R, ∆) is nrecoverable implies (R I , ∆) is n-recoverable.Example 6.2 (Augmentation).Consider the small producer-consumer system R below, composed of a producer node n p , a queue node n q , and a consumer node n c .The producer recursively sends items to the queue and sleeps for a time unit.The queue expects to receive an item within three time units that then gets sent to the consumer.In case of a timeout the queue loops back to the beginning and awaits an item from the producer.The consumer recursively receives items from the queue.We fix the latency of the system to L = 1.
The augmented producer-consumer R I adds behaviour to the system by having a second producer node n p ′ .R I improves the resilience to a producer node or its link failing or being slow.For example the curse function ∆(n p ) injecting node delay for the producer node between time 1 and 3 and ↑ otherwise impacts the first system R but not its augmented counterpart R I .R is 4-recoverable while R I is 0-recoverable.Moreover, R I preserving augmentation of system R.
6.1.Augmentation with scoped barbs.Augmentations often need to introduce additional behaviour into actors.One may want to disregard part of 'behind the scenes' augmentation when comparing the behaviour of cursed systems using the relation in Definition 5.4.For simplicity, instead of adding scope restriction to the calculus, we extend barbs with scopes to hide behaviour of some nodes or links.With mailboxes, all interactions to a node are directed to the one mailbox.Defining scope restriction only on node identifiers would be less expressive than scope restriction based on channels, e.g., it would not be possible to hide specific communications to a node, while in channel-based calculi one can 13:23 use ad-hoc hidden channels.To retain expressiveness, we define scope restriction that takes into account patterns in the communication between nodes.Definition 6.3 (Scoped barb).Let N be a finite set of elements of the form ! n p or ?n p where n ∈ N and p is a pattern.R ↓ N x if: (1) R ↓ x, (2) x ̸ ∈ N , and (3) if x = !n m then for all !n p ∈ N , (p, m) ̸ ⊢ match .If R ↓ N x we say that R has a N -scoped barb on x.
We extend Def.5.4 using ↓ N instead of ↓ , obtaining scoped weak-barbed bisimulation ≈ N , and Def.6.1 to use ≈ N .This setting allow us to analyse producer consumer scenarios, or more complex ones, like the Circuit Breaker pattern [Nyg18] widely used in distributed systems.
Example 6.4 (Circuit breaker).Consider system (R, ∆) with a client n c and a service n s , and its augmentation R I with a circuit breaker running on node n s : with a ∆(n c , n s ) injecting link slow ⟳ at times 1, 2, and 3 and healthy otherwise, and latency to L = 1.The impact of failure on the R makes it unrecoverable, as the link delay cascades to node n c .We augment R with a circuit breaker process which runs on the previous server node n s that monitors for failure, prevents faults in one part of the system and controls the retries to the service node now n 1 .The node n s forwards messages between nodes n c and n 1 , and in case of a timeout checks the health of n s and tells node n c when it can safely retry the request.When comparing R and R I for resilience, recoverability or transparency we wish to abstract from the additional behaviour introduced by the circuit breaker pattern for which we use Def.6.3 with: N = {! n s ruok, ?n s imok, ?n s reply, ?n 1 request, !n s reply, ?n 1 ruok, !n s imok, !n c ko, !n c retry, ?n c ko, ?n c retry}.This effectively hides the entire behaviour of n 1 and node n s 's health checking behaviour.Using the extended definition we find that for the same curse function system R I is 0-recoverable.Similarly, for the curse function delays link (n s , n 1 ) at times 1, 2, and 3, R I is 0-recoverable.

Prospective applications
This work is a first step towards an analysis of mailbox systems with failures and has the purpose of clarifying the problem space.An informal validation of the relevance of the work was attained through interaction with our industry partners, in particular Erlang Solutions Ltd. and Actyx AG, as well as in applying it to a collection of real-world case studies and patterns, such as the circuit breaker in Section 6.4.
To support analysis and development of real-world systems, we aim to build on the current work.In this section we discuss two potential applications; their development goes beyond the scope of the formal setting given in the current work.
13:24 7.1.Analysis of cursed systems.Encoding the models of failures and systems into verification tools like UPPAAL is fairly straightforward.We provide a prototype encoding here.A straightforward encoding only supports analysis of a system against a specific ∆ 'traces', and so, by repeating this analysis, to a limited set of curse cases.
As a more powerful development, we are working on generalizing the notion of ∆ to a symbolic entity, that can finitely characterise infinite patterns, together with a tractable algorithm to determine simulation that is parametric with respect to this symbolic ∆.This feature would allow it to be determined whether a system model is resilient with respect to a given set of curses, or synthesise the curses that a system can or cannot deal with.
Code generation or synthesis would, in turn support top-down or bottom-up development, (respectively).Existing approaches to code generation provide seamless links between processcalculi-based models and Erlang code.For example, the tool described in [BOV23], which presents a proof of concept of a theoretical advance, can generate Erlang gen statem code from a process-calculus specification and extract specifications from Erlang gen statem code.The circular transformation described above is possible thanks to the code structure induced by Erlang gen statem itself, which yields modular code that is structured as a finite state machine and hence has a straightforward correspondence with its model.A similar approach could be taken to our modelling of failure scenarios by supporting a richer process calculus that includes time and timeouts.

Test support.
A second direction is to use property-based testing (PBT), as implemented by QuickCheck [CH00], initially for Haskell and Erlang, and subsequently for a variety of other languages.Property-based testing replaces unit tests by tests of logical properties of the system under test (SUT), expressed in a universal fragment of first-order logic.A universal property is evaluated at a randomly generated set of values, and any counter-example is systematically shrunk to a simplest such example, according to some size metric.Successful application of property-based testing therefore depends on three things: being able to express relevant properties of a system in a logical form; being able to generate values from relevant domains in a way that optimises coverage of the domain; and being able to "shrink" values in an effective and efficient way.PBT can be seen as a complement to more heavyweight verification approaches: for example, it is worthwhile subjecting a candidate theorem to PBT before embarking on developing a formal proof.
Stateful systems in Erlang [CPS + 09] and other languages can subject to PBT using state machine models.The state machine provides an abstract model of the system, and is used to guide testing of the SUT: random sequences of transitions of the state machine exercise the SUT, and shrinking simplifies and shortens counter-example traces.
In the context of the work presented here, QuickCheck can be used to test systems in which failure is modelled explicitly in a state machine model, but could also be extended to include modelling of the symbolic ∆ function discussed above -e.g. using logical constraints -and to generate and shrink instances of ∆ with particular properties.

Conclusion and related work
We introduced a model for actor-based systems with grey failures and investigated the definition of behavioural equivalence for it.We used weak barbed bisimulation to compare systems on the basis of their ability to recover from faults, and defined properties of resilience, recoverability and augmentation.We reduced the problem of checking reliability properties of 13:25 systems to a problem of checking bisimulation.We introduced scope restriction for mailboxes based on patterns, which allows us to model relatively complex real-world scenarios like the Circuit Breaker.
As further work we plan to extend the recovery function Σ to model check-pointing of intermediate node states.Note that Σ can already be set as an arbitrary process, but a more meaningful extension would account for the way in which checkpoints are saved.Moreover, we plan to add a notion of intermittent correctness, to model recovery with partial checkpoints rather than re-starting from the initial state, or intermittent expected/unexpected behaviour.Another area of future work is to use the characteristic formulae approach [GS86,Ste89], a method to compute simulation-like relations in process algebras, to generate formulae for the properties introduced and reduce them to a model checking problem that can be offloaded to a model checker.
A related formalism to our model is Timed Rebeca [ACI + 11], which is actor-based and features similar constructs for deadlines and delays.Timed Rebeca actors can also use a 'now ' function to get their local times.Extending our calculus with 'now ' and allowing messages to have time as data sort, would allow us to model scenarios e.g., where a node calculates the return-trip time to another node and changes its behaviour accordingly.While Timed Rebeca can encode network delays (adding delays to receive actions -using a construct called 'after '), it does not model links explicitly.Explicit links and separation between curses and systems make it easier in our calculus to compare systems with respect to recoverability.Rebeca was encoded in McErlang [ACI + 11] and Real-Time Maude [SK Ö+ 15] for verification.We have ongoing work on encoding our model in UPPAAL.Our main challenge in this respect is to formalise a meaningful and manageable set of curses to verify the model against.
In [FH07], Francalanza and Hennessy introduced a behavioural theory for DπF, a distributed π-calculus with with nodes and links failures.For a subset of DπF, they also developed a notion of fault-tolerance up to n-faults [FH08], which is preserved by contexts, and which is related to our notion of resilience.The behavioural theory in [FH07] is based on reduction barbed congruence.The idea is to use a contextual relation to abstract from the behaviour of hidden nodes/links, while still observing their effects on the network, e.g., as to accessibility and reachability of other nodes.The scoped barbs in Section 6.1 have the similar purpose of hiding augmentations while observing their effects on recoverability.However, because of asynchronous communication over mailboxes (while DπF is based on synchronous message passing), our notion of hiding is less structural (i.e., based on nodes and links) and more application-dependent (i.e., based on patterns).At present, we have left pattern hiding out of the semantics, but further investigation towards a contextual relation that works for hidden patterns is promising future work.DπF studies partial failures but does not consider transient failures and time.On the other hand, DπF features mobility which we do not support.In fact, we rely on the assumption of fixed networks: since our observation is based on patterns (and ignores senders) we opted for relying on a stable structure to simplify our reasoning on what augmentation vs recoverability means, leaving mobility issues for future investigation.
Most ingredients of the given model (e.g., timeouts [LZ05, BY07, LP11], mailboxes [MV11], localities [RH97][BH03] [Cas01]) have been studied in literature, often in isolation.We investigated the inter-play of these ingredients, focussing on reliability properties.One of the first papers dealing with asynchronous communication in process algebra is by de Boer et al. [dBKP92], where different observation criteria are studied (bisimulation, traces and abstract traces) following the axiomatic approach typical of the process algebra ACP [BK84].
Definition 4.3 (Operational semantics for cursed systems).Reduction is the smallest relation on cursed systems over communication actions denoted by − ⇀, and time actions denoted by ∼∼▷ , that satisfies the rules in Figure3.We use − → when − → ∈ {− ⇀, ∼∼▷ }.For readability, in the rules we assume ∆ fixed and write R − Example 5.15 (n-recoverability and more complex retry strategies).Consider ∆ from Example 5.7, latency of 1 time unit, and a variant R T T of R T , where time-redundancy 13:17 affects both processes: R T T = p[ µX.sleep .!c item.?{ok .0,retry.X} after 5 0 ](∅)(0) ∥ c[ µX.?item.!p ok.0 after 4 (!p retry.X) ](∅)(0) no down nodes are introduced in R ′ ∆ and hence R ′ ∆ and R ′ ↑ are ↑-consistent.Rule [NUp] cannot be applied at time 0. The only possible failure actions are [MsgDown] and In case for [MsgDown], there exist R ′ and R By Definition 5.1 the barbs of R ↑ is the union of barbs of R, R u , and R f , and the set of barbs of R ∆ is the union of the barbs of R and R d .We only need to show that R d does not have barbs that R ↑ does not have.This follows trivially from Definition 5.16 since R d is the parallel composition of down nodes and hence has no barbs.We can now prove a more general property of cursed systems at time 0: Lemma 5.19.Let ≲ 0 be the restriction of ≲ obtained considering only communication actions − ⇀ in Figure3(i.e., no time-consuming actions ∼∼▷ ) on systems R such that time(R) = 0.It holds that (R, ↑) ≲ 0 (R, ∆)

Proof.
By induction on the derivation proceeding by case analysis on the last rule used.The base cases, for rules [Snd], [Sched], [Rcv], and [Checkpoint], are mechanical.The inductive cases for rules [Rec], [Str], and [ParCom], are straightforward by inductive hypothesis.Lemma 5.21.Let ≈ ↑ 0 be the restriction of ≈ obtained considering only actor/node actions in Figure 3 (i.e., no failure and no time-consuming actions) and time(R) = 0 with R fail-free.It holds that (R, ∆) ≈ ↑ 0 (R, ↑) Example 5.26 (n-checkpoint-recoverability).Consider a variant R T T of R T from Example 5.7, latency of 1 time unit, and ∆ that curses node p to go down at time 2: R T T = p[ ?order.save.sleep.!c item.?ok .0 after 3 0 ](∅)(0) ∥ c[ !p order.?item.!p ok.0 after 3 (!p failed.?item.!p ok.0 after 5 0) ](∅)(0) System R T T used checkpointing to restart from an intermediate state in the event of failure.In the case of the node p to going down at time 2, the system reduces in a number of steps to: R ′ T T = p[ sleep .!c item.?ok .0 after 3 0 ](∅)(3) ∥ c[ ?item.!p ok.0 after 5 0) ](∅)(3) ∥ 1.(c, p, failed)(3) [Mur19]ond set of rules, in Figure3b, is for time-passing reduction in absence of failures.Rules[Sleep]and [Timeout] model reduction of time consuming and receiving with timeout processes, respectively.Rule[Timeout]can only be applied if none of the messages in the mailbox is matching any of the patterns {p i } i∈I yielding an urgent receive semantics[Mur19]reflecting the receive primitive in Erlang.Rule[Latency]allows time passing for latent messages.Note that, by setting u ′ = max(u − 1, 0), if a receiver node crashes, all latent/floating messages remain in the link until the node is able to receive them, i.e. in a healthy state.We omit the rules for state-preserving time passing for idle nodes and n[ 0 ](M)(t).Failure actions.The third set of rules, in Figure3c, models the effects of failures injected at time t.Rule[NLate]models a delay, injected by ∆(t)(n) = ⟳ , in the execution of the System actions Figure 3. Reduction rules 13:10Time actions.processP in a node n: a time unit elapses without any action in P .Rule [MsgLoss] models a lossy link at time t, injected by ∆(t)(n 1 , n 2 ) = ↓ , and permanently deletes a message u.(n 1 , n 2 , m)(t) in transit.Rule [MsgLate] models a slow link, injected by ∆(t)(n 1 , n 2 ) = ⟳ , by allowing time to pass but without decreasing the latency u of the message.Rule[NDown] as R but without the node that went down, R ′ d is as R d but with the addition in parallel with the node that went down, similarly for R ′ u but the node in this case is still up.The set of down nodes has increased in R ′ ∆ .It follows that R ′ ∆ and R ′ ↑ are ↑-consistent.The cases for [Rec], [Str], and [ParCom] are immediate by induction.