# A Type Theory for Robust Failure Handling in Distributed Systems

- 6 Citations
- 7 Mentions
- 674 Downloads

## Abstract

This paper presents a formal framework for programming distributed applications capable of handling partial failures, motivated by the non-trivial interplay between failure handling and messaging in asynchronous distributed environments. Multiple failures can affect protocols at the level of individual interactions (**alignment**). At the same time, only participants affected by a failure or involved in its handling should be informed of it, and its handling should not be mixed with that of other failures **(precision)**. This is particularly challenging, as through the structure of protocols, failures may be linked to others in subsequent or concomitant interactions **(causality)**. Last but not least, no central authority should be required for handling failures **(decentralisation)**. Our goal is to give developers a description language, called protocol types, to specify robust failure handling that accounts for alignment, precision, causality, and decentralisation. A type discipline is built to statically ensure that asynchronous failure handling among multiple endpoints is free from orphan messages, deadlocks, starvation, and interactions are never stuck.

## Keywords

Session types Partial failure handling Distributed systems## 1 Introduction

For distributed systems where application components interact asynchronously and concurrently, the design and verification of communication protocols is critical. These systems are prone to *partial failures*, where some components or interactions may fail, while others must continue while respecting certain invariants. Since not all failures can be simply *masked* [10], programmers must *explicitly* deal with failures.

*robust failure handling*. Our framework ensures safety during normal execution and in case of failures. In particular our framework provides the following properties:

- 1.
**P1 (alignment)**: The occurrences of failures are specified at the level of individual interactions, which they be raised. - 2.
**P2 (precision)**: If a failure occurs, an endpoint is informed iff it is affected by the failure or involved in handling it, and its handling is not mixed with that of other failures. - 3.
**P3 (causality)**: Dependencies between failures are considered, i.e., a failure can affect (enable, disable) others which may occur in subsequent or concomitant interactions. - 4.
**P4 (decentralisation)**: No central authority or component controls the decisions or actions of the participants to handle a failure.

*protocol types*to achieve these properties. The basic design is shown in Eq. (1):where the first term expresses that a participant \( p _1\) either sends a message of type \(\widetilde{ S }\) to another participant \( p _2\), or raises one of several failures (i.e., \( f _1, ..., \ f _n\)). \(\mathbf {G}\) specifies the subsequent interactions and \(\mathbf {G}_i\) specifies the handling protocols for \( f _i\), \(i = 1..n\) in Open image in new window . In short, failures are thus associated with elementary interactions (

**P1**), only participants affected by such a failure in \(\mathbf {G}\) (e.g., they are expecting communication from \( p _{2}\)) or those involved in the corresponding failure handling activity are informed (

**P2**). There is at most one \( f \) appearing in Open image in new window that will be handled/raised (

**P3**). Our semantics ensure that there is no central authority (

**P4**), meaning, notifications of failures (and absence thereof) are delivered asynchronously from failure sources and processes are typed by local (i.e., endpoint) types achieved by the projection of participants over protocol types.

**P1**to

**P4**(see Sect. 7 for details).

Just as multiparty session types aim to specify the interactions among participants and verify implementations of these participants, protocol types specify the global interplay of failures in interactions among participants and verify the failure-handling activities of these participants. To the best of our knowledge, this is the first work that presents a type system for statically checking fine-grained failure handling activities across asynchronous/concurrent processes for partial failures in practical distributed systems. We define a calculus of processes with the ability to not only raise (“throw”) and handle (“catch”) failures, but to also automatically notify processes of failures or absence thereof at runtime.

Our framework also gives protocol designers and endpoint application developers simple and intuitive description/programming abstractions, and ensures safe interactions among endpoint applications in concurrent environments.

**Paper Structure.** Section 2 gives motivating examples to introduce the design of protocol types, which capture the properties **P1** to **P4**; then we introduce an operation called *transformation* to transform a protocol type to local (i.e., endpoint) types while preserving the desired properties. Section 3 gives a process calculus with de-centralised multiple-failure-handling capability, including the syntax for programs and networks, and the operational semantics for runtime. Section 5 gives a type system for local processes and Sect. 4 gives a type system for networks to maintain communication coherent. Section 6 states the property of safety, including subject reduction and communication safety, and the property of progress. Section 7 discusses related works. Finally Sect. 8 concludes our work.

Detailed formal and auxiliary definitions, lemmas, and proofs are presented in the extended version of this paper [9].

## 2 Protocol Types, Local Types, and Transformation

This section uses examples to show the design behind protocol types and uses Fig. 1 to illustrate an operation called *transformation*, which generates local types from protocol types. All formal definitions can be found in the extended version of this paper [9].

**P1**and

**P2**. Assume that in a network all outgoing traffic passes through a proxy (\(\mathsf {Proxy}\)), monitoring the traffic and logging general information, e.g., consumed bandwidth. The proxy sends this information to a log server (\(\mathsf {Log}\)). If the proxy detects suspicious behaviour in the traffic, it raises a \(\mathsf {SuspiciousB}\) failure and notifies \(\mathsf {Log}\) and a supervision server (\(\mathsf {SupServer}\)) to handle the failure by having \(\mathsf {Log}\) send the traffic logs to \(\mathsf {SupServer}\); if the proxy detects that the quota of \(\mathsf {Client}\) is low, it raises a \(\mathsf {QuotaWarn}\) failure and notifies \(\mathsf {Log}\) and \(\mathsf {Client}\) to handle the failure by having \(\mathsf {Log}\) send quota information to \(\mathsf {Client}\). Then \(\mathsf {Proxy}\) forwards the traffic from \(\mathsf {Client}\) to an external server (\(\mathsf {EServer}\)). We propose the following type to formalise the above scenario:It specifies that either \(\mathsf {SuspiciousB}\) or \(\mathsf {QuotaWarn}\) may occur at interaction \(\mathsf {\mathsf {Proxy}} \rightarrow \mathsf {\mathsf {Log}}\) (

**P1**). \(\mathsf {Proxy}\) can raise the failure and correspondingly sends

*failure/non-failure notifications*to all relevant parties. \(\mathsf {SuspiciousB}\) affects \(\mathsf {Log}\) and \(\mathsf {SupServer}\) since they both handle that failure; it also affects \(\mathsf {Client}\) because the occurrence of \(\mathsf {SuspiciousB}\) implies that \(\mathsf {QuotaWarn}\) will never occur, and thus \(\mathsf {Client}\) will not yield to the handling activity for \(\mathsf {QuotaWarn}\). When \(\mathsf {QuotaWarn}\) occurs, the situation is similar to \(\mathsf {SuspiciousB}\)’s. Once one of them occurs, \(\mathsf {Proxy}\) sends failure notifications carrying the occurred failure to \(\mathsf {Log}\), \(\mathsf {SupServer}\), and \(\mathsf {Client}\); when no failures occur, \(\mathsf {Proxy}\) sends non-failure notifications carrying both failures to the same participants to inform them not to yield to handling activity. No failures will affect \(\mathsf {Proxy}\) and \(\mathsf {EServer}\) (

**P2**) because \(\mathsf {Proxy}\) and \(\mathsf {EServer}\) are not involved in any failure handling activities: \(\mathsf {Proxy}\) continues sending to \(\mathsf {EServer}\) after it raises a failure and \(\mathsf {EServer}\) still receives a message from \(\mathsf {Proxy}\) as expected.

**P2**and

**P3**. Assume a resource provider (\(\mathsf {RP}\)) informs a coordinator (\(\mathsf {Coord}\)) of which resources it can provide. The library (\(\mathsf {Lib}\)) can get the resource by requesting \(\mathsf {Coord}\) only if \(\mathsf {RP}\) sends a list of resources to \(\mathsf {Coord}\). If failure \(\mathsf {NoRes}\) occurs, meaning absence of resource, \(\mathsf {Coord}\) informs \(\mathsf {Lib}\) of this; otherwise, \(\mathsf {Lib}\) places a request to \(\mathsf {Coord}\) or raises a failure \(\mathsf {Abort}\) (due to one of many possible local problems) and \(\mathsf {Coord}\) invokes \(\mathsf {Record}\) to record this failure:Although it may seem that interactions \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) can run concurrently, this is not the case because, \(\mathsf {Lib}\) can only get the resource if \(\mathsf {RP}\) gives \(\mathsf {Coord}\) a resource list, which implies failures \(\mathsf {NoRes}\) and \(\mathsf {Abort}\) are dependent (

**P3**). Additionally, we constrain that failure handling should be not be mixed (

**P2**).

*Synchronisation*of \(\mathsf {Lib}\) to yield to the completion of \(\mathsf {RP} \rightarrow \mathsf {Coord}\) is thus needed. This helps programmers tremendously in reasoning about the states that participants are in after failures. Note that, if \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) have no failures specified, then \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) can run concurrently because there is no failure dependency.

### 2.1 Protocol Types

*partial failures*in interactions which exhibit the properties

**P1**to

**P4**, Fig. 2 defines protocol types based on the definition of session types given in the work by Bettini

*et al.*[1]. Protocol types, denoted by \(\mathbf {G}\), are composed of interaction types \( g \) and terminated by \(\mathsf {end}\). We use (\( p , ...\)) to range over identifiers, (\( S , ...\)) to range over basic types like \(\mathsf {bool}\), \(\mathsf {unit}\), \(\mathsf {int}\), and \(\mathsf {str}\), and (

*F*, ...) to range over sets of failures. We highlight the key concepts:

- (A)
\(\mathsf { p _1} \rightarrow \mathsf { p _2}: \widetilde{ S } \vee F\) is a

*failure-raising interaction tagged with**F*, or*F*-*raising interaction*for short. When \(\widetilde{ S } \not = \emptyset \) and \(F \not = \emptyset \), either \( p _1\) sends a content of type \(\widetilde{ S }\) or raises*one*failure in*F*to \( p _2\). When one of \(\widetilde{ S }\) and*F*is empty, \( p _1\) only makes an output based on the non-empty one. We do not allow both*F*and \(\widetilde{ S }\) to be empty. - (B)
Open image in new window defines default interaction in \( g \), which is an interaction type, and a

*handling environment*, \( h = \{ f _i : g _i\}_{i \in I}\), which maps failures to handling activities defined in global types. Our design allows \( h \) to deal with different failures, with exactly one handler taking over once failures occur. - (C)
In (A), if

*F*is empty, we do not require the interaction to be enclosed in a try-handle term; otherwise, the interaction*must*appear within a try-handle term.

In the remaining syntax, we use \(\varepsilon \) for idle, and \( g _1; g _2\) for sequential composition. A type variable is denoted by \( t \), and a recursive type under an equi-recursive approach [18] is denoted by \(\mu t . g \), assuming every \( t \) appearing in \( g \) is guarded by prefixes.

For brevity, our protocol types presentation omits parallel composition, thus we do not allow session interleaving or multi-threading at a local participant. Note that we can still implement two individual interactions running in parallel by implementing two disjoint groups of interacting participants who execute two respective protocols. We omit branching with multiple options of ongoing interactions, since the term Open image in new window is able to encode the branching in multiparty session types [1, 12] by using failures \(l_1, ..., l_n\) as labels for branches. We leave unaffected participants to continue default actions regardless of the occurrence of failures; we do not inform them of the failure. Moreover, we do not have well-formedness constraints on the shape of interactions in branches (i.e. failure handlers in our syntax) as most multiparty session types and choreographic programming related works require [1, 3, 4, 5, 8, 12, 13, 16, 17, 21].

### 2.2 Local Types and Transformation

*transformation*, to synthesize a guidance to locally guide which participants need to coordinate with others once a failure occurs, or inversely to assert that none has occurred, before proceeding with the next action. The operation

*transformation*includes the following steps:

- 1.
**Generating a global structure**from a given protocol type and alpha renaming it. - 2.
**Projecting**the above structure to every participants to obtain*simple*local types, which are not yet sufficient for robust failure handling. The projection algorithm is similar to the mechanism in multiparty session types [1, 12]. - 3.
**Adding**the information of*need-(to)-be-informed*participants, who are those affected by or involved in handling failures, and synchronisation points to local types.

After these 3 steps, we obtain local types which are sufficient for our type system to ensure *robust failure handling*.

Figure 1 uses the example of \(\mathbf {G}_{proxy}\) to demonstrate the operation of *transforming* a protocol type to each participants’ local types, which are defined below:

### **Definition 1**

**(Local Types**\(\mathbf {T}\)

**).**

A local type is either an action (\( \mathbf {n} \)), a try-handle type (\(\textit{try}\{\mathbf {T}, \mathbf {H}\}\)), a sequencing type (\( \mathbf {n} \dashrightarrow \mathbf {T}\)), a local type variable (\( t \)), or a recursive type (\(\mu t . \mathbf {T}\)). We use \(\varepsilon \) to type an idle action, while \(\mathsf {end}\) types termination. A sending action, typed by \(\textit{sn} \langle p ! \ \widetilde{ S } \vee F , \widetilde{ p }', \widetilde{ p }'' \rangle \), specifies a sending of normal content of type \(\widetilde{ S }\) to \( p \) or raising a failure in *F*. When a failure is raised, the sender also sends failure notifications to participants \(\widetilde{ p }'\). When normal content is sent, the sender also sends non-failure notifications to participants \(\widetilde{ p }''\). A receiving action, typed by \(\textit{rn} \langle p ? \ \widetilde{ S } \vee F \rangle \), specifies the reception of content of type \(\widetilde{ S }\) from \( p \), who may raise a failure in *F* instead. An action yielding to the arrival of a non-failure notification informing that no failures in *F* occurred, is typed by \(\textit{yield} \langle F \rangle \). A handling environment in local types, denoted by \(\mathbf {H}= \{ f _i: \mathbf {T}_i \}_{i \in I}\), maps failures to corresponding local handling actions defined in local types.

In Fig. 1, we firstly create a **global structure** for \(\mathbf {G}_{proxy}\) by \( Struct (\mathbf {G}_{proxy})\). Global structures, denoted by \(\mathtt {T}\), consist of either a single interaction (\(\mathtt {N}\)), a try-handle structure (\(\textit{try}\{\mathtt {T}, \mathtt {H}\}\)) where \(\mathtt {H}\) has a similar shape to handling environments in local types, a sequence (\(\mathtt {N}\dashrightarrow \mathtt {T}\)), or a recursive structure (\(\mu t . \mathtt {T}\)). We define \(\mathtt {N}\) as either \(\mathsf { p } \rightarrow \mathsf { p }: \widetilde{ S } \vee F\) or \(\varepsilon \) or \(\mathsf {end}\). By defining \( Struct (\text {single interaction})\) = \(\mathtt {N}\), a try-handle structure is obtained by Open image in new window , while a recursive structure is obtained by \( Struct (\mu t . \mathbf {G}) = \mu t . Struct (\mathbf {G})\).

The *simple* local types are gained by *projecting* \(\mathtt {T}\), created by \( Struct (\mathbf {G})\), on each participants. The projection rules are defined below:

### **Definition 2**

**(Projecting**\(\mathtt {T}\)

**onto Endpoint**\( p \)

**).**Assume \(\mathtt {T}= Struct (\mathbf {G})\) and \(\mathtt {T}\) is alpha-renamed so that all failures in \(\mathtt {T}\) are unique. Define \(\upharpoonright \! \!(\mathtt {T}, p )\) as generating a local type on \( p \):

Rule (2) is for dually interacting participants. It introduces \(\varepsilon _F\), which has equivalent meaning to \(\varepsilon \) (i.e. idle action) but is only used in *transformation* for adding synchronisation points. As we project an interaction \(\mathsf { p _1} \rightarrow \mathsf { p _2}: \widetilde{ S } \vee F\) with \( p _1 \not = p _2\) onto \( p _1\) (resp. \( p _2\)), we get an action \(\textit{sn} \langle p _2 ! \ \widetilde{ S } \vee F , , \ \_ \ , \_ \rangle \) (resp. \(\textit{rn} \langle p _1 ? \ \widetilde{ S } \vee F \rangle \)). Note that the two slots \(\_\) in the sending action are preserved for adding the need-be-informed participants as a failure occurs (the first slot), and those as no failures occur (the second slot). As we project the interaction to some participant who is *not* in the interaction, we get \(\varepsilon _{F}\) (idle action). The subscript *F* indicates that if \( p \) is affected by some failures in *F* a synchronisation point will be added at this position. Rule (3) simply projects every sub-structure in the try block and handlers onto the participant. Rule (4) sequences two local types projected from a global structure. Other rules are straightforward.

*need-be-informed participants*into the failure-raiser’s sending actions (e.g., the one marked in green ring in Fig. 1). We use

*C*(\(\mathtt {T}\),

*F*) to get the set of need-be-informed participants regarding a unique

*F*in a global structure \(\mathtt {T}\). It is the least fixed point of \( c (\mathtt {T}, \mathtt {T}, F, \mathbf {\textit{r}})\), which recursively collects the need-be-informed participants regarding

*F*based on \(\mathtt {T}\). Since for every protocol the number of participants is finite, function \( c \) will converge to a fixed set of participants. The key calculation is done by the rule below

*initial*\(\mathbf {\textit{r}}\) to be the set containing the receiver of

*F*-raising interaction and the participants involved in handling

*F*in \(\mathtt {T}\); later \(\mathbf {\textit{r}}\) acts as an accumulator collecting the participants causally related to the initial \(\mathbf {\textit{r}}\). \(\mathtt {N}\) is an interaction, \({\textit{pid}}(\mathtt {N})\) is the set of participants in \(\mathtt {N}\), and \(\textit{FSet}(\mathtt {N})\) returns the failures tagged on \(\mathtt {N}\). This rule says that if the interaction we are checking appears after the

*F*-raising interaction, and some of its interacting participants are related to \(\mathbf {\textit{r}}\) or the interaction itself can raise another failure set (e.g. the interaction \(\mathsf {Lib} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {Abort}\) in \(\mathsf {RP} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {NoRes}; \mathsf {Lib} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {Abort}\) is related to \(F = \{\mathsf {NoRes}\}\)), then we collect its participants (i.e. \({\textit{pid}}(\mathtt {N})\)) and the need-be-informed participants with respect to the failures that can be raised by \(\mathtt {N}\).

*F*is \(\varepsilon _F\) (e.g. in \(\mathbf {G}_{coord}\) the participant \(\mathsf {Lib}\) has \(\varepsilon _{\mathsf {NoRes}}\) by Definition 2), \(\textit{yield} \langle F \rangle \) is positioned ahead of \( p \)’s action (e.g. a sending action of \(\mathsf {Lib}\) to \(\mathsf {Coord}\) specified in \(\mathbf {G}_{coord}\)), because \( p \) needs to wait for the notification regarding

*F*before taking any action. If \( p \) is the receiver, we have \(\textit{yield} \langle F \rangle \) positioned before the receiving action because \(\textit{yield} \langle F \rangle \) is the point deciding whether the process will handle a failure regarding

*F*or proceed. If \( p \) is the sender, we should have \(\textit{yield} \langle F \rangle \) positioned after the sending action, because as \( p \) is involved for some failure handling activity regarding

*F*, it needs to first send out failure notifications then go back to execute the handling activity; otherwise the process will get stuck.

In Fig. 1, green ring appears at \(\mathsf {Proxy}\)’s second action because, if a failure occurs, \(\mathsf {Proxy}\) has to inform \(\mathsf {Log}\), \(\mathsf {SupServer}\), and \(\mathsf {Client}\) about that failure. Blue rings appear at \(\mathsf {Client}\), \(\mathsf {Log}\), and \(\mathsf {SupServer}\)’s try blocks because they are involved in handling activity, and they can terminate only after getting the notifications that no failures occurred.

Overall, we define the operation of transformation as \(\textit{Transform}(\mathbf {G}, p )\), which transforms \(\mathbf {G}\) to a local type for \( p \).

## 3 Processes for Decentralised Multiple-Failure-Handling

**P4**). The shared queue is only

*conceptually*global for convenience, and could be split into individual participant queues.

*Syntax.* In Fig. 3, we define *x* as value variables, *y* as channel variables, *a* as shared names (e.g., names for services or protocol managers), and \( s \) as session names (i.e., session IDs), \( p \) as participant identifiers, and *X* as process variables. We use *u* for names and \( c \) for channels, which are either variables or a combination of \( s \) and \( p \). The definition for expressions *e* is standard. We define the syntax of processes (*P*, ...) and that of networks (*N*, ...), which represent interactions of processes at runtime. Process \(\mathbf {0}\) is inactive. Process \( c ! ( p , \langle \tilde{e} \rangle ^{F})\) denotes an output, which may alternatively raise a failure \( f \) in *F*, sends a message with content \(\tilde{e}\) to \( p \) via channel \( c \); while \( c ?( p , (\tilde{x})^{F}).P\) denotes an input using \( c \) to receive a content from \( p \), which may alternatively raise a failure from *F*. Every \(\tilde{x}\) appearing in *P* is bound by the input prefix. When \(F = \emptyset \), we omit *F* since the process will not raise/receive a failure. Process \( c \ \mathsf {raise}( f )\ \mathsf {to}\ p \) raises \( f \) to \( p \) via channel \( c \). Process \({ c \otimes F}\) is guarded by \(\otimes F\), a synchronisation point, yielding to non-failure notification for *F*. A try-handle process \(\mathsf {try}\{P\}\mathsf {h}\{ H \}\) executes *P* until a handler \( f \in \textit{dom}( H )\) is triggered, then the triggered handler takes over. A handling environment, denoted by \( H \), maps failure names to handling processes. We write \(P_1 ; P_2\) to represent a sequential composition where \(P_2\) follows \(P_1\). Process \( \textsf {def}~ D ~\textsf {in}~P \) defines a recursion, where declaration \( D = (X(\tilde{x}\ c ) = P)\) is a recursive call. The term \(\textsf {if}\ e \ \textsf {then}\ P \ \textsf {else}\ P\) is standard. We define evaluation context \(\mathscr {E}\) over processes. It is either a hole, a context in a try-handle term, or a context sequencing next processes.

A network *N* is composed by linking points, denoted by \(a[ p ](y).P\), and runtime processes, denoted by \([P]_{\mathbf {T}}\) with global transports (i.e., \(s: q\)) for proceeding communications in a private session (i.e., \((\textsf {new} \ s)N\)). Our framework asks a process to join one session at a time. A linking point \(a[ p ](y).P\) is guarded by \(a[ p ](y)\) for session initiation, where shared name *a* associates a service to a protocol type. \([P]_{\mathbf {T}}\) represents a runtime process which is guided by \(\mathbf {T}\) for notifying need-be-informed participants.

*failure*\( f \)

*occurred*(resp. a set of failures

*F*to indicate that

*no failures in*

*F*

*occurred*) to \( p \). Conventionally we say \(\langle \! \langle p , \emptyset \rangle \! \rangle = \varepsilon \). When session \( s \) is initiated for a network, a private (i.e., hidden) session is created, in which activities cannot be witnessed from the outside. We use structural congruence rules, defined by \(\equiv \), which are standard according to the works of multiparty session types [1, 12].

*Operational Semantics.* Figure 4 gives the operational semantics for networks (i.e., runtime processes) through the reduction relation \(N \rightarrow N\). We have added boxes to those rules which differ from standard session type definitions. In rule Open image in new window , a session is generated with a fresh name \( s \) through shared name *a* obeying protocol type \(\mathbf {G}\). This indicates that all processes in the new session \( s \) will obey to the behaviours defined in \(\mathbf {G}\). At the same time, a global queue \( s : \varepsilon \) is generated, and the local process associated with \( p \) replaces \(y_{ p }\) with \( s [ p ]\); a local type \(\mathbf {T}_{ p }\) is generated by \(\textit{Transform}(\mathbf {G}, p )\) to guide the local process associated with \( p \) for propagating notifications. Note that, as we enclose a local process with \(\mathbf {T}\), together they become an element of a network. \(\mathbf {T}\) is merely a local type and the reduction of the network does not change \(\mathbf {T}\).

Rule Open image in new window states that, in \( s \), a process associated with \( p _1\) is able to receive a value \(\tilde{v}\) from participant associated with \( p _2\) and message \(\langle p _2, p _1, \langle \tilde{v} \rangle \rangle ^{F}\) is on the top of \(q\). Then \(\tilde{v}\) will replace the free occurrences of \(\tilde{x}\) in *P*. The shape of \( s [ p _1] ?( p _2, (\tilde{x})^{F})\) indicates that its dual action may send it a failure from *F*; in other words, if \(F \not = \emptyset \), a process should be structured by a try-handle term for possible failure handling. Rule Open image in new window is dual to Open image in new window . We define \( node (\mathbf {T}, F)\) as a function returning an action tagged with *F* in a local type \(\mathbf {T}\). Rule Open image in new window states that, if there is an action in \(\mathbf {T}\) matching \( s [ p _1] ! ( p _2, \langle \tilde{e} \rangle ^{F})\) and \(\tilde{e} \Downarrow \tilde{v}\), then the process associated with \( p _1\) in \( s \) is allowed to send a message with content \(\tilde{v}\) to \( p _2\) and non-failure notifications \(\langle \! \langle p '_1, F \rangle \! \rangle ... \langle \! \langle p '_n, F \rangle \! \rangle \). Note that, non failure notifications are automatically generated at runtime. If a process follows the guidance of the attached \(\mathbf {T}\), since \(\mathbf {T}\) is alpha-renamed, every failure raised by the process is unique. Similarly, Open image in new window states for a process associated with \( p _1\) in \( s \), to raise \( f \) to \( p _2\) and other affected ones, \( p '_1, .., p '_n\). Very importantly, in Open image in new window and Open image in new window , \(q\) has no failure notification to trigger \( H \) because, as a failure-raising interaction is ready to fire (i.e. its sender is about to send), it implies that, globally, either this interaction is the first failure-raising interaction in *s* (thus no failure yet occurs in *s*), or its previous interactions did not raise a failure in \(\textit{dom}( H \)) (thus by **P2**, this interaction is able to raise a failure in \(\textit{dom}( H )\), and no failures in \(\textit{dom}( H )\) yet occurs in *s*). For convenience, we use \(\textit{act}\) to extract the channel that a process or the set of handlers is acting on, i.e. \(\textit{act}(P) = s [ p ]\) says *P* is acting on channel \( s [ p ]\), and \(\textit{act}( H ) = \textit{act}( H ( f ))\) for every \( f \in \textit{dom}( H )\).

In Open image in new window , if the \( H \) in a try-handle process associated with \( p \) in \( s \) will not be triggered by the top message in \( s :q\), then the process in the try block will take action according to the process’s interaction with the queue. In Open image in new window , as \( f \) arrives to a try-handle process associated with \( p \) in \( s \) whose try block is yielding to non-failure notification for *F* and \( H \) is able to handle \( f \), the handling process \( H ( f )\) takes over. Due to asynchrony, other processes’ handlers for \( f \) may become active before this process. Thus some messages in \(q\) may be sent from other processes’ handlers of \( f \) for *P*. Note that none of the messages in \(q\) are for \(\mathscr {E}\) because, all default sending actions in other processes are also guarded by synchronisation points.

Synchronisation either proceeds with Open image in new window , where *F* is sufficient to remove \(F'\), or with Open image in new window , where *F* is included in \(F'\) carried in the notification. For the former, some processes in the failure-handling activity only take care of partial failures in *F*, i.e. \(F'\), when they receive *F*, to ensure that no failures in *F* occurred. For the latter, further synchronisation is required by \(F'' = (F' \setminus F) \not = \emptyset \). In Open image in new window , since we have added sufficient synchronisation points to guard processes who must yield to non-failure notifications, when a network reaches \([\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}\), it is safe to be inactive because no more failure notifications will occur. In other rules, the operations enclosed in \(\mathbf {T}\) are standard according to the works of multiparty session types [1, 12].

## 4 Typing Local Processes

*x*does not occur in \(\varGamma \), so does \(\varGamma , X: (\tilde{x}\ \mathbf {T})\) and \(\varDelta , c : \mathbf {T}\). We assume that expressions are typed by sorts. \(\varGamma \vdash e: S \) is the typing judgment for expressions, whose typing rules are standard. The typing judgment \(\varGamma \vdash P \rhd \varDelta \) for local processes reads as “\(\varGamma \) proves that

*P*complies with abstract specification \(\varDelta \)”.

Rule Open image in new window states that an idle process is typed by end-only \(\varDelta \), which means \(\forall c \in \textit{dom}(\varDelta )\), \(\varDelta ( c ) = \mathsf {end}\). Rule Open image in new window types sequential composition by sequencing \(P_2\)’s action in \(\varDelta _2\) after \(P_1\)’s action in \(\varDelta _1\) as long as \(P_1\) and \(P_2\) are acting on the same channel. We define \(\varDelta _1 \circ \varDelta _2\) as the one defined in the multiparty session types extended with failure-handling ability [3]. Rule Open image in new window specifies that \( s [ p _1] ?( p _2, (\widetilde{x})^{F}). P\) is valid as it corresponds to local type \(\textit{rn} \langle p _2 ? \ \widetilde{ S } \vee F \rangle \dashrightarrow \mathbf {T}\) as long as *P*, associated with \( p _1\) in session \( s \), is well-typed by \(\mathbf {T}\) under an environment which knows \(\tilde{x}: \widetilde{ S }\). In Open image in new window and Open image in new window , since the slots are not related to typing, their contents are omitted. Rules Open image in new window and Open image in new window share the same action for typing because \(\textit{sn} \langle p _2 ! \ \widetilde{ S } \vee F , \ \_ \ , \_ \rangle \) specifies two possible actions: a sending action \( s [ p _1] ! ({ p _2}, \langle \tilde{e} \rangle ^{F})\) in which \(\tilde{e}\) must have type \(\widetilde{ S }\), and action \( s [ p _1] \ \mathsf {raise}( f )\ \mathsf {to}\ p _2\) in which \( f \) must be in *F*. Then the continuing process *P* is typed by the following \(\mathbf {T}\).

For typing handling activities, rule Open image in new window types a try-handle term if its default action (i.e., *P*) with its following process is well-typed, and those in handlers with their follow-up processes are all well-typed. We require the following process \(P'\) should not contain any failure appearing in \( H \). Since *P* and any processes in \( H \) are acting on the same channel and \(\textit{act}( H )\) represents the channel that every processes in \( H \) is acting on, we use \(\textit{act}( H )\) to get the channel in order to type \(\mathsf {try}\{\mathbf {0}\}\mathsf {h}\{ H \}\). Recall Fig. 1 and projection rules defined in Definition 2, for local types the sequencing action is linked at every leaf in a try-handle term; in other words, the type of \(P'\) is attached to the type of *P* and also to every handler in \(\mathbf {H}\). Therefore, as we type a try-handle term, we also consider its following process.

Rule Open image in new window specifies that process \( c \otimes F ; P\) is well-typed if the local type for \( c \) has synchronisation point \(\textit{yield} \langle F \rangle \) and *P* is well-typed w.r.t. \(\mathbf {T}\). The algorithm for adding synchronisation points (introduced in Sect. 2) automatically places the synchronisation points in local types and ensures that once a failure is raised, other possible failure-raising actions must not fire. Since the operational semantics defined in Fig. 4 only deliver notifications regarding *F* to need-be-informed participants and only one failure in *F* can be raised, our type system ensures only one failure in a try-handle term is handled and all participants affected by *F* have consistent failure handling activities.

Rule Open image in new window types a local process variable, and rule Open image in new window types a recursion with \(\varDelta \), where the recursive call \(X(\tilde{x}, c ) = P\) is typed by \( c : \mathbf {T}\), indicating that *P* follows behaviour \(\mathbf {T}\) at \( c \). Others are standard according to the works of multiparty session types [1, 12].

## 5 Typing the Network

*Coherence*, according to the works of multiparty session types [1, 12], describes an environment where all

*interactions*are complying with the guidance of some \(\mathbf {G}\), such that the behaviour of every participant in \(\varDelta \), say \(s[ p ]\), obeys to \(\textit{Transform}(\mathbf {G}, p )\), which denotes a local type. To reason about coherence of default and handling interactions in a session, we statically type check the interactions by modeling the outputs and inputs among local processes and the shared global queue.

*a*provides a behaviour pattern defined in \(\mathbf {G}\). For guiding

*P*associated with \( p \), Open image in new window uses local type \(\mathbf {T}\) generated by \(\textit{Transform}(\mathbf {G}, p )\) to type

*P*acting on channel

*y*. Rule Open image in new window states that \([P]_{\mathbf {T}}\) is well-typed by \(\varDelta \) if

*P*is well-typed by \(\varDelta \), and \(\mathbf {T}\), gained by some \(\mathbf {G}\), contains the type which types

*P*acting on channel \(s[ p ]\). Note that, by rule Open image in new window (see Fig. 4), \([P]_{\mathbf {T}}\) is created after linking and \(\mathbf {T}\) is not changed after any reduction; thus \(\mathbf {G}\) in rule Open image in new window comes from rule Open image in new window . Rule Open image in new window ensures the parallel composition of two networks if each of them is well-typed and they do not share a common channel (i.e., \(\textit{dom}(\varDelta _1) \cap \textit{dom}(\varDelta _2) = \emptyset \)). The composed network exhibits the union of the session environments. Rule Open image in new window types hiding (i.e., \((\textsf {new} \ s )N\)) when the session environment of networks under \( s \), denoted by

### **Definition 3**

**(Coherence).** We say \(\varDelta \langle s \rangle \) is coherent if there exists \(\mathbf {G}\) such that \({\textit{pid}}(\mathbf {G}) = \{ p \mid s'[ p ] \in \textit{dom}(\varDelta ), s' = s \}\) and either (1) \(\forall s[ p ] \in \textit{dom}(\varDelta \langle s \rangle )\) we have \(\textit{Transform}(\mathbf {G}, p )\) is equal to the type of \(\varDelta (s[ p ])\) after \(s[ p ]\) absorbs all messages heading to it; or (2) there exists \(\varDelta ' \subset \varDelta \) such that \(\forall s[ p ] \in \textit{dom}(\varDelta \langle s \rangle \setminus \varDelta ' \langle s \rangle )\) we have that \(\textit{Transform}(\mathbf {G}, p )\) is equal to the type of \(\varDelta (s[ p ])\) after \(s[ p ]\) absorbs all messages heading to it, and \(\varDelta ' \langle s \rangle \) is coherent.

Note that due to asynchrony, after a sender takes action, the type of the sender and its receiver may be temporarily incoherent if the sender has moved forward and the output is still in the global queue. Therefore, coherence holds only after a receiver has absorbed all messages heading to it.

As we aim to handle partial failure(s), either (1) no failures occurred such that there exists \(\mathbf {G}\) defining interactions for every \(s[ p ]\) in \(\varDelta \langle s \rangle \), or (2) a failure occurs such that the need-be-informed participants, who are in \(\varDelta '\langle s \rangle \), are handling that failure in a coherent way, and other unaffected ones, who are in \(\varDelta \langle s \rangle \setminus \varDelta '\langle s \rangle \), still follow the behaviour defined in \(\mathbf {G}\).

## 6 Properties

We prove that our typing discipline ensures the properties of *safety* and *progress*. The property of safety is defined by *subject reduction* and *communication safety*. Firstly we define \(\varDelta \rightharpoonup \varDelta '\) as reductions of session environments. Intuitively, the reductions correspond closely to the operational semantics defined in Fig. 4. Subject reduction states that a well-typed network (resp. coherent session environment) is always well-typed (resp. coherent) after reduction:

### **Theorem 1**

**(Subject Congruence and Reduction).**

- 1.
(subject congruence) \(\varGamma \vdash N \rhd \varDelta \) and \(N \equiv N'\) imply that \(\varGamma \vdash N' \rhd \varDelta \).

- 2.
(subject reduction) \(\varGamma \vdash N \rhd \varDelta \) with \(\varDelta \) coherent and \(N \rightarrow N'\) imply that \(\varGamma \vdash N' \rhd \varDelta '\) such that \(\varDelta \rightharpoonup \varDelta '\) or \(\varDelta \equiv \varDelta '\) and \(\varDelta '\) is coherent.

*et al.*[1], global linearity-check is not needed. For convenience, we define here contexts on networks:

### **Corollary 1**

**(Communication Safety).** Suppose \(\varGamma \vdash N \rhd \varDelta \) and \(\varDelta \) is coherent. Let \(N_1= \mathscr {C}_1[s: q\cdot \langle p _2, p _1, \langle \tilde{v} \rangle \rangle ^{F} \cdot q']\) and \(N_2 = \mathscr {C}_2[s: q\cdot \langle p _1, \ f \rangle \cdot q']\) and \(N_3 = \mathscr {C}_3[s: q\cdot \langle \! \langle p _1, F \rangle \! \rangle \cdot q']\) and no messages in \(q\) is sending to \( p _1\).

- 1.
If \(N= \mathscr {C}[\mathscr {E}[s[ p _1] ?( p _2, (\tilde{x})^{F}) . P]_{\mathbf {T}}]\), then \(N \equiv N_1\) or \(N \rightarrow ^{*} N_1\).

- 2.
If \(N = \mathscr {C}[\mathscr {E}[\mathsf {try}\{s[ p _1] \otimes F'; P\}\mathsf {h}\{ H \}]_{\mathbf {T}}]\) and \(F' \subseteq F \not = \emptyset \), then either (a) \(N \equiv N_2\) or \(N \rightarrow ^{*} N_2\) or (b) \(N \equiv N_3\) or \(N \rightarrow ^{*} N_3\).

- 3.
If \(N = \mathscr {C}[\mathscr {E}[\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}]\) and \( f \in \textit{dom}( H )\) and process \( H ( f )\) is acting on \(s[ p _1]\), then \(N \not \equiv N_2\) and \(N \not \rightarrow ^{*} N_2\).

This corollary states that our system is *free from deadlock* and *starvation*: if there is a receiving action in *N*, then *N* is either structurally congruent to the network which contains the message for input, or *N* will reduce to such a network. We state that \([\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}\) is safe to become idle by proving that no \( f \in \textit{dom}( H )\) is heading to it (Case 3).

Corollary 1 provides the means to prove that our system *never gets stuck and is free from orphan messages* (*property of progress*):

### **Theorem 2**

**(Progress).** \(\varGamma \vdash N \rhd \varDelta \) with \(\varDelta \) coherent and \(N \rightarrow N'\) imply that \(N'\) is communication safe or \(N' = \mathbf {0}\parallel s :\varepsilon \).

This theorem states that every interaction in a well-typed network is a safe interaction and reducible until the whole network terminates without any message left.

## 7 Related Works

Failure handling has been addressed in several process calculi and communication-centered programming languages. For instance, the conversation calculus [20] models exception handling in abstract service-based systems with message-passing based communication. It studies expressiveness and behaviour theory of bisimilarity rather than theory of types. Colombo and Pace [7] investigate several different process calculi for failure-recovery within long-running transactions. They give insight regarding the application of these failure-recovery formalisms in practice via comparing the design choices and formal notions of correctness properties. Both works do not provide a type system to statically type check local implementations.

Previous works for failure handling with type systems [3, 4, 5, 13] extend the theory of session types to specify error handling under asynchronous interactions. These works do not capture handling of partial failures and the scenarios which exhibit the properties **P1** to **P4**. They may be able to encode multiple possible failures at the interaction level (**P1**), for example, by (i) explicitly using a labeled branching inside the failure handler, or (ii) piggybacking a label with the failure notification (“multiplexing”). However, (i) implies double communication and synchronisation (once for the failure notification, then for the branch) and (ii) implies that either the well-formedness constraints on the shape of interactions in handlers are needed or any participants related to any failure handling activity should be informed as a failure occurs in order to know how to proceed. Our approach is different since we do not have such constrains and we do not inform the unaffected participants. Moreover, while the termination of try-handle terms in those works demands an agreement of all participants, ours allows local try-handle terms to terminate since we have locally added synchronisation points by *transformation* (see Sect. 2.2). Our approach can encode the global types for exception-handling proposed in the work by Capecchi *et al.* [3], which is the closest related work (and other related ones have similar try-handle syntax). The formal encoding can be found in the long version of this paper [9].

Collet and Van Roy [6] informally present a distributed programming model of Oz for asynchronous failure handling and focus on programming applications in a distributed manner. Jakšić and Padovani [14] study a type theory for error handling for copy-less messaging and memory sharing to prevent memory leaks/faults through typing of exchange heaps. Lanese *et al.* [11, 15] formalise a feature which can dynamically install fault and compensation handlers at execution time in an orchestration programming style. They investigate the interplay between fault handling and the request-response pattern. In contrast, our framework statically defines the handlers for non-trivial failure handling, which can only be done with a global perspective.

## 8 Concluding Remarks

Protocol types enable the design of protocols in an intuitive manner, and statically type check multiple failure-handling processes in a transparent way. Our type discipline exhibits the desirable properties of **P1** **(alignment)**, **P2** **(precision)**, **P3** **(causality)**, and **P4** **(decentralisation)** for robust failure handling, and ensures fundamental properties of safety and progress. We are currently implementing the proposed framework and are extending it to support system-induced failures as opposed to application-specific ones focused on in this paper, in addition to parameterisation and dynamic multiroles.

## References

- 1.Bettini, L., Coppo, M., D’Antoni, L., De Luca, M., Dezani-Ciancaglini, M., Yoshida, N.: Global progress in dynamically interleaved multiparty sessions. In: van Breugel, F., Chechik, M. (eds.) CONCUR 2008. LNCS, vol. 5201, pp. 418–433. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 2.Caires, L., Vieira, H.T.: Conversation types. In: Castagna, G. (ed.) ESOP 2009. LNCS, vol. 5502, pp. 285–300. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 3.Capecchi, S., Giachino, E., Yoshida, N.: Global escape in multiparty sessions. MSCS
**29**, 1–50 (2015)zbMATHGoogle Scholar - 4.Carbone, M., Honda, K., Yoshida, N.: Structured interactional exceptions in session types. In: van Breugel, F., Chechik, M. (eds.) CONCUR 2008. LNCS, vol. 5201, pp. 402–417. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 5.Carbone, M., Yoshida, N., Honda, K.: Asynchronous session types: exceptions and multiparty interactions. In: Bernardo, M., Padovani, L., Zavattaro, G. (eds.) SFM 2009. LNCS, vol. 5569, pp. 187–212. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 6.Collet, R., Van Roy, P.: Failure handling in a network-transparent distributed programming language. In: Cheraghchi, H.S., Lindskov Knudsen, J., Romanovsky, A., Babu, C.S. (eds.) Exception Handling. LNCS, vol. 4119, pp. 121–140. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 7.Colombo, C., Pace, G.J.: Recovery within long-running transactions. ACM Comput. Surv.
**45**(3), 28: 1–28: 35 (2013)CrossRefzbMATHGoogle Scholar - 8.Deniélou, P.-M., Yoshida, N.: Dynamic multirole session types. In: POPL 2011, pp. 435–446 (2011)Google Scholar
- 9.Technical report. Long version of this paper. https://github.com/Distributed-Systems-Programming-Group/paper/blob/master/2016/forte16_long_dsp.pdf
- 10.Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput. Surv.
**31**(1), 1–26 (1999)CrossRefGoogle Scholar - 11.Guidi, C., Lanese, I., Montesi, F., Zavattaro, G.: On the interplay between fault handling and request-response service invocations. In: 8th International Conference on Application of Concurrency to System Design, 2008, ACSD 2008, pp. 190–198, June 2008Google Scholar
- 12.Honda, K., Yoshida, N., Carbone, M.: Multiparty asynchronous session types. In: POPL 2008, pp. 273–284. ACM (2008)Google Scholar
- 13.Hu, R., Neykova, R., Yoshida, N., Demangeon, R., Honda, K.: Practical interruptible conversations. In: Legay, A., Bensalem, S. (eds.) RV 2013. LNCS, vol. 8174, pp. 130–148. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 14.Jakšić, S., Padovani, L.: Exception handling for copyless messaging. Sci. Comput. Program.
**84**, 22–51 (2014)CrossRefzbMATHGoogle Scholar - 15.Lanese, I., Montesi, F.: Error handling: from theory to practice. In: Margaria, T., Steffen, B. (eds.) ISoLA 2010, Part II. LNCS, vol. 6416, pp. 66–81. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 16.Lanese, I., Montesi, F., Zavattaro, G.: Amending choreographies. In: WWV 2013, vol. 123 of EPTCS, pp. 34–48 (2013)Google Scholar
- 17.Mostrous, D.: Session Types, in Concurrent Calculi: Higher-Order Processes and Objects. Ph.D. thesis, Imperial College London (2009)Google Scholar
- 18.Pierce, B.C.: Types and Programming Languages. MIT Press, Cambridge (2002)zbMATHGoogle Scholar
- 19.Takeuchi, K., Honda, H., Kubo, M.: An interaction-based language and its typing system. In: Halatsis, Constantinos, Philokyprou, G., Maritsas, D., Theodoridis, Sergios (eds.) PARLE 1994. LNCS, vol. 817. Springer, Heidelberg (1994)Google Scholar
- 20.Vieira, H.T., Caires, L., Seco, J.C.: The conversation calculus: a model of service-oriented computation. In: Drossopoulou, S. (ed.) ESOP 2008. LNCS, vol. 4960, pp. 269–283. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 21.Yoshida, N., Vasconcelos, V.T.: Language primitives and type discipline for structured communication-based programming revisited: two systems for higher-order session communication. Electr. Notes Theor. Comput. Sci.
**171**(4), 73–93 (2007)CrossRefGoogle Scholar