Keywords

1 Introduction

For distributed systems where application components interact asynchronously and concurrently, the design and verification of communication protocols is critical. These systems are prone to partial failures, where some components or interactions may fail, while others must continue while respecting certain invariants. Since not all failures can be simply masked [10], programmers must explicitly deal with failures.

To ease the burden on programmers for constructing resilient communication protocols in the presence of partial failures, we propose a framework for robust failure handling. Our framework ensures safety during normal execution and in case of failures. In particular our framework provides the following properties:

  1. 1.

    P1 (alignment): The occurrences of failures are specified at the level of individual interactions, which they be raised.

  2. 2.

    P2 (precision): If a failure occurs, an endpoint is informed iff it is affected by the failure or involved in handling it, and its handling is not mixed with that of other failures.

  3. 3.

    P3 (causality): Dependencies between failures are considered, i.e., a failure can affect (enable, disable) others which may occur in subsequent or concomitant interactions.

  4. 4.

    P4 (decentralisation): No central authority or component controls the decisions or actions of the participants to handle a failure.

Inspired by session types [12, 19], we introduce protocol types to achieve these properties. The basic design is shown in Eq. (1):

(1)

where the first term expresses that a participant \( p _1\) either sends a message of type \(\widetilde{ S }\) to another participant \( p _2\), or raises one of several failures (i.e., \( f _1, ..., \ f _n\)). \(\mathbf {G}\) specifies the subsequent interactions and \(\mathbf {G}_i\) specifies the handling protocols for \( f _i\), \(i = 1..n\) in . In short, failures are thus associated with elementary interactions (P1), only participants affected by such a failure in \(\mathbf {G}\) (e.g., they are expecting communication from \( p _{2}\)) or those involved in the corresponding failure handling activity are informed (P2). There is at most one \( f \) appearing in that will be handled/raised (P3). Our semantics ensure that there is no central authority (P4), meaning, notifications of failures (and absence thereof) are delivered asynchronously from failure sources and processes are typed by local (i.e., endpoint) types achieved by the projection of participants over protocol types.

This design results in a distinguishable type system from the design of Eq. (2):

(2)

In Eq. (2), only one failure may occur in a try-block and where/when it will occur is not specified; once a failure — no matter which one — occurs somewhere in \( p _1 \rightarrow p _2: \widetilde{ S }\) or \(\mathbf {G}\), the handling activity \(\mathbf {G}'\) simply takes over. Previous works on dealing with failures [24, 6] are based on Eq. (2) and/or centralised authorities, and hence do not satisfy all of P1 to P4 (see Sect. 7 for details).

Just as multiparty session types aim to specify the interactions among participants and verify implementations of these participants, protocol types specify the global interplay of failures in interactions among participants and verify the failure-handling activities of these participants. To the best of our knowledge, this is the first work that presents a type system for statically checking fine-grained failure handling activities across asynchronous/concurrent processes for partial failures in practical distributed systems. We define a calculus of processes with the ability to not only raise (“throw”) and handle (“catch”) failures, but to also automatically notify processes of failures or absence thereof at runtime.

Our framework also gives protocol designers and endpoint application developers simple and intuitive description/programming abstractions, and ensures safe interactions among endpoint applications in concurrent environments.

Paper Structure. Section 2 gives motivating examples to introduce the design of protocol types, which capture the properties P1 to P4; then we introduce an operation called transformation to transform a protocol type to local (i.e., endpoint) types while preserving the desired properties. Section 3 gives a process calculus with de-centralised multiple-failure-handling capability, including the syntax for programs and networks, and the operational semantics for runtime. Section 5 gives a type system for local processes and Sect. 4 gives a type system for networks to maintain communication coherent. Section 6 states the property of safety, including subject reduction and communication safety, and the property of progress. Section 7 discusses related works. Finally Sect. 8 concludes our work.

Detailed formal and auxiliary definitions, lemmas, and proofs are presented in the extended version of this paper [9].

2 Protocol Types, Local Types, and Transformation

This section uses examples to show the design behind protocol types and uses Fig. 1 to illustrate an operation called transformation, which generates local types from protocol types. All formal definitions can be found in the extended version of this paper [9].

The first example, visualised by Fig. 1, shows the properties P1 and P2. Assume that in a network all outgoing traffic passes through a proxy (\(\mathsf {Proxy}\)), monitoring the traffic and logging general information, e.g., consumed bandwidth. The proxy sends this information to a log server (\(\mathsf {Log}\)). If the proxy detects suspicious behaviour in the traffic, it raises a \(\mathsf {SuspiciousB}\) failure and notifies \(\mathsf {Log}\) and a supervision server (\(\mathsf {SupServer}\)) to handle the failure by having \(\mathsf {Log}\) send the traffic logs to \(\mathsf {SupServer}\); if the proxy detects that the quota of \(\mathsf {Client}\) is low, it raises a \(\mathsf {QuotaWarn}\) failure and notifies \(\mathsf {Log}\) and \(\mathsf {Client}\) to handle the failure by having \(\mathsf {Log}\) send quota information to \(\mathsf {Client}\). Then \(\mathsf {Proxy}\) forwards the traffic from \(\mathsf {Client}\) to an external server (\(\mathsf {EServer}\)). We propose the following type to formalise the above scenario:

It specifies that either \(\mathsf {SuspiciousB}\) or \(\mathsf {QuotaWarn}\) may occur at interaction \(\mathsf {\mathsf {Proxy}} \rightarrow \mathsf {\mathsf {Log}}\) (P1). \(\mathsf {Proxy}\) can raise the failure and correspondingly sends failure/non-failure notifications to all relevant parties. \(\mathsf {SuspiciousB}\) affects \(\mathsf {Log}\) and \(\mathsf {SupServer}\) since they both handle that failure; it also affects \(\mathsf {Client}\) because the occurrence of \(\mathsf {SuspiciousB}\) implies that \(\mathsf {QuotaWarn}\) will never occur, and thus \(\mathsf {Client}\) will not yield to the handling activity for \(\mathsf {QuotaWarn}\). When \(\mathsf {QuotaWarn}\) occurs, the situation is similar to \(\mathsf {SuspiciousB}\)’s. Once one of them occurs, \(\mathsf {Proxy}\) sends failure notifications carrying the occurred failure to \(\mathsf {Log}\), \(\mathsf {SupServer}\), and \(\mathsf {Client}\); when no failures occur, \(\mathsf {Proxy}\) sends non-failure notifications carrying both failures to the same participants to inform them not to yield to handling activity. No failures will affect \(\mathsf {Proxy}\) and \(\mathsf {EServer}\) (P2) because \(\mathsf {Proxy}\) and \(\mathsf {EServer}\) are not involved in any failure handling activities: \(\mathsf {Proxy}\) continues sending to \(\mathsf {EServer}\) after it raises a failure and \(\mathsf {EServer}\) still receives a message from \(\mathsf {Proxy}\) as expected.

Fig. 1.
figure 1

Overview of transformation to obtain local types from a protocol type. The left-hand-side figure visualises a protocol type, \(\mathbf {G}_{proxy}\), as a global structure by function \( Struct \). The right-hand-side figures are local types for participants in \(\mathbf {G}_{proxy}\). They are gained by an operation called transformation, which firstly projects the global structure onto participants to get simple local types, then adds the information of need-be-informed participants to the positions with green rings, and finally adds synchronisation points to the positions with blue rings. After transformation, local types for robust failure handling are reached. (Color figure online)

The next example shows P2 and P3. Assume a resource provider (\(\mathsf {RP}\)) informs a coordinator (\(\mathsf {Coord}\)) of which resources it can provide. The library (\(\mathsf {Lib}\)) can get the resource by requesting \(\mathsf {Coord}\) only if \(\mathsf {RP}\) sends a list of resources to \(\mathsf {Coord}\). If failure \(\mathsf {NoRes}\) occurs, meaning absence of resource, \(\mathsf {Coord}\) informs \(\mathsf {Lib}\) of this; otherwise, \(\mathsf {Lib}\) places a request to \(\mathsf {Coord}\) or raises a failure \(\mathsf {Abort}\) (due to one of many possible local problems) and \(\mathsf {Coord}\) invokes \(\mathsf {Record}\) to record this failure:

Although it may seem that interactions \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) can run concurrently, this is not the case because, \(\mathsf {Lib}\) can only get the resource if \(\mathsf {RP}\) gives \(\mathsf {Coord}\) a resource list, which implies failures \(\mathsf {NoRes}\) and \(\mathsf {Abort}\) are dependent (P3). Additionally, we constrain that failure handling should be not be mixed (P2). Synchronisation of \(\mathsf {Lib}\) to yield to the completion of \(\mathsf {RP} \rightarrow \mathsf {Coord}\) is thus needed. This helps programmers tremendously in reasoning about the states that participants are in after failures. Note that, if \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) have no failures specified, then \(\mathsf {RP} \rightarrow \mathsf {Coord}\) and \(\mathsf {Lib} \rightarrow \mathsf {Coord}\) can run concurrently because there is no failure dependency.

2.1 Protocol Types

Fig. 2.
figure 2

Syntax of protocol types.

To handle partial failures in interactions which exhibit the properties P1 to P4, Fig. 2 defines protocol types based on the definition of session types given in the work by Bettini et al. [1]. Protocol types, denoted by \(\mathbf {G}\), are composed of interaction types \( g \) and terminated by \(\mathsf {end}\). We use (\( p , ...\)) to range over identifiers, (\( S , ...\)) to range over basic types like \(\mathsf {bool}\), \(\mathsf {unit}\), \(\mathsf {int}\), and \(\mathsf {str}\), and (F, ...) to range over sets of failures. We highlight the key concepts:

  1. (A)

    \(\mathsf { p _1} \rightarrow \mathsf { p _2}: \widetilde{ S } \vee F\) is a failure-raising interaction tagged with F, or F-raising interaction for short. When \(\widetilde{ S } \not = \emptyset \) and \(F \not = \emptyset \), either \( p _1\) sends a content of type \(\widetilde{ S }\) or raises one failure in F to \( p _2\). When one of \(\widetilde{ S }\) and F is empty, \( p _1\) only makes an output based on the non-empty one. We do not allow both F and \(\widetilde{ S }\) to be empty.

  2. (B)

    defines default interaction in \( g \), which is an interaction type, and a handling environment, \( h = \{ f _i : g _i\}_{i \in I}\), which maps failures to handling activities defined in global types. Our design allows \( h \) to deal with different failures, with exactly one handler taking over once failures occur.

  3. (C)

    In (A), if F is empty, we do not require the interaction to be enclosed in a try-handle term; otherwise, the interaction must appear within a try-handle term.

In the remaining syntax, we use \(\varepsilon \) for idle, and \( g _1; g _2\) for sequential composition. A type variable is denoted by \( t \), and a recursive type under an equi-recursive approach [18] is denoted by \(\mu t . g \), assuming every \( t \) appearing in \( g \) is guarded by prefixes.

For brevity, our protocol types presentation omits parallel composition, thus we do not allow session interleaving or multi-threading at a local participant. Note that we can still implement two individual interactions running in parallel by implementing two disjoint groups of interacting participants who execute two respective protocols. We omit branching with multiple options of ongoing interactions, since the term is able to encode the branching in multiparty session types [1, 12] by using failures \(l_1, ..., l_n\) as labels for branches. We leave unaffected participants to continue default actions regardless of the occurrence of failures; we do not inform them of the failure. Moreover, we do not have well-formedness constraints on the shape of interactions in branches (i.e. failure handlers in our syntax) as most multiparty session types and choreographic programming related works require [1, 35, 8, 12, 13, 16, 17, 21].

2.2 Local Types and Transformation

In order to achieve our desired properties we use an operation, called transformation, to synthesize a guidance to locally guide which participants need to coordinate with others once a failure occurs, or inversely to assert that none has occurred, before proceeding with the next action. The operation transformation includes the following steps:

  1. 1.

    Generating a global structure from a given protocol type and alpha renaming it.

  2. 2.

    Projecting the above structure to every participants to obtain simple local types, which are not yet sufficient for robust failure handling. The projection algorithm is similar to the mechanism in multiparty session types [1, 12].

  3. 3.

    Adding the information of need-(to)-be-informed participants, who are those affected by or involved in handling failures, and synchronisation points to local types.

After these 3 steps, we obtain local types which are sufficient for our type system to ensure robust failure handling.

Figure 1 uses the example of \(\mathbf {G}_{proxy}\) to demonstrate the operation of transforming a protocol type to each participants’ local types, which are defined below:

Definition 1

(Local Types \(\mathbf {T}\) ).

$$\begin{array}{lrcl} (\text {Local Types}) &{} \mathbf {T}&{}\ {:}{:} = &{} \mathbf {n} \ | \ \textit{try} \{ \mathbf {T}, \ \mathbf {H}\} \ | \ \mathbf {n} \dashrightarrow \mathbf {T}\ | \ t \ | \ \mu t . \mathbf {T}\ \ \ (\text {Handlers}) \ \mathbf {H}\ {:}{:} = \emptyset \ | \ \mathbf {H}, \ f : \mathbf {T}\\ (\text {Action}) &{} \mathbf {n} &{} \ {:}{:} = &{} \varepsilon \ | \ \mathsf {end}\ | \ \textit{sn} \langle p ! \ \widetilde{ S } \vee F , \widetilde{ p }, \widetilde{ p } \rangle \ | \ \textit{rn} \langle p ? \ \widetilde{ S } \vee F \rangle \ | \ \textit{yield} \langle F \rangle \end{array} $$

A local type is either an action (\( \mathbf {n} \)), a try-handle type (\(\textit{try}\{\mathbf {T}, \mathbf {H}\}\)), a sequencing type (\( \mathbf {n} \dashrightarrow \mathbf {T}\)), a local type variable (\( t \)), or a recursive type (\(\mu t . \mathbf {T}\)). We use \(\varepsilon \) to type an idle action, while \(\mathsf {end}\) types termination. A sending action, typed by \(\textit{sn} \langle p ! \ \widetilde{ S } \vee F , \widetilde{ p }', \widetilde{ p }'' \rangle \), specifies a sending of normal content of type \(\widetilde{ S }\) to \( p \) or raising a failure in F. When a failure is raised, the sender also sends failure notifications to participants \(\widetilde{ p }'\). When normal content is sent, the sender also sends non-failure notifications to participants \(\widetilde{ p }''\). A receiving action, typed by \(\textit{rn} \langle p ? \ \widetilde{ S } \vee F \rangle \), specifies the reception of content of type \(\widetilde{ S }\) from \( p \), who may raise a failure in F instead. An action yielding to the arrival of a non-failure notification informing that no failures in F occurred, is typed by \(\textit{yield} \langle F \rangle \). A handling environment in local types, denoted by \(\mathbf {H}= \{ f _i: \mathbf {T}_i \}_{i \in I}\), maps failures to corresponding local handling actions defined in local types.

In Fig. 1, we firstly create a global structure for \(\mathbf {G}_{proxy}\) by \( Struct (\mathbf {G}_{proxy})\). Global structures, denoted by \(\mathtt {T}\), consist of either a single interaction (\(\mathtt {N}\)), a try-handle structure (\(\textit{try}\{\mathtt {T}, \mathtt {H}\}\)) where \(\mathtt {H}\) has a similar shape to handling environments in local types, a sequence (\(\mathtt {N}\dashrightarrow \mathtt {T}\)), or a recursive structure (\(\mu t . \mathtt {T}\)). We define \(\mathtt {N}\) as either \(\mathsf { p } \rightarrow \mathsf { p }: \widetilde{ S } \vee F\) or \(\varepsilon \) or \(\mathsf {end}\). By defining \( Struct (\text {single interaction})\) = \(\mathtt {N}\), a try-handle structure is obtained by , while a recursive structure is obtained by \( Struct (\mu t . \mathbf {G}) = \mu t . Struct (\mathbf {G})\).

The simple local types are gained by projecting \(\mathtt {T}\), created by \( Struct (\mathbf {G})\), on each participants. The projection rules are defined below:

Definition 2

(Projecting \(\mathtt {T}\) onto Endpoint \( p \) ). Assume \(\mathtt {T}= Struct (\mathbf {G})\) and \(\mathtt {T}\) is alpha-renamed so that all failures in \(\mathtt {T}\) are unique. Define \(\upharpoonright \! \!(\mathtt {T}, p )\) as generating a local type on \( p \):

$$ \begin{array}{l} (1)\ \upharpoonright \! \!(\mathsf {end}, p ) = \mathsf {end}\quad (2)\ \upharpoonright \! \!(\mathsf { p _1} \rightarrow \mathsf { p _2}: \widetilde{ S } \vee F, p ) = {\left\{ \begin{array}{ll} \textit{sn} \langle p _2 ! \ \widetilde{ S } \vee F , \_ \ , \_ \rangle &{} \text {if}\ p = p _1 \not = p _2 \\ \textit{rn} \langle p _1 ? \ \widetilde{ S } \vee F \rangle &{} \text {if}\ p = p _2 \not = p _1 \\ \varepsilon _F &{} \text {otherwise} \end{array}\right. } \\ (3)\ \upharpoonright \! \!(\textit{try}\{ \mathtt {T}, \ f _1: \mathtt {T}_1, ..., \ f _n: \mathtt {T}_n \}, p ) = \textit{try} \{ \upharpoonright \! \!(\mathtt {T}, p ), \ f _1: \upharpoonright \! \!(\mathtt {T}_1, p ), ..., \ f _n: \upharpoonright \! \!(\mathtt {T}_n, p ) \} \\ (4)\ \upharpoonright \! \!(\mathtt {N}\dashrightarrow \mathtt {T}, p ) = \upharpoonright \! \!(\mathtt {N}, p ) \dashrightarrow \upharpoonright \! \!(\mathtt {T}, p ) \quad (5)\ \upharpoonright \! \!( t , p ) = t \qquad (6)\ \upharpoonright \! \!(\mu t . \mathtt {T}, p ) = \mu t . \upharpoonright \! \!(\mathtt {T}, p ) \end{array} $$

Others are undefined.

Rule (2) is for dually interacting participants. It introduces \(\varepsilon _F\), which has equivalent meaning to \(\varepsilon \) (i.e. idle action) but is only used in transformation for adding synchronisation points. As we project an interaction \(\mathsf { p _1} \rightarrow \mathsf { p _2}: \widetilde{ S } \vee F\) with \( p _1 \not = p _2\) onto \( p _1\) (resp. \( p _2\)), we get an action \(\textit{sn} \langle p _2 ! \ \widetilde{ S } \vee F , , \ \_ \ , \_ \rangle \) (resp. \(\textit{rn} \langle p _1 ? \ \widetilde{ S } \vee F \rangle \)). Note that the two slots \(\_\) in the sending action are preserved for adding the need-be-informed participants as a failure occurs (the first slot), and those as no failures occur (the second slot). As we project the interaction to some participant who is not in the interaction, we get \(\varepsilon _{F}\) (idle action). The subscript F indicates that if \( p \) is affected by some failures in F a synchronisation point will be added at this position. Rule (3) simply projects every sub-structure in the try block and handlers onto the participant. Rule (4) sequences two local types projected from a global structure. Other rules are straightforward.

After projection, we add need-be-informed participants into the failure-raiser’s sending actions (e.g., the one marked in green ring in Fig. 1). We use C(\(\mathtt {T}\), F) to get the set of need-be-informed participants regarding a unique F in a global structure \(\mathtt {T}\). It is the least fixed point of \( c (\mathtt {T}, \mathtt {T}, F, \mathbf {\textit{r}})\), which recursively collects the need-be-informed participants regarding F based on \(\mathtt {T}\). Since for every protocol the number of participants is finite, function \( c \) will converge to a fixed set of participants. The key calculation is done by the rule below

$$\begin{aligned} \begin{array}{l} c (\mathtt {T}, \mathtt {N}, F, \mathbf {\textit{r}})= {\left\{ \begin{array}{ll} \mathbf {\textit{r}}\cup {\textit{pid}}{(\mathtt {N})} \cup C(\mathtt {T},\textit{FSet}(\mathtt {N})) &{} \begin{array}{l} \text {if}\ (F\ \text {appears before}\ \mathtt {N}\ \text {in}\ \mathtt {T}) \wedge \\ (({\textit{pid}}{(\mathtt {N})} \cap \mathbf {\textit{r}}\ne \emptyset ) \vee (\textit{FSet}(\mathtt {N}) \ne \emptyset )) \end{array} \\ \mathbf {\textit{r}}&{} \begin{array}{l} \text {otherwise} \end{array} \end{array}\right. } \end{array} \end{aligned}$$

where we require the initial \(\mathbf {\textit{r}}\) to be the set containing the receiver of F-raising interaction and the participants involved in handling F in \(\mathtt {T}\); later \(\mathbf {\textit{r}}\) acts as an accumulator collecting the participants causally related to the initial \(\mathbf {\textit{r}}\). \(\mathtt {N}\) is an interaction, \({\textit{pid}}(\mathtt {N})\) is the set of participants in \(\mathtt {N}\), and \(\textit{FSet}(\mathtt {N})\) returns the failures tagged on \(\mathtt {N}\). This rule says that if the interaction we are checking appears after the F-raising interaction, and some of its interacting participants are related to \(\mathbf {\textit{r}}\) or the interaction itself can raise another failure set (e.g. the interaction \(\mathsf {Lib} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {Abort}\) in \(\mathsf {RP} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {NoRes}; \mathsf {Lib} \rightarrow \mathsf {Coord}: \mathsf {str} \vee \mathsf {Abort}\) is related to \(F = \{\mathsf {NoRes}\}\)), then we collect its participants (i.e. \({\textit{pid}}(\mathtt {N})\)) and the need-be-informed participants with respect to the failures that can be raised by \(\mathtt {N}\).

After adding those participants, we add synchronisation points \(\textit{yield} \langle F \rangle \) to the positions where a participant yields to the arrival of non-failure notifications (e.g. those marked in blue rings in Fig. 1). The key rule is:

$$ \begin{array}{l} \textit{Sync}(\mathtt {T}, \mathbf {n} , p ) = {\left\{ \begin{array}{ll} \textit{yield} \langle \textit{FSet}(\mathtt {N}) \rangle \dashrightarrow \mathbf {n} &{} \begin{array}{l} \text {if}\ ( \mathbf {n} = \upharpoonright \! \!(\mathtt {N}, p )) \wedge ( p \in C(\mathtt {T}, \textit{FSet}(\mathtt {N})) \wedge p \ne \small {snd} (\mathtt {N}) \end{array} \\ \mathbf {n} \dashrightarrow \textit{yield} \langle \textit{FSet}(\mathtt {N}) \rangle &{} \begin{array}{l} \text {if}\ ( \mathbf {n} = \upharpoonright \! \!(\mathtt {N}, p )) \wedge ( p \in C(\mathtt {T}, \textit{FSet}(\mathtt {N})) \wedge p = \small {snd} (\mathtt {N}) \end{array} \\ \mathbf {n} &{} \text {otherwise} \end{array}\right. } \end{array} $$

where \( \small {snd} (\mathtt {N})\) is the sender for interaction \(\mathtt {N}\). This rule says the followings: If \( p \) needs to be informed of \(F= \textit{FSet}(\mathtt {N})\) (i.e. \( p \in C (\mathtt {T}, \textit{FSet}(\mathtt {N})\)) then it must add a synchronisation point. If \( p \)’s action (i.e., \( \mathbf {n} \)) regarding F is \(\varepsilon _F\) (e.g. in \(\mathbf {G}_{coord}\) the participant \(\mathsf {Lib}\) has \(\varepsilon _{\mathsf {NoRes}}\) by Definition 2), \(\textit{yield} \langle F \rangle \) is positioned ahead of \( p \)’s action (e.g. a sending action of \(\mathsf {Lib}\) to \(\mathsf {Coord}\) specified in \(\mathbf {G}_{coord}\)), because \( p \) needs to wait for the notification regarding F before taking any action. If \( p \) is the receiver, we have \(\textit{yield} \langle F \rangle \) positioned before the receiving action because \(\textit{yield} \langle F \rangle \) is the point deciding whether the process will handle a failure regarding F or proceed. If \( p \) is the sender, we should have \(\textit{yield} \langle F \rangle \) positioned after the sending action, because as \( p \) is involved for some failure handling activity regarding F, it needs to first send out failure notifications then go back to execute the handling activity; otherwise the process will get stuck.

In Fig. 1, green ring appears at \(\mathsf {Proxy}\)’s second action because, if a failure occurs, \(\mathsf {Proxy}\) has to inform \(\mathsf {Log}\), \(\mathsf {SupServer}\), and \(\mathsf {Client}\) about that failure. Blue rings appear at \(\mathsf {Client}\), \(\mathsf {Log}\), and \(\mathsf {SupServer}\)’s try blocks because they are involved in handling activity, and they can terminate only after getting the notifications that no failures occurred.

Overall, we define the operation of transformation as \(\textit{Transform}(\mathbf {G}, p )\), which transforms \(\mathbf {G}\) to a local type for \( p \).

3 Processes for Decentralised Multiple-Failure-Handling

We abstract distributed systems as a finite set of processes communicating by outputting (resp. inputting) messages into (resp. from) the shared global queue asynchronously and concurrently. The semantics of the calculus is in the same style as that of the \(\pi \)-calculus and does not involve any centralised authority for specifying how messages are exchanged (P4). The shared queue is only conceptually global for convenience, and could be split into individual participant queues.

Fig. 3.
figure 3

Syntax for processes and networks.

Syntax. In Fig. 3, we define x as value variables, y as channel variables, a as shared names (e.g., names for services or protocol managers), and \( s \) as session names (i.e., session IDs), \( p \) as participant identifiers, and X as process variables. We use u for names and \( c \) for channels, which are either variables or a combination of \( s \) and \( p \). The definition for expressions e is standard. We define the syntax of processes (P, ...) and that of networks (N, ...), which represent interactions of processes at runtime. Process \(\mathbf {0}\) is inactive. Process \( c ! ( p , \langle \tilde{e} \rangle ^{F})\) denotes an output, which may alternatively raise a failure \( f \) in F, sends a message with content \(\tilde{e}\) to \( p \) via channel \( c \); while \( c ?( p , (\tilde{x})^{F}).P\) denotes an input using \( c \) to receive a content from \( p \), which may alternatively raise a failure from F. Every \(\tilde{x}\) appearing in P is bound by the input prefix. When \(F = \emptyset \), we omit F since the process will not raise/receive a failure. Process \( c \ \mathsf {raise}( f )\ \mathsf {to}\ p \) raises \( f \) to \( p \) via channel \( c \). Process \({ c \otimes F}\) is guarded by \(\otimes F\), a synchronisation point, yielding to non-failure notification for F. A try-handle process \(\mathsf {try}\{P\}\mathsf {h}\{ H \}\) executes P until a handler \( f \in \textit{dom}( H )\) is triggered, then the triggered handler takes over. A handling environment, denoted by \( H \), maps failure names to handling processes. We write \(P_1 ; P_2\) to represent a sequential composition where \(P_2\) follows \(P_1\). Process \( \textsf {def}~ D ~\textsf {in}~P \) defines a recursion, where declaration \( D = (X(\tilde{x}\ c ) = P)\) is a recursive call. The term \(\textsf {if}\ e \ \textsf {then}\ P \ \textsf {else}\ P\) is standard. We define evaluation context \(\mathscr {E}\) over processes. It is either a hole, a context in a try-handle term, or a context sequencing next processes.

A network N is composed by linking points, denoted by \(a[ p ](y).P\), and runtime processes, denoted by \([P]_{\mathbf {T}}\) with global transports (i.e., \(s: q\)) for proceeding communications in a private session (i.e., \((\textsf {new} \ s)N\)). Our framework asks a process to join one session at a time. A linking point \(a[ p ](y).P\) is guarded by \(a[ p ](y)\) for session initiation, where shared name a associates a service to a protocol type. \([P]_{\mathbf {T}}\) represents a runtime process which is guided by \(\mathbf {T}\) for notifying need-be-informed participants.

A session queue, denoted by \( s : q\), is a queue for messages floating in session \( s \). Message \(\langle p _1, p _2, \langle \tilde{v} \rangle \rangle ^F\) carries content \(\tilde{v}\), sent from \( p _1\) to \( p _2\) prone to failure \( f \in F\). Message \(\langle p , f \rangle \) (resp. \(\langle \! \langle p , F \rangle \! \rangle \)) carries a failure name \( f \) to indicate that failure \( f \) occurred (resp. a set of failures F to indicate that no failures in F occurred) to \( p \). Conventionally we say \(\langle \! \langle p , \emptyset \rangle \! \rangle = \varepsilon \). When session \( s \) is initiated for a network, a private (i.e., hidden) session is created, in which activities cannot be witnessed from the outside. We use structural congruence rules, defined by \(\equiv \), which are standard according to the works of multiparty session types [1, 12].

Fig. 4.
figure 4

Reduction rules for networks (i.e., runtime processes).

Operational Semantics. Figure 4 gives the operational semantics for networks (i.e., runtime processes) through the reduction relation \(N \rightarrow N\). We have added boxes to those rules which differ from standard session type definitions. In rule , a session is generated with a fresh name \( s \) through shared name a obeying protocol type \(\mathbf {G}\). This indicates that all processes in the new session \( s \) will obey to the behaviours defined in \(\mathbf {G}\). At the same time, a global queue \( s : \varepsilon \) is generated, and the local process associated with \( p \) replaces \(y_{ p }\) with \( s [ p ]\); a local type \(\mathbf {T}_{ p }\) is generated by \(\textit{Transform}(\mathbf {G}, p )\) to guide the local process associated with \( p \) for propagating notifications. Note that, as we enclose a local process with \(\mathbf {T}\), together they become an element of a network. \(\mathbf {T}\) is merely a local type and the reduction of the network does not change \(\mathbf {T}\).

Rule states that, in \( s \), a process associated with \( p _1\) is able to receive a value \(\tilde{v}\) from participant associated with \( p _2\) and message \(\langle p _2, p _1, \langle \tilde{v} \rangle \rangle ^{F}\) is on the top of \(q\). Then \(\tilde{v}\) will replace the free occurrences of \(\tilde{x}\) in P. The shape of \( s [ p _1] ?( p _2, (\tilde{x})^{F})\) indicates that its dual action may send it a failure from F; in other words, if \(F \not = \emptyset \), a process should be structured by a try-handle term for possible failure handling. Rule is dual to . We define \( node (\mathbf {T}, F)\) as a function returning an action tagged with F in a local type \(\mathbf {T}\). Rule states that, if there is an action in \(\mathbf {T}\) matching \( s [ p _1] ! ( p _2, \langle \tilde{e} \rangle ^{F})\) and \(\tilde{e} \Downarrow \tilde{v}\), then the process associated with \( p _1\) in \( s \) is allowed to send a message with content \(\tilde{v}\) to \( p _2\) and non-failure notifications \(\langle \! \langle p '_1, F \rangle \! \rangle ... \langle \! \langle p '_n, F \rangle \! \rangle \). Note that, non failure notifications are automatically generated at runtime. If a process follows the guidance of the attached \(\mathbf {T}\), since \(\mathbf {T}\) is alpha-renamed, every failure raised by the process is unique. Similarly, states for a process associated with \( p _1\) in \( s \), to raise \( f \) to \( p _2\) and other affected ones, \( p '_1, .., p '_n\). Very importantly, in and , \(q\) has no failure notification to trigger \( H \) because, as a failure-raising interaction is ready to fire (i.e. its sender is about to send), it implies that, globally, either this interaction is the first failure-raising interaction in s (thus no failure yet occurs in s), or its previous interactions did not raise a failure in \(\textit{dom}( H \)) (thus by P2, this interaction is able to raise a failure in \(\textit{dom}( H )\), and no failures in \(\textit{dom}( H )\) yet occurs in s). For convenience, we use \(\textit{act}\) to extract the channel that a process or the set of handlers is acting on, i.e. \(\textit{act}(P) = s [ p ]\) says P is acting on channel \( s [ p ]\), and \(\textit{act}( H ) = \textit{act}( H ( f ))\) for every \( f \in \textit{dom}( H )\).

In , if the \( H \) in a try-handle process associated with \( p \) in \( s \) will not be triggered by the top message in \( s :q\), then the process in the try block will take action according to the process’s interaction with the queue. In , as \( f \) arrives to a try-handle process associated with \( p \) in \( s \) whose try block is yielding to non-failure notification for F and \( H \) is able to handle \( f \), the handling process \( H ( f )\) takes over. Due to asynchrony, other processes’ handlers for \( f \) may become active before this process. Thus some messages in \(q\) may be sent from other processes’ handlers of \( f \) for P. Note that none of the messages in \(q\) are for \(\mathscr {E}\) because, all default sending actions in other processes are also guarded by synchronisation points.

Synchronisation either proceeds with , where F is sufficient to remove \(F'\), or with , where F is included in \(F'\) carried in the notification. For the former, some processes in the failure-handling activity only take care of partial failures in F, i.e. \(F'\), when they receive F, to ensure that no failures in F occurred. For the latter, further synchronisation is required by \(F'' = (F' \setminus F) \not = \emptyset \). In , since we have added sufficient synchronisation points to guard processes who must yield to non-failure notifications, when a network reaches \([\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}\), it is safe to be inactive because no more failure notifications will occur. In other rules, the operations enclosed in \(\mathbf {T}\) are standard according to the works of multiparty session types [1, 12].

4 Typing Local Processes

This section introduces rules to type user-defined processes. Based on the multiparty session types [1, 12], type environments and typing rules for processes are given in Fig. 5. A shared environment \(\varGamma \) is a finite mapping from variables to sorts and from process variables to local types; a session environment \(\varDelta \) is a finite mapping from session channels to local types. \(\varGamma , x: S \) means that x does not occur in \(\varGamma \), so does \(\varGamma , X: (\tilde{x}\ \mathbf {T})\) and \(\varDelta , c : \mathbf {T}\). We assume that expressions are typed by sorts. \(\varGamma \vdash e: S \) is the typing judgment for expressions, whose typing rules are standard. The typing judgment \(\varGamma \vdash P \rhd \varDelta \) for local processes reads as “\(\varGamma \) proves that P complies with abstract specification \(\varDelta \)”.

Fig. 5.
figure 5

Typing rules for processes.

Rule states that an idle process is typed by end-only \(\varDelta \), which means \(\forall c \in \textit{dom}(\varDelta )\), \(\varDelta ( c ) = \mathsf {end}\). Rule  types sequential composition by sequencing \(P_2\)’s action in \(\varDelta _2\) after \(P_1\)’s action in \(\varDelta _1\) as long as \(P_1\) and \(P_2\) are acting on the same channel. We define \(\varDelta _1 \circ \varDelta _2\) as the one defined in the multiparty session types extended with failure-handling ability [3]. Rule  specifies that \( s [ p _1] ?( p _2, (\widetilde{x})^{F}). P\) is valid as it corresponds to local type \(\textit{rn} \langle p _2 ? \ \widetilde{ S } \vee F \rangle \dashrightarrow \mathbf {T}\) as long as P, associated with \( p _1\) in session \( s \), is well-typed by \(\mathbf {T}\) under an environment which knows \(\tilde{x}: \widetilde{ S }\). In and , since the slots are not related to typing, their contents are omitted. Rules and share the same action for typing because \(\textit{sn} \langle p _2 ! \ \widetilde{ S } \vee F , \ \_ \ , \_ \rangle \) specifies two possible actions: a sending action \( s [ p _1] ! ({ p _2}, \langle \tilde{e} \rangle ^{F})\) in which \(\tilde{e}\) must have type \(\widetilde{ S }\), and action \( s [ p _1] \ \mathsf {raise}( f )\ \mathsf {to}\ p _2\) in which \( f \) must be in F. Then the continuing process P is typed by the following \(\mathbf {T}\).

For typing handling activities, rule types a try-handle term if its default action (i.e., P) with its following process is well-typed, and those in handlers with their follow-up processes are all well-typed. We require the following process \(P'\) should not contain any failure appearing in \( H \). Since P and any processes in \( H \) are acting on the same channel and \(\textit{act}( H )\) represents the channel that every processes in \( H \) is acting on, we use \(\textit{act}( H )\) to get the channel in order to type \(\mathsf {try}\{\mathbf {0}\}\mathsf {h}\{ H \}\). Recall Fig. 1 and projection rules defined in Definition 2, for local types the sequencing action is linked at every leaf in a try-handle term; in other words, the type of \(P'\) is attached to the type of P and also to every handler in \(\mathbf {H}\). Therefore, as we type a try-handle term, we also consider its following process.

Rule specifies that process \( c \otimes F ; P\) is well-typed if the local type for \( c \) has synchronisation point \(\textit{yield} \langle F \rangle \) and P is well-typed w.r.t. \(\mathbf {T}\). The algorithm for adding synchronisation points (introduced in Sect. 2) automatically places the synchronisation points in local types and ensures that once a failure is raised, other possible failure-raising actions must not fire. Since the operational semantics defined in Fig. 4 only deliver notifications regarding F to need-be-informed participants and only one failure in F can be raised, our type system ensures only one failure in a try-handle term is handled and all participants affected by F have consistent failure handling activities.

Rule types a local process variable, and rule types a recursion with \(\varDelta \), where the recursive call \(X(\tilde{x}, c ) = P\) is typed by \( c : \mathbf {T}\), indicating that P follows behaviour \(\mathbf {T}\) at \( c \). Others are standard according to the works of multiparty session types [1, 12].

5 Typing the Network

Ultimately our framework needs to ensure that the network is coherent. Coherence, according to the works of multiparty session types [1, 12], describes an environment where all interactions are complying with the guidance of some \(\mathbf {G}\), such that the behaviour of every participant in \(\varDelta \), say \(s[ p ]\), obeys to \(\textit{Transform}(\mathbf {G}, p )\), which denotes a local type. To reason about coherence of default and handling interactions in a session, we statically type check the interactions by modeling the outputs and inputs among local processes and the shared global queue.

Fig. 6.
figure 6

Typing rules for networks.

The typing rules for networks are defined in Fig. 6 by extending the session environments such as to map queues to queue types. A queue type, denoted by \(\mathbf {q}\), is composed by message types, which are typed by their contents or shapes: Rule types an empty queue, while rule types a message carrying a value under the assumption that \(\varGamma \vdash \widetilde{v}: \widetilde{ S }\) and the following queue is well-typed; rule types \(\langle p , f \rangle \) by message type \(\langle p , f \rangle \), while rule types \(\langle \! \langle p , F \rangle \! \rangle \) by message type \(\langle \! \langle p , F \rangle \! \rangle \). Rule  types a linking point \(a[ p ](y). P\) by assuming that a provides a behaviour pattern defined in \(\mathbf {G}\). For guiding P associated with \( p \), uses local type \(\mathbf {T}\) generated by \(\textit{Transform}(\mathbf {G}, p )\) to type P acting on channel y. Rule states that \([P]_{\mathbf {T}}\) is well-typed by \(\varDelta \) if P is well-typed by \(\varDelta \), and \(\mathbf {T}\), gained by some \(\mathbf {G}\), contains the type which types P acting on channel \(s[ p ]\). Note that, by rule (see Fig. 4), \([P]_{\mathbf {T}}\) is created after linking and \(\mathbf {T}\) is not changed after any reduction; thus \(\mathbf {G}\) in rule comes from rule . Rule  ensures the parallel composition of two networks if each of them is well-typed and they do not share a common channel (i.e., \(\textit{dom}(\varDelta _1) \cap \textit{dom}(\varDelta _2) = \emptyset \)). The composed network exhibits the union of the session environments. Rule types hiding (i.e., \((\textsf {new} \ s )N\)) when the session environment of networks under \( s \), denoted by

$$\varDelta \langle s \rangle \mathop {=}\limits ^{ \text{ def }}\{ s '[ p ] : \mathbf {T}\ \mid \ s '[ p ] \in \textit{dom}(\varDelta ), s ' = s \} \cup \{ s : \mathbf {q}\},$$

is coherent:

Definition 3

(Coherence). We say \(\varDelta \langle s \rangle \) is coherent if there exists \(\mathbf {G}\) such that \({\textit{pid}}(\mathbf {G}) = \{ p \mid s'[ p ] \in \textit{dom}(\varDelta ), s' = s \}\) and either (1) \(\forall s[ p ] \in \textit{dom}(\varDelta \langle s \rangle )\) we have \(\textit{Transform}(\mathbf {G}, p )\) is equal to the type of \(\varDelta (s[ p ])\) after \(s[ p ]\) absorbs all messages heading to it; or (2) there exists \(\varDelta ' \subset \varDelta \) such that \(\forall s[ p ] \in \textit{dom}(\varDelta \langle s \rangle \setminus \varDelta ' \langle s \rangle )\) we have that \(\textit{Transform}(\mathbf {G}, p )\) is equal to the type of \(\varDelta (s[ p ])\) after \(s[ p ]\) absorbs all messages heading to it, and \(\varDelta ' \langle s \rangle \) is coherent.

Note that due to asynchrony, after a sender takes action, the type of the sender and its receiver may be temporarily incoherent if the sender has moved forward and the output is still in the global queue. Therefore, coherence holds only after a receiver has absorbed all messages heading to it.

As we aim to handle partial failure(s), either (1) no failures occurred such that there exists \(\mathbf {G}\) defining interactions for every \(s[ p ]\) in \(\varDelta \langle s \rangle \), or (2) a failure occurs such that the need-be-informed participants, who are in \(\varDelta '\langle s \rangle \), are handling that failure in a coherent way, and other unaffected ones, who are in \(\varDelta \langle s \rangle \setminus \varDelta '\langle s \rangle \), still follow the behaviour defined in \(\mathbf {G}\).

6 Properties

We prove that our typing discipline ensures the properties of safety and progress. The property of safety is defined by subject reduction and communication safety. Firstly we define \(\varDelta \rightharpoonup \varDelta '\) as reductions of session environments. Intuitively, the reductions correspond closely to the operational semantics defined in Fig. 4. Subject reduction states that a well-typed network (resp. coherent session environment) is always well-typed (resp. coherent) after reduction:

Theorem 1

(Subject Congruence and Reduction).

  1. 1.

    (subject congruence) \(\varGamma \vdash N \rhd \varDelta \) and \(N \equiv N'\) imply that \(\varGamma \vdash N' \rhd \varDelta \).

  2. 2.

    (subject reduction) \(\varGamma \vdash N \rhd \varDelta \) with \(\varDelta \) coherent and \(N \rightarrow N'\) imply that \(\varGamma \vdash N' \rhd \varDelta '\) such that \(\varDelta \rightharpoonup \varDelta '\) or \(\varDelta \equiv \varDelta '\) and \(\varDelta '\) is coherent.

According to the definition of communication safety in the works of multiparty session types [1, 12], it is a corollary of Theorem 1. Note that, since our calculus is based on the work of Bettini et al. [1], global linearity-check is not needed. For convenience, we define here contexts on networks:

$$ \begin{array}{c} \mathscr {C}\ {:}{:} = [ \ ] \ | \ \mathscr {C}\parallel N \ | \ N \parallel \mathscr {C}\ | \ (\textsf {new} \ s)\mathscr {C}\end{array} $$

Corollary 1

(Communication Safety). Suppose \(\varGamma \vdash N \rhd \varDelta \) and \(\varDelta \) is coherent. Let \(N_1= \mathscr {C}_1[s: q\cdot \langle p _2, p _1, \langle \tilde{v} \rangle \rangle ^{F} \cdot q']\) and \(N_2 = \mathscr {C}_2[s: q\cdot \langle p _1, \ f \rangle \cdot q']\) and \(N_3 = \mathscr {C}_3[s: q\cdot \langle \! \langle p _1, F \rangle \! \rangle \cdot q']\) and no messages in \(q\) is sending to \( p _1\).

  1. 1.

    If \(N= \mathscr {C}[\mathscr {E}[s[ p _1] ?( p _2, (\tilde{x})^{F}) . P]_{\mathbf {T}}]\), then \(N \equiv N_1\) or \(N \rightarrow ^{*} N_1\).

  2. 2.

    If \(N = \mathscr {C}[\mathscr {E}[\mathsf {try}\{s[ p _1] \otimes F'; P\}\mathsf {h}\{ H \}]_{\mathbf {T}}]\) and \(F' \subseteq F \not = \emptyset \), then either (a) \(N \equiv N_2\) or \(N \rightarrow ^{*} N_2\) or (b) \(N \equiv N_3\) or \(N \rightarrow ^{*} N_3\).

  3. 3.

    If \(N = \mathscr {C}[\mathscr {E}[\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}]\) and \( f \in \textit{dom}( H )\) and process \( H ( f )\) is acting on \(s[ p _1]\), then \(N \not \equiv N_2\) and \(N \not \rightarrow ^{*} N_2\).

This corollary states that our system is free from deadlock and starvation: if there is a receiving action in N, then N is either structurally congruent to the network which contains the message for input, or N will reduce to such a network. We state that \([\mathsf {try}\{v\}\mathsf {h}\{ H \}]_{\mathbf {T}}\) is safe to become idle by proving that no \( f \in \textit{dom}( H )\) is heading to it (Case 3).

Corollary 1 provides the means to prove that our system never gets stuck and is free from orphan messages (property of progress):

Theorem 2

(Progress). \(\varGamma \vdash N \rhd \varDelta \) with \(\varDelta \) coherent and \(N \rightarrow N'\) imply that \(N'\) is communication safe or \(N' = \mathbf {0}\parallel s :\varepsilon \).

This theorem states that every interaction in a well-typed network is a safe interaction and reducible until the whole network terminates without any message left.

7 Related Works

Failure handling has been addressed in several process calculi and communication-centered programming languages. For instance, the conversation calculus [20] models exception handling in abstract service-based systems with message-passing based communication. It studies expressiveness and behaviour theory of bisimilarity rather than theory of types. Colombo and Pace [7] investigate several different process calculi for failure-recovery within long-running transactions. They give insight regarding the application of these failure-recovery formalisms in practice via comparing the design choices and formal notions of correctness properties. Both works do not provide a type system to statically type check local implementations.

Previous works for failure handling with type systems [35, 13] extend the theory of session types to specify error handling under asynchronous interactions. These works do not capture handling of partial failures and the scenarios which exhibit the properties P1 to P4. They may be able to encode multiple possible failures at the interaction level (P1), for example, by (i) explicitly using a labeled branching inside the failure handler, or (ii) piggybacking a label with the failure notification (“multiplexing”). However, (i) implies double communication and synchronisation (once for the failure notification, then for the branch) and (ii) implies that either the well-formedness constraints on the shape of interactions in handlers are needed or any participants related to any failure handling activity should be informed as a failure occurs in order to know how to proceed. Our approach is different since we do not have such constrains and we do not inform the unaffected participants. Moreover, while the termination of try-handle terms in those works demands an agreement of all participants, ours allows local try-handle terms to terminate since we have locally added synchronisation points by transformation (see Sect. 2.2). Our approach can encode the global types for exception-handling proposed in the work by Capecchi et al. [3], which is the closest related work (and other related ones have similar try-handle syntax). The formal encoding can be found in the long version of this paper [9].

Collet and Van Roy [6] informally present a distributed programming model of Oz for asynchronous failure handling and focus on programming applications in a distributed manner. Jakšić and Padovani [14] study a type theory for error handling for copy-less messaging and memory sharing to prevent memory leaks/faults through typing of exchange heaps. Lanese et al. [11, 15] formalise a feature which can dynamically install fault and compensation handlers at execution time in an orchestration programming style. They investigate the interplay between fault handling and the request-response pattern. In contrast, our framework statically defines the handlers for non-trivial failure handling, which can only be done with a global perspective.

8 Concluding Remarks

Protocol types enable the design of protocols in an intuitive manner, and statically type check multiple failure-handling processes in a transparent way. Our type discipline exhibits the desirable properties of P1 (alignment), P2 (precision), P3 (causality), and P4 (decentralisation) for robust failure handling, and ensures fundamental properties of safety and progress. We are currently implementing the proposed framework and are extending it to support system-induced failures as opposed to application-specific ones focused on in this paper, in addition to parameterisation and dynamic multiroles.