1 Introduction

In the setting of mechanised proof about programming languages, it is often unclear what kind of operational semantics to use for formalising the language: common big-step and small-step approaches each have their own strengths and weaknesses. The choice depends on the size, complexity, and nature of the programming language, as well as what is being proved about it. As a rule-of-thumb, the more complex the language’s features, or the more semantically intricate the desired theorem, the more likely it is that small-step semantics will be needed. This is because small-step semantics enable powerful proof techniques, including syntactic preservation/progress and step-indexed logical relations, by allowing close observation not only of the result of a program, but also how it got there. In contrast, big-step’s advantages arise from following the syntactic structure of the programming language. This means that they can mesh nicely with similarly structured compilers, type systems, etc. that one is trying to verify, and reduce the overhead of mechanised proof.

For large projects, a hybrid approach can be adopted. The CompCert [16, 17] verified C compiler uses big-step for some parts of its semantics and small-step for others. In the initial version of our own CakeML project [15], we had two different semantics for the source language: big-step for the compiler verification and small-step for the type soundness proof, with an additional proof connecting the two semantics.

In contrast, this paper advocates functional big-step semantics, which can support many of the proofs and languages that typically rely on a small-step approach, but with a structure that follows the language’s syntax. A functional big-step semantics is essentially an interpreter written in a purely functional style and equipped with a clock to ensure that the function is total, even when run on diverging programs. Hence the interpreter can be used in a higher-order logic of total functions – the kind supported by Coq, HOL4, and Isabelle/HOL – as a formal definition of the semantics. In this way, it harkens back to Reynolds’ idea of definitional interpreters [23] to give a readable account of a semantics. Additionally, by initialising the clock to a very large number, the same functional big-step semantics used for proof can also be executed on test programs for exploration and validation.

The idea of using a clock in a semantics is not new;Footnote 1 our contribution here is to analyse its advantages, especially in the context of interactive proofs, and to show how it can be used to support the kinds of proofs that push researchers to small-step semantics. We argue that:

  • Functional semantics are easier to read, have a familiar feel for functional programmers, and avoid much of the duplication that occurs in big-step semantics defined with inductive relations, especially for languages with exceptions and other non-local control-flow (Sect. 2).

  • Functional semantics can be used more easily in mechanised proofs based on rewriting, since functional semantics are stated in terms of equations (Sect. 3.1).

  • Functional semantics also produce better induction theorems. Induction theorems for relational big-step semantics frequently force unnecessary case splits in proofs (Sect. 3.2).

  • The clock used to define a functional semantics is convenient both for proofs that a compiler preserves the diverging behaviour of programs (Sects. 3.3 and 3.4), and for defining (and using) step-indexed logical relations (Sect. 6).

  • Functional semantics can use an oracle in the state to support languages with I/O and non-determinism (Sect. 4).

There are a variety of advanced techniques for defining big-step semantics that solve some of these problems. For example, one can use co-induction to precisely define diverging computations [18, 20], or the pretty-big-step approach to reduce duplication in the definition [10]. Notably, these techniques still define the semantics using inductive (and co-inductive) relations rather than recursive functions, and we are not aware of any relational approach with all of the advantages listed above. However, functional semantics, as advocated in this paper, are not without their limitations. One is that the definition of a functional semantics requires introduction of a clock which must decrease on certain recursive calls (Sect. 2.3). Another is that languages with non-determinism require an oracle state component to factor out the non-determinism (Sect. 4). Lastly, we have not investigated languages with unstructured non-determinism, e.g. concurrency.

Our ideas about functional big-step semantics were developed in the context of the CakeML project (https://cakeml.org, [15]) where the latest version has functional big-step semantics for all of its intermediate languages (see Sect. 8); however, the bulk of this paper concentrates on a series of smaller examples, starting with a C-like language with for and break statements (Sect. 2). We use it to explain in detail how the functional approach supports the verification of a simple compiler (Sect. 3). Then, we present a series of different languages and theorems to illustrate the breadth of our approach (Sects. 4, 5, and 6). Lastly, we show how to prove the equivalence of a functional big-step and small-step semantics (Sect. 7).

All of the semantics and theorems in this paper have been formalised and proved in the HOL4 proof assistant (http://hol-theorem-prover.org). The formalisation is available in the HOL4 examples directory (https://github.com/HOL-Theorem-Prover/HOL/tree/master/examples/fun-op-sem); we encourage interested readers to consult these sources for the definitions and lemmas that we lack the space to present here.

2 Example Semantics

In this section, we motivate functional big-step semantics by defining an operational semantics for a toy language in both relational and functional styles. We call our toy language FOR, as it includes for loops and break statements that are familiar from C. We first define the big-step semantics of FOR, informally, as an interpreter in Standard ML (SML); next we explain why the semantics of FOR is difficult to capture in a conventional big-step relation, but, using a functional big-step semantics, can be defined neatly as a function in logic.

2.1 An Interpreter in SML

The FOR language has expressions e and statements t. Like C, we allow expression evaluation to have side effects (namely, assignment).

figure afigure a

We sketch the semantics for this language by defining functions that evaluate expressions and statements, run_e and run_t respectively. Each evaluation returns an integer wrapped in Rval, signals a break Rbreak, or fails Rfail. Expression evaluation fails on an attempt to read the value of an uninitialised variable.

figure bfigure b

Below, evaluation of a Break statement returns Rbreak, which is propagated to the enclosing For loop. A For loop returns a normal Rval result if the body of the loop returns Rbreak.

figure cfigure c

These SML functions make use of catch-all patterns in case-expressions in order to conveniently propagate non-Rval results. We use the same approach in our functional semantics (Sect. 2.3) to keep them concise. The case expressions above are idiomatic for SML, but in a language with syntactic support for monadic computations, such as Haskell with do-notation, one would package the propagation of exceptional results inside a monadic bind operator.

2.2 Relational Big-Step Semantics

The definition above is a good way to describe the semantics of FOR to a programmer familiar with SML. It is, however, not directly usable as an operational semantics for interactive proofs. Next, we outline how a big-step semantics can be defined for the FOR language using conventional inductively defined relations.

Relational big-step semantics are built up from evaluation rules for an evaluation relation, typically written \(\Downarrow \). Each rule states how execution of a program expression evaluates to a result. The evaluation relation for the FOR language takes as input a state and a statement; it then relates these inputs to the result pair (r and new state) just as the interpreter above does.

We give a flavour of the evaluation rules next. The simplest rule in the FOR language is evaluation of Break: evaluation always produces Rbreak and the state \( s \) is returned unchanged. We call this rule (B).

figure dfigure d

The semantics of Seq is defined by two evaluation rules. We need two rules because evaluation of \(t_2\) only happens if evaluation of \(t_1\) leads to Rval. The first rule for Seq (S1) states: if \(t_1\) evaluates according to (\({t}_{1}\) , s) \({{\Downarrow _{\texttt {t}}}}\) (Rval \( n _{\text {1}}\texttt {,} s _{\text {1}}\)) and \(t_2\) evaluates as (\( t _{\text {2}}\texttt {,} s _{\text {1}}\)) \(\Downarrow _{\texttt {t}}\) \( r \), then \({(\texttt {Seq} \, t _{\text {1}} \, t _{\text {2}} \texttt {,} s }\)) \(\Downarrow _{\texttt {t}}\) \( r \), i.e. Seq   \( t _{\text {1}} \, t _{\text {2}}\) evaluates state \( s \) to result \( r \). The second rule (S2) states that a non-Rval result in \(t_1\) is the result for evaluation of \({\texttt {Seq} \, t _{\text {1}}\, t _{\text {2}}}\).

figure efigure e

Defining these evaluation rules is straightforward, if the language is simple enough. We include the For statement in our example language in order to show how this conventional approach to big-step evaluation rules becomes awkward and repetitive. The For statement’s semantics is defined by six rules. The first rule captures the case when the loop is not executed, i.e. when the guard expression evaluates to zero. The second rule states that errors in the evaluation of the guard are propagated.

figure ffigure f

Execution of the body of the For statement is described by the following four rules. The first of the following rules (F3) specifies the behaviour of an evaluation where the guard \( e_{\mathrm {1}} \), the body \( t \), and the increment expression \( e_{\mathrm {2}} \) each return some Rval. The second rule (F4) defines the semantics for the case where evaluation of the body \( t \) signals Rbreak. The third rule (F5) states that errors in the increment expression \( e_{\mathrm {2}} \) propagate. Similarly, the fourth rule (F6) states that errors that occur in evaluation of the body propagate.

figure gfigure g

Once one has become accustomed to this style of definition, these rules are quite easy to read. However, even an experienced semanticist may find it difficult to immediately see whether these rules cover all the cases. Maybe the last two rules above were surprising? Worse, these rules only provide semantics for terminating executions, i.e. if we want to reason about the behaviour of diverging evaluations, then these (inductive) rules are not enough as stated.

Another drawback is the duplication that rules for complex languages (even for our toy FOR language) contain. In each of the four rules above, the first three lines are almost the same. This duplication might seem innocent but it has knock-on effects on interactive proofs: the generated induction theorem also contains duplication, and from there it leaks into proof scripts. In particular, users are forced to establish the same inductive hypothesis many times (Sect. 3.4).

The rules (F2), (F5) and (F6) ensure that the Rfail value is always propagated to the top, preventing the big-step relation from doing the moral equivalent of getting ‘stuck’ in the small-step sense. Thus, we know that a program diverges iff it is not related to anything. We could omit these rules if we do not need or want to distinguish divergence from getting stuck, and this is often done with big-step semantics.Footnote 2 However, for the purposes of this paper, we are primarily interested in the (many) situations where the distinction is important – that is where the functional big-step approach has the largest benefit.

The above ‘not related’ characterisation of divergence does not yield a useful principle for reasoning about diverging programs: the relation’s induction principle only applies when a program is related to something, not when we know it is not related to anything. To define divergence with a relation [18], one adds to the existing inductive evaluation relation \(\Downarrow _\texttt {t}\) a co-inductively defined divergence relation \(\Uparrow _\texttt {t}\), which provides a useful co-induction principle.

The rules for Seq and For are given below. (S1\('\)) states that a sequence diverges if its first sub-statement does. (S2\('\)) says that the sequence diverges if the first sub-statement returns a value, using the \(\Downarrow _\texttt {t}\) relation, and the second sub-statement diverges. Notice the duplication between the definitions of \(\Downarrow _\texttt {t}\) and \(\Uparrow _\texttt {t}\): both must allow the evaluation to progress normally up to a particular sub-statement, and then \(\Downarrow _\texttt {t}\) requires it to terminate, while \(\Uparrow _\texttt {t}\) requires it to diverge. This corresponds to the duplication internal to \(\Downarrow _\texttt {t}\) for propagating Rbreak and other exceptional results.

figure hfigure h

2.3 Functional Big-Step Semantics

The interpreter written in SML, given in Sect. 2.1, avoids the irritating duplication of the conventional big-step semantics. It is also arguably easier to read and clearly gives some semantics to all cases. So why can we not just take the SML code and define it as a function in logic? The answer is that the SML code does not terminate for all inputs, e.g., run_t [] (For (Num 1, Num 1, Exp (Num 1))).

In order to define run_t as a function in logic, we need to make it total somehow. A technique for doing this is to add a clock to the function: on each recursive call for which termination is non-obvious, one adds a clock decrement. The clock is a natural number, so when it hits zero, execution is aborted with a special time-out signal.

A very simple implementation of the clocked-function solution is to add a check-and-decrement on every recursive call. The termination proof becomes trivial, but the function is cluttered with the clock mechanism.

Instead of inserting the clock on every recursive call, we suggest that the clock should only be decremented on recursive function calls for which the currently evaluated expressions does not decrease in size. For the FOR language, this means adding a clock-check-and-decrement only on the looping call in the For case. In the SML code, this recursive call is performed here:

figure ifigure i

In our functional big-step semantics for the FOR language, called sem_t, we write the line above as follows. Here dec_clock decrements the clock that is stored in the state.

figure jfigure j

All other parts of the SML code are directly translated from SML into HOL4’s logic. The complete definition of sem_t is given below. Because run_e is a pure total function, it can be translated directly into the HOL4 logic as sem_e without adding a clock. Here store_var \( x \)  0 \( s \) is state \( s \) updated to have value 0 in variable \( x \).

figure kfigure k

Note that, in our logic version of the semantics, we have introduced a new kind of return value called Rtimeout. This return value is used only to signal that the clock has aborted evaluation. It always propagates to the top, and can be used for reasoning about divergence preservation (Sect. 3.3).

Termination Proof. We prove termination of sem_t by providing a well-founded measure: the lexicographic ordering on the clock value and the size of the statement that is being evaluated. This measure works because the value of the clock is never increased, and, on every recursive call where the clock is not decremented, the size of the statement that is being evaluated decreases.Footnote 3

No termination proof is required for relational big-step semantics. This requirement is, therefore, a drawback for the functional version. However, the functional representation brings some immediate benefits that are not immediate for relational definitions. The functional representation means that the semantics is total (by definition) and that the semantics is deterministic (see Sect. 4 for an account of non-deterministic languages). These are properties that can require tedious proof for relational definitions.

Semantics of Terminating and Non-terminating Evaluations. The sem_t function terminates for all inputs. However, at the same time, it gives semantics to both terminating and non-terminating (diverging) evaluations. We say that evaluation terminates, if there exists some initial value of the clock for which the sem_t returns Rval. An evaluation is non-terminating if sem_t returns Rtimeout for all initial values of the clock. In all other cases, the semantics fails. The top-level semantics is defined formally as follows. There are three observable outcomes: Terminate, Diverge, and Crash.

figure lfigure l

Section 3.3 verifies a compiler that preserves this semantics, and Sect. 4 extends the FOR language with input, output, and internal non-determinism.

3 Using Functional Semantics

The previous section showed how big-step semantics can be defined as functions in logic, and how they avoid the duplication that occurs in conventional big-step semantics. In this section, we highlight how the change in style of definition affects proofs that use the semantics. We compare proofs based on the functional semantics with corresponding proofs based on the relational semantics.

3.1 Rewriting with the Semantics

Since the functional semantics is defined as a function, it can be used for evaluation in the logic and used directly for proofs by rewriting. As a simple example, we can easily show that the Dec statement is an abbreviation for a longer program. This proof is just a simple call to the automatic rewriter in HOL4.

figure mfigure m

This ability to perform symbolic evaluation within the logic is a handy tool, as any ACL2 expert will attest [19].

Sometimes rewriting with a functional semantics can get stuck in an infinite loop. This happens when the left-hand side of the definition, e.g. in our example sem_t \( s \) ( \(\texttt {For} \, e _{\text {1}} \, e _{\text {2}} \, t \) ), matches a subexpression on the right-hand side of the equation, e.g. \({\texttt {sem\_t \, ({dec\_clock}} } \, s _{\text {3}} \texttt {)} \, \texttt {(For} \ e _{\text {1}} \, e _{\text {2}} \, t \texttt {)}\). We use a simple work-around for this: we define STOP \( x \) = \( x \) and prove an equation where the right-hand side is sem_t (dec_clock \( s _{\text {3}}\) ) (STOP (For \( e _{\text {1}} e _{\text {2}} \, t \) )). We ensure that the automatic simplifier cannot remove STOP and thus cannot apply the rewrite beyond the potentially diverging recursive call.

Rewriting is possible but often more cumbersome with relational big-step semantics. In HOL4, every definition of an inductive relation produces a rewrite theorem of the following form. We only show the cases relating to Seq, eliding others with ellipses.

figure nfigure n

Such theorems have unrestricted left-hand sides, which easily cause HOL4’s rewriter to diverge, and right-hand sides that introduce a large number of disjunctions. One can often avoid divergence by providing the rewriter with manually proved theorems with specialised left-hand sides, e.g. (Seq \( t_{\mathrm {1}} \) \( t_{\mathrm {2}} \),\( s \)) \(\Downarrow _{\texttt {t}}\) \( res \). Functional semantics require less work for use in proofs by rewriting.

3.2 Induction Theorem

The ability to rewrite with the functional semantics helps improve the details of interactive proofs. Surprisingly, the use of functional semantics also improves the overall structure of many proofs. The reason for this is that the induction theorems produced by functional semantics avoid the duplication that comes from the relational semantics.

The induction theorems for the FOR language are shown in Figs. 1 and 2. The induction theorem for sem_t only has one case for the For loop. In contrast, the induction theorem for the relational semantics has six cases for the For loop. The duplication in the relation semantics carries over to duplication in the induction theorem and, hence, to the structure of interactive proofs, making them longer and more repetitive. This difference is significant for languages with complex program constructs.

Fig. 1.
figure 1figure 1

Induction theorem for functional big-step semantics.

Fig. 2.
figure 2figure 2

Induction theorem for relational big-step semantics. Parts omitted with ‘...’.

Avoiding Duplication in Relations. The duplication problem can be avoided in relational big-step semantics. A trick is to define the evaluation rules such that program constructs are described by only one rule each. Below is an example of how one can package up all of the rules about For into one giant rule.

figure ofigure o

By avoiding the duplication in the rules, the induction theorem also avoids the duplication. Writing packaged rules, as shown above, is unusual and certainly not aesthetically pleasing. However, if relational definitions are to be used, packaging evaluation rules as above is potentially less intrusive to proofs than use of the pretty-big-step approach, since it does not introduce new data constructors.Footnote 4

3.3 Example Compiler Verification

Next, we outline how functional big-step semantics support compiler verification, proving that a compiler preserves the observable behaviour. Our compiler targets a simple assembly-like language, where the code is a list of instructions (instr).

figure pfigure p

The compiler, compile, is a composition of three phases. The first phase, phase1, simplifies For and Dec; phase2 splits assignments into simple instruction-like assignments, but stays within the source language; and phase3 reduces the remaining subset of the source language into a list of target instructions. The first two parameters to phase3 accumulate code location information.

figure qfigure q

The first phase is a source-to-source transformation that simplifies For and Dec as follows. Here Loop is an abbreviation: Loop \( t \) = For (Num 1) (Num 1) \( t \).

figure rfigure r

The compilation function phase1 has a simple correctness theorem that can be proved in less than 20 lines of HOL4 script using the induction from Fig. 1.

figure sfigure s

We also prove that phase1 preserves the observable semantics:

figure tfigure t

Subsequent phases assume that For statements have been simplified to Loop. The verification of the second phase, phase2, is almost as simple but a little longer because phase2 invents variable names to hold temporary results.

The third phase compiles the resulting subset of the FOR language into a list of instructions in the assembly-like target language. The crucial lemma, stated below, was proved by induction using the theorem shown in Fig. 1. This lemma’s statement can informally be read as: if the source semantics sem_t dictates that program \( t \) successfully evaluates state \( s_{\mathrm {1}} \) to state \( s_{\mathrm {2}} \), the source program \( t \) is within the allowed syntactic subset, and the compiled code for \( t \) is installed in a store-related target state \( x \); then the target semantics sem_a evaluates \( x \) to a new target state \( x^{\prime } \) that is store-related to \( s_{\mathrm {2}} \). Below, sem_a is the functional big-step semantics for the target assembly language. The sem_a function executes one instruction at a time and is tail-recursive; its lengthy definition is omitted. phase3_subset defines the syntactic restrictions that programs must follow after phases 1 and 2. The ellipses elide several detailed parts of the conclusion that are only necessary to make the induction go through: in particular, where the program counter will point at exit based on the result \( res \).

figure ufigure u

From the lemma above, it is easy to prove that phase3 \(0\) \(0\) \( t \) preserves the observable semantics, if \( t \) is in the subset expected by the third phase and \( t \) does not Crash in the source semantics.

figure vfigure v

Here asm_semantics is the observable semantics of the target assembly language.

figure wfigure w

The following top-level compiler correctness theorem is produced by combining the semantics preservation theorems from all three phases. The assumption that the source semantics does not Crash is implied by a simple syntactic check syntax_ok, which checks that all variables been declared (Dec) and that all Break statements are contained within For loops.

figure xfigure x

3.4 Comparison with Proof in Relational Semantics

We provide a corresponding proof of correctness for phase1 in the relational semantics. As a rough point of comparison, our relational proof required 43 lines while the functional big-step proof required just 18 lines. The proof is split into two parts, corresponding to the relations defining our big-step semantics:

figure yfigure y

The advantage of (non-looping) functional rewriting is apparent in our proofs: we often had to manually control where rewrites were applied in the relational proof. Additionally, we had to deal with significantly more cases in the relational proofs; these extra cases came from two sources, namely, the ones arising from an additional co-inductive proof for diverging programs, and extra (similar) cases in the induction theorems.

The additional co-inductive proof is a good point of comparison, since our technique of decrementing the clock only on recursive calls in the functional big-step semantics gives us divergence preservation for free in compilation steps that do not cause additional clock ticks. The cases arising in our co-inductive proof also required a different form of reasoning from the inductive proof; this naturally arises from the difference between induction and co-induction but it meant that we could not directly adapt similar cases across both proofs.

The top-level observable semantics can be similarly defined for relational semantics:

figure zfigure z

So we can prove the correctness of phase1 with respect to rel_semantics:

figure aafigure aa

This proof requires proving that the relations (\(\Downarrow _{\texttt {t}}\), \(\Uparrow _{\texttt {t}}\)) are disjoint:

figure abfigure ab

We also attempted a proof of phase1 with a relational pretty-big-step semantics; we found this semantics surprisingly difficult to use in HOL4. Pretty-big-step semantics requires the introduction of additional intermediate terms to factorise evaluation of sub-terms. Hence, the generated induction theorem requires reasoning over these intermediate terms. However, in our compiler proofs, we are typically concerned with the original syntactic terms – those are the only ones mentioned by the compiler – so this induction theorem cannot be applied directly, unlike in the other two semantics. There are ways around this: one can, for example, use an induction theorem that only concerns the original syntactic terms or induct on the size of derivations. Neither of these approaches are automatically supported in HOL4, and our proof of phase1 semantics preservation using the latter approach took 81 lines. Some of Charguéraud’s big-step and pretty-big-step equivalence proofs in Coq also needed to manually prove and use induction on derivation sizes. Additionally, a separate proof is still required for divergence preservation in the co-inductive interpretation of these rules; this requires the use of its co-induction theorem, which also has similar issues with intermediate terms.

To further validate the functional big-step approach, we prove the equivalence of the functional big-step semantics (sem_t) and the relational semantics (\(\Downarrow _{\texttt {t}}\), \(\Uparrow _{\texttt {t}}\)). (We also prove the equivalence with a small-step semantics in Sect. 7). The equivalence is separated into two theorems: the first shows equivalence for terminating programs while the latter shows equivalence on diverging programs.

figure acfigure ac

The proofs rely on the disjointness lemma above and a determinism lemma for the relational semantics:

figure adfigure ad

They also rely on an analogue of determinism for the functional big-step semantics: if a program does not time out for a given clock, then every increment to the clock gives the same resultFootnote 5.

figure aefigure ae

These lemmas are easy to prove compared to the main body of the equivalence proof, and our examples above demonstrate that the number of such lemmas required is comparable between the two semantics.

4 Non-determinism

We now add non-deterministic evaluation order and input/output expressions to the FOR language. The only syntactic change is the addition of two expressions: Getchar and Putchar \( e \). However, the observable behaviours of programs have changed significantly. Instead of doing exactly one of terminating, diverging, or crashing, a program can now exhibit a set of those behaviours. Furthermore, both termination and divergence results now include the I/O stream that the program consumed/produced. For technical reasons, it also contains the choices made by the non-deterministic evaluation order (see Sect. 7). In the type of observation, the llist type is the lazy list type that contains both finite and infinite lists, and + is the type constructor for disjoint unions.

figure affigure af

As a function, sem_t seems to be inherently deterministic: we cannot simply have it internally know what the next input character is, or choose which sub-expression to evaluate first. We are left with two options: we can factor out the input stream and all choices into the state argument of sem_t and then existentially quantify them in the top-level semantic function to build a set of results; or alternatively, we can change the type of sem_t to return sets of results (alongside partial I/O traces). Here we take the first approach which leads to only minor changes in the definition of sem_t.

First, the state argument of sem_t gets three new fields: io_trace to record the characters read and written; input to represent the (possibly infinite) input stream; and non_det_o which represents an infinite stream of decisions that determine the subexpression evaluation ordering. We include the inputs in the io_trace to accurately model the order in which the I/O operations happened.

figure agfigure ag

Because all of our changes are limited to the expression language, and encapsulated in the extended state argument, which sem_t does not access, the definition of sem_t looks identical to the previous one. The changes to sem_e are limited to the Add case (where a non-deterministic choice is made), and two new cases for the new expressions.

figure ahfigure ah

The Add case is similar to before, but uses the permute_pair function to swap the sub-expressions or not, depending on the oracle. It also returns a new oracle ready to get the next choice, and whether or not it switched the sub-expressions. The latter is used to un-permute the values to apply the primitive + in the right order (which would matter for a non-commutative operator). Getchar similarly consumes one input and updates the state. Putchar adds to the I/O trace.

Critically, the above modifications are orthogonal to the clock, and do not affect the termination proof, or the usefulness of the induction theorems and rewriting equations. The changes to the semantics function are explained next.Footnote 6

figure aifigure ai

Firstly, semantics is now a predicateFootnote 7 over programs, inputs, and observation. Termination and crashing are still straightforward: the non-determinism oracle and input are quantified along with the clock, and the resulting I/O trace is read out of the result state. We filter the trace so it only contains the I/O actions and not the record of the non-determinism oracle. Some choices of oracles might lead to a crash whereas others might lead to different terminating results.

Divergence is more subtle. First, note that a program can both terminate and diverge depending on evaluation order. For example, in the following x can be assigned either 1 or 0, depending on which sub-expression is evaluated first.

figure ajfigure aj

Thus, in the definition of semantics, we first existentially quantify the non-determinism, then check that it results in a timeout for all clock values given that particular oracle. To ensure that the resulting I/O trace is correct, we consider the set of all I/O traces for every possible clock in the complete partial order of lazy lists ordered by the prefix relation. This set forms a chain, because we prove that increasing the input clock does not alter the I/O already performed. Hence, the resulting I/O behaviour is the least upper bound, which can be either a finite or infinite lazy list. Operationally, as we increase the clock, we potentially see more I/O behaviour, and the least upper bound defines the lazy list that incorporates all of these. (Notation: the \(\bigvee \) binder takes lubs in this PO.)

Adapting the Compiler Verification. Adapting the compiler verification to the I/O and non-determinism extension is an almost trivial exercise. The I/O streams were modelled in the same way in the assembly language, which we kept deterministic. The new proof engineering work stems mostly from the substantial change to the definition of the top-level semantics function semantics. Due to non-determinism, which the compiler removes, the correctness theorem is now stated as a subset relation: every behaviour of the generated (deterministic) assembly code is also a behaviour of the (non-deterministic) source program.

figure akfigure ak

Unclocked Relational Big-step. Non-determinism can be handled naturally with two big-step rules for Add, although that does introduce duplication. A big-step relation can also be used to collect I/O traces [10, 17, 20]. However, this requires a mixed co-inductive/inductive approach for non-terminating programs, and we can no longer choose to equate divergence with a failure to relate the program to anything.

Concurrency. The techniques described in this section can support functional big-step semantics for a large variety of practical languages, but they do share a significant limitation with other big-step approaches: concurrency. Concurrent execution would require interleaving the evaluation of multiple expressions, whereas the main principle of a big-step semantics (ours included) is to evaluate an expression to a value in one step. Our non-determinism merely selects which to do first. Work-arounds, such as having sem_t return sets of traces of inter-thread communications, might sometimes be possible, but would significantly affect the shape of the definition of the semantics.

5 Type Soundness

Whereas big-step semantics are common in compiler verification, small-step semantics enable the standard approach to type soundness by preservation and progress lemmas [29]. A type soundness theorem says that well-typed programs do not crash; they either terminate normally or diverge. As Siek notes [25], a critical thing a semantics must provide is a good separation between divergence and crashing, and a clocked big-step semantics does this naturally. We have experimented with two type systems and found that functional big-step semantics works very well for proving type soundness.

Our first example is for the FOR language. We prove that syntax_ok programs do not evaluate to Rfail. The key is to use the induction theorem associated with the functional semantics, rather than rule induction derived from the type system.

We carry the same approach to a language with more interesting type systems: the Core ML language from Wright and Felleisen [29] equipped with a functional big-step semantics closely resembling an ML interpreter. The type system is more complex than the FOR language’s, supporting references, exceptions, higher-order functions and Hindley-Milner polymorphism. However, this extra complexity in the type system factors out neatly, and does not disrupt the proof outline.

Our approach is similar to the one described by Siek [26] (followed by Rompf and Amin [24]) who uses a clocked functional big-step semantics and demonstrates the utility of the induction theorem arising from the clocked semantics. As a result, our main type soundness proof, which interacts with the big-step semantics, is easy. Siek’s example type system is simpler than Core ML’s: it has no references or polymorphism; but these difficult aspects can be isolated. The most difficult lemmas in our proof are about the type system, and rely on \(\alpha \)-equivalence reasoning over type schemes. Similar lemmas, concerning the type system only, were proved by Tofte [27].

Our statement of type soundness for Core ML is: if a program is well-typed, then for all clocks, the semantics of the program is either Rtimeout, an exception, or a value of the correct type – never Rfail. The universal quantification of clocks makes this a strong statement, since it implies diverging well-typed programs also cannot fail. For contrast, we have also written un-clocked big-step semantics for Core ML and proved a similar theorem: if a program is well-typed and converges to \( r \), then \( r \) is an exception or value of the correct type, but never Rfail. The proof by induction is essentially the same as for the clocked semantics, and all the type-system lemmas can be re-used exactly, but the conclusion is much weaker because diverging programs do not satisfy the assumption. The proof is also longer (330 lines vs. 200) because of the duplication in the relational semantics.

6 Logical Relations

The technique of step-indexed logical relations [2] supports reasoning about programs that have recursive types, higher-order state, or other features that introduce aspects of circularity into a language’s semantics [1, 12]. The soundness of these relations is usually proved with respect to a small-step semantics, because the length of a small-step trace can be used to make the relation well-founded when following the structure of the language’s cyclic constructs (e.g., when following a pointer cycle in the heap or unfolding a recursive type). Here we show that the clock in a functional big-step semantics can serve the same purpose.

Because our main purpose here is to illustrate functional big-step semantics, we first present the relation and defer its motivation to the end of this section. For now, it suffices to say that it has some significant differences from the existing literature, because it is designed to validate compiler optimisations in an untyped setting.

We start with an untyped lambda calculus with literals, variables (using de Bruijn indices), functions, and a tick expression that decrements the clock. The semantics will also use closure values, and a state with a clock.

figure alfigure al

We can then define the function sem, which implements call-by-value evaluation and decrements the clock on every function call. EL gets the nth element of a list.

figure amfigure am

The top-level semantic function’s definition is similar to the FOR language’s (Sect. 2).

We then define the relations val_rel, which relates two values; exec_rel, which relates two environment/store/expression triples (i.e., the inputs to sem); and state_rel, which relates two stores; all at a given index.

figure anfigure an

The definitions of val_rel and state_rel are typical of a logical relation; exec_rel is where the relation interacts with the functional big-step semantics. In the small-step setting, exec_rel would say that the two triples are related if they remain related for \( i \) steps of the small-step semantics. With the functional big-step semantics, we instead check that the results of the sem function are related when we set the clock to a value less than \( i \).

From here we prove that the relation is compatible with the language’s syntax, that it is reflexive and transitive, that it is sound with respect to contextual approximation, and finally that \(\beta \)-value conversion is in the relation, and hence a sound optimisation for the language at any subexpression. Most of the proof is related to the semantic work at hand, rather than the details of the semantics, but we do need to rely on several easy-to-prove lemmas about the clock that capture intuitive aspects of what it means to be a clocked evaluation function. They correspond to the last lemma from Sect. 3.4.

Motivation. The language and relation are designed as a prototype of an intermediate language for CakeML that is similar to the clambda intermediate language in the OCaml compiler [9]. Because this is an untyped intermediate language for a typed source language, the compiler should be able to change a failing expression into anything at all. We know that we will never try to compile an expression that fails, and this design allows us to omit run-time checks that would otherwise be needed to signal failure. This is why exec_rel relates Rfail to anything, and why our relation is not an equivalence, but an approximation: the compiler must never convert a good expression into one that fails.

Furthermore, the compiler must not convert a diverging program into one that converges (or vice-versa). This is why Rtimeout is only related to itself, and why the clocks are both set to the same \( i' \) when running the expressions. In a typed setting, the clock for the right-hand argument is existentially quantified, thereby allowing a diverging expression to be related to a converging one, and if one wants to show equivalence, one proves the approximation both ways. Because of our treatment of failure, that is not an option here. The drawback is that we cannot support transformations that increase the number of clock ticks needed. For transformations that might reduce the number of ticks, including our \(\beta \)-value conversion, the transformation just needs to introduce extra Tick instructions.

All of the above applies in a small-step setting too. However, the functional big-step approach automatically has some flexibility for changing the amount of computation done. For example, both \(1+2\) and 3 evaluate with the same clock, and so this type of logical relation could be used to show that constant folding is a sound optimisation without added Tick instructions.

7 Equivalence with Small-Step Semantics

We build a straightforward small-step semantics for the FOR language by adding a Handle statement to the language, to stop the propagation of Break statements upward, and implement For as follows (we write Seq as an infix ;):

figure aofigure ao

To prove the equivalence of the functional big-step and small-step, we need two lemmas. First, that the functional semantics only gives Rtimeout with a clock of 0 (which is trivial to prove). Second, that any result of the functional semantics has a corresponding trace through the small-step semantics that is long enough. In the theorem below, we represent the small-step trace with a list so that we can check its length. The check_trace predicate checks that it is indeed a trace of \(\rightarrow _{\texttt {t}}\) steps. The length check ensures that if the functional big-step diverges, then we will be able to build a small-step trace of arbitrary length, and so it diverges too. The subtraction calculates how many clock ticks the evaluation actually used.

figure apfigure ap

One would expect such a theorem building small-step traces from big-step executions to show up in any big-step/small-step equivalence proof. The extra length check adds very little difficulty to the proof, but ensures that we do not need to explicitly prove anything about divergence, or additionally reason going from small-step traces to big-step executions. Similar to type soundness (Sect. 5), we prove this using the induction principle of sem_t.

In the non-deterministic case, we extend the state of the small-step semantics with the same oracle that the functional big-step semantics uses, and we use the oracle to choose which sub-expression of an Add to start evaluating. AddL and AddR expressions are included to mark which argument is being evaluated, so that we do not consult the oracle in subsequent steps for the same decision or switch back-and-forth between subexpressions. For example, if the oracle returns false, we start evaluating the left sub-expression on the updated oracle state. The oracle_upd function puts the new oracle into s and adds F to its \( io\texttt {\_}{}trace \).

figure aqfigure aq

Thus, the small-step semantics remains non-deterministic, and we can use the same approach as above. There are three significant differences. One, we look at the list of all I/O actions and non-determinism oracle results stored in \( io\texttt {\_}{}trace \) instead of the return value. This is why we need to record the oracle results there. Two, our trace-building must account for the AddL and AddR expressions. Three, we must know that the \( io\texttt {\_}{}trace \) is monotone with respect to stepping in the small-step semantics, and with respect to the clock in the functional big-step semantics. The only difficulty in this proof, over the deterministic one, was in handling the AddL and AddR forms, not in dealing with the oracle or trace.

To get an equivalent non-deterministic labelled transition system (LTS) with I/O actions as labels, one would prove the equivalence entirely in the small-step world with a simulation between the oracle small-step and the LTS semantics.

In the above, there was nothing special about the FOR language itself, and the same connection to small-step semantics could be proved for any situation where the big-step to small-step lemma above holds, along with other basic properties of the semantics. In fact, our proof for the FOR language is based on a general theorem that distills the essence of the approach. (We omit the details, which are obscured by the need to treat the two kinds of semantics abstractly).

8 Discussion and Related Work

Logical Foundations. All of our examples are carried out in classical higher-order logic of the kind supported by HOL4, HOL Light, Isabelle/HOL, etc. However, there is nothing inherently non-constructive about our techniques, and we expect that they would carry over to Coq. We rely on the ability to make definitions by well-founded recursion (usually on the combined structure of the terms, and a natural number index), derive the corresponding induction principles, and take lubs in the CPO of lazy lists. Occasionally, we make a non-constructive definition for convenience (e.g., of the top-level semantics in Sect. 2, whereas Sect. 4 has a constructive definition), our proofs do not rely on classical reasoning (other than in HOL4’s implementation of the features mentioned above).

Testing Semantics. To test a semantics, one must actually use it to evaluate programs. Functional big-step semantics can do this out-of-the-box, as can many small-step approaches [13, 14]. Where semantics are defined in a relational big-step style, one needs to build an interpreter that corresponds to the relation and verify that they are equivalent – essentially, building a functional big-step semantics anyway. This construction and proof has been done by hand in several projects [6, 7, 22], and both Coq and Isabelle have mechanisms for automatically deriving functions from inductive relations, although under certain restrictions, and not for co-inductive relations [5, 28].

Interpreters and Relational Big-step Semantics. The essence of the functional big-step approach is that the semantics are just an interpreter for the language, modified with a clock to make it admissible in higher-order logic. In this sense, we are just following Reynolds’ idea of definitional interpreters [23], but using higher-order logic, rather than a programming language, as the meta-language. Using a clock to handle potential non-termination keeps the mathematics unsophisticated, and fits in well with the automation available in HOL4.

Other approaches are possible, such as Danielsson’s use of a co-inductive partiality monad [11] to define functional big-step semantics. He defines a compiler from a lambda calculus with non-determinism to a stack-based virtual machine, and verifies it, including divergence preservation, in Agda. The compiler that we verify here targets a language with lower abstraction. A thorough comparison is difficult to make because the necessary mixed recursion/corecursion is not available in HOL.

Nakata and Uustalu [20] give a functional big-step semantics whose co-domain is (possibly infinite) traces of all states the program has passed through, rather than final results. Although their function is recursive, it relies on co-recursive helpers for sequencing and looping: in this way it looks less like a definitional interpreter. They prove equivalence between a variety of trace-based semantics, but do not use the semantics for compiler verification or type soundness. Our FOR language with I/O also keeps traces – although not of all of the program states passed through – but they are kept in the state, rather than in the function’s result. Instead of using co-recursion, we take a least upper bound to build possibly infinite traces of I/O actions.

Several improvements have been made to traditional inductive relational big-step semantics. Leroy and Grall show how to use co-inductive definitions to give a semantics to a lambda-calculus and verify type soundness, and compiler correctness (for a compiler to a VM) while properly handling divergence [18].

Charguéraud’s pretty-big-step semantics keeps the co-induction and removes some of the duplication by representing partial computations with new syntax and providing rules for completing the evaluation of the partially evaluated syntax [10]. For the FOR language, he introduces new syntax, For1, For2, and For3, that contain semantic contexts for partial evaluations. The evaluation rule for For has a hypothesis about evaluation of For1, which represents the state of evaluation after the first expression in For has been evaluated. Similarly, the semantics of For1 is given semantics in terms of For2, and so forth. The pretty-big-step approach leads to many rules, but there are fewer than in a conventional big-step definitions, and the duplication is removed by factoring it out into rules that introduce For1, For2, and For3.

Bach Poulsen and Mosses show how to derive a (co-inductive) pretty-big-step semantics from a certain kind of small-step semantics (MSOS). This allows one to get the conciseness of a small-step definition and some of the reasoning benefits of a big-step style [3]. They further show that the duplication between the inductive and co-inductive rules can be reduced by encoding in the state whether the computation is trying to diverge or converge, under certain restrictions [4]. Their approach to encoding control-flow effects in the state could be applied in the functional big-step setting. From the point of view of writing an interpreter, this would correspond to using a state monad to encode an exception monad.

Nipkow and Klein use an inductive big-step semantics for a simple imperative language, along with a small-step semantics proved equivalent, and show how to verify a compiler for it [21]. The language cannot have run-time errors, so they do not have to use co-induction. (When they add a type system and possible runtime errors, they switch to small-step). However, their compiler correctness proof and big-step/small-step equivalence proofs each rely on two lemmas. The first assumes a converging big-step execution and builds a small-step trace (their target language has a small-step semantics), just like our corresponding proofs in Sects. 3.3 and 7. Their second assumes a small-step trace and shows that the big-step semantics converges to the right thing. With functional big-step semantics, we do not need this direction because we are in a deterministic setting and we correlate the trace length with clock in the first lemma. This is significant because the second lemma has the more difficult proof: any machine state encountered when running the compiled program must be related back to some source program.

Functional Big-step in CakeML. At the time of writing, the CakeML compiler has 12 intermediate languages (ILs), totaling \(\approx 5,800\) lines. There are about \(\approx 40,000\) lines of proof about them. The semantics of each IL is defined in the functional big-step style, with added support for I/O using the techniques from Sect. 4. The lowest-level ILs are assembly and machine-code-like languages. Their functional big-step semantics are formulated as tail-recursive functions.

9 Conclusion

We have shown how to take an easy to understand interpreter and use it as a formal semantics suitable for use in an interactive theorem prover. To make this possible we added clocks and oracles to the interpreter. Although our example FOR language is simple, it exhibits a wide range of programming language features including divergence, I/O, exceptions (Break), and stores. We have also shown how the functional big-step style can support functional language semantics with Core ML and call-by-value lambda calculus examples.