Functional BigStep Semantics
 22 Citations
 1 Mentions
 874 Downloads
Abstract
When doing an interactive proof about a piece of software, it is important that the underlying programming language’s semantics does not make the proof unnecessarily difficult or unwieldy. Both smallstep and bigstep semantics are commonly used, and the latter is typically given by an inductively defined relation. In this paper, we consider an alternative: using a recursive function akin to an interpreter for the language. The advantages include a better induction theorem, less duplication, accessibility to ordinary functional programmers, and the ease of doing symbolic simulation in proofs via rewriting. We believe that this style of semantics is well suited for compiler verification, including proofs of divergence preservation. We do not claim the invention of this style of semantics: our contribution here is to clarify its value, and to explain how it supports several language features that might appear to require a relational or smallstep approach. We illustrate the technique on a simple imperative language with Clike forloops and a break statement, and compare it to a variety of other approaches. We also provide ML and lambdacalculus based examples to illustrate its generality.
Keywords
Relational Semantic Recursive Call Label Transition System Interactive Proof Type Soundness1 Introduction
In the setting of mechanised proof about programming languages, it is often unclear what kind of operational semantics to use for formalising the language: common bigstep and smallstep approaches each have their own strengths and weaknesses. The choice depends on the size, complexity, and nature of the programming language, as well as what is being proved about it. As a ruleofthumb, the more complex the language’s features, or the more semantically intricate the desired theorem, the more likely it is that smallstep semantics will be needed. This is because smallstep semantics enable powerful proof techniques, including syntactic preservation/progress and stepindexed logical relations, by allowing close observation not only of the result of a program, but also how it got there. In contrast, bigstep’s advantages arise from following the syntactic structure of the programming language. This means that they can mesh nicely with similarly structured compilers, type systems, etc. that one is trying to verify, and reduce the overhead of mechanised proof.
For large projects, a hybrid approach can be adopted. The CompCert [16, 17] verified C compiler uses bigstep for some parts of its semantics and smallstep for others. In the initial version of our own CakeML project [15], we had two different semantics for the source language: bigstep for the compiler verification and smallstep for the type soundness proof, with an additional proof connecting the two semantics.
In contrast, this paper advocates functional bigstep semantics, which can support many of the proofs and languages that typically rely on a smallstep approach, but with a structure that follows the language’s syntax. A functional bigstep semantics is essentially an interpreter written in a purely functional style and equipped with a clock to ensure that the function is total, even when run on diverging programs. Hence the interpreter can be used in a higherorder logic of total functions – the kind supported by Coq, HOL4, and Isabelle/HOL – as a formal definition of the semantics. In this way, it harkens back to Reynolds’ idea of definitional interpreters [23] to give a readable account of a semantics. Additionally, by initialising the clock to a very large number, the same functional bigstep semantics used for proof can also be executed on test programs for exploration and validation.

Functional semantics are easier to read, have a familiar feel for functional programmers, and avoid much of the duplication that occurs in bigstep semantics defined with inductive relations, especially for languages with exceptions and other nonlocal controlflow (Sect. 2).

Functional semantics can be used more easily in mechanised proofs based on rewriting, since functional semantics are stated in terms of equations (Sect. 3.1).

Functional semantics also produce better induction theorems. Induction theorems for relational bigstep semantics frequently force unnecessary case splits in proofs (Sect. 3.2).

The clock used to define a functional semantics is convenient both for proofs that a compiler preserves the diverging behaviour of programs (Sects. 3.3 and 3.4), and for defining (and using) stepindexed logical relations (Sect. 6).

Functional semantics can use an oracle in the state to support languages with I/O and nondeterminism (Sect. 4).
There are a variety of advanced techniques for defining bigstep semantics that solve some of these problems. For example, one can use coinduction to precisely define diverging computations [18, 20], or the prettybigstep approach to reduce duplication in the definition [10]. Notably, these techniques still define the semantics using inductive (and coinductive) relations rather than recursive functions, and we are not aware of any relational approach with all of the advantages listed above. However, functional semantics, as advocated in this paper, are not without their limitations. One is that the definition of a functional semantics requires introduction of a clock which must decrease on certain recursive calls (Sect. 2.3). Another is that languages with nondeterminism require an oracle state component to factor out the nondeterminism (Sect. 4). Lastly, we have not investigated languages with unstructured nondeterminism, e.g. concurrency.
Our ideas about functional bigstep semantics were developed in the context of the CakeML project (https://cakeml.org, [15]) where the latest version has functional bigstep semantics for all of its intermediate languages (see Sect. 8); however, the bulk of this paper concentrates on a series of smaller examples, starting with a Clike language with for and break statements (Sect. 2). We use it to explain in detail how the functional approach supports the verification of a simple compiler (Sect. 3). Then, we present a series of different languages and theorems to illustrate the breadth of our approach (Sects. 4, 5, and 6). Lastly, we show how to prove the equivalence of a functional bigstep and smallstep semantics (Sect. 7).
All of the semantics and theorems in this paper have been formalised and proved in the HOL4 proof assistant (http://holtheoremprover.org). The formalisation is available in the HOL4 examples directory (https://github.com/HOLTheoremProver/HOL/tree/master/examples/funopsem); we encourage interested readers to consult these sources for the definitions and lemmas that we lack the space to present here.
2 Example Semantics
In this section, we motivate functional bigstep semantics by defining an operational semantics for a toy language in both relational and functional styles. We call our toy language FOR, as it includes for loops and break statements that are familiar from C. We first define the bigstep semantics of FOR, informally, as an interpreter in Standard ML (SML); next we explain why the semantics of FOR is difficult to capture in a conventional bigstep relation, but, using a functional bigstep semantics, can be defined neatly as a function in logic.
2.1 An Interpreter in SML
2.2 Relational BigStep Semantics
The definition above is a good way to describe the semantics of FOR to a programmer familiar with SML. It is, however, not directly usable as an operational semantics for interactive proofs. Next, we outline how a bigstep semantics can be defined for the FOR language using conventional inductively defined relations.
Relational bigstep semantics are built up from evaluation rules for an evaluation relation, typically written \(\Downarrow \). Each rule states how execution of a program expression evaluates to a result. The evaluation relation for the FOR language takes as input a state and a statement; it then relates these inputs to the result pair (r and new state) just as the interpreter above does.
Once one has become accustomed to this style of definition, these rules are quite easy to read. However, even an experienced semanticist may find it difficult to immediately see whether these rules cover all the cases. Maybe the last two rules above were surprising? Worse, these rules only provide semantics for terminating executions, i.e. if we want to reason about the behaviour of diverging evaluations, then these (inductive) rules are not enough as stated.
Another drawback is the duplication that rules for complex languages (even for our toy FOR language) contain. In each of the four rules above, the first three lines are almost the same. This duplication might seem innocent but it has knockon effects on interactive proofs: the generated induction theorem also contains duplication, and from there it leaks into proof scripts. In particular, users are forced to establish the same inductive hypothesis many times (Sect. 3.4).
The rules (F2), (F5) and (F6) ensure that the Rfail value is always propagated to the top, preventing the bigstep relation from doing the moral equivalent of getting ‘stuck’ in the smallstep sense. Thus, we know that a program diverges iff it is not related to anything. We could omit these rules if we do not need or want to distinguish divergence from getting stuck, and this is often done with bigstep semantics.^{2} However, for the purposes of this paper, we are primarily interested in the (many) situations where the distinction is important – that is where the functional bigstep approach has the largest benefit.
The above ‘not related’ characterisation of divergence does not yield a useful principle for reasoning about diverging programs: the relation’s induction principle only applies when a program is related to something, not when we know it is not related to anything. To define divergence with a relation [18], one adds to the existing inductive evaluation relation \(\Downarrow _\texttt {t}\) a coinductively defined divergence relation \(\Uparrow _\texttt {t}\), which provides a useful coinduction principle.
2.3 Functional BigStep Semantics
The interpreter written in SML, given in Sect. 2.1, avoids the irritating duplication of the conventional bigstep semantics. It is also arguably easier to read and clearly gives some semantics to all cases. So why can we not just take the SML code and define it as a function in logic? The answer is that the SML code does not terminate for all inputs, e.g., run_t [] (For (Num 1, Num 1, Exp (Num 1))).
In order to define run_t as a function in logic, we need to make it total somehow. A technique for doing this is to add a clock to the function: on each recursive call for which termination is nonobvious, one adds a clock decrement. The clock is a natural number, so when it hits zero, execution is aborted with a special timeout signal.
A very simple implementation of the clockedfunction solution is to add a checkanddecrement on every recursive call. The termination proof becomes trivial, but the function is cluttered with the clock mechanism.
Note that, in our logic version of the semantics, we have introduced a new kind of return value called Rtimeout. This return value is used only to signal that the clock has aborted evaluation. It always propagates to the top, and can be used for reasoning about divergence preservation (Sect. 3.3).
Termination Proof. We prove termination of sem_t by providing a wellfounded measure: the lexicographic ordering on the clock value and the size of the statement that is being evaluated. This measure works because the value of the clock is never increased, and, on every recursive call where the clock is not decremented, the size of the statement that is being evaluated decreases.^{3}
No termination proof is required for relational bigstep semantics. This requirement is, therefore, a drawback for the functional version. However, the functional representation brings some immediate benefits that are not immediate for relational definitions. The functional representation means that the semantics is total (by definition) and that the semantics is deterministic (see Sect. 4 for an account of nondeterministic languages). These are properties that can require tedious proof for relational definitions.
Section 3.3 verifies a compiler that preserves this semantics, and Sect. 4 extends the FOR language with input, output, and internal nondeterminism.
3 Using Functional Semantics
The previous section showed how bigstep semantics can be defined as functions in logic, and how they avoid the duplication that occurs in conventional bigstep semantics. In this section, we highlight how the change in style of definition affects proofs that use the semantics. We compare proofs based on the functional semantics with corresponding proofs based on the relational semantics.
3.1 Rewriting with the Semantics
Sometimes rewriting with a functional semantics can get stuck in an infinite loop. This happens when the lefthand side of the definition, e.g. in our example sem_t\( s \)(\(\texttt {For} \, e _{\text {1}} \, e _{\text {2}} \, t \)), matches a subexpression on the righthand side of the equation, e.g. \({\texttt {sem\_t \, ({dec\_clock}} } \, s _{\text {3}} \texttt {)} \, \texttt {(For} \ e _{\text {1}} \, e _{\text {2}} \, t \texttt {)}\). We use a simple workaround for this: we define STOP\( x \)=\( x \) and prove an equation where the righthand side is sem_t(dec_clock\( s _{\text {3}}\))(STOP(For\( e _{\text {1}} e _{\text {2}} \, t \))). We ensure that the automatic simplifier cannot remove STOP and thus cannot apply the rewrite beyond the potentially diverging recursive call.
3.2 Induction Theorem
The ability to rewrite with the functional semantics helps improve the details of interactive proofs. Surprisingly, the use of functional semantics also improves the overall structure of many proofs. The reason for this is that the induction theorems produced by functional semantics avoid the duplication that comes from the relational semantics.
3.3 Example Compiler Verification
Subsequent phases assume that For statements have been simplified to Loop. The verification of the second phase, phase2, is almost as simple but a little longer because phase2 invents variable names to hold temporary results.
3.4 Comparison with Proof in Relational Semantics
The additional coinductive proof is a good point of comparison, since our technique of decrementing the clock only on recursive calls in the functional bigstep semantics gives us divergence preservation for free in compilation steps that do not cause additional clock ticks. The cases arising in our coinductive proof also required a different form of reasoning from the inductive proof; this naturally arises from the difference between induction and coinduction but it meant that we could not directly adapt similar cases across both proofs.
We also attempted a proof of phase1 with a relational prettybigstep semantics; we found this semantics surprisingly difficult to use in HOL4. Prettybigstep semantics requires the introduction of additional intermediate terms to factorise evaluation of subterms. Hence, the generated induction theorem requires reasoning over these intermediate terms. However, in our compiler proofs, we are typically concerned with the original syntactic terms – those are the only ones mentioned by the compiler – so this induction theorem cannot be applied directly, unlike in the other two semantics. There are ways around this: one can, for example, use an induction theorem that only concerns the original syntactic terms or induct on the size of derivations. Neither of these approaches are automatically supported in HOL4, and our proof of phase1 semantics preservation using the latter approach took 81 lines. Some of Charguéraud’s bigstep and prettybigstep equivalence proofs in Coq also needed to manually prove and use induction on derivation sizes. Additionally, a separate proof is still required for divergence preservation in the coinductive interpretation of these rules; this requires the use of its coinduction theorem, which also has similar issues with intermediate terms.
4 Nondeterminism
As a function, sem_t seems to be inherently deterministic: we cannot simply have it internally know what the next input character is, or choose which subexpression to evaluate first. We are left with two options: we can factor out the input stream and all choices into the state argument of sem_t and then existentially quantify them in the toplevel semantic function to build a set of results; or alternatively, we can change the type of sem_t to return sets of results (alongside partial I/O traces). Here we take the first approach which leads to only minor changes in the definition of sem_t.
The Add case is similar to before, but uses the permute_pair function to swap the subexpressions or not, depending on the oracle. It also returns a new oracle ready to get the next choice, and whether or not it switched the subexpressions. The latter is used to unpermute the values to apply the primitive + in the right order (which would matter for a noncommutative operator). Getchar similarly consumes one input and updates the state. Putchar adds to the I/O trace.
Firstly, semantics is now a predicate^{7} over programs, inputs, and observation. Termination and crashing are still straightforward: the nondeterminism oracle and input are quantified along with the clock, and the resulting I/O trace is read out of the result state. We filter the trace so it only contains the I/O actions and not the record of the nondeterminism oracle. Some choices of oracles might lead to a crash whereas others might lead to different terminating results.
Unclocked Relational Bigstep. Nondeterminism can be handled naturally with two bigstep rules for Add, although that does introduce duplication. A bigstep relation can also be used to collect I/O traces [10, 17, 20]. However, this requires a mixed coinductive/inductive approach for nonterminating programs, and we can no longer choose to equate divergence with a failure to relate the program to anything.
Concurrency. The techniques described in this section can support functional bigstep semantics for a large variety of practical languages, but they do share a significant limitation with other bigstep approaches: concurrency. Concurrent execution would require interleaving the evaluation of multiple expressions, whereas the main principle of a bigstep semantics (ours included) is to evaluate an expression to a value in one step. Our nondeterminism merely selects which to do first. Workarounds, such as having sem_t return sets of traces of interthread communications, might sometimes be possible, but would significantly affect the shape of the definition of the semantics.
5 Type Soundness
Whereas bigstep semantics are common in compiler verification, smallstep semantics enable the standard approach to type soundness by preservation and progress lemmas [29]. A type soundness theorem says that welltyped programs do not crash; they either terminate normally or diverge. As Siek notes [25], a critical thing a semantics must provide is a good separation between divergence and crashing, and a clocked bigstep semantics does this naturally. We have experimented with two type systems and found that functional bigstep semantics works very well for proving type soundness.
Our first example is for the FOR language. We prove that syntax_ok programs do not evaluate to Rfail. The key is to use the induction theorem associated with the functional semantics, rather than rule induction derived from the type system.
We carry the same approach to a language with more interesting type systems: the Core ML language from Wright and Felleisen [29] equipped with a functional bigstep semantics closely resembling an ML interpreter. The type system is more complex than the FOR language’s, supporting references, exceptions, higherorder functions and HindleyMilner polymorphism. However, this extra complexity in the type system factors out neatly, and does not disrupt the proof outline.
Our approach is similar to the one described by Siek [26] (followed by Rompf and Amin [24]) who uses a clocked functional bigstep semantics and demonstrates the utility of the induction theorem arising from the clocked semantics. As a result, our main type soundness proof, which interacts with the bigstep semantics, is easy. Siek’s example type system is simpler than Core ML’s: it has no references or polymorphism; but these difficult aspects can be isolated. The most difficult lemmas in our proof are about the type system, and rely on \(\alpha \)equivalence reasoning over type schemes. Similar lemmas, concerning the type system only, were proved by Tofte [27].
Our statement of type soundness for Core ML is: if a program is welltyped, then for all clocks, the semantics of the program is either Rtimeout, an exception, or a value of the correct type – never Rfail. The universal quantification of clocks makes this a strong statement, since it implies diverging welltyped programs also cannot fail. For contrast, we have also written unclocked bigstep semantics for Core ML and proved a similar theorem: if a program is welltyped and converges to \( r \), then \( r \) is an exception or value of the correct type, but never Rfail. The proof by induction is essentially the same as for the clocked semantics, and all the typesystem lemmas can be reused exactly, but the conclusion is much weaker because diverging programs do not satisfy the assumption. The proof is also longer (330 lines vs. 200) because of the duplication in the relational semantics.
6 Logical Relations
The technique of stepindexed logical relations [2] supports reasoning about programs that have recursive types, higherorder state, or other features that introduce aspects of circularity into a language’s semantics [1, 12]. The soundness of these relations is usually proved with respect to a smallstep semantics, because the length of a smallstep trace can be used to make the relation wellfounded when following the structure of the language’s cyclic constructs (e.g., when following a pointer cycle in the heap or unfolding a recursive type). Here we show that the clock in a functional bigstep semantics can serve the same purpose.
Because our main purpose here is to illustrate functional bigstep semantics, we first present the relation and defer its motivation to the end of this section. For now, it suffices to say that it has some significant differences from the existing literature, because it is designed to validate compiler optimisations in an untyped setting.
The definitions of val_rel and state_rel are typical of a logical relation; exec_rel is where the relation interacts with the functional bigstep semantics. In the smallstep setting, exec_rel would say that the two triples are related if they remain related for \( i \) steps of the smallstep semantics. With the functional bigstep semantics, we instead check that the results of the sem function are related when we set the clock to a value less than \( i \).
From here we prove that the relation is compatible with the language’s syntax, that it is reflexive and transitive, that it is sound with respect to contextual approximation, and finally that \(\beta \)value conversion is in the relation, and hence a sound optimisation for the language at any subexpression. Most of the proof is related to the semantic work at hand, rather than the details of the semantics, but we do need to rely on several easytoprove lemmas about the clock that capture intuitive aspects of what it means to be a clocked evaluation function. They correspond to the last lemma from Sect. 3.4.
Motivation. The language and relation are designed as a prototype of an intermediate language for CakeML that is similar to the clambda intermediate language in the OCaml compiler [9]. Because this is an untyped intermediate language for a typed source language, the compiler should be able to change a failing expression into anything at all. We know that we will never try to compile an expression that fails, and this design allows us to omit runtime checks that would otherwise be needed to signal failure. This is why exec_rel relates Rfail to anything, and why our relation is not an equivalence, but an approximation: the compiler must never convert a good expression into one that fails.
Furthermore, the compiler must not convert a diverging program into one that converges (or viceversa). This is why Rtimeout is only related to itself, and why the clocks are both set to the same \( i' \) when running the expressions. In a typed setting, the clock for the righthand argument is existentially quantified, thereby allowing a diverging expression to be related to a converging one, and if one wants to show equivalence, one proves the approximation both ways. Because of our treatment of failure, that is not an option here. The drawback is that we cannot support transformations that increase the number of clock ticks needed. For transformations that might reduce the number of ticks, including our \(\beta \)value conversion, the transformation just needs to introduce extra Tick instructions.
All of the above applies in a smallstep setting too. However, the functional bigstep approach automatically has some flexibility for changing the amount of computation done. For example, both \(1+2\) and 3 evaluate with the same clock, and so this type of logical relation could be used to show that constant folding is a sound optimisation without added Tick instructions.
7 Equivalence with SmallStep Semantics
One would expect such a theorem building smallstep traces from bigstep executions to show up in any bigstep/smallstep equivalence proof. The extra length check adds very little difficulty to the proof, but ensures that we do not need to explicitly prove anything about divergence, or additionally reason going from smallstep traces to bigstep executions. Similar to type soundness (Sect. 5), we prove this using the induction principle of sem_t.
Thus, the smallstep semantics remains nondeterministic, and we can use the same approach as above. There are three significant differences. One, we look at the list of all I/O actions and nondeterminism oracle results stored in \( io\texttt {\_}{}trace \) instead of the return value. This is why we need to record the oracle results there. Two, our tracebuilding must account for the AddL and AddR expressions. Three, we must know that the \( io\texttt {\_}{}trace \) is monotone with respect to stepping in the smallstep semantics, and with respect to the clock in the functional bigstep semantics. The only difficulty in this proof, over the deterministic one, was in handling the AddL and AddR forms, not in dealing with the oracle or trace.
To get an equivalent nondeterministic labelled transition system (LTS) with I/O actions as labels, one would prove the equivalence entirely in the smallstep world with a simulation between the oracle smallstep and the LTS semantics.
In the above, there was nothing special about the FOR language itself, and the same connection to smallstep semantics could be proved for any situation where the bigstep to smallstep lemma above holds, along with other basic properties of the semantics. In fact, our proof for the FOR language is based on a general theorem that distills the essence of the approach. (We omit the details, which are obscured by the need to treat the two kinds of semantics abstractly).
8 Discussion and Related Work
Logical Foundations. All of our examples are carried out in classical higherorder logic of the kind supported by HOL4, HOL Light, Isabelle/HOL, etc. However, there is nothing inherently nonconstructive about our techniques, and we expect that they would carry over to Coq. We rely on the ability to make definitions by wellfounded recursion (usually on the combined structure of the terms, and a natural number index), derive the corresponding induction principles, and take lubs in the CPO of lazy lists. Occasionally, we make a nonconstructive definition for convenience (e.g., of the toplevel semantics in Sect. 2, whereas Sect. 4 has a constructive definition), our proofs do not rely on classical reasoning (other than in HOL4’s implementation of the features mentioned above).
Testing Semantics. To test a semantics, one must actually use it to evaluate programs. Functional bigstep semantics can do this outofthebox, as can many smallstep approaches [13, 14]. Where semantics are defined in a relational bigstep style, one needs to build an interpreter that corresponds to the relation and verify that they are equivalent – essentially, building a functional bigstep semantics anyway. This construction and proof has been done by hand in several projects [6, 7, 22], and both Coq and Isabelle have mechanisms for automatically deriving functions from inductive relations, although under certain restrictions, and not for coinductive relations [5, 28].
Interpreters and Relational Bigstep Semantics. The essence of the functional bigstep approach is that the semantics are just an interpreter for the language, modified with a clock to make it admissible in higherorder logic. In this sense, we are just following Reynolds’ idea of definitional interpreters [23], but using higherorder logic, rather than a programming language, as the metalanguage. Using a clock to handle potential nontermination keeps the mathematics unsophisticated, and fits in well with the automation available in HOL4.
Other approaches are possible, such as Danielsson’s use of a coinductive partiality monad [11] to define functional bigstep semantics. He defines a compiler from a lambda calculus with nondeterminism to a stackbased virtual machine, and verifies it, including divergence preservation, in Agda. The compiler that we verify here targets a language with lower abstraction. A thorough comparison is difficult to make because the necessary mixed recursion/corecursion is not available in HOL.
Nakata and Uustalu [20] give a functional bigstep semantics whose codomain is (possibly infinite) traces of all states the program has passed through, rather than final results. Although their function is recursive, it relies on corecursive helpers for sequencing and looping: in this way it looks less like a definitional interpreter. They prove equivalence between a variety of tracebased semantics, but do not use the semantics for compiler verification or type soundness. Our FOR language with I/O also keeps traces – although not of all of the program states passed through – but they are kept in the state, rather than in the function’s result. Instead of using corecursion, we take a least upper bound to build possibly infinite traces of I/O actions.
Several improvements have been made to traditional inductive relational bigstep semantics. Leroy and Grall show how to use coinductive definitions to give a semantics to a lambdacalculus and verify type soundness, and compiler correctness (for a compiler to a VM) while properly handling divergence [18].
Charguéraud’s prettybigstep semantics keeps the coinduction and removes some of the duplication by representing partial computations with new syntax and providing rules for completing the evaluation of the partially evaluated syntax [10]. For the FOR language, he introduces new syntax, For1, For2, and For3, that contain semantic contexts for partial evaluations. The evaluation rule for For has a hypothesis about evaluation of For1, which represents the state of evaluation after the first expression in For has been evaluated. Similarly, the semantics of For1 is given semantics in terms of For2, and so forth. The prettybigstep approach leads to many rules, but there are fewer than in a conventional bigstep definitions, and the duplication is removed by factoring it out into rules that introduce For1, For2, and For3.
Bach Poulsen and Mosses show how to derive a (coinductive) prettybigstep semantics from a certain kind of smallstep semantics (MSOS). This allows one to get the conciseness of a smallstep definition and some of the reasoning benefits of a bigstep style [3]. They further show that the duplication between the inductive and coinductive rules can be reduced by encoding in the state whether the computation is trying to diverge or converge, under certain restrictions [4]. Their approach to encoding controlflow effects in the state could be applied in the functional bigstep setting. From the point of view of writing an interpreter, this would correspond to using a state monad to encode an exception monad.
Nipkow and Klein use an inductive bigstep semantics for a simple imperative language, along with a smallstep semantics proved equivalent, and show how to verify a compiler for it [21]. The language cannot have runtime errors, so they do not have to use coinduction. (When they add a type system and possible runtime errors, they switch to smallstep). However, their compiler correctness proof and bigstep/smallstep equivalence proofs each rely on two lemmas. The first assumes a converging bigstep execution and builds a smallstep trace (their target language has a smallstep semantics), just like our corresponding proofs in Sects. 3.3 and 7. Their second assumes a smallstep trace and shows that the bigstep semantics converges to the right thing. With functional bigstep semantics, we do not need this direction because we are in a deterministic setting and we correlate the trace length with clock in the first lemma. This is significant because the second lemma has the more difficult proof: any machine state encountered when running the compiled program must be related back to some source program.
Functional Bigstep in CakeML. At the time of writing, the CakeML compiler has 12 intermediate languages (ILs), totaling \(\approx 5,800\) lines. There are about \(\approx 40,000\) lines of proof about them. The semantics of each IL is defined in the functional bigstep style, with added support for I/O using the techniques from Sect. 4. The lowestlevel ILs are assembly and machinecodelike languages. Their functional bigstep semantics are formulated as tailrecursive functions.
9 Conclusion
We have shown how to take an easy to understand interpreter and use it as a formal semantics suitable for use in an interactive theorem prover. To make this possible we added clocks and oracles to the interpreter. Although our example FOR language is simple, it exhibits a wide range of programming language features including divergence, I/O, exceptions (Break), and stores. We have also shown how the functional bigstep style can support functional language semantics with Core ML and callbyvalue lambda calculus examples.
Footnotes
 1.
For example, CakeML initially used a clocked, but relational, semantics for its intermediate languages, and clocked recursive evaluation functions are common in BoyerMoorestyle provers such as ACL2, where inductive relations are unavailable [8, 30]. Leroy and Grall [18] use a clock to define a denotational semantics in Coq. Siek has also advocated for clocks for proving type soundness [25, 26].
 2.
If we had another mode of failure, e.g., from a raise expression, then these rules would still be needed to propagate that.
 3.
HOL4’s current definition package requires some help to prove and use the fact that the clock never increases.
 4.
Note that such packaged bigstep rules are easy to define in HOL4. However, they do not fit well with Coq’s default mechanism for defining inductive relations. Charguéraud’s prettybigstep approach was developed in the context of Coq.
 5.
This lemma also implies that if a program times out for a given clock, then it times out for all smaller clocks.
 6.
Here FILTER is ordinary filtering over a list, and ISL is the predicate for the left injection of a sum (disjoint union), so the FILTERISL applications get the I/O actions and discard the evaluation ordering choices.
 7.
Note that HOL4 identifies the types \(\alpha \)> bool and \(\alpha \)set.
Notes
Acknowledgements
We thank Arthur Charguéraud for advice on Coq and prettybigstep. The first author was supported by the EPSRC [EP/K040561/1]. The second author was partially supported by the Swedish Research Council. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.
References
 1.Ahmed, A.: Stepindexed syntactic logical relations for recursive and quantified types. In: Sestoft, P. (ed.) ESOP 2006. LNCS, vol. 3924, pp. 69–83. Springer, Heidelberg (2006). doi: 10.1007/11693024_6 CrossRefGoogle Scholar
 2.Appel, A.W., McAllester, D.A.: An indexed model of recursive types for foundational proofcarrying code. ACM Trans. Program. Lang. Syst. 23(5), 657–683 (2001). doi: 10.1145/504709.504712 CrossRefGoogle Scholar
 3.Bach Poulsen, C., Mosses, P.D.: Deriving prettybigstep semantics from smallstep semantics. In: Shao, Z. (ed.) ESOP 2014 (ETAPS). LNCS, vol. 8410, pp. 270–289. Springer, Heidelberg (2014). doi: 10.1007/9783642548338_15 CrossRefGoogle Scholar
 4.Poulsen, C.B., Mosses, P.D.: Divergence as state in coinductive bigstep semantics (extended abstract). In: 26th Nordic Workshop on Programming Theory, NWPT 2014 (2014). http://www.plancomps.org/nwpt2014/
 5.Berghofer, S., Bulwahn, L., Haftmann, F.: Turning inductive into equational specifications. In: Berghofer, S., Nipkow, T., Urban, C., Wenzel, M. (eds.) TPHOLs 2009. LNCS, vol. 5674, pp. 131–146. Springer, Heidelberg (2009). doi: 10.1007/9783642033599_11 CrossRefGoogle Scholar
 6.Blazy, S., Leroy, X.: Mechanized semantics for the Clight subset of the C language. J. Autom. Reasoning 43(3), 263–288 (2009). doi: 10.1007/s1081700991483 MathSciNetCrossRefzbMATHGoogle Scholar
 7.Bodin, M., Charguéraud, A., Filaretti, D., Gardner, P., Maffeis, S., Naudziuniene, D., Schmitt, A., Smith, G.: A trusted mechanised JavaScript specification. In: The 41st Annual ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2014, pp. 87–100 (2014). doi: 10.1145/2535838.2535876
 8.Boyer, R., Moore, J.S.: Mechanized formal reasoning about programs and computing machines. In: Automated Reasoning and Its Applications: Essays in Honor of Larry Wos. MIT Press (1996)Google Scholar
 9.Chambart, P.: High level OCaml optimisations (2013). https://ocaml.org/meetings/ocaml/2013/slides/chambart.pdf
 10.Charguéraud, A.: Prettybigstep semantics. In: Felleisen, M., Gardner, P. (eds.) ESOP 2013. LNCS, vol. 7792, pp. 41–60. Springer, Heidelberg (2013). doi: 10.1007/9783642370366_3 CrossRefGoogle Scholar
 11.Danielsson, N.A.: Operational semantics using the partiality monad. In: ACM SIGPLAN International Conference on Functional Programming, ICFP 2012, pp. 127–138 (2012). doi: 10.1145/2364527.2364546
 12.Dreyer, D., Neis, G., Birkedal, L.: The impact of higherorder state and control effects on local relational reasoning. J. Funct. Program. 22(4–5), 477–528 (2012). doi: 10.1017/S095679681200024X MathSciNetCrossRefzbMATHGoogle Scholar
 13.Ellison, C., Rosu, G.: An executable formal semantics of C with applications. In: Proceedings of the 39th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POpPL 2012, pp. 533–544 (2012). doi: 10.1145/2103656.2103719
 14.Klein, C., Clements, J., Dimoulas, C., Eastlund, C., Felleisen, M., Flatt, M., McCarthy, J.A., Rafkind, J., TobinHochstadt, S., Findler, R.B.: Run your research: on the effectiveness of lightweight mechanization. In: Proceedings of the 39th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POpPL 2012, pp. 285–296 (2012). doi: 10.1145/2103656.2103691
 15.Kumar, R., Myreen, M.O., Norrish, M., Owens, S.: CakeML: a verified implementation of ML. In: Proceedings of the 41st ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2014, pp. 179–191. ACM Press (2014). doi: 10.1145/2535838.2535841
 16.Leroy, X.: Formal certification of a compiler backend or: programming a compiler with a proof assistant. In: Proceedings of the 33rd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POpPL 2006, pp. 42–54 (2006). doi: 10.1145/1111037.1111042
 17.Leroy, X.: A formally verified compiler backend. J. Autom. Reasoning 43(4), 363–446 (2009). doi: 10.1007/s1081700991554 MathSciNetCrossRefzbMATHGoogle Scholar
 18.Leroy, X., Grall, H.: Coinductive bigstep operational semantics. Inf. Comput. 207(2), 284–304 (2009). doi: 10.1016/j.ic.2007.12.004 MathSciNetCrossRefzbMATHGoogle Scholar
 19.Moore, J.S.: Symbolic simulation: an ACL2 approach. In: Gopalakrishnan, G.C., Windley, P. (eds.) FMCAD 1998. LNCS, vol. 1522, pp. 334–350. Springer, Heidelberg (1998). doi: 10.1007/3540495193_22 CrossRefGoogle Scholar
 20.Nakata, K., Uustalu, T.: Tracebased coinductive operational semantics for while. In: Berghofer, S., Nipkow, T., Urban, C., Wenzel, M. (eds.) TPHOLs 2009. LNCS, vol. 5674, pp. 375–390. Springer, Heidelberg (2009). doi: 10.1007/9783642033599_26 CrossRefGoogle Scholar
 21.Nipkow, T., Klein, G.: With Isabelle/HOL. Springer, Heidelberg (2014). doi: 10.1007/9783319105420 CrossRefzbMATHGoogle Scholar
 22.Owens, S.: A sound semantics for OCaml light. In: Drossopoulou, S. (ed.) Programming Languages and Systems. Lecture Notes in Computer Science, vol. 4960, pp. 1–15. Springer, Heidelberg (2008). doi: 10.1007/9783540787396_1 CrossRefGoogle Scholar
 23.Reynolds, J.C.: Definitional interpreters for higherorder programming languages. Higherorder and Symbolic Comput. 11(4), 363–397 (1998). doi: 10.1023/A:1010027404223 CrossRefzbMATHGoogle Scholar
 24.Rompf, T., Amin, N.: From F to DOT: type soundness proofs with definitional interpreters (2015). CoRR abs/1510.05216
 25.Siek, J.: Bigstep, diverging or stuck? (2012). http://siek.blogspot.com/2012/07/bigstepdivergingorstuck.html
 26.Siek, J.: Type safety in three easy lemmas (2013). http://siek.blogspot.com/2013/05/typesafetyinthreeeasylemmas.html
 27.Tofte, M.: Type inference for polymorphic references. Inf. Comput. 89(1), 1–34 (1990). doi: 10.1016/08905401(90)90018D MathSciNetCrossRefzbMATHGoogle Scholar
 28.Tollitte, P.N., Delahaye, D., Dubois, C.: Producing certified functional code from inductive specifications. In: Hawblitzel, C., Miller, D. (eds.) CPP 2012. LNCS, vol. 7679, pp. 76–91. Springer, Heidelberg (2012). doi: 10.1007/9783642353086_9 CrossRefGoogle Scholar
 29.Wright, A.K., Felleisen, M.: A syntactic approach to type soundness. Inf. Comput. 115(1), 38–94 (1994). doi: 10.1006/inco.1994.1093 MathSciNetCrossRefzbMATHGoogle Scholar
 30.Young, W.D.: A mechanically verified code generator. J. Autom. Reasoning 5(4), 493–518 (1989). doi: 10.1007/BF00243134 CrossRefGoogle Scholar