1 Introduction

In air combat training simulations, the role of opponent is often played by virtual entities known as computer generated forces (cgfs). Various research efforts have demonstrated the ability of machine learning (cf. Karli et al. 2017; Teng et al. 2013; Toubman et al. 2016) and other adaptive techniques (cf. Floyd et al. 2017; Karneeb et al. 2018) to generate air combat behaviour models for cgfs. The strength of such techniques is that the computer can automatically adapt the behaviour of the cgfs, and thus the training, to the trainee fighter pilots. However, the creative capabilities of these techniques may result in undesirable (e.g., non-humanlike) behaviour that is not useful for training (Petty 2003). The main idea behind this paper is that newly generated behaviour models should be validated to prove their usefulness in training simulations. In the remainder of this paper, we investigate what this validation entails (see Sect. 2). The two contributions of this paper are the following:

  1. 1.

    We present a validation procedure for machine-learned air combat behaviour models (see Sect. 3). A key component of the procedure is a newly developed questionnaire for the assessment of the behaviour produced by air combat behaviour models. We call this questionnaire the Assessment Tool for Air Combat cgfs (atacc);

  2. 2.

    As a case study, we generate novel air combat behaviour models by means of machine learning (see Sect. 4) and apply the validation procedure to the models (see Sect. 5). The results show that the generated behaviour models are valid to some extent, but also that both the behaviour models and the validation procedure require additional effort (see Sect. 6). To the best of our knowledge, this is the first time that the validation of machine-generated air combat behaviour models has been treated as a research subject in its own right (see Sect. 7).

2 The Difficulty of Validating Behaviour Models

Since the advent of the use of simulation in military training there has been a rising interest in the validation of simulation models (cf. Kim et al. 2015; Sargent 2011). Many definitions of validation have been stated throughout the literature (cf. Birta and Arbez 2013; Bruzzone and Massei 2017; Petty 2010). When military simulations are discussed in particular, we find references to the definition of validation that is used by the US Department of Defense (2009). We use this definition from now onwards. For convenience, we restate the definition.

Definition 1

(Validation). Validation is “[t]he process of determining the degree to which a model or simulation and its associated data are an accurate representation of the real world from the perspective of the intended uses of the model” (ibid.).

The definition names four important concepts: (1) a process, (2) a degree of accuracy, (3) a model (or simulation), and (4) the intended use of the model. We can readily fill in concepts (3) and (4). Regarding concept (3), the models that we wish to validate are newly generated behaviour models. Furthermore, regarding concept (4), the intended use of these models is to produce behaviour for opponent cgfs in air combat training simulations. However, this leaves open two questions for us to investigate: (1) what the process entails, and (2) how we should determine the accuracy of the models. The difficulty of validating behaviour models is answering these two questions for every specific case.

First, we investigate the question of what the process entails. There is no one-size-fits-all solution for validation processes, since different models have (1) different intended uses, and (2) different associated works available for use in the validation. Here, we use the notion “associated work” to refer to a range of results of performed work, e.g., (1) baseline models, (2) expected output data, (3) conceptual diagrams of the modelled phenomenon, or (4) expert knowledge. This being so, we still find that the various validation methods to be applied are well described in the literature (cf. Balci 1994; Petty 2010; Sargent 2011). In general, the four categories of validation methods are: (1) informal methods such as face validation, (2) static methods such as evaluating the model structure, (3) dynamic methods that involve executing the model and analysing the output data, and (4) formal methods based on mathematical proofs. An important factor in the choice of validation method(s) to use is the availability of associated works (cf. Petty 2010; Sargent 2011). For example, dynamic methods can only be applied if (1) it is possible to execute the model with input that is relevant with regard to the intended use of the model, (2) data can be collected on the execution of the model, and (3) it is known how the collected data should be interpreted (e.g., compared to another available set of data). In other words, the choice of validation methods is always limited by practical considerations.

The second question we would like to investigate reads: how should we determine the accuracy of the models? For instance, for a physics-based model, the accuracy of the model can be defined in terms of the number of faults that is allowed when the data that the model produces is compared to data that is measured in the real world. However, for behaviour models the question is particularly difficult to answer, since the notion of fault is difficult to grasp (Hahn 2017). Goerger et al. (2005) identify five causes to the difficulty of validating behaviour models in general. FourFootnote 1 of these causes relate to the problem of defining the accuracy of a behaviour model. These four causes are: (1) the cognitive processes that are modelled may be nonlinear, which makes the processes as well as their models hard to reason about, (2) it is impossible to investigate all possible interactions that may arise in simulations because of the large number of interdependent variables in the models, (3) the metrics for measuring accuracy are inadequate, (4) there is no “robust” set of input data for the models.

An important consequence of the difficulty of validating behaviour models is that the outcome of a validation should not be interpreted as either “the model is valid” or “the model is not valid”, as it is practically impossible to “completely validate” a model (Birta and Arbez 2013). Therefore, Birta and Arbez, (ibid.) note that “degrees of success must be recognized and accepted.” For them, it is important that the chosen validation methods are able to adequately reflect on the extent of the validity of the models.

3 Our Proposed Validation Procedure

In this section, we present our validation procedure for air combat behaviour models. Specifically, the validation procedure is aimed at automatically generated (e.g., machine-learned) behaviour models. The main idea behind the validation procedure is a comparison of (a) the behaviour displayed by cgfs that use the generated behaviour models, to (b) the behaviour displayed by cgfs that use behaviour models that have been written by professional model builders and/or subject matter experts (henceforth the professionals). Essentially, we use the latter, established type of behaviour models to provide a standard of behaviour to which the former, newly generated type of behaviour models should adhere. In other words, we do not aim for the generated models to surpass the established models in any way. Rather, we aim to show their equivalence, so that the new models can be used to supplement the established models, and thereby widen the variety of the training simulations that are offered.

In order to produce observable (and thus comparable) behaviour, all of the models have to be fed with the behaviour of their opponents, i.e., cgfs controlled by human fighter pilots, in a realistic air combat setting (see Sect. 3.1). Next, the displayed behaviour has to be assessed to create data on the basis of which a comparison can be made (see Sect. 3.2). For the actual comparison, we rely on a statistical method known as equivalence testing (see Sect. 3.3). Based on the outcome of the equivalence testing, we can state the extent of the validity of the generated behaviour models. Figure 1 provides an overview of the entire validation procedure.

Fig. 1.
figure 1

The validation procedure. In human-in-the-loop simulations, human fighter pilots engage cgfs that are either controlled by the 4m-models (subject of the validation) or the 4p-models (baseline for comparison). Expert assessors assess the behaviour displayed by the cgfs by means of a newly developed assessment tool. Equivalence testing on the assessment results in a measurable extent of validity of the 4m-models.

3.1 Human-in-the-Loop Simulations

The validation procedure begins with human-in-the-loop simulations in a high-fidelity beyond-visual-range air combat simulator. We consider a simulator that accommodates four human participants acting as fighter pilots. In the simulations, the participants engage a so-called four-ship (viz. a team of four) of hostile cgfs.

The behaviour of the four-ship of cgfs is driven by four behaviour models, one for each cgf. In our experience, the behaviour models for the cgfs in a four-ship are treated as a single model. Especially when the models are designed by professionals, they are carefully tuned to each other to provide the illusion of a cohesive team at work. We henceforth consider the four models that together control the behaviour of a four-ship to be an indivisible unit. For convenience, we introduce the term 4-model to refer to a group of four behaviour models.

Using the term 4-model, we are now able to make the distinction between (1) 4-models that have been written by the professionals, and (2) 4-models that have been generated by means of machine learning. We introduce the terms 4p-model (where the p stands for professional) and 4m-model (where the m stands for machine learning) to refer to these two kinds of 4-model, respectively.

The 4m-models are the subjects of the validation procedure. However, by themselves they are not sufficient input for the validation process. As Petty (2010) stated succinctly, validation “[is a] process[] that compare[s] things.” Therefore, we require either (a) a baseline model, (b) a set of expected output data, or (c) implicit expert knowledge as a reference to compare against the 4m-models.

For complex air combat behaviour models, it is almost infeasible to compile a set of expected output data, since the output depends on a wide range of possible interactions with other entities. However, what we do have available are behaviour models that have been written previously by professionals (i.e., 4p-models). These 4p-models constitute a sample of all behaviour models that have been written by the professionals, comparable to how the 4m-models that are validated are a sample of the behaviour models that can possibly be generated by machine learning. Furthermore, we argue that since the 4p-models have been developed by means of the behaviour modelling process, the 4p-models have themselves been validated to some extent. We therefore add 4p-models as the second input to the validation process.

We record the human-in-the-loop simulations, resulting in a set of behaviour traces. The behaviour traces contain three-dimensional recordings of the simulated airspace, including the movements of all entities (i.e., cgfs and missiles) flying in the airspace. The behaviour traces serve as input for the assessment (see next section).

3.2 Assessment

The goal of the assessment is to summarise the cgfs’ behaviour that is encoded in the behaviour traces into values that are (1) meaningful and (2) comparable between the 4m-models and the 4p-models. The assessment is performed by means of a structured form of face validation, which is one of the informal validation methods (see Sect. 2).

However, there is little to no information available on measures for cgf behaviour that are relevant to training simulations. Therefore, we make use of the implicit knowledge of expert evaluators. We leverage this knowledge in two manners. First, we elicit knowledge on measures for behaviour of air combat cgfs, and then structure this knowledge into a novel assessment tool which we call the Assessment Tool for Air Combat cgfs (atacc) (see below). This tool enables a structured assessment of cgf behaviour. Second, expert evaluators review the behaviour traces that we have collected, and then assess the behaviour that the cgfs display. The result of the assessment is a series of ratings on Likert scales. The ratings serve as input for the equivalence tests (see next section). Below, we describe the development and contents of the atacc.

The Assessment Tool for Air Combat CGFs. Together with instructor fighter pilots, we identified three performance dimensions that should be taken into consideration in the assessment of the behaviour of air combat cgfs. These performance dimensions are (1) the challenge provided by the cgfs, (2) the situational awareness that the cgfs display, and (3) the realism of the behaviour of the cgfs. We briefly describe the three performance dimensions below.

  • Performance dimension 1: Challenge. The tool should measure whether (1) the cgfs behave in such a way that the human participants in the simulations need to think about and adjust their actions, and (2) whether the cgfs provide some form of training value to the simulations.

  • Performance dimension 2: Situational awareness. The tool should measure whether (1) the cgfs appear to sense and react to changes in their environment, and (2) whether multiple cgfs belonging to the same team appear to acknowledge each other’s presence.

  • Performance dimension 3: Realism. The assessment tool should measure (1) whether the cgfs behave as can be expected from their real-world counterparts, and (2) whether the cgfs use the capabilities of their platform (including e.g., sensors and weapons) in a realistic manner.

Next, we attempted to formulate examples of behaviour that relate to each of the performance dimensions. This was done in an iterative manner, such that examples that were proposed could be critically analysed by each of the instructor fighter pilots. We formulated eight examples of behaviour in total (listed below). Examples 1 through 4 relate to Challenge; 5 and 6 to Situational awareness; and 7 and 8 to Realism In each of the examples, red air refers to the cgfs, whereas blue air refers to the human participants in the human-in-the-loop simulations.

  • Example 1. Red air forced blue air to change their tactical plan.

  • Example 2. Red air forced blue air to change their shot doctrineFootnote 2.

  • Example 3. Red air was within factor rangeFootnote 3.

  • Example 4. Blue air was able to fire without threat from red air.Footnote 4

  • Example 5. Red air acted on blue air’s geometry.

  • Example 6. Red air acted on blue air’s weapon engagement zoneFootnote 5.

  • Example 7. Red air flew with kinematic realism.

  • Example 8. Red air’s behaviour was intelligent.

In the atacc, each of the eight examples of behaviour is presented as a separate rating item, so that the presence of each behaviour is rated on a five-point Likert scale. For all of the eight rating items, the scale is labelled as ranging from Never to Always. To conclude the atacc, we added a general ninth rating item stating “Red air’s behaviour tested blue air’s tactical air combat skills.” This item served to provide us with a general indication of the usefulness of the behaviour of the cgfs in relation to the human-in-the-loop simulations that were performed. The ninth item is also rated on five-point Likert scale, ranging from Strongly disagree to Strongly agree.

3.3 Equivalence Testing

At this point in the validation process, we have two sets of data: (1) the assessment of the 4p-models, and (2) the assessment of the 4m-models. We wish to compare these two sets of data in a meaningful way. Since we used the 4p-models as the baseline, we assume that the assessment of the 4p-models contains information about the desirable properties of air combat cgf behaviour. Based on this assumption, we define the measure of validity of the 4m-models as the extent that the assessment of the 4m-models and the assessment of the 4p-models can be measured to be equivalent.

Obviously, a simple comparison (viz., determining if the difference between the assessments equals zero) of the assessments is too strict. The results of our assessments include noise from multiple sources (e.g., the behaviour of the pilots in the human-in-the-loop simulations, and possible bias of the assessors). Furthermore, standard statistical significance tests do not suffice, since these tests check for differences rather than for equivalence. We found a solution in a form of comparison testing that is called equivalence testing.

The two one-sided tests (tost) method tests for equivalence of the means of two populations (cf. Anderson-Cook and Borror 2016; Lakens 2017; Meyners 2012). Therefore, the method starts with the assumption that two populations are different, and then collects evidence to show that the populations are the same. Note that this is the opposite of traditional tests that compare two populations (e.g., Student’s t-test), which (1) start with the assumption that two populations are similar or even the same, and then (2) collect evidence to show that the populations are different.

In tost, the assumption that two populations are different (viz., the null hypothesis or \(H_{0}\)) is stated as follows.

$$\begin{aligned} H_{0}: \quad \mu _{A} - \mu _{B} \le \delta _{L} \quad \text {or} \quad \mu _{A} - \mu _{B} \ge \delta _{U} \end{aligned}$$
(1)

Here, the difference of the means of two populations A and B are compared. Two populations are considered different if the difference of their means lies outside of the indifference zone \([\delta _{L},\delta _{U}]\). We assume that the indifference zone is symmetrical, i.e., \(\delta = \delta _{U} = -\delta _{L}\). However, we are interested in examining the hypothesis viz. the means are not different, i.e., the difference between the means lies inside of the indifference zone. The reformulation of the hypothesis (viz. the alternative hypothesis or \(H_{1)}\) is stated as follows.

$$\begin{aligned} H_{1}: \quad \delta _{L}< \mu _{A} - \mu _{B} < \delta _{U} \end{aligned}$$
(2)

If the tost finds evidence that the difference of the means lies within the indifference zone under the assumption that it does not, we reject \(H_{0}\) and accept \(H_{1}\), meaning that we conclude that the populations are the same (up to a very small difference). Finding this evidence is done by splitting \(H_{0}\) into two hypotheses which can be tested using standard one-sided t-tests. The p-value of the tost then becomes the maximum of the two p-values that are obtained from the two one-sided t-tests.

The outcome of the tost greatly depends on the value chosen for \(\delta \). Until recently, \(\delta \) could not be calculated directly. It was either (1) prescribed by regulatory agencies (e.g., in the field of pharmacology) or (2) determined by subject matter experts based on reference studies or expectations about the data (e.g., in psychology) (cf. Anderson-Cook and Borror 2016; Lakens 2017). For our validation, it is difficult to determine a suitable \(\delta \), since we have neither a regulatory agency, nor a reference study available. However, in 2016, an objective calculation of \(\delta \) was introduced by Juzek (2016). The calculation of this delta \(\delta \) (henceforth: Juzek’s \(\delta \)) is as follows.

$$\begin{aligned} \delta = 4.58 \frac{s_{p}}{N_{p}} \end{aligned}$$
(3)

Here, \(s_{p}\) is the pooled standard deviation in the two samples under comparison, and \(N_{p}\) is the pooled number of data points in the samples. Juzek found the coefficient (4.58) by simulating a large number of tost applications. The coefficient was approximated in such a way that Juzek’s \(\delta \) gives the tost the appropriate statistical power (\(1-\alpha =95\%\), \(1-\beta =80\%\)).

Armed with the tost method, we are now able to test the statistical equivalence of the assessments for the 4p-models and the 4m-models per rating item. The extent to which the rating items are equivalent can then be seen as the extent to which the 4m-models are valid.

4 Generating Air Combat Behaviour Models

We generated four novel 4m-models in preparation for the application of the validation procedure. These 4m-models served as the subject of the validation. The 4m-models were generated by means of the dynamic scripting machine learning algorithm (Spronck et al. 2006). The specific method for applying dynamic scripting to generate the 4m-models for air combat simulations is described by Toubman et al. (2016). We do not restate the full method here, as it is not the focus of this paper. In brief, the method consists of the following three steps:

  1. 1.

    We obtain four 4p-models that have been written by a professional and that have seen use in actual training simulations;

  2. 2.

    We decompose the 4p-models into their constituent “states”Footnote 6 and the transitions between these states;

  3. 3.

    The dynamic scripting algorithm repeatedly recombines the states and transitions into new behaviour models (4m-models) and tests these models in automated, agent-based simulations. The algorithm halts after a certain number of repetitions and returns the four best performing (viz. most-winning) 4m-models that it has found.

The use of this method thus results in (a) the four professionally written 4p-models obtained in the first step, and (b) the four machine-generated 4m-models obtained in the third step. Together, the eight models serve as input to the validation procedure (see Sect. 5).

5 Applying the Validation Procedure

In this section, we report on the application of the validation procedure (see Sect. 3) to a set of newly generated 4m-models (see Sect. 4). We present the application in the form of an experiment: the current section contains the “experimental method”, i.e., gathering behaviour traces in human-in-the-loop simulations (see Sect. 5.1) and performing the assessment (see Sect. 5.2). Later, we present the “experimental results”, i.e., the ratings obtained from the assessment and the results of the equivalence tests (see Sect. 6).

5.1 Human-in-the-Loop Simulations

Human-in-the-loop simulations were used to determine how a four-ship of red cgfs behaves when the cgfs interact with human participants controlling four blue cgfs. The simulations were performed in nlr’s Fighter 4-Ship simulator. This simulator consists of four networked mock-up cockpits.

The behaviour of the reds was controlled by means of eight 4-models: the four 4p-models plus the four 4m-models (see Sect. 4). Using these eight 4-models, we defined eight scenarios. Each scenario was a simulation configuration in which a four-ship of red cgfs approached the human participants from the simulated north. In each scenario, the red four-ship used either one of the four 4p-models or one of the four 4m-models, so that each of the 4-models was used in one of the scenarios.

The human participants in the simulations were active-duty Royal Netherlands Air Force (rnlaf) pilots from Volkel Airbase (all male, n = 16, age \(\mu = 32.0\), \(\sigma = 5.35\)), and one former rnlaf pilot (age \(= 60\)).Footnote 7 No selection criteria were applied. The active-duty pilots were assigned to the human-in-the-loop simulations based on availability. Experience levels ranged from wingman to weapons instructor pilot.

Over the course of three days, five teams of four participants controlled the blue cgfs in the Fighter 4-Ship. Before the simulations took place, the participants received a “mission briefing” document that described (1) the capabilities of the blue cgfs that they would control, and (2) the capabilities of the red cgfs that the participants were to expect in the simulator. The eight scenarios were presented sequentially in a random order. The participants were unaware of the origin of the 4-models controlling the red cgfs (i.e., the simulations were performed in a single-blinded fashion). Each scenario ended when either all four red cgfs, or all four human participants were defeated.

The human-in-the-loop simulations were recorded using the pcds mission debrief software. In addition to behaviour traces, the recordings included (1) the voice communication that took place among the human participants, and (2) video recordings of the multi-functional displays of the cockpits occupied by the human participants. In total, 33 recordingsFootnote 8 were stored.

5.2 Assessment

The behaviour that the reds displayed in the human-in-the-loop simulations were assessed by human experts. Active-duty rnlaf pilots from Leeuwarden Airbase acted as assessors (all male, \(n = 5\), age \(\mu = 35.2\), \(\sigma = 5.17\)). Assessors were selected on having tactical instructor pilot or weapons instructor pilot qualification. All five assessors had the weapons instructor pilot qualification. The assessment was performed by means of the atacc, implemented on paper.

Originally, we had planned to let each assessor assess all of the 33 recordings within a three hour time span. However, a pilot study with two weapons instructor pilots (not counted above) revealed that this was unfeasible because of time constraints. We subsequently reduced the pool of recordings available for rating to 16 recordings. These 16 recordings came from two teams that completed all eight scenarios (i.e., simulations with the four 4p-models and the four 4m-models) in human-in-the-loop simulations. From this reduced pool of recordings, we assigned ten recordings to each assessor, consisting of (1) eight recordings from one of the two teams in random order, and (2) two recordings from the other team. Furthermore, the weapons instructor pilots in the pilot study expressed that they were unable to adequately assess the intelligence of the red cgfs (rating item 8) and the extent to which the red cgfs tested the skills of the pilots in the simulator (rating item 9) without knowing the experience levels of these pilots. Based on this feedback, we made the decision to disclose the experience levels to the assessors during the assessment.

The assessors were provided with a laptop computer with mouse and headphones, a stack of ten atacc forms, and an instruction sheet. The pcds recordings were opened on the computer. Each atacc was marked with a unique code that referred to a specific recording in pcds. The assessors were instructed to view the recordings in the (pre-randomised) order as indicated by their ataccs.

6 Validation Results

In this section, we present the results of (a) the assessments and (b) the equivalence tests that were performed. Additionally, we provide the results of (c) follow-up tests in the cases where no equivalence was found.

Assessment Results. A summary of the responses to the atacc is given in Table 1. The responses to the Likert scale rating items were coded as integer values ranging from 1 (Never/Strongly disagree) to 5 (Always/Strongly agree). The coding for rating item four (Blue air was able to fire without threat from red air) was inverted so that the values reflected the occurrence of red behaviour (i.e., red influencing blue’s ability to fire).

Table 1. Summary of the atacc responses: the number of responses (n), mean response (\(\mu \)), and standard deviation (\(\sigma \)) of the responses to the atacc rating items for the 4p-models and the 4m-models.

Equivalence Testing. We applied Schuirmann’s (1987) tost method to determine the equivalence of (1) the responses given on the atacc for 4p-models, and (2) the responses given on the atacc for 4m-models. We calculated \(\delta \) (as Juzek’s \(\delta \)) for the responses to each rating item of the atacc, and then performed the tost on the responses to each rating item. The tost was performed using the TOSTtwo.raw function from R’s TOSTER package, with Welch’s t-test as the underlying one-sided test. We chose to use Welch’s t-test here because of the unequal sample sizes.Footnote 9 The \(\delta \) and the results of the tost (t-value, degrees of freedom [df], p-value, and the 90% confidence interval [ci] of the difference of the means) are shown in Table 2. In Table 2, the bold p-values indicate a significant result of the tost. Based on the results of the tost, we conclude that the responses to rating items 1, 2, 5, 7, 8, and 9 are equivalent between the 4p-models and the 4m-models (see Sect. 3.2 for the definitions of the examples of behaviour represented by these rating items).

Table 2. Results of the tost method per rating item (i.). The tost was based on Welch’s t-test. For rating items where the tost method did not find equivalence, an additional standard (Welch’s) t-test was performed. Significant p-values at the \(\alpha =\) 0.05 level are indicated in bold. The relevance (rel.) of the outcome of the tests is indicated in the rightmost column.

Follow-Up Testing. The tost did not find equivalence for rating items 3, 4, and 6. For these rating items, we conducted a follow-up test to determine if the responses to these rating items significantly differed between the 4p-models and the 4m-models. This test was a standard two-sided Welch’s t-test. A significant difference was found for rating items 3 (Red air was within factor range) and 6 (Red air acted on blue air’s weapons engagement zone). For both rating items, the responses indicated a higher frequency of the behaviour that was rated for the 4m-models (see Table 1). The responses to rating item 4 (Blue air was able to fire without threat from red air) were neither significantly equivalent, nor significantly different. Therefore, we may conclude that their relationship is undecided.

7 Discussion and Related Work

We started this paper by decomposing the difficulty of validating behaviour models into two questions (see Sect. 2): what does the process entail? and how should we determine the accuracy of the models? For the case of air combat behaviour models, our answer to the first question is the procedure laid out in Sect. 3. Our answer to the second question is embedded in the procedure: we determine the accuracy of newly generated 4m-models by a combination of simulation technology, behavioural science, statistical methods, and human input.

For our case study, we generated a new set of air combat behaviour models by means of machine learning, and applied the validation procedure to these models. Our key finding is that out of the nine rating items of the atacc, six are assessed as equivalent between the 4m-models and the 4p-models. Following the advice of Birta and Arbez (2013) to recognise the partial success, the results appear to moderately indicate validity. Still, the responses to the remaining three rating items do not support the notion of validity as we have defined it.

Is there any way that we could have achieved a more convincing indication towards the (non-)validity of the new models? We must acknowledge the large number of variables in our study, e.g., (1) the 4p-models, (2) the 4m-models, (3) the pilots, (4) the assessors, and (5) the atacc. While efforts could be made to control the “noise” from these variable, it is important to consider that (1) and (2) exist in too many variations to ever be sampled effectively, and (3) and (4) are assisting with all the implicit and explicit knowledge they have to offer. The contribution of this knowledge should be stimulated before it is controlled. It is therefore that we propose that improvements should be sought in the area of the assessment tool (5), such as refinement of the examples of behaviour posed by the atacc. One interesting approach might be to incorporate recent work on the mission essential competencies (mecs) into the tool (see, e.g., MacMillan et al. 2013; Tsifetakis and Kontogiannis 2017).

The validation study performed by Sadagic (2010) most closely resembles our work. The subject of this study were behaviour models for troops in urban warfare. Expert assessors observed the behaviour of these troops, and rated its realism. The work of Sadagic differs from ours in that their simulations had no human participants. Furthermore, no statistical tests were performed, as the behaviour was rated as conforming to the assessors’ ideal of realistic behaviour.

In the air combat domain, we find small-scale validation studies attached to machine learning experiments. For instance, Teng et al. (2013) show that their adaptive cgfs are rated more favourably than non-adaptive cgfs on certain qualities (e.g., predictability and aggression) by expert assessors. However, in contrast to our work, Teng, Tan, and Teow aimed to develop cgfs that showed improvement on these qualities, rather than find equivalence. By focusing on improvement, the adaptive capabilities of the cgfs have been validated, but the question remains whether the improved qualities are useful for training.

In conclusion, properly validating air combat behaviour models is difficult to accomplish, yet essential for the training simulations that aim to use them. The validation procedure that we propose likely is one of many possible solutions. We invite more machine learning researchers and training experts to jointly address the issue of validation in future research, thereby paving the way to reliable adaptive training of teams.