Introduction

The use of multiple-choice questions (MCQs) to assess higher order learning is standard practice for assessing clinical knowledge in the health professions [1, 2]. However, the creation of such questions is not straightforward, often requiring ‘Item Writing Workshops’ to train staff in the creation of clinical scenarios for such questions [3, 4]. There is a need to expand the use of higher order MCQs, beyond the evaluation of clinical scenarios and into medical science education more generally, where assessments of ‘higher order learning’ may be currently given in the form of written coursework such as essays and dissertations. These assessments tend to show limited coverage of the curriculum and are vulnerable to a number of forms of misconduct, such as plagiarism [5], outsourcing to commercial writers [6], and the use of Chatbot AI tools such as ChatGPT [7].

Many authors have written guidance for the use of MCQs to assess higher order learning, and these were recently reviewed into a set of guidance principles for educators wishing to write their own questions [8]. A key feature of MCQs that assess higher-order learning is the use of problem-type scenarios, which the learner then solves, rather than the recall of a standalone fact [9]. The scenario should contain a lot of information, as is the case in real-world problems, and the test-taker has to critically appraise the information to identify key relevant features [10]. Another feature of MCQs which assess higher order learning is the use of ‘assumed knowledge’; i.e., the student is required to have specific subject knowledge or skills which are missing from the problem scenario and thus can act as a cognitive bridge between the problem/question and the answer options [11]. Students who do not have this knowledge will not be able to answer the question.

Higher order learning is usually defined with reference to Bloom’s taxonomy or some other hierarchy (e.g. [1, 2, 9, 11,12,13,14]). Bloom’s taxonomy [15, 16] is usually presented as a hierarchy of verbs for the creation of learning outcomes, with the principle that outcomes created using verbs from the base of the hierarchy (e.g. list, recall) are ‘lower order’ learning, whereas verbs at the higher end of the taxonomy (e.g. ‘evaluate’, ‘justify’) represent ‘higher order learning’ [17]. However, Bloom’s taxonomy has faced many decades of criticism from multiple angles, including a concern that it cannot meaningfully identify ‘higher order learning’ [18,19,20,21], and recent studies in the UK and the USA have shown that the presentation of Bloom’s taxonomy is remarkably inconsistent between universities and other sources. That is, one presentation of Bloom’s may place a verb at the very bottom of the taxonomy, whereas another will place that same verb at the very top [17, 22].

Despite these criticisms, it is still common for Bloom’s taxonomy to be used as the reference point when defining higher order assessment items. One approach is to ask participants (e.g. academics or students) to assign assessment items to the relevant level of Bloom’s taxonomy, or to identify relevant action verbs when writing the items [1, 11, 23]. These sorts of validations were undertaken in the studies which were reviewed to develop the higher-order item writing guidance being tested here [8]. However, this sort of validation normally requires collapsing the six tiers of the taxonomy into just two or three, and then asking subject experts to make subjective ratings about the level of learning which is being assessed by the question. Even then it has been repeatedly shown that academics find it difficult to do this reliably (e.g. the ratings of an assessment item as ‘higher’ or ‘lower order’ will vary between academics) [24,25,26,27]. Other taxonomies of learning have also been used to classify learning by cognitive level; for example, the ‘Structure of Observed Learning Outcomes’ (SOLO) Taxonomy, popular in higher education in the UK, has been used to design MCQs which aim at assessing ‘higher order’ learning [28], while in North America, clinical question banks may be classified into 1st, 2nd, and 3rd orders, with 1st-order questions assessing factual recall, while 3rd-order questions require critical thinking and problem-solving [29, 30]. Again though, while there are guidelines for the creation of questions which map to these levels, there is a paucity of objective evidence showing that the assigning of questions to these levels is reliable.

An alternate approach is to generate objective data about the ability of students to correctly answer the questions, dependent on their level of expertise, on the basis that students with existing lower order knowledge will find it easier to answer the higher-order questions than subject-novices who do not have that knowledge. These hypotheses are supported by some prior research (e.g. [23]), but still the majority of research in this area appears to rely on subjective ratings of question difficulty. Here then, we tested guidance designed to help educators write higher-order MCQs [8], by generating objective data on the ability of subject novices vs experts to answer questions which had been written using the guidance.

Methodology

Guidance for the Creation of Higher Order MCQs

This guidance has been published previously, in a detailed form [8], based on a number of papers describing the use of MCQs to assess higher order learning [10, 12, 14, 31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]. A summary of the guidelines is shown in Table 1. Examples of higher order questions written using the guidance, and their lower order equivalents, are given in ESM Appendix 1.

Table 1 Summary of principles for the creation of higher-order bridge MCQs. Derived from [8]

We conducted two different sets of experiments to compare questions written using the guidance with their lower order counterparts. A summary of the key differences between these experiments is given in Table 2.

Table 2 Summary of the two sets of experiments conducted here

All studies were carried out using Qualtrics surveys with student participants recruited via the online labour market Prolific (www.Prolific.com). The fee was at an estimated hourly rate of £8, advised by Prolific to be a ‘good’ rate at the time of the study. Participants were screened into expert or novice groups on the basis of the subjects being studied at university. The full list of subjects which were used to screen participants for each study is shown in ESM Appendix 1. Before beginning the study, all participants were given information about their data protection rights, their right to withdraw at any time, and a contact email for any questions. We were not aware of any similar studies in the literature which might give us a reasonable expectation of the possible sizes of any differences between the experimental groups, and so we were unable to undertake a meaningful power analysis and sample size calculation. Thus, the sample sizes are based on the experience of the authors.

Ethical Approval.

Both experiments were approved by the ethics committee of the Swansea University Medical School ref SUMS RESC 2022-0042A.

Experiment 1. Novices Only, Proof of Principle

An initial study was conducted to determine whether subject matter novices could meaningfully answer lower order questions, under time-limited conditions. In this same experiment, we then evaluated whether this ability would be reduced when the questions were rewritten using the guidance from Table 1. Questions were tested on novice students, meaning they were studying subjects unrelated to the subject matter of the questions (neuroscience). Each question was asked in both higher and lower-order formats, and under both closed book and open book conditions.

Procedure

We evaluated the performance of the questions under two conditions: ‘closed book’ wherein participants were asked to try and answer the questions without referral to any other sources, and then an ‘open book’ condition, wherein students were encouraged to ‘cheat’ by any means, for example by using Google or Alexa. (These experiments were conducted in Summer 2022, prior to the launch of ChatGPT.) Participants were asked to ‘cheat’ rather than to use ‘an open book condition’ on the basis that this would be more familiar terminology and so easier for the participants to correctly follow the instructions. To minimise confusion in the instructions for participants, each participant was asked to answer in only one condition (closed or open book), but was asked questions from both formats (lower order or higher order), although the format difference was not explained. Thus, a participant only saw one version of an individual question. The same questions were asked, in the same order, in each group, but in different formats (higher order or lower order). Therefore, the experimental groups were as follows:

  1. 1.

    Higher-lower (closed book)

  2. 2.

    Lower-higher (closed book)

  3. 3.

    Higher-lower (open book)

  4. 4.

    Lower-higher (open book)

Thus, groups 1 and 3 would start with Q1 in a higher-order format, followed by Q2 in a lower-order format, then Q3 in a higher order format, and so on. Groups 2 and 4 would start with Q1 in a lower-order format, Q2 in a higher order format, and so on. Dropout and motivation were a concern, since it seemed likely that participants would be unable to answer the questions, and so the study was conducted in two parts: an initial proof-of-principle pilot evaluated the performance of 12 questions, and then the second part evaluated the performance of 5 further questions. Twelve participants were recruited to each group, except for group 3 which had 13 participants in the initial pilot, due to a technical error. The tasks on Prolific were set so that any one participant could only be recruited to one study.

The task was advertised on Prolific as ‘A Study about University Assessment Methods’. Participants were given the instructions shown in Fig. 1.

Fig. 1
figure 1

Instructions given to participants in A the closed book condition and B the open book condition experiment of Experiment 1

The participants were then given a maximum of 90 s to answer each question, after which the instrument automatically forwarded on to the next question. At the end of the survey, the purpose of the study was explained in more detail and those who had participated in the open-book group were asked what methods they had used.

Experiment 2. Novice vs Expert, Testing of Full Guidance

The first experiment used both lower and higher order questions written by the authors, which potentially leads to a conflict when we also designed the research. Experiment 2 was designed to test the guidance in a more general context that reflects the practical application of the guidance, by starting with lower-order questions that had not been written by the authors, but which subject novices could obtain the correct answer through simple Googling. These questions were identified as described below, and then rewritten by the authors following the guidance in Table 1. We hypothesised that, once rewritten subject experts would still be able to answer the questions, but novices would not, even in the open book condition. In Experiment 1, participants were also not provided with any specific motivation to find the correct answer in the open book condition. We did not want participants to simply give up, especially since this seemed more likely for novices and so could artificially bias the data. This was addressed in Experiment 2 by giving participants a small financial incentive for answering correctly under the open book condition. A final difference between the two studies was that, in Experiment 1, the answer options were different between some of the lower and higher order formats of the same question, in accordance with the guidance (e.g. to make the answer options active). Here we retained the same answer options with the lower and higher order versions, including keeping the same correct answer. This was to ensure that the revised questions were assessing the same learning outcome as the original lower-order question, although it did present a challenge with utilising some aspects of the guidance (e.g., to make the answer options active, and to present them in plain language).

A pilot study was undertaken to identify ‘lower order’ questions which novices could only answer under open-book conditions. Twenty questions were selected from the introductory chapter of a genetics textbook [47] and from past-papers of the United Kingdom Biology A-level exam (these are the exams taken by students aged ~ 18, in part as the basis for entry to higher education). They were selected by the authors on the basis that they should be questions whose answers would be reasonably familiar to current undergraduates studying the life sciences.

Participants were then given the instructions shown in Fig. 2.

Fig. 2
figure 2

Instructions given to all participants at A the beginning (closed book) and B halfway (the open book condition) of Experiment 2. For the second part of Experiment 2, only 10 questions were used, and so the bonus payments and instructions identified in section B were modified accordingly

Twenty participants were recruited to each of four groups. This experiment also included two attention-check questions, one in each condition. These were questions which appeared to be regular multiple-choice questions formatted in the same way as others in the study, but where the question stem included an instruction to select a specific answer. In keeping with the Prolific.co guidance on attention checks, participants who failed both attention checks by selecting the incorrect answer were not paid and their data were not included. Additional participants were then added as required.

  1. 1.

    Novices, closed book

  2. 2.

    Novices, open book

  3. 3.

    Experts, closed book

  4. 4.

    Experts, open book

Using these pilot data, we identified questions for further development, based upon the following criteria:

  • Large difference in performance on open book vs closed book for novices (i.e. questions in which the open book condition allowed the participants to score highly)

  • Experts scored highly under both conditions

  • Experts scored higher than non-subject specialists under the closed book condition, to ensure the expertise tested was not common knowledge.

The precise metrics for identifying questions were not formalised—each question was considered individually by both authors, and selected questions were then drafted into the higher order format using the guidance in Table 1. Both authors drafted higher-order questions and then a final version of each question was agreed through discussion. The experiment was then repeated but with ten higher-order questions.

Analysis

For both experiments, analysis was undertaken by question, with the dependent variable being the percentage of participants who answered a question correctly. Specific statistical tests are identified in the relevant section of the results.

For the methods used by students in the open book condition, a simple quantitative content analysis [48] was performed on the very brief free-text comments left by participants.

Example questions from Experiment 1 and Experiment 2 are shown in ESM Appendix 1 (the full set of questions is available upon reasonable request, from the corresponding author). Statistical analysis and figure creation were undertaken using GraphPad Prism V10 (San Diego, CA).

Results

Experiment 1

There was a clear effect of condition, where an average of 53.4% of participants were able to answer the questions in the open-book condition when the questions were written in the lower order format, compared to 18.6% when the questions where in the higher order format. In the closed book condition, an average of 24% of participants answered correctly in the lower order format with 13.2% answering correctly in the higher format. The percentage of participants able to answer each question in the lower order, open-book format was significantly higher than for every other format/condition when analysed using a two-way repeated measures ANOVA, with percentage of participants answering correctly as the dependent variable, and test format (closed book vs open book) and question format (higher order vs lower order) as the conditions. There was a significant effect of question format (F (1, 16) = 25.03, P = 0.0002), and test format (F (1, 16) = 14.59, P < 0.0001) and a significant interaction between the two (F (1, 16) = 6.95, P = 0.0018). Post hoc Bonferroni tests revealed a significant difference between the lower-order, open-book condition, and all other conditions. No other significant differences were observed. The results are shown in Fig. 3.

Fig. 3
figure 3

Rewriting lower order questions into higher order makes it harder for subject matter novices to answer. Different groups of participants were given the same question in two different formats, lower order and higher order, and under two different conditions, open book and closed book. Participants were able to answer the lower order questions in the open-book condition but were not able to answer the questions in the closed-book condition or when they were rewritten into the higher order format, even under the open-book condition. *P, 0.05 when compared to all other conditions by post hoc Bonferroni tests following two-way repeated measures ANOVA (see text for details)

Methods Used in the Open-Book Condition

In 25 participants, 22 left comments, most very brief (the total corpus was 542 words). All 22 identified ‘searching the internet’ as their strategy, with 20/22 naming Google directly, and one asking ‘Siri’. In 22 participants, 6 identified the time limit as a factor which made it difficult to answer the questions. In 22 participants, 4 identified that there were questions that were more ‘difficult’ (the participants were not told that the questions were in two different formats).

Experiment 2

The results from Experiment 1 were clear: novices used Google to successfully answer questions written in a lower-order format, but this approach was not successful when the questions were in a higher-order format. However, the design of the study contained some potential confounds; (1) the participants had no obvious motivation to try and answer correctly. This was potentially more significant in the higher order condition due to the extra work required to successfully answer the question, and the increased length and complexity of the questions. (2) Both lower and higher-order questions were written by the authors. (3) There is no within-study positive control since Experiment 1 only used subject matter novices, and thus it is not clear that subject experts could still answer the higher-order questions. This is important to demonstrate that the rewritten questions remain a valid form of assessment for the subject matter content. Finally, the answer options were often completely different between the lower order and higher order question formats.

Thus, in Experiment 2, we included a within-experiment positive control. We also started with existing lower order questions, in the public domain, that had not been written by either of the authors, using the original answer options and same correct answer, but with additional answer options added.

The results are shown in Fig. 4. Novices were again able to successfully answer lower order questions by Googling the answers, an average of 73.3% answering correctly but this dropped to 23% when the questions were rewritten into a higher order format. A two-way repeated measures ANOVA was conducted, with percentage of participants answering correctly as the dependent variable, and test format (closed book vs open book) and expertise (novice vs expert) as the conditions. For lower order questions (Fig. 3A), there was a significant effect of question format (F (1, 9) = 163.0, P < 0.0001) and expertise (F (1, 9) = 17.09, P = 0.0025), with a significant interaction between these two conditions (F (1, 9) = 61.91, P < 0.0001). A post hoc Bonferroni multiple comparisons test showed a significant effect of expertise in the closed-book condition (P < 0.0001) but not the open-book condition (P = 0.2931). For the higher order questions (Fig. 3B), a two-way repeated measures ANOVA again showed a significant effect of question format (F (1, 9) = 6.553, P = 0.0307) and expertise (F (1, 9) = 18.40, P = 0.0020), but no interaction between the two conditions (F (1, 9) = 0.2687, P = 0.6167).

Fig. 4
figure 4

Rewriting publicly available lower order questions into higher order questions makes it harder for novices to answer them, but experts can still answer under closed-book conditions. Different groups of participants were given the same question in two different formats, lower order and higher order, and under two different conditions, open book and closed book. Novices were able to answer the lower order questions in the open-book condition but not when the questions when rewritten into a higher order format using the guidance tested here. Experts showed good ability to answer questions under all conditions. See text for statistical analysis

If the higher order questions are truly assessing higher order learning, then they should be harder, and so the percentage of experts answering them would be lower. An average of 53.3% of experts were able to answer questions in the lower order format under closed book conditions; this dropped to 40% when rewritten into the higher order format. This same comparison was 67.5% and 51.5% in the open book format. When analysed using a paired t-test, the difference between lower and higher order was significant for both the closed book (t = 3.234, df = 16, P = 0.0052) and the open book condition (t = 4.740, df = 16, P = 0.0002).

Discussion

Here we conducted a validation test on guidance which is designed to help educators write higher order MCQs [8]. When questions were written using a traditional lower order format, subject-matter novices were able to Google their way to the correct answer, to the same extent as experts, even though the novices were studying subjects unrelated to the topic. When those same questions were rewritten into a higher-order format using the guidance, subject-novices found it significantly harder to answer using Google, while experts were still able to answer the questions. These findings suggest that questions written using the summary guidance in Table 1 do indeed assess higher-order learning. The summary guidance in Table 1 is largely complementary to, and intended to be used alongside, a large body of existing literature on what makes an effective multiple-choice question, regardless of whether they assess lower or higher-order learning [39, 49, 50].

There are a number of factors and potential limitations that need to be considered when interpreting our findings.

The guidance shown in Table 1 contains a number of different elements which combine to make the question an active, problem-solving exercise. On the basis of this study, it is not currently possible to determine which, if any, of the individual elements is the most important for ensuring that an MCQ assesses ‘higher order learning’. This is the subject of ongoing work where each element is subject to a systematic experimental appraisal. These analyses may result in a revised and condensed set of guidelines which could prioritise individual elements for the creations of higher-order MCQs.

When using online labour markets for opinion surveys, there is potentially an issue participants could simply give random, or minimal, answers [51], or that the participants may in fact be ‘bots’ [52]. Here we had objective outcome measures with right and wrong answers. Although there was no incentive to answer correctly, we did see a clear and expected difference in those experimental conditions which would indicate that participants are real, valid, and following instructions (for example between subject experts and novices).

Participants in the open-book conditions were instructed to ‘cheat’ by using whatever sources necessary. Even so, subject novices struggled to answer the higher order questions. However, it is important to be clear that this is not specifically a study of ‘cheating’ and that these higher-order questions are not ‘cheat-proof’, especially in the new era of ChatGPT, which can answer very complex problem-solving MCQs [53], and when cheating in online open-book exams is already high [54, 55]. Indeed, the ability of ChatGPT to answer questions written using these guidelines has been recently tested and ChatGPT answers almost all the questions correctly, apart from where novel, labelled images are included [56]. However, these higher order questions should be more resilient to cheating in invigilated examinations, and offer a way to assess higher order learning in such exams. In this way, a supervised exam based on these questions would likely be more resistant to misconduct than other assessment formats which are currently used to assess higher order learning, such as essays and other asynchronous written coursework; these are open to multiple sources of misconduct such as plagiarism [5] contract cheating [6] and can be completed to a high standard using tools such as ChatGPT [7].

Guidelines developed here have been tested in only two subject areas, both of which are from human physiology and medical science. It seems reasonable to propose the guidelines could be used to test higher order learning in other subjects, including those outside of medical science education. This will require careful validation and is also the subject of future work.

In summary, we have validated guidance for item-writing, or rewriting, based on the literature for the use of MCQs to assess higher order learning. In an experimental situation, questions that were rewritten using this guidance were much more challenging to answer by simple Googling but were still answerable by students who were studying relevant subjects, although those experts found the questions harder. These findings suggest that the guidance could be used by educators and institutions to develop MCQ-based exams to assess higher order learning, to partially replace asynchronous written coursework such as essays.