A Systematic Review of Automatic Question Generation for Educational Purposes
While exam-style questions are a fundamental educational tool serving a variety of purposes, manual construction of questions is a complex process that requires training, experience, and resources. This, in turn, hinders and slows down the use of educational activities (e.g. providing practice questions) and new advances (e.g. adaptive testing) that require a large pool of questions. To reduce the expenses associated with manual construction of questions and to satisfy the need for a continuous supply of new questions, automatic question generation (AQG) techniques were introduced. This review extends a previous review on AQG literature that has been published up to late 2014. It includes 93 papers that were between 2015 and early 2019 and tackle the automatic generation of questions for educational purposes. The aims of this review are to: provide an overview of the AQG community and its activities, summarise the current trends and advances in AQG, highlight the changes that the area has undergone in the recent years, and suggest areas for improvement and future opportunities for AQG. Similar to what was found previously, there is little focus in the current literature on generating questions of controlled difficulty, enriching question forms and structures, automating template construction, improving presentation, and generating feedback. Our findings also suggest the need to further improve experimental reporting, harmonise evaluation metrics, and investigate other evaluation methods that are more feasible.
KeywordsAutomatic question generation Semantic Web Education Natural language processing Natural language generation Assessment Difficulty prediction
Exam-style questions are a fundamental educational tool serving a variety of purposes. In addition to their role as an assessment instrument, questions have the potential to influence student learning. According to Thalheimer (2003), some of the benefits of using questions are: 1) offering the opportunity to practice retrieving information from memory; 2) providing learners with feedback about their misconceptions; 3) focusing learners’ attention on the important learning material; 4) reinforcing learning by repeating core concepts; and 5) motivating learners to engage in learning activities (e.g. reading and discussing). Despite these benefits, manual question construction is a challenging task that requires training, experience, and resources. Several published analyses of real exam questions (mostly multiple choice questions (MCQs)) (Hansen and Dexter 1997; Tarrant et al. 2006; Hingorjo and Jaleel 2012; Rush et al. 2016) demonstrate their poor quality, which Tarrant et al. (2006) attributed to a lack of training in assessment development. This challenge is augmented further by the need to replace assessment questions consistently to ensure their validity, since their value will decrease or be lost after a few rounds of usage (due to being shared between test takers), as well as the rise of e-learning technologies, such as massive open online courses (MOOCs) and adaptive learning, which require a larger pool of questions.
Automatic question generation (AQG) techniques emerged as a solution to the challenges facing test developers in constructing a large number of good quality questions. AQG is concerned with the construction of algorithms for producing questions from knowledge sources, which can be either structured (e.g. knowledge bases (KBs) or unstructured (e.g. text)). As Alsubait (2015) discussed, research on AQG goes back to the 70’s. Nowadays, AQG is gaining further importance with the rise of MOOCs and other e-learning technologies (Qayyum and Zawacki-Richter 2018; Gaebel et al. 2014; Goldbach and Hamza-Lup 2017).
In what follows, we outline some potential benefits that one might expect from successful automatic generation of questions. AQG can reduce the cost (in terms of both money and effort) of question construction which, in turn, enables educators to spend more time on other important instructional activities. In addition to resource saving, having a large number of good-quality questions enables the enrichment of the teaching process with additional activities such as adaptive testing (Vie et al. 2017), which aims to adapt learning to student knowledge and needs, as well as drill and practice exercises (Lim et al. 2012). Finally, being able to automatically control question characteristics, such as question difficulty and cognitive level, can inform the construction of good quality tests with particular requirements.
Although the focus of this review is education, the applications of question generation (QG) are not limited to education and assessment. Questions are also generated for other purposes, such as validation of knowledge bases, development of conversational agents, and development of question answering or machine reading comprehension systems, where questions are used for training and testing.
a comprehensive summary of the recent AQG approaches;
an analysis of the state of the field focusing on differences between the pre- and post-2014 periods;
a summary of challenges and future directions; and
an extensive reference to the relevant literature.
Summary of Previous Reviews
There have been six published reviews on the AQG literature. The reviews reported by Le et al. 2014, Kaur and Bathla 2015, Alsubait 2015 and Rakangor and Ghodasara (2015) cover the literature that has been published up to late 2014 while those reported by Ch and Saha (2018) and Papasalouros and Chatzigiannakou (2018) cover the literature that has been published up to late 2018. Out of these, the most comprehensive review is Alsubait’s, which includes 81 papers (65 distinct studies) that were identified using a systematic procedure. The other reviews were selective and only cover a small subset of the AQG literature. Of interest, due to it being a systematic review and due to the overlap in timing with our review, is the review developed by Ch and Saha (2018). However, their review is not as rigorous as ours, as theirs only focuses on automatic generation of MCQs using text as input. In addition, essential details about the review procedure, such as the search queries used for each electronic database and the resultant number of papers, are not reported. In addition, several related studies found in other reviews on AQG are not included.
Findings of Alsubait’s Review
In this section, we concentrate on summarising the main results of Alsubait’s systematic review, due to its being the only comprehensive review. We do so by elaborating on interesting trends and speculating about the reasons for those trends, as well as highlighting limitations observed in the AQG literature.
Alsubait characterised AQG studies along the following dimensions: 1) purpose of generating questions, 2) domain, 3) knowledge sources, 4) generation method, 5) question type, 6) response format, and 7) evaluation.
Results of Alsubait’s review. Categories with frequency of three or less are classified under “other”
No. of studies
Math word problems
With regard to knowledge sources, the most commonly used source for question generation is text (Table 1). A similar trend was also found by Rakangor and Ghodasara (2015). Note that 19 text-based approaches, out of the 38 text-based approaches identified by Alsubait (2015), tackle the generation of questions for the language learning domain, both free response (FR) and multiple choice (MC). Out of the remaining 19 studies, only five focus on generating MCQs. To do so, they incorporate additional inputs such as WordNet (Miller et al. 1990), thesaurus, or textual corpora. By and large, the challenge in the case of MCQs is distractor generation. Despite using text for generating language questions, where distractors can be generated using simple strategies such as selecting words having a particular POS or other syntactic properties, text often does not incorporate distractors, so external, structured knowledge sources are needed to find what is true and what is similar. On the other hand, eight ontology-based approaches are centred on generating MCQs and only three focus on FR questions.
Simple factual wh-questions (i.e. where the answers are short facts that are explicitly mentioned in the input) and gap-fill questions (also known as fill-in-the-blank or cloze questions) are the most generated types of questions with the majority of them, 17 and 15 respectively, being generated from text. The prevalence of these questions is expected because they are common in language learning assessment. In addition, these two types require relatively little effort to construct, especially when they are not accompanied by distractors. In gap-fill questions, there are no concerns about the linguistic aspects (e.g. grammaticality) because the stem is constructed by only removing a word or a phrase from a segment of text. The stem of a wh-question is constructed by removing the answer from the sentence, selecting an appropriate wh-word, and rearranging words to form a question. Other types of questions such as mathematical word problems, Jeopardy-style questions,1 and medical case-based questions (CBQs) require more effort in choosing the stem content and verbalisation. Another related observation we made is that the types of questions generated from ontologies are more varied than the types of questions generated from text.
Limitations observed by Alsubait (2015) include the limited research on controlling the difficulty of generated questions and on generating informative feedback. Existing difficulty models are either not validated or only applicable to a specific type of question (Alsubait 2015). Regarding feedback (i.e. an explanation for the correctness/incorrectness of the answer), only three studies generate feedback along with the questions. Even then, the feedback is used to motivate students to try again or to provide extra reading material without explaining why the selected answer is correct/incorrect. Ungrammaticality is another notable problem with auto-generated questions, especially in approaches that apply syntactic transformations of sentences (Alsubait 2015). For example, 36.7% and 39.5% of questions generated in the work of Heilman and Smith (2009) were rated by reviewers as ungrammatical and nonsensical, respectively. Another limitation related to approaches to generating questions from ontologies is the use of experimental ontologies for evaluation, neglecting the value of using existing, probably large, ontologies. Various issues can arise if existing ontologies are used, which in turn provide further opportunities to enhance the quality of generated questions and the ontologies used for generation.
The goal of this review is to provide a comprehensive view of the AQG field since 2015. Following and extending the schema presented by Alsubait (2015) (Table 1), we have structured our review around the following four objectives and their related questions. Questions marked with an asterisk “*” are those proposed by Alsubait (2015). Questions under the first three objectives (except question 5 under OBJ3) are used to guide data extraction. The others are analytical questions to be answered based on extracted results.
- Providing an overview of the AQG community and its activities
What is the rate of publication?*
What types of papers are published in the area?
Where is research published?
Who are the active research groups in the field?*
- Summarising current QG approaches
What is the purpose of QG?*
What method is applied?*
What tasks related to question generation are considered?
What type of input is used?*
Is it designed for a specific domain? For which domain?*
What type of questions are generated?* (i.e., question format and answer format)
What is the language of the questions?
Does it generate feedback?*
Is difficulty of questions controlled?*
Does it consider verbalisation (i.e. presentation improvements)?
- Identifying the gold-standard performance in AQG
Are there any available sources or standard datasets for performance comparison?
What types of evaluation are applied to QG approaches?*
What properties of questions are evaluated?2 and What metrics are used for their measurement?
How does the generation approach perform?
What is the gold-standard performance?
- Tracking the evolution of AQG since Alsubait’s review
Has there been any progress on feedback generation?
Has there been progress on generating questions with controlled difficulty?
Has there been progress on enhancing the naturalness of questions (i.e. verbalisation)?
One of our motivations for pursuing these objectives is to provide members of the AQG community with a reference to facilitate decisions such as what resources to use, whom to compare to, and where to publish. As we mentioned in the Summary of Previous Reviews, Alsubait (2015) highlighted a number of concerns related to the quality of generated questions, difficulty models, and the evaluation of questions. We were motivated to know whether these concerns have been addressed. Furthermore, while reviewing some of the AQG literature, we made some observations about the simplicity of generated questions and about the reporting being insufficient and heterogeneous. We want to know whether these issues are universal across the AQG literature.
Inclusion and Exclusion Criteria
it is not in English
it presents work in progress only and does not provide a sufficient description of how the questions are generated
it presents a QG approach that is based mainly on a template and questions are generated by substituting template slots with numerals or with a set of randomly predefined values
it focuses on question answering rather than question generation
it presents an automatic mechanism to deliver assessments, rather than generating assessment questions
it presents an automatic mechanism to assemble exams or to adaptively select questions from a question bank
it presents an approach for predicting the difficulty of human-authored questions
it presents a QG approach for purposes other than those related to education (e.g. training of question answering systems, dialogue systems)
it does not include an evaluation of the generated questions
it is an extension of a paper published before 2015 and no changes were made to the question generation approach
it is a secondary study (i.e. literature review)
it is not peer-reviewed (e.g. theses, presentations and technical reports)
its full text is not available (through the University of Manchester Library website, Google or Google scholar).
Six data sources were used, five of which were electronic databases (ERIC, ACM, IEEE, INSPEC and Science Direct), which were determined by Alsubait (2015) to have good coverage of the AQG literature. We also searched the International Journal of Artificial Intelligence in Education (AIED) and the proceedings of the International Conference on Artificial Intelligence in Education for 2015, 2017, and 2018 due to their AQG publication record.
We obtained additional papers by examining the reference lists of, and the citations to, AQG papers we reviewed (known as “snowballing”). The citations to a paper were identified by searching for the paper using Google Scholar, then clicking on the “cited by” option that appears under the name of the paper. We performed this for every paper on AQG, regardless of whether we had decided to include it, to ensure that we captured all the relevant papers. That is to say, even if a paper was excluded because it met some of the exclusion criteria (1-3 and 8-13), it is still possible that it refers to, or is referred to by, relevant papers.
We used the reviews reported by Ch and Saha (2018) and Papasalouros and Chatzigiannakou (2018) as a “sanity check” to evaluate the comprehensiveness of our search strategy. We exported all the literature published between 2015 and 2018 included in the work of Ch and Saha (2018) and Papasalouros and Chatzigiannakou (2018) and checked whether they were included in our results (both search results and snowballing results).
We used the keywords “question” and “generation” to search for relevant papers. Actual search queries used for each of the databases are provided in the Appendix under “Search Queries”. We decided on these queries after experimenting with different combinations of keywords and operators provided by each database and looking at the ratio between relevant and irrelevant results in the first few pages (sorted by relevance). To ensure that recall was not compromised, we checked whether relevant results returned using different versions of each search query were still captured by the selected version.
The search results were exported to comma-separated values (CSV) files. Two reviewers then looked independently at the titles and abstracts to decide on inclusion or exclusion. The reviewers skimmed the paper if they were not able to make a decision based on the title and abstract. Note that, at this phase, it was not possible to assess whether all papers had satisfied the exclusion criteria 2, 3, 8, 9, and 10. Because of this, the final decision was made after reading the full text as described next.
To judge whether a paper’s purpose was related to education, we considered the title, abstract, introduction, and conclusion sections. Papers that mentioned many potential purposes for generating questions, but did not state which one was the focus, were excluded. If the paper mentioned only educational applications of QG, we assumed that its purpose was related to education, even without a clear purpose statement. Similarly, if the paper mentioned only one application, we assumed that was its focus.
Concerning evaluation, papers that evaluated the usability of a system that had a QG functionality, without evaluating the quality of generated questions, were excluded. In addition, in cases where we found multiple papers by the same author(s) reporting the same generation approach, even if some did not cover evaluation, all of the papers were included but counted as one study in our analyses.
Lastly, because the final decision on inclusion/exclusion sometimes changed after reading the full paper, agreement between the two reviewers was checked after the full paper had been read and the final decision had been made. However, a check was also made to ensure that the inclusion/exclusion criteria were interpreted in the same way. Cases of disagreement were resolved through discussion.
Guided by the questions presented in the “Review Objective” section, we designed a specific data extraction form. Two reviewers independently extracted data related to the included studies. As mentioned above, different papers that related to the same study were represented as one entry. Agreement for data extraction was checked and cases of disagreement were discussed to reach a consensus.
they reported on different evaluations of the same generation approach;
they reported on applying the same generation approach to different sources or domains;
one of the papers introduced an additional feature of the generation approach such as difficulty prediction or generating distractors without changing the initial generation procedure.
The extracted data were analysed using a code written in R markdown.3
Criteria used for quality assessment
Q1: Is the number of the participants included in the study reported?
Q2: Are the characteristics of the participants included in the study described?
Q3: Is the procedure for participant selection reported?
Q4: Are the participants selected for this study suitable for the question(s) posed by the researchers?
Q5: Is the number of questions evaluated in the study reported?
Q6: Is the sample selection method described?
Q6a: Is the sampling strategy described?
Q6b: Is the sample size calculation described?
Q7: Is the sample representative of the target group?
Q8: Are the main outcomes to be measured described?
Q9: Is the reliability of the measures assessed?
In what follows, we describe the individual criteria (Q1-Q9 presented in Table 2) that we considered when deciding if a study satisfied said criteria. Three responses are used when scoring the criteria: “yes”, “no” and “not specified”. The “not specified” response is used when either there is no information present to support the criteria, or when there is not enough information present to distinguish between a “yes” or “no” response.
Q1-Q4 are concerned with the quality of reporting on participant information, Q5-Q7 are concerned with the quality of reporting on the question samples, and Q8 and Q9 describe the evaluative measures used to assess the outcomes of the studies.
When a study reports the exact number of participants (e.g. experts, students, employees, etc.) used in the study, Q1 scores a “yes”. Otherwise, it scores a “no”. For example, the passage “20 students were recruited to participate in an exam…” would result in a “yes”, whereas “a group of students were recruited to participate in an exam…” would result in a “no”.
Q2 requires the reporting of demographic characteristics supporting the suitability of the participants for the task. Depending on the category of participant, relevant demographic information is required to score a “yes”. Studies that do not specify relevant information score a “no”. By means of examples, in studies relying on expert reviews, those that include information on teaching experience or the proficiency level of reviewers would receive a “yes”, while in studies relying on mock exams, those that include information about grade level or proficiency level of test takers would also receive a “yes”. Studies reporting that the evaluation was conducted by reviewers, instructors, students, or co-workers without providing any additional information about the suitability of the participants for the task would be considered neglectful of Q2 and score a “no”.
For a study to score “yes” for Q3, it must provide specific information on how participants were selected/recruited, otherwise it receives a score of “no”. This includes information on whether the participants were paid for their work or were volunteers. For example, the passage “7th grade biology students were recruited from a local school.” would receive a score of “no” because it is not clear whether or not they were paid for their work. However, a study that reports “Student volunteers were recruited from a local school…” or “Employees from company X were employed for n hours to take part in our study… they were rewarded for their services with Amazon vouchers worth $n” would receive a “yes”.
To score “yes” for Q4, two conditions must be met: the study must 1) score “yes” for both Q2 and Q3 and 2) only use participants that are suitable for the task at hand. Studies that fail to meet the first condition score “not specified” while those that fail to meet the second condition score “no”. Regarding the suitability of participants, we consider, as an example, native Chinese speakers suitable for evaluating the correctness and plausibility of options generated for Chinese gap-fill questions. As another example, we consider Amazon Mechanical Turk (AMT) co-workers unsuitable for evaluating the difficulty of domain-specific questions (e.g. mathematical questions).
When a study reports the exact number of questions used in the experimentation or evaluation stage, Q5 receives a score of “yes”, otherwise it receives a score of “no”. To demonstrate, consider the following examples. A study reporting “25 of the 100 generated questions were used in our evaluation…” would receive a score of “yes”. However, if a study made a claim such as “Around half of the generated questions were used…”, it would receive a score of “no”.
Q6a requires that the sampling strategy be not only reported (e.g. random, proportionate stratification, disproportionate stratification, etc.) but also justified to receive a “yes”, otherwise, it receives a score of “no”. To demonstrate, if a study only reports that “We sampled 20 questions from each template …” would receive a score of “no” since no justification as to why the stratified sampling procedure was used is provided. However, if it was to also add “We sampled 20 questions from each template to ensure template balance in discussions about the quality of generated questions…” then this would be considered as a suitable justification and would warrant a score of “yes”. Similarly, Q6b requires that the sample size be both reported and justified.
Our decision regarding Q7 takes into account the following: 1) responses to Q6a (i.e. a study can only score “yes” if the score to Q6a is “yes”, otherwise, the score would be “not specified”) and 2) representativeness of the population. Using random sampling is, in most cases, sufficient to score “yes” for Q7. However, if multiple types of questions are generated (e.g. different templates or different difficulty levels), stratified sampling is more appropriate in cases in which the distribution of questions is skewed.
Q8 considers whether the authors provide a description, a definition, or a mathematical formula for the evaluation measures they used as well as a description of the coding system (if applicable). If so, then the study receives a score of “yes” for Q8, otherwise it receives a score of “no”.
Q9 is concerned with whether questions were evaluated by multiple reviewers and whether measures of the agreement (e.g., Cohen’s kappa or percentage of agreement) were reported. For example, studies reporting information similar to “all questions were double-rated and inter-rater agreement was computed…” receive a score of “yes”, whereas studies reporting information similar to “Each question was rated by one reviewer…” receive a score of “no” .
To assess inter-rater reliability, this activity was performed by two reviewers (the first and second authors), who are proficient in the field of AQG, independently on an exploratory random sample of 27 studies.4 The percentage of agreement and Cohen’s kappa were used to measure inter-rater reliability for Q1-Q9. The percentage of agreement ranged from 73% to 100%, while Cohen’s kappa was above .72 for Q1-Q5, demonstrating “substantial to almost perfect agreement”, and equal to 0.42 for Q9,5
Results and Discussion
Search and Screening Results
Sources used to obtain relevant papers and their contribution to the final results (* = after removing duplicates)
No. included (based
on title & abstract)
(based on full text)
Computerised databases, journals, and conference proceedings
122 (89 without
51 (36 without
Ch and Saha (2018),
Papasalouros and Chatzigiannakou (2018)
Total (other sources)
67 (57 without
The most common reasons for excluding papers on AQG were that the purpose of the generation was not related to education or there was no evaluation. Details of papers that were excluded after reading their full text are in the Appendix under “Excluded Studies”.
Data Extraction Results
In this section, we provide our results and outline commonalities and differences with Alsubait’s results (highlighted in the “Findings of Alsubait’s Review” section). The results are presented in the same order as our research questions. The main characteristics of the reviewed literature can be found in the Appendix under “Summary of Included Studies”.
Rate of Publication
Types of Papers and Publication Venues
Of the papers published in the period covered by this review, conference papers constitute the majority (44 papers), followed by journal articles (32 papers) and workshop papers (17 papers). This is similar to the results of Alsubait (2015) with 34 conference papers, 22 journal papers, 13 workshop papers, and 12 other types of papers, including books or book chapters as well as technical reports and theses. In the Appendix, under “Publication Venues”, we list journals, conferences, and workshops that published at least two of the papers included in either of the reviews.
Overall, 358 researchers are working in the area (168 identified in Alsubait’s review and 205 identified in this review with 15 researchers in common). The majority of researchers have only one publication. In Appendix “Active Research Groups”, we present the 13 active groups defined as having more than two publications in the period of both reviews. Of the 174 papers identified in both reviews, 64 were published by these groups. This shows that, besides the increased activities in the study of AQG, the community is also growing.
Purpose of Question Generation
Purposes for automatically generating questions in the included studies. Note that a study can belong to more than one category
No. of studies
Education with no focus on a specific purpose
Self-directed learning, self-study or self-assessment
Tutoring system or computer-assisted learning system
Providing practice questions
Providing questions for MOOCs or other courses
Methods of generating questions have been classified in the literature (Yao et al. 2012) as follows: 1) syntax-based, 2) semantic-based, and 3) template-based. Syntax-based approaches operate on the syntax of the input (e.g. syntactic tree of text) to generate questions. Semantic-based approaches operate on a deeper level (e.g. is-a or other semantic relations). Template-based approaches use templates consisting of fixed text and some placeholders that are populated from the input. Alsubait (2015) extended this classification to include two more categories: 4) rule-based and 5) schema-based. The main characteristic of rule-based approaches, as defined by Alsubait (2015), is the use of rule-based knowledge sources to generate questions that assess understanding of the important rules of the domain. As this definition implies that these methods require a deep understanding (beyond syntactic understanding), we believe that this category falls under the semantic-based category. However, we define the rule-based approach differently, as will be seen below. Regarding the fifth category, according to Alsubait (2015), schemas are similar to templates but are more abstract. They provide a grouping of templates that represent variants of the same problem. We regard this distinction between template and schema as unclear. Therefore, we restrict our classification to the template-based category regardless of how abstract the templates are.
- Level of understanding
Syntactic: Syntax-based approaches leverage syntactic features of the input, such as POS or parse-tree dependency relations, to guide question generation. These approaches do not require understanding of the semantics of the input in use (i.e. entities and their meaning). For example, approaches that select distractors based on their POS are classified as syntax-based.
Semantic: Semantic-based approaches require a deeper understanding of the input, beyond lexical and syntactic understanding. The information that these approaches use are not necessarily explicit in the input (i.e. they may require reasoning to be extracted). In most cases, this requires the use of additional knowledge sources (e.g., taxonomies, ontologies, or other such sources). As an example, approaches that use either contextual similarity or feature-based similarity to select distractors are classified as being semantic-based.
- Procedure of transformation
Template: Questions are generated with the use of templates. Templates define the surface structure of the questions using fixed text and placeholders that are substituted with values to generate questions. Templates also specify the features of the entities (either syntactic, semantic, or both), that can replace the placeholders.
Rule: Questions are generated with the use of rules. Rules often accompany approaches using text as input. Typically, approaches utilising rules annotate sentences with syntactic and/or semantic information. They then use these annotations to match the input to a pattern specified in the rules. These rules specify how to select a suitable question type (e.g. selecting suitable wh-words) and how to manipulate the input to construct questions (e.g. converting sentences into questions).
Statistical methods: This is where question transformation is learned from training data. For example, in Gao et al. (2018), question generation has been dealt with as a sequence-to-sequence prediction problem in which, given a segment of text (usually a sentence), the question generator forms a sequence of text representing a question (using the probabilities of co-occurrence that are learned from the training data). Training data has also been used in Kumar et al. (2015b) for predicting which word(s) in the input sentence is/are to be replaced by a gap (in gap-fill questions).
Regarding the level of understanding, 60 papers rely on semantic information and only ten approaches rely only on syntactic information. All except three of the ten syntactic approaches (Das and Majumder 2017; Kaur and Singh 2017; Kusuma and Alhamri 2018) tackle the generation of language questions. In addition, templates are more popular than rules and statistical methods, with 27 papers reporting the use of templates, compared to 16 and nine for rules and statistical methods, respectively. Each of these three approaches has its advantages and disadvantages. In terms of cost, all three approaches are considered expensive. Templates and rules require manual construction, while learning from data often requires a large amount of annotated data which is unavailable in many specific domains. Additionally, questions generated by rules and statistical methods are very similar to the input (e.g. sentences used for generation), while templates allow the generating of questions that differ from the surface structure of the input, in the use of words for example. However, questions generated from templates are limited in terms of their linguistic diversity. Note that some of the papers were classified as not having a method of transforming the input into questions because they only focused on distractor generation or gap-fill questions for which the stem is the same input statement with a word or a phrase being removed. Readers interested in studies that belong to a specific approach are referred to the “Summary of Included Studies” in the Appendix.
Tasks involved in question generation are explained below. We grouped the tasks into the stages of preprocessing, question construction, and post-processing. For each task, we provide a brief description, mention its role in the generation process, and summarise different approaches that have been applied in the literature. The “Summary of Included Studies” in the Appendix shows which tasks have been tackled in each study.
Sentence simplification: This is employed in some text-based approaches (Liu et al. 2017; Majumder and Saha 2015; Patra and Saha 2018b). Complex sentences, usually sentences with appositions or sentences joined with conjunctions, are converted into simple sentences to ease upcoming tasks. For example, Patra and Saha (2018b) reported that Wikipedia sentences are long and contain multiple objects; simplifying these sentences facilitates triplet extraction (where triples are used later for generating questions). This task was carried out by using sentence simplification rules (Liu et al. 2017) and relying on parse-tree dependencies (Majumder and Saha 2015; Patra and Saha 2018b).
Sentence classification: In this task, sentences are classified into categories, which is, according to Mazidi and Tarau (2016a) and Mazidi and Tarau (2016b), a key to determining the type of question to be asked about the sentence. This classification was carried out by analysing POS and dependency labels, as in Mazidi and Tarau (2016a) and Mazidi and Tarau (2016b) or by using a machine learning (ML) model and a set of rules, as in Basuki and Kusuma (2018). For example, in Mazidi and Tarau(2016a, 2016b), the pattern “S-V-acomp” is an adjectival complement that describes the subject and is therefore matched to the question template “Indicate properties or characteristics of S?”
Content selection: As the number of questions in examinations is limited, the goal of this task is to determine important content, such as sentences, parts of sentences, or concepts, about which to generate questions. In the reviewed literature, the majority approach is to generate all possible questions and leave the task of selecting important questions to exam designers. However, in some settings such as self-assessment and self-learning environments, in which questions are generated “on the fly”, leaving the selection to exam designers is not feasible.
Content selection was of interest for those approaches that utilise text more than for those that utilise structured knowledge sources. Several characterisations of important sentences and approaches for their selection have been proposed in the reviewed literature which we summarise in the following paragraphs.
Huang and He (2016) defined three characteristics for selecting sentences that are important for reading assessment and propose metrics for their measurement: keyness (containing the key meaning of the text), completeness (spreading over different paragraphs to ensure that test-takers grasp the text fully), and independence (covering different aspects of text content). Olney et al. (2017) selected sentences that: 1) are well connected to the discourse (same as completeness) and 2) contain specific discourse relations. Other researchers have focused on selecting topically important sentences. To that end, Kumar et al. (2015b) selected sentences that contain concepts and topics from an educational textbook, while Kumar et al. (2015a) and Majumder and Saha (2015) used topic modelling to identify topics and then rank sentences based on topic distribution. Park et al. (2018) took another approach by projecting the input document and sentences within it into the same n-dimensional vector space and then selecting sentences that are similar to the document, assuming that such sentences best express the topic or the essence of the document. Other approaches selected sentences by checking the occurrence of, or measuring the similarity to, a reference set of patterns under the assumption that these sentences convey similar information to sentences used to extract patterns (Majumder and Saha 2015; Das and Majumder 2017). Others (Shah et al. 2017; Zhang and Takuma 2015) filtered sentences that are insufficient on their own to make valid questions, such as sentences starting with discourse connectives (e.g. thus, also, so, etc.) as in Majumder and Saha (2015).
Still other approaches to content selection are more specific and are informed by the type of question to be generated. For example, the purpose of the study reported in Susanti et al. (2015) is to generate “closest-in-meaning vocabulary questions”9 which involve selecting a text snippet from the Internet that contains the target word, while making sure that the word has the same sense in both the input and retrieved sentences. To this end, the retrieved text was scored on the basis of metrics such as the number of query words that appear in the text.
With regard to content selection from structured knowledge bases, only one study focuses on this task. Rocha and Zucker (2018) used DBpedia to generate questions along with external ontologies; the ontologies describe educational standards according to which DBpedia content was selected for use in question generation.
- Stem and correct answer generation: These two processes are often carried out together, using templates, rules, or statistical methods, as mentioned in the “Generation Methods” Section. Subprocesses involved are:
transforming assertive sentences into interrogative ones (when the input is text);
determination of question type (i.e. selecting suitable wh-word or template); and
selection of gap position (relevant to gap-fill questions).
Incorrect options (i.e. distractor) generation: Distractor generation is a very important task in MCQ generation since distractors influence question quality. Several strategies have been used to generate distractors. Among these are selection of distractors based on word frequency (i.e. the number of times distractors appear in a corpus is similar to the key) (Jiang and Lee 2017), POS (Soonklang and Muangon 2017; Susanti et al. 2015; Satria and Tokunaga2017a, 2017b; Jiang and Lee 2017), or co-occurrence with the key (Jiang and Lee 2017). A dominant approach is the selection of distractors based on their similarity to the key, using different notions of similarity, such as syntax-based similarity (i.e. similar POS, similar letters) (Kumar et al. 2015b; Satria and Tokunaga 2017a, 2017b; Jiang and Lee 2017), feature-based similarity (Wita et al. 2018; Majumder and Saha 2015; Patra and Saha 2018a, 2018b; Alsubait et al. 2016; Leo et al. 2019), or contextual similarity (Afzal 2015; Kumar et al. 2015a, 2015b; Yaneva et al. 2018; Shah et al. 2017; Jiang and Lee 2017). Some studies (Lopetegui et al. 2015; Faizan and Lohmann 2018; Faizan et al. 2017; Kwankajornkiet et al. 2016; Susanti et al. 2015) selected distractors that are declared in a KB to be siblings of the key, which also implies some notion of similarity (siblings are assumed to be similar). Another approach that relies on structured knowledge sources is described in Seyler et al. (2017). The authors used query relaxation, whereby queries used to generate question keys are relaxed to provide distractors that share some of the key features. Faizan and Lohmann (2018) and Faizan et al. (2017) and Stasaski and Hearst (2017) adopted a similar approach for selecting distractors. Others, including Liang et al. (2017, 2018) and Liu et al. (2018), used ML-models to rank distractors based on a combination of the previous features.
Again, some distractor selection approaches are tailored to specific types of questions. For example, for pronoun reference questions generated in Satria and Tokunaga (2017a, 2017b), words selected as distractors do not belong to the same coreference chain as this would make them correct answers. Another example of a domain specific approach for distractor selection is related to gap-fill questions. Kumar et al. (2015b) ensured that distractors fit into the question sentence by calculating the probability of their occurring in the question.
Feedback generation: Feedback provides an explanation of the correctness or incorrectness of responses to questions, usually in reaction to user selection. As feedback generation is one of the main interests of this review, we elaborate more fully on this in the “Feedback Generation” section.
Controlling difficulty: This task focuses on determining how easy or difficult a question will be. We elaborate more on this in the section titled “Difficulty” .
Verbalisation: This task is concerned with producing the final surface structure of the question. There is more on this in the section titled “Verbalisation”.
Question ranking (also referred to as question selection or question filtering): Several generators employed an “over-generate and rank” approach whereby a large number of questions are generated, and then ranked or filtered in a subsequent phase. The ranking goal is to prioritise good quality questions. The ranking is achieved by the use of statistical models as in Blšták (2018), Kwankajornkiet et al. (2016), Liu et al. (2017), and Niraula and Rus (2015).
In this section, we summarise our observations on which input formats are most popular in the literature published after 2014. One question we had in mind is whether structured sources (i.e. whereby knowledge is organised in a way that facilitates automatic retrieval and processing) are gaining more popularity. We were also interested in the association between the input being used and the domain or question types. Specifically, are some inputs more common in specific domains? And are some inputs more suitable for specific types of questions?
As in the findings of Alsubait (Table 1), text is still the most popular type of input with 42 studies using it. Ontologies and resource description framework (RDF) knowledge bases come second, with eight and six studies, respectively, using these. Note that these three input formats are shared between our review and Alsubit’s review. Another input, used by more than one study, are question stems and keys, which feature in five studies that focus on generating distractors. See the Appendix “Summary of Included Studies” for types of inputs used in each study.
The majority of studies reporting the use of text as the main input are centred around generating questions for language learning (18 studies) or generating simple factual questions (16 studies). Other domains investigated are medicine, history, and sport (one study each). On the other hand, among studies utilising Semantic Web technologies, only one tackles the generation of language questions and nine tackle the generation of domain-unspecific questions. Questions for biology, medicine, biomedicine, and programming have also been generated using Semantic Web technologies. Additional domains investigated in Alsubait’s review are mathematics, science, and databases (for studies using the Semantic Web). Combining both results, we see a greater variety of domains in semantic-based approaches.
Free-response questions are more prevalent among studies using text, with 21 studies focusing on this question type, 18 on multiple-choice, three on both free-response and multiple-choice questions, and one on verbal response questions. Some studies employ additional resources such as WordNet (Kwankajornkiet et al. 2016; Kumar et al. 2015a) or DBpedia (Faizan and Lohmann 2018; Faizan et al. 2017; Tamura et al. 2015) to generate distractors. By contrast, MCQs are more prevalent in studies using Semantic Web technologies, with ten studies focusing on the generation of multiple-choice questions and four studies focusing on free-response questions. This result is similar to those obtained by Alsubait (Table 1) with free-response being more popular for generation from text and multiple-choice more popular from structured sources. We have discussed why this is the case in the “Findings of Alsubait’s Review” Section.
Domain, Question Types and Language
Domains for which questions are generated and types of questions in the reviewed literature
No. of studies
No. of studies
When, Why, How, and How many
Whom, Whose, and How much
Recognition, generalisation, and specification
List and describe questions
Summarise and name some
What and Who
Where and How many
Which, Why, How, and How long
TOEFL reference questions
TOEFL vocabulary questions
Word reading questions
Vocabulary matching questions
Reading comprehension (inference) questions
Input and output questions and function questions
Inverse of the “feature specification” questions
What and Where
Concept completion questions
Casual consequence questions
Bio-medicine and Medicine
Mathematical word problems
Deterministic finite automata (DFA) problems
With regard to the response format of questions, both free- and selected-response questions (i.e. MC and T/F questions) are of interest. In all, 35 studies focus on generating selected-response questions, 32 on generating free-response questions, and four studies on both. These numbers are similar to the results reported in Alsubait (2015), which were 33 and 32 papers on generation of free- and selected-response questions respectively (Table 1). However, which format is more suitable for assessment is debatable. Although some studies that advocate the use of free-response argue that these questions can test a higher cognitive level,10 most automatically generated free-response questions are simple factual questions for which the answers are short facts explicitly mentioned in the input. Thus, we believe that it is useful to generate distractors, leaving to exam designers the choice of whether to use the free-response or the multiple-choice version of the question.
Concerning language, the majority of studies focus on generating questions in English (59 studies). Questions in Chinese (5 studies), Japanese (3 studies), Indonesian (2 studies), as well as Punjabi and Thai (1 study each) have also been generated. To ascertain which languages have been investigated before, we skimmed the papers identified in Alsubait (2015) and found three studies on generating questions in languages other than English: French in Fairon (1999), Tagalog in Montenegro et al. (2012), and Chinese, in addition to English, in Wang et al. (2012). This reflects an increasing interest in generating questions in other languages, which possibly accompanies interest in NLP research in these domains. Note that there may be studies on other languages or more studies on the languages we have identified that we were not able to capture, because we excluded studies written in languages other than English.
Feedback generation concerns the provision of information regarding the response to a question. Feedback is important in reinforcing the benefits of questions especially in electronic environments in which interaction between instructors and students is limited. In addition to informing test takers of the correctness of their responses, feedback plays a role in correcting test takers’ errors and misconceptions and in guiding them to the knowledge they must acquire, possibly with reference to additional materials.
This aspect of questions has been neglected in early and recent AQG literature. Among the literature that we reviewed, only one study, Leo et al. (2019), has generated feedback, alongside the generated questions. They generate feedback as a verbalisation of the axioms used to select options. In cases of distractors, axioms used to generate both key and distractors are included in the feedback.
We found another study (Das and Majumder 2017) that has incorporated a procedure for generating hints using syntactic features, such as the number of words in the key, the first two letters of a one-word key, or the second word of a two-words key.
Difficulty is a fundamental property of questions that is approximated using different statistical measures, one of which is percentage correct (i.e the percentage of examinees who answered a question correctly).11 Lack of control over difficulty poses issues such as generating questions of inappropriate difficulty (inappropriately easy or difficult questions). Also, searching for a question with a specific difficulty among a huge number of generated questions is likely to be tedious for exam designers.
We structure this section around three aspects of difficulty models: 1) their generality, 2) features underlying them, and 3) evaluation of their performance.
Despite the growth in AQG, only 14 studies have dealt with difficulty. Eight of these studies focus on the difficulty of questions belonging to a particular domain, such as mathematical word problems (Wang and Su 2016; Khodeir et al. 2018), geometry questions (Singhal et al. 2016), vocabulary questions (Susanti et al. 2017a), reading comprehension questions (Gao et al. 2018), DFA problems (Shenoy et al. 2016), code-tracing questions (Thomas et al. 2019), and medical case-based questions (Leo et al. 2019; Kurdi et al. 2019). The remaining six focus on controlling the difficulty of non-domain-specific questions (Lin et al. 2015; Alsubait et al. 2016; Kurdi et al.2017; Faizan and Lohmann 2018; Faizan et al. 2017; Seyler et al. 2017; Vinu and Kumar 2015a, 2017a; Vinu et al. 2016; Vinu and Kumar 2017b, 2015b).
Features proposed for controlling the difficulty of generated questions
Lin et al. (2015)
Feature-based similarity between key and distractors
Number and type of domain-objects involved
Number and type of domain-rules involved
User given scenarios
Length of the solution
Direct/indirect use of rules involved
Reading passage difficulty
Contextual similarity between key and distractors
Distractor word difficulty level
Quality of hints (i.e. how much they reduce the answer space)
Vinu et al. (2016)
Popularity of predicates present in stems
and Vinu and Kumar (2017b)
Depth of concepts and roles present in a stem in class hierarchy
Vinu and Kumar (2015b)
Feature-based similarity between key and distractors
Alsubait et al. (2016)
Feature-based similarity between key and distractors
Kurdi et al. (2017)
Shenoy et al. (2016)
Eight features specific to DFA problems such as the number
Wang and Su (2016)
Complexity of equations
Presence of distraction (i.e. redundant information) in stem
Seyler et al. (2017)
Popularity of entities (of both question and answer)
Popularity of semantic types
Coherence of entity pairs (i.e. tendency to appear together)
Faizan and Lohmann (2018) and
Depth of the correct answer in class hierarchy
Faizan et al. (2017)
Popularity of RDF triples (of subject and object)
Gao et al. (2018)
Question word proximity hint (i.e. distance of all
nonstop sentence words to the answer in the
Khodeir et al. (2018)
Number and types of included operators
Number of objects in the story
Leo et al. (2019) and
Kurdi et al. (2019)
Option entity difference
Thomas et al. (2019)
Number of executable blocks in a piece of code
Types of evaluation employed for verifying difficulty models. An asterisk “*” indicates that no sufficient information about the reviewers is reported
Type of evaluation
Lin et al. (2015)
45 questions and
and 88 participants
24 questions and
and Vinu et al. (2016)
Vinu and Kumar (2015b)
31 questions and
Alsubait et al. (2016) and
115 questions and
12 questions and
Kurdi et al. (2017)
Shenoy et al. (2016)
4 questions and
Wang and Su (2016)
24 questions and
Seyler et al. (2017)
150 questions and
Faizan and Lohmann (2018)
14 questions and
and Faizan et al. (2017)
Gao et al. (2018)
200 questions and
2 automatic solvers
Khodeir et al. (2018)
25 questions and
Leo et al. (2019) and
435 questions and
231 questions and
Kurdi et al. (2019)
Thomas et al. (2019)
36 questions and
In addition to controlling difficulty, in one study (Kusuma and Alhamri 2018), the author claims to generate questions targeting a specific Bloom level. However, no evaluation of whether generated questions are indeed at a particular Bloom level was conducted.
We define verbalisation as any process carried out to improve the surface structure of questions (grammaticality and fluency) or to provide variations of questions (i.e. paraphrasing). The former is important since linguistic issues may affect the quality of generated questions. For example, grammatical inconsistency between the stem and incorrect options enables test takers to select the correct option with no mastery of the required knowledge. On the other hand, grammatical inconsistency between the stem and the correct option can confuse test takers who have the required knowledge and would have been likely to select the key otherwise. Providing different phrasing for the question text is also of importance, playing a role in keeping test takers engaged. It also plays a role in challenging test takers and ensuring that they have mastered the required knowledge, especially in the language learning domain. To illustrate, consider questions for reading comprehension assessment; if the questions match the text with a very slight variation, test takers are likely to be able to answer these questions by matching the surface structure without really grasping the meaning of the text.
From the literature identified in this review, only ten studies apply additional processes for verbalisation. Given that the majority of the literature focuses on gap-fill question generation, this result is expected. Aspects of verbalisation that have been considered are pronoun substitutions (i.e. replacing pronouns by their antecedents) (Huang and He 2016), selection of a suitable auxiliary verb (Mazidi and Nielsen 2015), determiner selection (Zhang and VanLehn 2016), and representation of semantic entities (Vinu and Kumar 2015b; Seyler et al. 2017) (see below for more on this). Other verbalisation processes that are mostly specific to some question types are the following: selection of singular personal pronouns (Faizan and Lohmann 2018; Faizan et al. 2017), which is relevant for Jeopardy questions; selection of adjectives for predicates (Vinu and Kumar 2017a), which is relevant for aggregation questions; and ordering sentences and reference resolution (Huang and He 2016), which is relevant for word problems.
For approaches utilising structured knowledge sources, semantic entities, which are usually represented following some convention such as using camel case (e.g anExampleOfCamelCase) or using underscore as a word separator, need to be represented in a natural form. Basic processing which includes word segmentation, adaptation of camel case, underscores, spaces, punctuation, and conversion of the segmented phrase into a suitable morphological form (e.g. “has pet” to “having pet”), has been reported in Vinu and Kumar (2015b). Seyler et al. (2017) used Wikipedia to verbalise entities, an entity-annotated corpus to verbalise predicates, and WordNet to verbalise semantic types. The surface form of Wikipedia links was used as verbalisation for entities. The annotated corpus was used to collect all sentences that contain mentions of entities in a triple, combined with some heuristic for filtering and scoring sentences. Phrases between the two entities were used as verbalisation of predicates. Finally, as types correspond to WordNet synsets, the authors used a lexicon that comes with WordNet for verbalising semantic types.
Only two studies (Huang and He 2016; Ai et al. 2015) have considered paraphrasing. Ai et al. (2015) employed a manually created library that includes different ways to express particular semantic relations for this purpose. For instance, “wife had a kid from husband” is expressed as “from husband, wife had a kid”. The latter is randomly chosen from among the ways to express the marriage relation as defined in the library. The other study that tackles paraphrasing is Huang and He (2016) in which words were replaced with synonyms.
In this section, we report on standard datasets and evaluation practices that are currently used in the field (considering how QG approaches are evaluated and what aspects of questions such evaluation focuses on). We also report on issues hindering comparison of the performance of different approaches and identification of the best-performing methods. Note that our focus is on the results of evaluating the whole generation approach, as indicated by the quality of generated questions, and not on the results of evaluating a specific component of the approach (e.g. sentence selection or classification of question types). We also do not report on evaluations related to the usability of question generators (e.g. evaluating ease of use) or efficiency (i.e. time taken to generate questions). For approaches using ontologies as the main input, we consider whether they use existing ontologies or experimental ones (i.e. created for the purpose of QG), since Alsubait (2015) has concerns related to using experimental ontologies in evaluations (see “Findings of Alsubait’s Review” section). We also reflect on further issues in the design and implementation of evaluation procedures and how they can be improved.
In what follows, we outline publicly available question corpora, providing details about their content, as well as how they were developed and used in the context of QG. These corpora are grouped on the basis of the initial purpose for which they were developed. Following this, we discuss the advantages and limitations of using such datasets and call attention to some aspects to consider when developing similar datasets.
- Machine reading comprehension
The Stanford Question Answering Dataset (SQuAD)12 (Rajpurkar et al. 2016) consists of 150K questions about Wikipedia articles developed by AMT co-workers. Of those, 100K questions are accompanied by paragraph-answer pairs from the same articles and 50K questions have no answer in the article. This dataset was used by Kumar et al. (2018) and Wang et al. (2018) to perform a comparison among variants of the generation approach they developed and between their approach and an approach from the literature. The comparison was based on the metrics BLEU-4, METEOR, and ROUGE-L which capture the similarity between generated questions and the SQuAD questions that serve as ground truth questions (there is more information on these metrics in the next section). That is, questions were generated using the 100K paragraph-answer pairs as input. Then, the generated questions were compared with the human-authored questions that are based on the same paragraph-answer pairs.
NewsQA13 is another crowd-sourced dataset of about 120K question-answer pairs about CNN articles. The dataset consists of wh-questions and is used in the same way as SQuAD.
- Training question-answering (QA) systems
The 30M factoid question-answer corpus (Serban et al. 2016) is a corpus of questions automatically generated from Freebase.14 Freebase triples (of the form: subject, relationship, object) were used to generate questions where the correct answer is the object of the triple. For example, the question: “What continent is bayuvi dupki in?” is generated from the triple (bayuvi dupki, contained by, europe). The triples and the questions generated from them are provided in the dataset. A sample of the questions was evaluated by 63 AMT co-workers, each of whom evaluated 44-75 examples; each question was evaluated by 3-5 co-workers. The questions were also evaluated by automatic evaluation metrics. Song and Zhao (2016a) performed a qualitative analysis comparing the grammaticality and naturalness of questions generated by their approach and questions from this corpus (although the comparison is not clear).
SciQ15 (Welbl et al. 2017) is a corpus of 13.7K science MCQs on biology, chemistry, earth science, and physics. The questions target a broad cohort, ranging from elementary to college introductory level. The corpus was created by AMT co-workers at a cost of $10,415 and its development relied on a two-stage procedure. First, 175 co-workers were shown paragraphs and asked to generate questions for a payment of $0.30 per question. Second, another crowd-sourcing task in which co-workers validate the questions developed and provide them with distractors was conducted. A list of six distractors was provided by a ML-model. The co-workers were asked to select two distractors from the list and to provide at least one additional distractor for a payment of $0.20. For evaluation, a third crowd-sourcing task was created. The co-workers were provided with 100 question pairs, each pair consisting of an original science exam question and a crowd-sourced question in a random order. They were instructed to select the question likelier to be the real exam question. The science exam questions were identified in 55% of the cases. This corpus was used by Liang et al. (2018) to develop and test a model for ranking distractors. All keys and distractors in the dataset were fed to the model to rank. The authors assessed whether ranked distractors were among the original distractors provided with the questions.
- Question generation
The question generation shared task challenge (QGSTEC) dataset16 (Rus et al. 2012) is created for the QG shared task. The shared task contains two challenges: question generation from individual sentences and question generation from a paragraph. The dataset contains 90 sentences and 65 paragraphs collected from Wikipedia, OpenLearn,17 and Yahoo! Answers, with 180 and 390 questions generated from the sentences and paragraphs, respectively. A detailed description of the dataset, along with the results achieved by the participants, is given in Rus et al. (2012). Blšták and Rozinajová (2017, 2018) used this dataset to generate questions and compare their performance on correctness to the performance of the systems participating in the shared task.
Medical CBQ corpus (Leo et al. 2019) is a corpus of 435 case-based, auto-generated questions that follow four templates (“What is the most likely diagnosis?”, “What is the drug of choice?”, “What is the most likely clinical finding?”, and “What is the differential diagnosis?”). The questions are accompanied by experts’ ratings of appropriateness, difficulty, and actual student performance. The data was used to evaluate an ontology-based approach for generating case-based questions and predicting their difficulty.
MCQL is a corpus of about 7.1K MCQs crawled from the web, with an average of 2.91 distractors per question. The domains of the questions are biology, physics, and chemistry, and they target Cambridge O-level and college-level. The dataset was used in Blšták and Rozinajová (2017) to develop and evaluate a ML-model for ranking distractors.
Information about question corpora that are used in the reviewed literature
pairs and questions
on the pairs
Some of these datasets were used to develop and evaluate ML-models for ranking distractors. However, being written by humans does not necessarily mean that these distractors are good. This is, in fact, supported by many studies on the quality of distractors in real exam questions (Sarin et al. 1998; Tarrant et al. 2009; Ware and Vik 2009). If these datasets were to be used for similar purposes, distractors would need to be filtered based on their functionality (i.e. being picked by test takers as answers to questions).
We also observe that these datasets have been used in a small number of studies (1-2). This is partially due to the fact that many of them are relatively new. In addition, the design space for question generation is large (i.e. different inputs, question types, and domains). Therefore, each of these datasets is only relevant for a small set of question generators.
Types of Evaluation
The most common evaluation approach is expert-based evaluation (n = 21), in which experts are presented with a sample of generated questions to review. Given that expert review is also a standard procedure for selecting questions for real exams, expert rating is believed to be a good proxy for quality. However, it is important to note that expert review only provides initial evidence for the quality of questions. The questions also need to be administered to a sample of students to obtain further evidence of their quality (empirical difficulty, discrimination, and reliability), as we will see later. However, invalid questions must be filtered first, and expert review is also utilised for this purpose, whereby questions indicated by experts to be invalid (e.g. ambiguous, guessable, or not requiring domain knowledge) are filtered out. Having an appropriate question set is important to keep participants involved in question evaluation motivated and interested in solving these questions.
One of our observations on expert-based evaluation is that only in a few studies were experts required to answer the questions as part of the review. We believe this is an important step to incorporate since answering a question encourages engagement and triggers deeper thinking about what is required to answer. In addition, expert performance on questions is another indicator of question quality and difficulty. Questions answered incorrectly by experts can be ambiguous or very difficult.
Another observation on expert-based evaluation is the ambiguity of instructions provided to experts. For example, in an evaluation of reading comprehension questions (Mostow et al. 2017), the authors reported different interpretations of the instructions for rating the overall question quality, whereby one expert pointed out that it is not clear whether reading the preceding text is required in order to rate the question as being of good quality. Researchers have also measured question acceptability, as well as other aspects of questions, using scales with a large number of categories (up to a 9-point scale) without a clear categorisation for each category. Zhang (2015) found that reviewers perceive scale differently and not all categories of scales are used by all reviewers. We believe that these two issues are reasons for low inter-rater agreement between experts. To improve the accuracy of the data obtained through expert review, researchers must precisely specify the criteria by which to evaluate questions. In addition, a pilot test needs to be conducted with experts to provide an opportunity for validating the instructions and ensuring that instructions and questions are easily understood and interpreted as intended by different respondents.
The second most commonly employed method for evaluation is comparing machine-generated questions (or parts of questions) to human-authored ones (n = 15), which is carried out automatically or as part of the expert review. This comparison is utilised to confirm different aspects of question quality. Zhang and VanLehn (2016) evaluated their approach by counting the number of questions in common between those that are human- and machine-generated. The authors used this method under the assumption that humans are likely to ask deep questions about topics (i.e. questions of higher cognitive level). On this ground, the authors claimed that an overlap means the machine was able to mimic this in-depth questioning. Other researchers have compared machine-generated questions with human-authored reference questions using metrics borrowed from the fields of text summarisation (ROUGE (Lin 2004)) and machine translation (BLEU (Papineni et al. 2002) and METEOR (Banerjee and Lavie 2005)). These metrics measure the similarity between two questions generated from the same text segment or sentence. Put simply, this is achieved by counting matching n-grams in the gold-standard question to n-grams in the generated question with some focusing on recall (i.e. how much of the reference question is captured in the generated question) and others focusing on precision (i.e. how much of the generated question is relevant). METEOR also considers stemming and synonymy matching. Wang et al. (2018) claimed that these metrics can be used as initial, inexpensive, large-scale indicators of the fluency and relevancy of questions. Other researchers investigated whether machine-generated questions are indistinguishable from human-authored questions by mixing both types and asking experts about the source of each question (Chinkina and Meurers 2017; Susanti et al. 2015; Khodeir et al. 2018). Some researchers evaluated their approaches by investigating the ability of the approach to assemble human-authored distractors. For example, Yaneva et al. (2018) only focused on generating distractors given a question stem and key. However, given the published evidence of the poor quality of human-generated distractors, additional checks need to be performed, such as the functionality of these distractors.
Crowd-sourcing has also been used in ten of the studies. In eight of these, co-workers were employed to review questions while in the remaining three, they were employed to take mock tests. To assess the quality of their responses, Chinkina et al. (2017) included test questions to make sure that the co-workers understood the task and were able to distinguish low-quality from high-quality questions. However, including a process for validating the reliability of co-workers has been neglected in most studies (or perhaps not reported). Another validation step that can be added to the experimental protocol is conducting a pilot to test the capability of co-workers for review. This can also be achieved by adding validated questions to the list of questions to be reviewed by the co-workers (given the availability of a validated question set).
Similarly, students have been employed to review questions in nine studies and to take tests in a further ten. We attribute the low rate of question validation through testing with student cohorts to it being time-consuming and to the ethical issues involved in these experiments. Experimenters must ensure that these tests do not have an influence on students’ grades or motivations. For example, if multiple auto-generated questions focus on one topic, students could perceive this as an important topic and pay more attention to it while studying for upcoming exams, possibly giving less attention to other topics not covered by the experimental exam. Difficulty of such experimental exams could also affect students. If an experimental test is very easy, students could expect upcoming exams to be the same, again paying less attention when studying for them. Another possible threat is a drop in student motivation triggered by an experimental exam being too difficult.
Finally, for ontology-based approaches, similar to the findings reported in the section “Findings of Alsubait’s Review”, most ontologies used in evaluations were hand-crafted for experimental purposes and the use of real ontologies was neglected, except in Vinu and Kumar (2015b), Leo et al. (2019), and Lopetegui et al. (2015).
Quality Criteria and Metrics
Evaluation metrics and number of papers that have used each metric
No. of studies
Question as a whole
Statistical difficulty (i.e. based on examinee performance) and reviewer
rating of difficulty
Question acceptability (often by domain experts)
Educational usefulness (i.e. usability in a learning context)
Relevance to the input
Being indistinguishable from human-authored questions
Overlap with human-authored questions
Freeness from errors
Cognitive level or depth
Diversity of question types
How much the questions revealed about the answer
Distractor quality or plausibility
Answer correctness or distractor correctness
Distractor functionality (i.e. based on examinee performance)
Overlap with human-generated distractors
Distractor matching intended type
Generality of the designed templates
Performance of Generation Approaches and Gold Standard Performance
We started this systematic review hoping to identify standard performance and the best generation approaches. However, a comparison between the performances of various approaches was not possible due to heterogeneity in the measurement of quality and reporting of results. For example, scales that consist of different number of categories were used by different studies for measuring the same variables. We were not able to normalise these scales because most studies have only reported aggregated data without providing the number of observations in each rating scale category. Another example of heterogeneity is difficulty based on examinee performance. While some studies use percentage correct, others use Rasch difficulty without providing the raw data to allow the other metric to be calculated. Also, essential information that is needed to judge the trustability and generality of the results, such as sample size and selection method, was not reported in multiple studies. All of these issues preclude a statistical analysis of, and a conclusion about, the performance of generation approaches.
Quality Assessment Results
In this section, we describe and reflect on the state of experimental reporting in the reviewed literature.
Overall, the experimental reporting is unsatisfactory. Essential information that is needed to assess the strength of a study is not reported, raising concerns about trustability and generalisability of the results. For example, the number of evaluated questions, the number of participants involved in evaluations, or both of these numbers are not mentioned in five, ten and five studies, respectively. Information about sampling strategy and how sample size was determined is almost never reported (see the Appendix, “Quality assessment”).
A description of the participants’ characteristics, whether experts, students, or co-workers, is frequently missing (neglected by 23 studies). Minimal information that needs to be reported about experts involved in reviewing questions, in addition to their numbers, is their teaching and exam construction experience. Reporting whether experts were paid or not is important for the reader to understand possible biases involved. However, this is not reported in 51 studies involving experiments with human subjects. Other additional helpful information to report is the time taken to review, because this would assist researchers to estimate the number of experts to recruit given a particular sample size, or to estimate the number of questions to sample given the available number of experts.
Characteristics of students involved in evaluations, such as their educational level and experience with the subject under assessment, are important for replication of studies. In addition, this information can provide a basis for combining evidence from multiple studies. For example, we could gain stronger evidence about the effect of specific features on question difficulty by combining studies investigating the same features with different cohorts. In addition, the characteristics of the participants are a possible justification for the difference in difficulty between studies. Similarly, criteria used for the selection of co-workers such as imposing a restriction on which countries they are from, or the number and accuracy of previous tasks in which they participated is important.
Some studies neglect to report on the total number of generated questions and the distribution of questions per categories (question types, difficulty levels, and question sources, when applicable), which are necessary to assess the suitability of sampling strategies. For example, without reporting the distribution of question types, making a claim based on random sampling that “70% of questions are appropriate to be used in exams” would be misleading if the distribution of question types is skewed. This is due to the sample not being representative of question types with a low number of questions. Similarly, if the majority of generated questions are easy, using a random sample will result in the underrepresentation of difficult questions, consequently precluding any conclusion about difficult questions or any comparison between easy and difficult questions.
With regard to measurement descriptions, 10 studies fail to report information sufficient for replication, such as instructions given to participants and a description of the rating scales. Another limitation concerning measurements is the lack of assessment of inter-rater reliability (not reported by 43 studies). In addition, we observed a lack of justification for experimental decisions. Examples of this are the sources from which questions were generated, when particular texts or knowledge sources were selected without any discussion of whether these sources were representative and of what they were representative. We believe that generation challenges and question quality issues that might be encountered when using different sources need to be raised and discussed.
Conclusion and Future Work
In this paper, we have conducted a comprehensive review of 93 papers addressing the automatic generation of questions for educational purposes. In what follows, we summarise our findings in relation to the review objectives.
Providing an Overview of the AQG Community and its Activities
We found that AQG is an increasing activity of a growing community. Through this review, we identified the top publication venues and the active research groups in the field, providing a connection point for researchers interested in the field.
Summarising Current QG Approaches
We found that the majority of QG systems focus on generating questions for the purpose of assessment. The template-based approach was the most common method employed in the reviewed literature. In addition to the generation of complete questions or of question components, a variety of pre- and post-processing tasks that are believed to improve question quality have been investigated. The focus was on the generation of questions from text and for the language domain. The generation of both multiple-choice and free-response questions was almost equally investigated with a large number of studies focusing on wh-word and gap-fill questions. We also found increased interest in generating questions in languages other than English. Although extensive research has been carried out on QG, only a small proportion of these tackle the generation of feedback, verbalisation of questions, and the control of question difficulty.
Identifying Gold Standard performance in AQG
Incomparability of the performance of generation approaches is an issue we identified in the reviewed literature. This issue is due to the heterogeneity in both measurement of quality and reporting of results. We suggest below how the evaluation of questions and reporting of results can be improved to overcome this issue.
Tracking the Evolution of AQG Since Alsubait’s Review
Our results are consistent with the findings of Alsubait (2015). Based on these findings, we suggest that research in the area can be extended in the following directions (starting at the question level before moving on to the evaluation and research in closely related areas):
Improvement at the Question Level
Generating Questions with Controlled Difficulty
As mentioned earlier, there is little research on question difficulty and what there is mostly focuses on either stem or distractor difficulty. The difficulty of both stem and options plays a role in overall difficulty and therefore needs to be considered together and not in isolation. Furthermore, controlling MCQ difficulty by varying the similarity between key and distractors is a common feature found in multiple studies. However, similarity is only one facet of difficulty and there are others that need to be identified and integrated into the generation process. Thus, the formulation of a theory behind an intelligent automatic question generator capable of both generating questions and accurately controlling their difficulty is at the heart of AQG research. This would be used for improving the quality of generated questions by filtering inappropriately easy or difficult questions which is especially important given the large number of questions.
Enriching Question Forms and Structures
One of the main limitations of existing works is the simplicity of generated questions, which has also been highlighted in Song and Zhao (2016b). Most generated questions consist of a few terms and target lower cognitive levels. While these questions are still useful, there is a potential for improvement by exploring the generation of other, higher order and more complex, types of questions.
Automating Template Construction
The template library is a major component of question generation systems. At present, the process of template construction is largely manual. The templates are either developed through analysing a set of hand-written questions manually or through consultation with domain experts. While one of the main motivations for generating questions automatically is cost reduction, both of these template acquisition techniques are costly. In addition, there is no evidence that the set of templates defined by a few experts is typical of the set of questions used in assessments. We attribute part of the simplicity of the current questions to the cost, both in terms of time and resources, of both template acquisition techniques.
The cost of generating questions automatically could be reduced further by automatically constructing templates. In addition, this would contribute to the development of more diverse questions.
Employing natural language generation and processing techniques in order to present questions in natural and correct forms and to eliminate errors that invalidate questions, such as syntactic clues, are important steps to take before questions can be used beyond experimental settings for assessment purposes.
As has been seen in both reviews, work on feedback generation is almost non-existent. Developing mechanisms for producing rich, effective feedback is one of the features that needs to be integrated into the generation process. This includes different types of feedback, such as formative, summative, interactive, and personalised feedback.
Improvement of Evaluation Methods
Using Human-Authored Questions for Evaluation
Evaluating question quality, whether by means of expert review or mock exams, is an expensive and time consuming process. Analysing existing exam performance data is a potential source for evaluating question quality and difficulty prediction models. Translating human-authored questions to a machine-processable representation is a possible method for evaluating the ability of generation approaches to generate human-like questions. Regarding the evaluation of difficulty models, this can be done by translating questions to a machine-processable representation, computing the features of these questions, and examining their effect on difficulty. This analysis also provides an understanding of pedagogical content knowledge (i.e. concepts that students often find difficult and usually have misconceptions about). This knowledge can be integrated into difficulty prediction models, or used for question selection and feedback generation.
Standardisation and Development of Automatic Scoring Procedures
To ease comparison between different generation approaches, which was difficult due to heterogeneity in measurement and reporting as well as ungrounded heterogeneity needs to be eliminated. The development of standard and well defined scoring procedures is important to reduce heterogeneity and improve inter-rater reliability. In addition, developing automatic scoring procedures that correlate with human ratings are also important since this will reduce evaluation cost and heterogeneity.
Improvement of Reporting
We also emphasise the need for good experimental reporting. In general, authors should improve reporting on their generation approaches and on evaluation, which are both essential for other researchers who wish to compare their approaches with existing approaches. At a minimum, data extracted in this review (refer to questions under OBJ2 and OBJ3) should be reported in all publications on AQG. To ensure quality, journals can require authors to be complete a checklist prior to peer review, which has shown to improve the reporting quality (Han et al. 2017). Alternatively, text-mining techniques can be used for assessing the reporting quality by targeting key information in AQG literature, as has been proposed in Flórez-Vargas et al. (2016).
Other Areas of Improvement and Further Research
Assembling Exams from the Generated Questions
Although there is a large amount of work that needs to be done at the question level before moving to the exam level, further work in extending the difficulty models, enriching question form and structure, and improving presentation are steps towards this goal. Research in these directions will open new opportunities for AQG research to move towards assembling exams automatically from generated questions. One of the challenges in exam generation is the selection of a question set that is of appropriate difficulty with good coverage of the material. Ensuring that questions do not overlap or provide clues for other questions also needs to be taken into account. The AQG field could adopt ideas from the question answering field in which question entailment has been investigated (for example, see the work of Abacha and Demner-Fushman (2016)). Finally, ordering questions in a way that increases motivation and maximises the accuracy of scores is another interesting area.
Mining Human-Authored Questions
While existing researchers claim that the questions they generate can be used for educational purposes, these claims are not generally supported. More attention needs to be given to the educational value of generated questions.
In addition to potential use in evaluation, analysing real, good quality exams can help to gain insights into what questions need to be generated so that the generation addresses real life educational needs. This will also help to quantify the characteristics of real questions (e.g. number of terms in real questions) and direct attention to what needs to be done and where the focus should be in order to move to exam generation. Additionally, exam questions reflect what should be included in similar assessments that, in turn, can be further used for content selection and the ranking of questions. For example, concepts extracted from these questions can inform the selection of existing textual or structured sources and the quantifying of whether or not the contents are of educational relevance.
Other potential advantages that the automatic mining of questions offers are the extraction of question templates, a major component of automatic question generators, and improving natural language generation. Besides, mapping the information contained in existing questions to an ontology permits modification of these questions, prediction of their difficulty, and the formation of theories about different aspects of the questions such as their quality.
Similarity Computation and Optimisation
A variety of similarity measures have been used in the context of QG to select content for questions, to select plausible distractors and to control question difficulty (see “Generation Tasks” section for examples). Similarity can also be employed in suggesting a diverse set of generated questions (i.e. questions that do not entail the same meaning regardless of their surface structure). Improving computation of the similarity measures (i.e. speed and accuracy) and investigating other types of similarity that might be needed for other question forms are all considered sidelines that have direct implications for improving the current automatic question generation process. Evaluating the performance of existing similarity measures in comparison to each other and whether or not cheap similarity measures can approximate expensive ones are further interesting objects of study.
Source Acquisition and Enrichment
As we have seen in this review, structured knowledge sources have been a popular source for question generation, either by themselves or to complement texts. However, knowledge sources are not available for many domains, while those that are developed for purposes other than QG might not be rich enough to generate good quality questions. Therefore, they need to be adapted or extended before they can be used for QG. As such, investigating different approaches for building or enriching structured knowledge sources and gaining further evidence for the feasibility of obtaining good quality knowledge sources that can be used for question generation, are crucial ingredients for their successful use in question generation.
A limitation of this review is the underrepresentation of studies published in languages other than English. In addition, ten papers were excluded because of the unavailability of their full texts.
Questions like those presented in the T.V. show “Jeopardy!”. These questions consist of statements that give hints about the answer. See Faizan and Lohmann (2018) for an example.
Note that evaluated properties are not necessarily controlled by the generation method. For example, an evaluation could focus on difficulty and discrimination as an indication of quality.
The code and the input files are available at: https://github.com/grkurdi/AQG_systematic_review
The required sample size was calculated using the N.cohen.kappa function (Gamer et al. 2019).
This due to the initial description of Q9 being insufficient. However, the agreement improved after refining the description of Q9. demonstrating “moderate agreement”.6 Note that Cohen’s kappa was unsuitable for assessing the agreement on the criteria Q6-Q8 due to the unbalanced distribution of responses (e.g. the majority of responses to Q6a were “no”). Since the level of agreement between both reviewers was high, the quality of the remaining studies was assessed by the first author.
Cohen’s kappa was interpreted according to the interpretation provided by Viera et al. (2005).
The last update of the search was on 3-4-2019.
Questions consisting of a text segment followed by a stem of the form: “The word X in paragraph Y is closest in meaning to:” and a set of options. See Susanti et al. (2015) for more details.
A percentage of 0 means that no one answered the question correctly (highly difficult question), while 100% means that everyone answered the question correctly (extremely easy question).
This can be found at https://rajpurkar.github.io/SQuAD-explorer/
This can be found at https://datasets.maluuba.com/NewsQA
This is a collaboratively created knowledge base.
Available at http://allenai.org/data.html
The dataset can be obtained from https://github.com/bjwyse/QGSTEC2010/blob/master/QGSTEC-Sentences-2010.zip
OpenLearn is an online repository that provides access to learning materials from The Open University.
- Abacha, AB, & Demner-Fushman, D. (2016). Recognizing question entailment for medical question answering. In: the AMIA annual symposium, American medical informatics association, p. 310.Google Scholar
- Adithya, SSR, & Singh, PK. (2017). Web authoriser tool to build assessments using Wikipedia articles. In: TENCON 2017 - 2017 IEEE region 10 conference, pp. 467–470. https://doi.org/10.1109/TENCON.2017.8227909.
- Ai, R, Krause, S, Kasper, W, Xu, F, Uszkoreit, H. (2015). Semi-automatic generation of multiple-choice tests from mentions of semantic relations. In: the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, pp. 26–33.Google Scholar
- Alsubait, T. (2015). Ontology-based question generation. PhD thesis: University of Manchester.Google Scholar
- Alsubait, T, Parsia, B, Sattler, U. (2012b). Mining ontologies for analogy questions: A similarity-based approach. In: OWLED.Google Scholar
- Alsubait, T, Parsia, B, Sattler, U. (2013). A similarity-based theory of controlling MCQ difficulty. In 2013 2Nd international conference on e-learning and e-technologies in education (pp. 283–288). ICEEE: IEEE.. https://doi.org/10.1109/ICeLeTE.2013.664438
- Alsubait, T, Parsia, B, Sattler, U. (2014a). Generating multiple choice questions from ontologies: Lessons learnt. In: OWLED, Citeseer, pp. 73–84.Google Scholar
- Alsubait, T, Parsia, B, Sattler, U. (2014b). Generating multiple questions from ontologies: How far can we go? In: the 1st International Workshop on Educational Knowledge Management (EKM 2014), Linköping University Electronic Press, pp. 19–30.Google Scholar
- Araki, J, Rajagopal, D, Sankaranarayanan, S, Holm, S, Yamakawa, Y, Mitamura, T. (2016). Generating questions and multiple-choice answers using semantic analysis of texts. In The 26th international conference on computational linguistics (COLING, (Vol. 2016 pp. 1125–1136).Google Scholar
- Banerjee, S, & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72.Google Scholar
- Basuki, S, & Kusuma, S F. (2018). Automatic question generation for 5w-1h open domain of Indonesian questions by using syntactical template-based features from academic textbooks. Journal of Theoretical and Applied Information Technology, 96 (12), 3908–3923.Google Scholar
- Beck, JE, Mostow, J, Bey, J. (2004). Can automated questions scaffold children’s reading comprehension? In: International Conference on Intelligent Tutoring Systems, Springer, pp. 478–490.Google Scholar
- Bednarik, L, & Kovacs, L. (2012a). Automated EA-type question generation from annotated texts, IEEE, SACI. https://doi.org/10.1109/SACI.2012.6250000.
- Bednarik, L, & Kovacs, L. (2012b). Implementation and assessment of the automatic question generation module, IEEE, CogInfoCom. https://doi.org/10.1109/CogInfoCom.2012.6421938.
- Biggs, J B, & Collis, KF. (2014). Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome). Cambridge: Academic Press.Google Scholar
- Bloom, B S, Engelhart, M D, Furst, E J, Hill, W H, Krathwohl, D R. (1956). Taxonomy of educational objectives, handbook i: The cognitive domain vol 19. New York: David McKay Co Inc.Google Scholar
- Blšták, M. (2018). Automatic question generation based on sentence structure analysis. Information Sciences & Technologies: Bulletin of the ACM Slovakia, 10(2), 1–5.Google Scholar
- Blšták, M., & Rozinajová, V. (2018). Building an agent for factual question generation task. In 2018 World symposium on digital intelligence for systems and machines (DISA) (pp. 143–150). IEEE.. https://doi.org/10.1109/DISA.2018.8490637
- Boland, A, Cherry, M G, Dickson, R. (2013). Doing a systematic review: A student’s guide. Sage.Google Scholar
- Ch, DR, & Saha, SK. (2018). Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies https://doi.org/10.1109/TLT.2018.2889100, in press.
- Chen, CY, Liou, HC, Chang, JS. (2006). Fast: an automatic generation system for grammar tests. In: the COLING/ACL on interactive presentation sessions, association for computational linguistics, pp. 1–4.Google Scholar
- Chinkina, M, & Meurers, D. (2017). Question generation for language learning: From ensuring texts are read to supporting learning. In: the 12th workshop on innovative use of NLP for building educational applications, pp. 334–344.Google Scholar
- Critical Appraisal Skills Programme. (2018). CASP qualitative checklist. https://casp-uk.net/wp-content/uploads/2018/03/CASP-Qualitative-Checklist-Download.pdf, accessed: 2018-09-07.
- Donnelly, K. (2006). SNOMED-CT: The Advanced terminology and coding system for eHealth. Studies in health technology and informatics, 121, 279–290.Google Scholar
- Fairon, C. (1999). A web-based system for automatic language skill assessment: Evaling. In: Symposium on computer mediated language assessment and evaluation in natural language processing, association for computational linguistics, pp. 62–67.Google Scholar
- Faizan, A, & Lohmann, S. (2018). Automatic generation of multiple choice questions from slide content using linked data. In: the 8th International Conference on Web Intelligence, Mining and Semantics.Google Scholar
- Faizan, A, Lohmann, S, Modi, V. (2017). Multiple choice question generation for slides. In: Computer Science Conference for University of Bonn Students, pp. 1–6.Google Scholar
- Fattoh, I E, Aboutabl, A E, Haggag, M H. (2015). Semantic question generation using artificial immunity. International Journal of Modern Education and Computer Science, 7(1), 1–8.Google Scholar
- Flor, M, & Riordan, B. (2018). A semantic role-based approach to open-domain automatic question generation. In: the 13th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 254–263.Google Scholar
- Flórez-Vargas, O., Brass, A, Karystianis, G, Bramhall, M, Stevens, R, Cruickshank, S, Nenadic, G. (2016). Bias in the reporting of sex and age in biomedical research on mouse models. eLife 5(e13615).Google Scholar
- Gaebel, M, Kupriyanova, V, Morais, R, Colucci, E. (2014). E-learning in European higher education institutions: Results of a mapping survey conducted in october-December 2013. Tech. rep.: European University Association.Google Scholar
- Gamer, M, Lemon, J, Gamer, MM, Robinson, A, Kendall’s, W. (2019). Package ’irr’. https://cran.r-project.org/web/packages/irr/irr.pdf.
- Gao, Y, Wang, J, Bing, L, King, I, Lyu. MR. (2018). Difficulty controllable question generation for reading comprehension. Tech. rep.Google Scholar
- Goldbach, IR, & Hamza-Lup, FG. (2017). Survey on e-learning implementation in Eastern-Europe spotlight on Romania. In: the Ninth International Conference on Mobile, Hybrid, and On-Line Learning.Google Scholar
- Gupta, M, Gantayat, N, Sindhgatta, R. (2017). Intelligent math tutor: Problem-based approach to create cognizance. In: the 4th ACM Conference on Learning@ Scale, ACM, pp. 241–244.Google Scholar
- Heilman, M. (2011). Automatic factual question generation from text. PhD thesis: Carnegie Mellon University.Google Scholar
- Heilman, M, & Smith, NA. (2009). Ranking automatically generated questions as a shared task. In: The 2nd Workshop on Question Generation, pp. 30–37.Google Scholar
- Heilman, M, & Smith, NA. (2010a). Good question! statistical ranking for question generation. In: Human language technologies: The 2010 annual conference of the north american chapter of the association for computational linguistics, association for computational linguistics, pp. 609–617.Google Scholar
- Heilman, M, & Smith, NA. (2010b). Rating computer-generated questions with mechanical turk. In: the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical turk, association for computational linguistics, pp. 35–40.Google Scholar
- Hill, J, & Simha, R. (2016). Automatic generation of context-based fill-in-the-blank exercises using co-occurrence likelihoods and Google n-grams. In: the 11th workshop on innovative use of NLP for building educational applications, pp. 23–30.Google Scholar
- Hingorjo, M R, & Jaleel, F. (2012). Analysis of one-best MCQs: the difficulty index, discrimination index and distractor efficiency. The Journal of the Pakistan Medical Association (JPMA), 62(2), 142–147.Google Scholar
- Huang, Y T, & Mostow, J. (2015). Evaluating human and automated generation of distractors for diagnostic multiple-choice cloze questions to assess children’s reading comprehension. In Conati, C., Heffernan, N., Mitrovic, A., Verdejo, M.F. (Eds.) Artificial intelligence in education (pp. 155–164). Cham: Springer International Publishing.Google Scholar
- Huang, Y T, Tseng, Y M, Sun, Y S, Chen, MC. (2014). TEDQuiz: automatic quiz generation for TED talks video clips to assess listening comprehension. In 2014 IEEE 14Th international conference on advanced learning technologies (pp. 350–354). ICALT: IEEE.Google Scholar
- Jiang, S, & Lee, J. (2017). Distractor generation for Chinese fill-in-the-blank items. In: the 12th workshop on innovative use of NLP for building educational applications, pp. 143–148.Google Scholar
- Jouault, C, & Seta, K. (2014). Content-dependent question generation for history learning in semantic open learning space. In: The international conference on intelligent tutoring systems, Springer, pp. 300–305.Google Scholar
- Jouault, C, Seta, K, Hayashi, Y. (2015a). A method for generating history questions using LOD and its evaluation. SIG-ALST of The Japanese Society for Artificial Intelligence, B5(1), 28–33.Google Scholar
- Jouault, C, Seta, K, Hayashi, Y. (2015b). Quality of LOD based semantically generated questions. In Conati, C., Heffernan, N., Mitrovic, A, Verdejo, M.F. (Eds.) Artificial intelligence in education (pp. 662–665). Cham: Springer International Publishing.Google Scholar
- Jouault, C, Seta, K, Yuki, H, et al. (2016b). Can LOD based question generation support work in a learning environment for history learning?. SIG-ALST, 5(03), 37–41.Google Scholar
- Kaur, A, & Singh, S. (2017). Automatic question generation system for Punjabi. In: The international conference on recent innovations in science, Agriculture, Engineering and Management.Google Scholar
- Kaur, J, & Bathla, A K. (2015). A review on automatic question generation system from a given Hindi text. International Journal of Research in Computer Applications and Robotics (IJRCAR), 3(6), 87–92.Google Scholar
- Killawala, A, Khokhlov, I, Reznik, L. (2018). Computational intelligence framework for automatic quiz question generation. In: 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8. https://doi.org/10.1109/FUZZ-IEEE.2018.8491624.
- Kitchenham, B, & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Tech. rep.: Keele University and University of Durham.Google Scholar
- Kovacs, L, & Szeman, G. (2013). Complexity-based generation of multi-choice tests in AQG systems, IEEE, CogInfoCom. https://doi.org/10.1109/CogInfoCom.2013.6719278.
- Kumar, G, Banchs, R, D’Haro, LF. (2015a). Revup: Automatic gap-fill question generation from educational texts. In: the 10th workshop on innovative use of NLP for building educational applications, pp. 154–161.Google Scholar
- Kumar, G, Banchs, R, D’Haro, LF. (2015b). Automatic fill-the-blank question generator for student self-assessment. In: IEEE Frontiers in Education Conference (FIE), pp. 1–3. https://doi.org/10.1109/FIE.2015.7344291.
- Kumar, V, Boorla, K, Meena, Y, Ramakrishnan, G, Li, Y F. (2018). Automating reading comprehension by generating question and answer pairs. In Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (Eds.) Advances in knowledge discovery and data mining (pp. 335–348). Cham: Springer International Publishing.. https://doi.org/10.1007/978-3-319-93040-4_27 CrossRefGoogle Scholar
- Kurdi, G, Parsia, B, Sattler, U. (2017). An experimental evaluation of automatically generated multiple choice questions from ontologies. In Dragoni, M., Poveda-Villalón, M., Jimenez-Ruiz, E. (Eds.) OWL: Experiences And directions – reasoner evaluation (pp. 24–39). Cham: Springer International Publishing.. https://doi.org/10.1007/978-3-319-54627-8_3 Google Scholar
- Kurdi, G, Leo, J, Matentzoglu, N, Parsia, B, Forege, S, Donato, G, Dowling, W. (2019). A comparative study of methods for a priori prediction of MCQ difficulty. the Semantic Web journal, In press.Google Scholar
- Kwankajornkiet, C, Suchato, A, Punyabukkana, P. (2016). Automatic multiple-choice question generation from Thai text. In: the 13th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 1–6. https://doi.org/10.1109/JCSSE.2016.7748891.
- Le, N T, Kojiri, T, Pinkwart, N. (2014). Automatic question generation for educational applications – the state of art. In van Do, T., Thi, H.A.L, Nguyen, N.T. (Eds.) Advanced computational methods for knowledge engineering (pp. 325–338). Cham: Springer International Publishing.CrossRefGoogle Scholar
- Lee, CH, Chen, TY, Chen, LP, Yang, PC, Tsai, RTH. (2018). Automatic question generation from children’s stories for companion chatbot. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 491–494. https://doi.org/10.1109/IRI.2018.00078.
- Liang, C, Yang, X, Wham, D, Pursel, B, Passonneau, R, Giles, CL. (2017). Distractor generation with generative adversarial nets for automatically creating fill-in-the-blank questions. In: the Knowledge Capture Conference, p. 33. https://doi.org/10.1145/3148011.3154463.
- Liang, C, Yang, X, Dave, N, Wham, D, Pursel, B, Giles, CL. (2018). Distractor generation for multiple choice questions using learning to rank. In: the 13th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 284–290. https://doi.org/10.18653/v1/W18-0533.
- Lin, C, Liu, D, Pang, W, Apeh, E. (2015). Automatically predicting quiz difficulty level using similarity measures. In: the 8th International Conference on Knowledge Capture, ACM.Google Scholar
- Lin, CY. (2004). ROUGE: A package for automatic evaluation of summaries. In: the Workshop on Text Summarization Branches Out.Google Scholar
- Liu, M, Calvo, R A, Rus, V. (2014). Automatic generation and ranking of questions for critical review. Journal of Educational Technology & Society, 17(2), 333–346.Google Scholar
- Lopetegui, MA, Lara, BA, Yen, PY, Çatalyürek, Ü.V., Payne, PR. (2015). A novel multiple choice question generation strategy: alternative uses for controlled vocabulary thesauri in biomedical-sciences education. In: the AMIA annual symposium, american medical informatics association, pp. 861–869.Google Scholar
- Majumder, M, & Saha, SK. (2015). A system for generating multiple choice questions: With a novel approach for sentence selection. In: the 2nd workshop on natural language processing techniques for educational applications, pp. 64–72.Google Scholar
- Marrese-Taylor, E, Nakajima, A, Matsuo, Y, Yuichi, O. (2018). Learning to automatically generate fill-in-the-blank quizzes. In: the 5th workshop on natural language processing techniques for educational applications. https://doi.org/10.18653/v1/W18-3722.
- Mazidi, K, & Nielsen, RD. (2014). Linguistic considerations in automatic question generation. In: the 52nd annual meeting of the association for computational linguistics, pp. 321–326.Google Scholar
- Mazidi, K, & Nielsen, R D. (2015). Leveraging multiple views of text for automatic question generation. In Conati, C., Heffernan, N., Mitrovic, A., Verdejo, M.F. (Eds.) Artificial intelligence in education (pp. 257–266). Cham: Springer International Publishing.Google Scholar
- Mazidi, K, & Tarau, P. (2016b). Infusing NLU into automatic question generation. In: the 9th International Natural Language Generation conference, pp. 51–60.Google Scholar
- Mitkov, R, & Ha, L A. (2003). Computer-aided generation of multiple-choice tests. In The HLT-NAACL 03 workshop on building educational applications using natural language processing, association for computational linguistics, pp. 17–22.Google Scholar
- Montenegro, C S, Engle, V G, Acuba, M G J, Ferrenal, A M A. (2012). Automated question generator for Tagalog informational texts using case markers. In TENCON 2012-2012 IEEE region 10 conference, IEEE, pp. 1–5. https://doi.org/10.1109/TENCON.2012.6412273.
- Mostow, J, & Chen, W. (2009). Generating instruction automatically for the reading strategy of self-questioning. In: the 14th international conference artificial intelligence in education, pp. 465–472.Google Scholar
- Mostow, J, Beck, J, Bey, J, Cuneo, A, Sison, J, Tobin, B, Valeri, J. (2004). Using automated questions to assess reading comprehension, vocabulary, and effects of tutorial interventions. Technology Instruction Cognition and Learning, 2, 97–134.Google Scholar
- Mostow, J, Yt, Huang, Jang, H, Weinstein, A, Valeri, J, Gates, D. (2017). Developing, evaluating, and refining an automatic generator of diagnostic multiple choice cloze questions to assess children’s comprehension while reading. Natural Language Engineering, 23(2), 245–294. https://doi.org/10.1017/S1351324916000024.CrossRefGoogle Scholar
- Niraula, NB, & Rus, V. (2015). Judging the quality of automatically generated gap-fill question using active learning. In: the 10th workshop on innovative use of NLP for building educational applications, pp. 196–206.Google Scholar
- Odilinye, L, Popowich, F, Zhang, E, Nesbit, J, Winne, PH. (2015). Aligning automatically generated questions to instructor goals and learner behaviour. In: the IEEE 9th international conference on semantic computing (ICS), pp. 216–223. https://doi.org/10.1109/ICOSC.2015.7050809.
- Olney, A M, Pavlik, P I, Maass, J K. (2017). Improving reading comprehension with automatically generated cloze item practice. In André, E., Baker, R., Hu, X., Rodrigo, M.M.T., du Boulay, B. (Eds.) Artificial intelligence in education. https://doi.org/10.1007/978-3-319-61425-0_22 (pp. 262–273). Cham: Springer International Publishing.Google Scholar
- Papasalouros, A, & Chatzigiannakou, M. (2018). Semantic web and question generation: An overview of the state of the art. In: The international conference e-learning, pp. 189–192.Google Scholar
- Papineni, K, Roukos, S, Ward, T, Zhu, WJ. (2002). BLEU: a method for automatic evaluation of machine translation. In: the 40th annual meeting on association for computational linguistics, Association for computational linguistics, pp. 311–318.Google Scholar
- Park, J, Cho, H, Sg, Lee. (2018). Automatic generation of multiple-choice fill-in-the-blank question using document embedding. In Penstein Rosé, C., Martínez-Maldonado, R., Hoppe, H.U., Luckin, R., Mavrikis, M., Porayska-Pomsta, K., McLaren, B., du Boulay, B. (Eds.) Artificial intelligence in education (pp. 261–265). Cham: Springer International Publishing.Google Scholar
- Patra, R, & Saha, SK. (2018a). Automatic generation of named entity distractors of multiple choice questions using web information Pattnaik, P.K., Rautaray, SS, Das, H, Nayak, J (Eds.), Springer, Berlin.Google Scholar
- Patra, R, & Saha, SK. (2018b). A hybrid approach for automatic generation of named entity distractors for multiple choice questions. Education and Information Technologies pp. 1–21.Google Scholar
- Polozov, O, O’Rourke, E, Smith, A M, Zettlemoyer, L, Gulwani, S, Popovic, Z. (2015). Personalized mathematical word problem generation. In The 24th international joint conference on artificial intelligence (IJCAI 2015), pp. 381–388.Google Scholar
- Qayyum, A, & Zawacki-Richter, O. (2018). Distance education in Australia, Europe and the americas, Springer, Berlin.Google Scholar
- Rajpurkar, P, Zhang, J, Lopyrev, K, Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. In: the 2016 conference on empirical methods in natural language processing, pp. 2383–2392.Google Scholar
- Rakangor, S, & Ghodasara, Y R. (2015). Literature review of automatic question generation systems. International Journal of Scientific and Research Publications, 5 (1), 2250–3153.Google Scholar
- Reisch, J S, Tyson, J E, Mize, S G. (1989). Aid to the evaluation of therapeutic studies. Pediatrics, 84(5), 815–827.Google Scholar
- Rocha, OR, & Zucker, CF. (2018). Automatic generation of quizzes from DBpedia according to educational standards. In: the 3rd educational knowledge management workshop (EKM).Google Scholar
- Sarin, Y, Khurana, M, Natu, M, Thomas, A G, Singh, T. (1998). Item analysis of published MCQs. Indian Pediatrics, 35, 1103–1104.Google Scholar
- Satria, AY, & Tokunaga, T. (2017a). Automatic generation of english reference question by utilising nonrestrictive relative clause. In: the 9th international conference on computer supported education, pp. 379–386. https://doi.org/10.5220/0006320203790386.
- Satria, AY, & Tokunaga, T. (2017b). Evaluation of automatically generated pronoun reference questions. In: the 12th workshop on innovative use of NLP for building educational applications, pp. 76–85.Google Scholar
- Serban, IV, García-Durán, A., Gulcehre, C, Ahn, S, Chandar, S, Courville, A, Bengio, Y. (2016). Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus. ACL.Google Scholar
- Seyler, D, Yahya, M, Berberich, K. (2017). Knowledge questions from knowledge graphs. In: The ACM SIGIR international conference on theory of information retrieval, pp. 11–18.Google Scholar
- Shah, R, Shah, D, Kurup, L. (2017). Automatic question generation for intelligent tutoring systems. In: the 2nd international conference on communication systems, computing and it applications (CSCITA), pp. 127–132. https://doi.org/10.1109/CSCITA.2017.8066538.
- Shenoy, V, Aparanji, U, Sripradha, K, Kumar, V. (2016). Generating DFA construction problems automatically. In: The international conference on learning and teaching in computing and engineering (LATICE), pp. 32–37. https://doi.org/10.1109/LaTiCE.2016.8.
- Shirude, A, Totala, S, Nikhar, S, Attar, V, Ramanand, J. (2015). Automated question generation tool for structured data. In: International conference on advances in computing, communications and informatics (ICACCI), pp. 1546–1551. https://doi.org/10.1109/ICACCI.2015.7275833.
- Singhal, R, & Henz, M. (2014). Automated generation of region based geometric questions.Google Scholar
- Singhal, R, Henz, M, Goyal, S. (2015a). A framework for automated generation of questions across formal domains. In: the 17th international conference on artificial intelligence in education, pp. 776–780.Google Scholar
- Singhal, R, Henz, M, Goyal, S. (2015b). A framework for automated generation of questions based on first-order logic Conati, C., Heffernan, N., Mitrovic, A., Verdejo, M.F. (Eds.), Springer International Publishing, Cham.Google Scholar
- Singhal, R, Goyal, R, Henz, M. (2016). User-defined difficulty levels for automated question generation. In: the IEEE 28th international conference on tools with artificial intelligence (ICTAI), pp. 828–835. https://doi.org/10.1109/ICTAI.2016.0129.
- Song, L, & Zhao, L. (2016a). Domain-specific question generation from a knowledge base. Tech. rep.Google Scholar
- Song, L, & Zhao, L. (2016b). Question generation from a knowledge base with web exploration. Tech. rep.Google Scholar
- Soonklang, T, & Muangon, W. (2017). Automatic question generation system for English exercise for secondary students. In: the 25th international conference on computers in education.Google Scholar
- Stasaski, K, & Hearst, MA. (2017). Multiple choice question generation utilizing an ontology. In: the 12th workshop on innovative use of NLP for building educational applications, pp. 303–312.Google Scholar
- Susanti, Y, Iida, R, Tokunaga, T. (2015). Automatic generation of English vocabulary tests. In: the 7th international conference on computer supported education, pp. 77–87.Google Scholar
- Susanti, Y, Nishikawa, H, Tokunaga, T, Hiroyuki, O. (2016). Item difficulty analysis of English vocabulary questions. In The 8th international conference on computer supported education (CSEDU 2016), pp. 267–274.Google Scholar
- Susanti, Y, Tokunaga, T, Nishikawa, H, Obari, H. (2017a). Controlling item difficulty for automatic vocabulary question generation. Research and Practice in Technology Enhanced Learning, 12(1), 25. https://doi.org/10.1186/s41039-017-0065-5.
- Susanti, Y, Tokunaga, T, Nishikawa, H, Obari, H. (2017b). Evaluation of automatically generated English vocabulary questions. Research and Practice in Technology Enhanced Learning 12(1). https://doi.org/10.1186/s41039-017-0051-y.
- Tamura, Y, Takase, Y, Hayashi, Y, Nakano, Y I. (2015). Generating quizzes for history learning based on Wikipedia articles. In Zaphiris, P., & Ioannou, A. (Eds.) Learning and collaboration technologies (pp. 337–346). Cham: Springer International Publishing.Google Scholar
- Thalheimer, W. (2003). The learning benefits of questions. Tech. rep., Work Learning Research. http://www.learningadvantage.co.za/pdfs/questionmark/LearningBenefitsOfQuestions.pdf.
- Thomas, A, Stopera, T, Frank-Bolton, P, Simha, R. (2019). Stochastic tree-based generation of program-tracing practice questions. In: the 50th ACM technical symposium on computer science education, ACM, pp. 91–97.Google Scholar
- Viera, A J, Garrett, J M, et al. (2005). Understanding interobserver agreement: the kappa statistic. Family Medicine, 37(5), 360–363.Google Scholar
- Vinu, EV, & Kumar, PS. (2015a). Improving large-scale assessment tests by ontology based approach. In: the 28th international florida artificial intelligence research society conference, pp. 457– 462.Google Scholar
- Vinu, EV, & Kumar, PS. (2017b). Difficulty-level modeling of ontology-based factual questions. Semantic Web Journal In press.Google Scholar
- Vinu, E V, Alsubait, T, Kumar, PS. (2016). Modeling of item-difficulty for ontology-based MCQs. Tech. rep.Google Scholar
- Wang, K, & Su, Z. (2016). Dimensionally guided synthesis of mathematical word problems. In: the 25th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2661–2668.Google Scholar
- Wang, Z, Lan, AS, Nie, W, Waters, AE, Grimaldi, PJ, Baraniuk, RG. (2018). QG-net: a data-driven question generation model for educational content. In: the 5th Annual ACM Conference on Learning at Scale, pp. 15–25.Google Scholar
- Webb, N L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education. Tech. rep.: National Institute for Science Education.Google Scholar
- Welbl, J, Liu, NF, Gardner, M. (2017). Crowdsourcing multiple choice science questions. In: the 3rd workshop on noisy user-generated text, pp. 94–106.Google Scholar
- Wita, R, Oly, S, Choomok, S, Treeratsakulchai, T, Wita, S. (2018). A semantic graph-based Japanese vocabulary learning game. In Hancke, G., Spaniol, M., Osathanunkul, K., Unankard, S., Klamma, R. (Eds.) Advances in web-based learning – ICWL, (Vol. 2018 pp. 140–145). Cham: Springer International Publishing.. https://doi.org/10.1007/978-3-319-96565-9_14 CrossRefGoogle Scholar
- Yaneva, V, & et al. (2018). Automatic distractor suggestion for multiple-choice tests using concept embeddings and information retrieval. In: the 13th workshop on innovative use of NLP for building educational applications, pp. 389–398.Google Scholar
- Zavala, L, & Mendoza, B. (2018). On the use of semantic-based AIG to automatically generate programming exercises. In: the 49th ACM technical symposium on computer science education, ACM, pp. 14–19.Google Scholar
- Zhang, J, & Takuma, J. (2015). A Kanji learning system based on automatic question sentence generation. In: 2015 international conference on asian language processing (IALP), pp. 144–147. https://doi.org/10.1109/IALP.2015.7451552.
- Zhang, L. (2015). Biology question generation from a semantic network. PhD thesis: Arizona State University.Google Scholar
- Zhang, L, & VanLehn, K. (2016). How do machine-generated questions compare to human-generated questions?. Research and Practice in Technology Enhanced Learning, 11(7). https://doi.org/10.1186/s41039-016-0031-7.
- Zhang, T, Quan, P, et al. (2018). Domain specific automatic Chinese multiple-type question generation. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp. 1967–1971. https://doi.org/10.1109/BIBM.2018.8621162.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.