<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Nicolò</forename><surname>Donati</surname></persName>
							<email>n.donati@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Bologna</orgName>
								<address>
									<addrLine>Viale del Risorgimento, 2</addrLine>
									<postCode>40136, BO</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Zanichelli editore S.p.A</orgName>
								<address>
									<addrLine>Via Irnerio 34</addrLine>
									<postCode>40126</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Matteo</forename><surname>Periani</surname></persName>
							<email>matteo.periani2@studio.unibo.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Bologna</orgName>
								<address>
									<addrLine>Viale del Risorgimento, 2</addrLine>
									<postCode>40136, BO</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><forename type="middle">Di</forename><surname>Natale</surname></persName>
							<email>paolo.dinatale3@studio.unibo.it</email>
							<affiliation key="aff2">
								<orgName type="institution">University of Bologna</orgName>
								<address>
									<addrLine>Corso della Repubblica</addrLine>
									<postCode>136, 47121</postCode>
									<settlement>Forlì FC</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Savino</surname></persName>
							<email>gsavino@zanichelli.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Zanichelli editore S.p.A</orgName>
								<address>
									<addrLine>Via Irnerio 34</addrLine>
									<postCode>40126</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Torroni</surname></persName>
							<email>p.torroni@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Bologna</orgName>
								<address>
									<addrLine>Viale del Risorgimento, 2</addrLine>
									<postCode>40136, BO</postCode>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">26DCF52A68C81D686A6ACC6CF3B831E4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Distractor Generation</term>
					<term>Multiple-Choice Cloze</term>
					<term>Evaluation Metric</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners' grammatical proficiency and comprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCC exercises must be contextually relevant and engaging, incorporating distractors-plausible but incorrect alternatives-to balance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs) in education, their application in generating English grammar MCC exercises is still limited. Previous methods typically impose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potential of LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. We hypothesize that LLMs can craft self-contained sentences that foster learner's communicative competence. Our analysis of existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we address the lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions include developing an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammar topics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance the automatic generation of English grammar MCC exercises, enhancing both their quality and creativity.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>English grammar Multiple-Choice Cloze (MCC) exercises are widely used tools for enhancing a learner's grammatical proficiency and comprehension skills. They consist of fill-the-gap questions where the gap must be filled by choosing one correct solution (key) among several options. The incorrect alternatives are called distractors. Devising these exercises is a labour-intensive process requiring expert knowledge in language teaching and content creation. The exercises must be contextually relevant to help learners understand how rules apply in real-life situations. This requires crafting sentences and scenarios that are both engaging and educational. Learners have different levels of proficiency, from beginners to advanced. Striking the right balance ensures that learners are neither bored nor frustrated, which is crucial for maintaining their motivation and progress. In MCC exercises this is done by choosing distractors that are incorrect but plausible, thus keeping the exercise challenging for the learner. Studies in Communicative Language Teaching demonstrate that the learner must possess the knowledge of grammatical structures and the ability to compose syntactically well-formed propositions, and they must also acquire the ability to employ grammatical forms in discourse <ref type="bibr" target="#b0">[1]</ref> <ref type="bibr" target="#b1">[2]</ref>.</p><p>Recently, there has been a growing interest in applying LLMs in education <ref type="bibr" target="#b2">[3]</ref>. However, the adoption of LLMs for English grammar MCC exercise generation is still limited. Some proposals focus on testing vocabulary <ref type="bibr" target="#b3">[4]</ref> or use LLMs by constraining their generation capability, for example using fixed part-of-speech sequences <ref type="bibr" target="#b4">[5]</ref>. Although the outputs of these models are grammatically correct typically they lack creativity <ref type="bibr" target="#b6">[6]</ref>.</p><p>In this work, we investigate the potential of LLMs in automatic exercise generation without hampering their creativity. Our working hypothesis is that LLMs can generate self-contained sentences, recreating situational contexts that elicit the communicative competence of the learner <ref type="bibr" target="#b7">[7]</ref>. Our main objective is to understand to what extent can LLMs generate accurate grammar exercises without providing predefined constraints or POS sequences. To pursue this objective, we analyzed the available English grammar MCC exercises dataset <ref type="bibr" target="#b8">[8]</ref>. We observed that it has limited diversity, some topics are underrepresented, and there are often mistakes. Existing literature does not offer a single agreed-upon automatic metric for evaluating the quality of the generated gram-mar exercises. Therefore, we set out to identify such a metric and validate its alignment with human judgment. In this paper, we present a novel solution utilizing an LLM to generate English grammar MCC exercises. Our contribution also focuses on curating an MCC dataset that spans 19 topics. Lastly, we propose an automatic metric to evaluate the exercise's correctness and verify the validity of our contribution thanks to human expert evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Task description</head><p>Grammar exercises should define the range of abilities to be assessed and avoid the influence of irrelevant factors like past knowledge or cultural background <ref type="bibr" target="#b9">[9]</ref>. We followed the Best-practice guidelines for creating grammar MCC items defined in <ref type="bibr" target="#b10">[10]</ref>  <ref type="bibr" target="#b11">[11]</ref>. According to them, each item consists of three components.</p><p>• Body: the sentence with a gap in place of the key.</p><p>• Key: the correct answer.</p><p>• Distractor: the incorrect answer.</p><p>The body plays a central role in designing effective exercises. Learners should be able to infer the key based on the helpful elements present in the body. However, the effectiveness of an exercise depends mainly on the quality of its distractors. Ideally, challenging distractors should be homogeneous, plausible, and unambiguous. Homogeneous distractors share the same syntactic category as the key <ref type="bibr" target="#b12">[12]</ref>. Plausible distractors provide a credible alternative to the key. Lastly, unambiguous distractors ensure that none of them could be considered correct if used in place of the key <ref type="bibr" target="#b10">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related Works</head><p>The generation of MCC exercises has been explored from various perspectives. In this section, we will briefly discuss the main related approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">MCC Dataset</head><p>Prior works in creating MCC datasets are very limited. To the best of our knowledge, the only one in English was presented by Liu et al. in their work SC-Ques <ref type="bibr" target="#b8">[8]</ref>. It comprises real English test items for students developed by teaching professionals. The dataset contains roughly 300k MCC sentence completion exercises, composed of the question body, a varying number of alternative answers, and the key (i.e. the correct alternative). It comprises both exercises with only single or multiple blanks. It has various limitations, discussed in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Grammar MCC Exercise Generation</head><p>A large share of prior works uses rules to create Grammar MCC Exercises (Sumita et al. <ref type="bibr" target="#b13">[13]</ref>, Brown et al. <ref type="bibr" target="#b14">[14]</ref>, Smith et al. <ref type="bibr" target="#b15">[15]</ref>, Majumder and Saha <ref type="bibr" target="#b16">[16]</ref>, Lin et al. <ref type="bibr" target="#b17">[17]</ref>). They all follow a three-fold process: (1) select sentences from arbitrary sources, (2) insert the blank into the sentence, and (3) generate distractors for the blank. Sentences usually come from corpora or user-submitted passages. Many solutions restrict gap detection into fixed schemes: Sumita et al. <ref type="bibr" target="#b13">[13]</ref> picked out the leftmost single verb, Lin et al. <ref type="bibr" target="#b17">[17]</ref> only selected adjectives as a blank. One of the few exceptions is Goto et al. <ref type="bibr" target="#b18">[18]</ref>, who proposed a method based on Conditional Random Fields (CRFs) <ref type="bibr" target="#b19">[19]</ref>. Methods that extract sentences from arbitrary text suffer from several limitations. First of all, they lack customization options, such as adjusting for the subject or difficulty level of the exercise. Additionally, they are limited by the length and quality of the extracted texts, which can negatively impact the system's results.</p><p>Recently, parts of MCC generation have been executed by Neural Networks instead of rule-based algorithms. Bitew et al. <ref type="bibr" target="#b20">[20]</ref> use a variation of the RoBERTa <ref type="bibr" target="#b21">[21]</ref> model to predict the gap positions within the sentence. To decrease the ambiguity Matsumori et al. <ref type="bibr" target="#b22">[22]</ref> trained a Masked Language Model for gap score prediction of each candidate sentence. Chomphooyod et al. <ref type="bibr" target="#b23">[23]</ref> proposed a system that uses Transformers <ref type="bibr" target="#b24">[24]</ref> to generate candidate sentences given a POS sequence, a keyword and a desired grammar topic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Metrics</head><p>In the literature, the evaluation of MCC exercises is mainly based on judgments expressed by human annotators. Slavuj et al. <ref type="bibr" target="#b25">[25]</ref> asked annotators to perform the language tasks, assuming that the presence of incorrect answers would be a sign of ill-formed exercises. Teachers were then asked to provide feedback on any pitfalls they encountered. Malafeev <ref type="bibr" target="#b26">[26]</ref> simply attended to suitability for classroom use. Chomphooyod et al. <ref type="bibr" target="#b23">[23]</ref> evaluates for each exercise different aspects such as the grammatical and semantic correctness, the relevance with respect to the topic, and its acceptability.</p><p>Very few automatic metrics have been proposed to evaluate exercise generation. Bitew et al. <ref type="bibr" target="#b20">[20]</ref> rely on span overlap with respect to ground truth to assess the consistency of gap detection. March et al. <ref type="bibr" target="#b27">[27]</ref> test the effectiveness of distractors by their selection rate.</p><p>Since an important criterion for exercise collection is diversity, often similarity measures have been applied to MCC exercise. Metrics like BLUE <ref type="bibr" target="#b28">[28]</ref>, ROUGE <ref type="bibr" target="#b29">[29]</ref>, and METEOR <ref type="bibr" target="#b30">[30]</ref> have been used even though originally designed for different applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Approach</head><p>To overcome the limitations of existing solutions, we utilized an LLM to generate exercises in a single, constraintfree step. We chose Llama3 <ref type="bibr" target="#b31">[31]</ref> due to its acceptable balance between computational cost and performance. To evaluate its effectiveness, we engineered a well-structured prompt (Appendix B.2). However, the results were unsatisfactory. The model exhibited significant difficulties with certain grammar topics and consistently failed to generate effective distractors. Therefore we decided to fine-tune the model using a well-formatted dataset containing exercises with distractors that meet our criteria. Each dataset example includes four features: the grammar topic, the exercise text, the key, and the distractors. The model is trained to produce the exercise text, key, and distractors when given a specific grammar topic as input. The prompt used during the fine-tuning and an example of input-output text can be found in the appendix section B.1.</p><p>To assess the correctness of the generated items, we devised metrics that evaluate the minimal structural requirements of an exercise thanks to rule-based analysis. These are defined in section 7. To monitor the results we used SELF-BLEU <ref type="bibr" target="#b6">[6]</ref>, a metric that inspects repetitions checking continuous lexical overlap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Dataset Curation</head><p>We developed the fine-tuning dataset based on the data released by <ref type="bibr" target="#b8">[8]</ref>. The data underwent three pre-processing steps: cleaning, grammar topic identification, and removal of similar examples.</p><p>Data cleaning First, we got rid of improperly formatted examples and cleaned the text to comply with the tokenizer specifications and limit potential noise. Items with multiple blank spaces or fewer than two distractors were discarded. Next, we filtered out exercise texts containing instructions, non-Latin symbols or letters, emails, phone numbers, and links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Extraction of the grammar topic</head><p>The second step involves the assignment of the grammar topic to each exercise thanks to the Pattern Matcher. First, grammar topics are defined in a tailor-made grammar taxonomy with the aid of spaCy Dependency Matcher. Given a set of sentences, this tool allows one to identify whether each sentence features the described grammar topics, and if so, at what position. The relevant topic is chosen by comparing the overlap between the position of the topic detected by Pattern Matcher and the key span 1 . To 1 The key span is the range of positions the key belongs to.</p><p>ensure the exclusively grammatical nature of the exercises, distractors are checked using the metrics proposed in Section 3.3. All exercises lacking valid distractors are then discarded.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Deduplication</head><p>We deduplicated and removed all the similar exercises, to increase the quality of our dataset <ref type="bibr" target="#b32">[32]</ref>. Exercises are clustered by topic and compared in terms of embeddings through cosine similarity. Using a threshold 𝑇𝑝, where 𝑝 denotes the topic, all elements exceeding the limit are discarded. Lastly, we noticed that SC-Ques <ref type="bibr" target="#b8">[8]</ref> had an unbalanced representation of grammar topics. For example, in half of the WH-questions have "How" as the key. For each topic, a maximum ratio of key presence is established, and superfluous data are discarded.</p><p>After pre-processing, the least represented class contained a quarter of the examples present in the most represented one. The only exception was the "WH-questions" class, which was underrepresented. Therefore, we upsampled the class with synthetic exercises using GPT-4 <ref type="bibr" target="#b33">[33]</ref>. The dataset is composed by several fields: the filled_text (complete exercise sentence), the gapped_text (sentence with a blank gap), the key (the text removed to create the gap), and the list of distractors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Fine-Tuning</head><p>We designed the fine-tuning process to generate exercises on specific grammar topics with a fixed number of distractors. The model's expected response is a JSONencoded exercise coherent to the dataset structure described in Section 5. We observed that including the filled_text in the output improves overall accuracy and reduces similarity among exercises. An example from the fine-tuning dataset can be found in the appendix section B.1. To reduce the computational resources required for fine-tuning, we employed the Quantized Low-Rank Adapters (QLoRA) <ref type="bibr" target="#b34">[34]</ref> approach. Our tests on small models revealed that this strategy prevents significant shrinkage of the model's dictionary during fine-tuning. Consequently, the generated exercises exhibit greater variability, enhancing the model's creativity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Evaluation Metrics</head><p>Two metrics are used to track the model's performance on diverse aspects. First, we introduce a metric that evaluates the minimal structural requirements of an exercise. Secondly, we control for language diversity to have more interpretability on the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Structural Compliance</head><p>This metric evaluates the structure and well-formedness of the exercise. Decomposing the validation stage into two steps, we design two rule-based components, namely pertinence and homogeneity.</p><p>The former oversees that the gap placeholder is located in the intended position and that the key includes the correct grammar form. The second component checks that the distractor fulfils the criterion of homogeneity as described in the section 2. To achieve this, grammar topics have been grouped into two classes.</p><p>Inflectional They must have the same lemma as the key so as to rule out the influence of lexis and semantics. We also make adjustments to account for circumstances when the key and the distractor are identical, as well as for handling variation of the auxiliary verb.</p><p>Free morphemes Exercises of this group limit acceptable keys and distractors to a narrow range of options. So, we manually compile a list of admitted words for each grammar topic. If the distractor belongs to that list and is not identical to the key, it is deemed homogeneous. Some grammar topics may be built with distractors of any of the two classes. If either of the checks is successful, the distractor passes the test of fitness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Language Diversity</head><p>LLMs often experience the so-called repetition problem, where their output includes excessively repeated segments of text, creating an undesirable effect <ref type="bibr" target="#b35">[35]</ref>. In the context of the generation of thousands of exercises, duplicates or overly similar sentences are highly likely to occur. In order to assess this phenomenon we decided to rely on continuous lexical overlap by using Self-BLEU <ref type="bibr" target="#b6">[6]</ref> onto 2-to-5-grams to capture multi-word repetitions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Experiments</head><p>We fine-tuned the Huggingface implementation of Meta-Llama-3-8B-Instruct 2 . The model was first quantized to 4-bit precision and then fine-tuned using LoRA adapters, with the following configuration: rank equal to 64, alpha 16, and a dropout percentage of 0.1. The adapters have been added on top of all the attention linear layers to not significantly degrade performance. The training hyperparameters are: a constant learning rate of 2e−4, max gradient norm of 0.3, and a weight decay equal to 1e−2. The number of epochs was set to 3, using a batch size 2 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct of 1 and gradient accumulation equal to 16. The train lasted two hours on a NVIDIA RTX A6000.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Results</head><p>To evaluate performances, for each grammar topic we generated 50 exercises, setting the number of distractors to 1. We use the sampling decoding strategy with a temperature equal to 0.7 to balance the creativity and the coherence of the output.</p><p>The exercises are categorized according to their grammar topic. For each exercise, we assessed its structural compliance and its similarity to the exercises within the same grammar topic that has been labelled structurally correct, using the metrics described in section 3.3. The results are then averaged to obtain the accuracy for each grammar topic. In the end, the model performances are computed by averaging the topic scores. The results are reported in Table <ref type="table" target="#tab_0">1</ref>.</p><p>Overall, the outcomes are satisfactory. The model on average scores a Structural Compliance (SC𝐻 ) equal to 85%, indicating its ability to generate well formed exercises. It achieves a self-BLEU similarity of 7%, demonstrating that text repetitions are limited. Looking at the individual SC scores, we observe that the model tends to perform better on free morphemes grammar topics. We suppose this is due to the limited number of possible key/distractor options. Furthermore, we observed that due to spaCy limitations in properly labelling certain verbs, grammar topics related to verbal tenses are more prone to be misidentified. This limitation causes occasional misjudgment of the exercise's structural compliance, leading to a negative effect on the topic performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.1.">Human Evaluation</head><p>To assess classroom suitability a human evaluation was performed on all 950 exercises by a computational linguist with a background in pedagogy in language teaching. Each generated exercise (EC) was evaluated on four criteria:Plausibility, Ambiguity (defined in section 2), Common Sense, Acceptability. Common Sense means that the exercise sentence should be coherent with common sense. Acceptability indicates that a sentence does not perpetuate stereotypes or display inappropriate content, such as violence. If any of these criteria is not met, the item is flagged as incorrect.</p><p>The results presented in table 1 have established that 79% of the items satisfy all the requirements to be administered to learners. We conducted an error analysis. The results are summarized in Table <ref type="table">2</ref>. Results of the evaluation on the generated exercises. SC 𝐴 is the Structural Compliance evaluate by our metric, SC 𝐻 evaluated by the human annotator and EC is the exercise correctness. The double lines divide the results from the automatic metric (left) to those obtained by the human-eval (right). More results on error analysis can be found in table <ref type="table">2</ref>.</p><p>distractors remain an open matter in the field, especially for tense-based topics. Instead, we can notice that the generation of sentences with bias or trivial exercises is almost absent. Furthermore, we asked the annotator to evaluate the structural compliance of the exercises (SC𝐻 ). Then we computed the Precision, Recall and F1 scores using annotator judgements as golden labels. The results show that our automatic structural compliance metric (SC𝐴) has an F1 score of 95% w.r.t the human evaluation, with a Precision of 98% and a Recall of 91%. This highlights its effectiveness in predicting the overall structural quality of the exercises.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Conclusion</head><p>We investigated the use of an LLM to generate English MCC grammar exercises. To that end, we curated a new English grammar MCC exercises dataset. We devised metrics for the automatic evaluation of such exercises. We evaluated our work using said metrics, and a human study involving domain experts. Our findings demonstrate the model's ability to generate exercises suitable for educational use. The generated exercises exhibit a low similarity score, indicating that our method can effectively produce original exercises: a significant advantage from prior art, mostly relying on rule-based methods. We observe that human evaluation correlates positively with the proposed structural compliance metric, corroborating our metric as an indicator of exercise structure correctness and alignment with human expert preferences. We found that a key factor of our method was the availability of high-quality fine-tuning data.</p><p>One limitation was the presence of many similar exercises in the SC-Dataset <ref type="bibr" target="#b8">[8]</ref> we used to build our resource from. After removing similar exercises, only 30% of the original data was left. Another limitation is the sensitivity of the evaluation metric to the Pattern Matcher, concerning the evaluation of the key and the distractors, which caused some false negatives.</p><p>The curated dataset and model will be available to the community. <ref type="foot" target="#foot_0">3</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1. Fine-Tuning prompt</head><p>The prompt used to fine-tune the model has the same structure for all the grammar topics. The only varying parts are the name of the grammar topic and the number of distractors required. These parts are highlighted by the brackets and change depending on the dataset items. The prompt used is the following. W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on { g r a m m a r _ t o p i c } w i t h { n _ d i s t r a c t o r s } d i s t r a c t o r s .</p><p>Listing 1: Fine-tuning prompt.</p><p>A training example is created by concatenating to the prompt the desired JSON representation of the exercise. We decided to use this format because it is easier to use at inference time. An example of training data is the following. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head># USER W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on c o m p a r i s o n s w i t h 3 d i s t r a c t</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Baseline prompt</head><p>To test the performances of the baseline Llama3 we utilize its instruction-tuned version, Llama3-Instruct that can follow direction given by the user. This model is not able to answer correctly using the prompt described above. Therefore, we construct an alternative one in which all the useful information is given to the model. We include the structure of the exercise, the roles of each component with their constraints and the desired format of the output. The results are the following.</p><p># SYSTEM You a r e an e n g l i s h t e a c h e r c r e a t i n g m u l t i p l e − c h o i c e − gap e x e r c i s e s . # USER W r i t e one e x e r c i s e on { g r a m m a r _ t o p i c } .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I t must c o n t a i n s t h e : − s e n t e n c e : t h e body e x e r c i s e t e x t t h a t must c o n t a i n t h e t a g &lt;GAP&gt; i n s t e a d o f t h e s o l u t i o n − s o l u t i o n : t h e t h a t c o r r e c t l y f i l l t h e gap − d i s t r a c t o r : a word r e l a t e d t o t h e s o l u t i o n , b u t d i f f e r e n t The d i s t r a c t o r must be s u c h t h a t i f s u b s t i t u t e d t o t h e s o l u t i o n , t h e s e n t e n c e</head><p>i s wrong . Do n o t g e n e r a t e any e x a p l a n a t i o n . The o u t p u t must be a JSON o b j e c t w i t h t h e f o l l o w i n g s t r u c t u r e : { " s e n t e n c e " : s t r , " s o l u t i o n " : s t r , " d i s t r a c t o r " : l i s t [ s t r ] } Listing 3: Prompt used to the generation of exercises with the base Llama3 model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Ethical Considerations</head><p>This section outlines the ethical considerations of the system we developed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Bias and Fairness</head><p>The dataset used in this study is obtained from a publicly available source, ensuring that all data was collected with appropriate consent. To protect personal information, we removed all sensitive data such as phone numbers, email addresses and URLs. Since humans created this data, we assume that proper names or any reference to existing entities are invented. Moreover, those that contain preferences such as films, books, etc., we assume do not reflect real preferences of the users. We suppose that events or situations described in the exercises are not related to existing facts. Finally, since the data have been created by professional creators we assume that any possible bias or stereotype in the dataset is not intended and it is a coincidence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Accuracy and Reliability</head><p>The accuracy of the generated exercises is paramount. We employ both automated validation tools and human expert reviews to ensure the correctness and reliability of the content. Any inaccuracies identified are promptly rectified. We acknowledge the potential for bias in LLM-generated content. However, the human evaluation highlights a negligible presence in the generated outputs.</p><p>Transparency We strive for transparency by documenting the sources of our training data and explaining the model architecture. All the techniques used to manipulate the data and the steps done are described step by step highlighting all the important aspects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Educational Impact</head><p>We assess the impact of LLM-generated exercises on learning outcomes. We aim to enhance personalized learning while preventing over-reliance on automated systems. The content is designed to be inclusive and accessible to all students.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Common sense was the most frequently observed inaccuracy, although the magnitude of the issue is modest. As expected, ambiguous</figDesc><table><row><cell>grammar topic</cell><cell>SC 𝐴</cell><cell>self-BLEU</cell><cell>SC 𝐻</cell><cell>EC</cell></row><row><cell>articles</cell><cell>0.94</cell><cell>0.03</cell><cell>0.94</cell><cell>0.74</cell></row><row><cell>comparison adjectives</cell><cell>0.90</cell><cell>0.09</cell><cell>0.92</cell><cell>0.72</cell></row><row><cell>conditional statements</cell><cell>0.76</cell><cell>0.07</cell><cell>0.90</cell><cell>0.66</cell></row><row><cell>future simple</cell><cell>0.82</cell><cell>0.06</cell><cell>0.90</cell><cell>0.90</cell></row><row><cell>modal verbs</cell><cell>0.62</cell><cell>0</cell><cell>0.78</cell><cell>0.70</cell></row><row><cell>infinitive and gerund verbs</cell><cell>0.76</cell><cell>0</cell><cell>0.96</cell><cell>0.86</cell></row><row><cell>passive tenses</cell><cell>0.84</cell><cell>0</cell><cell>0.86</cell><cell>0.74</cell></row><row><cell>past continuous</cell><cell>0.98</cell><cell>0.16</cell><cell>0.98</cell><cell>0.88</cell></row><row><cell>past perfect</cell><cell>0.94</cell><cell>0.12</cell><cell>0.96</cell><cell>0.82</cell></row><row><cell>past simple</cell><cell>0.88</cell><cell>0</cell><cell>0.86</cell><cell>0.82</cell></row><row><cell>personal pronouns</cell><cell>0.85</cell><cell>0.07</cell><cell>0.92</cell><cell>0.74</cell></row><row><cell>possessive adjectives</cell><cell>0.82</cell><cell>0.12</cell><cell>0.90</cell><cell>0.72</cell></row><row><cell>prepositions</cell><cell>0.84</cell><cell>0</cell><cell>0.92</cell><cell>0.72</cell></row><row><cell>present continuous</cell><cell>0.96</cell><cell>0.11</cell><cell>0.98</cell><cell>0.88</cell></row><row><cell>present perfect</cell><cell>0.66</cell><cell>0.08</cell><cell>0.98</cell><cell>0.84</cell></row><row><cell>present simple</cell><cell>0.88</cell><cell>0.05</cell><cell>0.88</cell><cell>0.86</cell></row><row><cell>quantifiers</cell><cell>0.88</cell><cell>0.07</cell><cell>0.88</cell><cell>0.84</cell></row><row><cell>relative clauses</cell><cell>0.94</cell><cell>0.03</cell><cell>0.94</cell><cell>0.74</cell></row><row><cell>WH-question</cell><cell>0.98</cell><cell>0.18</cell><cell>1.00</cell><cell>0.90</cell></row><row><cell>average</cell><cell>0.85</cell><cell>0.07</cell><cell>0.92</cell><cell>0.79</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://github.com/ZanichelliEditore/ english-grammar-multiple-choice-generation</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We wish to thank Zanichelli editore for their support which enabled data up-sampling, human evaluation, and experimentation with their infrastructure. We also thank Eleonora Cupin for her valuable contribution to the human evaluation of the dataset.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Error analysis</head><p>Thanks to the human evaluation we conducted a small error analysis on the errors made by the model. By analyzing the exercises that the annotator marked as incorrect we found out that the major issue is the coherence of the exercise sentence. More precisely, 75% of the wrong exercises has a meaningless or absurd exercise sentence. This behaviour is directly related to the hallucinations suffered by LLMs <ref type="bibr" target="#b36">[36]</ref>. The second prevailing error is the ambiguity between the key and the distractors. The model does not possess a deep understanding of what a distractor is. In fact some generated distractors are interchangeable with the key.</p><p>Despite these limitations, the model is very effective in producing exercises that are not trivial (plausibility error rate at 1%) and negligibly affected by bias and stereotypes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Prompts</head><p>In this section, the prompts used in our work are presented. utilize the Llama3 chat template format, but to make the text more readable we use three placeholders: #SYSTEM, #USER and #ASSISTANT.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">G</forename><surname>Widdowson</surname></persName>
		</author>
		<title level="m">Teaching Language as Communication</title>
				<meeting><address><addrLine>Oxford</addrLine></address></meeting>
		<imprint>
			<publisher>Oxford University Press</publisher>
			<date type="published" when="1978">1978</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">G</forename><surname>Widdowson</surname></persName>
		</author>
		<title level="m">Explorations in Applied Linguistics</title>
				<meeting><address><addrLine>Oxford</addrLine></address></meeting>
		<imprint>
			<publisher>Oxford University Press</publisher>
			<date type="published" when="1979">1979</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Large language models in education: Vision and opportunities</title>
		<author>
			<persName><forename type="first">W</forename><surname>Gan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename></persName>
		</author>
		<author>
			<persName><forename type="first">-</forename><forename type="middle">W</forename><surname>Lin</surname></persName>
		</author>
		<idno type="DOI">10.1109/BigData59044.2023.10386291</idno>
	</analytic>
	<monogr>
		<title level="m">2023 IEEE International Conference on Big Data (BigData)</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="4776" to="4785" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Automated generation of multiple-choice cloze questions for assessing english vocabulary using gptturbo 3</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Orita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sugawara</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2403.02078.arXiv:2403.02078" />
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">English grammar multiplechoice question generation using text-to-text transfer transformer</title>
		<author>
			<persName><forename type="first">P</forename><surname>Chomphooyod</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Suchato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tuaycharoen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Punyabukkana</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.caeai.2023.100158</idno>
		<ptr target="https:" />
	</analytic>
	<monogr>
		<title level="j">Computers and Education: Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">100158</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<idno type="DOI">10.1016/j.caeai.2023.100158</idno>
		<idno>.100158</idno>
		<ptr target="//doi.org/10.1016/j.caeai.2023" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Texygen: A benchmarking platform for text generation models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1802.01886.arXiv:1802.01886" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">On communicative competence</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Hymes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sociolinguistics. Selected Readings</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Pride</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Holmes</surname></persName>
		</editor>
		<imprint>
			<publisher>Harmondsworth</publisher>
			<date type="published" when="1972">1972</date>
			<biblScope unit="page" from="269" to="293" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Sc-ques: A sentence completion question dataset for english as a second language learners</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Augmented Intelligence and Intelligent Tutoring Systems</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Frasson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Mylonas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Troussas</surname></persName>
		</editor>
		<meeting><address><addrLine>Nature Switzerland, Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="678" to="690" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Bachman</surname></persName>
		</author>
		<title level="m">Fundamental Considerations in Language Testing</title>
				<meeting><address><addrLine>Oxford</addrLine></address></meeting>
		<imprint>
			<publisher>Oxford University Press</publisher>
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Purpura</surname></persName>
		</author>
		<title level="m">Assessing Grammar, Cambridge Language Assessment</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Practical Language Testing</title>
		<author>
			<persName><forename type="first">G</forename><surname>Fulcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fulcher</surname></persName>
		</author>
		<idno type="DOI">10.4324/980203767399</idno>
		<imprint>
			<date type="published" when="2010">2010</date>
			<publisher>Routledge</publisher>
		</imprint>
	</monogr>
	<note>1st ed</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Multiple choice question corpus analysis for distractor characterization</title>
		<author>
			<persName><forename type="first">V.-M</forename><surname>Pho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>André</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-L</forename><surname>Ligozat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Grau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Illouz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>François</surname></persName>
		</author>
		<ptr target="http://www.lrec-conf.org/proceedings/lrec2014/pdf/692_Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14), European Language Resources Association (ELRA)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Choukri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Declerck</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Loftsson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Maegaard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Mariani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Moreno</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Odijk</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Piperidis</surname></persName>
		</editor>
		<meeting>the Ninth International Conference on Language Resources and Evaluation (LREC&apos;14), European Language Resources Association (ELRA)<address><addrLine>Reykjavik, Iceland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="4284" to="4291" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Measuring non-native speakers&apos; proficiency of english by using a test with automatically-generated fill-in-theblank questions</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sumita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Sugaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yamamoto</surname></persName>
		</author>
		<idno type="DOI">10.3115/1609829.1609839</idno>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Automatic question generation for vocabulary assessment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Frishkoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Eskénazi</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/H05-1103/" />
	</analytic>
	<monogr>
		<title level="m">Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference</title>
				<meeting><address><addrLine>Vancouver; British Columbia, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005-10">October 2005. 2005</date>
			<biblScope unit="page" from="819" to="826" />
		</imprint>
	</monogr>
	<note>The Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Gap-fill tests for language learners: Corpus-driven item generation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kilgarriff</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:61531901" />
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A system for generating multiple choice questions: With a novel approach for sentence selection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Saha</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/W15-4410</idno>
		<ptr target="https://doi.org/10.18653/v1/W15-4410.doi:10.18653/V1/W15-4410" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Tseng</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Matsumoto</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Wong</surname></persName>
		</editor>
		<meeting>the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015-07-31">July 31, 2015. 2015</date>
			<biblScope unit="page" from="64" to="72" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A system for generating multiple choice questions: With a novel approach for sentence selection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Saha</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/W15-4410</idno>
		<ptr target="https://doi.org/10.18653/v1/W15-4410.doi:10.18653/V1/W15-4410" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Tseng</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Matsumoto</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Wong</surname></persName>
		</editor>
		<meeting>the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015-07-31">July 31, 2015. 2015</date>
			<biblScope unit="page" from="64" to="72" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Automatic generation system of multiple-choice cloze questions and its evaluation, Knowledge Management &amp; E-Learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Goto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kojiri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Watanabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Iwata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yamada</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:15482954" />
	</analytic>
	<monogr>
		<title level="j">An International Journal</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="210" to="224" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">C N</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001)</title>
				<editor>
			<persName><forename type="first">C</forename><forename type="middle">E</forename><surname>Brodley</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Danyluk</surname></persName>
		</editor>
		<meeting>the Eighteenth International Conference on Machine Learning (ICML 2001)<address><addrLine>Williamstown, MA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="2001-07-01">June 28 -July 1, 2001. 2001</date>
			<biblScope unit="page" from="282" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Learning from partially annotated data: Example-aware creation of gap-filling exercises for language learning</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Bitew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Deleu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Doğruöz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Develder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Demeester</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2023.BEA-1.51</idno>
		<ptr target="https://doi.org/10.18653/v1/2023.bea-1.51.doi:10.18653/V1/2023.BEA-1.51" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kochmar</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Horbach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Laarmann-Quante</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Madnani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Tack</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Yaneva</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Yuan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Zesch</surname></persName>
		</editor>
		<meeting>the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2023-07-13">13 July 2023. 2023</date>
			<biblScope unit="page" from="598" to="609" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Roberta: A robustly optimized bert pretraining approach</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1907.11692.arXiv:1907.11692" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Mask and cloze: Automatic open cloze question generation using a masked language model</title>
		<author>
			<persName><forename type="first">S</forename><surname>Matsumori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Okuoka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Shibata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Inoue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fukuchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Imai</surname></persName>
		</author>
		<idno type="DOI">10.1109/access.2023.3239005</idno>
		<ptr target="http://dx.doi.org/10.1109/ACCESS.2023.3239005.doi:10.1109/access.2023.3239005" />
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="9835" to="9850" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">English grammar multiplechoice question generation using text-to-text transfer transformer</title>
		<author>
			<persName><forename type="first">P</forename><surname>Chomphooyod</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Suchato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tuaycharoen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Punyabukkana</surname></persName>
		</author>
		<idno type="DOI">10.1016/J.CAEAI.2023.100158</idno>
		<ptr target="https://doi.org/10.1016/j.caeai.2023.100158.doi:10.1016/J.CAEAI.2023.100158" />
	</analytic>
	<monogr>
		<title level="j">Comput. Educ. Artif. Intell</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">100158</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/1706.03762.arXiv:1706.03762" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Automatic generation of language exercises based on a universal methodology: An analysis of possibilities</title>
		<author>
			<persName><forename type="first">V</forename><surname>Slavuj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Nacinovic Prskalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brkic</surname></persName>
		</author>
		<author>
			<persName><surname>Bakaric</surname></persName>
		</author>
		<idno type="DOI">10.31926/but.pcs.2021.63.14.2.3</idno>
	</analytic>
	<monogr>
		<title level="j">Bulletin of the Transilvania University of Brasov. Series IV: Philology and Cultural Studies</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">63</biblScope>
			<biblScope unit="page" from="29" to="48" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Language exercise generation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Malafeev</surname></persName>
		</author>
		<idno type="DOI">10.4018/IJCSSA.2014070102</idno>
	</analytic>
	<monogr>
		<title level="j">ternational Journal of Conceptual Structures and Smart Applications</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="20" to="35" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<title level="m" type="main">An evidence-based approach to distractor generation in multiple-choice language tests</title>
		<author>
			<persName><forename type="first">D</forename><surname>Perrett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>March</surname></persName>
		</author>
		<idno type="DOI">10.13140/RG.2.2.22779.16165</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://aclanthology.org/P02-1040.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Isabelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Charniak</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</editor>
		<meeting>the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Philadelphia, Pennsylvania, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">METEOR: An automatic metric for MT evaluation with improved correlation with human judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W05-0909" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Goldstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Voss</surname></persName>
		</editor>
		<meeting>the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics<address><addrLine>Ann Arbor, Michigan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<ptr target="https://ai.meta.com/blog/meta-llama-3/" />
		<title level="m">Introducing Meta Llama 3: The most capable openly available LLM to date</title>
				<imprint>
			<date type="published" when="2024-04">April 2024</date>
		</imprint>
	</monogr>
	<note>Meta</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">D4: Improving llm pretraining via document deduplication and diversification</title>
		<author>
			<persName><forename type="first">K</forename><surname>Tirumala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Simig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aghajanyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Morcos</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2023/file/a8f8cbd7f7a5fb2c837e578c75e5b615-Paper-Datasets_and_Benchmarks.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Oh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Naumann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Globerson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Saenko</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hardt</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="53983" to="53995" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">O</forename></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2303.08774.arXiv:2303.08774" />
		<title level="m">Gpt-4 technical report</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<title level="m" type="main">Qlora: Efficient finetuning of quantized llms</title>
		<author>
			<persName><forename type="first">T</forename><surname>Dettmers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pagnoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holtzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2305.14314.arXiv:2305.14314" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">A theoretical analysis of the repetition problem in text generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename></persName>
		</author>
		<author>
			<persName><forename type="first">.-C</forename><surname>So</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Shi</surname></persName>
		</author>
		<idno type="DOI">10.1609/aaai.v35i14.17520</idno>
		<ptr target="https://ojs.aaai.org/index.php/AAAI/article/view/17520.doi:10.1609/aaai.v35i14.17520" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="12848" to="12856" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">Hallucination is inevitable: An innate limitation of large language models</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kankanhalli</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2401.11817.arXiv:2401.11817" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
