1. Introduction

Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises

Nicolò Donati

1 2

Matteo Periani

Paolo Di Natale

Giuseppe Savino

Paolo Torroni

1 0 University of Bologna , Corso della Repubblica, 136, 47121 Forlì FC , Italy 1 University of Bologna , Viale del Risorgimento, 2, 40136 Bologna BO , Italy 2 Zanichelli editore S.p.A. , Via Irnerio 34, 40126 Bologna , Italy

English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners' grammatical proficiency and comprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Efective MCC exercises must be contextually relevant and engaging, incorporating distractors-plausible but incorrect alternatives-to balance dificulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs) in education, their application in generating English grammar MCC exercises is still limited. Previous methods typically impose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potential of LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. We hypothesize that LLMs can craft self-contained sentences that foster learner's communicative competence. Our analysis of existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we address the lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions include developing an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammar topics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance the automatic generation of English grammar MCC exercises, enhancing both their quality and creativity.

eol>Large Language Models Distractor Generation Multiple-Choice Cloze Evaluation Metric

1. Introduction

mar exercises. Therefore, we set out to identify such a metric and validate its alignment with human judgment. In this paper, we present a novel solution utilizing an LLM to generate English grammar MCC exercises. Our contribution also focuses on curating an MCC dataset that spans 19 topics. Lastly, we propose an automatic metric to evaluate the exercise’s correctness and verify the validity of our contribution thanks to human expert evaluation.

A large share of prior works uses rules to create Grammar

MCC Exercises (Sumita et al. [13], Brown et al. [14], Smith et al. [15], Majumder and Saha [16], Lin et al. [17]). They all follow a three-fold process: (1) select sentences from arbitrary sources, (2) insert the blank into the sentence, and (3) generate distractors for the blank. Sentences usually come from corpora or user-submitted passages.

Many solutions restrict gap detection into fixed schemes: Sumita et al. [13] picked out the leftmost single verb, Lin 2. Task description et al. [17] only selected adjectives as a blank. One of the few exceptions is Goto et al. [18], who proposed a Grammar exercises should define the range of abilities to method based on Conditional Random Fields (CRFs) [19]. be assessed and avoid the influence of irrelevant factors Methods that extract sentences from arbitrary text sufer like past knowledge or cultural background [9]. We fol- from several limitations. First of all, they lack customizalowed the Best-practice guidelines for creating grammar tion options, such as adjusting for the subject or dificulty MCC items defined in [ 10 ] [ 11 ]. According to them, each level of the exercise. Additionally, they are limited by item consists of three components. the length and quality of the extracted texts, which can negatively impact the system’s results. • Body: the sentence with a gap in place of the key. Recently, parts of MCC generation have been executed • Key: the correct answer. by Neural Networks instead of rule-based algorithms. • Distractor: the incorrect answer. Bitew et al. [20] use a variation of the RoBERTa [21] model to predict the gap positions within the sentence.

To decrease the ambiguity Matsumori et al. [22] trained a Masked Language Model for gap score prediction of each candidate sentence. Chomphooyod et al. [23] proposed a system that uses Transformers [24] to generate candidate sentences given a POS sequence, a keyword and a desired grammar topic.

The body plays a central role in designing efective exercises. Learners should be able to infer the key based on the helpful elements present in the body. However, the effectiveness of an exercise depends mainly on the quality of its distractors. Ideally, challenging distractors should be homogeneous, plausible, and unambiguous. Homogeneous distractors share the same syntactic category as the key [12]. Plausible distractors provide a credible alternative to the key. Lastly, unambiguous distractors ensure that none of them could be considered correct if used in place of the key [ 10 ].

3.3. Metrics In the literature, the evaluation of MCC exercises is

mainly based on judgments expressed by human annotators. Slavuj et al.[25] asked annotators to perform the 3. Related Works language tasks, assuming that the presence of incorrect answers would be a sign of ill-formed exercises. Teachers The generation of MCC exercises has been explored from were then asked to provide feedback on any pitfalls they various perspectives. In this section, we will briefly dis- encountered. Malafeev [26] simply attended to suitability cuss the main related approaches. for classroom use. Chomphooyod et al. [23] evaluates for each exercise diferent aspects such as the grammatical 3.1. MCC Dataset and semantic correctness, the relevance with respect to the topic, and its acceptability.

Prior works in creating MCC datasets are very limited. Very few automatic metrics have been proposed to To the best of our knowledge, the only one in English evaluate exercise generation. Bitew et al. [20] rely on was presented by Liu et al. in their work SC-Ques [8]. It span overlap with respect to ground truth to assess the comprises real English test items for students developed consistency of gap detection. March et al. [27] test the by teaching professionals. The dataset contains roughly efectiveness of distractors by their selection rate. 300k MCC sentence completion exercises, composed of Since an important criterion for exercise collection is the question body, a varying number of alternative an- diversity, often similarity measures have been applied to swers, and the key (i.e. the correct alternative). It com- MCC exercise. Metrics like BLUE [28], ROUGE [29], and prises both exercises with only single or multiple blanks. METEOR [30] have been used even though originally It has various limitations, discussed in Section 5. designed for diferent applications.

4. Approach

ensure the exclusively grammatical nature of the exercises, distractors are checked using the metrics proposed To overcome the limitations of existing solutions, we uti- in Section 3.3. All exercises lacking valid distractors are lized an LLM to generate exercises in a single, constraint- then discarded. free step. We chose Llama3 [31] due to its acceptable balance between computational cost and perfor- Deduplication We deduplicated and removed all mance. To evaluate its efectiveness, we engineered a the similar exercises, to increase the quality of our well-structured prompt (Appendix B.2). However, the dataset [32]. Exercises are clustered by topic and results were unsatisfactory. The model exhibited signifi- compared in terms of embeddings through cosine cant dificulties with certain grammar topics and consis- similarity. Using a threshold , where denotes the tently failed to generate efective distractors. Therefore topic, all elements exceeding the limit are discarded. we decided to fine-tune the model using a well-formatted Lastly, we noticed that SC-Ques [8] had an unbalanced dataset containing exercises with distractors that meet representation of grammar topics. For example, in half our criteria. Each dataset example includes four features: of the WH-questions have "How" as the key. For each the grammar topic, the exercise text, the key, and the topic, a maximum ratio of key presence is established, distractors. The model is trained to produce the exercise and superfluous data are discarded. text, key, and distractors when given a specific grammar topic as input. The prompt used during the fine-tuning and an example of input-output text can be found in the appendix section B.1.

To assess the correctness of the generated items, we devised metrics that evaluate the minimal structural requirements of an exercise thanks to rule-based analysis.

These are defined in section 7. To monitor the results we used SELF-BLEU [6], a metric that inspects repetitions checking continuous lexical overlap.

After pre-processing, the least represented class con

tained a quarter of the examples present in the most represented one. The only exception was the "WH-questions" class, which was underrepresented. Therefore, we upsampled the class with synthetic exercises using GPT-4 [33]. The dataset is composed by several fields: the filled_text (complete exercise sentence), the gapped_text (sentence with a blank gap), the key (the text removed to create the gap), and the list of distractors.

5. Dataset Curation 6. Fine-Tuning We developed the fine-tuning dataset based on the data released by [8]. The data underwent three pre-processing steps: cleaning, grammar topic identification, and removal of similar examples. We designed the fine-tuning process to generate exer

cises on specific grammar topics with a fixed number of distractors. The model’s expected response is a JSONencoded exercise coherent to the dataset structure described in Section 5. We observed that including the Data cleaning First, we got rid of improperly format- filled_text in the output improves overall accuracy and ted examples and cleaned the text to comply with the reduces similarity among exercises. An example from tokenizer specifications and limit potential noise. Items the fine-tuning dataset can be found in the appendix secwith multiple blank spaces or fewer than two distractors tion B.1. To reduce the computational resources required were discarded. Next, we filtered out exercise texts con- for fine-tuning, we employed the Quantized Low-Rank taining instructions, non-Latin symbols or letters, emails, Adapters (QLoRA) [34] approach. Our tests on small phone numbers, and links. models revealed that this strategy prevents significant shrinkage of the model’s dictionary during fine-tuning.

Consequently, the generated exercises exhibit greater variability, enhancing the model’s creativity.

Extraction of the grammar topic The second step involves the assignment of the grammar topic to each exercise thanks to the Pattern Matcher. First, grammar topics are defined in a tailor-made grammar taxonomy with the aid of spaCy Dependency Matcher. Given a set of sentences, this tool allows one to identify whether each sentence features the described grammar topics, and if so, at what position. The relevant topic is chosen by comparing the overlap between the position of the topic detected by Pattern Matcher and the key span1. To

1The key span is the range of positions the key belongs to. 7. Evaluation Metrics Two metrics are used to track the model’s performance

on diverse aspects. First, we introduce a metric that evaluates the minimal structural requirements of an exercise. Secondly, we control for language diversity to have more interpretability on the results.

7.1. Structural Compliance

LLMs often experience the so-called repetition problem, where their output includes excessively repeated segments of text, creating an undesirable efect [ 35]. In the context of the generation of thousands of exercises, duplicates or overly similar sentences are highly likely to occur. In order to assess this phenomenon we decided to rely on continuous lexical overlap by using Self-BLEU [6] onto 2-to-5-grams to capture multi-word repetitions. of 1 and gradient accumulation equal to 16. The train lasted two hours on a NVIDIA RTX A6000.

9. Results This metric evaluates the structure and well-formedness

of the exercise. Decomposing the validation stage into two steps, we design two rule-based components, namely pertinence and homogeneity.

The former oversees that the gap placeholder is located in the intended position and that the key includes the correct grammar form. The second component checks that the distractor fulfils the criterion of homogeneity as described in the section 2. To achieve this, grammar topics have been grouped into two classes.

To evaluate performances, for each grammar topic we

generated 50 exercises, setting the number of distractors to 1. We use the sampling decoding strategy with a temperature equal to 0.7 to balance the creativity and the coherence of the output.

The exercises are categorized according to their grammar topic. For each exercise, we assessed its structural Inflectional They must have the same lemma as the compliance and its similarity to the exercises within the key so as to rule out the influence of lexis and semantics. same grammar topic that has been labelled structurally We also make adjustments to account for circumstances correct, using the metrics described in section 3.3. The when the key and the distractor are identical, as well as results are then averaged to obtain the accuracy for each for handling variation of the auxiliary verb. grammar topic. In the end, the model performances are computed by averaging the topic scores. The results are Free morphemes Exercises of this group limit accept- reported in Table 1. able keys and distractors to a narrow range of options. Overall, the outcomes are satisfactory. The model on So, we manually compile a list of admitted words for average scores a Structural Compliance (SC ) equal to each grammar topic. If the distractor belongs to that list 85%, indicating its ability to generate well formed exerand is not identical to the key, it is deemed homogeneous. cises. It achieves a self-BLEU similarity of 7%, demonstrating that text repetitions are limited. Looking at the Some grammar topics may be built with distrac- individual SC scores, we observe that the model tends tors of any of the two classes. If either of the checks is to perform better on free morphemes grammar topics. successful, the distractor passes the test of fitness. We suppose this is due to the limited number of possible key/distractor options. Furthermore, we observed that due to spaCy limitations in properly labelling cer7.2. Language Diversity tain verbs, grammar topics related to verbal tenses are more prone to be misidentified. This limitation causes occasional misjudgment of the exercise’s structural compliance, leading to a negative efect on the topic performance.

9.1. Human Evaluation 2https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct To assess classroom suitability a human evaluation was

performed on all 950 exercises by a computational linguist with a background in pedagogy in language teach8. Experiments ing. Each generated exercise (EC) was evaluated on four criteria:Plausibility, Ambiguity (defined in section 2), We fine-tuned the Huggingface implementation of Meta- Common Sense, Acceptability. Common Sense means that Llama-3-8B-Instruct2. The model was first quantized to the exercise sentence should be coherent with common 4-bit precision and then fine-tuned using LoRA adapters, sense. Acceptability indicates that a sentence does not with the following configuration: rank equal to 64, alpha perpetuate stereotypes or display inappropriate content, 16, and a dropout percentage of 0.1. The adapters have such as violence. If any of these criteria is not met, the been added on top of all the attention linear layers to not item is flagged as incorrect. significantly degrade performance. The training hyper- The results presented in table 1 have established that parameters are: a constant learning rate of 2e− 4, max 79% of the items satisfy all the requirements to be admingradient norm of 0.3, and a weight decay equal to 1e− 2. istered to learners. We conducted an error analysis. The The number of epochs was set to 3, using a batch size results are summarized in Table 2. Common sense was the most frequently observed inaccuracy, although the magnitude of the issue is modest. As expected, ambiguous average distractors remain an open matter in the field, especially for tense-based topics. Instead, we can notice that the generation of sentences with bias or trivial exercises is almost absent.

Furthermore, we asked the annotator to evaluate the structural compliance of the exercises (SC ). Then we computed the Precision, Recall and F1 scores using annotator judgements as golden labels. The results show that our automatic structural compliance metric (SC) has an F1 score of 95% w.r.t the human evaluation, with a Precision of 98% and a Recall of 91%. This highlights its efectiveness in predicting the overall structural quality of the exercises. 10. Conclusion the proposed structural compliance metric, corroborating our metric as an indicator of exercise structure correctness and alignment with human expert preferences. We found that a key factor of our method was the availability of high-quality fine-tuning data.

One limitation was the presence of many similar exercises in the SC-Dataset [8] we used to build our resource from. After removing similar exercises, only 30% of the original data was left. Another limitation is the sensitivity of the evaluation metric to the Pattern Matcher, concerning the evaluation of the key and the distractors, which caused some false negatives.

The curated dataset and model will be available to the community.3.

Acknowledgments We investigated the use of an LLM to generate English

MCC grammar exercises. To that end, we curated a new We wish to thank Zanichelli editore for their support English grammar MCC exercises dataset. We devised which enabled data up-sampling, human evaluation, and metrics for the automatic evaluation of such exercises. experimentation with their infrastructure. We also thank We evaluated our work using said metrics, and a human Eleonora Cupin for her valuable contribution to the hustudy involving domain experts. Our findings demon- man evaluation of the dataset. strate the model’s ability to generate exercises suitable for educational use. The generated exercises exhibit a low similarity score, indicating that our method can efectively produce original exercises: a significant advantage from prior art, mostly relying on rule-based methods. We observe that human evaluation correlates positively with

3https://github.com/ZanichelliEditore/

english-grammar-multiple-choice-generation

A. Error analysis

Thanks to the human evaluation we conducted a small error analysis on the errors made by the model. By analyzing the exercises that the annotator marked as incorrect we found out that the major issue is the coherence of the exercise sentence. More precisely, 75% of the wrong exercises has a meaningless or absurd exercise sentence. This behaviour is directly related to the hallucinations sufered by LLMs[ 36]. The second prevailing error is the ambiguity between the key and the distractors. The model does not possess a deep understanding of what a distractor is. In fact some generated distractors are interchangeable with the key.

Despite these limitations, the model is very efective in producing exercises that are not trivial (plausibility error rate at 1%) and negligibly afected by bias and stereotypes.

grammar topic articles comparison adjectives conditional statements future simple modal verbs infinitive and gerund verbs passive tenses past continuous past perfect past simple personal pronouns possessive adjectives prepositions present continuous present perfect present simple quantifiers relative clauses WH- question average

B. Prompts

In this section, the prompts used in our work are presented. utilize the Llama3 chat template format, but to make the text more readable we use three placeholders: #SYSTEM, #USER and #ASSISTANT.

B.1. Fine-Tuning prompt

The prompt used to fine-tune the model has the same structure for all the grammar topics. The only varying parts are the name of the grammar topic and the number of distractors required. These parts are highlighted by the brackets and change depending on the dataset items. The prompt used is the following.

W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on { g r a m m a r _ t o p i c } w i t h { n _ d i s t r a c t o r s } d i s t r a c t o r s .

Listing 1: Fine-tuning prompt. A training example is created by concatenating to the prompt the desired JSON representation of the exercise. We decided to use this format because it is easier to use at inference time. An example of training data is the following. # USER

W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on c o m p a r i s o n s w i t h 3 d i s t r a c t o r s . # ASSISTANT { " f i l l e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can b e t t e r a s s e s s p a t i e n t s ’ c o n d i t i o n s . " , " g a p p e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can ___ a s s e s s p a t i e n t s ’ c o n d i t i o n s . " , " s o l u t i o n " : " b e t t e r " , " d i s t r a c t o r s " : [ " b e s t " , " good " , " w e l l " ]

Listing 2: Example from the Fine-Tuning dataset. B.2. Baseline prompt

To test the performances of the baseline Llama3 we utilize its instruction-tuned version, Llama3-Instruct that can follow direction given by the user. This model is not able to answer correctly using the prompt described above. Therefore, we construct an alternative one in which all the useful information is given to the model. We include the structure of the exercise, the roles of each component with their constraints and the desired format of the output. The results are the following.

C. Ethical Considerations

This section outlines the ethical considerations of the system we developed.

Bias and Fairness The dataset used in this study is obtained from a publicly available source, ensuring that all data was collected with appropriate consent. To protect personal information, we removed all sensitive data such as phone numbers, email addresses and URLs. Since humans created this data, we assume that proper names or any reference to existing entities are invented. Moreover, those that contain preferences such as films, books, etc., we assume do not reflect real preferences of the users. We suppose that events or situations described in the exercises are not related to existing facts. Finally, since the data have been created by professional creators we assume that any possible bias or stereotype in the dataset is not intended and it is a coincidence.

Accuracy and Reliability The accuracy of the generated exercises is paramount. We employ both automated validation tools and human expert reviews to ensure the correctness and reliability of the content. Any inaccuracies identified are promptly rectified. We acknowledge the potential for bias in LLM-generated content. However, the human evaluation highlights a negligible presence in the generated outputs.

Transparency We strive for transparency by documenting the sources of our training data and explaining the model architecture. All the techniques used to manipulate the data and the steps done are described step by step highlighting all the important aspects.

Educational Impact We assess the impact of LLM-generated exercises on learning outcomes. We aim to enhance personalized learning while preventing over-reliance on automated systems. The content is designed to be inclusive and accessible to all students.

2001 , pp. 282 - 289 . the Association for Computational Linguistics, As[20]

S. K.

Bitew ,

Deleu ,

A. S.

Doğruöz , C. Develder, sociation for Computational Linguistics , Philadel-

Demeester , Learning from partially annotated phia , Pennsylvania, USA, 2002 , pp. 311 - 318 . URL:

data: Example-aware creation of gap-filling ex - https://aclanthology.org/P02-1040. doi: 10 .3115/

ercises for language learning , in: E. Kochmar, 1073083 . 1073135 .

Burstein ,

Horbach , R. Laarmann-Quante, [29] C.-Y. Lin , ROUGE:

A package for automatic eval-

(Eds.), Proceedings of the 18th Workshop on In- Branches Out, Association for Computational Lin-

novative Use of NLP for Building Educational Ap- guistics,

Barcelona , Spain, 2004 , pp. 74 - 81 . URL:

plications , BEA@ACL 2023 , Toronto, Canada, 13 https://aclanthology.org/W04-1013.

July 2023 , Association for Computational Linguis- [30]

Banerjee ,

Lavie , METEOR:

An automatic met-

tics , 2023 , pp. 598 - 609 . URL: https://doi.org/10. ric for MT evaluation with improved correlation

18653 /v1/ 2023 .bea- 1 .51. doi: 10 .18653/V1/ 2023 . with human judgments , in: J. Goldstein , A . Lavie,

BEA-1 .51. C.-Y. Lin , C. Voss (Eds.), Proceedings of the ACL [21]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen , Workshop on Intrinsic and Extrinsic Evaluation

proach , 2019 . URL: https://arxiv.org/abs/ 1907 .11692. tics, Ann Arbor, Michigan, 2005 , pp. 65 - 72 . URL:

arXiv: 1907 .11692. https://aclanthology.org/W05-0909. [22]

Matsumori ,

Okuoka ,

Shibata , M. In- [31] Meta , Introducing Meta Llama 3: The most capable

Automatic open cloze question generation us- blog/meta-llama-

3 /, April 2024 .

ing a masked language model , IEEE Ac- [32]

Tirumala ,

Simig ,

Aghajanyan , A . Morcos,

cess 11 ( 2023 ) 9835 - 9850 . URL: http://dx.doi. D4: Improving llm pretraining via document de-

org/10 .1109/ACCESS. 2023 . 3239005 . doi: 10 .1109/ duplication and diversification, in: A. Oh , T. Nau-

access.

2023 . 3239005 . mann , A. Globerson, K.

Saenko , M.

Hardt , S. Levine [23] P.

Chomphooyod , A.

Suchato , N.

Tuaycharoen , (Eds.), Advances in Neural Information Processing

Punyabukkana , English grammar multiple- Systems , volume 36 , Curran

Associates

, Inc., 2023 ,

choice question generation using text-to-text trans - pp. 53983 - 53995 . URL: https://proceedings.

fer transformer, Comput. Educ. Artif. Intell . 5 ( 2023 ) neurips .cc/paper_files/paper/2023/file/

100158. URL: https://doi.org/10.1016/j.caeai. 2023 . a8f8cbd7f7a5fb2c837e578c75e5b615-Paper-Datasets_

100158. doi: 10 .1016/J.CAEAI. 2023 . 100158 . and_Benchmarks.pdf. [24]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit, [33] O. et al., Gpt-4 technical report , 2024 . URL: https:

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , At- //arxiv.org/abs/2303.08774. arXiv: 2303 . 08774 .

tention is all you need , 2023 . URL: https://arxiv.org/ [34]

Dettmers ,

Pagnoni ,

Holtzman , L. Zettle-

abs/1706 .03762. arXiv: 1706 .03762. moyer, Qlora: Eficient finetuning of quantized [25]

Slavuj ,

L. Nacinovic

Prskalo , M. Brkic Bakaric, llms, 2023 . URL: https://arxiv.org/abs/2305.14314.

Automatic generation of language exercises based arXiv:2305 . 14314 .

on a universal methodology: An analysis of possi- [35]

Fu ,

Lam , A. M.-C. So , B. Shi , A theoreti-

Brasov. Series IV : Philology and Cultural Studies 14 eration , Proceedings of the AAAI Conference

(63) ( 2022 ) 29 - 48 . doi: 10 .31926/but.pcs. 2021. on Artificial Intelligence 35 ( 2021 ) 12848 - 12856 .

63.14.2 .3. URL: https://ojs.aaai.org/index.php/AAAI/article/ [26]

Malafeev , Language exercise generation , In- view/17520. doi: 10 .1609/aaai.v35i14. 17520 .

ternational Journal of Conceptual Structures and [36]

Xu ,

Jain ,

Kankanhalli , Hallucination is in-

Smart Applications 2 ( 2014 ) 20 - 35 . doi: 10 .4018/ evitable: An innate limitation of large language

IJCSSA.2014070102. models, 2024 . URL: https://arxiv.org/abs/2401.11817. [27]

Perrett ,

March , An evidence-based approach arXiv: 2401 . 11817 .

guage tests , 2019 . doi: 10 .13140/RG.2.2.22779.

16165. [28]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu:

(Eds.), Proceedings of the 40th Annual Meeting of