=Paper=
{{Paper
|id=Vol-3878/38_main_long
|storemode=property
|title=Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises
|pdfUrl=https://ceur-ws.org/Vol-3878/38_main_long.pdf
|volume=Vol-3878
|authors=Nicolò Donati,Matteo Periani,Paolo Di Natale,Giuseppe Savino,Paolo Torroni
|dblpUrl=https://dblp.org/rec/conf/clic-it/DonatiPNST24
}}
==Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises==
Generation and Evaluation of English Grammar
Multiple-Choice Cloze Exercises
Nicolò Donati1,2,*,† , Matteo Periani1,† , Paolo Di Natale3,† , Giuseppe Savino2 and Paolo Torroni1
1
University of Bologna, Viale del Risorgimento, 2, 40136 Bologna BO, Italy
2
Zanichelli editore S.p.A., Via Irnerio 34, 40126 Bologna, Italy
3
University of Bologna, Corso della Repubblica, 136, 47121 Forlì FC, Italy
Abstract
English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners’ grammatical proficiency and
comprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCC
exercises must be contextually relevant and engaging, incorporating distractors—plausible but incorrect alternatives—to
balance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs)
in education, their application in generating English grammar MCC exercises is still limited. Previous methods typically
impose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potential
of LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. We
hypothesize that LLMs can craft self-contained sentences that foster learner’s communicative competence. Our analysis
of existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we address
the lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions include
developing an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammar
topics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance the
automatic generation of English grammar MCC exercises, enhancing both their quality and creativity.
Keywords
Large Language Models, Distractor Generation, Multiple-Choice Cloze, Evaluation Metric
1. Introduction challenging for the learner. Studies in Communicative
Language Teaching demonstrate that the learner must
English grammar Multiple-Choice Cloze (MCC) exercises possess the knowledge of grammatical structures and
are widely used tools for enhancing a learner’s grammat- the ability to compose syntactically well-formed proposi-
ical proficiency and comprehension skills. They consist tions, and they must also acquire the ability to employ
of fill-the-gap questions where the gap must be filled grammatical forms in discourse [1][2].
by choosing one correct solution (key) among several Recently, there has been a growing interest in applying
options. The incorrect alternatives are called distractors. LLMs in education [3]. However, the adoption of LLMs
Devising these exercises is a labour-intensive process for English grammar MCC exercise generation is still
requiring expert knowledge in language teaching and limited. Some proposals focus on testing vocabulary [4]
content creation. The exercises must be contextually or use LLMs by constraining their generation capability,
relevant to help learners understand how rules apply for example using fixed part-of-speech sequences [5].
in real-life situations. This requires crafting sentences Although the outputs of these models are grammatically
and scenarios that are both engaging and educational. correct typically they lack creativity [6].
Learners have different levels of proficiency, from be- In this work, we investigate the potential of LLMs
ginners to advanced. Striking the right balance ensures in automatic exercise generation without hampering
that learners are neither bored nor frustrated, which is their creativity. Our working hypothesis is that LLMs
crucial for maintaining their motivation and progress. can generate self-contained sentences, recreating situa-
In MCC exercises this is done by choosing distractors tional contexts that elicit the communicative competence
that are incorrect but plausible, thus keeping the exercise of the learner [7]. Our main objective is to understand
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, to what extent can LLMs generate accurate grammar ex-
Dec 04 — 06, 2024, Pisa, Italy ercises without providing predefined constraints or POS
*
Corresponding author. sequences. To pursue this objective, we analyzed the
†
These authors contributed equally. available English grammar MCC exercises dataset [8].
$ n.donati@unibo.it (N. Donati); matteo.periani2@studio.unibo.it We observed that it has limited diversity, some topics are
(M. Periani); paolo.dinatale3@studio.unibo.it (P. D. Natale);
gsavino@zanichelli.it (G. Savino); p.torroni@unibo.it (P. Torroni)
underrepresented, and there are often mistakes. Existing
0009-0000-5673-5274 (N. Donati); 0000-0002-9253-8638 literature does not offer a single agreed-upon automatic
(P. Torroni) metric for evaluating the quality of the generated gram-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
mar exercises. Therefore, we set out to identify such a 3.2. Grammar MCC Exercise Generation
metric and validate its alignment with human judgment.
A large share of prior works uses rules to create Grammar
In this paper, we present a novel solution utilizing an
MCC Exercises (Sumita et al. [13], Brown et al. [14], Smith
LLM to generate English grammar MCC exercises. Our
et al. [15], Majumder and Saha [16], Lin et al. [17]). They
contribution also focuses on curating an MCC dataset
all follow a three-fold process: (1) select sentences from
that spans 19 topics. Lastly, we propose an automatic
arbitrary sources, (2) insert the blank into the sentence,
metric to evaluate the exercise’s correctness and verify
and (3) generate distractors for the blank. Sentences
the validity of our contribution thanks to human expert
usually come from corpora or user-submitted passages.
evaluation.
Many solutions restrict gap detection into fixed schemes:
Sumita et al. [13] picked out the leftmost single verb, Lin
2. Task description et al. [17] only selected adjectives as a blank. One of
the few exceptions is Goto et al. [18], who proposed a
Grammar exercises should define the range of abilities to method based on Conditional Random Fields (CRFs) [19].
be assessed and avoid the influence of irrelevant factors Methods that extract sentences from arbitrary text suffer
like past knowledge or cultural background [9]. We fol- from several limitations. First of all, they lack customiza-
lowed the Best-practice guidelines for creating grammar tion options, such as adjusting for the subject or difficulty
MCC items defined in [10] [11]. According to them, each level of the exercise. Additionally, they are limited by
item consists of three components. the length and quality of the extracted texts, which can
negatively impact the system’s results.
• Body: the sentence with a gap in place of the key. Recently, parts of MCC generation have been executed
• Key: the correct answer. by Neural Networks instead of rule-based algorithms.
• Distractor: the incorrect answer. Bitew et al. [20] use a variation of the RoBERTa [21]
model to predict the gap positions within the sentence.
The body plays a central role in designing effective exer-
To decrease the ambiguity Matsumori et al. [22] trained a
cises. Learners should be able to infer the key based on
Masked Language Model for gap score prediction of each
the helpful elements present in the body. However, the ef-
candidate sentence. Chomphooyod et al. [23] proposed a
fectiveness of an exercise depends mainly on the quality
system that uses Transformers [24] to generate candidate
of its distractors. Ideally, challenging distractors should
sentences given a POS sequence, a keyword and a desired
be homogeneous, plausible, and unambiguous. Homo-
grammar topic.
geneous distractors share the same syntactic category
as the key [12]. Plausible distractors provide a credible
alternative to the key. Lastly, unambiguous distractors 3.3. Metrics
ensure that none of them could be considered correct if In the literature, the evaluation of MCC exercises is
used in place of the key [10]. mainly based on judgments expressed by human anno-
tators. Slavuj et al.[25] asked annotators to perform the
3. Related Works language tasks, assuming that the presence of incorrect
answers would be a sign of ill-formed exercises. Teachers
The generation of MCC exercises has been explored from were then asked to provide feedback on any pitfalls they
various perspectives. In this section, we will briefly dis- encountered. Malafeev [26] simply attended to suitability
cuss the main related approaches. for classroom use. Chomphooyod et al. [23] evaluates for
each exercise different aspects such as the grammatical
and semantic correctness, the relevance with respect to
3.1. MCC Dataset
the topic, and its acceptability.
Prior works in creating MCC datasets are very limited. Very few automatic metrics have been proposed to
To the best of our knowledge, the only one in English evaluate exercise generation. Bitew et al. [20] rely on
was presented by Liu et al. in their work SC-Ques [8]. It span overlap with respect to ground truth to assess the
comprises real English test items for students developed consistency of gap detection. March et al. [27] test the
by teaching professionals. The dataset contains roughly effectiveness of distractors by their selection rate.
300k MCC sentence completion exercises, composed of Since an important criterion for exercise collection is
the question body, a varying number of alternative an- diversity, often similarity measures have been applied to
swers, and the key (i.e. the correct alternative). It com- MCC exercise. Metrics like BLUE [28], ROUGE [29], and
prises both exercises with only single or multiple blanks. METEOR [30] have been used even though originally
It has various limitations, discussed in Section 5. designed for different applications.
4. Approach ensure the exclusively grammatical nature of the exer-
cises, distractors are checked using the metrics proposed
To overcome the limitations of existing solutions, we uti- in Section 3.3. All exercises lacking valid distractors are
lized an LLM to generate exercises in a single, constraint- then discarded.
free step. We chose Llama3 [31] due to its accept-
able balance between computational cost and perfor- Deduplication We deduplicated and removed all
mance. To evaluate its effectiveness, we engineered a the similar exercises, to increase the quality of our
well-structured prompt (Appendix B.2). However, the dataset [32]. Exercises are clustered by topic and
results were unsatisfactory. The model exhibited signifi- compared in terms of embeddings through cosine
cant difficulties with certain grammar topics and consis- similarity. Using a threshold 𝑇 , where 𝑝 denotes the
𝑝
tently failed to generate effective distractors. Therefore topic, all elements exceeding the limit are discarded.
we decided to fine-tune the model using a well-formatted Lastly, we noticed that SC-Ques [8] had an unbalanced
dataset containing exercises with distractors that meet representation of grammar topics. For example, in half
our criteria. Each dataset example includes four features: of the WH-questions have "How" as the key. For each
the grammar topic, the exercise text, the key, and the topic, a maximum ratio of key presence is established,
distractors. The model is trained to produce the exercise and superfluous data are discarded.
text, key, and distractors when given a specific grammar
topic as input. The prompt used during the fine-tuning After pre-processing, the least represented class con-
and an example of input-output text can be found in the tained a quarter of the examples present in the most repre-
appendix section B.1. sented one. The only exception was the "WH-questions"
To assess the correctness of the generated items, we class, which was underrepresented. Therefore, we upsam-
devised metrics that evaluate the minimal structural re- pled the class with synthetic exercises using GPT-4 [33].
quirements of an exercise thanks to rule-based analysis. The dataset is composed by several fields: the filled_text
These are defined in section 7. To monitor the results we (complete exercise sentence), the gapped_text (sentence
used SELF-BLEU [6], a metric that inspects repetitions with a blank gap), the key (the text removed to create the
checking continuous lexical overlap. gap), and the list of distractors.
5. Dataset Curation 6. Fine-Tuning
We developed the fine-tuning dataset based on the data re- We designed the fine-tuning process to generate exer-
leased by [8]. The data underwent three pre-processing cises on specific grammar topics with a fixed number of
steps: cleaning, grammar topic identification, and re- distractors. The model’s expected response is a JSON-
moval of similar examples. encoded exercise coherent to the dataset structure de-
scribed in Section 5. We observed that including the
Data cleaning First, we got rid of improperly format- filled_text in the output improves overall accuracy and
ted examples and cleaned the text to comply with the reduces similarity among exercises. An example from
tokenizer specifications and limit potential noise. Items the fine-tuning dataset can be found in the appendix sec-
with multiple blank spaces or fewer than two distractors tion B.1. To reduce the computational resources required
were discarded. Next, we filtered out exercise texts con- for fine-tuning, we employed the Quantized Low-Rank
taining instructions, non-Latin symbols or letters, emails, Adapters (QLoRA) [34] approach. Our tests on small
phone numbers, and links. models revealed that this strategy prevents significant
shrinkage of the model’s dictionary during fine-tuning.
Extraction of the grammar topic The second step Consequently, the generated exercises exhibit greater
involves the assignment of the grammar topic to each variability, enhancing the model’s creativity.
exercise thanks to the Pattern Matcher. First, grammar
topics are defined in a tailor-made grammar taxonomy
with the aid of spaCy Dependency Matcher. Given a set 7. Evaluation Metrics
of sentences, this tool allows one to identify whether
Two metrics are used to track the model’s performance
each sentence features the described grammar topics,
on diverse aspects. First, we introduce a metric that eval-
and if so, at what position. The relevant topic is chosen
uates the minimal structural requirements of an exercise.
by comparing the overlap between the position of the
1 Secondly, we control for language diversity to have more
topic detected by Pattern Matcher and the key span . To
interpretability on the results.
1
The key span is the range of positions the key belongs to.
7.1. Structural Compliance of 1 and gradient accumulation equal to 16. The train
lasted two hours on a NVIDIA RTX A6000.
This metric evaluates the structure and well-formedness
of the exercise. Decomposing the validation stage into
two steps, we design two rule-based components, namely 9. Results
pertinence and homogeneity.
The former oversees that the gap placeholder is located To evaluate performances, for each grammar topic we
in the intended position and that the key includes the generated 50 exercises, setting the number of distractors
correct grammar form. The second component checks to 1. We use the sampling decoding strategy with a tem-
that the distractor fulfils the criterion of homogeneity perature equal to 0.7 to balance the creativity and the
as described in the section 2. To achieve this, grammar coherence of the output.
topics have been grouped into two classes. The exercises are categorized according to their gram-
mar topic. For each exercise, we assessed its structural
Inflectional They must have the same lemma as the compliance and its similarity to the exercises within the
key so as to rule out the influence of lexis and semantics. same grammar topic that has been labelled structurally
We also make adjustments to account for circumstances correct, using the metrics described in section 3.3. The
when the key and the distractor are identical, as well as results are then averaged to obtain the accuracy for each
for handling variation of the auxiliary verb. grammar topic. In the end, the model performances are
computed by averaging the topic scores. The results are
Free morphemes Exercises of this group limit accept- reported in Table 1.
able keys and distractors to a narrow range of options. Overall, the outcomes are satisfactory. The model on
So, we manually compile a list of admitted words for average scores a Structural Compliance (SC𝐻 ) equal to
each grammar topic. If the distractor belongs to that list 85%, indicating its ability to generate well formed exer-
and is not identical to the key, it is deemed homogeneous. cises. It achieves a self-BLEU similarity of 7%, demon-
strating that text repetitions are limited. Looking at the
Some grammar topics may be built with distrac- individual SC scores, we observe that the model tends
tors of any of the two classes. If either of the checks is to perform better on free morphemes grammar topics.
successful, the distractor passes the test of fitness. We suppose this is due to the limited number of possi-
ble key/distractor options. Furthermore, we observed
that due to spaCy limitations in properly labelling cer-
7.2. Language Diversity tain verbs, grammar topics related to verbal tenses are
LLMs often experience the so-called repetition problem, more prone to be misidentified. This limitation causes
where their output includes excessively repeated seg- occasional misjudgment of the exercise’s structural com-
ments of text, creating an undesirable effect [35]. In the pliance, leading to a negative effect on the topic perfor-
context of the generation of thousands of exercises, du- mance.
plicates or overly similar sentences are highly likely to
occur. In order to assess this phenomenon we decided 9.1. Human Evaluation
to rely on continuous lexical overlap by using Self-BLEU
[6] onto 2-to-5-grams to capture multi-word repetitions. To assess classroom suitability a human evaluation was
performed on all 950 exercises by a computational lin-
guist with a background in pedagogy in language teach-
8. Experiments ing. Each generated exercise (EC) was evaluated on
four criteria:Plausibility, Ambiguity (defined in section 2),
We fine-tuned the Huggingface implementation of Meta- Common Sense, Acceptability. Common Sense means that
Llama-3-8B-Instruct2 . The model was first quantized to the exercise sentence should be coherent with common
4-bit precision and then fine-tuned using LoRA adapters, sense. Acceptability indicates that a sentence does not
with the following configuration: rank equal to 64, alpha perpetuate stereotypes or display inappropriate content,
16, and a dropout percentage of 0.1. The adapters have such as violence. If any of these criteria is not met, the
been added on top of all the attention linear layers to not item is flagged as incorrect.
significantly degrade performance. The training hyper- The results presented in table 1 have established that
parameters are: a constant learning rate of 2e−4, max 79% of the items satisfy all the requirements to be admin-
gradient norm of 0.3, and a weight decay equal to 1e−2. istered to learners. We conducted an error analysis. The
The number of epochs was set to 3, using a batch size results are summarized in Table 2. Common sense was the
most frequently observed inaccuracy, although the mag-
nitude of the issue is modest. As expected, ambiguous
2
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
grammar topic SC𝐴 self-BLEU SC𝐻 EC
articles 0.94 0.03 0.94 0.74
comparison adjectives 0.90 0.09 0.92 0.72
conditional statements 0.76 0.07 0.90 0.66
future simple 0.82 0.06 0.90 0.90
modal verbs 0.62 0 0.78 0.70
infinitive and gerund verbs 0.76 0 0.96 0.86
passive tenses 0.84 0 0.86 0.74
past continuous 0.98 0.16 0.98 0.88
past perfect 0.94 0.12 0.96 0.82
past simple 0.88 0 0.86 0.82
personal pronouns 0.85 0.07 0.92 0.74
possessive adjectives 0.82 0.12 0.90 0.72
prepositions 0.84 0 0.92 0.72
present continuous 0.96 0.11 0.98 0.88
present perfect 0.66 0.08 0.98 0.84
present simple 0.88 0.05 0.88 0.86
quantifiers 0.88 0.07 0.88 0.84
relative clauses 0.94 0.03 0.94 0.74
WH- question 0.98 0.18 1.00 0.90
average 0.85 0.07 0.92 0.79
Table 1
Results of the evaluation on the generated exercises. SC𝐴 is the Structural Compliance evaluate by our metric, SC𝐻 evaluated
by the human annotator and EC is the exercise correctness. The double lines divide the results from the automatic metric (left)
to those obtained by the human-eval (right). More results on error analysis can be found in table 2.
distractors remain an open matter in the field, especially the proposed structural compliance metric, corroborating
for tense-based topics. Instead, we can notice that the our metric as an indicator of exercise structure correct-
generation of sentences with bias or trivial exercises is ness and alignment with human expert preferences. We
almost absent. found that a key factor of our method was the availability
Furthermore, we asked the annotator to evaluate the of high-quality fine-tuning data.
structural compliance of the exercises (SC𝐻 ). Then we One limitation was the presence of many similar exer-
computed the Precision, Recall and F1 scores using anno- cises in the SC-Dataset [8] we used to build our resource
tator judgements as golden labels. The results show that from. After removing similar exercises, only 30% of the
our automatic structural compliance metric (SC𝐴 ) has original data was left. Another limitation is the sensi-
an F1 score of 95% w.r.t the human evaluation, with a tivity of the evaluation metric to the Pattern Matcher,
Precision of 98% and a Recall of 91%. This highlights its concerning the evaluation of the key and the distractors,
effectiveness in predicting the overall structural quality which caused some false negatives.
of the exercises. The curated dataset and model will be available to the
community.3 .
10. Conclusion
Acknowledgments
We investigated the use of an LLM to generate English
MCC grammar exercises. To that end, we curated a new We wish to thank Zanichelli editore for their support
English grammar MCC exercises dataset. We devised which enabled data up-sampling, human evaluation, and
metrics for the automatic evaluation of such exercises. experimentation with their infrastructure. We also thank
We evaluated our work using said metrics, and a human Eleonora Cupin for her valuable contribution to the hu-
study involving domain experts. Our findings demon- man evaluation of the dataset.
strate the model’s ability to generate exercises suitable
for educational use. The generated exercises exhibit a
low similarity score, indicating that our method can effec-
tively produce original exercises: a significant advantage
from prior art, mostly relying on rule-based methods. We
3
observe that human evaluation correlates positively with https://github.com/ZanichelliEditore/
english-grammar-multiple-choice-generation
References (LREC’14), European Language Resources Associ-
ation (ELRA), Reykjavik, Iceland, 2014, pp. 4284–
[1] H. G. Widdowson, Teaching Language as Commu- 4291. URL: http://www.lrec-conf.org/proceedings/
nication, Oxford University Press, Oxford, 1978. lrec2014/pdf/692_Paper.pdf.
[2] H. G. Widdowson, Explorations in Applied Linguis- [13] E. Sumita, F. Sugaya, S. Yamamoto, Measuring
tics, Oxford University Press, Oxford, 1979. non-native speakers’ proficiency of english by us-
[3] W. Gan, Z. Qi, J. Wu, J. C.-W. Lin, Large lan- ing a test with automatically-generated fill-in-the-
guage models in education: Vision and opportu- blank questions (2005). doi:10.3115/1609829.
nities, in: 2023 IEEE International Conference on 1609839.
Big Data (BigData), 2023, pp. 4776–4785. doi:10. [14] J. Brown, G. A. Frishkoff, M. Eskénazi, Auto-
1109/BigData59044.2023.10386291. matic question generation for vocabulary assess-
[4] Q. Wang, R. Rose, N. Orita, A. Sugawara, Auto- ment, in: HLT/EMNLP 2005, Human Language
mated generation of multiple-choice cloze ques- Technology Conference and Conference on Empiri-
tions for assessing english vocabulary using gpt- cal Methods in Natural Language Processing, Pro-
turbo 3.5, 2024. URL: https://arxiv.org/abs/2403. ceedings of the Conference, 6-8 October 2005, Van-
02078. arXiv:2403.02078. couver, British Columbia, Canada, The Association
[5] P. Chomphooyod, A. Suchato, N. Tuaycharoen, for Computational Linguistics, 2005, pp. 819–826.
P. Punyabukkana, English grammar multiple- URL: https://aclanthology.org/H05-1103/.
choice question generation using text-to-text [15] S. Smith, A. P.V.S, A. Kilgarriff, Gap-fill tests
transfer transformer, Computers and Educa- for language learners: Corpus-driven item gener-
tion: Artificial Intelligence 5 (2023) 100158. ation, 2010. URL: https://api.semanticscholar.org/
URL: https://www.sciencedirect.com/science/ CorpusID:61531901.
article/pii/S2666920X23000371. doi:https: [16] M. Majumder, S. K. Saha, A system for gen-
//doi.org/10.1016/j.caeai.2023.100158. erating multiple choice questions: With a novel
[6] Y. Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, approach for sentence selection, in: H. Chen,
Y. Yu, Texygen: A benchmarking platform for text Y. Tseng, Y. Matsumoto, L. Wong (Eds.), Proceedings
generation models, 2018. URL: https://arxiv.org/abs/ of the 2nd Workshop on Natural Language Process-
1802.01886. arXiv:1802.01886. ing Techniques for Educational Applications, NLP-
[7] D. H. Hymes, On communicative competence, in: TEA@ACL/IJCNLP, Beijing, China, July 31, 2015,
J. B. Pride, J. Holmes (Eds.), Sociolinguistics. Se- Association for Computational Linguistics, 2015, pp.
lected Readings, Penguin, Harmondsworth, 1972, 64–72. URL: https://doi.org/10.18653/v1/W15-4410.
pp. 269–293. doi:10.18653/V1/W15-4410.
[8] Q. Liu, Y. Huang, Z. Liu, S. Huang, J. Chen, X. Zhao, [17] M. Majumder, S. K. Saha, A system for gen-
G. Lin, Y. Zhou, W. Luo, Sc-ques: A sentence com- erating multiple choice questions: With a novel
pletion question dataset for english as a second lan- approach for sentence selection, in: H. Chen,
guage learners, in: C. Frasson, P. Mylonas, C. Trous- Y. Tseng, Y. Matsumoto, L. Wong (Eds.), Proceedings
sas (Eds.), Augmented Intelligence and Intelligent of the 2nd Workshop on Natural Language Process-
Tutoring Systems, Springer Nature Switzerland, ing Techniques for Educational Applications, NLP-
Cham, 2023, pp. 678–690. TEA@ACL/IJCNLP, Beijing, China, July 31, 2015,
[9] L. F. Bachman, Fundamental Considerations in Lan- Association for Computational Linguistics, 2015, pp.
guage Testing, Oxford University Press, Oxford, 64–72. URL: https://doi.org/10.18653/v1/W15-4410.
1990. doi:10.18653/V1/W15-4410.
[10] J. E. Purpura, Assessing Grammar, Cambridge Lan- [18] T. Goto, T. Kojiri, T. Watanabe, T. Iwata, T. Yamada,
guage Assessment, Cambridge University Press, Automatic generation system of multiple-choice
2004. cloze questions and its evaluation, Knowledge Man-
[11] G. Fulcher, G. Fulcher, Practical Language Test- agement & E-Learning: An International Journal
ing, 1st ed., Routledge, 2010. doi:10.4324/ 2 (2010) 210–224. URL: https://api.semanticscholar.
980203767399. org/CorpusID:15482954.
[12] V.-M. Pho, T. André, A.-L. Ligozat, B. Grau, G. Il- [19] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Condi-
louz, T. François, Multiple choice question corpus tional random fields: Probabilistic models for seg-
analysis for distractor characterization, in: N. Cal- menting and labeling sequence data, in: C. E. Brod-
zolari, K. Choukri, T. Declerck, H. Loftsson, B. Mae- ley, A. P. Danyluk (Eds.), Proceedings of the Eigh-
gaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis teenth International Conference on Machine Learn-
(Eds.), Proceedings of the Ninth International Con- ing (ICML 2001), Williams College, Williamstown,
ference on Language Resources and Evaluation MA, USA, June 28 - July 1, 2001, Morgan Kaufmann,
2001, pp. 282–289. the Association for Computational Linguistics, As-
[20] S. K. Bitew, J. Deleu, A. S. Doğruöz, C. Develder, sociation for Computational Linguistics, Philadel-
T. Demeester, Learning from partially annotated phia, Pennsylvania, USA, 2002, pp. 311–318. URL:
data: Example-aware creation of gap-filling ex- https://aclanthology.org/P02-1040. doi:10.3115/
ercises for language learning, in: E. Kochmar, 1073083.1073135.
J. Burstein, A. Horbach, R. Laarmann-Quante, [29] C.-Y. Lin, ROUGE: A package for automatic eval-
N. Madnani, A. Tack, V. Yaneva, Z. Yuan, T. Zesch uation of summaries, in: Text Summarization
(Eds.), Proceedings of the 18th Workshop on In- Branches Out, Association for Computational Lin-
novative Use of NLP for Building Educational Ap- guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
plications, BEA@ACL 2023, Toronto, Canada, 13 https://aclanthology.org/W04-1013.
July 2023, Association for Computational Linguis- [30] S. Banerjee, A. Lavie, METEOR: An automatic met-
tics, 2023, pp. 598–609. URL: https://doi.org/10. ric for MT evaluation with improved correlation
18653/v1/2023.bea-1.51. doi:10.18653/V1/2023. with human judgments, in: J. Goldstein, A. Lavie,
BEA-1.51. C.-Y. Lin, C. Voss (Eds.), Proceedings of the ACL
[21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Workshop on Intrinsic and Extrinsic Evaluation
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Measures for Machine Translation and/or Summa-
Roberta: A robustly optimized bert pretraining ap- rization, Association for Computational Linguis-
proach, 2019. URL: https://arxiv.org/abs/1907.11692. tics, Ann Arbor, Michigan, 2005, pp. 65–72. URL:
arXiv:1907.11692. https://aclanthology.org/W05-0909.
[22] S. Matsumori, K. Okuoka, R. Shibata, M. In- [31] Meta, Introducing Meta Llama 3: The most capable
oue, Y. Fukuchi, M. Imai, Mask and cloze: openly available LLM to date, https://ai.meta.com/
Automatic open cloze question generation us- blog/meta-llama-3/, April 2024.
ing a masked language model, IEEE Ac- [32] K. Tirumala, D. Simig, A. Aghajanyan, A. Morcos,
cess 11 (2023) 9835–9850. URL: http://dx.doi. D4: Improving llm pretraining via document de-
org/10.1109/ACCESS.2023.3239005. doi:10.1109/ duplication and diversification, in: A. Oh, T. Nau-
access.2023.3239005. mann, A. Globerson, K. Saenko, M. Hardt, S. Levine
[23] P. Chomphooyod, A. Suchato, N. Tuaycharoen, (Eds.), Advances in Neural Information Processing
P. Punyabukkana, English grammar multiple- Systems, volume 36, Curran Associates, Inc., 2023,
choice question generation using text-to-text trans- pp. 53983–53995. URL: https://proceedings.
fer transformer, Comput. Educ. Artif. Intell. 5 (2023) neurips.cc/paper_files/paper/2023/file/
100158. URL: https://doi.org/10.1016/j.caeai.2023. a8f8cbd7f7a5fb2c837e578c75e5b615-Paper-Datasets_
100158. doi:10.1016/J.CAEAI.2023.100158. and_Benchmarks.pdf.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [33] O. et al., Gpt-4 technical report, 2024. URL: https:
L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At- //arxiv.org/abs/2303.08774. arXiv:2303.08774.
tention is all you need, 2023. URL: https://arxiv.org/ [34] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettle-
abs/1706.03762. arXiv:1706.03762. moyer, Qlora: Efficient finetuning of quantized
[25] V. Slavuj, L. Nacinovic Prskalo, M. Brkic Bakaric, llms, 2023. URL: https://arxiv.org/abs/2305.14314.
Automatic generation of language exercises based arXiv:2305.14314.
on a universal methodology: An analysis of possi- [35] Z. Fu, W. Lam, A. M.-C. So, B. Shi, A theoreti-
bilities, Bulletin of the Transilvania University of cal analysis of the repetition problem in text gen-
Brasov. Series IV: Philology and Cultural Studies 14 eration, Proceedings of the AAAI Conference
(63) (2022) 29–48. doi:10.31926/but.pcs.2021. on Artificial Intelligence 35 (2021) 12848–12856.
63.14.2.3. URL: https://ojs.aaai.org/index.php/AAAI/article/
[26] A. Malafeev, Language exercise generation, In- view/17520. doi:10.1609/aaai.v35i14.17520.
ternational Journal of Conceptual Structures and [36] Z. Xu, S. Jain, M. Kankanhalli, Hallucination is in-
Smart Applications 2 (2014) 20–35. doi:10.4018/ evitable: An innate limitation of large language
IJCSSA.2014070102. models, 2024. URL: https://arxiv.org/abs/2401.11817.
[27] D. Perrett, D. March, An evidence-based approach arXiv:2401.11817.
to distractor generation in multiple-choice lan-
guage tests, 2019. doi:10.13140/RG.2.2.22779.
16165.
[28] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:
a method for automatic evaluation of machine
translation, in: P. Isabelle, E. Charniak, D. Lin
(Eds.), Proceedings of the 40th Annual Meeting of
A. Error analysis
Thanks to the human evaluation we conducted a small error analysis on the errors made by the model. By analyzing
the exercises that the annotator marked as incorrect we found out that the major issue is the coherence of the
exercise sentence. More precisely, 75% of the wrong exercises has a meaningless or absurd exercise sentence. This
behaviour is directly related to the hallucinations suffered by LLMs[36]. The second prevailing error is the ambiguity
between the key and the distractors. The model does not possess a deep understanding of what a distractor is. In
fact some generated distractors are interchangeable with the key.
Despite these limitations, the model is very effective in producing exercises that are not trivial (plausibility error
rate at 1%) and negligibly affected by bias and stereotypes.
grammar topic CS Acc Amb P
articles 1.00 - - -
comparison adjectives 0.64 0.36 - -
conditional statements 1.00 - - -
future simple 1.00 - - -
modal verbs 0.85 - 0.15 -
infinitive and gerund verbs 0.50 0.12 0.38 -
passive tenses 0.83 - 0.17 -
past continuous 0.60 - 0.40 -
past perfect 0.50 - 0.38 0.12
past simple 0.40 - 0.60 -
personal pronouns 0.56 0.11 0.33 -
possessive adjectives 1.00 - - -
prepositions 0.67 - 0.33 -
present continuous 0.50 - 0.50 -
present perfect 1.00 - - -
present simple 1.00 - - -
quantifiers 0.75 0.25 - -
relative clauses 0.64 0.18 0.18 -
WH- question 0.80 - 0.20 -
average 0.75 0.05 0.19 0.01
Table 2
Results of the error analysis on the wrong exercises. CS stands for Common Sense, Acc for Acceptability, Amb for Ambiguity
and P for Plausibility.
B. Prompts
In this section, the prompts used in our work are presented. utilize the Llama3 chat template format, but to make
the text more readable we use three placeholders: #SYSTEM, #USER and #ASSISTANT.
B.1. Fine-Tuning prompt
The prompt used to fine-tune the model has the same structure for all the grammar topics. The only varying parts
are the name of the grammar topic and the number of distractors required. These parts are highlighted by the
brackets and change depending on the dataset items. The prompt used is the following.
W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on { g r a m m a r _ t o p i c } w i t h { n _ d i s t r a c t o r s }
distractors .
Listing 1: Fine-tuning prompt.
A training example is created by concatenating to the prompt the desired JSON representation of the exercise. We
decided to use this format because it is easier to use at inference time. An example of training data is the following.
# USER
W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on c o m p a r i s o n s w i t h 3 d i s t r a c t o r s .
# ASSISTANT
{
" f i l l e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can b e t t e r a s s e s s
patients ’ conditions . " ,
" g a p p e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can ___ a s s e s s p a t i e n t s ’
conditions . " ,
" solution ": " better " ,
" d i s t r a c t o r s " : [ " b e s t " , " good " , " w e l l " ]
Listing 2: Example from the Fine-Tuning dataset.
B.2. Baseline prompt
To test the performances of the baseline Llama3 we utilize its instruction-tuned version, Llama3-Instruct that can
follow direction given by the user. This model is not able to answer correctly using the prompt described above.
Therefore, we construct an alternative one in which all the useful information is given to the model. We include the
structure of the exercise, the roles of each component with their constraints and the desired format of the output.
The results are the following.
# SYSTEM
You a r e an e n g l i s h t e a c h e r c r e a t i n g m u l t i p l e − c h o i c e − gap e x e r c i s e s .
# USER
W r i t e one e x e r c i s e on { g r a m m a r _ t o p i c } .
I t must c o n t a i n s t h e :
− s e n t e n c e : t h e body e x e r c i s e t e x t t h a t must c o n t a i n t h e t a g i n s t e a d
of the s o l u t i o n
− s o l u t i o n : t h e t h a t c o r r e c t l y f i l l t h e gap
− d i s t r a c t o r : a word r e l a t e d t o t h e s o l u t i o n , b u t d i f f e r e n t
The d i s t r a c t o r must be s u c h t h a t i f s u b s t i t u t e d t o t h e s o l u t i o n , t h e s e n t e n c e
i s wrong .
Do n o t g e n e r a t e any e x a p l a n a t i o n .
The o u t p u t must be a JSON o b j e c t w i t h t h e f o l l o w i n g s t r u c t u r e :
{ " sentence " : str , " solution " : str , " d i s t r a c t o r " : l i s t [ s t r ] }
Listing 3: Prompt used to the generation of exercises with the base Llama3 model.
C. Ethical Considerations
This section outlines the ethical considerations of the system we developed.
Bias and Fairness The dataset used in this study is obtained from a publicly available source, ensuring that all
data was collected with appropriate consent. To protect personal information, we removed all sensitive data such as
phone numbers, email addresses and URLs. Since humans created this data, we assume that proper names or any
reference to existing entities are invented. Moreover, those that contain preferences such as films, books, etc., we
assume do not reflect real preferences of the users. We suppose that events or situations described in the exercises
are not related to existing facts. Finally, since the data have been created by professional creators we assume that
any possible bias or stereotype in the dataset is not intended and it is a coincidence.
Accuracy and Reliability The accuracy of the generated exercises is paramount. We employ both automated
validation tools and human expert reviews to ensure the correctness and reliability of the content. Any inaccuracies
identified are promptly rectified. We acknowledge the potential for bias in LLM-generated content. However, the
human evaluation highlights a negligible presence in the generated outputs.
Transparency We strive for transparency by documenting the sources of our training data and explaining the
model architecture. All the techniques used to manipulate the data and the steps done are described step by step
highlighting all the important aspects.
Educational Impact We assess the impact of LLM-generated exercises on learning outcomes. We aim to enhance
personalized learning while preventing over-reliance on automated systems. The content is designed to be inclusive
and accessible to all students.