<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicolò Donati</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Periani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Di Natale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Savino</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Torroni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna</institution>
          ,
          <addr-line>Corso della Repubblica, 136, 47121 Forlì FC</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna</institution>
          ,
          <addr-line>Viale del Risorgimento, 2, 40136 Bologna BO</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Zanichelli editore S.p.A.</institution>
          ,
          <addr-line>Via Irnerio 34, 40126 Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners' grammatical proficiency and comprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Efective MCC exercises must be contextually relevant and engaging, incorporating distractors-plausible but incorrect alternatives-to balance dificulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs) in education, their application in generating English grammar MCC exercises is still limited. Previous methods typically impose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potential of LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. We hypothesize that LLMs can craft self-contained sentences that foster learner's communicative competence. Our analysis of existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we address the lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions include developing an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammar topics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance the automatic generation of English grammar MCC exercises, enhancing both their quality and creativity.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Distractor Generation</kwd>
        <kwd>Multiple-Choice Cloze</kwd>
        <kwd>Evaluation Metric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>mar exercises. Therefore, we set out to identify such a
metric and validate its alignment with human judgment.
In this paper, we present a novel solution utilizing an
LLM to generate English grammar MCC exercises. Our
contribution also focuses on curating an MCC dataset
that spans 19 topics. Lastly, we propose an automatic
metric to evaluate the exercise’s correctness and verify
the validity of our contribution thanks to human expert
evaluation.</p>
      <sec id="sec-1-1">
        <title>A large share of prior works uses rules to create Grammar</title>
        <p>MCC Exercises (Sumita et al. [13], Brown et al. [14], Smith
et al. [15], Majumder and Saha [16], Lin et al. [17]). They
all follow a three-fold process: (1) select sentences from
arbitrary sources, (2) insert the blank into the sentence,
and (3) generate distractors for the blank. Sentences
usually come from corpora or user-submitted passages.</p>
        <p>
          Many solutions restrict gap detection into fixed schemes:
Sumita et al. [13] picked out the leftmost single verb, Lin
2. Task description et al. [17] only selected adjectives as a blank. One of
the few exceptions is Goto et al. [18], who proposed a
Grammar exercises should define the range of abilities to method based on Conditional Random Fields (CRFs) [19].
be assessed and avoid the influence of irrelevant factors Methods that extract sentences from arbitrary text sufer
like past knowledge or cultural background [9]. We fol- from several limitations. First of all, they lack
customizalowed the Best-practice guidelines for creating grammar tion options, such as adjusting for the subject or dificulty
MCC items defined in [
          <xref ref-type="bibr" rid="ref18">10</xref>
          ] [
          <xref ref-type="bibr" rid="ref17">11</xref>
          ]. According to them, each level of the exercise. Additionally, they are limited by
item consists of three components. the length and quality of the extracted texts, which can
negatively impact the system’s results.
• Body: the sentence with a gap in place of the key. Recently, parts of MCC generation have been executed
• Key: the correct answer. by Neural Networks instead of rule-based algorithms.
• Distractor: the incorrect answer. Bitew et al. [20] use a variation of the RoBERTa [21]
model to predict the gap positions within the sentence.
        </p>
        <p>To decrease the ambiguity Matsumori et al. [22] trained a
Masked Language Model for gap score prediction of each
candidate sentence. Chomphooyod et al. [23] proposed a
system that uses Transformers [24] to generate candidate
sentences given a POS sequence, a keyword and a desired
grammar topic.</p>
        <p>
          The body plays a central role in designing efective
exercises. Learners should be able to infer the key based on
the helpful elements present in the body. However, the
effectiveness of an exercise depends mainly on the quality
of its distractors. Ideally, challenging distractors should
be homogeneous, plausible, and unambiguous.
Homogeneous distractors share the same syntactic category
as the key [12]. Plausible distractors provide a credible
alternative to the key. Lastly, unambiguous distractors
ensure that none of them could be considered correct if
used in place of the key [
          <xref ref-type="bibr" rid="ref18">10</xref>
          ].
        </p>
        <sec id="sec-1-1-1">
          <title>3.3. Metrics</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>In the literature, the evaluation of MCC exercises is</title>
        <p>mainly based on judgments expressed by human
annotators. Slavuj et al.[25] asked annotators to perform the
3. Related Works language tasks, assuming that the presence of incorrect
answers would be a sign of ill-formed exercises. Teachers
The generation of MCC exercises has been explored from were then asked to provide feedback on any pitfalls they
various perspectives. In this section, we will briefly dis- encountered. Malafeev [26] simply attended to suitability
cuss the main related approaches. for classroom use. Chomphooyod et al. [23] evaluates for
each exercise diferent aspects such as the grammatical
3.1. MCC Dataset and semantic correctness, the relevance with respect to
the topic, and its acceptability.</p>
        <p>Prior works in creating MCC datasets are very limited. Very few automatic metrics have been proposed to
To the best of our knowledge, the only one in English evaluate exercise generation. Bitew et al. [20] rely on
was presented by Liu et al. in their work SC-Ques [8]. It span overlap with respect to ground truth to assess the
comprises real English test items for students developed consistency of gap detection. March et al. [27] test the
by teaching professionals. The dataset contains roughly efectiveness of distractors by their selection rate.
300k MCC sentence completion exercises, composed of Since an important criterion for exercise collection is
the question body, a varying number of alternative an- diversity, often similarity measures have been applied to
swers, and the key (i.e. the correct alternative). It com- MCC exercise. Metrics like BLUE [28], ROUGE [29], and
prises both exercises with only single or multiple blanks. METEOR [30] have been used even though originally
It has various limitations, discussed in Section 5. designed for diferent applications.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Approach</title>
      <p>ensure the exclusively grammatical nature of the
exercises, distractors are checked using the metrics proposed
To overcome the limitations of existing solutions, we uti- in Section 3.3. All exercises lacking valid distractors are
lized an LLM to generate exercises in a single, constraint- then discarded.
free step. We chose Llama3 [31] due to its
acceptable balance between computational cost and perfor- Deduplication We deduplicated and removed all
mance. To evaluate its efectiveness, we engineered a the similar exercises, to increase the quality of our
well-structured prompt (Appendix B.2). However, the dataset [32]. Exercises are clustered by topic and
results were unsatisfactory. The model exhibited signifi- compared in terms of embeddings through cosine
cant dificulties with certain grammar topics and consis- similarity. Using a threshold , where  denotes the
tently failed to generate efective distractors. Therefore topic, all elements exceeding the limit are discarded.
we decided to fine-tune the model using a well-formatted Lastly, we noticed that SC-Ques [8] had an unbalanced
dataset containing exercises with distractors that meet representation of grammar topics. For example, in half
our criteria. Each dataset example includes four features: of the WH-questions have "How" as the key. For each
the grammar topic, the exercise text, the key, and the topic, a maximum ratio of key presence is established,
distractors. The model is trained to produce the exercise and superfluous data are discarded.
text, key, and distractors when given a specific grammar
topic as input. The prompt used during the fine-tuning
and an example of input-output text can be found in the
appendix section B.1.</p>
      <p>To assess the correctness of the generated items, we
devised metrics that evaluate the minimal structural
requirements of an exercise thanks to rule-based analysis.</p>
      <p>These are defined in section 7. To monitor the results we
used SELF-BLEU [6], a metric that inspects repetitions
checking continuous lexical overlap.</p>
      <sec id="sec-2-1">
        <title>After pre-processing, the least represented class con</title>
        <p>tained a quarter of the examples present in the most
represented one. The only exception was the "WH-questions"
class, which was underrepresented. Therefore, we
upsampled the class with synthetic exercises using GPT-4 [33].
The dataset is composed by several fields: the filled_text
(complete exercise sentence), the gapped_text (sentence
with a blank gap), the key (the text removed to create the
gap), and the list of distractors.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Dataset Curation</title>
    </sec>
    <sec id="sec-4">
      <title>6. Fine-Tuning</title>
      <sec id="sec-4-1">
        <title>We developed the fine-tuning dataset based on the data released by [8]. The data underwent three pre-processing steps: cleaning, grammar topic identification, and removal of similar examples.</title>
      </sec>
      <sec id="sec-4-2">
        <title>We designed the fine-tuning process to generate exer</title>
        <p>cises on specific grammar topics with a fixed number of
distractors. The model’s expected response is a
JSONencoded exercise coherent to the dataset structure
described in Section 5. We observed that including the
Data cleaning First, we got rid of improperly format- filled_text in the output improves overall accuracy and
ted examples and cleaned the text to comply with the reduces similarity among exercises. An example from
tokenizer specifications and limit potential noise. Items the fine-tuning dataset can be found in the appendix
secwith multiple blank spaces or fewer than two distractors tion B.1. To reduce the computational resources required
were discarded. Next, we filtered out exercise texts con- for fine-tuning, we employed the Quantized Low-Rank
taining instructions, non-Latin symbols or letters, emails, Adapters (QLoRA) [34] approach. Our tests on small
phone numbers, and links. models revealed that this strategy prevents significant
shrinkage of the model’s dictionary during fine-tuning.</p>
        <p>Consequently, the generated exercises exhibit greater
variability, enhancing the model’s creativity.</p>
        <p>Extraction of the grammar topic The second step
involves the assignment of the grammar topic to each
exercise thanks to the Pattern Matcher. First, grammar
topics are defined in a tailor-made grammar taxonomy
with the aid of spaCy Dependency Matcher. Given a set
of sentences, this tool allows one to identify whether
each sentence features the described grammar topics,
and if so, at what position. The relevant topic is chosen
by comparing the overlap between the position of the
topic detected by Pattern Matcher and the key span1. To</p>
      </sec>
      <sec id="sec-4-3">
        <title>1The key span is the range of positions the key belongs to.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7. Evaluation Metrics</title>
      <sec id="sec-5-1">
        <title>Two metrics are used to track the model’s performance</title>
        <p>on diverse aspects. First, we introduce a metric that
evaluates the minimal structural requirements of an exercise.
Secondly, we control for language diversity to have more
interpretability on the results.</p>
        <sec id="sec-5-1-1">
          <title>7.1. Structural Compliance</title>
          <p>LLMs often experience the so-called repetition problem,
where their output includes excessively repeated
segments of text, creating an undesirable efect [ 35]. In the
context of the generation of thousands of exercises,
duplicates or overly similar sentences are highly likely to
occur. In order to assess this phenomenon we decided
to rely on continuous lexical overlap by using Self-BLEU
[6] onto 2-to-5-grams to capture multi-word repetitions.
of 1 and gradient accumulation equal to 16. The train
lasted two hours on a NVIDIA RTX A6000.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>9. Results</title>
      <sec id="sec-6-1">
        <title>This metric evaluates the structure and well-formedness</title>
        <p>of the exercise. Decomposing the validation stage into
two steps, we design two rule-based components, namely
pertinence and homogeneity.</p>
        <p>The former oversees that the gap placeholder is located
in the intended position and that the key includes the
correct grammar form. The second component checks
that the distractor fulfils the criterion of homogeneity
as described in the section 2. To achieve this, grammar
topics have been grouped into two classes.</p>
      </sec>
      <sec id="sec-6-2">
        <title>To evaluate performances, for each grammar topic we</title>
        <p>generated 50 exercises, setting the number of distractors
to 1. We use the sampling decoding strategy with a
temperature equal to 0.7 to balance the creativity and the
coherence of the output.</p>
        <p>The exercises are categorized according to their
grammar topic. For each exercise, we assessed its structural
Inflectional They must have the same lemma as the compliance and its similarity to the exercises within the
key so as to rule out the influence of lexis and semantics. same grammar topic that has been labelled structurally
We also make adjustments to account for circumstances correct, using the metrics described in section 3.3. The
when the key and the distractor are identical, as well as results are then averaged to obtain the accuracy for each
for handling variation of the auxiliary verb. grammar topic. In the end, the model performances are
computed by averaging the topic scores. The results are
Free morphemes Exercises of this group limit accept- reported in Table 1.
able keys and distractors to a narrow range of options. Overall, the outcomes are satisfactory. The model on
So, we manually compile a list of admitted words for average scores a Structural Compliance (SC ) equal to
each grammar topic. If the distractor belongs to that list 85%, indicating its ability to generate well formed
exerand is not identical to the key, it is deemed homogeneous. cises. It achieves a self-BLEU similarity of 7%,
demonstrating that text repetitions are limited. Looking at the
Some grammar topics may be built with distrac- individual SC scores, we observe that the model tends
tors of any of the two classes. If either of the checks is to perform better on free morphemes grammar topics.
successful, the distractor passes the test of fitness. We suppose this is due to the limited number of
possible key/distractor options. Furthermore, we observed
that due to spaCy limitations in properly labelling
cer7.2. Language Diversity tain verbs, grammar topics related to verbal tenses are
more prone to be misidentified. This limitation causes
occasional misjudgment of the exercise’s structural
compliance, leading to a negative efect on the topic
performance.</p>
        <sec id="sec-6-2-1">
          <title>9.1. Human Evaluation</title>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>2https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct</title>
      </sec>
      <sec id="sec-6-4">
        <title>To assess classroom suitability a human evaluation was</title>
        <p>performed on all 950 exercises by a computational
linguist with a background in pedagogy in language
teach8. Experiments ing. Each generated exercise (EC) was evaluated on
four criteria:Plausibility, Ambiguity (defined in section 2),
We fine-tuned the Huggingface implementation of Meta- Common Sense, Acceptability. Common Sense means that
Llama-3-8B-Instruct2. The model was first quantized to the exercise sentence should be coherent with common
4-bit precision and then fine-tuned using LoRA adapters, sense. Acceptability indicates that a sentence does not
with the following configuration: rank equal to 64, alpha perpetuate stereotypes or display inappropriate content,
16, and a dropout percentage of 0.1. The adapters have such as violence. If any of these criteria is not met, the
been added on top of all the attention linear layers to not item is flagged as incorrect.
significantly degrade performance. The training hyper- The results presented in table 1 have established that
parameters are: a constant learning rate of 2e− 4, max 79% of the items satisfy all the requirements to be
admingradient norm of 0.3, and a weight decay equal to 1e− 2. istered to learners. We conducted an error analysis. The
The number of epochs was set to 3, using a batch size results are summarized in Table 2. Common sense was the
most frequently observed inaccuracy, although the
magnitude of the issue is modest. As expected, ambiguous
average
distractors remain an open matter in the field, especially
for tense-based topics. Instead, we can notice that the
generation of sentences with bias or trivial exercises is
almost absent.</p>
        <p>Furthermore, we asked the annotator to evaluate the
structural compliance of the exercises (SC ). Then we
computed the Precision, Recall and F1 scores using
annotator judgements as golden labels. The results show that
our automatic structural compliance metric (SC) has
an F1 score of 95% w.r.t the human evaluation, with a
Precision of 98% and a Recall of 91%. This highlights its
efectiveness in predicting the overall structural quality
of the exercises.
10. Conclusion
the proposed structural compliance metric, corroborating
our metric as an indicator of exercise structure
correctness and alignment with human expert preferences. We
found that a key factor of our method was the availability
of high-quality fine-tuning data.</p>
        <p>One limitation was the presence of many similar
exercises in the SC-Dataset [8] we used to build our resource
from. After removing similar exercises, only 30% of the
original data was left. Another limitation is the
sensitivity of the evaluation metric to the Pattern Matcher,
concerning the evaluation of the key and the distractors,
which caused some false negatives.</p>
        <p>The curated dataset and model will be available to the
community.3.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>We investigated the use of an LLM to generate English</title>
        <p>MCC grammar exercises. To that end, we curated a new We wish to thank Zanichelli editore for their support
English grammar MCC exercises dataset. We devised which enabled data up-sampling, human evaluation, and
metrics for the automatic evaluation of such exercises. experimentation with their infrastructure. We also thank
We evaluated our work using said metrics, and a human Eleonora Cupin for her valuable contribution to the
hustudy involving domain experts. Our findings demon- man evaluation of the dataset.
strate the model’s ability to generate exercises suitable
for educational use. The generated exercises exhibit a
low similarity score, indicating that our method can
efectively produce original exercises: a significant advantage
from prior art, mostly relying on rule-based methods. We
observe that human evaluation correlates positively with</p>
      </sec>
      <sec id="sec-7-2">
        <title>3https://github.com/ZanichelliEditore/</title>
        <p>english-grammar-multiple-choice-generation</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>A. Error analysis</title>
      <p>Thanks to the human evaluation we conducted a small error analysis on the errors made by the model. By analyzing
the exercises that the annotator marked as incorrect we found out that the major issue is the coherence of the
exercise sentence. More precisely, 75% of the wrong exercises has a meaningless or absurd exercise sentence. This
behaviour is directly related to the hallucinations sufered by LLMs[ 36]. The second prevailing error is the ambiguity
between the key and the distractors. The model does not possess a deep understanding of what a distractor is. In
fact some generated distractors are interchangeable with the key.</p>
      <p>Despite these limitations, the model is very efective in producing exercises that are not trivial (plausibility error
rate at 1%) and negligibly afected by bias and stereotypes.</p>
      <p>grammar topic
articles
comparison adjectives
conditional statements
future simple
modal verbs
infinitive and gerund verbs
passive tenses
past continuous
past perfect
past simple
personal pronouns
possessive adjectives
prepositions
present continuous
present perfect
present simple
quantifiers
relative clauses
WH- question
average</p>
    </sec>
    <sec id="sec-9">
      <title>B. Prompts</title>
      <p>In this section, the prompts used in our work are presented. utilize the Llama3 chat template format, but to make
the text more readable we use three placeholders: #SYSTEM, #USER and #ASSISTANT.</p>
      <sec id="sec-9-1">
        <title>B.1. Fine-Tuning prompt</title>
        <p>The prompt used to fine-tune the model has the same structure for all the grammar topics. The only varying parts
are the name of the grammar topic and the number of distractors required. These parts are highlighted by the
brackets and change depending on the dataset items. The prompt used is the following.</p>
        <p>W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on { g r a m m a r _ t o p i c } w i t h { n _ d i s t r a c t o r s }
d i s t r a c t o r s .</p>
        <sec id="sec-9-1-1">
          <title>Listing 1: Fine-tuning prompt. A training example is created by concatenating to the prompt the desired JSON representation of the exercise. We decided to use this format because it is easier to use at inference time. An example of training data is the following. # USER</title>
          <p>W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on c o m p a r i s o n s w i t h 3 d i s t r a c t o r s .
# ASSISTANT
{
" f i l l e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can b e t t e r a s s e s s
p a t i e n t s ’ c o n d i t i o n s . " ,
" g a p p e d _ t e x t " : " Thanks t o h i g h t e c h n o l o g y , d o c t o r s can ___ a s s e s s p a t i e n t s ’
c o n d i t i o n s . " ,
" s o l u t i o n " : " b e t t e r " ,
" d i s t r a c t o r s " : [ " b e s t " , " good " , " w e l l " ]</p>
        </sec>
        <sec id="sec-9-1-2">
          <title>Listing 2: Example from the Fine-Tuning dataset.</title>
        </sec>
      </sec>
      <sec id="sec-9-2">
        <title>B.2. Baseline prompt</title>
        <p>To test the performances of the baseline Llama3 we utilize its instruction-tuned version, Llama3-Instruct that can
follow direction given by the user. This model is not able to answer correctly using the prompt described above.
Therefore, we construct an alternative one in which all the useful information is given to the model. We include the
structure of the exercise, the roles of each component with their constraints and the desired format of the output.
The results are the following.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>C. Ethical Considerations</title>
      <p>This section outlines the ethical considerations of the system we developed.</p>
      <p>Bias and Fairness The dataset used in this study is obtained from a publicly available source, ensuring that all
data was collected with appropriate consent. To protect personal information, we removed all sensitive data such as
phone numbers, email addresses and URLs. Since humans created this data, we assume that proper names or any
reference to existing entities are invented. Moreover, those that contain preferences such as films, books, etc., we
assume do not reflect real preferences of the users. We suppose that events or situations described in the exercises
are not related to existing facts. Finally, since the data have been created by professional creators we assume that
any possible bias or stereotype in the dataset is not intended and it is a coincidence.</p>
      <p>Accuracy and Reliability The accuracy of the generated exercises is paramount. We employ both automated
validation tools and human expert reviews to ensure the correctness and reliability of the content. Any inaccuracies
identified are promptly rectified. We acknowledge the potential for bias in LLM-generated content. However, the
human evaluation highlights a negligible presence in the generated outputs.</p>
      <p>Transparency We strive for transparency by documenting the sources of our training data and explaining the
model architecture. All the techniques used to manipulate the data and the steps done are described step by step
highlighting all the important aspects.</p>
      <p>Educational Impact We assess the impact of LLM-generated exercises on learning outcomes. We aim to enhance
personalized learning while preventing over-reliance on automated systems. The content is designed to be inclusive
and accessible to all students.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2001</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . the Association for Computational Linguistics, As[20]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Bitew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deleu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Doğruöz</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Develder, sociation for Computational Linguistics</article-title>
          , Philadel-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Demeester</surname>
          </string-name>
          ,
          <article-title>Learning from partially annotated phia</article-title>
          , Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>data: Example-aware creation of gap-filling ex</article-title>
          - https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>ercises for language learning</article-title>
          , in: E. Kochmar,
          <volume>1073083</volume>
          .
          <fpage>1073135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Horbach</surname>
          </string-name>
          , R. Laarmann-Quante, [29]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic eval-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 18th Workshop</source>
          on In- Branches Out, Association for Computational Lin-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>novative Use of NLP for Building Educational Ap- guistics,</article-title>
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>plications</surname>
          </string-name>
          ,
          <source>BEA@ACL</source>
          <year>2023</year>
          , Toronto, Canada,
          <volume>13</volume>
          https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>July</source>
          <year>2023</year>
          , Association for Computational Linguis- [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic met-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>598</fpage>
          -
          <lpage>609</lpage>
          . URL: https://doi.org/10.
          <article-title>ric for MT evaluation with improved correlation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .bea-
          <volume>1</volume>
          .51. doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .
          <article-title>with human judgments</article-title>
          , in: J.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Lavie,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>BEA-1</source>
          .51.
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Voss (Eds.),
          <source>Proceedings of the ACL</source>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Workshop on Intrinsic and Extrinsic Evaluation
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>proach</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          .11692. tics, Ann Arbor, Michigan,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          arXiv:
          <year>1907</year>
          .11692. https://aclanthology.org/W05-0909. [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsumori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Okuoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shibata</surname>
          </string-name>
          , M. In- [31]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <article-title>Introducing Meta Llama 3: The most capable</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Automatic open cloze question generation us- blog/meta-llama-</article-title>
          <volume>3</volume>
          /,
          <year>April 2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>ing a masked language model</article-title>
          , IEEE Ac- [32]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tirumala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Simig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aghajanyan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Morcos,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>cess 11</source>
          (
          <year>2023</year>
          )
          <fpage>9835</fpage>
          -
          <lpage>9850</lpage>
          . URL: http://dx.doi. D4: Improving llm pretraining via document de-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>org/10</source>
          .1109/ACCESS.
          <year>2023</year>
          .
          <volume>3239005</volume>
          . doi:
          <volume>10</volume>
          .1109/ duplication and diversification, in: A.
          <string-name>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Nau-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>access.</surname>
          </string-name>
          <year>2023</year>
          .
          <volume>3239005</volume>
          .
          <string-name>
            <surname>mann</surname>
            , A. Globerson,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
            , S. Levine [23]
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Chomphooyod</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Suchato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tuaycharoen</surname>
          </string-name>
          , (Eds.),
          <source>Advances in Neural Information Processing</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Punyabukkana</surname>
          </string-name>
          ,
          <source>English grammar multiple- Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>choice question generation using text-to-text trans</article-title>
          - pp.
          <fpage>53983</fpage>
          -
          <lpage>53995</lpage>
          . URL: https://proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>fer transformer, Comput. Educ. Artif. Intell</source>
          .
          <volume>5</volume>
          (
          <issue>2023</issue>
          )
          <article-title>neurips</article-title>
          .cc/paper_files/paper/2023/file/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          100158. URL: https://doi.org/10.1016/j.caeai.
          <year>2023</year>
          . a8f8cbd7f7a5fb2c837e578c75e5b615-Paper-Datasets_
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          100158. doi:
          <volume>10</volume>
          .1016/J.CAEAI.
          <year>2023</year>
          .
          <volume>100158</volume>
          . and_Benchmarks.pdf. [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit, [33]
          <string-name>
            <surname>O.</surname>
          </string-name>
          et al.,
          <source>Gpt-4 technical report</source>
          ,
          <year>2024</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , At- //arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>tention is all you need</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/ [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          , L. Zettle-
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>abs/1706</source>
          .03762. arXiv:
          <volume>1706</volume>
          .03762. moyer, Qlora: Eficient finetuning of quantized [25]
          <string-name>
            <given-names>V.</given-names>
            <surname>Slavuj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Nacinovic</given-names>
            <surname>Prskalo</surname>
          </string-name>
          , M. Brkic Bakaric, llms,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.14314.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>Automatic generation of language exercises based arXiv:2305</source>
          .
          <fpage>14314</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>on a universal methodology: An analysis of possi-</article-title>
          [35]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.-C. So</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
          </string-name>
          , A theoreti-
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Brasov. Series IV</surname>
          </string-name>
          :
          <article-title>Philology and Cultural Studies 14 eration</article-title>
          ,
          <source>Proceedings of the AAAI Conference</source>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>(63)</source>
          (
          <year>2022</year>
          )
          <fpage>29</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .31926/but.pcs.
          <source>2021. on Artificial Intelligence</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>12848</fpage>
          -
          <lpage>12856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>63.14.2</source>
          .3. URL: https://ojs.aaai.org/index.php/AAAI/article/ [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Malafeev</surname>
          </string-name>
          ,
          <article-title>Language exercise generation</article-title>
          , In- view/17520. doi:
          <volume>10</volume>
          .1609/aaai.v35i14.
          <fpage>17520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>ternational Journal of Conceptual Structures</source>
          <volume>and</volume>
          [36]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          , Hallucination is in-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>Smart Applications</source>
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <fpage>20</fpage>
          -
          <lpage>35</lpage>
          . doi:
          <volume>10</volume>
          .4018/ evitable:
          <article-title>An innate limitation of large language</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          IJCSSA.2014070102. models,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.11817. [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Perrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>March</surname>
          </string-name>
          ,
          <article-title>An evidence-based approach</article-title>
          arXiv:
          <volume>2401</volume>
          .
          <fpage>11817</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>guage tests</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .13140/RG.2.2.22779.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          16165. [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu, Bleu:
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 40th Annual Meeting of</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>