Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises

Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises NicolòDonati n.donati@unibo.it University of Bologna

Viale del Risorgimento, 2 40136, BO Bologna Italy

Zanichelli editore S.p.A

Via Irnerio 34 40126 Bologna Italy

MatteoPeriani matteo.periani2@studio.unibo.it University of Bologna

Viale del Risorgimento, 2 40136, BO Bologna Italy

PaoloDiNatale paolo.dinatale3@studio.unibo.it University of Bologna

Corso della Repubblica 136, 47121 Forlì FC Italy

GiuseppeSavino gsavino@zanichelli.it Zanichelli editore S.p.A

Via Irnerio 34 40126 Bologna Italy

PaoloTorroni p.torroni@unibo.it University of Bologna

Viale del Risorgimento, 2 40136, BO Bologna Italy

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises 1613-0073 26DCF52A68C81D686A6ACC6CF3B831E4 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Distractor Generation Multiple-Choice Cloze Evaluation Metric

English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners' grammatical proficiency and comprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCC exercises must be contextually relevant and engaging, incorporating distractors-plausible but incorrect alternatives-to balance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs) in education, their application in generating English grammar MCC exercises is still limited. Previous methods typically impose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potential of LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. We hypothesize that LLMs can craft self-contained sentences that foster learner's communicative competence. Our analysis of existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we address the lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions include developing an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammar topics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance the automatic generation of English grammar MCC exercises, enhancing both their quality and creativity.

Introduction

English grammar Multiple-Choice Cloze (MCC) exercises are widely used tools for enhancing a learner's grammatical proficiency and comprehension skills. They consist of fill-the-gap questions where the gap must be filled by choosing one correct solution (key) among several options. The incorrect alternatives are called distractors. Devising these exercises is a labour-intensive process requiring expert knowledge in language teaching and content creation. The exercises must be contextually relevant to help learners understand how rules apply in real-life situations. This requires crafting sentences and scenarios that are both engaging and educational. Learners have different levels of proficiency, from beginners to advanced. Striking the right balance ensures that learners are neither bored nor frustrated, which is crucial for maintaining their motivation and progress. In MCC exercises this is done by choosing distractors that are incorrect but plausible, thus keeping the exercise challenging for the learner. Studies in Communicative Language Teaching demonstrate that the learner must possess the knowledge of grammatical structures and the ability to compose syntactically well-formed propositions, and they must also acquire the ability to employ grammatical forms in discourse [1] [2].

Recently, there has been a growing interest in applying LLMs in education [3]. However, the adoption of LLMs for English grammar MCC exercise generation is still limited. Some proposals focus on testing vocabulary [4] or use LLMs by constraining their generation capability, for example using fixed part-of-speech sequences [5]. Although the outputs of these models are grammatically correct typically they lack creativity [6].

In this work, we investigate the potential of LLMs in automatic exercise generation without hampering their creativity. Our working hypothesis is that LLMs can generate self-contained sentences, recreating situational contexts that elicit the communicative competence of the learner [7]. Our main objective is to understand to what extent can LLMs generate accurate grammar exercises without providing predefined constraints or POS sequences. To pursue this objective, we analyzed the available English grammar MCC exercises dataset [8]. We observed that it has limited diversity, some topics are underrepresented, and there are often mistakes. Existing literature does not offer a single agreed-upon automatic metric for evaluating the quality of the generated gram-mar exercises. Therefore, we set out to identify such a metric and validate its alignment with human judgment. In this paper, we present a novel solution utilizing an LLM to generate English grammar MCC exercises. Our contribution also focuses on curating an MCC dataset that spans 19 topics. Lastly, we propose an automatic metric to evaluate the exercise's correctness and verify the validity of our contribution thanks to human expert evaluation.

Task description

Grammar exercises should define the range of abilities to be assessed and avoid the influence of irrelevant factors like past knowledge or cultural background [9]. We followed the Best-practice guidelines for creating grammar MCC items defined in [10] [11]. According to them, each item consists of three components.

• Body: the sentence with a gap in place of the key.

• Key: the correct answer.

• Distractor: the incorrect answer.

The body plays a central role in designing effective exercises. Learners should be able to infer the key based on the helpful elements present in the body. However, the effectiveness of an exercise depends mainly on the quality of its distractors. Ideally, challenging distractors should be homogeneous, plausible, and unambiguous. Homogeneous distractors share the same syntactic category as the key [12]. Plausible distractors provide a credible alternative to the key. Lastly, unambiguous distractors ensure that none of them could be considered correct if used in place of the key [10].

Related Works

The generation of MCC exercises has been explored from various perspectives. In this section, we will briefly discuss the main related approaches.

MCC Dataset

Prior works in creating MCC datasets are very limited. To the best of our knowledge, the only one in English was presented by Liu et al. in their work SC-Ques [8]. It comprises real English test items for students developed by teaching professionals. The dataset contains roughly 300k MCC sentence completion exercises, composed of the question body, a varying number of alternative answers, and the key (i.e. the correct alternative). It comprises both exercises with only single or multiple blanks. It has various limitations, discussed in Section 5.

Grammar MCC Exercise Generation

A large share of prior works uses rules to create Grammar MCC Exercises (Sumita et al. [13], Brown et al. [14], Smith et al. [15], Majumder and Saha [16], Lin et al. [17]). They all follow a three-fold process: (1) select sentences from arbitrary sources, (2) insert the blank into the sentence, and (3) generate distractors for the blank. Sentences usually come from corpora or user-submitted passages. Many solutions restrict gap detection into fixed schemes: Sumita et al. [13] picked out the leftmost single verb, Lin et al. [17] only selected adjectives as a blank. One of the few exceptions is Goto et al. [18], who proposed a method based on Conditional Random Fields (CRFs) [19]. Methods that extract sentences from arbitrary text suffer from several limitations. First of all, they lack customization options, such as adjusting for the subject or difficulty level of the exercise. Additionally, they are limited by the length and quality of the extracted texts, which can negatively impact the system's results.

Recently, parts of MCC generation have been executed by Neural Networks instead of rule-based algorithms. Bitew et al. [20] use a variation of the RoBERTa [21] model to predict the gap positions within the sentence. To decrease the ambiguity Matsumori et al. [22] trained a Masked Language Model for gap score prediction of each candidate sentence. Chomphooyod et al. [23] proposed a system that uses Transformers [24] to generate candidate sentences given a POS sequence, a keyword and a desired grammar topic.

Metrics

In the literature, the evaluation of MCC exercises is mainly based on judgments expressed by human annotators. Slavuj et al. [25] asked annotators to perform the language tasks, assuming that the presence of incorrect answers would be a sign of ill-formed exercises. Teachers were then asked to provide feedback on any pitfalls they encountered. Malafeev [26] simply attended to suitability for classroom use. Chomphooyod et al. [23] evaluates for each exercise different aspects such as the grammatical and semantic correctness, the relevance with respect to the topic, and its acceptability.

Very few automatic metrics have been proposed to evaluate exercise generation. Bitew et al. [20] rely on span overlap with respect to ground truth to assess the consistency of gap detection. March et al. [27] test the effectiveness of distractors by their selection rate.

Since an important criterion for exercise collection is diversity, often similarity measures have been applied to MCC exercise. Metrics like BLUE [28], ROUGE [29], and METEOR [30] have been used even though originally designed for different applications.

Approach

To overcome the limitations of existing solutions, we utilized an LLM to generate exercises in a single, constraintfree step. We chose Llama3 [31] due to its acceptable balance between computational cost and performance. To evaluate its effectiveness, we engineered a well-structured prompt (Appendix B.2). However, the results were unsatisfactory. The model exhibited significant difficulties with certain grammar topics and consistently failed to generate effective distractors. Therefore we decided to fine-tune the model using a well-formatted dataset containing exercises with distractors that meet our criteria. Each dataset example includes four features: the grammar topic, the exercise text, the key, and the distractors. The model is trained to produce the exercise text, key, and distractors when given a specific grammar topic as input. The prompt used during the fine-tuning and an example of input-output text can be found in the appendix section B.1.

To assess the correctness of the generated items, we devised metrics that evaluate the minimal structural requirements of an exercise thanks to rule-based analysis. These are defined in section 7. To monitor the results we used SELF-BLEU [6], a metric that inspects repetitions checking continuous lexical overlap.

Dataset Curation

We developed the fine-tuning dataset based on the data released by [8]. The data underwent three pre-processing steps: cleaning, grammar topic identification, and removal of similar examples.

Data cleaning First, we got rid of improperly formatted examples and cleaned the text to comply with the tokenizer specifications and limit potential noise. Items with multiple blank spaces or fewer than two distractors were discarded. Next, we filtered out exercise texts containing instructions, non-Latin symbols or letters, emails, phone numbers, and links.

Extraction of the grammar topic

The second step involves the assignment of the grammar topic to each exercise thanks to the Pattern Matcher. First, grammar topics are defined in a tailor-made grammar taxonomy with the aid of spaCy Dependency Matcher. Given a set of sentences, this tool allows one to identify whether each sentence features the described grammar topics, and if so, at what position. The relevant topic is chosen by comparing the overlap between the position of the topic detected by Pattern Matcher and the key span 1 . To 1 The key span is the range of positions the key belongs to.

ensure the exclusively grammatical nature of the exercises, distractors are checked using the metrics proposed in Section 3.3. All exercises lacking valid distractors are then discarded.

Deduplication

We deduplicated and removed all the similar exercises, to increase the quality of our dataset [32]. Exercises are clustered by topic and compared in terms of embeddings through cosine similarity. Using a threshold 𝑇𝑝, where 𝑝 denotes the topic, all elements exceeding the limit are discarded. Lastly, we noticed that SC-Ques [8] had an unbalanced representation of grammar topics. For example, in half of the WH-questions have "How" as the key. For each topic, a maximum ratio of key presence is established, and superfluous data are discarded.

After pre-processing, the least represented class contained a quarter of the examples present in the most represented one. The only exception was the "WH-questions" class, which was underrepresented. Therefore, we upsampled the class with synthetic exercises using GPT-4 [33]. The dataset is composed by several fields: the filled_text (complete exercise sentence), the gapped_text (sentence with a blank gap), the key (the text removed to create the gap), and the list of distractors.

Fine-Tuning

We designed the fine-tuning process to generate exercises on specific grammar topics with a fixed number of distractors. The model's expected response is a JSONencoded exercise coherent to the dataset structure described in Section 5. We observed that including the filled_text in the output improves overall accuracy and reduces similarity among exercises. An example from the fine-tuning dataset can be found in the appendix section B.1. To reduce the computational resources required for fine-tuning, we employed the Quantized Low-Rank Adapters (QLoRA) [34] approach. Our tests on small models revealed that this strategy prevents significant shrinkage of the model's dictionary during fine-tuning. Consequently, the generated exercises exhibit greater variability, enhancing the model's creativity.

Evaluation Metrics

Two metrics are used to track the model's performance on diverse aspects. First, we introduce a metric that evaluates the minimal structural requirements of an exercise. Secondly, we control for language diversity to have more interpretability on the results.

Structural Compliance

This metric evaluates the structure and well-formedness of the exercise. Decomposing the validation stage into two steps, we design two rule-based components, namely pertinence and homogeneity.

The former oversees that the gap placeholder is located in the intended position and that the key includes the correct grammar form. The second component checks that the distractor fulfils the criterion of homogeneity as described in the section 2. To achieve this, grammar topics have been grouped into two classes.

Inflectional They must have the same lemma as the key so as to rule out the influence of lexis and semantics. We also make adjustments to account for circumstances when the key and the distractor are identical, as well as for handling variation of the auxiliary verb.

Free morphemes Exercises of this group limit acceptable keys and distractors to a narrow range of options. So, we manually compile a list of admitted words for each grammar topic. If the distractor belongs to that list and is not identical to the key, it is deemed homogeneous. Some grammar topics may be built with distractors of any of the two classes. If either of the checks is successful, the distractor passes the test of fitness.

Language Diversity

LLMs often experience the so-called repetition problem, where their output includes excessively repeated segments of text, creating an undesirable effect [35]. In the context of the generation of thousands of exercises, duplicates or overly similar sentences are highly likely to occur. In order to assess this phenomenon we decided to rely on continuous lexical overlap by using Self-BLEU [6] onto 2-to-5-grams to capture multi-word repetitions.

Experiments

We fine-tuned the Huggingface implementation of Meta-Llama-3-8B-Instruct 2 . The model was first quantized to 4-bit precision and then fine-tuned using LoRA adapters, with the following configuration: rank equal to 64, alpha 16, and a dropout percentage of 0.1. The adapters have been added on top of all the attention linear layers to not significantly degrade performance. The training hyperparameters are: a constant learning rate of 2e−4, max gradient norm of 0.3, and a weight decay equal to 1e−2. The number of epochs was set to 3, using a batch size 2 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct of 1 and gradient accumulation equal to 16. The train lasted two hours on a NVIDIA RTX A6000.

Results

To evaluate performances, for each grammar topic we generated 50 exercises, setting the number of distractors to 1. We use the sampling decoding strategy with a temperature equal to 0.7 to balance the creativity and the coherence of the output.

The exercises are categorized according to their grammar topic. For each exercise, we assessed its structural compliance and its similarity to the exercises within the same grammar topic that has been labelled structurally correct, using the metrics described in section 3.3. The results are then averaged to obtain the accuracy for each grammar topic. In the end, the model performances are computed by averaging the topic scores. The results are reported in Table 1.

Overall, the outcomes are satisfactory. The model on average scores a Structural Compliance (SC𝐻 ) equal to 85%, indicating its ability to generate well formed exercises. It achieves a self-BLEU similarity of 7%, demonstrating that text repetitions are limited. Looking at the individual SC scores, we observe that the model tends to perform better on free morphemes grammar topics. We suppose this is due to the limited number of possible key/distractor options. Furthermore, we observed that due to spaCy limitations in properly labelling certain verbs, grammar topics related to verbal tenses are more prone to be misidentified. This limitation causes occasional misjudgment of the exercise's structural compliance, leading to a negative effect on the topic performance.

Human Evaluation

To assess classroom suitability a human evaluation was performed on all 950 exercises by a computational linguist with a background in pedagogy in language teaching. Each generated exercise (EC) was evaluated on four criteria:Plausibility, Ambiguity (defined in section 2), Common Sense, Acceptability. Common Sense means that the exercise sentence should be coherent with common sense. Acceptability indicates that a sentence does not perpetuate stereotypes or display inappropriate content, such as violence. If any of these criteria is not met, the item is flagged as incorrect.

The results presented in table 1 have established that 79% of the items satisfy all the requirements to be administered to learners. We conducted an error analysis. The results are summarized in Table 2. Results of the evaluation on the generated exercises. SC 𝐴 is the Structural Compliance evaluate by our metric, SC 𝐻 evaluated by the human annotator and EC is the exercise correctness. The double lines divide the results from the automatic metric (left) to those obtained by the human-eval (right). More results on error analysis can be found in table 2.

distractors remain an open matter in the field, especially for tense-based topics. Instead, we can notice that the generation of sentences with bias or trivial exercises is almost absent. Furthermore, we asked the annotator to evaluate the structural compliance of the exercises (SC𝐻 ). Then we computed the Precision, Recall and F1 scores using annotator judgements as golden labels. The results show that our automatic structural compliance metric (SC𝐴) has an F1 score of 95% w.r.t the human evaluation, with a Precision of 98% and a Recall of 91%. This highlights its effectiveness in predicting the overall structural quality of the exercises.

Conclusion

We investigated the use of an LLM to generate English MCC grammar exercises. To that end, we curated a new English grammar MCC exercises dataset. We devised metrics for the automatic evaluation of such exercises. We evaluated our work using said metrics, and a human study involving domain experts. Our findings demonstrate the model's ability to generate exercises suitable for educational use. The generated exercises exhibit a low similarity score, indicating that our method can effectively produce original exercises: a significant advantage from prior art, mostly relying on rule-based methods. We observe that human evaluation correlates positively with the proposed structural compliance metric, corroborating our metric as an indicator of exercise structure correctness and alignment with human expert preferences. We found that a key factor of our method was the availability of high-quality fine-tuning data.

One limitation was the presence of many similar exercises in the SC-Dataset [8] we used to build our resource from. After removing similar exercises, only 30% of the original data was left. Another limitation is the sensitivity of the evaluation metric to the Pattern Matcher, concerning the evaluation of the key and the distractors, which caused some false negatives.

The curated dataset and model will be available to the community. 3 .

B.1. Fine-Tuning prompt

The prompt used to fine-tune the model has the same structure for all the grammar topics. The only varying parts are the name of the grammar topic and the number of distractors required. These parts are highlighted by the brackets and change depending on the dataset items. The prompt used is the following. W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on { g r a m m a r _ t o p i c } w i t h { n _ d i s t r a c t o r s } d i s t r a c t o r s .

Listing 1: Fine-tuning prompt.

A training example is created by concatenating to the prompt the desired JSON representation of the exercise. We decided to use this format because it is easier to use at inference time. An example of training data is the following.

# USER W r i t e a m u l t i p l e − c h o i c e gap e x e r c i s e on c o m p a r i s o n s w i t h 3 d i s t r a c t

B.2. Baseline prompt

To test the performances of the baseline Llama3 we utilize its instruction-tuned version, Llama3-Instruct that can follow direction given by the user. This model is not able to answer correctly using the prompt described above. Therefore, we construct an alternative one in which all the useful information is given to the model. We include the structure of the exercise, the roles of each component with their constraints and the desired format of the output. The results are the following.

# SYSTEM You a r e an e n g l i s h t e a c h e r c r e a t i n g m u l t i p l e − c h o i c e − gap e x e r c i s e s . # USER W r i t e one e x e r c i s e on { g r a m m a r _ t o p i c } .

I t must c o n t a i n s t h e : − s e n t e n c e : t h e body e x e r c i s e t e x t t h a t must c o n t a i n t h e t a g <GAP> i n s t e a d o f t h e s o l u t i o n − s o l u t i o n : t h e t h a t c o r r e c t l y f i l l t h e gap − d i s t r a c t o r : a word r e l a t e d t o t h e s o l u t i o n , b u t d i f f e r e n t The d i s t r a c t o r must be s u c h t h a t i f s u b s t i t u t e d t o t h e s o l u t i o n , t h e s e n t e n c e

i s wrong . Do n o t g e n e r a t e any e x a p l a n a t i o n . The o u t p u t must be a JSON o b j e c t w i t h t h e f o l l o w i n g s t r u c t u r e : { " s e n t e n c e " : s t r , " s o l u t i o n " : s t r , " d i s t r a c t o r " : l i s t [ s t r ] } Listing 3: Prompt used to the generation of exercises with the base Llama3 model.

C. Ethical Considerations

This section outlines the ethical considerations of the system we developed.

Bias and Fairness

The dataset used in this study is obtained from a publicly available source, ensuring that all data was collected with appropriate consent. To protect personal information, we removed all sensitive data such as phone numbers, email addresses and URLs. Since humans created this data, we assume that proper names or any reference to existing entities are invented. Moreover, those that contain preferences such as films, books, etc., we assume do not reflect real preferences of the users. We suppose that events or situations described in the exercises are not related to existing facts. Finally, since the data have been created by professional creators we assume that any possible bias or stereotype in the dataset is not intended and it is a coincidence.

Accuracy and Reliability

The accuracy of the generated exercises is paramount. We employ both automated validation tools and human expert reviews to ensure the correctness and reliability of the content. Any inaccuracies identified are promptly rectified. We acknowledge the potential for bias in LLM-generated content. However, the human evaluation highlights a negligible presence in the generated outputs.

Transparency We strive for transparency by documenting the sources of our training data and explaining the model architecture. All the techniques used to manipulate the data and the steps done are described step by step highlighting all the important aspects.

Educational Impact

We assess the impact of LLM-generated exercises on learning outcomes. We aim to enhance personalized learning while preventing over-reliance on automated systems. The content is designed to be inclusive and accessible to all students.

Table 11Common sense was the most frequently observed inaccuracy, although the magnitude of the issue is modest. As expected, ambiguousgrammar topicSC 𝐴self-BLEUSC 𝐻ECarticles0.940.030.940.74comparison adjectives0.900.090.920.72conditional statements0.760.070.900.66future simple0.820.060.900.90modal verbs0.6200.780.70infinitive and gerund verbs0.7600.960.86passive tenses0.8400.860.74past continuous0.980.160.980.88past perfect0.940.120.960.82past simple0.8800.860.82personal pronouns0.850.070.920.74possessive adjectives0.820.120.900.72prepositions0.8400.920.72present continuous0.960.110.980.88present perfect0.660.080.980.84present simple0.880.050.880.86quantifiers0.880.070.880.84relative clauses0.940.030.940.74WH-question0.980.181.000.90average0.850.070.920.79

https://github.com/ZanichelliEditore/ english-grammar-multiple-choice-generation

Acknowledgments

We wish to thank Zanichelli editore for their support which enabled data up-sampling, human evaluation, and experimentation with their infrastructure. We also thank Eleonora Cupin for her valuable contribution to the human evaluation of the dataset.

A. Error analysis

Thanks to the human evaluation we conducted a small error analysis on the errors made by the model. By analyzing the exercises that the annotator marked as incorrect we found out that the major issue is the coherence of the exercise sentence. More precisely, 75% of the wrong exercises has a meaningless or absurd exercise sentence. This behaviour is directly related to the hallucinations suffered by LLMs [36]. The second prevailing error is the ambiguity between the key and the distractors. The model does not possess a deep understanding of what a distractor is. In fact some generated distractors are interchangeable with the key.

Despite these limitations, the model is very effective in producing exercises that are not trivial (plausibility error rate at 1%) and negligibly affected by bias and stereotypes.

B. Prompts

In this section, the prompts used in our work are presented. utilize the Llama3 chat template format, but to make the text more readable we use three placeholders: #SYSTEM, #USER and #ASSISTANT.

HGWiddowson Teaching Language as Communication

Oxford

Oxford University Press 1978 HGWiddowson Explorations in Applied Linguistics

Oxford

Oxford University Press 1979 Large language models in education: Vision and opportunities WGan ZQi JWu JC -WLin 10.1109/BigData59044.2023.10386291 2023 IEEE International Conference on Big Data (BigData) 2023 Automated generation of multiple-choice cloze questions for assessing english vocabulary using gptturbo 3 QWang RRose NOrita ASugawara 2024 5 English grammar multiplechoice question generation using text-to-text transfer transformer PChomphooyod ASuchato NTuaycharoen PPunyabukkana 10.1016/j.caeai.2023.100158 Computers and Education: Artificial Intelligence 5 100158 2023 <idno type="DOI">10.1016/j.caeai.2023.100158</idno> <idno>.100158</idno> <ptr target="//doi.org/10.1016/j.caeai.2023" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b6"> <monogr> <title level="m" type="main">Texygen: A benchmarking platform for text generation models YZhu SLu LZheng JGuo WZhang JWang YYu 2018 On communicative competence DHHymes Sociolinguistics. Selected Readings JBPride JHolmes Harmondsworth 1972 Sc-ques: A sentence completion question dataset for english as a second language learners QLiu YHuang ZLiu SHuang JChen XZhao GLin YZhou WLuo Augmented Intelligence and Intelligent Tutoring Systems CFrasson PMylonas CTroussas

Nature Switzerland, Cham

Springer 2023 LFBachman Fundamental Considerations in Language Testing

Oxford

Oxford University Press 1990 JEPurpura Assessing Grammar, Cambridge Language Assessment Cambridge University Press 2004 Practical Language Testing GFulcher GFulcher 10.4324/980203767399 2010 Routledge 1st ed Multiple choice question corpus analysis for distractor characterization V.-MPho TAndré A.-LLigozat BGrau GIllouz TFrançois Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA) NCalzolari KChoukri TDeclerck HLoftsson BMaegaard JMariani AMoreno JOdijk SPiperidis the Ninth International Conference on Language Resources and Evaluation (LREC'14), European Language Resources Association (ELRA)

Reykjavik, Iceland

2014 Measuring non-native speakers' proficiency of english by using a test with automatically-generated fill-in-theblank questions ESumita FSugaya SYamamoto 10.3115/1609829.1609839 2005 Automatic question generation for vocabulary assessment JBrown GAFrishkoff MEskénazi Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Vancouver; British Columbia, Canada

October 2005. 2005 The Association for Computational Linguistics Gap-fill tests for language learners: Corpus-driven item generation SSmith AKilgarriff 2010 A system for generating multiple choice questions: With a novel approach for sentence selection MMajumder SKSaha 10.18653/V1/W15-4410 Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP HChen YTseng YMatsumoto LWong the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP

Beijing, China

July 31, 2015. 2015 Association for Computational Linguistics A system for generating multiple choice questions: With a novel approach for sentence selection MMajumder SKSaha 10.18653/V1/W15-4410 Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP HChen YTseng YMatsumoto LWong the 2nd Workshop on Natural Language Processing Techniques for Educational Applications, NLP-TEA@ACL/IJCNLP

Beijing, China

July 31, 2015. 2015 Association for Computational Linguistics Automatic generation system of multiple-choice cloze questions and its evaluation, Knowledge Management & E-Learning TGoto TKojiri TWatanabe TIwata TYamada An International Journal 2 2010 Conditional random fields: Probabilistic models for segmenting and labeling sequence data JDLafferty AMccallum FC NPereira Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001) CEBrodley APDanyluk the Eighteenth International Conference on Machine Learning (ICML 2001)

Williamstown, MA, USA

Morgan Kaufmann June 28 -July 1, 2001. 2001 Learning from partially annotated data: Example-aware creation of gap-filling exercises for language learning SKBitew JDeleu ASDoğruöz CDevelder TDemeester 10.18653/V1/2023.BEA-1.51 Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023 EKochmar JBurstein AHorbach RLaarmann-Quante NMadnani ATack VYaneva ZYuan TZesch the 18th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2023

Toronto, Canada

Association for Computational Linguistics 13 July 2023. 2023 Roberta: A robustly optimized bert pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov 2019 Mask and cloze: Automatic open cloze question generation using a masked language model SMatsumori KOkuoka RShibata MInoue YFukuchi MImai 10.1109/access.2023.3239005 IEEE Access 11 2023 English grammar multiplechoice question generation using text-to-text transfer transformer PChomphooyod ASuchato NTuaycharoen PPunyabukkana 10.1016/J.CAEAI.2023.100158 Comput. Educ. Artif. Intell 5 100158 2023 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin 2023 Automatic generation of language exercises based on a universal methodology: An analysis of possibilities VSlavuj LNacinovic Prskalo MBrkic Bakaric 10.31926/but.pcs.2021.63.14.2.3 Bulletin of the Transilvania University of Brasov. Series IV: Philology and Cultural Studies 14 63 2022 Language exercise generation AMalafeev 10.4018/IJCSSA.2014070102 ternational Journal of Conceptual Structures and Smart Applications 2 2014 An evidence-based approach to distractor generation in multiple-choice language tests DPerrett DMarch 10.13140/RG.2.2.22779.16165 2019 Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu 10.3115/1073083.1073135 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics PIsabelle ECharniak DLin the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Philadelphia, Pennsylvania, USA

2002 ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 METEOR: An automatic metric for MT evaluation with improved correlation with human judgments SBanerjee ALavie Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics JGoldstein ALavie C.-YLin CVoss the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics

Ann Arbor, Michigan

2005 Introducing Meta Llama 3: The most capable openly available LLM to date April 2024 Meta D4: Improving llm pretraining via document deduplication and diversification KTirumala DSimig AAghajanyan AMorcos Advances in Neural Information Processing Systems AOh TNaumann AGloberson KSaenko MHardt SLevine Curran Associates, Inc 2023 36 O Gpt-4 technical report 2024 Qlora: Efficient finetuning of quantized llms TDettmers APagnoni AHoltzman LZettlemoyer 2023 A theoretical analysis of the repetition problem in text generation ZFu WLam AM .-CSo BShi 10.1609/aaai.v35i14.17520 Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2021 35 Hallucination is inevitable: An innate limitation of large language models ZXu SJain MKankanhalli 2024