Can LLMs Solve Reading Comprehension Tests as Second Language Learners? Akio Hayakawa1 , Horacio Saggion1 1 LaSTUS Lab, TALN Research Group, Department of Engineering, Universitat Pompeu Fabra, C/Tànger 122 (08018), Barcelona, Spain Abstract The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between the original LLMs and language learners, thereby hilighting a critical discrepancy between them. Keywords Natural Language Processing, Large Language Models, Question Answering, Reading Comprehension 1. Introduction PORTRAY Answer the following reading comprehension question as if you are a CEFR B1 level English In the field of Natural Language Processing (NLP), the learner. Learners at this level can understand evaluation of systems is commonly categorized into two the main points of... approaches: automatic and manual evaluation. Manual NONE evaluation, which is considered more reliable, involves {1-shot exapmle} methods ranging from subjective scoring on scales, such CONTEXT: A friend once asked me why I travel as a 5-point rating, to task-based assessments like solving when I can see everything on the television. I agreed that... comprehension questions. Despite its reliability, manual QUESTION: What is Sam Fradd's aim in this text? evaluation requires greater time and cost investments OPTIONS: [1]. A) to encourage people to keep... B) to explain reasons for... The difficulty of conducting manual evaluation signifi- C) to describe the route... The weakening cantly increases when targeting individuals with specific D) to advertise his... prompt alters ANSWER: attributes, as access to these groups becomes more diffi- distributions? cult. This has resulted in the diminished prioritization next-token probs. Language of their participation, calling into question the trustwor- 0.80 Learners thiness of manual evaluation. For instance, in the text NONE 0.60 PORTRAY simplification task, which aims to make texts more read- ? LLM 0.40 ? ? able and understandable, children, language learners, and 0.20 ? people with disabilities are considered ideal evaluators for the simplicity of texts, as they are presumed to bene- A B C D fit most from the simplification [2]. Nevertheless, stud- ies on text simplification have relied on native speakers Figure 1: Overview of our experimental setup. We investi- or people who do not need simplified texts for manual gate whether it is possible to make next-token probabilities of evaluation [3, 4], rarely involving individuals who need LLM closer to selection distribution by language learners, by simplification, probably due to significant disparities in weakening the LLM. accessibility to diverse groups. Indeed, Sauberli et al. [5] recently demonstrated subjective differences in per- ceived text difficulty between people with and without intellectual disabilities, highlighting the importance of their involvement. KiL’24: Workshop on Knowledge-infused Learning co-located with Recent advancements in NLP, especially with Large 30th ACM KDD Conference, August 26, 2024, Barcelona, Spain Language Models (LLMs), may address this bottleneck. Envelope-Open akio.hayakawa@upf.edu (A. Hayakawa); horacio.saggion@upf.edu (H. Saggion) One line of work has attempted to substitute manual GLOBE https://ahaya3776.github.io/ (A. Hayakawa) evaluation with assessments conducted by LLMs [6, 7, 8], © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). seeking immediate and inexpensive annotations of higher CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings quality. Another set of studies has reported that LLMs chines [16, 17], sometimes for specific capabilities such are capable of emulating a specific persona by including as reasoning in HotpotQA [18] and the use of external attributes in a prompt [9, 10]. knowledge in ReClor [19]. However, these datasets are Therefore, we wonder if LLMs could be prompted to designed only to measure system performance, not for serve as substitutes for specific personas. This study comparison with human responses. As a result, human specifically focuses on language learners, investigating responses to RC are absent from these datasets. There is whether LLMs can mimic their response patterns. This limited research that compares responses from machines approach could potentially offer a more accessible means and humans, and even these studies typically offer only of obtaining evaluations for tasks that ideally require summarized data [20]. This data shortage has hindered responses from specific target groups, such as predicting research into machine emulation of human response. the difficulty of questions without a pilot pretesting stage, In contrast to this scarcity, CMCQRD [12] is a unique simply by providing their attributes in the prompt. RC dataset which includes response data from language To judge the mimicability of LLMs, we compare re- learners. CMCQRD adopts a multiple-choice setting like sponses to multiple-choice reading comprehension (RC) many of the RC datasets mentioned above, and includes tests, which have been widely used to measure language the distribution of the choices among options. RC tests comprehension [11], from language learners and NLP and participants are categorized based on the CEFR which systems. Using the CMCQRD dataset [12], which is a is a guideline used to describe achievements of foreign recently released four-choice RC test dataset with se- language learners. Among the six reference levels (A1, lection distributions from language learners, we aim to A2, B1, B2, C1, C2) of the CEFR, independent- (B1, B2) investigate if LLM output can closely approximate these and proficient-level (C1, C2) are considered in the CM- distributions. While fine-tuning encoder models is one CQRD dataset. In other words, each question in this approach to pursuing distributions closer to those of hu- dataset is labeled with a difficulty level ranging from B1 mans [13], prompting LLMs has the potential to target a to C2 according to the CEFR, and also includes the selec- broader range of personas, suggesting enhanced applica- tion distribution by language learners whose proficiency bility. corresponds to these labeled levels. This information Figure 1 illustrates the outline of our experimental enables a detailed analysis of the differences between lan- setup. Given that current models in the NLP field often guage learners and machines. Liusie et al. [13] compared achieve or even surpass human-level performance on outputs from an ELECTRA-based classification model various tasks [14], it is reasonable to presume that LLMs with human responses, reporting low similarity due to could outperform the average language learner on RC the model performing worse than language learners. tests. Hence, LLMs need to be weakened to mimic lan- guage learners. We try several prompting techniques to 2.2. Prompts that Alter LLMs’ Behaviour degrade LLM performance and analyze their effects. Contrary to our expectations, our preliminary experi- in Question Answering mental results show that the prompts considered do not Retrieving distributions for multiple-choice questions lead LLMs to mimic language learners. Furthermore, from LLMs involves obtaining not only the final answer we observe that the questions LLMs tend to answer in- but also the probabilities associated with each option. correctly differ significantly from those that language While it is nontrivial to extract an answer or a probabil- learners struggle with. This discrepancy suggests a need ity because of the auto-regressive nature of text genera- for deeper analysis when we try to utilize LLM as a re- tion by LLMs, Robinson et al. [21] demonstrated that a placement for human evaluation. multiple-choice prompt can lead to a higher probability of generating option symbols as the next token, espe- cially with one or few-shot settings. Unlike a traditional 2. Related Work cloze prompt, which selects the option with the highest sequence’s probability without giving other options, a 2.1. Human Response to Reading multiple-choice prompt provides all options simultane- Comprehension Dataset ously and selects the one with the highest probability for Reading comprehension (RC) tests have been widely used the option symbols. in psycholinguistic studies to assess how well readers, However, even in this setting, it has been reported that especially language learners, understand the content of LLMs respond less robustly to certain prompts [22, 23]. a given text [15]. While these studies have seldom made Utilizing this vulnerability, Santurkar et al. [10] suggested their original data publicly available, research in natural that LLMs can change the distributions of attitude op- language processing has made standard datasets avail- tions towards controversial social topics, when given able to measure the text comprehension abilities of ma- prompts that mimic the behavior of a human group with specific attributes. LLMs’ behaviour will also change when given a degree of certainty like ”Perhaps it’s” [24]. Table 1 This change was observed in response to context-free Statistics of CMCQRD dataset. We use RC tests at B1 and B2 open-ended questions, highlighting an opportunity for levels with responses. extended research in multiple-choice RC tests. w/o responses w/ responses CEFR Num Num Num Num Mode Avg Level Text QA Text QA Acc Acc 3. Experimental Setup B1 5 25 23 115 0.913 0.590 B2 21 160 37 262 0.882 0.594 The primary objective of this work is to investigate C1 13 86 12 83 0.880 0.613 whether LLMs can mimic the responses of language learn- C2 3 20 6 42 0.833 0.681 ers in solving multiple-choice RC tests. In this section, we outline our experimental setup, utilizing the CMCQRD dataset [12], which includes responses from at least 100 We adopt GPT-4o1 and LLaMa-2-70B [25] with one- language learners per question, providing information shot prompting. We run LLaMa-2-70B using the Hugging- about answer probability distributions. Our analysis com- Face library with 4-bit quantization.2 The temperature pares the next-token probability on each option by LLMs parameter is set to 1.0 for both models. with the choice patterns of language learners, aiming to understand the extent of LLMs’ capability in emulating Evaluation To compare human and LLM outputs, we learner-like understanding in RC tasks. use mode accuracy, average accuracy, and KL divergence Assuming that up-to-date LLMs outperform average following Liusie et al. [13], and also correct/wrong F1 language learners, degrading these models is needed to score. Below is the description of these metrics. bring their output distributions closer to those of lan- 1. Mode Accuracy: how frequently the most plausi- guage learners. We employ several methods to weaken ble symbol by LLM is the correct answer, denoted the LLM performance and compare the results to the as language learners. Mode Accuracy = 𝔼[argmax𝑦 (𝑝 LLM ) = 𝑦 ans ], Dataset The CMCQRD dataset consists of 4-choice where 𝑝 represents probabilities for each option English RC tests, labeled with difficulty levels ranging and 𝑦 ans is the correct option. from CEFR B1 to C2. A subset of CMCQRD includes 2. Average Accuracy: how frequently the correct responses from non-native English speakers whose pro- option is selected on average by LLM, denoted as ficiency aligns with the difficulty label [12, 13]. We refer to this set of responses as the human distribution. Average Accuracy = 𝔼[𝑦 LLM = 𝑦 ans ]. Table 1 shows the statistics of the CMCQRD dataset. 3. KL Divergence: the similarity between two distri- The average accuracies of language learners are around butions [26], denoted as 60%, while the accuracies of their mode selections are 𝑙 around 90%. In this experiment, we exclusively use ques- KL Divergence = ∑ 𝑙𝑜 log 𝑜 , tions at levels B1 and B2 with a human distribution, cor- 𝑜 ℎ 𝑜 responding to intermediate levels of proficiency. Our where 𝑜 represents an option selection, with the focus on these levels is driven by our aim to assess the LLM and human distribution fixed to 𝑙 and ℎ, re- ability of LLMs to reproduce the challenges faced by lan- spectively. guage learners who are not fully proficient in reading 4. Correct/Wrong F1: the macro-averaged f1 score comprehension. focused on question-wise correct and wrong con- sistency on mode options, denoted as LLM Settings Since the outputs of LLMs are auto- 1 regressive and free-form, some techniques are required Correct/Wrong F1 = (F1correct + F1wrong ), 2 to increase the likelihood of desired tokens in subse- quent outputs. To this end, we employ a multiple-choice where each F1 score is calculated based on the prompting approach for RC, as described in Robinson elements of confusion matrix, such as TPcorrect = et al. [21]. This approach provides LLMs with a single ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 = 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )] and natural language prompt that concatenates the context, FPwrong = ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 ≠ 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )]. a question, options, and an option-symbol-prompting Furthermore, we calculate the summation of the prob- word, such as ”Answer:”. We take advantage of the next- abilities for option symbols appearing as the next token token probabilities to the distribution by LLM. The logits to evaluate the effectiveness of the prompts. of next tokens associated with option symbols, {A, B, C, 1 https://openai.com/index/hello-gpt-4o/ D} on 4-choice tests, are normalized using softmax. 2 https://huggingface.co/meta-llama/Llama-2-70b-hf Table 2 Result on CMCQRD Dataset. Values on KL and C/W F1 are those compared to Human language learners above. B1 B2 Mode Avg C/W Sum Mode Avg C/W Sum System Prompt Acc Acc KL↓ F1↑ Prob. Acc Acc KL↓ F1↑ Prob. Human - 0.913 0.585 - - - 0.885 0.592 - - - NONE 0.974 0.974 0.570 0.552 0.994 0.931 0.929 0.576 0.633 0.971 PORTRAY 0.974 0.971 0.566 0.552 0.988 0.927 0.927 0.580 0.606 0.975 GPT-4o ESL 0.965 0.964 0.563 0.544 0.895 0.927 0.926 0.554 0.651 0.842 UNCERTAIN 0.713 0.719 0.795 0.471 0.155 0.828 0.805 0.711 0.572 0.228 MASK 0.922 0.918 0.562 0.512 0.868 0.851 0.852 0.578 0.608 0.798 NONE 0.930 0.839 0.338 0.518 0.993 0.854 0.756 0.354 0.611 0.992 PORTRAY 0.930 0.831 0.320 0.518 0.984 0.847 0.740 0.332 0.604 0.980 LLaMa-2-70b ESL 0.922 0.750 0.211 0.512 0.973 0.851 0.674 0.263 0.658 0.969 UNCERTAIN 0.922 0.646 0.163 0.512 0.966 0.839 0.556 0.226 0.646 0.971 MASK 0.843 0.750 0.294 0.553 0.988 0.755 0.644 0.391 0.533 0.983 Prompt Design We employ 4 types of prompt designs The word frequency is calculated based on SUB- below. See Appendix A for the examples. TLEXus [29]. • NONE: Only the context, question, and candidate answers are given. 4. Results • PORTRAY: Similar to Santurkar et al. [10], a role is assigned at the beginning of the prompt, for Table 2 shows the performance of LLMs on CMCQRD example, ”Answer the following reading compre- given each prompt. Overall, contrary to our expectations, hension question as if you are a CEFR B1 level the results reveal the limited ability of LLMs to mimic English learner.”, followed by a description of the language learners when solving multiple-choice RC tests. level defined by CEFR. 3 • ESL: Bonner et al. [27] suggested that LLMs seem LLMs tend not to be distracted. First, the distribu- to have the ability to control outputs based on a tions by LLMs, especially from GPT-4o, show more skew- targeted CEFR level provided in a prompt. We ness in NONE compared to those from humans. In other ask LLMs the most plausible answer from lan- words, compared to the small gap between Human and guage learners at a specific CEFR level, such as the LLM in the mode accuracy, the average accuracy sees ”What do you think is the most plausible answer much a wider gap. For GPT-4o, there is almost no differ- by CEFR B1 level learners to the following reading ence between these accuracies, which demonstrates that comprehension test?”. In addition, we inject the the most plausible next token is only one option symbol explanation like ”Given the context and consider- regardless of its correctness. ing that the test takers are at a CEFR B1 level, the most plausible answer they might choolse could Prompts affect outputs differently across LLMs. be” after ”ANSWER:”. The results show the difference in the function of prompts • UNCERTAIN: as reported in Zhou et al. [24], the between GPT-4o and LLaMa-2-70b. For LLaMa-2-70b, the expression of uncertainty will change LLMs’ be- sum of the probabilities for option symbols exceeds 95% havior. We inject the expression like ”I’m not across all prompts, indicating that the prompts effectively sure because there are some sentences I don’t induce the generation of these symbols. On the other understand, but maybe the answer is,” after ”AN- hand, GPT-4o behaves differently, particularly with UN- SWER:”. CERTAIN prompt, where the probability of generating • MASK: Laufer [28] argued that language learn- non-symbol tokens is considerable. This shows that the ers need to know 95% of the vocabulary in a text function of prompts differs across LLMs. to comprehend its content. To simulate the sce- nario where 5% of the vocabulary are not known, LLaMa-2-70b is better than GPT-4o in weakening. top 5% unfrequent words within a context are A key distinction between responses from language learn- masked. Unfrequent words in question and op- ers and LLMs is that while both show high Mode Accu- tions are also masked based on this threshold. racy, LLMs demonstrate substantially higher Average 3 https://www.coe.int/en/web/common-european-framework-reference-languages/ cefr-descriptors Accuracy compared to humans, indicating that distribu- Table 3 tions by LLMs are generally skewed. Therefore, an LLM Correlation between the gap and complexity measures. N, H, suited for weakening can maintain Mode Accuracy while and U mean NONE, Human, and UNCERTAIN, respectively. * reducing Average Accuracy. In this aspect, LLaMa-2-70b means statistical significance on 𝑝 < 0.05. is better than GPT-4o. GPT-4o shows minimal changes in N-H N-U Avg Average Accuracy even with weakening prompts, includ- Δ Average B1 0.254 0.193 - ing UNCERTAIN that drops accuracies and Sum Probabil- Accuracy B2 0.164 0.200 - ity. Thus, its distributions remain distinct from language Passage B1 -0.14 0.01 342.2 learners, as reflected by the persistently high KL diver- Length B2 0.14* 0.04 656.7 gence. In contrast, LLaMa-2-70b shows the ability to B1 0.31* 0.23* 9.69 FKGL reduce Average Accuracy while maintaining Mode Accu- B2 0.05 -0.02 9.22 racy, especially with ESL and UNCERTAIN prompts. Word Freq B1 -0.18 -0.08 6.53 (per 1k words) B2 -0.01 0.13* 6.44 Prompt design plays a crucial role. Prompt designs markedly influence the outputs from LLMs, as exempli- fied by the difference between PORTRAY and ESL results (NONE-Human), and also the gaps between the LLM with on LLaMa-2-70b. While both prompts are designed to and without a weakening prompt (NONE-UNCERTAIN). emulate language learner-like outputs and include the We select LLaMa-2-70b because of its ability to be weak- description of the targeted CEFR level, PORTRAY fails to ened. Among the features used in prior research by Sug- weaken performance, whereas ESL leads to reductions awara et al. [20], we select Passage Length, FKGL[30], in Average Accuracy and KL Divergence. This suggests and Word Frequency as indicators of complexity. Corre- that there is much room for prompt engineering in other lations are measured between these indicators and the designs, including UNCERTAIN. accuracy gaps for each individual question. Table 3 shows the correlations, some of which are sta- tistically significant. For Passage Length, there is a weak Language Learners and LLM mistake different ques- positive correlation with the gap between NONE and tions. Whereas KL divergence measures the similar- Human at the B2 level, which means that the longer the ity between two distributions, Correct/Wrong F1 score context, the harder it is for language learners to answer directly measures the consistency of most plausible an- correctly compared to the LLM. This implies that a longer swers by humans and LLMs. LLMs show a low F1 score re- context may hinder B2 level language learners from find- gardless of the prompt given, indicating a discrepancy be- ing the evidence needed to answer more than it does the tween questions that lead to human errors and those that LLM. FKGL, a readability metric based on the number lead to LLM errors. LLaMa-2-70b observes the largest of words and syllables per sentence, shows a weak-to- drop in KL divergence with UNCERTAIN prompt com- moderate positive correlation with the gap between LLM pared to NONE. However, this does not correspond with and human, and also the gap between LLMs with and a substantial improvement in the F1 score, suggesting without uncertainty prompt. Since FKGL is designed to that the LLM does not mimic human error patterns effec- show a lower value on easier texts, these statistically tively. Since distributions by LLMs are generally skewed significant gaps imply that the LLM shows a higher accu- compared to those by language learners, the reduction of racy in more complex contexts. UNCERTAIN prompt can KL divergence is achievable by simply increasing the tem- slightly smooth this trend, but it does not enable the LLM perature parameter. This result reveals the importance to emulate the tendency of language learners. Finally, for of not only comparing distributions but also examining Word Freqency, there is a weak positive correlation with the consistency of the mode answers to mimic humans. gap between NONE and UNCERTAIN at B2 level. This may imply that UNCERTAIN weaken LLMs more when 5. Discussion a context is composed of more common words. Overall, these surface-level complexity indicators are Our results so far seem to demonstrate the inability of not sufficient to explain the difference between language LLMs to mimic human language learners when solving learners and LLMs. We reserve deeper analysis, such as RC tests, even when provided with weakening prompts. semantic considerations, for our further research. In particular, we identify differences in the questions that language learners and the LLMs tend to answer incor- rectly. In this section, we turn our attention to an analysis 6. Conclusion of the underlying factors for these discrepancies. In conclusion, our research reveals that LLMs does not We analyze the influence of the complexity of context behave as second language learners even with potentially on accuracy gaps between language learners and LLM performance-weakening prompts we provide. We also els with multiple rewriting transformations, 2020. observe that the performance varies depending on the arXiv:2005.00481 . model and prompts used, even though a limited set of [5] A. Sauberli, F. Holzknecht, P. Haller, S. Deilen, models and prompts are considered. Expanding the va- L. Schiffl, S. Hansen-Schirra, S. Ebling, Digital riety of these elements, including prompts with more comprehensibility assessment of simplified texts sophisticated approaches such as chain-of-thought [23] among persons with intellectual disabilities, 2024. and automatic prompt tuning [31], will be critical for a arXiv:2402.13094 . more comprehensive evaluation of the mimicability. [6] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt outper- Our findings demonstrate that discrepancies between forms crowd workers for text-annotation tasks, Pro- language learners and LLMs in terms of easiness of ques- ceedings of the National Academy of Sciences 120 tions, highlighting the necessity for micro-level analysis. (2023) e2305016120. Nonetheless, the limited size of CMCQRD dataset used [7] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G- in this research presents challenges in drawing compre- eval: Nlg evaluation using gpt-4 with better human hensive conclusions. The development of datasets incor- alignment, 2023. arXiv:2303.16634 . porating diverse personas beyond language learners is [8] D. Dillion, N. Tandon, Y. Gu, K. Gray, Can ai lan- essential when trying to use LLMs as the complement of guage models replace human participants?, Trends human evaluators. in Cognitive Sciences 27 (2023) 597–600. [9] E. Hwang, B. P. Majumder, N. Tandon, Align- ing language models to user opinions, 2023. Acknowledgments arXiv:2305.14929 . [10] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, The authors acknowledge the support from Departament T. Hashimoto, Whose opinions do language models de Recerca i Universitats de la Generalitat de Catalunya reflect?, 2023. arXiv:2303.17548 . (ajuts SGR-Cat 2021) and from Maria de Maeztu Units [11] S. Liu, X. Zhang, S. Zhang, H. Wang, W. Zhang, of Excellence Programme CEX2021-001195-M, funded Neural machine reading comprehension: Methods by MCIN/AEI /10.13039/501100011033. This research and trends, Applied Sciences 9 (2019) 3698. is part of a project that has received funding from the [12] A. Mullooly, O. Andersen, L. Benedetto, P. But- European Union´s Horizon Europe research and innova- tery, A. Caines, M. J. F. Gales, Y. Karatay, tion program under the Grant Agreement No. 101132431 K. Knill, A. Liusie, V. Raina, S. Taslimipoor, (iDEM Project). Views and opinions expressed are how- The Cambridge Multiple-Choice Questions Read- ever those of the author(s) only and do necessarily re- ing Dataset, Cambridge University Press and flect those of the European Union. Neither the European Assessment, 2023. URL: https://www.repository. Union nor the granting authority can be held responsible cam.ac.uk/handle/1810/358683. doi:10.17863/CAM. for them. 102185 . [13] A. Liusie, V. Raina, A. Mullooly, K. Knill, M. J. F. References Gales, Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate [1] S. Gehrmann, E. Clark, T. Sellam, Repairing response distribution, 2023. arXiv:2306.13047 . the cracked foundation: A survey of obstacles [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: in evaluation practices for generated text, 2022. Pre-training of deep bidirectional transformers for arXiv:2202.06935 . language understanding, 2019. arXiv:1810.04805 . [2] N. Grabar, H. Saggion, Evaluation of automatic text [15] E. H. Jeon, J. Yamashita, L2 reading comprehen- simplification: Where are we now, where should sion and its correlates: A meta-analysis, Language we go from here, in: Actes de la 29e Conférence learning 64 (2014) 160–212. sur le Traitement Automatique des Langues Na- [16] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: turelles. Volume 1 : conférence principale, ATALA, 100,000+ questions for machine comprehension of Avignon, France, 2022, pp. 453–463. URL: https: text, in: Proceedings of the 2016 Conference on //aclanthology.org/2022.jeptalnrecital-taln.47. Empirical Methods in Natural Language Processing, [3] L. Martin, A. Fan, Éric de la Clergerie, A. Bordes, Association for Computational Linguistics, USA, B. Sagot, Muss: Multilingual unsupervised sen- 2016, pp. 2383–2392. tence simplification by mining paraphrases, 2021. [17] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large- arXiv:2005.00352 . scale reading comprehension dataset from exami- [4] F. Alva-Manchego, L. Martin, A. Bordes, C. Scar- nations, in: Proceedings of the 2017 Conference ton, B. Sagot, L. Specia, Asset: A dataset for tun- on Empirical Methods in Natural Language Pro- ing and evaluation of sentence simplification mod- cessing, Association for Computational Linguistics, USA, 2017, pp. 785–794. based artificial intelligence in the language class- [18] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, room: Practical ideas for teaching., Teaching En- R. Salakhutdinov, C. D. Manning, Hotpotqa: A glish with Technology 23 (2023) 23–41. dataset for diverse, explainable multi-hop question [28] B. Laufer, What percentage of text-lexis is essen- answering, in: Proceedings of the 2018 Conference tial for comprehension?, Special language: From on Empirical Methods in Natural Language Pro- humans thinking to thinking machines (1989) 316. cessing, Association for Computational Linguistics, [29] M. Brysbaert, B. New, Moving beyond kučera and USA, 2018, pp. 2369–2380. francis: A critical evaluation of current word fre- [19] W. Yu, Z. Jiang, Y. Dong, J. Feng, Reclor: A reading quency norms and the introduction of a new and comprehension dataset requiring logical reasoning, improved word frequency measure for american en- in: International Conference on Learning Repre- glish, Behavior research methods 41 (2009) 977–990. sentations, International Conference on Learning [30] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Representations, USA, 2019. Chissom, Derivation of new readability formulas [20] S. Sugawara, N. Nangia, A. Warstadt, S. Bowman, (automated readability index, fog count and flesch What makes reading comprehension questions dif- reading ease formula) for navy enlisted personnel, ficult?, in: Proceedings of the 60th Annual Meeting 1975. of the Association for Computational Linguistics [31] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, (Volume 1: Long Papers), Association for Computa- G. Liu, J. Bian, Y. Yang, Connecting large language tional Linguistics, USA, 2022, pp. 6951–6971. models with evolutionary algorithms yields power- [21] J. Robinson, C. M. Rytting, D. Wingate, Leveraging ful prompt optimizers, 2024. URL: https://arxiv.org/ large language models for multiple choice question abs/2309.08532. arXiv:2309.08532 . answering, 2023. arXiv:2210.12353 . [22] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we know what language models know?, Transactions of the Association for Computational Linguistics 8 (2020) 423–438. URL: https://aclanthology.org/2020. tacl-1.28. doi:10.1162/tacl_a_00324 . [23] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners, Ad- vances in neural information processing systems 35 (2022) 22199–22213. [24] K. Zhou, D. Jurafsky, T. Hashimoto, Navigating the grey area: How expressions of uncertainty and overconfidence affect language models, 2023. arXiv:2302.13439 . [25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar- das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen- stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay- lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro- driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288 . [26] T. M. Cover, Elements of information theory, John Wiley & Sons, USA, 1999. [27] E. Bonner, R. Lege, E. Frazier, Large language model- A. Prompt Examples Table 4 Examples of designed prompts. NONE CONTEXT: I won’t pretend being a flight attendant is easy. But since I started the job, I’ve been everywhere, from the US to Australia. I work with incredible people, I have a lot of time off, and life is never boring - which ... QUESTION: What does Jack say about attending his job interview? A) He was surprised at the age range of people there. B) He made sure he seemed different from the others. C) He wondered whether he had enough qualifications. D) He realised there were too many people for the jobs available. ANSWER:\n PORTRAY Answer the following reading comprehension questions as if you are a CEFR B1 level English learner. Learners at this level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be difficult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. {Same as NONE from CONTEXT: to ANSWER:\n } ESL You are an ESL teacher. What do you think is the most plausible answer by CEFR B1 level learners to the following reading comprehension test? Learners at this level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be difficult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. {Same as NONE from CONTEXT: to D) he ...} ANSWER: Given the context and considering that the test takers are at a CEFR B1 level, the most plausible answer they might choose could be:\n UNCERTAIN {Same as NONE from CONTEXT: to D) he ...} ANSWER: I’m not sure because there are some sentences I don’t understand, but maybe the answer is:\n MASK CONTEXT: I won’t [MASK] being a flight [MASK] is easy. But since I started the job, I’ve been everywhere, from the US to Australia. I work with incredible people, I have a lot of time off, and life is never [MASK] - which ... QUESTION: What does Jack say about attending his job interview? A) He was surprised at the age range of people there. B) He made sure he seemed different from the others. C) He [MASK] whether he had enough qualifications. D) He realised there were too many people for the jobs available. ANSWER:\n