=Paper=
{{Paper
|id=Vol-3894/paper7
|storemode=property
|title=Can LLMs Solve Reading Comprehension Tests as Second Language Learners?
|pdfUrl=https://ceur-ws.org/Vol-3894/paper7.pdf
|volume=Vol-3894
|authors=Akio Hayakawa,Horacio Saggion
|dblpUrl=https://dblp.org/rec/conf/kil/HayakawaS24
}}
==Can LLMs Solve Reading Comprehension Tests as Second Language Learners?==
Can LLMs Solve Reading Comprehension Tests as Second
Language Learners?
Akio Hayakawa1 , Horacio Saggion1
1
LaSTUS Lab, TALN Research Group, Department of Engineering, Universitat Pompeu Fabra, C/Tànger 122 (08018), Barcelona, Spain
Abstract
The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting
people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans
at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to
behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language
learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice
reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts
designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between
the original LLMs and language learners, thereby hilighting a critical discrepancy between them.
Keywords
Natural Language Processing, Large Language Models, Question Answering, Reading Comprehension
1. Introduction PORTRAY
Answer the following reading comprehension
question as if you are a CEFR B1 level English
In the field of Natural Language Processing (NLP), the learner. Learners at this level can understand
evaluation of systems is commonly categorized into two the main points of...
approaches: automatic and manual evaluation. Manual NONE
evaluation, which is considered more reliable, involves {1-shot exapmle}
methods ranging from subjective scoring on scales, such CONTEXT: A friend once asked me why I travel
as a 5-point rating, to task-based assessments like solving when I can see everything on the television.
I agreed that...
comprehension questions. Despite its reliability, manual QUESTION: What is Sam Fradd's aim in this text?
evaluation requires greater time and cost investments OPTIONS:
[1]. A) to encourage people to keep...
B) to explain reasons for...
The difficulty of conducting manual evaluation signifi- C) to describe the route... The weakening
cantly increases when targeting individuals with specific D) to advertise his... prompt alters
ANSWER:
attributes, as access to these groups becomes more diffi- distributions?
cult. This has resulted in the diminished prioritization next-token
probs. Language
of their participation, calling into question the trustwor- 0.80 Learners
thiness of manual evaluation. For instance, in the text NONE
0.60 PORTRAY
simplification task, which aims to make texts more read- ?
LLM 0.40
? ?
able and understandable, children, language learners, and 0.20 ?
people with disabilities are considered ideal evaluators
for the simplicity of texts, as they are presumed to bene- A B C D
fit most from the simplification [2]. Nevertheless, stud-
ies on text simplification have relied on native speakers Figure 1: Overview of our experimental setup. We investi-
or people who do not need simplified texts for manual gate whether it is possible to make next-token probabilities of
evaluation [3, 4], rarely involving individuals who need LLM closer to selection distribution by language learners, by
simplification, probably due to significant disparities in weakening the LLM.
accessibility to diverse groups. Indeed, Sauberli et al.
[5] recently demonstrated subjective differences in per-
ceived text difficulty between people with and without intellectual disabilities, highlighting the importance of
their involvement.
KiL’24: Workshop on Knowledge-infused Learning co-located with Recent advancements in NLP, especially with Large
30th ACM KDD Conference, August 26, 2024, Barcelona, Spain Language Models (LLMs), may address this bottleneck.
Envelope-Open akio.hayakawa@upf.edu (A. Hayakawa);
horacio.saggion@upf.edu (H. Saggion)
One line of work has attempted to substitute manual
GLOBE https://ahaya3776.github.io/ (A. Hayakawa) evaluation with assessments conducted by LLMs [6, 7, 8],
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
seeking immediate and inexpensive annotations of higher
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
quality. Another set of studies has reported that LLMs chines [16, 17], sometimes for specific capabilities such
are capable of emulating a specific persona by including as reasoning in HotpotQA [18] and the use of external
attributes in a prompt [9, 10]. knowledge in ReClor [19]. However, these datasets are
Therefore, we wonder if LLMs could be prompted to designed only to measure system performance, not for
serve as substitutes for specific personas. This study comparison with human responses. As a result, human
specifically focuses on language learners, investigating responses to RC are absent from these datasets. There is
whether LLMs can mimic their response patterns. This limited research that compares responses from machines
approach could potentially offer a more accessible means and humans, and even these studies typically offer only
of obtaining evaluations for tasks that ideally require summarized data [20]. This data shortage has hindered
responses from specific target groups, such as predicting research into machine emulation of human response.
the difficulty of questions without a pilot pretesting stage, In contrast to this scarcity, CMCQRD [12] is a unique
simply by providing their attributes in the prompt. RC dataset which includes response data from language
To judge the mimicability of LLMs, we compare re- learners. CMCQRD adopts a multiple-choice setting like
sponses to multiple-choice reading comprehension (RC) many of the RC datasets mentioned above, and includes
tests, which have been widely used to measure language the distribution of the choices among options. RC tests
comprehension [11], from language learners and NLP and participants are categorized based on the CEFR which
systems. Using the CMCQRD dataset [12], which is a is a guideline used to describe achievements of foreign
recently released four-choice RC test dataset with se- language learners. Among the six reference levels (A1,
lection distributions from language learners, we aim to A2, B1, B2, C1, C2) of the CEFR, independent- (B1, B2)
investigate if LLM output can closely approximate these and proficient-level (C1, C2) are considered in the CM-
distributions. While fine-tuning encoder models is one CQRD dataset. In other words, each question in this
approach to pursuing distributions closer to those of hu- dataset is labeled with a difficulty level ranging from B1
mans [13], prompting LLMs has the potential to target a to C2 according to the CEFR, and also includes the selec-
broader range of personas, suggesting enhanced applica- tion distribution by language learners whose proficiency
bility. corresponds to these labeled levels. This information
Figure 1 illustrates the outline of our experimental enables a detailed analysis of the differences between lan-
setup. Given that current models in the NLP field often guage learners and machines. Liusie et al. [13] compared
achieve or even surpass human-level performance on outputs from an ELECTRA-based classification model
various tasks [14], it is reasonable to presume that LLMs with human responses, reporting low similarity due to
could outperform the average language learner on RC the model performing worse than language learners.
tests. Hence, LLMs need to be weakened to mimic lan-
guage learners. We try several prompting techniques to 2.2. Prompts that Alter LLMs’ Behaviour
degrade LLM performance and analyze their effects.
Contrary to our expectations, our preliminary experi-
in Question Answering
mental results show that the prompts considered do not Retrieving distributions for multiple-choice questions
lead LLMs to mimic language learners. Furthermore, from LLMs involves obtaining not only the final answer
we observe that the questions LLMs tend to answer in- but also the probabilities associated with each option.
correctly differ significantly from those that language While it is nontrivial to extract an answer or a probabil-
learners struggle with. This discrepancy suggests a need ity because of the auto-regressive nature of text genera-
for deeper analysis when we try to utilize LLM as a re- tion by LLMs, Robinson et al. [21] demonstrated that a
placement for human evaluation. multiple-choice prompt can lead to a higher probability
of generating option symbols as the next token, espe-
cially with one or few-shot settings. Unlike a traditional
2. Related Work cloze prompt, which selects the option with the highest
sequence’s probability without giving other options, a
2.1. Human Response to Reading multiple-choice prompt provides all options simultane-
Comprehension Dataset ously and selects the one with the highest probability for
Reading comprehension (RC) tests have been widely used the option symbols.
in psycholinguistic studies to assess how well readers, However, even in this setting, it has been reported that
especially language learners, understand the content of LLMs respond less robustly to certain prompts [22, 23].
a given text [15]. While these studies have seldom made Utilizing this vulnerability, Santurkar et al. [10] suggested
their original data publicly available, research in natural that LLMs can change the distributions of attitude op-
language processing has made standard datasets avail- tions towards controversial social topics, when given
able to measure the text comprehension abilities of ma- prompts that mimic the behavior of a human group with
specific attributes. LLMs’ behaviour will also change
when given a degree of certainty like ”Perhaps it’s” [24]. Table 1
This change was observed in response to context-free Statistics of CMCQRD dataset. We use RC tests at B1 and B2
open-ended questions, highlighting an opportunity for levels with responses.
extended research in multiple-choice RC tests. w/o responses w/ responses
CEFR Num Num Num Num Mode Avg
Level Text QA Text QA Acc Acc
3. Experimental Setup B1 5 25 23 115 0.913 0.590
B2 21 160 37 262 0.882 0.594
The primary objective of this work is to investigate C1 13 86 12 83 0.880 0.613
whether LLMs can mimic the responses of language learn- C2 3 20 6 42 0.833 0.681
ers in solving multiple-choice RC tests. In this section, we
outline our experimental setup, utilizing the CMCQRD
dataset [12], which includes responses from at least 100 We adopt GPT-4o1 and LLaMa-2-70B [25] with one-
language learners per question, providing information shot prompting. We run LLaMa-2-70B using the Hugging-
about answer probability distributions. Our analysis com- Face library with 4-bit quantization.2 The temperature
pares the next-token probability on each option by LLMs parameter is set to 1.0 for both models.
with the choice patterns of language learners, aiming to
understand the extent of LLMs’ capability in emulating Evaluation To compare human and LLM outputs, we
learner-like understanding in RC tasks. use mode accuracy, average accuracy, and KL divergence
Assuming that up-to-date LLMs outperform average following Liusie et al. [13], and also correct/wrong F1
language learners, degrading these models is needed to score. Below is the description of these metrics.
bring their output distributions closer to those of lan- 1. Mode Accuracy: how frequently the most plausi-
guage learners. We employ several methods to weaken ble symbol by LLM is the correct answer, denoted
the LLM performance and compare the results to the as
language learners.
Mode Accuracy = 𝔼[argmax𝑦 (𝑝 LLM ) = 𝑦 ans ],
Dataset The CMCQRD dataset consists of 4-choice where 𝑝 represents probabilities for each option
English RC tests, labeled with difficulty levels ranging and 𝑦 ans is the correct option.
from CEFR B1 to C2. A subset of CMCQRD includes 2. Average Accuracy: how frequently the correct
responses from non-native English speakers whose pro- option is selected on average by LLM, denoted as
ficiency aligns with the difficulty label [12, 13]. We refer
to this set of responses as the human distribution. Average Accuracy = 𝔼[𝑦 LLM = 𝑦 ans ].
Table 1 shows the statistics of the CMCQRD dataset. 3. KL Divergence: the similarity between two distri-
The average accuracies of language learners are around butions [26], denoted as
60%, while the accuracies of their mode selections are
𝑙
around 90%. In this experiment, we exclusively use ques- KL Divergence = ∑ 𝑙𝑜 log 𝑜 ,
tions at levels B1 and B2 with a human distribution, cor- 𝑜 ℎ 𝑜
responding to intermediate levels of proficiency. Our where 𝑜 represents an option selection, with the
focus on these levels is driven by our aim to assess the LLM and human distribution fixed to 𝑙 and ℎ, re-
ability of LLMs to reproduce the challenges faced by lan- spectively.
guage learners who are not fully proficient in reading 4. Correct/Wrong F1: the macro-averaged f1 score
comprehension. focused on question-wise correct and wrong con-
sistency on mode options, denoted as
LLM Settings Since the outputs of LLMs are auto-
1
regressive and free-form, some techniques are required Correct/Wrong F1 = (F1correct + F1wrong ),
2
to increase the likelihood of desired tokens in subse-
quent outputs. To this end, we employ a multiple-choice where each F1 score is calculated based on the
prompting approach for RC, as described in Robinson elements of confusion matrix, such as TPcorrect =
et al. [21]. This approach provides LLMs with a single ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 = 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )] and
natural language prompt that concatenates the context, FPwrong = ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 ≠ 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )].
a question, options, and an option-symbol-prompting Furthermore, we calculate the summation of the prob-
word, such as ”Answer:”. We take advantage of the next- abilities for option symbols appearing as the next token
token probabilities to the distribution by LLM. The logits to evaluate the effectiveness of the prompts.
of next tokens associated with option symbols, {A, B, C, 1
https://openai.com/index/hello-gpt-4o/
D} on 4-choice tests, are normalized using softmax. 2
https://huggingface.co/meta-llama/Llama-2-70b-hf
Table 2
Result on CMCQRD Dataset. Values on KL and C/W F1 are those compared to Human language learners above.
B1 B2
Mode Avg C/W Sum Mode Avg C/W Sum
System Prompt Acc Acc KL↓ F1↑ Prob. Acc Acc KL↓ F1↑ Prob.
Human - 0.913 0.585 - - - 0.885 0.592 - - -
NONE 0.974 0.974 0.570 0.552 0.994 0.931 0.929 0.576 0.633 0.971
PORTRAY 0.974 0.971 0.566 0.552 0.988 0.927 0.927 0.580 0.606 0.975
GPT-4o ESL 0.965 0.964 0.563 0.544 0.895 0.927 0.926 0.554 0.651 0.842
UNCERTAIN 0.713 0.719 0.795 0.471 0.155 0.828 0.805 0.711 0.572 0.228
MASK 0.922 0.918 0.562 0.512 0.868 0.851 0.852 0.578 0.608 0.798
NONE 0.930 0.839 0.338 0.518 0.993 0.854 0.756 0.354 0.611 0.992
PORTRAY 0.930 0.831 0.320 0.518 0.984 0.847 0.740 0.332 0.604 0.980
LLaMa-2-70b ESL 0.922 0.750 0.211 0.512 0.973 0.851 0.674 0.263 0.658 0.969
UNCERTAIN 0.922 0.646 0.163 0.512 0.966 0.839 0.556 0.226 0.646 0.971
MASK 0.843 0.750 0.294 0.553 0.988 0.755 0.644 0.391 0.533 0.983
Prompt Design We employ 4 types of prompt designs The word frequency is calculated based on SUB-
below. See Appendix A for the examples. TLEXus [29].
• NONE: Only the context, question, and candidate
answers are given. 4. Results
• PORTRAY: Similar to Santurkar et al. [10], a role
is assigned at the beginning of the prompt, for Table 2 shows the performance of LLMs on CMCQRD
example, ”Answer the following reading compre- given each prompt. Overall, contrary to our expectations,
hension question as if you are a CEFR B1 level the results reveal the limited ability of LLMs to mimic
English learner.”, followed by a description of the language learners when solving multiple-choice RC tests.
level defined by CEFR. 3
• ESL: Bonner et al. [27] suggested that LLMs seem LLMs tend not to be distracted. First, the distribu-
to have the ability to control outputs based on a tions by LLMs, especially from GPT-4o, show more skew-
targeted CEFR level provided in a prompt. We ness in NONE compared to those from humans. In other
ask LLMs the most plausible answer from lan- words, compared to the small gap between Human and
guage learners at a specific CEFR level, such as the LLM in the mode accuracy, the average accuracy sees
”What do you think is the most plausible answer much a wider gap. For GPT-4o, there is almost no differ-
by CEFR B1 level learners to the following reading ence between these accuracies, which demonstrates that
comprehension test?”. In addition, we inject the the most plausible next token is only one option symbol
explanation like ”Given the context and consider- regardless of its correctness.
ing that the test takers are at a CEFR B1 level, the
most plausible answer they might choolse could Prompts affect outputs differently across LLMs.
be” after ”ANSWER:”. The results show the difference in the function of prompts
• UNCERTAIN: as reported in Zhou et al. [24], the between GPT-4o and LLaMa-2-70b. For LLaMa-2-70b, the
expression of uncertainty will change LLMs’ be- sum of the probabilities for option symbols exceeds 95%
havior. We inject the expression like ”I’m not across all prompts, indicating that the prompts effectively
sure because there are some sentences I don’t induce the generation of these symbols. On the other
understand, but maybe the answer is,” after ”AN- hand, GPT-4o behaves differently, particularly with UN-
SWER:”. CERTAIN prompt, where the probability of generating
• MASK: Laufer [28] argued that language learn- non-symbol tokens is considerable. This shows that the
ers need to know 95% of the vocabulary in a text function of prompts differs across LLMs.
to comprehend its content. To simulate the sce-
nario where 5% of the vocabulary are not known, LLaMa-2-70b is better than GPT-4o in weakening.
top 5% unfrequent words within a context are A key distinction between responses from language learn-
masked. Unfrequent words in question and op- ers and LLMs is that while both show high Mode Accu-
tions are also masked based on this threshold. racy, LLMs demonstrate substantially higher Average
3
https://www.coe.int/en/web/common-european-framework-reference-languages/
cefr-descriptors
Accuracy compared to humans, indicating that distribu- Table 3
tions by LLMs are generally skewed. Therefore, an LLM Correlation between the gap and complexity measures. N, H,
suited for weakening can maintain Mode Accuracy while and U mean NONE, Human, and UNCERTAIN, respectively. *
reducing Average Accuracy. In this aspect, LLaMa-2-70b means statistical significance on 𝑝 < 0.05.
is better than GPT-4o. GPT-4o shows minimal changes in N-H N-U Avg
Average Accuracy even with weakening prompts, includ- Δ Average B1 0.254 0.193 -
ing UNCERTAIN that drops accuracies and Sum Probabil- Accuracy B2 0.164 0.200 -
ity. Thus, its distributions remain distinct from language Passage B1 -0.14 0.01 342.2
learners, as reflected by the persistently high KL diver- Length B2 0.14* 0.04 656.7
gence. In contrast, LLaMa-2-70b shows the ability to B1 0.31* 0.23* 9.69
FKGL
reduce Average Accuracy while maintaining Mode Accu- B2 0.05 -0.02 9.22
racy, especially with ESL and UNCERTAIN prompts. Word Freq B1 -0.18 -0.08 6.53
(per 1k words) B2 -0.01 0.13* 6.44
Prompt design plays a crucial role. Prompt designs
markedly influence the outputs from LLMs, as exempli-
fied by the difference between PORTRAY and ESL results (NONE-Human), and also the gaps between the LLM with
on LLaMa-2-70b. While both prompts are designed to and without a weakening prompt (NONE-UNCERTAIN).
emulate language learner-like outputs and include the We select LLaMa-2-70b because of its ability to be weak-
description of the targeted CEFR level, PORTRAY fails to ened. Among the features used in prior research by Sug-
weaken performance, whereas ESL leads to reductions awara et al. [20], we select Passage Length, FKGL[30],
in Average Accuracy and KL Divergence. This suggests and Word Frequency as indicators of complexity. Corre-
that there is much room for prompt engineering in other lations are measured between these indicators and the
designs, including UNCERTAIN. accuracy gaps for each individual question.
Table 3 shows the correlations, some of which are sta-
tistically significant. For Passage Length, there is a weak
Language Learners and LLM mistake different ques-
positive correlation with the gap between NONE and
tions. Whereas KL divergence measures the similar-
Human at the B2 level, which means that the longer the
ity between two distributions, Correct/Wrong F1 score
context, the harder it is for language learners to answer
directly measures the consistency of most plausible an-
correctly compared to the LLM. This implies that a longer
swers by humans and LLMs. LLMs show a low F1 score re-
context may hinder B2 level language learners from find-
gardless of the prompt given, indicating a discrepancy be-
ing the evidence needed to answer more than it does the
tween questions that lead to human errors and those that
LLM. FKGL, a readability metric based on the number
lead to LLM errors. LLaMa-2-70b observes the largest
of words and syllables per sentence, shows a weak-to-
drop in KL divergence with UNCERTAIN prompt com-
moderate positive correlation with the gap between LLM
pared to NONE. However, this does not correspond with
and human, and also the gap between LLMs with and
a substantial improvement in the F1 score, suggesting
without uncertainty prompt. Since FKGL is designed to
that the LLM does not mimic human error patterns effec-
show a lower value on easier texts, these statistically
tively. Since distributions by LLMs are generally skewed
significant gaps imply that the LLM shows a higher accu-
compared to those by language learners, the reduction of
racy in more complex contexts. UNCERTAIN prompt can
KL divergence is achievable by simply increasing the tem-
slightly smooth this trend, but it does not enable the LLM
perature parameter. This result reveals the importance
to emulate the tendency of language learners. Finally, for
of not only comparing distributions but also examining
Word Freqency, there is a weak positive correlation with
the consistency of the mode answers to mimic humans.
gap between NONE and UNCERTAIN at B2 level. This
may imply that UNCERTAIN weaken LLMs more when
5. Discussion a context is composed of more common words.
Overall, these surface-level complexity indicators are
Our results so far seem to demonstrate the inability of not sufficient to explain the difference between language
LLMs to mimic human language learners when solving learners and LLMs. We reserve deeper analysis, such as
RC tests, even when provided with weakening prompts. semantic considerations, for our further research.
In particular, we identify differences in the questions that
language learners and the LLMs tend to answer incor-
rectly. In this section, we turn our attention to an analysis 6. Conclusion
of the underlying factors for these discrepancies.
In conclusion, our research reveals that LLMs does not
We analyze the influence of the complexity of context
behave as second language learners even with potentially
on accuracy gaps between language learners and LLM
performance-weakening prompts we provide. We also els with multiple rewriting transformations, 2020.
observe that the performance varies depending on the arXiv:2005.00481 .
model and prompts used, even though a limited set of [5] A. Sauberli, F. Holzknecht, P. Haller, S. Deilen,
models and prompts are considered. Expanding the va- L. Schiffl, S. Hansen-Schirra, S. Ebling, Digital
riety of these elements, including prompts with more comprehensibility assessment of simplified texts
sophisticated approaches such as chain-of-thought [23] among persons with intellectual disabilities, 2024.
and automatic prompt tuning [31], will be critical for a arXiv:2402.13094 .
more comprehensive evaluation of the mimicability. [6] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt outper-
Our findings demonstrate that discrepancies between forms crowd workers for text-annotation tasks, Pro-
language learners and LLMs in terms of easiness of ques- ceedings of the National Academy of Sciences 120
tions, highlighting the necessity for micro-level analysis. (2023) e2305016120.
Nonetheless, the limited size of CMCQRD dataset used [7] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-
in this research presents challenges in drawing compre- eval: Nlg evaluation using gpt-4 with better human
hensive conclusions. The development of datasets incor- alignment, 2023. arXiv:2303.16634 .
porating diverse personas beyond language learners is [8] D. Dillion, N. Tandon, Y. Gu, K. Gray, Can ai lan-
essential when trying to use LLMs as the complement of guage models replace human participants?, Trends
human evaluators. in Cognitive Sciences 27 (2023) 597–600.
[9] E. Hwang, B. P. Majumder, N. Tandon, Align-
ing language models to user opinions, 2023.
Acknowledgments arXiv:2305.14929 .
[10] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang,
The authors acknowledge the support from Departament
T. Hashimoto, Whose opinions do language models
de Recerca i Universitats de la Generalitat de Catalunya
reflect?, 2023. arXiv:2303.17548 .
(ajuts SGR-Cat 2021) and from Maria de Maeztu Units
[11] S. Liu, X. Zhang, S. Zhang, H. Wang, W. Zhang,
of Excellence Programme CEX2021-001195-M, funded
Neural machine reading comprehension: Methods
by MCIN/AEI /10.13039/501100011033. This research
and trends, Applied Sciences 9 (2019) 3698.
is part of a project that has received funding from the
[12] A. Mullooly, O. Andersen, L. Benedetto, P. But-
European Union´s Horizon Europe research and innova-
tery, A. Caines, M. J. F. Gales, Y. Karatay,
tion program under the Grant Agreement No. 101132431
K. Knill, A. Liusie, V. Raina, S. Taslimipoor,
(iDEM Project). Views and opinions expressed are how-
The Cambridge Multiple-Choice Questions Read-
ever those of the author(s) only and do necessarily re-
ing Dataset, Cambridge University Press and
flect those of the European Union. Neither the European
Assessment, 2023. URL: https://www.repository.
Union nor the granting authority can be held responsible
cam.ac.uk/handle/1810/358683. doi:10.17863/CAM.
for them.
102185 .
[13] A. Liusie, V. Raina, A. Mullooly, K. Knill, M. J. F.
References Gales, Analysis of the cambridge multiple-choice
questions reading dataset with a focus on candidate
[1] S. Gehrmann, E. Clark, T. Sellam, Repairing response distribution, 2023. arXiv:2306.13047 .
the cracked foundation: A survey of obstacles [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
in evaluation practices for generated text, 2022. Pre-training of deep bidirectional transformers for
arXiv:2202.06935 . language understanding, 2019. arXiv:1810.04805 .
[2] N. Grabar, H. Saggion, Evaluation of automatic text [15] E. H. Jeon, J. Yamashita, L2 reading comprehen-
simplification: Where are we now, where should sion and its correlates: A meta-analysis, Language
we go from here, in: Actes de la 29e Conférence learning 64 (2014) 160–212.
sur le Traitement Automatique des Langues Na- [16] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
turelles. Volume 1 : conférence principale, ATALA, 100,000+ questions for machine comprehension of
Avignon, France, 2022, pp. 453–463. URL: https: text, in: Proceedings of the 2016 Conference on
//aclanthology.org/2022.jeptalnrecital-taln.47. Empirical Methods in Natural Language Processing,
[3] L. Martin, A. Fan, Éric de la Clergerie, A. Bordes, Association for Computational Linguistics, USA,
B. Sagot, Muss: Multilingual unsupervised sen- 2016, pp. 2383–2392.
tence simplification by mining paraphrases, 2021. [17] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large-
arXiv:2005.00352 . scale reading comprehension dataset from exami-
[4] F. Alva-Manchego, L. Martin, A. Bordes, C. Scar- nations, in: Proceedings of the 2017 Conference
ton, B. Sagot, L. Specia, Asset: A dataset for tun- on Empirical Methods in Natural Language Pro-
ing and evaluation of sentence simplification mod- cessing, Association for Computational Linguistics,
USA, 2017, pp. 785–794. based artificial intelligence in the language class-
[18] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, room: Practical ideas for teaching., Teaching En-
R. Salakhutdinov, C. D. Manning, Hotpotqa: A glish with Technology 23 (2023) 23–41.
dataset for diverse, explainable multi-hop question [28] B. Laufer, What percentage of text-lexis is essen-
answering, in: Proceedings of the 2018 Conference tial for comprehension?, Special language: From
on Empirical Methods in Natural Language Pro- humans thinking to thinking machines (1989) 316.
cessing, Association for Computational Linguistics, [29] M. Brysbaert, B. New, Moving beyond kučera and
USA, 2018, pp. 2369–2380. francis: A critical evaluation of current word fre-
[19] W. Yu, Z. Jiang, Y. Dong, J. Feng, Reclor: A reading quency norms and the introduction of a new and
comprehension dataset requiring logical reasoning, improved word frequency measure for american en-
in: International Conference on Learning Repre- glish, Behavior research methods 41 (2009) 977–990.
sentations, International Conference on Learning [30] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S.
Representations, USA, 2019. Chissom, Derivation of new readability formulas
[20] S. Sugawara, N. Nangia, A. Warstadt, S. Bowman, (automated readability index, fog count and flesch
What makes reading comprehension questions dif- reading ease formula) for navy enlisted personnel,
ficult?, in: Proceedings of the 60th Annual Meeting 1975.
of the Association for Computational Linguistics [31] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan,
(Volume 1: Long Papers), Association for Computa- G. Liu, J. Bian, Y. Yang, Connecting large language
tional Linguistics, USA, 2022, pp. 6951–6971. models with evolutionary algorithms yields power-
[21] J. Robinson, C. M. Rytting, D. Wingate, Leveraging ful prompt optimizers, 2024. URL: https://arxiv.org/
large language models for multiple choice question abs/2309.08532. arXiv:2309.08532 .
answering, 2023. arXiv:2210.12353 .
[22] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we
know what language models know?, Transactions
of the Association for Computational Linguistics 8
(2020) 423–438. URL: https://aclanthology.org/2020.
tacl-1.28. doi:10.1162/tacl_a_00324 .
[23] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
Large language models are zero-shot reasoners, Ad-
vances in neural information processing systems
35 (2022) 22199–22213.
[24] K. Zhou, D. Jurafsky, T. Hashimoto, Navigating
the grey area: How expressions of uncertainty
and overconfidence affect language models, 2023.
arXiv:2302.13439 .
[25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-
das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-
renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-
lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-
driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:
Open foundation and fine-tuned chat models, 2023.
arXiv:2307.09288 .
[26] T. M. Cover, Elements of information theory, John
Wiley & Sons, USA, 1999.
[27] E. Bonner, R. Lege, E. Frazier, Large language model-
A. Prompt Examples
Table 4
Examples of designed prompts.
NONE
CONTEXT: I won’t pretend being a flight attendant is easy. But since I started the
job, I’ve been everywhere, from the US to Australia. I work with incredible people, I
have a lot of time off, and life is never boring - which ...
QUESTION: What does Jack say about attending his job interview?
A) He was surprised at the age range of people there.
B) He made sure he seemed different from the others.
C) He wondered whether he had enough qualifications.
D) He realised there were too many people for the jobs available.
ANSWER:\n
PORTRAY
Answer the following reading comprehension questions as if you are a CEFR B1
level English learner. Learners at this level can understand the main points of clear
standard input on familiar matters regularly encountered in work, school, leisure,
etc. But sometimes it may be difficult to understand the main ideas of complex text
on both concrete and abstract topics, including technical discussions in his/her field
of specialisation.
{Same as NONE from CONTEXT: to ANSWER:\n }
ESL
You are an ESL teacher. What do you think is the most plausible answer by CEFR
B1 level learners to the following reading comprehension test? Learners at this
level can understand the main points of clear standard input on familiar matters
regularly encountered in work, school, leisure, etc. But sometimes it may be difficult
to understand the main ideas of complex text on both concrete and abstract topics,
including technical discussions in his/her field of specialisation.
{Same as NONE from CONTEXT: to D) he ...}
ANSWER:
Given the context and considering that the test takers are at a CEFR B1 level, the
most plausible answer they might choose could be:\n
UNCERTAIN
{Same as NONE from CONTEXT: to D) he ...}
ANSWER:
I’m not sure because there are some sentences I don’t understand, but maybe the
answer is:\n
MASK
CONTEXT: I won’t [MASK] being a flight [MASK] is easy. But since I started the job,
I’ve been everywhere, from the US to Australia. I work with incredible people, I have
a lot of time off, and life is never [MASK] - which ...
QUESTION: What does Jack say about attending his job interview?
A) He was surprised at the age range of people there.
B) He made sure he seemed different from the others.
C) He [MASK] whether he had enough qualifications.
D) He realised there were too many people for the jobs available.
ANSWER:\n