Can LLMs Solve Reading Comprehension Tests as Second
                                Language Learners?
                                Akio Hayakawa1 , Horacio Saggion1
                                1
                                    LaSTUS Lab, TALN Research Group, Department of Engineering, Universitat Pompeu Fabra, C/Tànger 122 (08018), Barcelona, Spain


                                                Abstract
                                                The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting
                                                people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans
                                                at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to
                                                behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language
                                                learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice
                                                reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts
                                                designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between
                                                the original LLMs and language learners, thereby hilighting a critical discrepancy between them.

                                                Keywords
                                                Natural Language Processing, Large Language Models, Question Answering, Reading Comprehension


                                1. Introduction                                                                         PORTRAY
                                                                                                                         Answer the following reading comprehension
                                                                                                                         question as if you are a CEFR B1 level English
                                In the field of Natural Language Processing (NLP), the                                   learner. Learners at this level can understand
                                evaluation of systems is commonly categorized into two                                   the main points of...
                                approaches: automatic and manual evaluation. Manual                                      NONE
                                evaluation, which is considered more reliable, involves                                  {1-shot exapmle}

                                methods ranging from subjective scoring on scales, such                                  CONTEXT: A friend once asked me why I travel
                                as a 5-point rating, to task-based assessments like solving                              when I can see everything on the television.
                                                                                                                         I agreed that...
                                comprehension questions. Despite its reliability, manual                                 QUESTION: What is Sam Fradd's aim in this text?
                                evaluation requires greater time and cost investments                                    OPTIONS:
                                [1].                                                                                     A) to encourage people to keep...
                                                                                                                         B) to explain reasons for...
                                   The difficulty of conducting manual evaluation signifi-                               C) to describe the route... The weakening
                                cantly increases when targeting individuals with specific                                D) to advertise his...       prompt alters
                                                                                                                         ANSWER:
                                attributes, as access to these groups becomes more diffi-                                                                             distributions?
                                cult. This has resulted in the diminished prioritization                                                            next-token
                                                                                                                                                     probs.                           Language
                                of their participation, calling into question the trustwor-                                                           0.80                            Learners
                                thiness of manual evaluation. For instance, in the text                                                                                               NONE
                                                                                                                                                      0.60                            PORTRAY
                                simplification task, which aims to make texts more read-                                                                                      ?
                                                                                                                              LLM                     0.40
                                                                                                                                                                      ?               ?
                                able and understandable, children, language learners, and                                                             0.20                                     ?
                                people with disabilities are considered ideal evaluators
                                for the simplicity of texts, as they are presumed to bene-                                                                        A       B       C        D
                                fit most from the simplification [2]. Nevertheless, stud-
                                ies on text simplification have relied on native speakers                             Figure 1: Overview of our experimental setup. We investi-
                                or people who do not need simplified texts for manual                                 gate whether it is possible to make next-token probabilities of
                                evaluation [3, 4], rarely involving individuals who need                              LLM closer to selection distribution by language learners, by
                                simplification, probably due to significant disparities in                            weakening the LLM.
                                accessibility to diverse groups. Indeed, Sauberli et al.
                                [5] recently demonstrated subjective differences in per-
                                ceived text difficulty between people with and without                                                           intellectual disabilities, highlighting the importance of
                                                                                                                                                 their involvement.
                                KiL’24: Workshop on Knowledge-infused Learning co-located with                                                      Recent advancements in NLP, especially with Large
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain                                                       Language Models (LLMs), may address this bottleneck.
                                Envelope-Open akio.hayakawa@upf.edu (A. Hayakawa);
                                horacio.saggion@upf.edu (H. Saggion)
                                                                                                                                                 One line of work has attempted to substitute manual
                                GLOBE https://ahaya3776.github.io/ (A. Hayakawa)                                                                 evaluation with assessments conducted by LLMs [6, 7, 8],
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                    Attribution 4.0 International (CC BY 4.0).
                                                                                                                                                 seeking immediate and inexpensive annotations of higher


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
quality. Another set of studies has reported that LLMs          chines [16, 17], sometimes for specific capabilities such
are capable of emulating a specific persona by including        as reasoning in HotpotQA [18] and the use of external
attributes in a prompt [9, 10].                                 knowledge in ReClor [19]. However, these datasets are
   Therefore, we wonder if LLMs could be prompted to            designed only to measure system performance, not for
serve as substitutes for specific personas. This study          comparison with human responses. As a result, human
specifically focuses on language learners, investigating        responses to RC are absent from these datasets. There is
whether LLMs can mimic their response patterns. This            limited research that compares responses from machines
approach could potentially offer a more accessible means        and humans, and even these studies typically offer only
of obtaining evaluations for tasks that ideally require         summarized data [20]. This data shortage has hindered
responses from specific target groups, such as predicting       research into machine emulation of human response.
the difficulty of questions without a pilot pretesting stage,      In contrast to this scarcity, CMCQRD [12] is a unique
simply by providing their attributes in the prompt.             RC dataset which includes response data from language
   To judge the mimicability of LLMs, we compare re-            learners. CMCQRD adopts a multiple-choice setting like
sponses to multiple-choice reading comprehension (RC)           many of the RC datasets mentioned above, and includes
tests, which have been widely used to measure language          the distribution of the choices among options. RC tests
comprehension [11], from language learners and NLP              and participants are categorized based on the CEFR which
systems. Using the CMCQRD dataset [12], which is a              is a guideline used to describe achievements of foreign
recently released four-choice RC test dataset with se-          language learners. Among the six reference levels (A1,
lection distributions from language learners, we aim to         A2, B1, B2, C1, C2) of the CEFR, independent- (B1, B2)
investigate if LLM output can closely approximate these         and proficient-level (C1, C2) are considered in the CM-
distributions. While fine-tuning encoder models is one          CQRD dataset. In other words, each question in this
approach to pursuing distributions closer to those of hu-       dataset is labeled with a difficulty level ranging from B1
mans [13], prompting LLMs has the potential to target a         to C2 according to the CEFR, and also includes the selec-
broader range of personas, suggesting enhanced applica-         tion distribution by language learners whose proficiency
bility.                                                         corresponds to these labeled levels. This information
   Figure 1 illustrates the outline of our experimental         enables a detailed analysis of the differences between lan-
setup. Given that current models in the NLP field often         guage learners and machines. Liusie et al. [13] compared
achieve or even surpass human-level performance on              outputs from an ELECTRA-based classification model
various tasks [14], it is reasonable to presume that LLMs       with human responses, reporting low similarity due to
could outperform the average language learner on RC             the model performing worse than language learners.
tests. Hence, LLMs need to be weakened to mimic lan-
guage learners. We try several prompting techniques to          2.2. Prompts that Alter LLMs’ Behaviour
degrade LLM performance and analyze their effects.
   Contrary to our expectations, our preliminary experi-
                                                                     in Question Answering
mental results show that the prompts considered do not      Retrieving distributions for multiple-choice questions
lead LLMs to mimic language learners. Furthermore,          from LLMs involves obtaining not only the final answer
we observe that the questions LLMs tend to answer in-       but also the probabilities associated with each option.
correctly differ significantly from those that language     While it is nontrivial to extract an answer or a probabil-
learners struggle with. This discrepancy suggests a need    ity because of the auto-regressive nature of text genera-
for deeper analysis when we try to utilize LLM as a re-     tion by LLMs, Robinson et al. [21] demonstrated that a
placement for human evaluation.                             multiple-choice prompt can lead to a higher probability
                                                            of generating option symbols as the next token, espe-
                                                            cially with one or few-shot settings. Unlike a traditional
2. Related Work                                             cloze prompt, which selects the option with the highest
                                                            sequence’s probability without giving other options, a
2.1. Human Response to Reading                              multiple-choice prompt provides all options simultane-
       Comprehension Dataset                                ously and selects the one with the highest probability for
Reading comprehension (RC) tests have been widely used the option symbols.
in psycholinguistic studies to assess how well readers,        However, even in this setting, it has been reported that
especially language learners, understand the content of LLMs respond less robustly to certain prompts [22, 23].
a given text [15]. While these studies have seldom made Utilizing this vulnerability, Santurkar et al. [10] suggested
their original data publicly available, research in natural that LLMs can change the distributions of attitude op-
language processing has made standard datasets avail- tions towards controversial social topics, when given
able to measure the text comprehension abilities of ma- prompts that mimic the behavior of a human group with
                                                            specific attributes. LLMs’ behaviour will also change
when given a degree of certainty like ”Perhaps it’s” [24]. Table 1
This change was observed in response to context-free Statistics of CMCQRD dataset. We use RC tests at B1 and B2
open-ended questions, highlighting an opportunity for levels with responses.
extended research in multiple-choice RC tests.                     w/o responses         w/ responses
                                                                 CEFR     Num      Num      Num     Num      Mode      Avg
                                                                 Level    Text      QA      Text     QA       Acc      Acc
3. Experimental Setup                                             B1       5        25       23     115      0.913    0.590
                                                                  B2       21      160       37     262      0.882    0.594
The primary objective of this work is to investigate              C1       13       86       12      83      0.880    0.613
whether LLMs can mimic the responses of language learn-           C2       3        20       6       42      0.833    0.681
ers in solving multiple-choice RC tests. In this section, we
outline our experimental setup, utilizing the CMCQRD
dataset [12], which includes responses from at least 100         We adopt GPT-4o1 and LLaMa-2-70B [25] with one-
language learners per question, providing information          shot prompting. We run LLaMa-2-70B using the Hugging-
about answer probability distributions. Our analysis com-      Face library with 4-bit quantization.2 The temperature
pares the next-token probability on each option by LLMs        parameter is set to 1.0 for both models.
with the choice patterns of language learners, aiming to
understand the extent of LLMs’ capability in emulating         Evaluation To compare human and LLM outputs, we
learner-like understanding in RC tasks.                        use mode accuracy, average accuracy, and KL divergence
   Assuming that up-to-date LLMs outperform average            following Liusie et al. [13], and also correct/wrong F1
language learners, degrading these models is needed to         score. Below is the description of these metrics.
bring their output distributions closer to those of lan-           1. Mode Accuracy: how frequently the most plausi-
guage learners. We employ several methods to weaken                   ble symbol by LLM is the correct answer, denoted
the LLM performance and compare the results to the                    as
language learners.
                                                                         Mode Accuracy = 𝔼[argmax𝑦 (𝑝 LLM ) = 𝑦 ans ],
Dataset The CMCQRD dataset consists of 4-choice                       where 𝑝 represents probabilities for each option
English RC tests, labeled with difficulty levels ranging              and 𝑦 ans is the correct option.
from CEFR B1 to C2. A subset of CMCQRD includes                    2. Average Accuracy: how frequently the correct
responses from non-native English speakers whose pro-                 option is selected on average by LLM, denoted as
ficiency aligns with the difficulty label [12, 13]. We refer
to this set of responses as the human distribution.                          Average Accuracy = 𝔼[𝑦 LLM = 𝑦 ans ].
   Table 1 shows the statistics of the CMCQRD dataset.             3. KL Divergence: the similarity between two distri-
The average accuracies of language learners are around                butions [26], denoted as
60%, while the accuracies of their mode selections are
                                                                                                          𝑙
around 90%. In this experiment, we exclusively use ques-                         KL Divergence = ∑ 𝑙𝑜 log 𝑜 ,
tions at levels B1 and B2 with a human distribution, cor-                                        𝑜       ℎ  𝑜
responding to intermediate levels of proficiency. Our                 where 𝑜 represents an option selection, with the
focus on these levels is driven by our aim to assess the              LLM and human distribution fixed to 𝑙 and ℎ, re-
ability of LLMs to reproduce the challenges faced by lan-             spectively.
guage learners who are not fully proficient in reading             4. Correct/Wrong F1: the macro-averaged f1 score
comprehension.                                                        focused on question-wise correct and wrong con-
                                                                      sistency on mode options, denoted as
LLM Settings Since the outputs of LLMs are auto-
                                                                                                 1
regressive and free-form, some techniques are required                   Correct/Wrong F1 = (F1correct + F1wrong ),
                                                                                                 2
to increase the likelihood of desired tokens in subse-
quent outputs. To this end, we employ a multiple-choice               where   each  F1 score  is calculated based on the
prompting approach for RC, as described in Robinson                   elements of confusion matrix, such as TPcorrect =
et al. [21]. This approach provides LLMs with a single                ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 = 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )] and
natural language prompt that concatenates the context,                FPwrong = ∑𝑖 [(𝑦𝑖𝐿𝐿𝑀 ≠ 𝑦𝑖𝑎𝑛𝑠 ) ∧ (𝑦𝑖𝐻 𝑢𝑚𝑎𝑛 = 𝑦𝑖𝑎𝑛𝑠 )].
a question, options, and an option-symbol-prompting            Furthermore, we calculate the summation of the prob-
word, such as ”Answer:”. We take advantage of the next- abilities for option symbols appearing as the next token
token probabilities to the distribution by LLM. The logits to evaluate the effectiveness of the prompts.
of next tokens associated with option symbols, {A, B, C, 1
                                                             https://openai.com/index/hello-gpt-4o/
D} on 4-choice tests, are normalized using softmax.        2
                                                             https://huggingface.co/meta-llama/Llama-2-70b-hf
Table 2
Result on CMCQRD Dataset. Values on KL and C/W F1 are those compared to Human language learners above.
                                                          B1                                        B2
                                       Mode      Avg              C/W      Sum     Mode     Avg             C/W     Sum
            System          Prompt      Acc      Acc      KL↓     F1↑      Prob.    Acc     Acc     KL↓     F1↑     Prob.
           Human                -       0.913    0.585     -        -        -     0.885   0.592      -       -       -
                           NONE         0.974    0.974   0.570    0.552    0.994   0.931   0.929    0.576   0.633   0.971
                         PORTRAY        0.974    0.971   0.566    0.552    0.988   0.927   0.927    0.580   0.606   0.975
            GPT-4o           ESL        0.965    0.964   0.563    0.544    0.895   0.927   0.926    0.554   0.651   0.842
                       UNCERTAIN        0.713    0.719   0.795    0.471    0.155   0.828   0.805    0.711   0.572   0.228
                           MASK         0.922    0.918   0.562    0.512    0.868   0.851   0.852    0.578   0.608   0.798
                           NONE         0.930    0.839   0.338    0.518    0.993   0.854   0.756    0.354   0.611   0.992
                         PORTRAY        0.930    0.831   0.320    0.518    0.984   0.847   0.740    0.332   0.604   0.980
      LLaMa-2-70b            ESL        0.922    0.750   0.211    0.512    0.973   0.851   0.674    0.263   0.658   0.969
                       UNCERTAIN        0.922    0.646   0.163    0.512    0.966   0.839   0.556    0.226   0.646   0.971
                           MASK         0.843    0.750   0.294    0.553    0.988   0.755   0.644    0.391   0.533   0.983


Prompt Design We employ 4 types of prompt designs                         The word frequency is calculated based on SUB-
below. See Appendix A for the examples.                                   TLEXus [29].
        • NONE: Only the context, question, and candidate
          answers are given.                                      4. Results
        • PORTRAY: Similar to Santurkar et al. [10], a role
          is assigned at the beginning of the prompt, for         Table 2 shows the performance of LLMs on CMCQRD
          example, ”Answer the following reading compre-          given each prompt. Overall, contrary to our expectations,
          hension question as if you are a CEFR B1 level          the results reveal the limited ability of LLMs to mimic
          English learner.”, followed by a description of the     language learners when solving multiple-choice RC tests.
          level defined by CEFR. 3
        • ESL: Bonner et al. [27] suggested that LLMs seem        LLMs tend not to be distracted. First, the distribu-
          to have the ability to control outputs based on a       tions by LLMs, especially from GPT-4o, show more skew-
          targeted CEFR level provided in a prompt. We            ness in NONE compared to those from humans. In other
          ask LLMs the most plausible answer from lan-            words, compared to the small gap between Human and
          guage learners at a specific CEFR level, such as        the LLM in the mode accuracy, the average accuracy sees
          ”What do you think is the most plausible answer         much a wider gap. For GPT-4o, there is almost no differ-
          by CEFR B1 level learners to the following reading      ence between these accuracies, which demonstrates that
          comprehension test?”. In addition, we inject the        the most plausible next token is only one option symbol
          explanation like ”Given the context and consider-       regardless of its correctness.
          ing that the test takers are at a CEFR B1 level, the
          most plausible answer they might choolse could          Prompts affect outputs differently across LLMs.
          be” after ”ANSWER:”.                                    The results show the difference in the function of prompts
        • UNCERTAIN: as reported in Zhou et al. [24], the         between GPT-4o and LLaMa-2-70b. For LLaMa-2-70b, the
          expression of uncertainty will change LLMs’ be-         sum of the probabilities for option symbols exceeds 95%
          havior. We inject the expression like ”I’m not          across all prompts, indicating that the prompts effectively
          sure because there are some sentences I don’t           induce the generation of these symbols. On the other
          understand, but maybe the answer is,” after ”AN-        hand, GPT-4o behaves differently, particularly with UN-
          SWER:”.                                                 CERTAIN prompt, where the probability of generating
        • MASK: Laufer [28] argued that language learn-           non-symbol tokens is considerable. This shows that the
          ers need to know 95% of the vocabulary in a text        function of prompts differs across LLMs.
          to comprehend its content. To simulate the sce-
          nario where 5% of the vocabulary are not known,         LLaMa-2-70b is better than GPT-4o in weakening.
          top 5% unfrequent words within a context are            A key distinction between responses from language learn-
          masked. Unfrequent words in question and op-            ers and LLMs is that while both show high Mode Accu-
          tions are also masked based on this threshold.          racy, LLMs demonstrate substantially higher Average
3
    https://www.coe.int/en/web/common-european-framework-reference-languages/
    cefr-descriptors
Accuracy compared to humans, indicating that distribu-       Table 3
tions by LLMs are generally skewed. Therefore, an LLM        Correlation between the gap and complexity measures. N, H,
suited for weakening can maintain Mode Accuracy while        and U mean NONE, Human, and UNCERTAIN, respectively. *
reducing Average Accuracy. In this aspect, LLaMa-2-70b       means statistical significance on 𝑝 < 0.05.
is better than GPT-4o. GPT-4o shows minimal changes in                                    N-H      N-U       Avg
Average Accuracy even with weakening prompts, includ-                Δ Average      B1    0.254    0.193      -
ing UNCERTAIN that drops accuracies and Sum Probabil-                Accuracy       B2    0.164    0.200      -
ity. Thus, its distributions remain distinct from language            Passage       B1    -0.14     0.01    342.2
learners, as reflected by the persistently high KL diver-             Length        B2     0.14*    0.04    656.7
gence. In contrast, LLaMa-2-70b shows the ability to                                B1     0.31*    0.23*   9.69
                                                                       FKGL
reduce Average Accuracy while maintaining Mode Accu-                                B2     0.05    -0.02    9.22
racy, especially with ESL and UNCERTAIN prompts.                     Word Freq      B1    -0.18    -0.08    6.53
                                                                   (per 1k words)   B2    -0.01     0.13*   6.44
Prompt design plays a crucial role. Prompt designs
markedly influence the outputs from LLMs, as exempli-
fied by the difference between PORTRAY and ESL results       (NONE-Human), and also the gaps between the LLM with
on LLaMa-2-70b. While both prompts are designed to           and without a weakening prompt (NONE-UNCERTAIN).
emulate language learner-like outputs and include the        We select LLaMa-2-70b because of its ability to be weak-
description of the targeted CEFR level, PORTRAY fails to     ened. Among the features used in prior research by Sug-
weaken performance, whereas ESL leads to reductions          awara et al. [20], we select Passage Length, FKGL[30],
in Average Accuracy and KL Divergence. This suggests         and Word Frequency as indicators of complexity. Corre-
that there is much room for prompt engineering in other      lations are measured between these indicators and the
designs, including UNCERTAIN.                                accuracy gaps for each individual question.
                                                                Table 3 shows the correlations, some of which are sta-
                                                             tistically significant. For Passage Length, there is a weak
Language Learners and LLM mistake different ques-
                                                             positive correlation with the gap between NONE and
tions. Whereas KL divergence measures the similar-
                                                             Human at the B2 level, which means that the longer the
ity between two distributions, Correct/Wrong F1 score
                                                             context, the harder it is for language learners to answer
directly measures the consistency of most plausible an-
                                                             correctly compared to the LLM. This implies that a longer
swers by humans and LLMs. LLMs show a low F1 score re-
                                                             context may hinder B2 level language learners from find-
gardless of the prompt given, indicating a discrepancy be-
                                                             ing the evidence needed to answer more than it does the
tween questions that lead to human errors and those that
                                                             LLM. FKGL, a readability metric based on the number
lead to LLM errors. LLaMa-2-70b observes the largest
                                                             of words and syllables per sentence, shows a weak-to-
drop in KL divergence with UNCERTAIN prompt com-
                                                             moderate positive correlation with the gap between LLM
pared to NONE. However, this does not correspond with
                                                             and human, and also the gap between LLMs with and
a substantial improvement in the F1 score, suggesting
                                                             without uncertainty prompt. Since FKGL is designed to
that the LLM does not mimic human error patterns effec-
                                                             show a lower value on easier texts, these statistically
tively. Since distributions by LLMs are generally skewed
                                                             significant gaps imply that the LLM shows a higher accu-
compared to those by language learners, the reduction of
                                                             racy in more complex contexts. UNCERTAIN prompt can
KL divergence is achievable by simply increasing the tem-
                                                             slightly smooth this trend, but it does not enable the LLM
perature parameter. This result reveals the importance
                                                             to emulate the tendency of language learners. Finally, for
of not only comparing distributions but also examining
                                                             Word Freqency, there is a weak positive correlation with
the consistency of the mode answers to mimic humans.
                                                             gap between NONE and UNCERTAIN at B2 level. This
                                                             may imply that UNCERTAIN weaken LLMs more when
5. Discussion                                                a context is composed of more common words.
                                                                Overall, these surface-level complexity indicators are
Our results so far seem to demonstrate the inability of      not sufficient to explain the difference between language
LLMs to mimic human language learners when solving           learners and LLMs. We reserve deeper analysis, such as
RC tests, even when provided with weakening prompts.         semantic considerations, for our further research.
In particular, we identify differences in the questions that
language learners and the LLMs tend to answer incor-
rectly. In this section, we turn our attention to an analysis 6. Conclusion
of the underlying factors for these discrepancies.
                                                              In conclusion, our research reveals that LLMs does not
   We analyze the influence of the complexity of context
                                                              behave as second language learners even with potentially
on accuracy gaps between language learners and LLM
performance-weakening prompts we provide. We also                  els with multiple rewriting transformations, 2020.
observe that the performance varies depending on the               arXiv:2005.00481 .
model and prompts used, even though a limited set of           [5] A. Sauberli, F. Holzknecht, P. Haller, S. Deilen,
models and prompts are considered. Expanding the va-               L. Schiffl, S. Hansen-Schirra, S. Ebling, Digital
riety of these elements, including prompts with more               comprehensibility assessment of simplified texts
sophisticated approaches such as chain-of-thought [23]             among persons with intellectual disabilities, 2024.
and automatic prompt tuning [31], will be critical for a           arXiv:2402.13094 .
more comprehensive evaluation of the mimicability.             [6] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt outper-
   Our findings demonstrate that discrepancies between             forms crowd workers for text-annotation tasks, Pro-
language learners and LLMs in terms of easiness of ques-           ceedings of the National Academy of Sciences 120
tions, highlighting the necessity for micro-level analysis.        (2023) e2305016120.
Nonetheless, the limited size of CMCQRD dataset used           [7] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-
in this research presents challenges in drawing compre-            eval: Nlg evaluation using gpt-4 with better human
hensive conclusions. The development of datasets incor-            alignment, 2023. arXiv:2303.16634 .
porating diverse personas beyond language learners is          [8] D. Dillion, N. Tandon, Y. Gu, K. Gray, Can ai lan-
essential when trying to use LLMs as the complement of             guage models replace human participants?, Trends
human evaluators.                                                  in Cognitive Sciences 27 (2023) 597–600.
                                                               [9] E. Hwang, B. P. Majumder, N. Tandon, Align-
                                                                   ing language models to user opinions, 2023.
Acknowledgments                                                    arXiv:2305.14929 .
                                                              [10] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang,
The authors acknowledge the support from Departament
                                                                   T. Hashimoto, Whose opinions do language models
de Recerca i Universitats de la Generalitat de Catalunya
                                                                   reflect?, 2023. arXiv:2303.17548 .
(ajuts SGR-Cat 2021) and from Maria de Maeztu Units
                                                              [11] S. Liu, X. Zhang, S. Zhang, H. Wang, W. Zhang,
of Excellence Programme CEX2021-001195-M, funded
                                                                   Neural machine reading comprehension: Methods
by MCIN/AEI /10.13039/501100011033. This research
                                                                   and trends, Applied Sciences 9 (2019) 3698.
is part of a project that has received funding from the
                                                              [12] A. Mullooly, O. Andersen, L. Benedetto, P. But-
European Union´s Horizon Europe research and innova-
                                                                   tery, A. Caines, M. J. F. Gales, Y. Karatay,
tion program under the Grant Agreement No. 101132431
                                                                   K. Knill, A. Liusie, V. Raina, S. Taslimipoor,
(iDEM Project). Views and opinions expressed are how-
                                                                   The Cambridge Multiple-Choice Questions Read-
ever those of the author(s) only and do necessarily re-
                                                                   ing Dataset, Cambridge University Press and
flect those of the European Union. Neither the European
                                                                   Assessment, 2023. URL: https://www.repository.
Union nor the granting authority can be held responsible
                                                                   cam.ac.uk/handle/1810/358683. doi:10.17863/CAM.
for them.
                                                                   102185 .
                                                              [13] A. Liusie, V. Raina, A. Mullooly, K. Knill, M. J. F.
References                                                         Gales, Analysis of the cambridge multiple-choice
                                                                   questions reading dataset with a focus on candidate
 [1] S. Gehrmann, E. Clark, T. Sellam, Repairing                   response distribution, 2023. arXiv:2306.13047 .
     the cracked foundation: A survey of obstacles            [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     in evaluation practices for generated text, 2022.             Pre-training of deep bidirectional transformers for
     arXiv:2202.06935 .                                            language understanding, 2019. arXiv:1810.04805 .
 [2] N. Grabar, H. Saggion, Evaluation of automatic text      [15] E. H. Jeon, J. Yamashita, L2 reading comprehen-
     simplification: Where are we now, where should                sion and its correlates: A meta-analysis, Language
     we go from here, in: Actes de la 29e Conférence               learning 64 (2014) 160–212.
     sur le Traitement Automatique des Langues Na-            [16] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
     turelles. Volume 1 : conférence principale, ATALA,            100,000+ questions for machine comprehension of
     Avignon, France, 2022, pp. 453–463. URL: https:               text, in: Proceedings of the 2016 Conference on
     //aclanthology.org/2022.jeptalnrecital-taln.47.               Empirical Methods in Natural Language Processing,
 [3] L. Martin, A. Fan, Éric de la Clergerie, A. Bordes,           Association for Computational Linguistics, USA,
     B. Sagot, Muss: Multilingual unsupervised sen-                2016, pp. 2383–2392.
     tence simplification by mining paraphrases, 2021.        [17] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large-
     arXiv:2005.00352 .                                            scale reading comprehension dataset from exami-
 [4] F. Alva-Manchego, L. Martin, A. Bordes, C. Scar-              nations, in: Proceedings of the 2017 Conference
     ton, B. Sagot, L. Specia, Asset: A dataset for tun-           on Empirical Methods in Natural Language Pro-
     ing and evaluation of sentence simplification mod-            cessing, Association for Computational Linguistics,
     USA, 2017, pp. 785–794.                                          based artificial intelligence in the language class-
[18] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen,                   room: Practical ideas for teaching., Teaching En-
     R. Salakhutdinov, C. D. Manning, Hotpotqa: A                     glish with Technology 23 (2023) 23–41.
     dataset for diverse, explainable multi-hop question         [28] B. Laufer, What percentage of text-lexis is essen-
     answering, in: Proceedings of the 2018 Conference                tial for comprehension?, Special language: From
     on Empirical Methods in Natural Language Pro-                    humans thinking to thinking machines (1989) 316.
     cessing, Association for Computational Linguistics,         [29] M. Brysbaert, B. New, Moving beyond kučera and
     USA, 2018, pp. 2369–2380.                                        francis: A critical evaluation of current word fre-
[19] W. Yu, Z. Jiang, Y. Dong, J. Feng, Reclor: A reading             quency norms and the introduction of a new and
     comprehension dataset requiring logical reasoning,               improved word frequency measure for american en-
     in: International Conference on Learning Repre-                  glish, Behavior research methods 41 (2009) 977–990.
     sentations, International Conference on Learning            [30] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S.
     Representations, USA, 2019.                                      Chissom, Derivation of new readability formulas
[20] S. Sugawara, N. Nangia, A. Warstadt, S. Bowman,                  (automated readability index, fog count and flesch
     What makes reading comprehension questions dif-                  reading ease formula) for navy enlisted personnel,
     ficult?, in: Proceedings of the 60th Annual Meeting              1975.
     of the Association for Computational Linguistics            [31] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan,
     (Volume 1: Long Papers), Association for Computa-                G. Liu, J. Bian, Y. Yang, Connecting large language
     tional Linguistics, USA, 2022, pp. 6951–6971.                    models with evolutionary algorithms yields power-
[21] J. Robinson, C. M. Rytting, D. Wingate, Leveraging               ful prompt optimizers, 2024. URL: https://arxiv.org/
     large language models for multiple choice question               abs/2309.08532. arXiv:2309.08532 .
     answering, 2023. arXiv:2210.12353 .
[22] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we
     know what language models know?, Transactions
     of the Association for Computational Linguistics 8
     (2020) 423–438. URL: https://aclanthology.org/2020.
     tacl-1.28. doi:10.1162/tacl_a_00324 .
[23] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
     Large language models are zero-shot reasoners, Ad-
     vances in neural information processing systems
     35 (2022) 22199–22213.
[24] K. Zhou, D. Jurafsky, T. Hashimoto, Navigating
     the grey area: How expressions of uncertainty
     and overconfidence affect language models, 2023.
     arXiv:2302.13439 .
[25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
     W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
     A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-
     das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-
     renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
     D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
     P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
     stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
     Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-
     lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
     Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-
     driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:
     Open foundation and fine-tuned chat models, 2023.
     arXiv:2307.09288 .
[26] T. M. Cover, Elements of information theory, John
     Wiley & Sons, USA, 1999.
[27] E. Bonner, R. Lege, E. Frazier, Large language model-
A. Prompt Examples

Table 4
Examples of designed prompts.
 NONE
    CONTEXT: I won’t pretend being a flight attendant is easy. But since I started the
    job, I’ve been everywhere, from the US to Australia. I work with incredible people, I
    have a lot of time off, and life is never boring - which ...
    QUESTION: What does Jack say about attending his job interview?
    A) He was surprised at the age range of people there.
    B) He made sure he seemed different from the others.
    C) He wondered whether he had enough qualifications.
    D) He realised there were too many people for the jobs available.
    ANSWER:\n
 PORTRAY
    Answer the following reading comprehension questions as if you are a CEFR B1
    level English learner. Learners at this level can understand the main points of clear
    standard input on familiar matters regularly encountered in work, school, leisure,
    etc. But sometimes it may be difficult to understand the main ideas of complex text
    on both concrete and abstract topics, including technical discussions in his/her field
    of specialisation.

       {Same as NONE from CONTEXT: to ANSWER:\n }
 ESL
       You are an ESL teacher. What do you think is the most plausible answer by CEFR
       B1 level learners to the following reading comprehension test? Learners at this
       level can understand the main points of clear standard input on familiar matters
       regularly encountered in work, school, leisure, etc. But sometimes it may be difficult
       to understand the main ideas of complex text on both concrete and abstract topics,
       including technical discussions in his/her field of specialisation.

   {Same as NONE from CONTEXT: to D) he ...}
   ANSWER:
   Given the context and considering that the test takers are at a CEFR B1 level, the
   most plausible answer they might choose could be:\n
 UNCERTAIN
   {Same as NONE from CONTEXT: to D) he ...}
   ANSWER:
   I’m not sure because there are some sentences I don’t understand, but maybe the
   answer is:\n
 MASK
   CONTEXT: I won’t [MASK] being a flight [MASK] is easy. But since I started the job,
   I’ve been everywhere, from the US to Australia. I work with incredible people, I have
   a lot of time off, and life is never [MASK] - which ...
   QUESTION: What does Jack say about attending his job interview?
   A) He was surprised at the age range of people there.
   B) He made sure he seemed different from the others.
   C) He [MASK] whether he had enough qualifications.
   D) He realised there were too many people for the jobs available.
   ANSWER:\n