<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Tests as Second Language Learners?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akio Hayakawa</string-name>
          <email>akio.hayakawa@upf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horacio Saggion</string-name>
          <email>horacio.saggion@upf.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Natural Language Processing, Large Language Models, Question Answering, Reading Comprehension</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>30th ACM KDD Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LaSTUS Lab, TALN Research Group, Department of Engineering, Universitat Pompeu Fabra</institution>
          ,
          <addr-line>C/Tànger 122 (08018), Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between the original LLMs and language learners, thereby hilighting a critical discrepancy between them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In the field of Natural Language Processing (NLP), the
evaluation of systems is commonly categorized into two
approaches: automatic and manual evaluation. Manual
evaluation, which is considered more reliable, involves
methods ranging from subjective scoring on scales, such
as a 5-point rating, to task-based assessments like solving
comprehension questions. Despite its reliability, manual
evaluation requires greater time and cost investments
[
        <xref ref-type="bibr" rid="ref13 ref19">1</xref>
        ].
      </p>
      <p>The dificulty of conducting manual evaluation
significantly increases when targeting individuals with specific
attributes, as access to these groups becomes more
dificult. This has resulted in the diminished prioritization
of their participation, calling into question the
trustworthiness of manual evaluation. For instance, in the text
simplification task, which aims to make texts more
readable and understandable, children, language learners, and
people with disabilities are considered ideal evaluators
for the simplicity of texts, as they are presumed to
beneift most from the simplification [
2]. Nevertheless,
studies on text simplification have relied on native speakers
or people who do not need simplified texts for manual
evaluation [3, 4], rarely involving individuals who need
simplification, probably due to significant disparities in
accessibility to diverse groups. Indeed, Sauberli et al.
[5] recently demonstrated subjective diferences in
perceived text dificulty between people with and without
KiL’24: Workshop on Knowledge-infused Learning co-located with</p>
      <sec id="sec-2-1">
        <title>PORTRAY</title>
        <p>question as if you are a CEFR B1 level English
learner. Learners at this level can understand
the main points of...</p>
      </sec>
      <sec id="sec-2-2">
        <title>NONE</title>
        <p>{1-shot exapmle}
D) to advertise his...</p>
        <p>ANSWER:
CONTEXT: A friend once asked me why I travel
when I can see everything on the television.
I agreed that...</p>
        <p>QUESTION: What is Sam Fradd's aim in this text?
OPTIONS:
A) to encourage people to keep...</p>
        <p>B) to explain reasons for...</p>
        <p>C) to describe the route... The weakening
prompt alters
distributions?
next-token
probs.
0.80
0.60
0.40
0.20</p>
        <p>A
?</p>
        <p>B
?</p>
        <p>C
?</p>
      </sec>
      <sec id="sec-2-3">
        <title>Language</title>
      </sec>
      <sec id="sec-2-4">
        <title>Learners</title>
      </sec>
      <sec id="sec-2-5">
        <title>NONE</title>
      </sec>
      <sec id="sec-2-6">
        <title>PORTRAY</title>
        <p>D
?</p>
        <p>LLM
gate whether it is possible to make next-token probabilities of</p>
      </sec>
      <sec id="sec-2-7">
        <title>LLM closer to selection distribution by language learners, by weakening the LLM.</title>
        <p>intellectual disabilities, highlighting the importance of
their involvement.</p>
        <sec id="sec-2-7-1">
          <title>Recent advancements in NLP, especially with Large</title>
        </sec>
        <sec id="sec-2-7-2">
          <title>Language Models (LLMs), may address this bottleneck.</title>
        </sec>
        <sec id="sec-2-7-3">
          <title>One line of work has attempted to substitute manual evaluation with assessments conducted by LLMs [6, 7, 8],</title>
          <p>Attribution 4.0 International (CC BY 4.0).</p>
          <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License seeking immediate and inexpensive annotations of higher
quality. Another set of studies has reported that LLMs chines [16, 17], sometimes for specific capabilities such
are capable of emulating a specific persona by including as reasoning in HotpotQA [18] and the use of external
attributes in a prompt [9, 10]. knowledge in ReClor [19]. However, these datasets are</p>
          <p>Therefore, we wonder if LLMs could be prompted to designed only to measure system performance, not for
serve as substitutes for specific personas. This study comparison with human responses. As a result, human
specifically focuses on language learners, investigating responses to RC are absent from these datasets. There is
whether LLMs can mimic their response patterns. This limited research that compares responses from machines
approach could potentially ofer a more accessible means and humans, and even these studies typically ofer only
of obtaining evaluations for tasks that ideally require summarized data [20]. This data shortage has hindered
responses from specific target groups, such as predicting research into machine emulation of human response.
the dificulty of questions without a pilot pretesting stage, In contrast to this scarcity, CMCQRD [12] is a unique
simply by providing their attributes in the prompt. RC dataset which includes response data from language</p>
          <p>To judge the mimicability of LLMs, we compare re- learners. CMCQRD adopts a multiple-choice setting like
sponses to multiple-choice reading comprehension (RC) many of the RC datasets mentioned above, and includes
tests, which have been widely used to measure language the distribution of the choices among options. RC tests
comprehension [11], from language learners and NLP and participants are categorized based on the CEFR which
systems. Using the CMCQRD dataset [12], which is a is a guideline used to describe achievements of foreign
recently released four-choice RC test dataset with se- language learners. Among the six reference levels (A1,
lection distributions from language learners, we aim to A2, B1, B2, C1, C2) of the CEFR, independent- (B1, B2)
investigate if LLM output can closely approximate these and proficient-level (C1, C2) are considered in the
CMdistributions. While fine-tuning encoder models is one CQRD dataset. In other words, each question in this
approach to pursuing distributions closer to those of hu- dataset is labeled with a dificulty level ranging from B1
mans [13], prompting LLMs has the potential to target a to C2 according to the CEFR, and also includes the
selecbroader range of personas, suggesting enhanced applica- tion distribution by language learners whose proficiency
bility. corresponds to these labeled levels. This information</p>
          <p>Figure 1 illustrates the outline of our experimental enables a detailed analysis of the diferences between
lansetup. Given that current models in the NLP field often guage learners and machines. Liusie et al. [13] compared
achieve or even surpass human-level performance on outputs from an ELECTRA-based classification model
various tasks [14], it is reasonable to presume that LLMs with human responses, reporting low similarity due to
could outperform the average language learner on RC the model performing worse than language learners.
tests. Hence, LLMs need to be weakened to mimic
language learners. We try several prompting techniques to 2.2. Prompts that Alter LLMs’ Behaviour
degrade LLM performance and analyze their efects.</p>
          <p>Contrary to our expectations, our preliminary experi- in Question Answering
mental results show that the prompts considered do not Retrieving distributions for multiple-choice questions
lead LLMs to mimic language learners. Furthermore, from LLMs involves obtaining not only the final answer
we observe that the questions LLMs tend to answer in- but also the probabilities associated with each option.
correctly difer significantly from those that language While it is nontrivial to extract an answer or a
probabillearners struggle with. This discrepancy suggests a need ity because of the auto-regressive nature of text
generafor deeper analysis when we try to utilize LLM as a re- tion by LLMs, Robinson et al. [21] demonstrated that a
placement for human evaluation. multiple-choice prompt can lead to a higher probability
of generating option symbols as the next token,
espe2. Related Work cially with one or few-shot settings. Unlike a traditional
cloze prompt, which selects the option with the highest
sequence’s probability without giving other options, a
2.1. Human Response to Reading multiple-choice prompt provides all options
simultane</p>
          <p>Comprehension Dataset ously and selects the one with the highest probability for
Reading comprehension (RC) tests have been widely used the option symbols.
in psycholinguistic studies to assess how well readers, However, even in this setting, it has been reported that
especially language learners, understand the content of LLMs respond less robustly to certain prompts [22, 23].
a given text [15]. While these studies have seldom made Utilizing this vulnerability, Santurkar et al. [10] suggested
their original data publicly available, research in natural that LLMs can change the distributions of attitude
oplanguage processing has made standard datasets avail- tions towards controversial social topics, when given
able to measure the text comprehension abilities of ma- prompts that mimic the behavior of a human group with
specific attributes. LLMs’ behaviour will also change
when given a degree of certainty like ”Perhaps it’s” [24]. Table 1</p>
        </sec>
        <sec id="sec-2-7-4">
          <title>This change was observed in response to context-free</title>
          <p>open-ended questions, highlighting an opportunity for levels with responses.
extended research in multiple-choice RC tests.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>The primary objective of this work is to investigate</title>
        <p>whether LLMs can mimic the responses of language
learners in solving multiple-choice RC tests. In this section, we
outline our experimental setup, utilizing the CMCQRD
dataset [12], which includes responses from at least 100</p>
        <sec id="sec-3-1-1">
          <title>Statistics of CMCQRD dataset. We use RC tests at B1 and B2 w/o responses w/ responses</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>CEFR</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Level B1 B2 C1</title>
          <p>C2
Num</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Text</title>
          <p>5
21
13
3
Num
QA
25
160
86
20
Num</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Text</title>
          <p>23
37
12
6
Num
QA
115
262
83
42</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Mode</title>
          <p>Acc
about answer probability distributions. Our analysis com- Face library with 4-bit quantization.2 The temperature
pares the next-token probability on each option by LLMs
parameter is set to 1.0 for both models.
with the choice patterns of language learners, aiming to
understand the extent of LLMs’ capability in emulating
learner-like understanding in RC tasks.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Assuming that up-to-date LLMs outperform average</title>
        <p>language learners, degrading these models is needed to
bring their output distributions closer to those of
language learners. We employ several methods to weaken
the LLM performance and compare the results to the
language learners.</p>
        <p>Dataset</p>
      </sec>
      <sec id="sec-3-3">
        <title>The CMCQRD dataset consists of 4-choice</title>
      </sec>
      <sec id="sec-3-4">
        <title>English RC tests, labeled with dificulty levels ranging</title>
        <p>from CEFR B1 to C2. A subset of CMCQRD includes
responses from non-native English speakers whose
proifciency aligns with the dificulty label [
12, 13]. We refer
to this set of responses as the human distribution.
60%, while the accuracies of their mode selections are
around 90%. In this experiment, we exclusively use
questions at levels B1 and B2 with a human distribution,
corresponding to intermediate levels of proficiency. Our
focus on these levels is driven by our aim to assess the
ability of LLMs to reproduce the challenges faced by
language learners who are not fully proficient in reading
comprehension.</p>
        <p>LLM Settings</p>
        <p>Since the outputs of LLMs are
autoregressive and free-form, some techniques are required
to increase the likelihood of desired tokens in
subsequent outputs. To this end, we employ a multiple-choice
prompting approach for RC, as described in Robinson
et al. [21]. This approach provides LLMs with a single
natural language prompt that concatenates the context,
a question, options, and an option-symbol-prompting
word, such as ”Answer:”. We take advantage of the
nexttoken probabilities to the distribution by LLM. The logits
Evaluation</p>
        <p>To compare human and LLM outputs, we
use mode accuracy, average accuracy, and KL divergence
following Liusie et al. [13], and also correct/wrong F1
score. Below is the description of these metrics.</p>
      </sec>
      <sec id="sec-3-5">
        <title>1. Mode Accuracy: how frequently the most plausi</title>
        <p>ble symbol by LLM is the correct answer, denoted
as</p>
        <p>Mode Accuracy = [ argmax ( LLM) =  ans],
and  ans is the correct option.</p>
        <p>where  represents probabilities for each option</p>
      </sec>
      <sec id="sec-3-6">
        <title>2. Average Accuracy: how frequently the correct</title>
        <p>option is selected on average by LLM, denoted as
Average Accuracy = [
LLM =  ans].</p>
      </sec>
      <sec id="sec-3-7">
        <title>3. KL Divergence: the similarity between two distri</title>
        <p>butions [26], denoted as</p>
        <p>KL Divergence = ∑   log 
where  represents an option selection, with the</p>
      </sec>
      <sec id="sec-3-8">
        <title>LLM and human distribution fixed to  and ℎ, re</title>
        <p>spectively.</p>
      </sec>
      <sec id="sec-3-9">
        <title>4. Correct/Wrong F1: the macro-averaged f1 score</title>
        <p>focused on question-wise correct and wrong
consistency on mode options, denoted as</p>
        <p>Correct/Wrong F1 =
2
1 (F1correct + F1wrong),
where each F1 score is calculated based on the
elements of confusion matrix, such as TPcorrect =
∑ [(</p>
        <p>FPwrong = ∑ [( 
= 
 ) ∧ (   

≠    ) ∧ ( 
 
=



=  
)] and
 )].</p>
      </sec>
      <sec id="sec-3-10">
        <title>Furthermore, we calculate the summation of the probabilities for option symbols appearing as the next token to evaluate the efectiveness of the prompts.</title>
      </sec>
      <sec id="sec-3-11">
        <title>D} on 4-choice tests, are normalized using softmax.</title>
        <p>of next tokens associated with option symbols, {A, B, C, 1https://openai.com/index/hello-gpt-4o/</p>
      </sec>
      <sec id="sec-3-12">
        <title>2https://huggingface.co/meta-llama/Llama-2-70b-hf</title>
      </sec>
      <sec id="sec-3-13">
        <title>Accuracy compared to humans, indicating that distribu- Table 3</title>
        <p>tions by LLMs are generally skewed. Therefore, an LLM Correlation between the gap and complexity measures. N, H,
suited for weakening can maintain Mode Accuracy while and U mean NONE, Human, and UNCERTAIN, respectively. *
reducing Average Accuracy. In this aspect, LLaMa-2-70b means statistical significance on  &lt; 0.05 .
is better than GPT-4o. GPT-4o shows minimal changes in N-H N-U Avg
Average Accuracy even with weakening prompts, includ- Δ Average B1 0.254 0.193
ing UNCERTAIN that drops accuracies and Sum Probabil- Accuracy B2 0.164 0.200
ity. Thus, its distributions remain distinct from language Passage
learners, as reflected by the persistently high KL diver- Length
gence. In contrast, LLaMa-2-70b shows the ability to FKGL
reduce Average Accuracy while maintaining Mode
Accuracy, especially with ESL and UNCERTAIN prompts.</p>
        <sec id="sec-3-13-1">
          <title>Word Freq (per 1k words) B1 B2</title>
          <p>B1
B2
B1
B2
Prompt design plays a crucial role. Prompt designs
markedly influence the outputs from LLMs, as
exempliifed by the diference between PORTRAY and ESL results
on LLaMa-2-70b. While both prompts are designed to
emulate language learner-like outputs and include the
description of the targeted CEFR level, PORTRAY fails to
weaken performance, whereas ESL leads to reductions
in Average Accuracy and KL Divergence. This suggests
that there is much room for prompt engineering in other
designs, including UNCERTAIN.
(NONE-Human), and also the gaps between the LLM with
and without a weakening prompt (NONE-UNCERTAIN).</p>
        </sec>
      </sec>
      <sec id="sec-3-14">
        <title>We select LLaMa-2-70b because of its ability to be weak</title>
        <p>ened. Among the features used in prior research by
Sugawara et al. [20], we select Passage Length, FKGL[30],
and Word Frequency as indicators of complexity.
Correlations are measured between these indicators and the
accuracy gaps for each individual question.</p>
        <p>Table 3 shows the correlations, some of which are
statistically significant. For Passage Length, there is a weak
Language Learners and LLM mistake diferent ques- positive correlation with the gap between NONE and
tions. Whereas KL divergence measures the similar- Human at the B2 level, which means that the longer the
ity between two distributions, Correct/Wrong F1 score context, the harder it is for language learners to answer
directly measures the consistency of most plausible an- correctly compared to the LLM. This implies that a longer
swers by humans and LLMs. LLMs show a low F1 score re- context may hinder B2 level language learners from
findgardless of the prompt given, indicating a discrepancy be- ing the evidence needed to answer more than it does the
tween questions that lead to human errors and those that LLM. FKGL, a readability metric based on the number
lead to LLM errors. LLaMa-2-70b observes the largest of words and syllables per sentence, shows a
weak-todrop in KL divergence with UNCERTAIN prompt com- moderate positive correlation with the gap between LLM
pared to NONE. However, this does not correspond with and human, and also the gap between LLMs with and
a substantial improvement in the F1 score, suggesting without uncertainty prompt. Since FKGL is designed to
that the LLM does not mimic human error patterns efec- show a lower value on easier texts, these statistically
tively. Since distributions by LLMs are generally skewed significant gaps imply that the LLM shows a higher
accucompared to those by language learners, the reduction of racy in more complex contexts. UNCERTAIN prompt can
KL divergence is achievable by simply increasing the tem- slightly smooth this trend, but it does not enable the LLM
perature parameter. This result reveals the importance to emulate the tendency of language learners. Finally, for
of not only comparing distributions but also examining Word Freqency, there is a weak positive correlation with
the consistency of the mode answers to mimic humans. gap between NONE and UNCERTAIN at B2 level. This
may imply that UNCERTAIN weaken LLMs more when
5. Discussion a context is composed of more common words.</p>
      </sec>
      <sec id="sec-3-15">
        <title>Overall, these surface-level complexity indicators are</title>
        <p>Our results so far seem to demonstrate the inability of not suficient to explain the diference between language
LLMs to mimic human language learners when solving learners and LLMs. We reserve deeper analysis, such as
RC tests, even when provided with weakening prompts. semantic considerations, for our further research.</p>
      </sec>
      <sec id="sec-3-16">
        <title>In particular, we identify diferences in the questions that</title>
        <p>language learners and the LLMs tend to answer incor- 6. Conclusion
rectly. In this section, we turn our attention to an analysis
of the underlying factors for these discrepancies. In conclusion, our research reveals that LLMs does not</p>
        <p>We analyze the influence of the complexity of context behave as second language learners even with potentially
on accuracy gaps between language learners and LLM
performance-weakening prompts we provide. We also
observe that the performance varies depending on the
model and prompts used, even though a limited set of
models and prompts are considered. Expanding the
variety of these elements, including prompts with more
sophisticated approaches such as chain-of-thought [23]
and automatic prompt tuning [31], will be critical for a
more comprehensive evaluation of the mimicability.</p>
      </sec>
      <sec id="sec-3-17">
        <title>Our findings demonstrate that discrepancies between</title>
        <p>language learners and LLMs in terms of easiness of
questions, highlighting the necessity for micro-level analysis.</p>
      </sec>
      <sec id="sec-3-18">
        <title>Nonetheless, the limited size of CMCQRD dataset used</title>
        <p>in this research presents challenges in drawing
comprehensive conclusions. The development of datasets
incorporating diverse personas beyond language learners is
essential when trying to use LLMs as the complement of
human evaluators.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>The authors acknowledge the support from Departament</title>
        <p>de Recerca i Universitats de la Generalitat de Catalunya
(ajuts SGR-Cat 2021) and from Maria de Maeztu Units
of Excellence Programme CEX2021-001195-M, funded
by MCIN/AEI /10.13039/501100011033. This research
is part of a project that has received funding from the</p>
      </sec>
      <sec id="sec-4-2">
        <title>European Union´s Horizon Europe research and innova</title>
        <p>tion program under the Grant Agreement No. 101132431
(iDEM Project). Views and opinions expressed are
however those of the author(s) only and do necessarily
relfect those of the European Union. Neither the European</p>
      </sec>
      <sec id="sec-4-3">
        <title>Union nor the granting authority can be held responsible</title>
        <p>for them.</p>
        <sec id="sec-4-3-1">
          <title>NONE</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>CONTEXT: I won’t pretend being a flight attendant is easy. But since I started the job, I’ve been everywhere, from the US to Australia. I work with incredible people, I have a lot of time of, and life is never boring - which ...</title>
        </sec>
        <sec id="sec-4-3-3">
          <title>QUESTION: What does Jack say about attending his job interview?</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>A) He was surprised at the age range of people there.</title>
        </sec>
        <sec id="sec-4-3-5">
          <title>B) He made sure he seemed diferent from the others.</title>
        </sec>
        <sec id="sec-4-3-6">
          <title>C) He wondered whether he had enough qualifications.</title>
        </sec>
        <sec id="sec-4-3-7">
          <title>D) He realised there were too many people for the jobs available.</title>
        </sec>
        <sec id="sec-4-3-8">
          <title>ANSWER:\n</title>
        </sec>
        <sec id="sec-4-3-9">
          <title>PORTRAY</title>
        </sec>
        <sec id="sec-4-3-10">
          <title>Answer the following reading comprehension questions as if you are a CEFR B1</title>
          <p>level English learner. Learners at this level can understand the main points of clear
standard input on familiar matters regularly encountered in work, school, leisure,
etc. But sometimes it may be dificult to understand the main ideas of complex text
on both concrete and abstract topics, including technical discussions in his/her field
of specialisation.</p>
          <p>ESL
{Same as NONE from CONTEXT: to ANSWER:\n}</p>
        </sec>
        <sec id="sec-4-3-11">
          <title>You are an ESL teacher. What do you think is the most plausible answer by CEFR</title>
        </sec>
        <sec id="sec-4-3-12">
          <title>B1 level learners to the following reading comprehension test? Learners at this</title>
          <p>level can understand the main points of clear standard input on familiar matters
regularly encountered in work, school, leisure, etc. But sometimes it may be dificult
to understand the main ideas of complex text on both concrete and abstract topics,
including technical discussions in his/her field of specialisation.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          .
          <article-title>based artificial intelligence in the language class</article-title>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , W. Cohen,
          <article-title>room: Practical ideas for teaching</article-title>
          .,
          <string-name>
            <surname>Teaching</surname>
          </string-name>
          En-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Hotpotqa: A glish with Technology 23 (</article-title>
          <year>2023</year>
          )
          <fpage>23</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>dataset for diverse, explainable multi-hop question</article-title>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Laufer</surname>
          </string-name>
          ,
          <article-title>What percentage of text-lexis is essen-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          answering,
          <source>in: Proceedings of the 2018 Conference tial for comprehension?</source>
          , Special language: From
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>on Empirical Methods in Natural Language Pro- humans thinking to thinking machines (</article-title>
          <year>1989</year>
          )
          <fpage>316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>cessing</surname>
            , Association for Computational Linguistics, [29]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Brysbaert</surname>
          </string-name>
          , B. New, Moving beyond kučera and
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>2369</fpage>
          -
          <lpage>2380</lpage>
          .
          <article-title>francis: A critical evaluation of current word fre</article-title>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <article-title>Reclor: A reading quency norms and the introduction of a new and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>in: International Conference on Learning Repre- glish, Behavior research methods 41</source>
          (
          <year>2009</year>
          )
          <fpage>977</fpage>
          -
          <lpage>990</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>sentations</surname>
            , International Conference on Learning [30]
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kincaid</surname>
            ,
            <given-names>R. P.</given-names>
          </string-name>
          <string-name>
            <surname>Fishburne</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. S.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Representations</surname>
          </string-name>
          , USA,
          <year>2019</year>
          . Chissom, Derivation of new readability formulas [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sugawara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Bowman</surname>
          </string-name>
          ,
          <article-title>(automated readability index, fog count</article-title>
          and flesch
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>ifcult?</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting</source>
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>of the Association for Computational Linguistics</source>
          [31]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computa- G. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Connecting large language
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>tional</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , USA,
          <year>2022</year>
          , pp.
          <fpage>6951</fpage>
          -
          <lpage>6971</lpage>
          .
          <article-title>models with evolutionary algorithms yields</article-title>
          power[21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Rytting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wingate</surname>
          </string-name>
          , Leveraging ful prompt optimizers,
          <year>2024</year>
          . URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>large language models for multiple choice question abs/2309.08532</article-title>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>08532</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>answering</surname>
          </string-name>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>12353</fpage>
          . [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          , G. Neubig, How can we
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>of the Association for Computational Linguistics 8</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          (
          <year>2020</year>
          )
          <fpage>423</fpage>
          -
          <lpage>438</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>tacl-1</source>
          .28. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00324</fpage>
          . [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          . [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          , T. Hashimoto, Navigating
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>and overconfidence afect language models</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>arXiv:2302</source>
          .
          <fpage>13439</fpage>
          . [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Alma-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>driguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>arXiv:2307</source>
          .
          <fpage>09288</fpage>
          . [26]
          <string-name>
            <surname>T. M. Cover</surname>
          </string-name>
          , Elements of information theory, John
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          Wiley &amp; Sons, USA,
          <year>1999</year>
          . [27]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bonner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lege</surname>
          </string-name>
          , E. Frazier, Large language model-
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>Given the context and considering that the test takers are at a CEFR B1 level, the</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>