<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Can LLMs Solve Reading Comprehension Tests as Second Language Learners?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Akio</forename><surname>Hayakawa</surname></persName>
							<email>akio.hayakawa@upf.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Engineering</orgName>
								<orgName type="laboratory" key="lab1">LaSTUS Lab</orgName>
								<orgName type="laboratory" key="lab2">TALN Research Group</orgName>
								<orgName type="institution">Universitat Pompeu Fabra</orgName>
								<address>
									<addrLine>C/Tànger 122 (08018)</addrLine>
									<settlement>Barcelona</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Horacio</forename><surname>Saggion</surname></persName>
							<email>horacio.saggion@upf.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Engineering</orgName>
								<orgName type="laboratory" key="lab1">LaSTUS Lab</orgName>
								<orgName type="laboratory" key="lab2">TALN Research Group</orgName>
								<orgName type="institution">Universitat Pompeu Fabra</orgName>
								<address>
									<addrLine>C/Tànger 122 (08018)</addrLine>
									<settlement>Barcelona</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Can LLMs Solve Reading Comprehension Tests as Second Language Learners?</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">02E497695C5BA46E9AE5F7AFCA316CAC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Natural Language Processing</term>
					<term>Large Language Models</term>
					<term>Question Answering</term>
					<term>Reading Comprehension</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between the original LLMs and language learners, thereby hilighting a critical discrepancy between them.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In the field of Natural Language Processing (NLP), the evaluation of systems is commonly categorized into two approaches: automatic and manual evaluation. Manual evaluation, which is considered more reliable, involves methods ranging from subjective scoring on scales, such as a 5-point rating, to task-based assessments like solving comprehension questions. Despite its reliability, manual evaluation requires greater time and cost investments <ref type="bibr" target="#b0">[1]</ref>.</p><p>The difficulty of conducting manual evaluation significantly increases when targeting individuals with specific attributes, as access to these groups becomes more difficult. This has resulted in the diminished prioritization of their participation, calling into question the trustworthiness of manual evaluation. For instance, in the text simplification task, which aims to make texts more readable and understandable, children, language learners, and people with disabilities are considered ideal evaluators for the simplicity of texts, as they are presumed to benefit most from the simplification <ref type="bibr" target="#b1">[2]</ref>. Nevertheless, studies on text simplification have relied on native speakers or people who do not need simplified texts for manual evaluation <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>, rarely involving individuals who need simplification, probably due to significant disparities in accessibility to diverse groups. Indeed, Sauberli et al. <ref type="bibr" target="#b4">[5]</ref> recently demonstrated subjective differences in perceived text difficulty between people with and without Answer the following reading comprehension question as if you are a CEFR B1 level English learner. Learners at this level can understand the main points of...</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>{1-shot exapmle}</head><p>CONTEXT: A friend once asked me why I travel when I can see everything on the television. I agreed that... QUESTION: What is Sam Fradd's aim in this text? OPTIONS: A) to encourage people to keep... B) to explain reasons for. The weakening prompt alters distributions?</p><p>Figure <ref type="figure">1</ref>: Overview of our experimental setup. We investigate whether it is possible to make next-token probabilities of LLM closer to selection distribution by language learners, by weakening the LLM.</p><p>intellectual disabilities, highlighting the importance of their involvement.</p><p>Recent advancements in NLP, especially with Large Language Models (LLMs), may address this bottleneck. One line of work has attempted to substitute manual evaluation with assessments conducted by LLMs <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>, seeking immediate and inexpensive annotations of higher quality. Another set of studies has reported that LLMs are capable of emulating a specific persona by including attributes in a prompt <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>.</p><p>Therefore, we wonder if LLMs could be prompted to serve as substitutes for specific personas. This study specifically focuses on language learners, investigating whether LLMs can mimic their response patterns. This approach could potentially offer a more accessible means of obtaining evaluations for tasks that ideally require responses from specific target groups, such as predicting the difficulty of questions without a pilot pretesting stage, simply by providing their attributes in the prompt.</p><p>To judge the mimicability of LLMs, we compare responses to multiple-choice reading comprehension (RC) tests, which have been widely used to measure language comprehension <ref type="bibr" target="#b10">[11]</ref>, from language learners and NLP systems. Using the CMCQRD dataset <ref type="bibr" target="#b11">[12]</ref>, which is a recently released four-choice RC test dataset with selection distributions from language learners, we aim to investigate if LLM output can closely approximate these distributions. While fine-tuning encoder models is one approach to pursuing distributions closer to those of humans <ref type="bibr" target="#b12">[13]</ref>, prompting LLMs has the potential to target a broader range of personas, suggesting enhanced applicability.</p><p>Figure <ref type="figure">1</ref> illustrates the outline of our experimental setup. Given that current models in the NLP field often achieve or even surpass human-level performance on various tasks <ref type="bibr" target="#b13">[14]</ref>, it is reasonable to presume that LLMs could outperform the average language learner on RC tests. Hence, LLMs need to be weakened to mimic language learners. We try several prompting techniques to degrade LLM performance and analyze their effects.</p><p>Contrary to our expectations, our preliminary experimental results show that the prompts considered do not lead LLMs to mimic language learners. Furthermore, we observe that the questions LLMs tend to answer incorrectly differ significantly from those that language learners struggle with. This discrepancy suggests a need for deeper analysis when we try to utilize LLM as a replacement for human evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Human Response to Reading Comprehension Dataset</head><p>Reading comprehension (RC) tests have been widely used in psycholinguistic studies to assess how well readers, especially language learners, understand the content of a given text <ref type="bibr" target="#b14">[15]</ref>. While these studies have seldom made their original data publicly available, research in natural language processing has made standard datasets available to measure the text comprehension abilities of ma-chines <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>, sometimes for specific capabilities such as reasoning in HotpotQA <ref type="bibr" target="#b17">[18]</ref> and the use of external knowledge in ReClor <ref type="bibr" target="#b18">[19]</ref>. However, these datasets are designed only to measure system performance, not for comparison with human responses. As a result, human responses to RC are absent from these datasets. There is limited research that compares responses from machines and humans, and even these studies typically offer only summarized data <ref type="bibr" target="#b19">[20]</ref>. This data shortage has hindered research into machine emulation of human response.</p><p>In contrast to this scarcity, CMCQRD <ref type="bibr" target="#b11">[12]</ref> is a unique RC dataset which includes response data from language learners. CMCQRD adopts a multiple-choice setting like many of the RC datasets mentioned above, and includes the distribution of the choices among options. RC tests and participants are categorized based on the CEFR which is a guideline used to describe achievements of foreign language learners. Among the six reference levels (A1, A2, B1, B2, C1, C2) of the CEFR, independent-(B1, B2) and proficient-level (C1, C2) are considered in the CM-CQRD dataset. In other words, each question in this dataset is labeled with a difficulty level ranging from B1 to C2 according to the CEFR, and also includes the selection distribution by language learners whose proficiency corresponds to these labeled levels. This information enables a detailed analysis of the differences between language learners and machines. Liusie et al. <ref type="bibr" target="#b12">[13]</ref> compared outputs from an ELECTRA-based classification model with human responses, reporting low similarity due to the model performing worse than language learners.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Prompts that Alter LLMs' Behaviour in Question Answering</head><p>Retrieving distributions for multiple-choice questions from LLMs involves obtaining not only the final answer but also the probabilities associated with each option. While it is nontrivial to extract an answer or a probability because of the auto-regressive nature of text generation by LLMs, Robinson et al. <ref type="bibr" target="#b20">[21]</ref> demonstrated that a multiple-choice prompt can lead to a higher probability of generating option symbols as the next token, especially with one or few-shot settings. Unlike a traditional cloze prompt, which selects the option with the highest sequence's probability without giving other options, a multiple-choice prompt provides all options simultaneously and selects the one with the highest probability for the option symbols. However, even in this setting, it has been reported that LLMs respond less robustly to certain prompts <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>. Utilizing this vulnerability, Santurkar et al. <ref type="bibr" target="#b9">[10]</ref> suggested that LLMs can change the distributions of attitude options towards controversial social topics, when given prompts that mimic the behavior of a human group with specific attributes. LLMs' behaviour will also change when given a degree of certainty like "Perhaps it's" <ref type="bibr" target="#b23">[24]</ref>. This change was observed in response to context-free open-ended questions, highlighting an opportunity for extended research in multiple-choice RC tests.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental Setup</head><p>The primary objective of this work is to investigate whether LLMs can mimic the responses of language learners in solving multiple-choice RC tests. In this section, we outline our experimental setup, utilizing the CMCQRD dataset <ref type="bibr" target="#b11">[12]</ref>, which includes responses from at least 100 language learners per question, providing information about answer probability distributions. Our analysis compares the next-token probability on each option by LLMs with the choice patterns of language learners, aiming to understand the extent of LLMs' capability in emulating learner-like understanding in RC tasks.</p><p>Assuming that up-to-date LLMs outperform average language learners, degrading these models is needed to bring their output distributions closer to those of language learners. We employ several methods to weaken the LLM performance and compare the results to the language learners. Dataset The CMCQRD dataset consists of 4-choice English RC tests, labeled with difficulty levels ranging from CEFR B1 to C2. A subset of CMCQRD includes responses from non-native English speakers whose proficiency aligns with the difficulty label <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>. We refer to this set of responses as the human distribution.</p><p>Table <ref type="table" target="#tab_0">1</ref> shows the statistics of the CMCQRD dataset. The average accuracies of language learners are around 60%, while the accuracies of their mode selections are around 90%. In this experiment, we exclusively use questions at levels B1 and B2 with a human distribution, corresponding to intermediate levels of proficiency. Our focus on these levels is driven by our aim to assess the ability of LLMs to reproduce the challenges faced by language learners who are not fully proficient in reading comprehension.</p><p>LLM Settings Since the outputs of LLMs are autoregressive and free-form, some techniques are required to increase the likelihood of desired tokens in subsequent outputs. To this end, we employ a multiple-choice prompting approach for RC, as described in Robinson et al. <ref type="bibr" target="#b20">[21]</ref>. This approach provides LLMs with a single natural language prompt that concatenates the context, a question, options, and an option-symbol-prompting word, such as "Answer:". We take advantage of the nexttoken probabilities to the distribution by LLM. The logits of next tokens associated with option symbols, {A, B, C, D} on 4-choice tests, are normalized using softmax. We adopt GPT-4o<ref type="foot" target="#foot_0">1</ref> and LLaMa-2-70B <ref type="bibr" target="#b24">[25]</ref> with oneshot prompting. We run LLaMa-2-70B using the Hugging-Face library with 4-bit quantization. <ref type="foot" target="#foot_1">2</ref> The temperature parameter is set to 1.0 for both models.</p><p>Evaluation To compare human and LLM outputs, we use mode accuracy, average accuracy, and KL divergence following Liusie et al. <ref type="bibr" target="#b12">[13]</ref>, and also correct/wrong F1 score. Below is the description of these metrics.</p><p>1. Mode Accuracy: how frequently the most plausible symbol by LLM is the correct answer, denoted as</p><formula xml:id="formula_0">Mode Accuracy = 𝔼[argmax 𝑦 (𝑝 LLM ) = 𝑦 ans ],</formula><p>where 𝑝 represents probabilities for each option and 𝑦 ans is the correct option. 2. Average Accuracy: how frequently the correct option is selected on average by LLM, denoted as Average Accuracy = 𝔼[𝑦 LLM = 𝑦 ans ].</p><p>3. KL Divergence: the similarity between two distributions <ref type="bibr" target="#b25">[26]</ref>, denoted as</p><formula xml:id="formula_1">KL Divergence = ∑ 𝑜 𝑙 𝑜 log 𝑙 𝑜 ℎ 𝑜 ,</formula><p>where 𝑜 represents an option selection, with the LLM and human distribution fixed to 𝑙 and ℎ, respectively. 4. Correct/Wrong F1: the macro-averaged f1 score focused on question-wise correct and wrong consistency on mode options, denoted as</p><formula xml:id="formula_2">Correct/Wrong F1 = 1 2 (F1 correct + F1 wrong ),</formula><p>where each F1 score is calculated based on the elements of confusion matrix, such as</p><formula xml:id="formula_3">TP correct = ∑ 𝑖 [(𝑦 𝐿𝐿𝑀 𝑖 = 𝑦 𝑎𝑛𝑠 𝑖 ) ∧ (𝑦 𝐻 𝑢𝑚𝑎𝑛 𝑖 = 𝑦 𝑎𝑛𝑠 𝑖 )] and FP wrong = ∑ 𝑖 [(𝑦 𝐿𝐿𝑀 𝑖 ≠ 𝑦 𝑎𝑛𝑠 𝑖 ) ∧ (𝑦 𝐻 𝑢𝑚𝑎𝑛 𝑖 = 𝑦 𝑎𝑛𝑠 𝑖 )]</formula><p>. Furthermore, we calculate the summation of the probabilities for option symbols appearing as the next token to evaluate the effectiveness of the prompts.  <ref type="bibr" target="#b26">[27]</ref> suggested that LLMs seem to have the ability to control outputs based on a targeted CEFR level provided in a prompt. We ask LLMs the most plausible answer from language learners at a specific CEFR level, such as "What do you think is the most plausible answer by CEFR B1 level learners to the following reading comprehension test?". In addition, we inject the explanation like "Given the context and considering that the test takers are at a CEFR B1 level, the most plausible answer they might choolse could be" after "ANSWER:". • UNCERTAIN: as reported in Zhou et al. <ref type="bibr" target="#b23">[24]</ref>, the expression of uncertainty will change LLMs' behavior. We inject the expression like "I'm not sure because there are some sentences I don't understand, but maybe the answer is, " after "AN-SWER:". • MASK: Laufer <ref type="bibr" target="#b27">[28]</ref> argued that language learners need to know 95% of the vocabulary in a text to comprehend its content. To simulate the scenario where 5% of the vocabulary are not known, top 5% unfrequent words within a context are masked. Unfrequent words in question and options are also masked based on this threshold.</p><p>3 https://www.coe.int/en/web/common-european-framework-reference-languages/ cefr-descriptors</p><p>The word frequency is calculated based on SUB-TLEXus <ref type="bibr" target="#b28">[29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>Table <ref type="table" target="#tab_1">2</ref> shows the performance of LLMs on CMCQRD given each prompt. Overall, contrary to our expectations, the results reveal the limited ability of LLMs to mimic language learners when solving multiple-choice RC tests.</p><p>LLMs tend not to be distracted. First, the distributions by LLMs, especially from GPT-4o, show more skewness in NONE compared to those from humans. In other words, compared to the small gap between Human and the LLM in the mode accuracy, the average accuracy sees much a wider gap. For GPT-4o, there is almost no difference between these accuracies, which demonstrates that the most plausible next token is only one option symbol regardless of its correctness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prompts affect outputs differently across LLMs.</head><p>The results show the difference in the function of prompts between GPT-4o and LLaMa-2-70b. For LLaMa-2-70b, the sum of the probabilities for option symbols exceeds 95% across all prompts, indicating that the prompts effectively induce the generation of these symbols. On the other hand, GPT-4o behaves differently, particularly with UN-CERTAIN prompt, where the probability of generating non-symbol tokens is considerable. This shows that the function of prompts differs across LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaMa-2-70b is better than GPT-4o in weakening.</head><p>A key distinction between responses from language learners and LLMs is that while both show high Mode Accuracy, LLMs demonstrate substantially higher Average Accuracy compared to humans, indicating that distributions by LLMs are generally skewed. Therefore, an LLM suited for weakening can maintain Mode Accuracy while reducing Average Accuracy. In this aspect, LLaMa-2-70b is better than GPT-4o. GPT-4o shows minimal changes in Average Accuracy even with weakening prompts, including UNCERTAIN that drops accuracies and Sum Probability. Thus, its distributions remain distinct from language learners, as reflected by the persistently high KL divergence. In contrast, LLaMa-2-70b shows the ability to reduce Average Accuracy while maintaining Mode Accuracy, especially with ESL and UNCERTAIN prompts.</p><p>Prompt design plays a crucial role. Prompt designs markedly influence the outputs from LLMs, as exemplified by the difference between PORTRAY and ESL results on LLaMa-2-70b. While both prompts are designed to emulate language learner-like outputs and include the description of the targeted CEFR level, PORTRAY fails to weaken performance, whereas ESL leads to reductions in Average Accuracy and KL Divergence. This suggests that there is much room for prompt engineering in designs, including UNCERTAIN.</p><p>Language Learners and LLM mistake different questions. Whereas KL divergence measures the similarity between two distributions, Correct/Wrong F1 score directly measures the consistency of most plausible answers by humans and LLMs. LLMs show a low F1 score regardless of the prompt given, indicating a discrepancy between questions that lead to human errors and those that lead to LLM errors. LLaMa-2-70b observes the largest drop in KL divergence with UNCERTAIN prompt compared to NONE. However, this does not correspond with a substantial improvement in the F1 score, suggesting that the LLM does not mimic human error patterns effectively. Since distributions by LLMs are generally skewed compared to those by language learners, the reduction of KL divergence is achievable by simply increasing the temperature parameter. This result reveals the importance of not only comparing distributions but also examining the consistency of the mode answers to mimic humans.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>Our results so far seem to demonstrate the inability of LLMs to mimic human language learners when solving RC tests, even when provided with weakening prompts.</p><p>In particular, we identify differences in the questions that language learners and the LLMs tend to answer incorrectly. In this section, we turn our attention to an analysis of the underlying factors for these discrepancies.</p><p>We analyze the influence of the complexity of context on accuracy gaps between language learners and LLM (NONE-Human), and also the gaps between the LLM with and without a weakening prompt (NONE-UNCERTAIN).</p><p>We select LLaMa-2-70b because of its ability to be weakened. Among the features used in prior research by Sugawara et al. <ref type="bibr" target="#b19">[20]</ref>, we select Passage Length, FKGL <ref type="bibr" target="#b29">[30]</ref>, and Word Frequency as indicators of complexity. Correlations are measured between these indicators and the accuracy gaps for each individual question. Table <ref type="table" target="#tab_2">3</ref> shows the correlations, some of which are statistically significant. For Passage Length, there is a weak positive correlation with the gap between NONE and Human at the B2 level, which means that the longer the context, the harder it is for language learners to answer correctly compared to the LLM. This implies that a longer context may hinder B2 level language learners from finding the evidence needed to answer more than it does the LLM. FKGL, a readability metric based on the number of words and syllables per sentence, shows a weak-tomoderate positive correlation with the gap between LLM and human, and also the gap between LLMs with and without uncertainty prompt. Since FKGL is designed to show a lower value on easier texts, these statistically significant gaps imply that the LLM shows a higher accuracy in more complex contexts. UNCERTAIN prompt can slightly smooth this trend, but it does not enable the LLM to emulate the tendency of language learners. Finally, for Word Freqency, there is a weak positive correlation with gap between NONE and UNCERTAIN at B2 level. This may imply that UNCERTAIN weaken LLMs more when a context is composed of more common words.</p><p>Overall, these surface-level complexity indicators are not sufficient to explain the difference between language learners and LLMs. We reserve deeper analysis, such as semantic considerations, for our further research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In conclusion, our research reveals that LLMs does not behave as second language learners even with potentially performance-weakening prompts we provide. We also observe that the performance varies depending on the model and prompts used, even though a limited set of models and prompts are considered. Expanding the variety of these elements, including prompts with more sophisticated approaches such as chain-of-thought <ref type="bibr" target="#b22">[23]</ref> and automatic prompt tuning <ref type="bibr" target="#b30">[31]</ref>, will be critical for a more comprehensive evaluation of the mimicability.</p><p>Our findings demonstrate that discrepancies between language learners and LLMs in terms of easiness of questions, highlighting the necessity for micro-level analysis. Nonetheless, the limited size of CMCQRD dataset used in this research presents challenges in drawing comprehensive conclusions. The development of datasets incorporating diverse personas beyond language learners is essential when trying to use LLMs as the complement of human evaluators. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Prompt Examples</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>.. C) to describe the route... D) to advertise his... ANSWER:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Statistics of CMCQRD dataset. We use RC tests at B1 and B2 levels with responses.</figDesc><table><row><cell></cell><cell cols="2">w/o responses</cell><cell></cell><cell cols="2">w/ responses</cell><cell></cell></row><row><cell>CEFR</cell><cell>Num</cell><cell>Num</cell><cell cols="3">Num Num Mode</cell><cell>Avg</cell></row><row><cell>Level</cell><cell>Text</cell><cell>QA</cell><cell>Text</cell><cell>QA</cell><cell>Acc</cell><cell>Acc</cell></row><row><cell>B1</cell><cell>5</cell><cell>25</cell><cell>23</cell><cell>115</cell><cell>0.913</cell><cell>0.590</cell></row><row><cell>B2</cell><cell>21</cell><cell>160</cell><cell>37</cell><cell>262</cell><cell>0.882</cell><cell>0.594</cell></row><row><cell>C1</cell><cell>13</cell><cell>86</cell><cell>12</cell><cell>83</cell><cell>0.880</cell><cell>0.613</cell></row><row><cell>C2</cell><cell>3</cell><cell>20</cell><cell>6</cell><cell>42</cell><cell>0.833</cell><cell>0.681</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Result on CMCQRD Dataset. Values on KL and C/W F1 are those compared to Human language learners above.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>B1</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>B2</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>Mode</cell><cell>Avg</cell><cell></cell><cell>C/W</cell><cell>Sum</cell><cell>Mode</cell><cell>Avg</cell><cell></cell><cell>C/W</cell><cell>Sum</cell></row><row><cell>System</cell><cell>Prompt</cell><cell>Acc</cell><cell>Acc</cell><cell>KL↓</cell><cell>F1↑</cell><cell>Prob.</cell><cell>Acc</cell><cell>Acc</cell><cell>KL↓</cell><cell>F1↑</cell><cell>Prob.</cell></row><row><cell>Human</cell><cell>-</cell><cell>0.913</cell><cell>0.585</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>0.885</cell><cell>0.592</cell><cell>-</cell><cell>-</cell><cell>-</cell></row><row><cell></cell><cell>NONE</cell><cell>0.974</cell><cell cols="4">0.974 0.570 0.552 0.994</cell><cell>0.931</cell><cell cols="4">0.929 0.576 0.633 0.971</cell></row><row><cell></cell><cell>PORTRAY</cell><cell>0.974</cell><cell cols="4">0.971 0.566 0.552 0.988</cell><cell>0.927</cell><cell cols="4">0.927 0.580 0.606 0.975</cell></row><row><cell>GPT-4o</cell><cell>ESL</cell><cell>0.965</cell><cell cols="4">0.964 0.563 0.544 0.895</cell><cell>0.927</cell><cell cols="4">0.926 0.554 0.651 0.842</cell></row><row><cell></cell><cell>UNCERTAIN</cell><cell>0.713</cell><cell cols="4">0.719 0.795 0.471 0.155</cell><cell>0.828</cell><cell cols="4">0.805 0.711 0.572 0.228</cell></row><row><cell></cell><cell>MASK</cell><cell>0.922</cell><cell cols="4">0.918 0.562 0.512 0.868</cell><cell>0.851</cell><cell cols="4">0.852 0.578 0.608 0.798</cell></row><row><cell></cell><cell>NONE</cell><cell>0.930</cell><cell cols="4">0.839 0.338 0.518 0.993</cell><cell>0.854</cell><cell cols="4">0.756 0.354 0.611 0.992</cell></row><row><cell></cell><cell>PORTRAY</cell><cell>0.930</cell><cell cols="4">0.831 0.320 0.518 0.984</cell><cell>0.847</cell><cell cols="4">0.740 0.332 0.604 0.980</cell></row><row><cell>LLaMa-2-70b</cell><cell>ESL</cell><cell>0.922</cell><cell cols="4">0.750 0.211 0.512 0.973</cell><cell>0.851</cell><cell cols="4">0.674 0.263 0.658 0.969</cell></row><row><cell></cell><cell>UNCERTAIN</cell><cell>0.922</cell><cell cols="4">0.646 0.163 0.512 0.966</cell><cell>0.839</cell><cell cols="4">0.556 0.226 0.646 0.971</cell></row><row><cell></cell><cell>MASK</cell><cell>0.843</cell><cell cols="4">0.750 0.294 0.553 0.988</cell><cell>0.755</cell><cell cols="4">0.644 0.391 0.533 0.983</cell></row><row><cell cols="5">Prompt Design We employ 4 types of prompt designs</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">below. See Appendix A for the examples.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">• NONE: Only the context, question, and candidate</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">answers are given.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">• PORTRAY: Similar to Santurkar et al. [10], a role</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">is assigned at the beginning of the prompt, for</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">example, "Answer the following reading compre-</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">hension question as if you are a CEFR B1 level</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">English learner. ", followed by a description of the</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">level defined by CEFR.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table><note>3  • ESL: Bonner et al.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Correlation between the gap and complexity measures. N, H, and U mean NONE, Human, and UNCERTAIN, respectively. * means statistical significance on 𝑝 &lt; 0.05.</figDesc><table><row><cell></cell><cell></cell><cell>N-H</cell><cell>N-U</cell><cell>Avg</cell></row><row><cell>Δ Average</cell><cell>B1</cell><cell>0.254</cell><cell>0.193</cell><cell>-</cell></row><row><cell>Accuracy</cell><cell>B2</cell><cell>0.164</cell><cell>0.200</cell><cell>-</cell></row><row><cell>Passage</cell><cell>B1</cell><cell>-0.14</cell><cell>0.01</cell><cell>342.2</cell></row><row><cell>Length</cell><cell>B2</cell><cell>0.14*</cell><cell>0.04</cell><cell>656.7</cell></row><row><cell>FKGL</cell><cell>B1 B2</cell><cell>0.31* 0.05</cell><cell>0.23* -0.02</cell><cell>9.69 9.22</cell></row><row><cell>Word Freq</cell><cell>B1</cell><cell>-0.18</cell><cell>-0.08</cell><cell>6.53</cell></row><row><cell>(per 1k words)</cell><cell>B2</cell><cell>-0.01</cell><cell>0.13*</cell><cell>6.44</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Examples of designed prompts. NONE CONTEXT: I won't pretend being a flight attendant is easy. But since I started the job, I've been everywhere, from the US to Australia. I work with incredible people, I have a lot of time off, and life is never boring -which ... QUESTION: What does Jack say about attending his job interview? A) He was surprised at the age range of people there. B) He made sure he seemed different from the others. C) He wondered whether he had enough qualifications. D) He realised there were too many people for the jobs available. ANSWER:\n PORTRAY Answer the following reading comprehension questions as if you are a CEFR B1 level English learner. Learners at this level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be difficult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. {Same as NONE from CONTEXT: to ANSWER:\n} ESL You are an ESL teacher. What do you think is the most plausible answer by CEFR B1 level learners to the following reading comprehension test? Learners at this level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be difficult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation. {Same as NONE from CONTEXT: to D) he ...} ANSWER: Given the context and considering that the test takers are at a CEFR B1 level, the most plausible answer they might choose could be:\n UNCERTAIN {Same as NONE from CONTEXT: to D) he ...} ANSWER: I'm not sure because there are some sentences I don't understand, but maybe the answer is:\n MASK CONTEXT: I won't [MASK] being a flight [MASK] is easy. But since I started the job, I've been everywhere, from the US to Australia. I work with incredible people, I have a lot of time off, and life is never [MASK] -which ... QUESTION: What does Jack say about attending his job interview? A) He was surprised at the age range of people there. B) He made sure he seemed different from the others. C) He [MASK] whether he had enough qualifications. D) He realised there were too many people for the jobs available.</figDesc><table /><note>ANSWER:\n</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://openai.com/index/hello-gpt-4o/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://huggingface.co/meta-llama/Llama-2-70b-hf</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors acknowledge the support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. This research is part of a project that has received funding from the European Union´s Horizon Europe research and innovation program under the Grant Agreement No. 101132431 (iDEM Project). Views and opinions expressed are however those of the author(s) only and do necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Gehrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sellam</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.06935</idno>
		<title level="m">Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Evaluation of automatic text simplification: Where are we now, where should we go from here</title>
		<author>
			<persName><forename type="first">N</forename><surname>Grabar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Saggion</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.jeptalnrecital-taln.47" />
	</analytic>
	<monogr>
		<title level="m">Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles</title>
				<meeting>s de la 29e Conférence sur le Traitement Automatique des Langues Naturelles<address><addrLine>ATALA, Avignon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="453" to="463" />
		</imprint>
	</monogr>
	<note>conférence principale</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Muss: Multilingual unsupervised sentence simplification by mining paraphrases</title>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Éric</forename><surname>De La Clergerie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.00352</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Scarton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.00481</idno>
		<title level="m">Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Sauberli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Holzknecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Haller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deilen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schiffl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hansen-Schirra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ebling</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.13094</idno>
		<title level="m">Digital comprehensibility assessment of simplified texts among persons with intellectual disabilities</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Chatgpt outperforms crowd workers for text-annotation tasks</title>
		<author>
			<persName><forename type="first">F</forename><surname>Gilardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alizadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kubli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences</title>
		<imprint>
			<biblScope unit="volume">120</biblScope>
			<biblScope unit="page">e2305016120</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.16634</idno>
		<title level="m">Geval: Nlg evaluation using gpt-4 with better human alignment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Can ai language models replace human participants?</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dillion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tandon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gray</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Trends in Cognitive Sciences</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="page" from="597" to="600" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Hwang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">P</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tandon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.14929</idno>
		<title level="m">Aligning language models to user opinions</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Whose opinions do language models reflect?</title>
		<author>
			<persName><forename type="first">S</forename><surname>Santurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Durmus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ladhak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hashimoto</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.17548</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Neural machine reading comprehension: Methods and trends</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">3698</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Mullooly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Andersen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Benedetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Buttery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Caines</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J F</forename><surname>Gales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Karatay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Knill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liusie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Raina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Taslimipoor</surname></persName>
		</author>
		<idno type="DOI">10.17863/CAM.102185</idno>
		<ptr target="https://www.repository.cam.ac.uk/handle/1810/358683.doi:10.17863/CAM.102185" />
		<title level="m">The Cambridge Multiple-Choice Questions Reading Dataset</title>
				<imprint>
			<publisher>Cambridge University Press and Assessment</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Liusie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Raina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mullooly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Knill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J F</forename><surname>Gales</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.13047</idno>
		<title level="m">Analysis of the cambridge multiple-choice questions reading dataset with a focus on candidate response distribution</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">L2 reading comprehension and its correlates: A meta-analysis</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">H</forename><surname>Jeon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yamashita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language learning</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="160" to="212" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Squad: 100,000+ questions for machine comprehension of text</title>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lopyrev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2383" to="2392" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Race: Largescale reading comprehension dataset from examinations</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Hotpotqa: A dataset for diverse, explainable multi-hop question answering</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="2369" to="2380" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Reclor: A reading comprehension dataset requiring logical reasoning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations, International Conference on Learning Representations</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">What makes reading comprehension questions difficult?</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sugawara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nangia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Warstadt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bowman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="6951" to="6971" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Leveraging large language models for multiple choice question answering</title>
		<author>
			<persName><forename type="first">J</forename><surname>Robinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Rytting</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wingate</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.12353</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">How can we know what language models know?</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">F</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Araki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00324</idno>
		<ptr target="https://aclanthology.org/2020.tacl-1.28.doi:10.1162/tacl_a_00324" />
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="423" to="438" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Large language models are zero-shot reasoners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kojima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Iwasawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="22199" to="22213" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hashimoto</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.13439</idno>
		<title level="m">Navigating the grey area: How expressions of uncertainty and overconfidence affect language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Albert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almahairi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Babaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bashlykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bikel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Blecher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Ferrer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cucurull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Esiobu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fernandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fuller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Goswami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hartshorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Inan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kardas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kerkez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Khabsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kloumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korenev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liskovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Molybog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poulton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reizenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rungta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saladi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schelten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Subramanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">X</forename><surname>Kuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Zarov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kambadur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stojnic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<title level="m">Llama 2: Open foundation and fine-tuned chat models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Cover</surname></persName>
		</author>
		<title level="m">Elements of information theory</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>John Wiley &amp; Sons</publisher>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Large language modelbased artificial intelligence in the language classroom: Practical ideas for teaching</title>
		<author>
			<persName><forename type="first">E</forename><surname>Bonner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lege</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frazier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Teaching English with Technology</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="23" to="41" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">What percentage of text-lexis is essential for comprehension?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Laufer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Special language: From humans thinking to thinking machines</title>
				<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page">316</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Moving beyond kučera and francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american english</title>
		<author>
			<persName><forename type="first">M</forename><surname>Brysbaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>New</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Behavior research methods</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="977" to="990" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Kincaid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Fishburne</surname><genName>Jr</genName></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Rogers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">S</forename><surname>Chissom</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1975">1975</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Connecting large language models with evolutionary algorithms yields powerful prompt optimizers</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2309.08532.arXiv:2309.08532" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
