-

1613-0073

Tests as Second Language Learners?

Akio Hayakawa

akio.hayakawa@upf.edu 0 1

Horacio Saggion

horacio.saggion@upf.edu 0 1

Natural Language Processing, Large Language Models, Question Answering, Reading Comprehension

0 30th ACM KDD Conference 1 LaSTUS Lab, TALN Research Group, Department of Engineering, Universitat Pompeu Fabra , C/Tànger 122 (08018), Barcelona , Spain

The manual evaluation of natural language processing systems is costly and time-consuming, especially when targeting people with specific attributes as evaluators. Current large language models (LLMs) are reported to outperform humans at various tasks, and recently have been used as substitutes for human evaluators. LLMs also have shown the ability to behave as specified in a prompt. This progress raises a fundamental question: can LLMs mimic the behavior of language learners? In this study, we intentionally weaken LLMs aiming to make them simulate language learners on multiple-choice reading comprehension tests. By comparing answer distributions from language learners and LLMs, we observe that prompts designed to weaken the LLMs indeed degrade their performance. However, this degration does not bridge the gap between the original LLMs and language learners, thereby hilighting a critical discrepancy between them.

CEUR ceur-ws.org

1. Introduction

In the field of Natural Language Processing (NLP), the evaluation of systems is commonly categorized into two approaches: automatic and manual evaluation. Manual evaluation, which is considered more reliable, involves methods ranging from subjective scoring on scales, such as a 5-point rating, to task-based assessments like solving comprehension questions. Despite its reliability, manual evaluation requires greater time and cost investments [ 1 ].

The dificulty of conducting manual evaluation significantly increases when targeting individuals with specific attributes, as access to these groups becomes more dificult. This has resulted in the diminished prioritization of their participation, calling into question the trustworthiness of manual evaluation. For instance, in the text simplification task, which aims to make texts more readable and understandable, children, language learners, and people with disabilities are considered ideal evaluators for the simplicity of texts, as they are presumed to beneift most from the simplification [ 2]. Nevertheless, studies on text simplification have relied on native speakers or people who do not need simplified texts for manual evaluation [3, 4], rarely involving individuals who need simplification, probably due to significant disparities in accessibility to diverse groups. Indeed, Sauberli et al. [5] recently demonstrated subjective diferences in perceived text dificulty between people with and without KiL’24: Workshop on Knowledge-infused Learning co-located with

PORTRAY

question as if you are a CEFR B1 level English learner. Learners at this level can understand the main points of...

NONE

{1-shot exapmle} D) to advertise his...

ANSWER: CONTEXT: A friend once asked me why I travel when I can see everything on the television. I agreed that...

QUESTION: What is Sam Fradd's aim in this text? OPTIONS: A) to encourage people to keep...

B) to explain reasons for...

C) to describe the route... The weakening prompt alters distributions? next-token probs. 0.80 0.60 0.40 0.20

A ?

B ?

C ?

Language Learners NONE PORTRAY

D ?

LLM gate whether it is possible to make next-token probabilities of

LLM closer to selection distribution by language learners, by weakening the LLM.

intellectual disabilities, highlighting the importance of their involvement.

Recent advancements in NLP, especially with Large Language Models (LLMs), may address this bottleneck. One line of work has attempted to substitute manual evaluation with assessments conducted by LLMs [6, 7, 8],

Attribution 4.0 International (CC BY 4.0).

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License seeking immediate and inexpensive annotations of higher quality. Another set of studies has reported that LLMs chines [16, 17], sometimes for specific capabilities such are capable of emulating a specific persona by including as reasoning in HotpotQA [18] and the use of external attributes in a prompt [9, 10]. knowledge in ReClor [19]. However, these datasets are

Therefore, we wonder if LLMs could be prompted to designed only to measure system performance, not for serve as substitutes for specific personas. This study comparison with human responses. As a result, human specifically focuses on language learners, investigating responses to RC are absent from these datasets. There is whether LLMs can mimic their response patterns. This limited research that compares responses from machines approach could potentially ofer a more accessible means and humans, and even these studies typically ofer only of obtaining evaluations for tasks that ideally require summarized data [20]. This data shortage has hindered responses from specific target groups, such as predicting research into machine emulation of human response. the dificulty of questions without a pilot pretesting stage, In contrast to this scarcity, CMCQRD [12] is a unique simply by providing their attributes in the prompt. RC dataset which includes response data from language

To judge the mimicability of LLMs, we compare re- learners. CMCQRD adopts a multiple-choice setting like sponses to multiple-choice reading comprehension (RC) many of the RC datasets mentioned above, and includes tests, which have been widely used to measure language the distribution of the choices among options. RC tests comprehension [11], from language learners and NLP and participants are categorized based on the CEFR which systems. Using the CMCQRD dataset [12], which is a is a guideline used to describe achievements of foreign recently released four-choice RC test dataset with se- language learners. Among the six reference levels (A1, lection distributions from language learners, we aim to A2, B1, B2, C1, C2) of the CEFR, independent- (B1, B2) investigate if LLM output can closely approximate these and proficient-level (C1, C2) are considered in the CMdistributions. While fine-tuning encoder models is one CQRD dataset. In other words, each question in this approach to pursuing distributions closer to those of hu- dataset is labeled with a dificulty level ranging from B1 mans [13], prompting LLMs has the potential to target a to C2 according to the CEFR, and also includes the selecbroader range of personas, suggesting enhanced applica- tion distribution by language learners whose proficiency bility. corresponds to these labeled levels. This information

Figure 1 illustrates the outline of our experimental enables a detailed analysis of the diferences between lansetup. Given that current models in the NLP field often guage learners and machines. Liusie et al. [13] compared achieve or even surpass human-level performance on outputs from an ELECTRA-based classification model various tasks [14], it is reasonable to presume that LLMs with human responses, reporting low similarity due to could outperform the average language learner on RC the model performing worse than language learners. tests. Hence, LLMs need to be weakened to mimic language learners. We try several prompting techniques to 2.2. Prompts that Alter LLMs’ Behaviour degrade LLM performance and analyze their efects.

Contrary to our expectations, our preliminary experi- in Question Answering mental results show that the prompts considered do not Retrieving distributions for multiple-choice questions lead LLMs to mimic language learners. Furthermore, from LLMs involves obtaining not only the final answer we observe that the questions LLMs tend to answer in- but also the probabilities associated with each option. correctly difer significantly from those that language While it is nontrivial to extract an answer or a probabillearners struggle with. This discrepancy suggests a need ity because of the auto-regressive nature of text generafor deeper analysis when we try to utilize LLM as a re- tion by LLMs, Robinson et al. [21] demonstrated that a placement for human evaluation. multiple-choice prompt can lead to a higher probability of generating option symbols as the next token, espe2. Related Work cially with one or few-shot settings. Unlike a traditional cloze prompt, which selects the option with the highest sequence’s probability without giving other options, a 2.1. Human Response to Reading multiple-choice prompt provides all options simultane

Comprehension Dataset ously and selects the one with the highest probability for Reading comprehension (RC) tests have been widely used the option symbols. in psycholinguistic studies to assess how well readers, However, even in this setting, it has been reported that especially language learners, understand the content of LLMs respond less robustly to certain prompts [22, 23]. a given text [15]. While these studies have seldom made Utilizing this vulnerability, Santurkar et al. [10] suggested their original data publicly available, research in natural that LLMs can change the distributions of attitude oplanguage processing has made standard datasets avail- tions towards controversial social topics, when given able to measure the text comprehension abilities of ma- prompts that mimic the behavior of a human group with specific attributes. LLMs’ behaviour will also change when given a degree of certainty like ”Perhaps it’s” [24]. Table 1

This change was observed in response to context-free

open-ended questions, highlighting an opportunity for levels with responses. extended research in multiple-choice RC tests.

3. Experimental Setup The primary objective of this work is to investigate

whether LLMs can mimic the responses of language learners in solving multiple-choice RC tests. In this section, we outline our experimental setup, utilizing the CMCQRD dataset [12], which includes responses from at least 100

Statistics of CMCQRD dataset. We use RC tests at B1 and B2 w/o responses w/ responses CEFR Level B1 B2 C1

C2 Num

Text

5 21 13 3 Num QA 25 160 86 20 Num

Text

23 37 12 6 Num QA 115 262 83 42

Mode

Acc about answer probability distributions. Our analysis com- Face library with 4-bit quantization.2 The temperature pares the next-token probability on each option by LLMs parameter is set to 1.0 for both models. with the choice patterns of language learners, aiming to understand the extent of LLMs’ capability in emulating learner-like understanding in RC tasks.

Assuming that up-to-date LLMs outperform average

language learners, degrading these models is needed to bring their output distributions closer to those of language learners. We employ several methods to weaken the LLM performance and compare the results to the language learners.

Dataset

The CMCQRD dataset consists of 4-choice English RC tests, labeled with dificulty levels ranging

from CEFR B1 to C2. A subset of CMCQRD includes responses from non-native English speakers whose proifciency aligns with the dificulty label [ 12, 13]. We refer to this set of responses as the human distribution. 60%, while the accuracies of their mode selections are around 90%. In this experiment, we exclusively use questions at levels B1 and B2 with a human distribution, corresponding to intermediate levels of proficiency. Our focus on these levels is driven by our aim to assess the ability of LLMs to reproduce the challenges faced by language learners who are not fully proficient in reading comprehension.

LLM Settings

Since the outputs of LLMs are autoregressive and free-form, some techniques are required to increase the likelihood of desired tokens in subsequent outputs. To this end, we employ a multiple-choice prompting approach for RC, as described in Robinson et al. [21]. This approach provides LLMs with a single natural language prompt that concatenates the context, a question, options, and an option-symbol-prompting word, such as ”Answer:”. We take advantage of the nexttoken probabilities to the distribution by LLM. The logits Evaluation

To compare human and LLM outputs, we use mode accuracy, average accuracy, and KL divergence following Liusie et al. [13], and also correct/wrong F1 score. Below is the description of these metrics.

1. Mode Accuracy: how frequently the most plausi

ble symbol by LLM is the correct answer, denoted as

Mode Accuracy = [ argmax ( LLM) = ans], and ans is the correct option.

where represents probabilities for each option

2. Average Accuracy: how frequently the correct

option is selected on average by LLM, denoted as Average Accuracy = [ LLM = ans].

3. KL Divergence: the similarity between two distri

butions [26], denoted as

KL Divergence = ∑ log where represents an option selection, with the

LLM and human distribution fixed to and ℎ, re

spectively.

4. Correct/Wrong F1: the macro-averaged f1 score

focused on question-wise correct and wrong consistency on mode options, denoted as

Correct/Wrong F1 = 2 1 (F1correct + F1wrong), where each F1 score is calculated based on the elements of confusion matrix, such as TPcorrect = ∑ [(

FPwrong = ∑ [( = ) ∧ ( ≠ ) ∧ ( = = )] and )].

Furthermore, we calculate the summation of the probabilities for option symbols appearing as the next token to evaluate the efectiveness of the prompts. D} on 4-choice tests, are normalized using softmax.

of next tokens associated with option symbols, {A, B, C, 1https://openai.com/index/hello-gpt-4o/

2https://huggingface.co/meta-llama/Llama-2-70b-hf Accuracy compared to humans, indicating that distribu- Table 3

tions by LLMs are generally skewed. Therefore, an LLM Correlation between the gap and complexity measures. N, H, suited for weakening can maintain Mode Accuracy while and U mean NONE, Human, and UNCERTAIN, respectively. * reducing Average Accuracy. In this aspect, LLaMa-2-70b means statistical significance on < 0.05 . is better than GPT-4o. GPT-4o shows minimal changes in N-H N-U Avg Average Accuracy even with weakening prompts, includ- Δ Average B1 0.254 0.193 ing UNCERTAIN that drops accuracies and Sum Probabil- Accuracy B2 0.164 0.200 ity. Thus, its distributions remain distinct from language Passage learners, as reflected by the persistently high KL diver- Length gence. In contrast, LLaMa-2-70b shows the ability to FKGL reduce Average Accuracy while maintaining Mode Accuracy, especially with ESL and UNCERTAIN prompts.

Word Freq (per 1k words) B1 B2

B1 B2 B1 B2 Prompt design plays a crucial role. Prompt designs markedly influence the outputs from LLMs, as exempliifed by the diference between PORTRAY and ESL results on LLaMa-2-70b. While both prompts are designed to emulate language learner-like outputs and include the description of the targeted CEFR level, PORTRAY fails to weaken performance, whereas ESL leads to reductions in Average Accuracy and KL Divergence. This suggests that there is much room for prompt engineering in other designs, including UNCERTAIN. (NONE-Human), and also the gaps between the LLM with and without a weakening prompt (NONE-UNCERTAIN).

We select LLaMa-2-70b because of its ability to be weak

ened. Among the features used in prior research by Sugawara et al. [20], we select Passage Length, FKGL[30], and Word Frequency as indicators of complexity. Correlations are measured between these indicators and the accuracy gaps for each individual question.

Table 3 shows the correlations, some of which are statistically significant. For Passage Length, there is a weak Language Learners and LLM mistake diferent ques- positive correlation with the gap between NONE and tions. Whereas KL divergence measures the similar- Human at the B2 level, which means that the longer the ity between two distributions, Correct/Wrong F1 score context, the harder it is for language learners to answer directly measures the consistency of most plausible an- correctly compared to the LLM. This implies that a longer swers by humans and LLMs. LLMs show a low F1 score re- context may hinder B2 level language learners from findgardless of the prompt given, indicating a discrepancy be- ing the evidence needed to answer more than it does the tween questions that lead to human errors and those that LLM. FKGL, a readability metric based on the number lead to LLM errors. LLaMa-2-70b observes the largest of words and syllables per sentence, shows a weak-todrop in KL divergence with UNCERTAIN prompt com- moderate positive correlation with the gap between LLM pared to NONE. However, this does not correspond with and human, and also the gap between LLMs with and a substantial improvement in the F1 score, suggesting without uncertainty prompt. Since FKGL is designed to that the LLM does not mimic human error patterns efec- show a lower value on easier texts, these statistically tively. Since distributions by LLMs are generally skewed significant gaps imply that the LLM shows a higher accucompared to those by language learners, the reduction of racy in more complex contexts. UNCERTAIN prompt can KL divergence is achievable by simply increasing the tem- slightly smooth this trend, but it does not enable the LLM perature parameter. This result reveals the importance to emulate the tendency of language learners. Finally, for of not only comparing distributions but also examining Word Freqency, there is a weak positive correlation with the consistency of the mode answers to mimic humans. gap between NONE and UNCERTAIN at B2 level. This may imply that UNCERTAIN weaken LLMs more when 5. Discussion a context is composed of more common words.

Overall, these surface-level complexity indicators are

Our results so far seem to demonstrate the inability of not suficient to explain the diference between language LLMs to mimic human language learners when solving learners and LLMs. We reserve deeper analysis, such as RC tests, even when provided with weakening prompts. semantic considerations, for our further research.

In particular, we identify diferences in the questions that

language learners and the LLMs tend to answer incor- 6. Conclusion rectly. In this section, we turn our attention to an analysis of the underlying factors for these discrepancies. In conclusion, our research reveals that LLMs does not

We analyze the influence of the complexity of context behave as second language learners even with potentially on accuracy gaps between language learners and LLM performance-weakening prompts we provide. We also observe that the performance varies depending on the model and prompts used, even though a limited set of models and prompts are considered. Expanding the variety of these elements, including prompts with more sophisticated approaches such as chain-of-thought [23] and automatic prompt tuning [31], will be critical for a more comprehensive evaluation of the mimicability.

Our findings demonstrate that discrepancies between

language learners and LLMs in terms of easiness of questions, highlighting the necessity for micro-level analysis.

Nonetheless, the limited size of CMCQRD dataset used

in this research presents challenges in drawing comprehensive conclusions. The development of datasets incorporating diverse personas beyond language learners is essential when trying to use LLMs as the complement of human evaluators.

Acknowledgments The authors acknowledge the support from Departament

de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021) and from Maria de Maeztu Units of Excellence Programme CEX2021-001195-M, funded by MCIN/AEI /10.13039/501100011033. This research is part of a project that has received funding from the

European Union´s Horizon Europe research and innova

tion program under the Grant Agreement No. 101132431 (iDEM Project). Views and opinions expressed are however those of the author(s) only and do necessarily relfect those of the European Union. Neither the European

Union nor the granting authority can be held responsible

for them.

NONE CONTEXT: I won’t pretend being a flight attendant is easy. But since I started the job, I’ve been everywhere, from the US to Australia. I work with incredible people, I have a lot of time of, and life is never boring - which ... QUESTION: What does Jack say about attending his job interview? A) He was surprised at the age range of people there. B) He made sure he seemed diferent from the others. C) He wondered whether he had enough qualifications. D) He realised there were too many people for the jobs available. ANSWER:\n PORTRAY Answer the following reading comprehension questions as if you are a CEFR B1

level English learner. Learners at this level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be dificult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation.

ESL {Same as NONE from CONTEXT: to ANSWER:\n}

You are an ESL teacher. What do you think is the most plausible answer by CEFR B1 level learners to the following reading comprehension test? Learners at this

level can understand the main points of clear standard input on familiar matters regularly encountered in work, school, leisure, etc. But sometimes it may be dificult to understand the main ideas of complex text on both concrete and abstract topics, including technical discussions in his/her field of specialisation.

USA , 2017 , pp. 785 - 794 . based artificial intelligence in the language class [18]

Yang ,

Qi ,

Zhang ,

Bengio , W. Cohen, room: Practical ideas for teaching ., Teaching En-

Salakhutdinov ,

C. D.

Manning , Hotpotqa: A glish with Technology 23 ( 2023 ) 23 - 41 .

dataset for diverse, explainable multi-hop question [28]

Laufer , What percentage of text-lexis is essen-

answering, in: Proceedings of the 2018 Conference tial for comprehension? , Special language: From

on Empirical Methods in Natural Language Pro- humans thinking to thinking machines (

1989 ) 316 .

cessing , Association for Computational Linguistics, [29] M.

Brysbaert , B. New, Moving beyond kučera and

USA , 2018 , pp. 2369 - 2380 . francis: A critical evaluation of current word fre [19]

Yu ,

Jiang ,

Dong ,

Feng , Reclor: A reading quency norms and the introduction of a new and

in: International Conference on Learning Repre- glish, Behavior research methods 41 ( 2009 ) 977 - 990 .

sentations , International Conference on Learning [30] J. P.

Kincaid , R. P.

Fishburne

R. L.

Rogers , B. S.

Representations , USA, 2019 . Chissom, Derivation of new readability formulas [20]

Sugawara ,

Nangia ,

Warstadt , S. Bowman , (automated readability index, fog count and flesch

ifcult? , in: Proceedings of the 60th Annual Meeting 1975 .

of the Association for Computational Linguistics [31]

Guo ,

Wang ,

Guo ,

Li ,

Song ,

Tan ,

(Volume 1 : Long

Papers)

, Association for Computa- G. Liu,

Bian ,

Yang , Connecting large language

tional

Linguistics

, USA, 2022 , pp. 6951 - 6971 . models with evolutionary algorithms yields power[21]

Robinson ,

C. M.

Rytting ,

Wingate , Leveraging ful prompt optimizers, 2024 . URL: https://arxiv.org/

large language models for multiple choice question abs/2309.08532 . arXiv: 2309 . 08532 .

answering , 2023 . arXiv: 2210 . 12353 . [22]

Jiang ,

F. F.

Xu ,

Araki , G. Neubig, How can we

of the Association for Computational Linguistics 8

( 2020 ) 423 - 438 . URL: https://aclanthology.org/ 2020 .

tacl-1 .28. doi: 10 .1162/tacl_a_ 00324 . [23]

Kojima ,

S. S.

Gu ,

Reid ,

Matsuo ,

Iwasawa ,

35 ( 2022 ) 22199 - 22213 . [24]

Zhou ,

Jurafsky , T. Hashimoto, Navigating

and overconfidence afect language models , 2023 .

arXiv:2302 . 13439 . [25]

Touvron ,

Martin ,

Stone ,

Albert , A . Alma-

driguez , R.

Stojnic , S.

Edunov , T. Scialom, Llama 2 :

Open foundation and fine-tuned chat models , 2023 .

arXiv:2307 . 09288 . [26] T. M. Cover , Elements of information theory, John

Wiley & Sons, USA, 1999 . [27]

Bonner ,

Lege , E. Frazier, Large language model-

Given the context and considering that the test takers are at a CEFR B1 level, the