<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Can LLMs evaluate items measuring collaborative problem-solving?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ella</forename><surname>Anghel</surname></persName>
							<email>anghel@bc.edu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">International Study Center</orgName>
								<orgName type="department" key="dep2">Lynch School of Education and Human Development</orgName>
								<orgName type="institution">Boston College</orgName>
								<address>
									<addrLine>140 Commonwealth Ave</addrLine>
									<postCode>02467</postCode>
									<settlement>Chestnut Hill</settlement>
									<region>MA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yu</forename><surname>Wang</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Administration, Leadership &amp; Technology</orgName>
								<orgName type="department" key="dep2">Education &amp; Human Development</orgName>
								<orgName type="institution">New York University Steinhardt School of Culture</orgName>
								<address>
									<addrLine>82 Washington Square East</addrLine>
									<postCode>10003</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Madhumitha</forename><surname>Gopalakrishnan</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Administration, Leadership &amp; Technology</orgName>
								<orgName type="department" key="dep2">Education &amp; Human Development</orgName>
								<orgName type="institution">New York University Steinhardt School of Culture</orgName>
								<address>
									<addrLine>82 Washington Square East</addrLine>
									<postCode>10003</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pranali</forename><surname>Mansukhani</surname></persName>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Administration, Leadership &amp; Technology</orgName>
								<orgName type="department" key="dep2">Education &amp; Human Development</orgName>
								<orgName type="institution">New York University Steinhardt School of Culture</orgName>
								<address>
									<addrLine>82 Washington Square East</addrLine>
									<postCode>10003</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yoav</forename><surname>Bergner</surname></persName>
							<email>yoav.bergner@nyu.edu</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Department of Administration, Leadership &amp; Technology</orgName>
								<orgName type="department" key="dep2">Education &amp; Human Development</orgName>
								<orgName type="institution">New York University Steinhardt School of Culture</orgName>
								<address>
									<addrLine>82 Washington Square East</addrLine>
									<postCode>10003</postCode>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Can LLMs evaluate items measuring collaborative problem-solving?</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">5604AB47D018B06A5F7E00C81E4A46CF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>large language models</term>
					<term>item evaluation</term>
					<term>collaborative problem-solving</term>
					<term>prompt engineering</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct's complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 (𝜅 = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Collaborative problem-solving (CPS) is one of the most important 21st century skills according to employers <ref type="bibr" target="#b0">[1]</ref> and has for some time attracted the interest of K-12 educators and policymakers. High-quality assessment of CPS is a vital companion for curricula designed to develop this skill. However, the complexity of the construct makes it challenging to design items that properly target CPS and to evaluate the quality of candidate items. In recent years, the use of large language models (LLMs) and other AI-based methods were proposed for determining psychometric properties such as item difficulty <ref type="bibr" target="#b1">[2]</ref>. These approaches are rarely applied to the evaluation of items' construct representation or to complex constructs like CPS. Therefore, it is unclear whether LLMs are suitable for such tasks. The current study aims to fill this gap by testing whether LLMs agree with humans on quality criteria for CPS items.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Literature review 2.1. Collaborative learning</head><p>It is now well established that collaboration and teamwork are essential for success in educational and work settings <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b2">3]</ref>. The importance of collaborative problem solving (CPS) has led policymakers to advocate for the development of high-quality CPS assessments <ref type="bibr" target="#b3">[4]</ref>. These calls have been answered by several national and international assessment programs <ref type="bibr" target="#b4">[5]</ref>.</p><p>From a socio-cognitive perspective, CPS is also believed to improve learning of the underlying domain. However, simply working together on a task is not enough to facilitate learning <ref type="bibr" target="#b5">[6]</ref>. Good CPS tasks should be challenging enough to justify the higher cognitive load of collaborating <ref type="bibr" target="#b6">[7]</ref>, focus on conceptual rather than procedural material <ref type="bibr" target="#b7">[8]</ref>, and involve positive interdependence among participants <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>.</p><p>While the importance of CPS in and of itself and as a contributor to other learning is widely supported, it remains a difficult construct to assess <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>. Some challenges relate to construct definition, confounding factors, and psychometric modeling <ref type="bibr" target="#b12">[13]</ref>. But even designing items that foster positive interdependence can be quite tricky. Many "collaborative" tasks can either be solved individually or by dividing the work among the group members rather than through collaboration. For example, the PISA 2015 tasks measuring CPS seem to encourage the test-takers to divide the work with their collaborators <ref type="bibr" target="#b13">[14]</ref>.</p><p>Collaborative learning scholars have emphasized the task design component in contrast to, for example, (over-)scripting student interactions <ref type="bibr" target="#b14">[15]</ref>. This approach was central in an online learning and assessment environment called Collaborative Higher-Order Problem Solving (CHOPS) <ref type="bibr" target="#b15">[16]</ref>. There, pairs of students work collaboratively to solve math problems built around three item "templates" designed to foster positive interdependence. These templates are described here, using somewhat trivial examples for illustrative purposes:</p><p>1. Jigsaw -Students must exchange information to solve the problem, as they have only part of the necessary information. For example, one student might have the length of one side of a rectangle and another student has the length of the adjacent side. Together they are asked to find its area. 2. Joint construction -A correct answer is composed of elements provided by each student that must together satisfy some criteria. For example, each student must provide the length of one side of a rectangle such that its area is 48 units. While there may be multiple solutions, the students must coordinate their responses. 3. Information request -Students have an under-specified problem with limited options to request information to complete the task. The pair must decide together what information is needed and coordinate who should ask for what. For example, the students are asked to determine how long a trip should take and can each request one of the following: the car's fuel usage, the distance traveled, the car's average speed, or when the car left its origin.</p><p>These templates allow for relatively short-duration CPS items (compared with elaborate scenario tasks). Consequently, many items can be delivered and reliability improved. Item developers can be trained to adapt many "standard" types of test questions to these templates, but the process is still quite time-consuming.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Large language models</head><p>In recent years, the performance of LLMs, such as OpenAI's GPT and Meta's Llama have improved significantly <ref type="bibr" target="#b16">[17]</ref>. As a result, these models have been applied in diverse areas such as medicine, computing, basic science, and education <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21]</ref>. Specifically, GPT-3.5 and GPT-4 have included innovations in bias reduction and complex problem-solving, which are essential for educational applications like content creation, interactive learning, and teaching assistance <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24]</ref>. Notwithstanding the name "OpenAI", GPT models are proprietary, potentially expensive, and require users to upload private information to OpenAI servers. Open-source initiatives like Llama and Mistral offer promising alternatives. These models have encouraged an efflorescence of open-source additions, for example other-than-English language capabilities <ref type="bibr" target="#b24">[25]</ref>.</p><p>While LLMs are often remarkably effective at interpreting natural language prompts, higherquality prompts can yield significantly better outputs <ref type="bibr" target="#b25">[26]</ref>. Prompt engineering has emerged as a design problem for refining the content and structure of LLM prompts to optimize for specific tasks <ref type="bibr" target="#b26">[27]</ref>. Some prompt engineering best practices involve writing clear, detailed instructions, separating distinct parts of the input, asking the model to adopt a persona, and instructing the LLM to work out the solution rather than immediately constructing the answer <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref>. A naming convention has emerged in the literature to describe different prompt variations. Refering to the number of worked examples given to the LLM, Zero-shot learning (ZSL) relies solely on the LLM's pre-trained "knowledge" along with the task description without the use of any worked examples. In contrast, One-shot learning (OSL) includes an example in the prompt, and Few-shot learning (FSL) includes two or more <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b30">31]</ref>. There are also variations in the presentation of worked examples. A prompt can include just the correct label or desired response for example. In Chain-of-thought (CoT) reasoning <ref type="bibr" target="#b31">[32]</ref>, however, the prompt demonstrates a multi-step reasoning process, mimicking how a human would approach the problem. These prompting approaches constitute sources of variance that may be important for educational researchers working with LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Large language models in assessment</head><p>Advancements in LLMs have not gone unnoticed by the measurement field, where they have been considered for item generation, scoring, and parameter calibration <ref type="bibr" target="#b32">[33]</ref>. Relatively little research has been conducted on item evaluation using LLMs. Most of this research has focused on automatic evaluation of item difficulty <ref type="bibr" target="#b1">[2]</ref>. For instance, researchers used LLM responses to items to evaluate the guessability or the knowledge required to respond to those items <ref type="bibr" target="#b33">[34,</ref><ref type="bibr" target="#b34">35]</ref>. Others have focused on the linguistic features of items <ref type="bibr" target="#b35">[36,</ref><ref type="bibr" target="#b36">37,</ref><ref type="bibr" target="#b37">38]</ref>. Only a few studies attempted to automatically evaluate items' content <ref type="bibr" target="#b38">[39]</ref>, and they generally did not use LLMs for this purpose.</p><p>The contribution of LLMs to assessment research and development may be even more pronounced for difficult-to-measure constructs like CPS. Can these models reduce the burden of new item design? Or will LLM-generated items be disastrous? While LLMs may be able to follow detailed prescriptions for item structure, a more impressive achievement would be understanding the task designer's intent more broadly. To that end, a prudent step before engaging an LLM in item generation is to test whether the model has the foundational knowledge to recognize a good CPS item when it sees one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">The current study</head><p>In the current study, we sought to examine to what extent LLMs can judge the quality of CHOPS template items for measuring CPS. Given the range of performance demonstrated in the literature, we compared multiple foundational models, prompt strategies, and task types to understand how some approaches may outperform others. This study contributes to the literature in several ways. First, understanding LLMs' ability to evaluate CPS items is a first step in improving item quality and even automatically generating such items. Second, this study is relevant to the measurement field as a whole, as it demonstrates how LLMs deal with complex item evaluation tasks. Finally, by examining different models and prompts we can shed light on the models' respective strengths and limitations, guiding future research in educational technology. In sum, our study aimed to answer the following research questions:</p><p>1. To what extent can LLMs evaluate the quality of complex CPS items? 2. To what extent do LLMs' success rates vary by the foundational model, prompting approach, and type of item?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Item design</head><p>We created a small data set of CPS problems for LLMs to evaluate. The items were designed for two students to solve and are approximately at the level of middle-school math. They use one of three CHOPS templates. We label a CPS task as "good" if it invokes positive interdependence. That is, it requires the participants to work together in a meaningful way to solve the problem. A bad task does not require collaboration or cannot be solved for other reasons. The set contained 21 jigsaw (10 good, 11 bad), 20 joint construction (10 good, 10 bad), and 20 request information (9 good, 11 bad) items, which were either new, adapted from items in CHOPS, or adapted from publicly available large-scale math assessment items like TIMSS and NAEP. Each item was reviewed by at least two team members for clarity, correctness, and content relevance. Figure <ref type="figure">1</ref> shows an example of a joint construction template created based on a TIMSS 2011 item <ref type="bibr" target="#b39">[40]</ref>. Version A and B would be shown to the two collaborating students. Since both students can enter values that meet the criterion presented in the item, they do not need to collaborate to solve it, making this a bad example for CPS. A (minimally) good version of this item would require each student to enter one value such that together they meet the criterion. The pair of students must then negotiate a common solution.</p><p>Version A:</p><p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p><p>■ Enter value for T1 ♢ Enter value for T2 Version B:</p><p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p><p>♢ Enter value for T1 ■ Enter value for T2 Figure <ref type="figure">1</ref>: An example of a bad joint construction item. A good variation would keep only the answer input rows preceded with black squares or diamonds but not both.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Pre-prompt design</head><p>In this study, we use the term "pre-prompt" when referring to the instructions provided to the LLMs on how to approach the item evaluation, since each "prompt" also includes an item that the LLMs is asked to evaluate. We designed several types of pre-prompts. Initially, we refined one prompt through trial and error with GPT-4 to improve the output. We also designed a pre-prompt following current best practices of asking the LLM to adopt the role of an evaluator and separating its task by first asking it to identify the item type (template) and then make a judgment on collaborative interdependence. Our original prompt was also paired with examples, sometimes limited to pass/fail labels or extended to CoT reasoning. In total, we tested five pre-prompts:</p><p>• Zero-shot learning with no examples, prompt refined with GPT-4 • Structured Zero-shot learning following prompt engineering best-practices • Few-shot learning, original prompt plus one good and bad example from each template (six total); only pass/fail labels were provided • The same prompt with six CoT examples followed by a verdict • The same prompt and CoT, except with the verdict given before the reasoning Below is our ZSL pre-prompt. The CoT pre-prompts with the example items we used for the other pre-prompts, as well as the structured ZSL prompt are available in Appendices A.1 and A.2, respectively.</p><p>You will be asked to evaluate one educational exercise for math students working in pairs. The exercise will be presented to you in two parts, the exercise version shown only to Student A (called Version A) and the exercise version as shown only to Student B (Version B). Students A and B are assigned to be partners. Importantly, Version A and Version B may contain different, complementary information, or the information may be formulated differently. Student A cannot see Version B, and Student B cannot see Version A. The only way they can access the information available to their partner is by communication with each other via text chat. The exercise should require Both Student A and Student B to submit some answers in an answer field or fields.</p><p>Your criterion for evaluation of the exercise is whether or not the exercise indeed requires Student A and Student B to collaborate in order to solve the problem. If so, indicate pass. It is not acceptable if Student A and Student B can work separately, independently, and without communicating and still each get the correct answer. In such case, indicate fail. For an exercise to pass, it should be impossible for the students to answer correctly by working alone independently. It is not necessary for you to solve the problem. However, you may describe the solution process in explaining your reasons for your evaluation. When providing your evaluation, please format it as follows:</p><p>Verdict: [pass or fail]</p><p>Reason: [explanation for verdict]</p><p>The following is the exercise you need to evaluate:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Selection of language models</head><p>We used six LLMs from three families: GPT-3.5 and GPT-4 from OpenAI, Llama2 and Llama3 from Meta, and Mistral7B and Mixtral8x7B <ref type="bibr" target="#b40">[41,</ref><ref type="bibr" target="#b41">42]</ref>. This selection was designed to explore variance between families as well as within a family, i.e., earlier/later or smaller/larger models. Llama2 and Llama3 come in different sizes; in both cases, we used Q5-quantized versions of the 70 billion (70b) parameter models. Mistral7B is a conventional 7b model. Mixtral8x7b is a Sparse Mixture of Experts (SMoE) architecture with 47b total parameters, but the model uses only 13b at inference time by routing each token to a subset of model components based on the token's attributes. We used Q8-quantized versions of both Mistral models. The Llama and Mistral models were served locally on a high-performance MacBook Pro with 128GB of RAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Procedure</head><p>Experimental outputs were collected by an automated script with a browser-based front end. The interface provided for API calls to GPT models and local/cloud-based open-source models. Items and pre-prompts could be selected, and the script would subsequently append the preprompts to each item for each call (61 items × 5 pre-prompts × 6 models). The outputs of each query were saved for subsequent analyses.</p><p>In the analysis stage, LLM outputs were parsed using regular expressions for pass/fail verdicts. All pre-prompts requested verdicts in a specific form, Verdict: Pass/Fail. Model outputs that did not follow this structure were originally parsed as having no verdict. However, further inspection revealed that many model responses contained meaningful evaluations in a different form (e.g., "this exercise meets the criteria"). We therefore wrote a more complex parser to identify relevant phrases. The new parser significantly lowered the no-verdict rates; however, we understand that the parser was still imperfect.</p><p>We then compared the results of the parser with our ground-truth labels for each item. The overall agreement is summarized using accuracy (% agreement) and Cohen's 𝜅 <ref type="bibr" target="#b42">[43]</ref>. Following <ref type="bibr" target="#b43">[44]</ref> but slightly more conservative at the low end, we interpret Cohen's 𝜅 values ≤ 0.05 as poor agreement, 0.06 to 0.20 as slight, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, and 0.61 to 0.80 as substantial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>Table <ref type="table" target="#tab_0">1</ref> presents the classification performance for all tested models using the ZSL pre-prompt, across all items as well as disagreggated by item type. GPT-4 had the best performance, with an overall moderate agreement level. The three bottom models were barely better than chance (i.e., 𝜅 scores are about zero). Only two open-source foundational models were somewhat comparable to GPT-4: Llama3 and Mixtral8x7b. Overall, Llama3 was better than Mixtral8x7b, but when disaggregated by item type, the results are more complex.</p><p>Jigsaw items follow the same pattern as the overall (GPT-4 &gt; Llama3 &gt; Mixtral8x7b). On joint construction items, Llama3 and even Llama2 edge out Mixtral8x7B. However, classifying information request items seems to be the hardest subtask. The highest accuracy, obtained by GPT-4, is 0.63, with a moderate 𝜅 of 0.26. Mixtral8x7b slightly beats chance on these items, while Llama3 does worse than chance. In sum, it is possible that to optimize performance using the open-source models, one would do better using Llama3 for jigsaw and joint construction items and Mixtral8x7B for info request items. Next, we examined the other pre-prompts to see if they impacted the results. Table <ref type="table" target="#tab_1">2</ref> includes the classification metrics for the top three performing models, i.e., GPT-4, Llama3, and Mixtral8x7b, across all pre-prompts (the ZSL results from Table <ref type="table" target="#tab_0">1</ref> are embedded in the first column).</p><p>For GPT-4, which had the best overall performance on the task, it is notable that elaboration of the original prompt did not have a positive impact on classification performance and often led to worse performance. The ZSL pre-prompt was as good or better than all others, except CoT prompting for info request items which had identical accuracy and higher 𝜅 by about 0.03. However, the difference is probably not of practical significance as the confidence interval around 𝜅 is on the order of ±0.3.</p><p>While the differences were still small, it does appear that the few-shot prompting improved the results from Llama3 and Mixtral8x7b in a number of prompt-item-type combinations. For example, CoT prompting improved Mixtral8x7b notably on jigsaw items, while the CoT verdict first improved the joint construction evaluations. Llama3 had more modest gains from these two prompts. The above analysis is perhaps too fine, slicing by model, prompt, and item type. To understand if different prompts are generally more suitable to different item types, we average over the top three models. These results are shown in Table <ref type="table" target="#tab_2">3</ref>. Indeed, after averaging, it remains the case that the best overall prompt is not the best prompt for each item type. Notably, the classification of info request items is, at best, barely better than chance. It appears to be the case that jigsaw classification is the most successful, followed by joint construction and information request. A high-level summary confirming this finding using accuracy scores averaged over pre-prompts for each model is shown in Table <ref type="table" target="#tab_3">4</ref>. Note that these are not the best results for each model.  As an exploratory step, we were interested in whether the models were able to classify items into the correct types, the first sub-task using the structured ZSL approach. Base rate classification accuracy for item types could be expected at 0.33, and actual results ranged from 0.13 to 0.43. Striking, however, is the relationship between accuracy in classifying the item type (template) and accuracy in evaluating the items (see Figure <ref type="figure" target="#fig_0">2</ref>). The highly apparent correlation (𝑟 = 0.80) suggests that better models in one task can do the other better as well. Interestingly, when it came to type classification, Llama3 was actually the best performing model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The purpose of our study was to test the feasibility of LLMs for evaluating items measuring CPS. We also wanted to see if different models, pre-prompts, or item types affect the results.</p><p>Understanding these issues may contribute to research on how LLMs interact with complex tasks and to future item design in practice. According to our findings, only three of the tested models did better than chance, with GPT-4 outperforming the other models in almost all cases. Between the open-access models, different models did better on different item types, suggesting that users should consider the task type when choosing the best model. Given GPT-4's success relative to other models in various tasks <ref type="bibr" target="#b44">[45,</ref><ref type="bibr" target="#b45">46]</ref>, including tasks related to item generation <ref type="bibr" target="#b46">[47]</ref>, this result is unsurprising. However, even GPT-4 reached only moderate levels of agreement in most cases. Others have also found that LLMs struggle with evaluative tasks <ref type="bibr" target="#b47">[48]</ref>, sugggesting directions for future LLM developments.</p><p>Contrary to existing findings <ref type="bibr" target="#b48">[49]</ref>, elaborate pre-prompting rarely improved on the basic ZSL pre-prompt. It is possible that the examples were confusing or focused the LLMs on the specific cases rather than the general idea. We intend to examine this issue in the future. We also found that some item types were easier for the LLMs to judge than others. All models generally did best with the jigsaw items followed by the joint construction items and the information request items. We are unaware of existing research comparing LLMs' ability to evaluate different types of interdependent tasks, and this might also be a fruitful direction for future work.</p><p>This study has several limitations. First, our basic ZSL pre-prompt was refined using GPT-4, perhaps contributing to its success. Since GPT-4 seems to outperform other models in a variety of complex tasks, we believe this effect is likely small. Second, to enhance the study's generalizability, more items, constructs, models, and pre-prompts should be tested. Finally, we could only examine the final verdict of the models and not their reasoning. Qualitative analysis of the LLMs' outputs is planned and could reveal the reasons for their disagreements with humans.</p><p>In conclusion, when evaluating the quality of CPS items, existing LLMs have only moderate levels of agreement with humans at best. Adding more information beyond ZSL pre-prompts does not improve this by much. However, different models and pre-prompts perform better when evaluating different item types. Therefore, more work on the models or on prompting strategies is required before LLMs can be reliably used LLMs for evaluating items measuring CPS and, likely, similarly complex constructs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Full text of prompts</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Chain-of-Thought</head><p>You will be asked to evaluate one educational exercise for math students working in pairs. The exercise will be presented to you in two parts, the exercise version shown only to Student A (called Version A) and the exercise version as shown only to Student B (Version B). Students A and B are assigned to be partners. Importantly, Version A and Version B may contain different, complementary information, or the information may be formulated differently. Student A cannot see Version B, and Student B cannot see Version A. The only way they can access the information available to their partner is by communication with each other via text chat. The exercise should require Both Student A and Student B to submit some answers in an answer field or fields.</p><p>Your criterion for evaluation of the exercise is whether or not the exercise indeed requires Student A and Student B to collaborate in order to solve the problem. If so, indicate pass. It is not acceptable if Student A and Student B can work separately, independently, and without communicating and still each get the correct answer. In such case, indicate fail. For an exercise to pass, it should be impossible for the students to answer correctly by working alone independently. It is not necessary for you to solve the problem. However, you may describe the solution process in explaining your reasons for your evaluation. When providing your evaluation, please format it as follows:</p><p>Verdict: [pass or fail]</p><p>Reason: [explanation for verdict] ##The following is an example exercises with suitable response: #Example prompt Version A: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn from today's production line, and 2 batteries fail the quality test. What is the best estimate for the total number of faulty batteries produced today? Version B: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn from today's production line, and 2 batteries fail the quality test. What is the best estimate for the total number of faulty batteries produced today?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#Example response</head><p>To estimate the total number of faulty batteries produced, one needs to know the total daily production, the size of the test sample, and the number of failed batteries in the test sample. Both Student A and Student B have the complete information needed to solve the problem and thus can in principle solve the problem without collaborating with one another.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Fail #Example prompt</head><p>Version A: A factory produces batteries each day. A sample of 200 batteries is drawn from today's production line, and 2 batteries fail the quality test. What is the best estimate for the total number of faulty batteries produced today? Version B: A factory produces 100,000 batteries each day. A sample of batteries is drawn from today's production line, and 2 batteries fail the quality test. What is the best estimate for the total number of faulty batteries produced today?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#Example response</head><p>To estimate the total number of faulty batteries produced, one needs to know the total daily production, the size of the test sample, and the number of failed batteries in the test sample. Student A has the sample size but does not have the total number produced, while Student B knows the total number of batteries produced but does not know the size of the sample that was tested. The collaborating students need to communicate this information to each other to estimate the total number of faulty batteries produced today. Thus, this exercise meets the requirement that it can only be solved if Student A and Student B share information with each other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Pass #Example prompt</head><p>Version A:</p><p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p><p>Enter value for T1: Enter value for T2: Version B: The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Enter value for T1: Enter value for T2: #Example response</head><p>There is an infinite number of possible solutions to the posed problem. Each student is provided with the ability to provide a complete solution to the problem. Thus, it is possible for each student to answer correctly on their own without coordinating with their partner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Fail</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#Example prompt</head><p>Version A: The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p><p>Enter value for T1:</p><formula xml:id="formula_0">Version B:</formula><p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</p><p>Enter value for T2:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#Example response</head><p>Each student is provided with the ability to answer one of two necessary parts of the solution. Moreover, the two parts must together compose a correct solution. Although there is an infinite number of possible solutions to the posed problem, neither student can answer correctly on their own without coordinating with their partner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Pass</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>#Example prompt</head><p>Version A: In a school fund-raiser, students in class A and class B sold boxes of cookies. What was the average number (arithmetic mean) of boxes of cookies sold by all students in both classes?</p><p>To answer this question, you and your partner may each make TWO selections from the following list of values. After you submit your selection, the values you selected will be revealed to you. Use this information to provide your answer in the box below. To answer this question, you and your partner may each make TWO selections from the following list of values. After you submit your selection, the values you selected will be revealed to you. Use this information to provide your answer in the box below. Critical pieces of information necessary for solving the problem (such as the total number of students in both classes or the total number of boxes sold in class B) are either missing or inadequately defined in the options available to the students. Therefore, the task is unsolvable with the provided selections, even if students work together to combine their available information. The exercise does not meet the criteria for a solvable and collaborative educational exercise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Fail #Example prompt</head><p>Version A: In a school fund-raiser, students in class A and class B sold boxes of cookies. What was the average number (arithmetic mean) of boxes of cookies sold by all students in both classes?</p><p>To answer this question, you and your partner may each make TWO selections from the following list of values. After you submit your selection, the values you selected will be revealed to you. Use this information to provide your answer in the box below. To answer this question, you and your partner may each make TWO selections from the following list of values. After you submit your selection, the values you selected will be revealed to you. Use this information to provide your answer in the box below. To calculate the overall average number of boxes sold by students in both classes, students will need at least four pieces of information from the options provided. For instance, one student might choose the total number of boxes sold in class A and the total number of students in class A, while the other selects the equivalent information for class B. Alternatively, they could choose average numbers and total students in each class. However, each student has the ability to select only two pieces of information. Without sharing this information, neither student can independently calculate the overall average, fulfilling the requirement for collaboration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Verdict: Pass</head><p>The following is the exercise you need to evaluate:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Structured Zero-shot</head><p>Your Role: Collaboration evaluator for math exercises Objective: You need to evaluate collaborative math exercises provided for two students who are solving the exercises together. The goal of this evaluation is to determine whether the exercises require genuine collaboration between the partners to solve.</p><p>Exercise overview: Each exercise will be presented to you in two parts, Version A, accessible only to Student A, and Version B, accessible only to Student B. Students A and B are assigned to be partners.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Types of collaborative exercises:</head><p>1. Jigsaw (the pair of students are provided different or complementary information that needs to be shared to arrive at the solution) 2. Joint construction (the pair of students are provided the same information but need to solve and respond with different parts of the solution) 3. Info request (the students may or may not receive different information, but they will need to collaborate to identify two pieces of information they can request to solve the exercise) Thus, Version A and Version B may contain different or complementary information, the information may be formulated differently, or the response options provided to each student may be different. Images or figures provided are summarized in text within square brackets. Student A cannot see Version B, and Student B cannot see Version A. The only way they can access the information available to their partner is by communication with each other via text chat. The exercise should require both Student A and Student B to submit some answer(s).</p><p>Evaluation Criteria:</p><p>1. Communication Necessity: Is communication between Student A and Student B essential for completing the exercise? 2. Solution Process: Can the problem only be solved through the combined efforts and information of both students?</p><p>It is not necessary for you to solve the problem. However, you may describe the solution process in explaining your reasons for your evaluation. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Relationship between item-type classification success and item quality evaluation using the structured ZSL prompt.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>A</head><label></label><figDesc>. Average number of boxes of cookies sold in class A B. Total number of boxes of cookies sold in class A C. Average number of boxes of cookies sold in class B D. Total number of cookies per box E. Total number of students in class A Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What was the average number (arithmetic mean) of boxes of cookies sold by all students in both classes?</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>A</head><label></label><figDesc>. Average number of boxes of cookies sold in class A B. Total number of boxes of cookies sold in class A C. Average number of boxes of cookies sold in class B D. Total number of cookies per box E. Total number of students in class A #Example response</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>A. Average number of boxes of cookies sold in class A B. Total number of boxes of cookies sold in class A C. Average number of boxes of cookies sold in class B D. Total number of boxes of cookies sold in class B E. Total number of cookies per box F. Total number of students in class A G. Total number of students in class B Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What was the average number (arithmetic mean) of boxes of cookies sold by all students in both classes?</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>A</head><label></label><figDesc>. Average number of boxes of cookies sold in class A B. Total number of boxes of cookies sold in class A C. Average number of boxes of cookies sold in class B D. Total number of boxes of cookies sold in class B E. Total number of cookies per box F. Total number of students in class A G. Total number of students in class B #Example response</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>Evaluation format: When providing your evaluation, please format it as follows: Verdict: [pass or fail] Type: [Jigsaw, Joint Construction, Info Request, NA (if fail), Other (if pass but does not fit any of the types)] Reason: [explanation for verdict]</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Classification performance (accuracy and 𝜅) using a common zero-shot prompt for all models. Results are shown for all items as well as for jigsaw (𝑗𝑖𝑔), joint construction (𝑗𝑐), and info request (𝑖𝑟) items separatelyModelAcc 𝑎𝑙𝑙 𝜅 𝑎𝑙𝑙 Acc 𝑗𝑖𝑔 𝜅 𝑗𝑖𝑔 Acc 𝑗𝑐 𝜅 𝑗𝑐 Acc 𝑖𝑟 𝜅 𝑖𝑟</figDesc><table><row><cell>GPT-4</cell><cell>0.77</cell><cell>0.53 0.86</cell><cell>0.71 0.80</cell><cell>0.60 0.63</cell><cell>0.26</cell></row><row><cell>llama3.70B</cell><cell>0.62</cell><cell>0.25 0.81</cell><cell>0.61 0.65</cell><cell>0.30 0.40</cell><cell>-0.14</cell></row><row><cell cols="2">mixtral8x7b 0.54</cell><cell>0.09 0.62</cell><cell>0.21 0.50</cell><cell>0.00 0.50</cell><cell>0.08</cell></row><row><cell>mistral7b</cell><cell>0.51</cell><cell>0.03 0.57</cell><cell>0.10 0.50</cell><cell>0.00 0.45</cell><cell>0.00</cell></row><row><cell>llama2.70B</cell><cell>0.51</cell><cell>0.03 0.52</cell><cell>0.00 0.55</cell><cell>0.10 0.45</cell><cell>0.00</cell></row><row><cell>GPT-3.5</cell><cell>0.50</cell><cell>0.00 0.53</cell><cell>0.00 0.43</cell><cell>0.00 0.50</cell><cell>0.00</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell cols="5">Accuracy (Cohen's 𝜅) for GPT-4, Llama3.70B, and Mixtral8x7B by Pre-prompt Type</cell></row><row><cell>GPT-4</cell><cell>GPT-ZSL</cell><cell cols="3">Structured ZSL FSL verdict only CoT</cell><cell>CoT verdict first</cell></row><row><cell>All items</cell><cell cols="2">0.77 (0.53) 0.70 (0.40)</cell><cell>0.68 (0.37)</cell><cell>0.75 (0.50) 0.72 (0.43)</cell></row><row><cell>Jigsaw</cell><cell cols="2">0.86 (0.71) 0.86 (0.71)</cell><cell>0.86 (0.71)</cell><cell>0.81 (0.61) 0.81 (0.61)</cell></row><row><cell cols="3">Joint construction 0.80 (0.60) 0.70 (0.40)</cell><cell>0.70 (0.40)</cell><cell>0.80 (0.60) 0.75 (0.50)</cell></row><row><cell>Info request</cell><cell cols="2">0.63 (0.26) 0.53 (0.06)</cell><cell>0.47 (-0.02)</cell><cell>0.63 (0.29) 0.58 (0.19)</cell></row><row><cell>Llama3.70B</cell><cell>GPT-ZSL</cell><cell cols="3">Structured ZSL FSL verdict only CoT</cell><cell>CoT verdict first</cell></row><row><cell>All items</cell><cell cols="2">0.62 (0.25) 0.59 (0.19)</cell><cell>0.66 (0.32)</cell><cell>0.55 (0.11) 0.64 (0.28)</cell></row><row><cell>Jigsaw</cell><cell cols="2">0.81 (0.61) 0.76 (0.51)</cell><cell>0.81 (0.61)</cell><cell>0.83 (0.67) 0.71 (0.42)</cell></row><row><cell cols="3">Joint construction 0.65 (0.30) 0.55 (0.10)</cell><cell>0.65 (0.30)</cell><cell>0.56 (0.12) 0.70 (0.40)</cell></row><row><cell>Info request</cell><cell cols="2">0.40 (-0.14) 0.45 (0.00)</cell><cell>0.50 (-0.03)</cell><cell>0.44 (-0.11) 0.50 (0.01)</cell></row><row><cell>Mixtral8x7B</cell><cell>GPT-ZSL</cell><cell cols="3">Structured ZSL FSL verdict only CoT</cell><cell>CoT verdict first</cell></row><row><cell>All items</cell><cell cols="2">0.54 (0.09) 0.54 (0.09)</cell><cell>0.53 (0.06)</cell><cell>0.53 (0.08) 0.53 (0.07)</cell></row><row><cell>Jigsaw</cell><cell cols="2">0.62 (0.21) 0.67 (0.31)</cell><cell>0.62 (0.22)</cell><cell>0.76 (0.52) 0.52 (0.02)</cell></row><row><cell cols="3">Joint construction 0.50 (0.00) 0.55 (0.10)</cell><cell>0.50 (0.00)</cell><cell>0.47 (-0.03) 0.63 (0.23)</cell></row><row><cell>Info request</cell><cell cols="2">0.50 (0.08) 0.40 (-0.10)</cell><cell>0.45 (0.00)</cell><cell>0.35 (-0.25) 0.45 (-0.04)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Best preprompt (using Cohen's 𝜅) Overall and by Item Type. The highest values in each column are bolded.</figDesc><table><row><cell>Prompt</cell><cell>𝜅 𝑎𝑙𝑙 𝜅 𝑗𝑖𝑔</cell><cell>𝜅 𝑗𝑐</cell><cell>𝜅 𝑖𝑟</cell></row><row><cell>GPT-ZSL</cell><cell cols="3">0.29 0.51 0.30 0.06</cell></row><row><cell>Structured ZSL</cell><cell cols="3">0.22 0.51 0.20 -0.02</cell></row><row><cell cols="4">FSL verdict only 0.24 0.51 0.23 -0.01</cell></row><row><cell>CoT</cell><cell cols="3">0.22 0.54 0.22 -0.03</cell></row><row><cell cols="4">CoT verdict first 0.26 0.35 0.38 0.06</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Average Accuracy by Item Type and Model</figDesc><table><row><cell></cell><cell>Model</cell><cell></cell><cell cols="4">Jigsaw Joint construction Information request</cell></row><row><cell></cell><cell>GPT-4</cell><cell></cell><cell>0.84</cell><cell>0.75</cell><cell></cell><cell>0.54</cell></row><row><cell></cell><cell cols="2">llama3.70B</cell><cell>0.76</cell><cell>0.61</cell><cell></cell><cell>0.46</cell></row><row><cell></cell><cell cols="2">mixtral8x7b</cell><cell>0.64</cell><cell>0.52</cell><cell></cell><cell>0.43</cell></row><row><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>GPT40</cell></row><row><cell>Item−quality accuracy</cell><cell>0.2 0.4 0.6</cell><cell cols="2">mistral7b</cell><cell>GPT35 mixtral8x7b</cell><cell>llama2_70B</cell><cell>llama3_70B</cell></row><row><cell></cell><cell>0.1</cell><cell></cell><cell>0.2</cell><cell>0.3</cell><cell>0.4</cell><cell>0.5</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Item−type accuracy</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Identifying critical 21st-century skills for workplace success: A content analysis of job advertisements</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Rios</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pugh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Becker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bacall</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Educational Researcher</title>
		<imprint>
			<biblScope unit="volume">49</biblScope>
			<biblScope unit="page" from="80" to="89" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey on recent approaches to question difficulty estimation from text</title>
		<author>
			<persName><forename type="first">L</forename><surname>Benedetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cremonesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Caines</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Buttery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cappelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giussani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Turrin</surname></persName>
		</author>
		<idno type="DOI">10.1145/3556538</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="1" to="37" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Identifying the most important 21st century workforce competencies: An analysis of the occupational information network (o*net)</title>
		<author>
			<persName><forename type="first">J</forename><surname>Burrus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Jackson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Xi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ETS Research Report Series</title>
		<imprint>
			<biblScope unit="page">55</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Criteria for high-quality assessment</title>
		<author>
			<persName><forename type="first">L</forename><surname>Darling-Hammond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Herman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pellegrino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Abedi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Aber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Baker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Haertel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hakuta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stanford Center for Opportunity Policy in Education</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="171" to="192" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Fiore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Graesser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Greiff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Griffin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kyllonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Massey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>O'neil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pellegrino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rothman</surname></persName>
		</author>
		<title level="m">Collaborative problem solving: Considerations for the national assessment of educational progress</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Cooperative learning: Review of research and practice</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Gillies</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Australian Journal of Teacher Education (Online)</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="39" to="54" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">The collaboration principle in multimedia learning</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Kirschner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Kirschner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Janssen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Cambridge handbook of multimedia learning</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="547" to="575" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Are two heads always better than one? differential effects of collaboration on students&apos; computer-supported learning in mathematics</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mullins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rummel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Spada</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer-Supported Collaborative Learning</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="421" to="443" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An educational psychology success story: Social interdependence theory and cooperative learning</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Johnson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Educational researcher</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="365" to="379" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Research on cooperative learning and achievement: What we know, what we need to know</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">E</forename><surname>Slavin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Contemporary educational psychology</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="43" to="69" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">M</forename><surname>Stecher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">S</forename><surname>Hamilton</surname></persName>
		</author>
		<title level="m">Measuring Hard-to-Measure Student Competencies: A Research and Development Plan</title>
				<imprint>
			<publisher>ERIC</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">Research Report</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Challenges of assessing collaborative problem solving</title>
		<author>
			<persName><forename type="first">A</forename><surname>Graesser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Foltz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Rosen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Shaffer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Forsyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-L</forename><surname>Germany</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Assessment and teaching of 21st century skills: Research and applications</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="75" to="91" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Initial steps towards a standardized assessment for collaborative problem solving (cps): Practical challenges and strategies, Innovative assessment of collaboration</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Davier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C</forename><surname>Kyllonen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="135" to="156" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<ptr target="https://www.oecd.org/pisa/innovation/collaborative-problem-solving/,????" />
		<title level="m">Pisa 2015 collaborative problem solving</title>
				<imprint>
			<date type="published" when="2024-05-10">2024-05-10</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Over-scripting cscl: The risks of blending collaborative learning with instructional design</title>
		<author>
			<persName><forename type="first">P</forename><surname>Dillenbourg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Three worlds of CSCL</title>
				<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="61" to="91" />
		</imprint>
	</monogr>
	<note>Can we support CSCL?</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Mathchops: A platform for developing collaborative higher order problem solving in mathematics</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bergner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th International Conference on Computer-Supported Collaborative Learning-CSCL 2023</title>
				<meeting>the 16th International Conference on Computer-Supported Collaborative Learning-CSCL 2023</meeting>
		<imprint>
			<publisher>International Society of the Learning Sciences</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="51" to="58" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Chatgpt and open-ai models: A preliminary review</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">I</forename><surname>Roumeliotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Tselikas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Internet</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">192</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Large language models in medicine</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Thirunavukarasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S J</forename><surname>Ting</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Elangovan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gutierrez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">F</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S W</forename><surname>Ting</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature medicine</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="1930" to="1940" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Chatgpt for good? on opportunities and challenges of large language models for education</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kasneci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Seßler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Küchemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bannert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dementieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Gasser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Groh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Günnemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hüllermeier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Learning and individual differences</title>
		<imprint>
			<biblScope unit="volume">103</biblScope>
			<biblScope unit="page">102274</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Bran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Schilter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Baldassari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Schwaller</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.05376</idno>
		<title level="m">Chemcrow: Augmenting large-language models with chemistry tools</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">A systematic evaluation of large language models of code</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">F</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Alon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">J</forename><surname>Hellendoorn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming</title>
				<meeting>the 6th ACM SIGPLAN International Symposium on Machine Programming</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Use of chatgpt in esp teaching process</title>
		<author>
			<persName><forename type="first">D</forename><surname>Kovačević</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 22nd International Symposium INFOTEH-JAHORINA (INFOTEH), IEEE</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Chatgpt: Bullshit spewer or the end of traditional assessments in higher education?</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rudolph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of applied learning and teaching</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="342" to="363" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education</title>
		<author>
			<persName><forename type="first">A</forename><surname>Tlili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Shehata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Adarkwah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bozkurt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Hickey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Agyemang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Smart Learning Environments</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">15</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Balachandran</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.05845</idno>
		<title level="m">Tamil-llama: A new tamil language model based on llama 2</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">How to write effective prompts for large language models</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature Human Behaviour</title>
		<imprint>
			<biblScope unit="page" from="1" to="5" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">An information-theoretic approach to prompt engineering without ground truth labels</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sorensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Robinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Rytting</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Shaw</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">J</forename><surname>Rogers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Delorey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Khalil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Fulda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wingate</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.11364</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<ptr target="https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api#h_eae065300d" />
		<title level="m">Best practices for prompt engineering with the openai api</title>
				<imprint>
			<date type="published" when="2024-05-04">2024. 2024-05-04</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<ptr target="https://platform.openai.com/docs/guides/prompt-engineering" />
		<title level="m">Prompt engineering</title>
				<imprint>
			<date type="published" when="2024-05-04">2024-05-04</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Zero-shot and few-shot learning with knowledge graphs: A comprehensive survey</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Horrocks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Chainof-thought prompting elicits reasoning in large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="24824" to="24837" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Transforming assessment: The impacts and implications of large language models and generative ai</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Davier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Yaneva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lottridge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Davier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Harris</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Educational Measurement: Issues and Practice</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<title level="m" type="main">Field-testing multiple-choice questions with ai examinees</title>
		<author>
			<persName><forename type="first">H</forename><surname>Maeda</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<title level="m" type="main">Automatic Generation and Evaluation of Multiple-Choice Reading Comprehension Items with Large Language Models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Säuberli</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
		<respStmt>
			<orgName>University of Zurich</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">End-to-end generation of multiple-choice questions using text-to-text transfer transformer models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rodriguez-Torrealba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Garcia-Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcia-Cabot</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">208</biblScope>
			<biblScope unit="page">118258</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Raina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gales</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.11830</idno>
		<title level="m">Multiple-choice question generation: Towards an automated assessment framework</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Advanced methods in automatic item generation</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Gierl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Tanygin</surname></persName>
		</author>
		<idno type="DOI">10.4324/9781003025634</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
			<publisher>Routledge</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Evaluation of approaches for automatic e-assessment item annotation with levels of bloom&apos;s taxonomy</title>
		<author>
			<persName><forename type="first">R</forename><surname>Meissner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jenatschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Thor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Symposium on Emerging Technologies for Education</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="57" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<title level="m">International Association for the Evaluation of Educational Achievement (IEA), TIMSS 2011 Assessment, TIMSS &amp; PIRLS International Study Center, Lynch School of Education</title>
				<meeting><address><addrLine>Boston College, Chestnut Hill, MA; Amsterdam, the Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>IEA Secretariat</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note>and International Association for the Evaluation of Educational Achievement (IEA)</note>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lengyel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.06825</idno>
		<title level="m">Mistral 7b</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Savary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">B</forename><surname>Hanna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.04088</idno>
		<title level="m">Mixtral of experts</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Coefficient of agreement for nominal scales</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cohen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Educational and psychological measurement</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<date type="published" when="1960">1960</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">The measurement of observer agreement for categorical data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Landis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">G</forename><surname>Koch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">biometrics</title>
		<imprint>
			<biblScope unit="page" from="159" to="174" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Borji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mohammadian</surname></persName>
		</author>
		<title level="m">Battle of the wordsmiths: Comparing chatgpt, gpt-4, claude, and bard, GPT-4, Claude, and Bard</title>
				<imprint>
			<date type="published" when="2023-06-12">June 12, 2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Whitehouse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Catterson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Perera</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.16212</idno>
		<title level="m">Better call gpt, comparing large language models against lawyers</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Säuberli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clematide</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2404.07720</idno>
		<title level="m">Automatic generation and evaluation of reading comprehension test items with large language models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">Comparing the quality of human and chatgpt feedback of students&apos; writing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Steiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tate</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Graham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cruz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hebert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Moon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Tseng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Warschauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">B</forename><surname>Olson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Learning and Instruction</title>
		<imprint>
			<biblScope unit="volume">91</biblScope>
			<biblScope unit="page">101894</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Langrené</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.14735</idno>
		<title level="m">Unleashing the potential of prompt engineering in large language models: a comprehensive review</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
