<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can LLMs evaluate items measuring collaborative problem-solving?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ella Anghel</string-name>
          <email>anghel@bc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madhumitha Gopalakrishnan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pranali Mansukhani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoav Bergner</string-name>
          <email>yoav.bergner@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Administration, Leadership &amp; Technology, New York University Steinhardt School of Culture, Education &amp; Human Development</institution>
          ,
          <addr-line>82 Washington Square East, New York, NY 10003</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Study Center, Lynch School of Education and Human Development, Boston College</institution>
          ,
          <addr-line>140 Commonwealth Ave, Chestnut Hill, MA 02467</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments is challenging due to the construct's complexity. Advances in the capabilities of large language models (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was consistently the best-performing model with an overall accuracy of .77 ( = .53). GPT-4 did the best with zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot, chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future research directions on the utility of LLMs for assessment design.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language models</kwd>
        <kwd>item evaluation</kwd>
        <kwd>collaborative problem-solving</kwd>
        <kwd>prompt engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <sec id="sec-2-1">
        <title>2.1. Collaborative learning</title>
        <p>
          It is now well established that collaboration and teamwork are essential for success in educational
and work settings [
          <xref ref-type="bibr" rid="ref1 ref3 ref50">1, 3</xref>
          ]. The importance of collaborative problem solving (CPS) has led
policymakers to advocate for the development of high-quality CPS assessments [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. These calls have
been answered by several national and international assessment programs [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          From a socio-cognitive perspective, CPS is also believed to improve learning of the underlying
domain. However, simply working together on a task is not enough to facilitate learning
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Good CPS tasks should be challenging enough to justify the higher cognitive load of
collaborating [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], focus on conceptual rather than procedural material [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and involve positive
interdependence among participants [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ].
        </p>
        <p>
          While the importance of CPS in and of itself and as a contributor to other learning is widely
supported, it remains a dificult construct to assess [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. Some challenges relate to construct
definition, confounding factors, and psychometric modeling [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. But even designing items that
foster positive interdependence can be quite tricky. Many “collaborative” tasks can either be
solved individually or by dividing the work among the group members rather than through
collaboration. For example, the PISA 2015 tasks measuring CPS seem to encourage the test-takers
to divide the work with their collaborators [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          Collaborative learning scholars have emphasized the task design component in contrast to,
for example, (over-)scripting student interactions [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This approach was central in an online
learning and assessment environment called Collaborative Higher-Order Problem Solving
(CHOPS) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. There, pairs of students work collaboratively to solve math problems built
around three item “templates” designed to foster positive interdependence. These templates are
described here, using somewhat trivial examples for illustrative purposes:
1. Jigsaw - Students must exchange information to solve the problem, as they have only
part of the necessary information. For example, one student might have the length of one
side of a rectangle and another student has the length of the adjacent side. Together they
are asked to find its area.
2. Joint construction - A correct answer is composed of elements provided by each student
that must together satisfy some criteria. For example, each student must provide the
length of one side of a rectangle such that its area is 48 units. While there may be multiple
solutions, the students must coordinate their responses.
3. Information request - Students have an under-specified problem with limited options to
request information to complete the task. The pair must decide together what information
is needed and coordinate who should ask for what. For example, the students are asked
to determine how long a trip should take and can each request one of the following: the
car’s fuel usage, the distance traveled, the car’s average speed, or when the car left its
origin.
        </p>
        <p>These templates allow for relatively short-duration CPS items (compared with elaborate
scenario tasks). Consequently, many items can be delivered and reliability improved. Item
developers can be trained to adapt many “standard” types of test questions to these templates,
but the process is still quite time-consuming.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Large language models</title>
        <p>
          In recent years, the performance of LLMs, such as OpenAI’s GPT and Meta’s Llama have
improved significantly [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. As a result, these models have been applied in diverse areas such as
medicine, computing, basic science, and education [
          <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21">18, 19, 20, 21</xref>
          ]. Specifically, GPT-3.5 and
GPT4 have included innovations in bias reduction and complex problem-solving, which are essential
for educational applications like content creation, interactive learning, and teaching assistance
[
          <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
          ]. Notwithstanding the name “OpenAI”, GPT models are proprietary, potentially
expensive, and require users to upload private information to OpenAI servers. Open-source
initiatives like Llama and Mistral ofer promising alternatives. These models have encouraged
an eflorescence of open-source additions, for example other-than-English language capabilities
[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          While LLMs are often remarkably efective at interpreting natural language prompts,
higherquality prompts can yield significantly better outputs [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Prompt engineering has emerged as
a design problem for refining the content and structure of LLM prompts to optimize for specific
tasks [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Some prompt engineering best practices involve writing clear, detailed instructions,
separating distinct parts of the input, asking the model to adopt a persona, and instructing
the LLM to work out the solution rather than immediately constructing the answer [
          <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
          ].
A naming convention has emerged in the literature to describe diferent prompt variations.
Refering to the number of worked examples given to the LLM, Zero-shot learning (ZSL) relies
solely on the LLM’s pre-trained “knowledge" along with the task description without the use of
any worked examples. In contrast, One-shot learning (OSL) includes an example in the prompt,
and Few-shot learning (FSL) includes two or more [
          <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
          ]. There are also variations in the
presentation of worked examples. A prompt can include just the correct label or desired response
for example. In Chain-of-thought (CoT) reasoning [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], however, the prompt demonstrates a
multi-step reasoning process, mimicking how a human would approach the problem. These
prompting approaches constitute sources of variance that may be important for educational
researchers working with LLMs.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Large language models in assessment</title>
        <p>
          Advancements in LLMs have not gone unnoticed by the measurement field, where they have
been considered for item generation, scoring, and parameter calibration [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. Relatively little
research has been conducted on item evaluation using LLMs. Most of this research has focused
on automatic evaluation of item dificulty [
          <xref ref-type="bibr" rid="ref2 ref51">2</xref>
          ]. For instance, researchers used LLM responses to
items to evaluate the guessability or the knowledge required to respond to those items [
          <xref ref-type="bibr" rid="ref34 ref35">34, 35</xref>
          ].
Others have focused on the linguistic features of items [
          <xref ref-type="bibr" rid="ref36 ref37 ref38">36, 37, 38</xref>
          ]. Only a few studies attempted
to automatically evaluate items’ content [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], and they generally did not use LLMs for this
purpose.
        </p>
        <p>The contribution of LLMs to assessment research and development may be even more
pronounced for dificult-to-measure constructs like CPS. Can these models reduce the burden of
new item design? Or will LLM-generated items be disastrous? While LLMs may be able to follow
detailed prescriptions for item structure, a more impressive achievement would be understanding
the task designer’s intent more broadly. To that end, a prudent step before engaging an LLM in
item generation is to test whether the model has the foundational knowledge to recognize a
good CPS item when it sees one.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. The current study</title>
        <p>In the current study, we sought to examine to what extent LLMs can judge the quality of
CHOPS template items for measuring CPS. Given the range of performance demonstrated in
the literature, we compared multiple foundational models, prompt strategies, and task types
to understand how some approaches may outperform others. This study contributes to the
literature in several ways. First, understanding LLMs’ ability to evaluate CPS items is a first
step in improving item quality and even automatically generating such items. Second, this
study is relevant to the measurement field as a whole, as it demonstrates how LLMs deal with
complex item evaluation tasks. Finally, by examining diferent models and prompts we can shed
light on the models’ respective strengths and limitations, guiding future research in educational
technology. In sum, our study aimed to answer the following research questions:
1. To what extent can LLMs evaluate the quality of complex CPS items?
2. To what extent do LLMs’ success rates vary by the foundational model, prompting
approach, and type of item?</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Item design</title>
        <p>We created a small data set of CPS problems for LLMs to evaluate. The items were designed for
two students to solve and are approximately at the level of middle-school math. They use one
of three CHOPS templates. We label a CPS task as “good” if it invokes positive interdependence.
That is, it requires the participants to work together in a meaningful way to solve the problem. A
bad task does not require collaboration or cannot be solved for other reasons. The set contained
21 jigsaw (10 good, 11 bad), 20 joint construction (10 good, 10 bad), and 20 request information
(9 good, 11 bad) items, which were either new, adapted from items in CHOPS, or adapted from
publicly available large-scale math assessment items like TIMSS and NAEP. Each item was
reviewed by at least two team members for clarity, correctness, and content relevance.</p>
        <p>
          Figure 1 shows an example of a joint construction template created based on a TIMSS 2011
item [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. Version A and B would be shown to the two collaborating students. Since both
students can enter values that meet the criterion presented in the item, they do not need to
collaborate to solve it, making this a bad example for CPS. A (minimally) good version of this
item would require each student to enter one value such that together they meet the criterion.
The pair of students must then negotiate a common solution.
The minute hand of a clock turns 600 degrees between time T1 and time T2 of the
same day. Together with your partner, come up with a possible value for T1 and T2.
■ Enter value for T1
♢ Enter value for T2
Version B:
♢ Enter value for T1
■ Enter value for T2
The minute hand of a clock turns 600 degrees between time T1 and time T2 of the
same day. Together with your partner, come up with a possible value for T1 and T2.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pre-prompt design</title>
        <p>In this study, we use the term “pre-prompt” when referring to the instructions provided to the
LLMs on how to approach the item evaluation, since each “prompt” also includes an item that
the LLMs is asked to evaluate. We designed several types of pre-prompts. Initially, we refined
one prompt through trial and error with GPT-4 to improve the output. We also designed a
pre-prompt following current best practices of asking the LLM to adopt the role of an evaluator
and separating its task by first asking it to identify the item type (template) and then make a
judgment on collaborative interdependence. Our original prompt was also paired with examples,
sometimes limited to pass/fail labels or extended to CoT reasoning. In total, we tested five
pre-prompts:
• Zero-shot learning with no examples, prompt refined with GPT-4
• Structured Zero-shot learning following prompt engineering best-practices
• Few-shot learning, original prompt plus one good and bad example from each template
(six total); only pass/fail labels were provided
• The same prompt with six CoT examples followed by a verdict
• The same prompt and CoT, except with the verdict given before the reasoning
Below is our ZSL pre-prompt. The CoT pre-prompts with the example items we used for the
other pre-prompts, as well as the structured ZSL prompt are available in Appendices A.1 and
A.2, respectively.</p>
        <p>You will be asked to evaluate one educational exercise for math students working in pairs.
The exercise will be presented to you in two parts, the exercise version shown only to
Student A (called Version A) and the exercise version as shown only to Student B (Version
B). Students A and B are assigned to be partners.</p>
        <p>Importantly, Version A and Version B may contain diferent, complementary information,
or the information may be formulated diferently. Student A cannot see Version B, and
Student B cannot see Version A. The only way they can access the information available
to their partner is by communication with each other via text chat. The exercise should
require Both Student A and Student B to submit some answers in an answer field or fields.
Your criterion for evaluation of the exercise is whether or not the exercise indeed requires
Student A and Student B to collaborate in order to solve the problem. If so, indicate pass.
It is not acceptable if Student A and Student B can work separately, independently, and
without communicating and still each get the correct answer. In such case, indicate fail.
For an exercise to pass, it should be impossible for the students to answer correctly by
working alone independently. It is not necessary for you to solve the problem. However,
you may describe the solution process in explaining your reasons for your evaluation.
When providing your evaluation, please format it as follows:
Verdict: [pass or fail]
Reason: [explanation for verdict]</p>
        <p>The following is the exercise you need to evaluate:</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Selection of language models</title>
        <p>
          We used six LLMs from three families: GPT-3.5 and GPT-4 from OpenAI, Llama2 and Llama3
from Meta, and Mistral7B and Mixtral8x7B [
          <xref ref-type="bibr" rid="ref41 ref42">41, 42</xref>
          ]. This selection was designed to explore
variance between families as well as within a family, i.e., earlier/later or smaller/larger models.
Llama2 and Llama3 come in diferent sizes; in both cases, we used Q5-quantized versions of
the 70 billion (70b) parameter models. Mistral7B is a conventional 7b model. Mixtral8x7b is a
Sparse Mixture of Experts (SMoE) architecture with 47b total parameters, but the model uses
only 13b at inference time by routing each token to a subset of model components based on
the token’s attributes. We used Q8-quantized versions of both Mistral models. The Llama and
Mistral models were served locally on a high-performance MacBook Pro with 128GB of RAM.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Procedure</title>
        <p>Experimental outputs were collected by an automated script with a browser-based front end.
The interface provided for API calls to GPT models and local/cloud-based open-source models.
Items and pre-prompts could be selected, and the script would subsequently append the
preprompts to each item for each call (61 items × 5 pre-prompts × 6 models). The outputs of each
query were saved for subsequent analyses.</p>
        <p>In the analysis stage, LLM outputs were parsed using regular expressions for pass/fail verdicts.
All pre-prompts requested verdicts in a specific form, Verdict: Pass/Fail. Model outputs that
did not follow this structure were originally parsed as having no verdict. However, further
inspection revealed that many model responses contained meaningful evaluations in a diferent
form (e.g., “this exercise meets the criteria”). We therefore wrote a more complex parser to
identify relevant phrases. The new parser significantly lowered the no-verdict rates; however,
we understand that the parser was still imperfect.</p>
        <p>
          We then compared the results of the parser with our ground-truth labels for each item. The
overall agreement is summarized using accuracy (% agreement) and Cohen’s  [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]. Following
[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] but slightly more conservative at the low end, we interpret Cohen’s  values ≤ 0.05 as
poor agreement, 0.06 to 0.20 as slight, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, and 0.61 to
0.80 as substantial.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Table 1 presents the classification performance for all tested models using the ZSL pre-prompt,
across all items as well as disagreggated by item type. GPT-4 had the best performance, with an
overall moderate agreement level. The three bottom models were barely better than chance
(i.e.,  scores are about zero). Only two open-source foundational models were somewhat
comparable to GPT-4: Llama3 and Mixtral8x7b. Overall, Llama3 was better than Mixtral8x7b,
but when disaggregated by item type, the results are more complex.</p>
      <p>Jigsaw items follow the same pattern as the overall (GPT-4 &gt; Llama3 &gt; Mixtral8x7b). On
joint construction items, Llama3 and even Llama2 edge out Mixtral8x7B. However, classifying
information request items seems to be the hardest subtask. The highest accuracy, obtained by
GPT-4, is 0.63, with a moderate  of 0.26. Mixtral8x7b slightly beats chance on these items,
while Llama3 does worse than chance. In sum, it is possible that to optimize performance using
the open-source models, one would do better using Llama3 for jigsaw and joint construction
items and Mixtral8x7B for info request items.</p>
      <p>Next, we examined the other pre-prompts to see if they impacted the results. Table 2
includes the classification metrics for the top three performing models, i.e., GPT-4, Llama3, and
Mixtral8x7b, across all pre-prompts (the ZSL results from Table 1 are embedded in the first
column).</p>
      <p>For GPT-4, which had the best overall performance on the task, it is notable that elaboration
of the original prompt did not have a positive impact on classification performance and often
led to worse performance. The ZSL pre-prompt was as good or better than all others, except
CoT prompting for info request items which had identical accuracy and higher  by about
0.03. However, the diference is probably not of practical significance as the confidence interval
around  is on the order of ± 0.3.</p>
      <p>While the diferences were still small, it does appear that the few-shot prompting improved
the results from Llama3 and Mixtral8x7b in a number of prompt-item-type combinations. For
GPT-4
All items
Jigsaw
Joint construction
Info request
Llama3.70B
All items
Jigsaw
Joint construction
Info request
Mixtral8x7B
All items
Jigsaw
Joint construction
Info request</p>
      <p>GPT-ZSL
example, CoT prompting improved Mixtral8x7b notably on jigsaw items, while the CoT verdict
ifrst improved the joint construction evaluations. Llama3 had more modest gains from these
two prompts.</p>
      <p>The above analysis is perhaps too fine, slicing by model, prompt, and item type. To understand
if diferent prompts are generally more suitable to diferent item types, we average over the top
three models. These results are shown in Table 3. Indeed, after averaging, it remains the case
that the best overall prompt is not the best prompt for each item type. Notably, the classification
of info request items is, at best, barely better than chance.</p>
      <p>It appears to be the case that jigsaw classification is the most successful, followed by joint
construction and information request. A high-level summary confirming this finding using
accuracy scores averaged over pre-prompts for each model is shown in Table 4. Note that these
are not the best results for each model.</p>
      <p>As an exploratory step, we were interested in whether the models were able to classify
items into the correct types, the first sub-task using the structured ZSL approach. Base rate
classification accuracy for item types could be expected at 0.33, and actual results ranged from
0.13 to 0.43. Striking, however, is the relationship between accuracy in classifying the item type
(template) and accuracy in evaluating the items (see Figure 2). The highly apparent correlation
( = 0.80) suggests that better models in one task can do the other better as well. Interestingly,
when it came to type classification, Llama3 was actually the best performing model.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        The purpose of our study was to test the feasibility of LLMs for evaluating items measuring
CPS. We also wanted to see if diferent models, pre-prompts, or item types afect the results.
Understanding these issues may contribute to research on how LLMs interact with complex
tasks and to future item design in practice. According to our findings, only three of the tested
models did better than chance, with GPT-4 outperforming the other models in almost all cases.
Between the open-access models, diferent models did better on diferent item types, suggesting
that users should consider the task type when choosing the best model. Given GPT-4’s success
relative to other models in various tasks [
        <xref ref-type="bibr" rid="ref45 ref46">45, 46</xref>
        ], including tasks related to item generation [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ],
this result is unsurprising. However, even GPT-4 reached only moderate levels of agreement in
most cases. Others have also found that LLMs struggle with evaluative tasks [
        <xref ref-type="bibr" rid="ref48">48</xref>
        ], sugggesting
directions for future LLM developments.
      </p>
      <p>
        Contrary to existing findings [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ], elaborate pre-prompting rarely improved on the basic ZSL
pre-prompt. It is possible that the examples were confusing or focused the LLMs on the specific
cases rather than the general idea. We intend to examine this issue in the future. We also found
that some item types were easier for the LLMs to judge than others. All models generally did
best with the jigsaw items followed by the joint construction items and the information request
items. We are unaware of existing research comparing LLMs’ ability to evaluate diferent types
of interdependent tasks, and this might also be a fruitful direction for future work.
      </p>
      <p>This study has several limitations. First, our basic ZSL pre-prompt was refined using
GPT4, perhaps contributing to its success. Since GPT-4 seems to outperform other models in a
variety of complex tasks, we believe this efect is likely small. Second, to enhance the study’s
generalizability, more items, constructs, models, and pre-prompts should be tested. Finally, we
could only examine the final verdict of the models and not their reasoning. Qualitative analysis
of the LLMs’ outputs is planned and could reveal the reasons for their disagreements with
humans.</p>
      <p>In conclusion, when evaluating the quality of CPS items, existing LLMs have only moderate
levels of agreement with humans at best. Adding more information beyond ZSL pre-prompts
does not improve this by much. However, diferent models and pre-prompts perform better
when evaluating diferent item types. Therefore, more work on the models or on prompting
strategies is required before LLMs can be reliably used LLMs for evaluating items measuring
CPS and, likely, similarly complex constructs.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Full text of prompts</title>
      <sec id="sec-6-1">
        <title>A.1. Chain-of-Thought</title>
        <p>You will be asked to evaluate one educational exercise for math students working in pairs. The
exercise will be presented to you in two parts, the exercise version shown only to Student A
(called Version A) and the exercise version as shown only to Student B (Version B). Students A
and B are assigned to be partners. Importantly, Version A and Version B may contain diferent,
complementary information, or the information may be formulated diferently. Student A
cannot see Version B, and Student B cannot see Version A. The only way they can access the
information available to their partner is by communication with each other via text chat. The
exercise should require Both Student A and Student B to submit some answers in an answer
ifeld or fields.</p>
        <p>Your criterion for evaluation of the exercise is whether or not the exercise indeed requires
Student A and Student B to collaborate in order to solve the problem. If so, indicate pass. It
is not acceptable if Student A and Student B can work separately, independently, and without
communicating and still each get the correct answer. In such case, indicate fail. For an exercise to
pass, it should be impossible for the students to answer correctly by working alone independently.
It is not necessary for you to solve the problem. However, you may describe the solution process
in explaining your reasons for your evaluation. When providing your evaluation, please format
it as follows:</p>
        <sec id="sec-6-1-1">
          <title>Verdict: [pass or fail]</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>Reason: [explanation for verdict]</title>
          <p>##The following is an example exercises with suitable response:
#Example prompt</p>
          <p>Version A: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn
from today’s production line, and 2 batteries fail the quality test. What is the best estimate for
the total number of faulty batteries produced today?</p>
          <p>Version B: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn
from today’s production line, and 2 batteries fail the quality test. What is the best estimate for
the total number of faulty batteries produced today?
#Example response</p>
          <p>To estimate the total number of faulty batteries produced, one needs to know the total daily
production, the size of the test sample, and the number of failed batteries in the test sample.
Both Student A and Student B have the complete information needed to solve the problem and
thus can in principle solve the problem without collaborating with one another.
#Example prompt</p>
          <p>Version A: A factory produces batteries each day. A sample of 200 batteries is drawn from
today’s production line, and 2 batteries fail the quality test. What is the best estimate for the
total number of faulty batteries produced today?</p>
          <p>Version B: A factory produces 100,000 batteries each day. A sample of batteries is drawn from
today’s production line, and 2 batteries fail the quality test. What is the best estimate for the
total number of faulty batteries produced today?
#Example response</p>
          <p>To estimate the total number of faulty batteries produced, one needs to know the total daily
production, the size of the test sample, and the number of failed batteries in the test sample.
Student A has the sample size but does not have the total number produced, while Student B
knows the total number of batteries produced but does not know the size of the sample that
was tested. The collaborating students need to communicate this information to each other
to estimate the total number of faulty batteries produced today. Thus, this exercise meets the
requirement that it can only be solved if Student A and Student B share information with each
other.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Verdict: Pass</title>
          <p>#Example prompt</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>Version A:</title>
        </sec>
        <sec id="sec-6-1-5">
          <title>Enter value for T1: Enter value for T2:</title>
        </sec>
        <sec id="sec-6-1-6">
          <title>Enter value for T1:</title>
          <p>Enter value for T2:
#Example response</p>
          <p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.</p>
        </sec>
        <sec id="sec-6-1-7">
          <title>Version B:</title>
          <p>The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.</p>
          <p>There is an infinite number of possible solutions to the posed problem. Each student is
provided with the ability to provide a complete solution to the problem. Thus, it is possible for
each student to answer correctly on their own without coordinating with their partner.</p>
        </sec>
        <sec id="sec-6-1-8">
          <title>Version A: The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</title>
        </sec>
        <sec id="sec-6-1-9">
          <title>Enter value for T1: Version B: The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day. Together with your partner, come up with a possible value for T1 and T2.</title>
          <p>Each student is provided with the ability to answer one of two necessary parts of the solution.
Moreover, the two parts must together compose a correct solution. Although there is an infinite
number of possible solutions to the posed problem, neither student can answer correctly on
their own without coordinating with their partner.</p>
          <p>Verdict: Fail
#Example prompt</p>
        </sec>
        <sec id="sec-6-1-10">
          <title>Enter value for T2: #Example response</title>
        </sec>
        <sec id="sec-6-1-11">
          <title>Verdict: Pass</title>
          <p>#Example prompt</p>
          <p>Version A: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?</p>
          <p>To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.</p>
        </sec>
        <sec id="sec-6-1-12">
          <title>A. Average number of boxes of cookies sold in class A B. Total number of boxes of cookies sold in class A C. Average number of boxes of cookies sold in class B D. Total number of cookies per box</title>
          <p>E. Total number of students in class A</p>
          <p>Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?</p>
          <p>To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.</p>
          <p>A. Average number of boxes of cookies sold in class A
B. Total number of boxes of cookies sold in class A
C. Average number of boxes of cookies sold in class B
D. Total number of cookies per box
E. Total number of students in class A
#Example response</p>
          <p>Critical pieces of information necessary for solving the problem (such as the total number of
students in both classes or the total number of boxes sold in class B) are either missing or
inadequately defined in the options available to the students. Therefore, the task is unsolvable with
the provided selections, even if students work together to combine their available information.
The exercise does not meet the criteria for a solvable and collaborative educational exercise.</p>
          <p>Verdict: Fail
#Example prompt</p>
          <p>A. Average number of boxes of cookies sold in class A
B. Total number of boxes of cookies sold in class A
C. Average number of boxes of cookies sold in class B
D. Total number of boxes of cookies sold in class B
E. Total number of cookies per box
F. Total number of students in class A
G. Total number of students in class B</p>
          <p>Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?</p>
          <p>To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.
#Example response</p>
          <p>To calculate the overall average number of boxes sold by students in both classes, students
will need at least four pieces of information from the options provided. For instance, one student
might choose the total number of boxes sold in class A and the total number of students in
class A, while the other selects the equivalent information for class B. Alternatively, they could
choose average numbers and total students in each class. However, each student has the ability
to select only two pieces of information. Without sharing this information, neither student can
independently calculate the overall average, fulfilling the requirement for collaboration.</p>
        </sec>
        <sec id="sec-6-1-13">
          <title>Verdict: Pass</title>
        </sec>
        <sec id="sec-6-1-14">
          <title>The following is the exercise you need to evaluate:</title>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>A.2. Structured Zero-shot</title>
        <sec id="sec-6-2-1">
          <title>Your Role: Collaboration evaluator for math exercises</title>
          <p>Objective: You need to evaluate collaborative math exercises provided for two students who
are solving the exercises together. The goal of this evaluation is to determine whether the
exercises require genuine collaboration between the partners to solve.</p>
          <p>Exercise overview: Each exercise will be presented to you in two parts, Version A, accessible
only to Student A, and Version B, accessible only to Student B. Students A and B are assigned to
be partners.</p>
          <p>Types of collaborative exercises:
1. Jigsaw (the pair of students are provided diferent or complementary information that
needs to be shared to arrive at the solution)
2. Joint construction (the pair of students are provided the same information but need to
solve and respond with diferent parts of the solution)
3. Info request (the students may or may not receive diferent information, but they will
need to collaborate to identify two pieces of information they can request to solve the
exercise)</p>
          <p>Thus, Version A and Version B may contain diferent or complementary information, the
information may be formulated diferently, or the response options provided to each student
may be diferent. Images or figures provided are summarized in text within square brackets.
Student A cannot see Version B, and Student B cannot see Version A. The only way they can
access the information available to their partner is by communication with each other via text
chat. The exercise should require both Student A and Student B to submit some answer(s).</p>
          <p>It is not necessary for you to solve the problem. However, you may describe the solution
process in explaining your reasons for your evaluation.</p>
          <p>Evaluation format: When providing your evaluation, please format it as follows: Verdict:
[pass or fail]</p>
          <p>Type: [Jigsaw, Joint Construction, Info Request, NA (if fail), Other (if pass but does not fit
any of the types)]</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Rios</surname>
          </string-name>
          , G. Ling,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacall</surname>
          </string-name>
          ,
          <article-title>Identifying critical 21st-century skills for workplace success: A content analysis of job advertisements</article-title>
          ,
          <source>Educational Researcher</source>
          <volume>49</volume>
          (
          <year>2020</year>
          )
          <fpage>80</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benedetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buttery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cappelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giussani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <article-title>A survey on recent approaches to question dificulty estimation from text</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/3556538.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Burrus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinberg</surname>
          </string-name>
          ,
          <article-title>Identifying the most important 21st century workforce competencies: An analysis of the occupational information network (o*net)</article-title>
          ,
          <source>ETS Research Report Series</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
          <fpage>i</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Darling-Hammond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pellegrino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Aber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Haertel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hakuta</surname>
          </string-name>
          , et al.,
          <article-title>Criteria for high-quality assessment, Stanford Center for Opportunity Policy in Education 2 (</article-title>
          <year>2013</year>
          )
          <fpage>171</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Fiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graesser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Greif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Grifin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kyllonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Massey</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. O'Neil</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pellegrino</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rothman</surname>
          </string-name>
          , et al.,
          <article-title>Collaborative problem solving: Considerations for the national assessment of educational progress (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Gillies</surname>
          </string-name>
          ,
          <article-title>Cooperative learning: Review of research and practice</article-title>
          ,
          <source>Australian Journal of Teacher Education (Online) 41</source>
          (
          <year>2016</year>
          )
          <fpage>39</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Kirschner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kirschner</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Janssen,</surname>
          </string-name>
          <article-title>The collaboration principle in multimedia learning</article-title>
          ,
          <source>The Cambridge handbook of multimedia learning 2</source>
          (
          <year>2014</year>
          )
          <fpage>547</fpage>
          -
          <lpage>575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mullins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rummel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Spada</surname>
          </string-name>
          ,
          <article-title>Are two heads always better than one? diferential efects of collaboration on students' computer-supported learning in mathematics</article-title>
          ,
          <source>International Journal of Computer-Supported Collaborative Learning</source>
          <volume>6</volume>
          (
          <year>2011</year>
          )
          <fpage>421</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , R. T. Johnson,
          <article-title>An educational psychology success story: Social interdependence theory and cooperative learning</article-title>
          ,
          <source>Educational researcher 38</source>
          (
          <year>2009</year>
          )
          <fpage>365</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. E. Slavin,</surname>
          </string-name>
          <article-title>Research on cooperative learning and achievement: What we know, what we need to know</article-title>
          ,
          <source>Contemporary educational psychology 21</source>
          (
          <year>1996</year>
          )
          <fpage>43</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Stecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          , Measuring Hard-to-
          <source>Measure Student Competencies: A Research and Development Plan. Research Report., ERIC</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A. C Graesser</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          <string-name>
            <surname>Shafer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Forsyth</surname>
          </string-name>
          , M.-L. Germany,
          <article-title>Challenges of assessing collaborative problem solving, Assessment and teaching of 21st century skills: Research and applications (</article-title>
          <year>2018</year>
          )
          <fpage>75</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hao</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <surname>A. A. von Davier</surname>
          </string-name>
          , P. C.
          <article-title>Kyllonen, Initial steps towards a standardized assessment for collaborative problem solving (cps): Practical challenges and strategies, Innovative assessment of collaboration (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>Pisa 2015 collaborative problem solving</article-title>
          , https://www.oecd.org/pisa/innovation/ collaborative-problem-solving/, ???? Accessed:
          <fpage>2024</fpage>
          -05-10.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dillenbourg</surname>
          </string-name>
          ,
          <article-title>Over-scripting cscl: The risks of blending collaborative learning with instructional design</article-title>
          .,
          <source>Three worlds of CSCL. Can we support CSCL?</source>
          (
          <year>2002</year>
          )
          <fpage>61</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bergner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Mathchops: A platform for developing collaborative higher order problem solving in mathematics</article-title>
          ,
          <source>in: Proceedings of the 16th International Conference on Computer-Supported Collaborative Learning-CSCL</source>
          <year>2023</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          , International Society of the Learning Sciences,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K. I.</given-names>
            <surname>Roumeliotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Tselikas</surname>
          </string-name>
          ,
          <article-title>Chatgpt and open-ai models: A preliminary review</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <fpage>192</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Thirunavukarasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. J.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Elangovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S. W.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Large language models in medicine</article-title>
          ,
          <source>Nature medicine 29</source>
          (
          <year>2023</year>
          )
          <fpage>1930</fpage>
          -
          <lpage>1940</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Küchemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bannert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Gasser</surname>
          </string-name>
          , G. Groh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Günnemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hüllermeier</surname>
          </string-name>
          , et al.,
          <article-title>Chatgpt for good? on opportunities and challenges of large language models for education</article-title>
          ,
          <source>Learning and individual diferences 103</source>
          (
          <year>2023</year>
          )
          <fpage>102274</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>A. M. Bran</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Schilter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Baldassari</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <string-name>
            <surname>White</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Schwaller</surname>
          </string-name>
          , Chemcrow:
          <article-title>Augmenting large-language models with chemistry tools</article-title>
          ,
          <source>arXiv preprint arXiv:2304.05376</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Hellendoorn</surname>
          </string-name>
          ,
          <article-title>A systematic evaluation of large language models of code</article-title>
          ,
          <source>in: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kovačević</surname>
          </string-name>
          ,
          <article-title>Use of chatgpt in esp teaching process</article-title>
          ,
          <source>in: 2023 22nd International Symposium INFOTEH-JAHORINA (INFOTEH)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Chatgpt: Bullshit spewer or the end of traditional assessments in higher education?</article-title>
          ,
          <source>Journal of applied learning and teaching 6</source>
          (
          <year>2023</year>
          )
          <fpage>342</fpage>
          -
          <lpage>363</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tlili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shehata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Adarkwah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozkurt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Hickey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Agyemang</surname>
          </string-name>
          ,
          <article-title>What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education</article-title>
          ,
          <source>Smart Learning Environments</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Balachandran</surname>
          </string-name>
          ,
          <article-title>Tamil-llama: A new tamil language model based on llama 2</article-title>
          , arXiv preprint arXiv:
          <volume>2311</volume>
          .05845 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>How to write efective prompts for large language models</article-title>
          ,
          <source>Nature Human Behaviour</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Rytting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Delorey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fulda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wingate</surname>
          </string-name>
          ,
          <article-title>An information-theoretic approach to prompt engineering without ground truth labels</article-title>
          ,
          <source>arXiv preprint arXiv:2203.11364</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <article-title>Best practices for prompt engineering with the openai api</article-title>
          , https://help.openai.com/ en/articles/6654000-best
          <article-title>-practices-for-prompt-engineering-with-the-openai-api#</article-title>
          <source>h_ eae065300d</source>
          ,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-04.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Prompt</surname>
            <given-names>engineering</given-names>
          </string-name>
          , https://platform.openai.com/docs/guides/prompt-engineering, n.d. Accessed:
          <fpage>2024</fpage>
          -05-04.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Horrocks,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Zero-shot and few-shot learning with knowledge graphs: A comprehensive survey</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>von Davier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Yaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lottridge</surname>
          </string-name>
          , M. von
          <string-name>
            <surname>Davier</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
          </string-name>
          ,
          <article-title>Transforming assessment: The impacts and implications of large language models and generative ai</article-title>
          ,
          <source>Educational Measurement: Issues and Practice</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>H.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <article-title>Field-testing multiple-choice questions with ai examinees (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Säuberli</surname>
          </string-name>
          ,
          <article-title>Automatic Generation and Evaluation of Multiple-Choice Reading Comprehension Items with Large Language Models</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Zurich,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rodriguez-Torrealba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Garcia-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Cabot</surname>
          </string-name>
          ,
          <article-title>End-to-end generation of multiple-choice questions using text-to-text transfer transformer models</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>208</volume>
          (
          <year>2022</year>
          )
          <fpage>118258</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>V.</given-names>
            <surname>Raina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <article-title>Multiple-choice question generation: Towards an automated assessment framework</article-title>
          ,
          <source>arXiv preprint arXiv:2209.11830</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>M. J. Gierl</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Tanygin</surname>
          </string-name>
          ,
          <article-title>Advanced methods in automatic item generation</article-title>
          ,
          <source>Routledge</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .4324/9781003025634.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>R.</given-names>
            <surname>Meissner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jenatschke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thor</surname>
          </string-name>
          ,
          <article-title>Evaluation of approaches for automatic e-assessment item annotation with levels of bloom's taxonomy</article-title>
          ,
          <source>in: International Symposium on Emerging Technologies for Education</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <article-title>International Association for the Evaluation of Educational Achievement (IEA), TIMSS 2011 Assessment</article-title>
          ,
          <string-name>
            <surname>TIMSS</surname>
          </string-name>
          &amp; PIRLS International Study Center,
          <source>Lynch School of Education</source>
          , Boston College, Chestnut Hill, MA and
          <article-title>International Association for the Evaluation of Educational Achievement (IEA)</article-title>
          ,
          <source>IEA Secretariat</source>
          , Amsterdam, the Netherlands.,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7b, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Hanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , et al.,
          <source>Mixtral of experts, arXiv preprint arXiv:2401.04088</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Coeficient of agreement for nominal scales</article-title>
          ,
          <source>Educational and psychological measurement 20</source>
          (
          <year>1960</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Landis</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. G. Koch,</surname>
          </string-name>
          <article-title>The measurement of observer agreement for categorical data, biometrics (</article-title>
          <year>1977</year>
          )
          <fpage>159</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohammadian</surname>
          </string-name>
          ,
          <article-title>Battle of the wordsmiths: Comparing chatgpt, gpt-4, claude, and bard</article-title>
          ,
          <source>GPT-4</source>
          , Claude, and
          <string-name>
            <surname>Bard</surname>
          </string-name>
          (June 12,
          <year>2023</year>
          ) (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Whitehouse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Catterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perera</surname>
          </string-name>
          ,
          <article-title>Better call gpt, comparing large language models against lawyers</article-title>
          ,
          <source>arXiv preprint arXiv:2401.16212</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>A.</given-names>
            <surname>Säuberli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clematide</surname>
          </string-name>
          ,
          <article-title>Automatic generation and evaluation of reading comprehension test items with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2404.07720</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Steiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hebert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Moon</surname>
          </string-name>
          , W. Tseng,
          <string-name>
            <given-names>M.</given-names>
            <surname>Warschauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Olson</surname>
          </string-name>
          ,
          <article-title>Comparing the quality of human and chatgpt feedback of students' writing, Learning and Instruction 91 (</article-title>
          <year>2024</year>
          )
          <fpage>101894</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Langrené</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Unleashing the potential of prompt engineering in large language models: a comprehensive review</article-title>
          ,
          <source>arXiv preprint arXiv:2310.14735</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Communication</given-names>
            <surname>Necessity</surname>
          </string-name>
          :
          <article-title>Is communication between Student A and Student B essential for completing the exercise?</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Solution</given-names>
            <surname>Process</surname>
          </string-name>
          <article-title>: Can the problem only be solved through the combined eforts and information of both students?</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>