=Paper=
{{Paper
|id=Vol-3772/paper2
|storemode=property
|title=Can LLMs Evaluate Items Measuring Collaborative Problem-Solving?
|pdfUrl=https://ceur-ws.org/Vol-3772/paper2.pdf
|volume=Vol-3772
|authors=Ella Anghel,Yu Wang,Madhumitha Gopalakrishnan,Pranali Mansukhani,Yoav Bergner
|dblpUrl=https://dblp.org/rec/conf/evallac/AnghelWGMB24
}}
==Can LLMs Evaluate Items Measuring Collaborative Problem-Solving?==
<pdf width="1500px">https://ceur-ws.org/Vol-3772/paper2.pdf</pdf>
<pre>
                                Can LLMs evaluate items measuring collaborative
                                problem-solving?
                                Ella Anghel1,* , Yu Wang2 , Madhumitha Gopalakrishnan2 , Pranali Mansukhani2 and
                                Yoav Bergner2
                                1
                                  International Study Center, Lynch School of Education and Human Development, Boston College, 140 Commonwealth
                                Ave, Chestnut Hill, MA 02467, USA
                                2
                                  Department of Administration, Leadership & Technology, New York University Steinhardt School of Culture, Education
                                & Human Development, 82 Washington Square East, New York, NY 10003, USA


                                            Abstract
                                            Collaborative problem-solving (CPS) is a vital skill for students to learn, but designing CPS assessments
                                            is challenging due to the construct’s complexity. Advances in the capabilities of large language models
                                            (LLMs) have the potential to aid the design and evaluation of CPS items. In this study, we tested whether
                                            six LLMs agree with human judges on the quality of items measuring CPS. We found that GPT-4 was
                                            consistently the best-performing model with an overall accuracy of .77 (𝜅 = .53). GPT-4 did the best with
                                            zero-shot prompts, with other models only marginally benefiting from more complex prompts (few-shot,
                                            chain-of-thought). This work highlights challenges in using LLMs for assessment and proposes future
                                            research directions on the utility of LLMs for assessment design.

                                            Keywords
                                            large language models, item evaluation, collaborative problem-solving, prompt engineering


                                1. Introduction
                                Collaborative problem-solving (CPS) is one of the most important 21st century skills according
                                to employers [1] and has for some time attracted the interest of K-12 educators and policy-
                                makers. High-quality assessment of CPS is a vital companion for curricula designed to develop
                                this skill. However, the complexity of the construct makes it challenging to design items that
                                properly target CPS and to evaluate the quality of candidate items. In recent years, the use
                                of large language models (LLMs) and other AI-based methods were proposed for determining
                                psychometric properties such as item difficulty [2]. These approaches are rarely applied to the
                                evaluation of items’ construct representation or to complex constructs like CPS. Therefore, it is
                                unclear whether LLMs are suitable for such tasks. The current study aims to fill this gap by
                                testing whether LLMs agree with humans on quality criteria for CPS items.


                                EvalLAC’24: Workshop on Automatic Evaluation of Learning and Assessment Content, July 08, 2024, Recife, Brazil
                                *
                                 Corresponding author.
                                $ anghel@bc.edu (E. Anghel); yw3060@nyu.edu (Y. Wang); mg7584@nyu.edu (M. Gopalakrishnan);
                                pm3598@nyu.edu (P. Mansukhani); yoav.bergner@nyu.edu (Y. Bergner)
                                 0000-0001-6332-7826 (E. Anghel); 0009-0005-9647-3076 (Y. Wang); 0009-0001-4328-5910 (M. Gopalakrishnan);
                                00009-0005-5801-4266 (P. Mansukhani); 0000-0001-7738-4290 (Y. Bergner)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Literature review
2.1. Collaborative learning
It is now well established that collaboration and teamwork are essential for success in educational
and work settings [1, 3]. The importance of collaborative problem solving (CPS) has led policy-
makers to advocate for the development of high-quality CPS assessments [4]. These calls have
been answered by several national and international assessment programs [5].
    From a socio-cognitive perspective, CPS is also believed to improve learning of the underlying
domain. However, simply working together on a task is not enough to facilitate learning
[6]. Good CPS tasks should be challenging enough to justify the higher cognitive load of
collaborating [7], focus on conceptual rather than procedural material [8], and involve positive
interdependence among participants [9, 10].
    While the importance of CPS in and of itself and as a contributor to other learning is widely
supported, it remains a difficult construct to assess [11, 12]. Some challenges relate to construct
definition, confounding factors, and psychometric modeling [13]. But even designing items that
foster positive interdependence can be quite tricky. Many “collaborative” tasks can either be
solved individually or by dividing the work among the group members rather than through
collaboration. For example, the PISA 2015 tasks measuring CPS seem to encourage the test-takers
to divide the work with their collaborators [14].
    Collaborative learning scholars have emphasized the task design component in contrast to,
for example, (over-)scripting student interactions [15]. This approach was central in an online
learning and assessment environment called Collaborative Higher-Order Problem Solving
(CHOPS) [16]. There, pairs of students work collaboratively to solve math problems built
around three item “templates” designed to foster positive interdependence. These templates are
described here, using somewhat trivial examples for illustrative purposes:

   1. Jigsaw - Students must exchange information to solve the problem, as they have only
      part of the necessary information. For example, one student might have the length of one
      side of a rectangle and another student has the length of the adjacent side. Together they
      are asked to find its area.
   2. Joint construction - A correct answer is composed of elements provided by each student
      that must together satisfy some criteria. For example, each student must provide the
      length of one side of a rectangle such that its area is 48 units. While there may be multiple
      solutions, the students must coordinate their responses.
   3. Information request - Students have an under-specified problem with limited options to
      request information to complete the task. The pair must decide together what information
      is needed and coordinate who should ask for what. For example, the students are asked
      to determine how long a trip should take and can each request one of the following: the
      car’s fuel usage, the distance traveled, the car’s average speed, or when the car left its
      origin.

  These templates allow for relatively short-duration CPS items (compared with elaborate
scenario tasks). Consequently, many items can be delivered and reliability improved. Item
developers can be trained to adapt many “standard” types of test questions to these templates,
but the process is still quite time-consuming.

2.2. Large language models
In recent years, the performance of LLMs, such as OpenAI’s GPT and Meta’s Llama have
improved significantly [17]. As a result, these models have been applied in diverse areas such as
medicine, computing, basic science, and education [18, 19, 20, 21]. Specifically, GPT-3.5 and GPT-
4 have included innovations in bias reduction and complex problem-solving, which are essential
for educational applications like content creation, interactive learning, and teaching assistance
[22, 23, 24]. Notwithstanding the name “OpenAI”, GPT models are proprietary, potentially
expensive, and require users to upload private information to OpenAI servers. Open-source
initiatives like Llama and Mistral offer promising alternatives. These models have encouraged
an efflorescence of open-source additions, for example other-than-English language capabilities
[25].
   While LLMs are often remarkably effective at interpreting natural language prompts, higher-
quality prompts can yield significantly better outputs [26]. Prompt engineering has emerged as
a design problem for refining the content and structure of LLM prompts to optimize for specific
tasks [27]. Some prompt engineering best practices involve writing clear, detailed instructions,
separating distinct parts of the input, asking the model to adopt a persona, and instructing
the LLM to work out the solution rather than immediately constructing the answer [28, 29].
A naming convention has emerged in the literature to describe different prompt variations.
Refering to the number of worked examples given to the LLM, Zero-shot learning (ZSL) relies
solely on the LLM’s pre-trained “knowledge" along with the task description without the use of
any worked examples. In contrast, One-shot learning (OSL) includes an example in the prompt,
and Few-shot learning (FSL) includes two or more [30, 31]. There are also variations in the
presentation of worked examples. A prompt can include just the correct label or desired response
for example. In Chain-of-thought (CoT) reasoning [32], however, the prompt demonstrates a
multi-step reasoning process, mimicking how a human would approach the problem. These
prompting approaches constitute sources of variance that may be important for educational
researchers working with LLMs.

2.3. Large language models in assessment
Advancements in LLMs have not gone unnoticed by the measurement field, where they have
been considered for item generation, scoring, and parameter calibration [33]. Relatively little
research has been conducted on item evaluation using LLMs. Most of this research has focused
on automatic evaluation of item difficulty [2]. For instance, researchers used LLM responses to
items to evaluate the guessability or the knowledge required to respond to those items [34, 35].
Others have focused on the linguistic features of items [36, 37, 38]. Only a few studies attempted
to automatically evaluate items’ content [39], and they generally did not use LLMs for this
purpose.
   The contribution of LLMs to assessment research and development may be even more
pronounced for difficult-to-measure constructs like CPS. Can these models reduce the burden of
new item design? Or will LLM-generated items be disastrous? While LLMs may be able to follow
detailed prescriptions for item structure, a more impressive achievement would be understanding
the task designer’s intent more broadly. To that end, a prudent step before engaging an LLM in
item generation is to test whether the model has the foundational knowledge to recognize a
good CPS item when it sees one.

2.4. The current study
In the current study, we sought to examine to what extent LLMs can judge the quality of
CHOPS template items for measuring CPS. Given the range of performance demonstrated in
the literature, we compared multiple foundational models, prompt strategies, and task types
to understand how some approaches may outperform others. This study contributes to the
literature in several ways. First, understanding LLMs’ ability to evaluate CPS items is a first
step in improving item quality and even automatically generating such items. Second, this
study is relevant to the measurement field as a whole, as it demonstrates how LLMs deal with
complex item evaluation tasks. Finally, by examining different models and prompts we can shed
light on the models’ respective strengths and limitations, guiding future research in educational
technology. In sum, our study aimed to answer the following research questions:

   1. To what extent can LLMs evaluate the quality of complex CPS items?
   2. To what extent do LLMs’ success rates vary by the foundational model, prompting
      approach, and type of item?


3. Methods
3.1. Item design
We created a small data set of CPS problems for LLMs to evaluate. The items were designed for
two students to solve and are approximately at the level of middle-school math. They use one
of three CHOPS templates. We label a CPS task as “good” if it invokes positive interdependence.
That is, it requires the participants to work together in a meaningful way to solve the problem. A
bad task does not require collaboration or cannot be solved for other reasons. The set contained
21 jigsaw (10 good, 11 bad), 20 joint construction (10 good, 10 bad), and 20 request information
(9 good, 11 bad) items, which were either new, adapted from items in CHOPS, or adapted from
publicly available large-scale math assessment items like TIMSS and NAEP. Each item was
reviewed by at least two team members for clarity, correctness, and content relevance.
   Figure 1 shows an example of a joint construction template created based on a TIMSS 2011
item [40]. Version A and B would be shown to the two collaborating students. Since both
students can enter values that meet the criterion presented in the item, they do not need to
collaborate to solve it, making this a bad example for CPS. A (minimally) good version of this
item would require each student to enter one value such that together they meet the criterion.
The pair of students must then negotiate a common solution.
        Version A:
        The minute hand of a clock turns 600 degrees between time T1 and time T2 of the
        same day. Together with your partner, come up with a possible value for T1 and T2.
        ■ Enter value for T1
        ♢ Enter value for T2
        Version B:
        The minute hand of a clock turns 600 degrees between time T1 and time T2 of the
        same day. Together with your partner, come up with a possible value for T1 and T2.
        ♢ Enter value for T1
        ■ Enter value for T2


Figure 1: An example of a bad joint construction item. A good variation would keep only the answer
input rows preceded with black squares or diamonds but not both.


3.2. Pre-prompt design
In this study, we use the term “pre-prompt” when referring to the instructions provided to the
LLMs on how to approach the item evaluation, since each “prompt” also includes an item that
the LLMs is asked to evaluate. We designed several types of pre-prompts. Initially, we refined
one prompt through trial and error with GPT-4 to improve the output. We also designed a
pre-prompt following current best practices of asking the LLM to adopt the role of an evaluator
and separating its task by first asking it to identify the item type (template) and then make a
judgment on collaborative interdependence. Our original prompt was also paired with examples,
sometimes limited to pass/fail labels or extended to CoT reasoning. In total, we tested five
pre-prompts:

    • Zero-shot learning with no examples, prompt refined with GPT-4
    • Structured Zero-shot learning following prompt engineering best-practices
    • Few-shot learning, original prompt plus one good and bad example from each template
      (six total); only pass/fail labels were provided
    • The same prompt with six CoT examples followed by a verdict
    • The same prompt and CoT, except with the verdict given before the reasoning

  Below is our ZSL pre-prompt. The CoT pre-prompts with the example items we used for the
other pre-prompts, as well as the structured ZSL prompt are available in Appendices A.1 and
A.2, respectively.

      You will be asked to evaluate one educational exercise for math students working in pairs.
      The exercise will be presented to you in two parts, the exercise version shown only to
      Student A (called Version A) and the exercise version as shown only to Student B (Version
      B). Students A and B are assigned to be partners.
      Importantly, Version A and Version B may contain different, complementary information,
      or the information may be formulated differently. Student A cannot see Version B, and
      Student B cannot see Version A. The only way they can access the information available
      to their partner is by communication with each other via text chat. The exercise should
      require Both Student A and Student B to submit some answers in an answer field or fields.
      Your criterion for evaluation of the exercise is whether or not the exercise indeed requires
      Student A and Student B to collaborate in order to solve the problem. If so, indicate pass.
      It is not acceptable if Student A and Student B can work separately, independently, and
      without communicating and still each get the correct answer. In such case, indicate fail.
      For an exercise to pass, it should be impossible for the students to answer correctly by
      working alone independently. It is not necessary for you to solve the problem. However,
      you may describe the solution process in explaining your reasons for your evaluation.
      When providing your evaluation, please format it as follows:
      Verdict: [pass or fail]
      Reason: [explanation for verdict]
      The following is the exercise you need to evaluate:

3.3. Selection of language models
We used six LLMs from three families: GPT-3.5 and GPT-4 from OpenAI, Llama2 and Llama3
from Meta, and Mistral7B and Mixtral8x7B [41, 42]. This selection was designed to explore
variance between families as well as within a family, i.e., earlier/later or smaller/larger models.
Llama2 and Llama3 come in different sizes; in both cases, we used Q5-quantized versions of
the 70 billion (70b) parameter models. Mistral7B is a conventional 7b model. Mixtral8x7b is a
Sparse Mixture of Experts (SMoE) architecture with 47b total parameters, but the model uses
only 13b at inference time by routing each token to a subset of model components based on
the token’s attributes. We used Q8-quantized versions of both Mistral models. The Llama and
Mistral models were served locally on a high-performance MacBook Pro with 128GB of RAM.

3.4. Procedure
Experimental outputs were collected by an automated script with a browser-based front end.
The interface provided for API calls to GPT models and local/cloud-based open-source models.
Items and pre-prompts could be selected, and the script would subsequently append the pre-
prompts to each item for each call (61 items × 5 pre-prompts × 6 models). The outputs of each
query were saved for subsequent analyses.
   In the analysis stage, LLM outputs were parsed using regular expressions for pass/fail verdicts.
All pre-prompts requested verdicts in a specific form, Verdict: Pass/Fail. Model outputs that
did not follow this structure were originally parsed as having no verdict. However, further
inspection revealed that many model responses contained meaningful evaluations in a different
form (e.g., “this exercise meets the criteria”). We therefore wrote a more complex parser to
identify relevant phrases. The new parser significantly lowered the no-verdict rates; however,
we understand that the parser was still imperfect.
   We then compared the results of the parser with our ground-truth labels for each item. The
overall agreement is summarized using accuracy (% agreement) and Cohen’s 𝜅 [43]. Following
[44] but slightly more conservative at the low end, we interpret Cohen’s 𝜅 values ≤ 0.05 as
poor agreement, 0.06 to 0.20 as slight, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, and 0.61 to
0.80 as substantial.


4. Results
Table 1 presents the classification performance for all tested models using the ZSL pre-prompt,
across all items as well as disagreggated by item type. GPT-4 had the best performance, with an
overall moderate agreement level. The three bottom models were barely better than chance
(i.e., 𝜅 scores are about zero). Only two open-source foundational models were somewhat
comparable to GPT-4: Llama3 and Mixtral8x7b. Overall, Llama3 was better than Mixtral8x7b,
but when disaggregated by item type, the results are more complex.
    Jigsaw items follow the same pattern as the overall (GPT-4 > Llama3 > Mixtral8x7b). On
joint construction items, Llama3 and even Llama2 edge out Mixtral8x7B. However, classifying
information request items seems to be the hardest subtask. The highest accuracy, obtained by
GPT-4, is 0.63, with a moderate 𝜅 of 0.26. Mixtral8x7b slightly beats chance on these items,
while Llama3 does worse than chance. In sum, it is possible that to optimize performance using
the open-source models, one would do better using Llama3 for jigsaw and joint construction
items and Mixtral8x7B for info request items.

Table 1
Classification performance (accuracy and 𝜅) using a common zero-shot prompt for all models. Results
are shown for all items as well as for jigsaw (𝑗𝑖𝑔), joint construction (𝑗𝑐), and info request (𝑖𝑟) items
separately
             Model           Acc𝑎𝑙𝑙   𝜅𝑎𝑙𝑙   Acc𝑗𝑖𝑔   𝜅𝑗𝑖𝑔    Acc𝑗𝑐   𝜅𝑗𝑐    Acc𝑖𝑟    𝜅𝑖𝑟
             GPT-4           0.77     0.53   0.86     0.71    0.80    0.60   0.63     0.26
             llama3.70B      0.62     0.25   0.81     0.61    0.65    0.30   0.40     -0.14
             mixtral8x7b     0.54     0.09   0.62     0.21    0.50    0.00   0.50     0.08
             mistral7b       0.51     0.03   0.57     0.10    0.50    0.00   0.45     0.00
             llama2.70B      0.51     0.03   0.52     0.00    0.55    0.10   0.45     0.00
             GPT-3.5         0.50     0.00   0.53     0.00    0.43    0.00   0.50     0.00

   Next, we examined the other pre-prompts to see if they impacted the results. Table 2 in-
cludes the classification metrics for the top three performing models, i.e., GPT-4, Llama3, and
Mixtral8x7b, across all pre-prompts (the ZSL results from Table 1 are embedded in the first
column).
   For GPT-4, which had the best overall performance on the task, it is notable that elaboration
of the original prompt did not have a positive impact on classification performance and often
led to worse performance. The ZSL pre-prompt was as good or better than all others, except
CoT prompting for info request items which had identical accuracy and higher 𝜅 by about
0.03. However, the difference is probably not of practical significance as the confidence interval
around 𝜅 is on the order of ±0.3.
   While the differences were still small, it does appear that the few-shot prompting improved
the results from Llama3 and Mixtral8x7b in a number of prompt-item-type combinations. For
example, CoT prompting improved Mixtral8x7b notably on jigsaw items, while the CoT verdict
first improved the joint construction evaluations. Llama3 had more modest gains from these
two prompts.

Table 2
Accuracy (Cohen’s 𝜅) for GPT-4, Llama3.70B, and Mixtral8x7B by Pre-prompt Type
 GPT-4                GPT-ZSL        Structured ZSL        FSL verdict only     CoT            CoT verdict first
 All items            0.77 (0.53)    0.70 (0.40)           0.68 (0.37)          0.75 (0.50)    0.72 (0.43)
 Jigsaw               0.86 (0.71)    0.86 (0.71)           0.86 (0.71)          0.81 (0.61)    0.81 (0.61)
 Joint construction   0.80 (0.60)    0.70 (0.40)           0.70 (0.40)          0.80 (0.60)    0.75 (0.50)
 Info request         0.63 (0.26)    0.53 (0.06)           0.47 (-0.02)         0.63 (0.29)    0.58 (0.19)

 Llama3.70B           GPT-ZSL        Structured ZSL        FSL verdict only     CoT            CoT verdict first
 All items            0.62 (0.25)    0.59 (0.19)           0.66 (0.32)          0.55 (0.11)    0.64 (0.28)
 Jigsaw               0.81 (0.61)    0.76 (0.51)           0.81 (0.61)          0.83 (0.67)    0.71 (0.42)
 Joint construction   0.65 (0.30)    0.55 (0.10)           0.65 (0.30)          0.56 (0.12)    0.70 (0.40)
 Info request         0.40 (-0.14)   0.45 (0.00)           0.50 (-0.03)         0.44 (-0.11)   0.50 (0.01)

 Mixtral8x7B          GPT-ZSL        Structured ZSL        FSL verdict only     CoT            CoT verdict first
 All items            0.54 (0.09)    0.54 (0.09)           0.53 (0.06)          0.53 (0.08)    0.53 (0.07)
 Jigsaw               0.62 (0.21)    0.67 (0.31)           0.62 (0.22)          0.76 (0.52)    0.52 (0.02)
 Joint construction   0.50 (0.00)    0.55 (0.10)           0.50 (0.00)          0.47 (-0.03)   0.63 (0.23)
 Info request         0.50 (0.08)    0.40 (-0.10)          0.45 (0.00)          0.35 (-0.25)   0.45 (-0.04)

   The above analysis is perhaps too fine, slicing by model, prompt, and item type. To understand
if different prompts are generally more suitable to different item types, we average over the top
three models. These results are shown in Table 3. Indeed, after averaging, it remains the case
that the best overall prompt is not the best prompt for each item type. Notably, the classification
of info request items is, at best, barely better than chance.

Table 3
Best preprompt (using Cohen’s 𝜅) Overall and by Item Type. The highest values in each column are
bolded.
                            Prompt                  𝜅𝑎𝑙𝑙   𝜅𝑗𝑖𝑔     𝜅𝑗𝑐       𝜅𝑖𝑟
                            GPT-ZSL                 0.29   0.51    0.30   0.06
                            Structured ZSL          0.22   0.51    0.20   -0.02
                            FSL verdict only        0.24   0.51    0.23   -0.01
                            CoT                     0.22   0.54    0.22   -0.03
                            CoT verdict first       0.26   0.35    0.38   0.06

  It appears to be the case that jigsaw classification is the most successful, followed by joint
construction and information request. A high-level summary confirming this finding using
accuracy scores averaged over pre-prompts for each model is shown in Table 4. Note that these
are not the best results for each model.
Table 4
Average Accuracy by Item Type and Model
                              Model         Jigsaw        Joint construction     Information request
                              GPT-4           0.84                       0.75                      0.54
                              llama3.70B      0.76                       0.61                      0.46
                              mixtral8x7b     0.64                       0.52                      0.43
                        0.8


                                                                                     GPT40
Item−quality accuracy

                        0.6


                                                                                                     llama3_70B
                                                           mixtral8x7b
                                                                                llama2_70B

                                                              GPT35
                        0.4
                        0.2


                                        mistral7b


                                0.1                 0.2                  0.3                 0.4              0.5

                                                           Item−type accuracy


Figure 2: Relationship between item-type classification success and item quality evaluation using the
structured ZSL prompt.


   As an exploratory step, we were interested in whether the models were able to classify
items into the correct types, the first sub-task using the structured ZSL approach. Base rate
classification accuracy for item types could be expected at 0.33, and actual results ranged from
0.13 to 0.43. Striking, however, is the relationship between accuracy in classifying the item type
(template) and accuracy in evaluating the items (see Figure 2). The highly apparent correlation
(𝑟 = 0.80) suggests that better models in one task can do the other better as well. Interestingly,
when it came to type classification, Llama3 was actually the best performing model.


5. Discussion
The purpose of our study was to test the feasibility of LLMs for evaluating items measuring
CPS. We also wanted to see if different models, pre-prompts, or item types affect the results.
Understanding these issues may contribute to research on how LLMs interact with complex
tasks and to future item design in practice. According to our findings, only three of the tested
models did better than chance, with GPT-4 outperforming the other models in almost all cases.
Between the open-access models, different models did better on different item types, suggesting
that users should consider the task type when choosing the best model. Given GPT-4’s success
relative to other models in various tasks [45, 46], including tasks related to item generation [47],
this result is unsurprising. However, even GPT-4 reached only moderate levels of agreement in
most cases. Others have also found that LLMs struggle with evaluative tasks [48], sugggesting
directions for future LLM developments.
   Contrary to existing findings [49], elaborate pre-prompting rarely improved on the basic ZSL
pre-prompt. It is possible that the examples were confusing or focused the LLMs on the specific
cases rather than the general idea. We intend to examine this issue in the future. We also found
that some item types were easier for the LLMs to judge than others. All models generally did
best with the jigsaw items followed by the joint construction items and the information request
items. We are unaware of existing research comparing LLMs’ ability to evaluate different types
of interdependent tasks, and this might also be a fruitful direction for future work.
   This study has several limitations. First, our basic ZSL pre-prompt was refined using GPT-
4, perhaps contributing to its success. Since GPT-4 seems to outperform other models in a
variety of complex tasks, we believe this effect is likely small. Second, to enhance the study’s
generalizability, more items, constructs, models, and pre-prompts should be tested. Finally, we
could only examine the final verdict of the models and not their reasoning. Qualitative analysis
of the LLMs’ outputs is planned and could reveal the reasons for their disagreements with
humans.
   In conclusion, when evaluating the quality of CPS items, existing LLMs have only moderate
levels of agreement with humans at best. Adding more information beyond ZSL pre-prompts
does not improve this by much. However, different models and pre-prompts perform better
when evaluating different item types. Therefore, more work on the models or on prompting
strategies is required before LLMs can be reliably used LLMs for evaluating items measuring
CPS and, likely, similarly complex constructs.
References
 [1] J. A. Rios, G. Ling, R. Pugh, D. Becker, A. Bacall, Identifying critical 21st-century skills for
     workplace success: A content analysis of job advertisements, Educational Researcher 49
     (2020) 80–89.
 [2] L. Benedetto, P. Cremonesi, A. Caines, P. Buttery, A. Cappelli, A. Giussani, R. Turrin, A
     survey on recent approaches to question difficulty estimation from text, ACM Computing
     Surveys 55 (2023) 1–37. doi:10.1145/3556538.
 [3] J. Burrus, T. Jackson, N. Xi, J. Steinberg, Identifying the most important 21st century
     workforce competencies: An analysis of the occupational information network (o*net),
     ETS Research Report Series 2013 (2013) i–55.
 [4] L. Darling-Hammond, J. Herman, J. Pellegrino, J. Abedi, J. L. Aber, E. Baker, R. Bennett,
     E. Gordon, E. Haertel, K. Hakuta, et al., Criteria for high-quality assessment, Stanford
     Center for Opportunity Policy in Education 2 (2013) 171–192.
 [5] S. M. Fiore, A. Graesser, S. Greiff, P. Griffin, B. Gong, P. Kyllonen, C. Massey, H. O’Neil,
     J. Pellegrino, R. Rothman, et al., Collaborative problem solving: Considerations for the
     national assessment of educational progress (2017).
 [6] R. M. Gillies, Cooperative learning: Review of research and practice, Australian Journal of
     Teacher Education (Online) 41 (2016) 39–54.
 [7] P. A. Kirschner, F. Kirschner, J. Janssen, The collaboration principle in multimedia learning,
     The Cambridge handbook of multimedia learning 2 (2014) 547–575.
 [8] D. Mullins, N. Rummel, H. Spada, Are two heads always better than one? differential effects
     of collaboration on students’ computer-supported learning in mathematics, International
     Journal of Computer-Supported Collaborative Learning 6 (2011) 421–443.
 [9] D. W. Johnson, R. T. Johnson, An educational psychology success story: Social interdepen-
     dence theory and cooperative learning, Educational researcher 38 (2009) 365–379.
[10] R. E. Slavin, Research on cooperative learning and achievement: What we know, what we
     need to know, Contemporary educational psychology 21 (1996) 43–69.
[11] B. M. Stecher, L. S. Hamilton, Measuring Hard-to-Measure Student Competencies: A
     Research and Development Plan. Research Report., ERIC, 2014.
[12] A. C Graesser, P. W. Foltz, Y. Rosen, D. W. Shaffer, C. Forsyth, M.-L. Germany, Challenges
     of assessing collaborative problem solving, Assessment and teaching of 21st century skills:
     Research and applications (2018) 75–91.
[13] J. Hao, L. Liu, A. A. von Davier, P. C. Kyllonen, Initial steps towards a standardized
     assessment for collaborative problem solving (cps): Practical challenges and strategies,
     Innovative assessment of collaboration (2017) 135–156.
[14] Pisa 2015 collaborative problem solving, https://www.oecd.org/pisa/innovation/
     collaborative-problem-solving/, ???? Accessed: 2024-05-10.
[15] P. Dillenbourg, Over-scripting cscl: The risks of blending collaborative learning with
     instructional design., Three worlds of CSCL. Can we support CSCL? (2002) 61–91.
[16] Y. Bergner, Y. Wang, Mathchops: A platform for developing collaborative higher order
     problem solving in mathematics, in: Proceedings of the 16th International Conference on
     Computer-Supported Collaborative Learning-CSCL 2023, pp. 51-58, International Society
     of the Learning Sciences, 2023.
[17] K. I. Roumeliotis, N. D. Tselikas, Chatgpt and open-ai models: A preliminary review,
     Future Internet 15 (2023) 192.
[18] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, D. S. W. Ting,
     Large language models in medicine, Nature medicine 29 (2023) 1930–1940.
[19] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser,
     G. Groh, S. Günnemann, E. Hüllermeier, et al., Chatgpt for good? on opportunities and
     challenges of large language models for education, Learning and individual differences
     103 (2023) 102274.
[20] A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, P. Schwaller, Chemcrow:
     Augmenting large-language models with chemistry tools, arXiv preprint arXiv:2304.05376
     (2023).
[21] F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language
     models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on
     Machine Programming, 2022, pp. 1–10.
[22] D. Kovačević, Use of chatgpt in esp teaching process, in: 2023 22nd International Sympo-
     sium INFOTEH-JAHORINA (INFOTEH), IEEE, 2023, pp. 1–5.
[23] J. Rudolph, S. Tan, S. Tan, Chatgpt: Bullshit spewer or the end of traditional assessments
     in higher education?, Journal of applied learning and teaching 6 (2023) 342–363.
[24] A. Tlili, B. Shehata, M. A. Adarkwah, A. Bozkurt, D. T. Hickey, R. Huang, B. Agyemang,
     What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in
     education, Smart Learning Environments 10 (2023) 15.
[25] A. Balachandran, Tamil-llama: A new tamil language model based on llama 2, arXiv
     preprint arXiv:2311.05845 (2023).
[26] Z. Lin, How to write effective prompts for large language models, Nature Human Behaviour
     (2024) 1–5.
[27] T. Sorensen, J. Robinson, C. M. Rytting, A. G. Shaw, K. J. Rogers, A. P. Delorey, M. Khalil,
     N. Fulda, D. Wingate, An information-theoretic approach to prompt engineering without
     ground truth labels, arXiv preprint arXiv:2203.11364 (2022).
[28] Best practices for prompt engineering with the openai api, https://help.openai.com/
     en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api#h_
     eae065300d, 2024. Accessed: 2024-05-04.
[29] Prompt engineering, https://platform.openai.com/docs/guides/prompt-engineering, n.d.
     Accessed: 2024-05-04.
[30] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[31] J. Chen, Y. Geng, Z. Chen, J. Z. Pan, Y. He, W. Zhang, I. Horrocks, H. Chen, Zero-shot and
     few-shot learning with knowledge graphs: A comprehensive survey, Proceedings of the
     IEEE (2023).
[32] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-
     of-thought prompting elicits reasoning in large language models, Advances in neural
     information processing systems 35 (2022) 24824–24837.
[33] J. Hao, A. A. von Davier, V. Yaneva, S. Lottridge, M. von Davier, D. J. Harris, Transforming
     assessment: The impacts and implications of large language models and generative ai,
     Educational Measurement: Issues and Practice (2024).
[34] H. Maeda, Field-testing multiple-choice questions with ai examinees (2024).
[35] A. Säuberli, Automatic Generation and Evaluation of Multiple-Choice Reading Compre-
     hension Items with Large Language Models, Ph.D. thesis, University of Zurich, 2023.
[36] R. Rodriguez-Torrealba, E. Garcia-Lopez, A. Garcia-Cabot, End-to-end generation of
     multiple-choice questions using text-to-text transfer transformer models, Expert Systems
     with Applications 208 (2022) 118258.
[37] V. Raina, M. Gales, Multiple-choice question generation: Towards an automated assessment
     framework, arXiv preprint arXiv:2209.11830 (2022).
[38] M. J. Gierl, H. Lai, V. Tanygin, Advanced methods in automatic item generation, Routledge,
     2021. doi:10.4324/9781003025634.
[39] R. Meissner, D. Jenatschke, A. Thor, Evaluation of approaches for automatic e-assessment
     item annotation with levels of bloom’s taxonomy, in: International Symposium on Emerg-
     ing Technologies for Education, Springer, 2020, pp. 57–69.
[40] International Association for the Evaluation of Educational Achievement (IEA), TIMSS
     2011 Assessment, TIMSS & PIRLS International Study Center, Lynch School of Education,
     Boston College, Chestnut Hill, MA and International Association for the Evaluation of
     Educational Achievement (IEA), IEA Secretariat, Amsterdam, the Netherlands., 2013.
[41] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825
     (2023).
[42] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l.
     Casas, E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088
     (2024).
[43] J. Cohen, Coefficient of agreement for nominal scales, Educational and psychological
     measurement 20 (1960).
[44] J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data,
     biometrics (1977) 159–174.
[45] A. Borji, M. Mohammadian, Battle of the wordsmiths: Comparing chatgpt, gpt-4, claude,
     and bard, GPT-4, Claude, and Bard (June 12, 2023) (2023).
[46] L. Martin, N. Whitehouse, S. Yiu, L. Catterson, R. Perera, Better call gpt, comparing large
     language models against lawyers, arXiv preprint arXiv:2401.16212 (2024).
[47] A. Säuberli, S. Clematide, Automatic generation and evaluation of reading comprehension
     test items with large language models, arXiv preprint arXiv:2404.07720 (2024).
[48] J. Steiss, T. Tate, S. Graham, J. Cruz, M. Hebert, J. Wang, Y. Moon, W. Tseng, M. Warschauer,
     C. B. Olson, Comparing the quality of human and chatgpt feedback of students’ writing,
     Learning and Instruction 91 (2024) 101894.
[49] B. Chen, Z. Zhang, N. Langrené, S. Zhu, Unleashing the potential of prompt engineering in
     large language models: a comprehensive review, arXiv preprint arXiv:2310.14735 (2023).
A. Full text of prompts
A.1. Chain-of-Thought
You will be asked to evaluate one educational exercise for math students working in pairs. The
exercise will be presented to you in two parts, the exercise version shown only to Student A
(called Version A) and the exercise version as shown only to Student B (Version B). Students A
and B are assigned to be partners. Importantly, Version A and Version B may contain different,
complementary information, or the information may be formulated differently. Student A
cannot see Version B, and Student B cannot see Version A. The only way they can access the
information available to their partner is by communication with each other via text chat. The
exercise should require Both Student A and Student B to submit some answers in an answer
field or fields.
    Your criterion for evaluation of the exercise is whether or not the exercise indeed requires
Student A and Student B to collaborate in order to solve the problem. If so, indicate pass. It
is not acceptable if Student A and Student B can work separately, independently, and without
communicating and still each get the correct answer. In such case, indicate fail. For an exercise to
pass, it should be impossible for the students to answer correctly by working alone independently.
It is not necessary for you to solve the problem. However, you may describe the solution process
in explaining your reasons for your evaluation. When providing your evaluation, please format
it as follows:


  Verdict: [pass or fail]

  Reason: [explanation for verdict]


  ##The following is an example exercises with suitable response:
  #Example prompt

   Version A: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn
from today’s production line, and 2 batteries fail the quality test. What is the best estimate for
the total number of faulty batteries produced today?

   Version B: A factory produces 100,000 batteries each day. A sample of 200 batteries is drawn
from today’s production line, and 2 batteries fail the quality test. What is the best estimate for
the total number of faulty batteries produced today?

  #Example response

  To estimate the total number of faulty batteries produced, one needs to know the total daily
production, the size of the test sample, and the number of failed batteries in the test sample.
Both Student A and Student B have the complete information needed to solve the problem and
thus can in principle solve the problem without collaborating with one another.
  Verdict: Fail

  #Example prompt

   Version A: A factory produces batteries each day. A sample of 200 batteries is drawn from
today’s production line, and 2 batteries fail the quality test. What is the best estimate for the
total number of faulty batteries produced today?

   Version B: A factory produces 100,000 batteries each day. A sample of batteries is drawn from
today’s production line, and 2 batteries fail the quality test. What is the best estimate for the
total number of faulty batteries produced today?

  #Example response

   To estimate the total number of faulty batteries produced, one needs to know the total daily
production, the size of the test sample, and the number of failed batteries in the test sample.
Student A has the sample size but does not have the total number produced, while Student B
knows the total number of batteries produced but does not know the size of the sample that
was tested. The collaborating students need to communicate this information to each other
to estimate the total number of faulty batteries produced today. Thus, this exercise meets the
requirement that it can only be solved if Student A and Student B share information with each
other.

  Verdict: Pass


  #Example prompt

  Version A:

  The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.

  Enter value for T1:
  Enter value for T2:

  Version B:
  The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.

  Enter value for T1:
  Enter value for T2:

  #Example response
  There is an infinite number of possible solutions to the posed problem. Each student is
provided with the ability to provide a complete solution to the problem. Thus, it is possible for
each student to answer correctly on their own without coordinating with their partner.

  Verdict: Fail


  #Example prompt

  Version A:
  The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.

  Enter value for T1:

  Version B:
  The minute hand of a clock turns 600 degrees between time T1 and time T2 of the same day.
Together with your partner, come up with a possible value for T1 and T2.

  Enter value for T2:

  #Example response

  Each student is provided with the ability to answer one of two necessary parts of the solution.
Moreover, the two parts must together compose a correct solution. Although there is an infinite
number of possible solutions to the posed problem, neither student can answer correctly on
their own without coordinating with their partner.

  Verdict: Pass


  #Example prompt

   Version A: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?
   To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.

  A. Average number of boxes of cookies sold in class A
  B. Total number of boxes of cookies sold in class A
  C. Average number of boxes of cookies sold in class B
  D. Total number of cookies per box
  E. Total number of students in class A
   Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?
   To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.

  A. Average number of boxes of cookies sold in class A
  B. Total number of boxes of cookies sold in class A
  C. Average number of boxes of cookies sold in class B
  D. Total number of cookies per box
  E. Total number of students in class A

  #Example response

   Critical pieces of information necessary for solving the problem (such as the total number of
students in both classes or the total number of boxes sold in class B) are either missing or inade-
quately defined in the options available to the students. Therefore, the task is unsolvable with
the provided selections, even if students work together to combine their available information.
The exercise does not meet the criteria for a solvable and collaborative educational exercise.

  Verdict: Fail

  #Example prompt

   Version A: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?
   To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.

  A. Average number of boxes of cookies sold in class A
  B. Total number of boxes of cookies sold in class A
  C. Average number of boxes of cookies sold in class B
  D. Total number of boxes of cookies sold in class B
  E. Total number of cookies per box
  F. Total number of students in class A
  G. Total number of students in class B

   Version B: In a school fund-raiser, students in class A and class B sold boxes of cookies. What
was the average number (arithmetic mean) of boxes of cookies sold by all students in both
classes?
   To answer this question, you and your partner may each make TWO selections from the
following list of values. After you submit your selection, the values you selected will be revealed
to you. Use this information to provide your answer in the box below.
  A. Average number of boxes of cookies sold in class A
  B. Total number of boxes of cookies sold in class A
  C. Average number of boxes of cookies sold in class B
  D. Total number of boxes of cookies sold in class B
  E. Total number of cookies per box
  F. Total number of students in class A
  G. Total number of students in class B

   #Example response
   To calculate the overall average number of boxes sold by students in both classes, students
will need at least four pieces of information from the options provided. For instance, one student
might choose the total number of boxes sold in class A and the total number of students in
class A, while the other selects the equivalent information for class B. Alternatively, they could
choose average numbers and total students in each class. However, each student has the ability
to select only two pieces of information. Without sharing this information, neither student can
independently calculate the overall average, fulfilling the requirement for collaboration.

  Verdict: Pass

  The following is the exercise you need to evaluate:


A.2. Structured Zero-shot
Your Role: Collaboration evaluator for math exercises

  Objective: You need to evaluate collaborative math exercises provided for two students who
are solving the exercises together. The goal of this evaluation is to determine whether the
exercises require genuine collaboration between the partners to solve.

  Exercise overview: Each exercise will be presented to you in two parts, Version A, accessible
only to Student A, and Version B, accessible only to Student B. Students A and B are assigned to
be partners.

  Types of collaborative exercises:
   1. Jigsaw (the pair of students are provided different or complementary information that
      needs to be shared to arrive at the solution)
   2. Joint construction (the pair of students are provided the same information but need to
      solve and respond with different parts of the solution)
   3. Info request (the students may or may not receive different information, but they will
      need to collaborate to identify two pieces of information they can request to solve the
      exercise)
   Thus, Version A and Version B may contain different or complementary information, the
information may be formulated differently, or the response options provided to each student
may be different. Images or figures provided are summarized in text within square brackets.
Student A cannot see Version B, and Student B cannot see Version A. The only way they can
access the information available to their partner is by communication with each other via text
chat. The exercise should require both Student A and Student B to submit some answer(s).

  Evaluation Criteria:


   1. Communication Necessity: Is communication between Student A and Student B essential
      for completing the exercise?
   2. Solution Process: Can the problem only be solved through the combined efforts and
      information of both students?


  It is not necessary for you to solve the problem. However, you may describe the solution
process in explaining your reasons for your evaluation.

  Evaluation format: When providing your evaluation, please format it as follows: Verdict:
[pass or fail]

  Type: [Jigsaw, Joint Construction, Info Request, NA (if fail), Other (if pass but does not fit
any of the types)]

  Reason: [explanation for verdict]

</pre>