Large Language Models for the Assessment of Students' Authentic Tasks. A Replication Study in Higher Education

Large Language Models for the Assessment of Students' Authentic Tasks. A Replication Study in Higher Education DanieleAgostini daniele.agostini@unitn.it Department of Psychology and Cognitive Sciences University of Trento

Corso Bettini, 84 38068 Rovereto Italy

FedericaPicasso federica.picasso@unitn.it Department of Psychology and Cognitive Sciences University of Trento

Corso Bettini, 84 38068 Rovereto Italy

HelgaBallardini helga.ballardini@unitn.it Department of Psychology and Cognitive Sciences University of Trento

Corso Bettini, 84 38068 Rovereto Italy

Large Language Models for the Assessment of Students' Authentic Tasks. A Replication Study in Higher Education 1613-0073 CB74E1CC12FDF748AD3789BB405D6045 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models (LLMs) AI-Assisted Assessment Rubrics Authentic Tasks Academic Assessment † D. Agostini: Conceptualisation Methodology Investigation Formal Analysis Writing -original draft Writing -review & editing Resources Supervision. F. Picasso: Investigation Writing -original draft Data curation. H.Ballardini: Investigation Writing -review & editing

After the public release of ChatGPT (November 30th, 2022) and consequently, that of all its competitors, the use of Large Language Models (LLMs) has become widespread among the public. The most significant impact was perceived from the very beginning in the field of Education and Instruction [1,2,3,4,5,6,7]. Of particular interest for this paper is its use both by teachers and students in particular in the context of higher education [8,4,9]. The immediacy with which Large Language Models (LLMs) have been integrated into higher education practices, both by teachers and students, leads to questions of fundamental importance relating to their effectiveness and reliability. In this field, LLMs become the means through which teachers have the opportunity to revolutionise the interaction with students, the management of workload and the personalisation of each learning experience [2]. Although these technologies are recognised as having advantages and potential for improving learning in terms of accessibility and personalisation [7], a crucial question concerns their application in assessment practices, especially the ability to objectively and impartially evaluate students' performance. The possibilities of using these tools in the field of learning evaluation is relatively little known, which implies the need to delve deeper into the topic for its application both in pedagogical theory and in educational practice. A previous study has been already published [10] which explored the use of the main LLM in the specific context of assessing students' papers, and this is a replication study based on it. The purpose of the current study is to explore the possible use of the main LLMs in the specific context of evaluating students' written productions, with a focus on the aspects of accuracy that are evaluated with the help of a rubric proposed by the teacher. This article is part of a series of contributions that focus on this topic, in light of the principles and application of the AI-Mediated Assessment for Academics and Students (AI-MAAS) model [11].

The Context: AI Assessment in Higher Education

In the last two years, Large Language Models (LLM) have taken on a very significant role in the technological landscape thanks to the launch of ChatGPT, followed subsequently by the release of competing models. The impact of LLMs remained relatively limited over time until increasingly simple and intuitive user interface functions were introduced, firstly the "chat" level, which brought the general public closer to these tools. This phenomenon of "democratisation" boosted the commercial and large-scale use of LLMs, which led institutions, companies and individuals to increase investment in this sector [12,13,14,15]. In addition to OpenAI's ChatGPT, Anthropic's Claude, Microsoft's Copilot and Google's Gemini are just some of the most used LLMs, in addition to the much more numerous open-source models to which Meta's LLAMA has given a notable boost At the same time, however, this has led to a crisis in search engines since LLMs, without requiring advanced research skills, offer new ways of querying and analysing data, more natural interaction and sufficiently precise and exhaustive answers. For example, LLMs allow users to avoid various typical inconvenient steps that characterise the standard use of search engines, such as the selection of long lists of websites, the acceptance of cookies and the appearance of advertising banners. As a result, educational institutions and agencies have begun to incorporate LLMs and generative AI into their curricula at various levels, developing courses to harness the potential of these innovative technologies. There is currently a strong emphasis on AI Literacy [16,17], which allows professionals from different sectors, including educational institutions, to deepen their understanding of the fundamental elements of AI generative, the availability of tools, the functionalities and methods of use that make LLMs effective tools in all fields [18,19,20,21,22,23]. However, one of the most critical issues concerns information management: LLMs possess enormous potential due to their ability to analyse and generate data; this raises numerous questions about accuracy, privacy and ethics in information management and ownership of output. The challenge in this continuously and rapidly evolving field becomes the ability to pay constant attention and critically evaluate so that end users always use LLMs responsibly [24,25,26,27]. In relation to this issue, higher education institutions have reacted by placing themselves on the defensive, so much so that some universities, in order to counteract the possible use of LLMs by students during exam tests, they have reintroduced the obligation to write by hand and also take oral tests [8,28]. At the same time, pieces of software created specifically to detect the productions generated by LLMs were introduced on the market. However, these turned out to be ineffective, causing management and legal problems for institutions because students could be unfairly accused of sending texts generated by artificial intelligence [29,21,30]. To avoid such inconveniences, national and international institutions and universities promptly provided themselves with guidelines that promote ethical behaviour towards the use of LLMs while maintaining a certain caution, allowing students and teachers to use them effectively to carry out tasks and benefit institutions. Important international bodies and universities moved in this direction, such as UNESCO [31,32], the JISC National Center for AI [33], the Russell Group [34], the French National Ministry of Education [35], the US Department of Education [36] and University College London [37]. Assessment tasks have proven to be arguably the ones that can profit most from the AI technology, especially in terms of sustainability. However, caution is needed as LLMs without specific task adaptations have proven incapable and unreliable in managing students' assessments independently [38,33], while LLMs supported by assessment tools have been shown to produce satisfactory results [39]. Above all, the use of artificial intelligence by students requires that teachers know how to take ethical aspects into consideration and act with responsibility when evaluating tasks and tests which results could have a great impact on students' careers (for example, motivation, grades, scholarships, acceptance into master's or doctoral programs).

Theoretical Framework

Since the 1980s the idea of being able to use computerised systems (and now also artificial intelligence) to assist educators in their assessment tasks and to be able to make precise, impartial and informed decisions has been present in much literature [40,41]. The possibility of using LLMs for learning assessment had already been explored in the period immediately preceding the release of ChatGPT, while the use of transformer models, including OpenAI's GPT-3, were already well established. Tamkin et al. [42] emphasised their educational application, which included:

• Summary: LLMs are able to summarise even very long texts. This use can help students submit concise summaries. Furthermore, various parameters can be considered for the synthesis, and this supports educators in providing precise information on the elements of the text that will be evaluated. • Questions and Answers: LLMs can "understand" various portions of text, answer questions, and ask questions when required. These features are useful for providing interactive feedback and learning experiences. • Classification: LLMs can classify the text into predefined categories: this allows you to introduce assisted assessment or classify students' feedback.

• Plagiarism detection: by comparing the similarity between different texts, LLMs are very useful to detect potential cases of plagiarism among students or to identify the misuse of original materials by students. • Assessment of knowledge: LLMs can assess students' understanding of a topic based on their written productions, especially if the information is generated from correct homework and with the help of an assessment rubric to refer to.

These five applications are fundamental to using LLMs in learning assessment. Following the introduction of ChatGPT and other universally accessible LLMs, UNESCO published the guidelines "AI and education: Guidance for policy-makers" [31], which suggest the following recommendations for learning assessment:

1. Test and implement AI technologies to support the assessment of various dimensions of skills and outcomes. 2. Use caution when using an automated assessment with closed-ended, rule-based questions. 3. Use AI-generated formative assessment as an integrated feature of learning management systems (LMS) to analyse learning outcomes more accurately and efficiently and reduce the risk of human bias. 4. Use the ability to provide AI-powered progressive assessments to regularly update students and parents. 5. Examine and evaluate the use of facial recognition and other artificial intelligence capabilities for users' recognition and their tracking in remote online assessments.

Based on these different theoretical approaches, recommendations and guidelines, the AI-Mediated Assessment for Academics and Students (AI-MAAS) model was developed which is currently under validation; it proposes two potential implementations of LLMs for the assessment of learning: the first one for formative evaluation and the second one for summative evaluation [11]. In both cases, the selected LLM must be able to evaluate using an assessment rubric provided by the teachers or by the students. Given the novelty of the tool, so far, there are not many experiments in this field. Martin et al. [39] worked on this opportunity, starting from the need to be able to assign students even complex tasks that involve a certain degree of reasoning, abstract conceptualisation and reprocessing of information; while the correction of these types of tasks (with a large quantity of long open answers) often proves to be an unsustainable task for teachers. Some researchers, working on this aspect, have demonstrated that in the evaluation of a chemistry task, for example, it is possible to use LLMs: in this case, an almost perfect match was obtained between the scores assigned by human raters and the scores generated by the LLMs. It should be highlighted, however, that Martin and colleagues did not simply use an LLM to achieve this result. The researchers used a complex procedure that involved, among other operations, the unsupervised machine learning technique HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a cluster mapping and training of a deep neural network classifier. The aim of this study was to test an operational model and demonstrate its feasibility. This excellent solution represents the result of models trained on specific tasks and populations; therefore, it cannot be assumed that the procedure applied can be replicated by any teacher not specialised in Machine Learning. Other studies have instead used LLMs for assessment purposes without comparing the performance of the AI with that of a teacher. These studies applied for example in the evaluation of L2 English tasks [43] and in supporting self-assessment came to satisfactory results [44]. Machine learning has also been applied in the evaluation of tasks related to STEM disciplines, but without using LLM [45].

Finally, a previous study [10] explored the use of the main LLMs in the specific context of assessing students' papers, with a focus on their accuracy in assessing according to a rubric developed by the teacher. The idea was that employing LLMs for assessment in higher education can enable the adoption of teaching and assessment approaches that were previously unsustainable and unscalable. This should help to ensure constructive alignment [46] and thereby improve the quality and effectiveness of university teaching. The study, aimed at selecting the most human-like evaluation amongst LLMs, highlighted that while some AI models, like ChatGPT-4 and Claude 2, performed well in most of the assessment criteria, others, such as Microsoft Copilot and Google Bard, were far from human-like assessment. The article recommends further research on ChatGPT and Claude, with potential inclusion of open-source models as well as involving multi-shot prompting, expanding the student sample, involving more evaluators, and refining and redesigning the rubrics.

Methodology and Tools

This is a replication study of "Are Large Language Models Capable of Assessing Students' Written Products? A Pilot Study in Higher Education" published in "Research Trends in Humanities Education & Philosophy, 11" [10] that follows the same methodology with updated LLMs employed, a greater sample of students and of human evaluators. It explores the use of leading Large Language Models (LLMs) in the specific context of assessing student written products, focusing on their precision and ability to evaluate according to a grading rubric developed by the teacher. The goal is to understand whether and which models can be used by university and non-university educators who are not experts in Machine Learning to assess students' written products, even in the presence of open-ended tasks and questions, thanks to grading rubrics. The pilot study was conducted at the University of Trento within the context of a university habilitation course for secondary school teachers, during the module concerning learning methodologies. One-hundred-fourty-two students participated anonymously, divided into 35 groups, along with 3 evaluating teachers, experts in experimental pedagogy and assessment. No data regarding the students' demographics was collected. The groups were tasked with carrying out an authentic task, namely to re-designing a past educational intervention that proved to be unsuccessful, targeted at a specific class (which could range from 1st grade of lower secondary school to 5th grade of higher secondary school, depending on the group composition). They were instructed to identify the past teaching approaches and strategies and to now think of different ones more suited to reach the intended learning outcomes. Furthermore, students' reflection and redesign ability was evaluated through the rubric of reference. To complete this task, groups were given two hours and thirty minutes, and a template for the educational design consisting of the following sections was provided: Involved Disciplines, Class and Grade Level, Intervention Title, Teacher, Programme and Learning Objectives, Context and Environment (formal, informal, type of setting, etc.). Moreover, a description of the reflection process applied for renewing the formative design is required, and it is considered under evaluation. In the part of the schedule with details, they were asked to explain the programming with concise descriptions of the various educational activities, the teacher's tasks, and those of the students. Within this framework, groups had the freedom to propose their original programming. The final product of each group is thus an MS Word file containing the programming of the educational intervention according to the described template. For the evaluation of the products, the following grading rubric (Table 1) was prepared, consisting of five evaluation criteria with four levels for each criterion.

Three expert human evaluators and seven LLMs (plus one that merged the feedback of all 7 models in a single one) evaluated all the student groups' products. The LLMs selected for this study were the most popular competing models at the time, and they are applied in the assessment process through the use of big-AGI (https://big-agi.com/). Big-AGI is an AI suite created to make advanced artificial intelligence accessible and was chosen for ease of adding several models through API, the possibility of imparting system prompts and the function (called "beam") for sending the same prompt to several LLMs at the same time. Human results were then compared with the results produced by the LLMs with various statistical analysis (see Method of Analysis section). The models used are:

1. GPT-4o: released in May 2024, GPT-4o is a multilingual, multimodal generative pretrained transformer developed by OpenAI. The model is capable of processing and generating text, images, and audio, making it a versatile tool for a wide range of tasks. Its multimodal capabilities

Understanding and applying educational architectures

Demonstrates a limited understanding of educational architectures, with applications not always appropriate or consistent.

Shows

a basic understanding of educational architectures, applying them generally correctly but with some uncertainties.

Applies educational architectures correctly, with a good understanding of their use in the specific context.

Demonstrates a thorough understanding of educational architectures, applying them in an innovative and contextually relevant manner. Clearly justifies the choices made.

Selection and implementation of teaching and learning strategies

The teaching strategies chosen are limited or not always appropriate for the objectives of the intervention.

Uses some relevant teaching strategies, but their implementation could be more targeted or diversified in relation to the intervention goals.

Selects and implements appropriate teaching strategies with a good correlation to the intervention goals.

Selects and implements highly effective and diversified teaching strategies, perfectly adapted to the objectives and context of the intervention.

Definition of the intended learning outcomes

The objectives are vague, not measurable or not aligned with the chosen teaching architectures and strategies.

The objectives are present but could be more specific or better aligned with the teaching architectures and strategies.

The objectives are welldefined and generally aligned with the chosen teaching architectures and strategies.

The objectives are clear, specific, measurable and perfectly aligned with the chosen teaching architectures and strategies.

Detailed scanning of the intervention

The scan is incomplete, unclear, or lacks a logical progression of activities.

The scan is present but could be more detailed or better structured in some parts.

The scan is clear and generally wellstructured, with a good progression of activities.

The scan is detailed, logical and wellstructured, with a clear progression of activities and realistic timeframes.

Critical reflection on the redesign process

There is a lack of critical reflection on the changes made or the justifications are superficial.

Includes some reflection on the changes, but the analysis could be more thorough.

Provides good reflection on the changes, with clear links to the learning objectives.

Provides deep and critical reflection on the changes made, clearly justifying each choice in relation to the learning objectives.

enable a deeper integration of different data formats, enhancing its utility in complex applications. Link: https://openai.com/index/hello-gpt-4o/ 2. Gemini 1.5 Pro Latest: a large language model developed by DeepMind (Google), is natively multimodal and supports an extended context window of up to two million tokens, which is currently the longest of any large-scale foundation model. This expansion in token capacity allows for processing more extensive sequences of data, thereby increasing its utility in tasks that require long-term contextual understanding. Link: https://deepmind.google/technologies/gemini/pro/. All these LLMs can "understand" and write in Italian, but it cannot be ruled out that performance in English may be different (presumably better, since most of the training is done in that language). Mistral's models were added for their specific training with European languages that renders them "natively fluent in English, French, Spanish, German, and Italian, with a nuanced understanding of grammar and cultural context". Privacy shouldn't be a concern since there is no data saved to LLM provider servers due to our use of API on a local instance of big-AGI. All the conversations are saved only locally.

Prompting

This study aimed to understand which models could be used by university educators (and, potentially, other educators) to assess students' products. For this reason, overly sophisticated prompting techniques were not used; instead, what an educator might do by providing clear instructions and giving the necessary context data for evaluation was employed. The LLMs systems were promoted through the following instruction (originally written in Italian). The first one is the System Prompt:

You are an experienced and impartial university lecturer. Your job is to assess the quality of student assignments according to a specific assessment rubric. The second is the prompt that were given to the LLMs to assess the products (originally written in Italian):

Evaluate the attached teaching design (student task) that was created by a group of students from the secondary school teaching qualification course. The key competence of this assignment lay in being able to design a teaching intervention that makes effective use of teaching architectures and strategies. In particular, the group's competence in terms of redesign and depth of reflection is taken into account with respect to previous instructional design. At the same time, the instructional design had to prove effective in achieving the goals they set themselves. Take into account that the students only had 2 hours to design. Use the evaluation rubric below to assess: <Starting teaching design evaluation rubric> Evaluation rubric: The prompt was sent simultaneously to all the LLMs involved. Through big-AGI the authentic task document was attached in PDF format. A zero-shot prompting procedure was used for all LLMs, meaning that no examples of human task assessments were given to the models. It is possible for a university instructor to provide an example that can enhance the quality of LLM assessments, however, the goal in this instance was to choose the most suitable models for this type of evaluation, not to find methods for optimising the results. Finally, an "eighth" LLM evaluator has been added, which uses the "Beam" function of big.AGI software. All the nuanced answers from the 7 LLM have been sent for consideration and synthesis to GPT-4o, resulting in an eighth assessment that considers the feedback from all seven LLMs.

Models' Settings

In order to set a common limit of length for every model's answers, all of them have been set through big.AGI API controls to 8128 tokens maximum. Also, the temperature was set to 0.2, that should ensure quite strict adherence to the instructions yet leave some room for creativity in answers.

Attention to Tokens and Context

Understanding tokens and context is crucial when using a Large Language Model (LLM). Tokens can be simplified as units of text that might consist of a word, part of a word, or even a single character. The characteristics of tokens can vary between models. However, it is generally safe to assume that, on average, English might require one to one and a half tokens per word, and Italian might need one and a half to two tokens per word. The context window, another essential concept, represents the number of tokens a language model can consider simultaneously when generating responses. This context depends on the model used and the available memory. Exceeding a model's context window could cause errors if it happens in a single prompt or, in a more extended conversation, the model might start ignoring the earlier parts of the dialogue to make room for more recent inputs. Therefore, preserving context is vital for generating coherent and relevant responses. It is important to note that not only the user's prompts consume context, but the model's responses do as well. To preserve the context window, some LLMs platforms impose a character limit on the prompts that can be sent and on the length of the generated responses, which are shorter than the maximum context window. Contrasting this replication study with the original one, it can be noted that context windows are decidedly wider than the ones that were found in LLMs one year ago, posing less of threat to the coherence of the assessment. Below Table 2 illustrates the maximum context window size for each of the models used:

Method of Analysis

The analysis method for evaluating the data involved examines the levels assigned by each evaluator (both LLMs and humans) to the various criteria of the rubric for each of the 35 group products. Each of the seven evaluators assigned a level to each of the five criteria for every product, resulting in each evaluator assigning a level to a total of 175 criteria. Several statistical techniques were employed to extract insights from the data, including Principal Component Analysis (PCA), analysis of standard deviation, and the creation of a disagreement index among evaluators. Microsoft Excel and JASP (based on R) were used for the statistical analyses.

Results

The consistency of the assessment for different models has been tested over the evaluation of three random tasks from the sample for three times each by each one of the models. For this test, each LLM assessed a total of 45 criteria. From these tests, the following behaviours were observed:

• GPT-4o, Gemini 1.5 Pro Latest and Claude 3.5 Sonnet were extremely consistent, with only one instance of a different assessment for one criterion, by just one point. • Mistral Large 2402 was perfectly consistent with zero instances of different assessments.

• Open Mixtral 8x22B (2404) and Qwen2 72B Instruct were quite consistent, with five instances of different assessments of single criterion by one point. • Llama 3.1 70B Instruct Turbo: was the fairly inconsistent, with 19 instances of different criteria assessment by one point.

Principal Component Analysis

The first analysis conducted, in addition to descriptive data, was the PCA, a dimensionality reduction technique that allows the identification of latent variables within the data and that can represent a general model of the data. Three principal components were identified from the PCA conducted on the assessment data (Table 3). The first component (RC1) is formed by evaluators e1, e3, e4, e5, e6, e7, e8 loadings, which correspond respectively to the LLMs GPT-4o, Claude 3.5 Sonnet, Mistral Large, Mixtral 8x22B, Llama 3.1 70B, Qwen2 72B and the merge of LLMs opinion. The second component (RC2) comprises those of e2, e9, e10 and e11 corresponding to Gemini 1.5 Pro and human evaluators 1, 2 and 3. As can be appreciated in Figure 1, Both GPT-4o and Claude 3.5 Sonnet contributes mainly to RC1 component but also to RC2. Gemini Pro 1.5 on the other hand, contributes only to RC2 component (the tiny loading to RC1 is negative). Trying to name the identified components, RC1 could be called "LLM Evaluation Pattern" and RC2 "Human Evaluation Pattern".

Analysis of Standard Deviation of Grades by Product and Assessment Criterion

To understand how assessments differed from criterion to criterion and from evaluator to evaluator, an analysis was conducted on the standard deviation (SD) of the different variables of the study. The criteria, numbered or abbreviated in some of the graphs, are those listed in Table 4. Firstly, an effort was made to identify which assessment criteria had the slightest and the most SD (Table 4) to understand which were assessed more consistently by all evaluators. The criteria with the minimum SD across all products is Criterion 4 and 1 ("Detailed scanning of the intervention" and "Understanding and application of teaching architectures"), with an average of about 0.5. This suggests a high level of agreement among evaluators in assessing the quality and details of the detailed activities envisaged in the educational design and the understanding and correct application of the teaching architectures at their bases. On the other hand, the criterion with the maximum SD among all activities is Criterion 5 (Critical reflection on redesign), with an average of about 0.8. This indicates a higher level of disagreement or inconsistency in how evaluators assessed the quality of teacher's critical reflection about their past activities and the way in which they tried to improve them.

Agreement Index

An "Agreement Index" (AIdx) was developed to obtain a more robust metric and better understand which evaluators assigned more similar scores for the various criteria. This index combines the average difference between the scores assigned to a criterion and the variability of this difference. It was calculated to understand which evaluators are most similar to the human ones for each criterion. While LLMs evaluators are treated individually, the human benchmark is an average of the human evaluators' (e9, e10 and e11) assessments. It is constructed as follows:

Therefore:

• The "Average Difference" is the absolute average difference in scores assigned between the evaluator in question and the average of human evaluators across all tasks and criteria. • The "Variability of the Difference" is the standard deviation of the difference scores between the tested evaluator and the reference evaluator, reflecting how consistent these differences are across different tasks and criteria.

AIdx is calculated individually for each evaluator. It provides a single measure that encapsulates the average magnitude of evaluation differences relative to the reference evaluator and the consistency of such differences. A lower value indicates a more significant overall agreement in evaluation relative to the human evaluator. The highest possible value for the index for an evaluator would be achieved if they constantly evaluated at the maximum difference from the human evaluators (3 points).

The LLM evaluator who provided assessments most similar to the average of human evaluators (calculated through the AIdx) is GPT-4o, followed at a negligible distance by Claude 3.5 Sonnet. On the other hand, the LLM evaluator with the worst AIdx is Qwen2 72B (Table 5).

Table 5

Agreement Indices (AIdx) with reference to the average of human evaluators.

Evaluator

Agreement Index with "average human" (lower is better) Focusing on the single criterion (Table 6), it can be noted how AIdx with other evaluators vary from criterion to criterion. Unexpectedly, Qwen2 72B, the worst on the general AIdx with the "average" human evaluator, is the single model that is most human-like in three criteria out of five. Its main problem is that it assessed in a very different way from humans the most difficult criterion: criterion number 5 "Critical reflection on redesign" (Table 6). It also did not fare optimally in criterion number 3 "Definition of learning objectives". Other LLMs like GPT-4o and Claude 3.5 Sonnet, as well as the Merge of the different LLMs feedback, keep a good Agreement Index across the board.

Human 3 0.

Assessment correlations among LLM and Human evaluators

As reported in Table 7 the model with higher correlation with human evaluation is by far Gemini 1.5 Pro (r = 0.84), followed at a distance by Claude Sonnet 3.5 (r = 0.66), then by GPT-4o (r = 0.59), the merge of the LLMs feedback by GPT-4o (r = 0.58) and Llama 3.1 70B (r = 0.45). This suggests that Gemini 1.5 Pro's pattern of scores across the criteria is the most similar to that of the human evaluators.

But what happens excluding single criteria from the correlation analysis? That could help in understanding what criteria makes the assessment "human" and what LLMs struggle with:

• Excluding Criterion 3 (Definition of learning objectives): When Criterion 3 is excluded all LLMs' correlation indexes significantly improve. Noticeably, for Mistral Large and Qwen2 72B the jump is from being hardly correlated, or not at all (r = 0.27 and -0.06 respectively), to being significantly correlated (r = 0.86 and 0.88). Excluding Criterion 3 also significantly reduces the correlation of Gemini 1.5 Pro suggesting that this was the Criterion that it got right and mostly contributed to its excellent general correlation to humans' assessment. This suggests that Criterion 3 may be peculiarly human-like in its application, which these models struggle to mimic accurately. The high increase implies that Criterion 3 might involve a complex judgment that those models are incapable to handle or contextual information that is not being passed to the model.

• Excluding Criterion 2 (Selection and implementation of teaching strategies): Excluding

Criterion 2 doesn't change LLMs correlation with human assessment, except for Gemini 1.5 Pro Latest. Gemini shows an almost perfect correlation of r = 0.99 when Criterion 2 is excluded, which is remarkable, but, even in this case, this criterion doesn't seem to be crucial.

• Excluding Criterion 5 (Critical reflection on redesign):

The exclusion leads to a substantial increase in correlation for Mistral Large (from r = 0.27 to 0.88) and a notable improvement for several other models. This criterion, similarly to Criterion 3, may also represent aspects of human judgement that are challenging for models to replicate accurately.

Discussion

Regarding the goal of understanding whether educators without expertise in machine learning can employ current Large Language Models (LLMs) to assess students' written authentic tasks using assessment rubrics, the analyses have revealed several interesting elements:

• Differently from a previous iteration of the study [10], all the models have enough context window to perform this task. • From the PCA, it appears that human evaluators generally have a different pattern of evaluation compared to LLMs. • In contrast with the evaluation pattern, the Agreement Index (AIdx) measures both the magnitude of the score differences and their consistency. A high Agreement Index value suggests significant discrepancies between the model's scores and the average human scores, despite possibly similar trends in the pattern. Transforming in percentage the AIdx of each model referred to the average human an accuracy metric has been achieved. This helps to better visualise each model's performance (Fig. 2 and Fig. 3) • Only Llama 3.1 70B was inconsistent in the repeated assessment of the same task. • Gemini 1.5 Pro is the LLM model with the evaluation pattern more similar (with by far the higher correlation) to the human's (see Table 7). It is the only model that in the PCA results only in the component of human assessment (Fig. 1). On the other hand, its AIdx was the second worst, just before Qwen2 72B (Table 5, Fig. 2). • GPT-4o and Claude 3.5 Sonnet have evaluation patterns not too dissimilar from the human's (Fig 1, Table 7) and on average attribute marks more similar to humans than any other model (Table 5). • Llama 3.1 70B Instruct was the best of the open models, and the fourth in total (Table 5), after the already mentioned three proprietary models. It behaved quite well in the correlation index with the humans' assessment pattern with a moderate correlation (Table 7) and has a good AIdx.

The problem with this model is the inconsistency of the assessment of the same task, where it "changed its mind" 19 times out of 45. It would be interesting to understand if that inconsistency has to do with the quantisation applied by Together AI, the API provider used. • Mixtral 8x22B and Mistral Large fared similarly with patterns quite dissimilar to the human's and AIdx which are pretty decent (similar to Llama's). The correlation of Mistral Large with the human pattern of evaluation, when Criterion 3 is removed, is the second highest, thus giving reasons to follow it closely and keep it in the test pool. • Qwen2 72B, an open LLM, would have been by far the best LLM overall (and Mistral Large would have been the second) if it weren't for Criterion 3. Criterion 3 posed a grave problem for Qwen2 both from the assessment pattern and from the AIdx point of view (Fig. 2 and Fig. 3). • Criterion 3 (Definition of learning objectives), in a larger part, and Criterion 5 (Critical reflection on redesign), in a smaller part, appear to be the most discriminative criteria in terms of capturing what makes the human evaluation pattern unique for this assessment task (Fig. 2 and Fig. 3). These criteria likely involve nuances and complexities in judgment that are particularly human-like and challenging for LLMs to capture accurately, or the authors might have failed to provide all the relevant contextual information regarding these criteria to LLMs. This last hypothesis seems relevant because, in the previous iteration of the study, this same criteria was the easiest one for LLM to assess in a human-like manner [10].

Based on the available data, it appears that the more suitable LLMs for the assessing students' authentic tasks using an assessment rubric are Claude 3.5 Sonnet and GPT-4o. That is because they fared well both on the assessment pattern (PCA and correlations) and in the agreement index (magnitude and consistencies of scores). On the other hand, Gemini 1.5 Pro is the one that had by far the most human-like assessment pattern, but fell short on the AIdx, attributing marks that were very different from the humans'.

Qwen2 7B and Llama 3.1 70B deserve a mention as they are open models, and if not for some flaw would have been at the level (or better than) the aforementioned proprietary models. Llama has a problem of inconsistency of the marks assigned for each criterion, while Qwen2 really just got one criterion very wrong. It might be useful to know that for both of them, Together AI (https://www.together.ai/) was used as an API provider. It applies quantisation of Floating Point 8-bit (FP8) for Llama 3.1 70B Instruct Turbo, while Qwen2 72B Instruct is run at full-precision Floating Point 16-bit (FP16).

Human evaluators have a pattern of evaluation (see the PCA) that can be usually distinguished from the LLMs' one, but Gemini 1.5 Pro, if not for its very different score attribution, has very similar patterns. It is interesting to note that human evaluators among themselves have different score attributions (see Table 5 and Figure 3), but as for critical criteria they assess similarly.

All considered, presently, none of the LLMs can be used for autonomous evaluation for all criteria, especially regarding the more complex and the less contextualised ones. This confirms what Webb [33] highlighted. However, Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B Instruct have the potential to be used as solid support for evaluation for the summative evaluation level as described in the AI-MAAS (AI-Mediated Assessment for Academics and Students) model [11].

Conclusions

The fundamental question of this study was whether and which current Large Language Models (LLMs) can be used by university educators (but it applies to other educators and instructors, too), even those without technical experience, to assess student-written authentic products in the presence of open tasks and questions using assessment rubrics. Indeed, using these technologies could make assessment more sustainable and scalable, allowing for more consistent alignment with declared learning objectives. This study has allowed us to determine that Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B Instruct have the potential to be used as solid support for summative evaluation. According to this study, the use of LLMs can be beneficial, but only if they are used under proper supervision. They should be seen as assistance for university educators and not as a substitute for assessments. The available data does not indicate that they are reliable enough to perform assessments independently, even if they are getting close to it. In fact, some criteria that is too complex or needs additional information about the context or specific subject can be evaluated in a way that is not in line with human assessment. This finding confirms the guidelines as stated by Miao et al. [31] and Webb [33]. The limitations of the present study lie in the sample size of student products that need to be significantly increased, as well as the number of human expert evaluators and the disciplines involved in the tests. The assessment rubric can also be optimised and, especially for the most critical criteria (such as Criterion 3), it would be important to experiment on its formulation to understand if it could have been a human error in defining the criteria that made it difficult to interpret by the LLMs. The idea behind this study is that it should be expanded and updated on a rolling basis to adjust the discussion and bring useful novelties into the assessment practice. Future evolutions of the study might include multi-shot prompting and the evaluation of textual feedback and assessment to tasks. Feedback that could be provided during the assessment for each of the criteria provided in a rubric deserve particular exploration [11,32,9,42].

**How to respond to requests:** * Do not express personal opinions or subjective judgements. * Focus exclusively on the criteria provided in the rubric. * Provide a fair and impartial assessment based on the task's adherence to the criteria. * Carefully review the student's entire paper before beginning the assessment. * Offer constructive suggestions as to how the student might improve. * Uses clear and concise language. * Justify the marks awarded with specific references to the paper and the rubric. * In your assessment, take into account that the students only had 2 hours for planning. **Request format Each request will include: * **The student's assignment:** The text of the assignment you are to assess. * **The grading rubric:** A list of criteria with descriptions for each grade level. **Response format:** Your answer should follow this format: **Title of the paper (also called title of the paper) as it appears in the document: [insert title here]**. **Total score:** [Insert total score here]. **Scoring breakdown:** | Criterion | Score | Comments |-|-|-| | [Criterion 1] | [Score] | [Comments with specific examples from the task] | | [Criterion 2] | [Score] | [Comments with specific examples from the task] | | [Criterion 3] | [Score] | [Comments with specific examples from the task] | | ... | ... | ... | **Suggestions for improvement * [Suggestion 1] * [Suggestion 2] * ... **Answer following the answer format provided above.**

Criterion 5 -5applications not always appropriate. -Sufficient (award 2 points): Basic understanding with some uncertainties in application. -Good (awarded 3 points): Correct application and good understanding. -Excellent (awarded 4 points): Thorough understanding and innovative and relevant application. Criterion 2 -Selection and implementation of teaching strategies: -Insufficient (award 1 point): Limited or not always adequate strategies. -Sufficient (award 2 points): Relevant strategies but implementation can be improved. -Good (award 3 points): Strategies appropriate and related to the objectives. -Excellent (award 4 points): Highly effective, diverse and well adapted strategies. Criterion 3 -Definition of learning objectives: -Insufficient (award 1 point): Vague or non-measurable objectives. -Sufficient (award 2 points): Objectives present but not very specific. -Good (award 3 points): Well-defined and generally aligned objectives. -Excellent (award 4 points): Clear, specific, measurable and perfectly aligned objectives. Criterion 4 -Detailed scanning of the intervention -Insufficient (award 1 point): Incomplete or unclear scan. -Sufficient (award 2 points): Scan present but can be improved in structure. -Good (awarded 3 points): Clear and well-structured scan. -Excellent (awarded 4 points): Detailed, logical and well-structured scanning. Critical reflection on redesign: -Insufficient (award 1 point): Lack of critical reflection or superficial justifications. -Sufficient (award 2 points): Reflection present but not very thorough. -Good (awarded 3 points): Good reflection with clear connections. -Excellent (award 4 points): Deep and critical reflection, clear justifications. <end of assessment rubric> **Total score:** **Scoring distribution:**

Figure 1 :1Figure 1: PCA Path Diagram. The diagram shows two main components (RC1 and RC2) and their relationships with different evaluators. RC1 represents the "LLM Evaluation Pattern" while RC2 represents the "Human Evaluation Pattern".

Figure 2 :2Figure 2: Radar graph of the LLMs' AIdx (transformed in percentage) with the average human for each criterion.

Figure 3 :3Figure 3: Radar graph of the LLMs' and Humans' AIdx (transformed in percentage) with the average human for each criterion.

Table 11Rubric for the Assessment of the Educational InterventionAssessmentInsufficient Level (1Sufficient Level (2GoodLevel(3Excellent Level (4Criteriapoint)points)points)points)

3. Claude 3.5 Sonnet: developed by Anthropic, excels in the ability to understand nuanced language, It demonstrates top-tier performance in handling sophisticated reasoning challenges, making it a robust tool for both natural language processing and technical tasks. Link: https://mistral.ai/ news/mistral-large/. 5. Open Mixtral 8x22B (2404): is one of the latest model developed by Mistral, featuring a sparse Mixture-of-Experts (SMoE) architecture. Despite its large size, with 141 billion parameters, only 39 billion parameters are actively engaged during processing, optimising both performance and cost efficiency. This approach sets new standards in the AI community for balancing model complexity with computational resource usage. Link: https://mistral.ai/news/mixtral-8x22b/. 6. Llama 3.1 70B Instruct Turbo: developed by Meta, is a 70-billion parameter language model designed for instruction-following tasks. The model is optimised to improve interactions where clear guidance or step-by-step reasoning is required, positioning it as an effective tool for applications in both academic and practical domains. Link: https://ai.meta.com/blog/meta-llama-3-1/ 7. Qwen2 72B Instruct: developed by Alibaba Cloud, is a 72-billion parameter language model optimised for instruction-based tasks. It integrates the latest advancements in generative AI, offering improved efficiency in tasks ranging from conversational AI to complex text generation and reasoning. Its design caters specifically to high-performance needs in both commercial and research applications.humour, and complex instructions. It is designed to generate high-quality content in a relatable, natural tone, showing marked improvements in areas such as writing and human-centric communication. Link: https://www.anthropic.com/news/claude-3-5-sonnet. 4. Mistral Large (2402): is designed to excel in complex reasoning tasks, particularly in multilingual contexts. The model is highly effective in text understanding, transformation, and code generation. Link:https://www.alibabacloud.com/en/solutions/generative-ai/qwen?_p_ lc=1

Table 22Context Windows of the used LLMs. The context windows refer to the APIs. Note that this feature may change with updates.

Large

Language Model (versions available in Italy, September 2024) Context Window (in tokens)GPT-4o128,000Claude 3.5 Sonnet200,000Gemini 1.5 Pro Latest2,000,000Mistral Large (2402)32,000Mixtral 8x22B (2404)64,000Meta Llama 3.1 70B Instruct Turbo131,072Qwen2 72B Instruct32,768

Table 33PCA Component LoadingsEvaluatorRC1RC2 Uniquenesse1 (GPT-4o)0.579 0.3270.472e2 (Gemini 1.5 Pro)0.4210.832e3 (Claude 3.5 Sonnet)0.695 0.3120.321e4 (Mistral Large 2402)0.7750.430e5 (Mixtral 8x22B 2404)0.6510.592e6 (Llama 3.1 70B Instruct)0.7430.405e7 (Qwen2 72B Instruct)0.825 -0.3260.336e8 (Merge of 7 LLMs by GPT-4o) 0.8020.301e9 (Human Evaluator 1)0.5150.708e10 (Human Evaluator 2)0.7530.420e11 (Human Evaluator 3)0.6440.604Note. Applied rotation method is promax.

Table 44Average standard deviation of scores assigned to criteria

Criterion Description Average Standard Deviation Percentage of Total Range (1-4)

5Critical reflection on redesign0.7625%2Selection and implementation of0.6421%teaching strategies3Definition of learning objectives0.6120%1Understanding and application of0.5418%teaching architectures4Detailed scanning of the0.5317%interventionAIdx =Average difference + Variability of the difference 2

Table 66Agreement Indices (AIdx) of evaluators compared to the average human evaluator divided by criterionCrit.GPTGeminiClaudeMistralMixtralMetaQwen2MergeBestSecond4o1.5 ProSonnetLarge8x22BLlama72B(GPT-LLMBestLatest3.5(2402)(2404)3.1 70B4o)LLM10.500.410.500.430.410.430.390.47Qwen2Gemini72B1.5 Pro /Mixtral8x22B20.560.640.570.470.550.570.400.55Qwen2Mistral72BLarge30.430.600.430.490.680.620.700.41MergeGPT-4o(GPT-/ Claude4o)Sonnet3.540.370.460.350.370.480.380.330.32MergeQwen2(GPT-72B4o)50.410.630.450.800.440.551.240.62GPT-4oMixtral8x22B

Table 77Pearson correlation coefficients calculated between the average scores per criterion for each Large Language Model (LLM) and the average scores per criterion of human evaluators. Single criteria have been excluded to understand what makes the pattern "human".LLMTotalExcl. Crit.Excl. Crit.Excl. Crit.Excl. Crit.Excl. Crit.12345GPT-4o0.590.450.600.650.740.66Gemini1.5 Pro0.840.790.990.470.820.86LatestClaude Sonnet0.660.540.660.780.710.813.5Mistral Large0.270.160.210.860.210.88(2402)Mixtral 8x22B0.150.240.010.530.060.19(2404)Llama 3.1 70B0.450.340.420.740.470.69Instrict TurboQwen2 72B-0.06-0.20-0.130.88-0.11-0.77Merge (GPT4o)0.580.450.580.750.630.75

Acknowledgments

The authors would like to thank Elena Benini, PhD Student, for her contribution on the assessment of students' authentic tasks. Thanks also to Prof. Massimo Stella for the fruitful discussion about statistical methods. Both of them work at the Department of Psychology and Cognitive Sciences of the University of Trento.

(H. Ballardini) https://webapps.unitn.it/du/en/Persona/PER0247709 (D. Agostini); https:https://webapps.unitn.it/du/en/Persona/PER0242228 (F. Picasso); https://webapps.unitn.it/du/en/Persona/PER0033179

The acceptance and diffusion of generative artificial intelligence in education: A literature review ABaytak 10.46303/cuper.2023.2 Current Perspectives in Educational Research 6 2023 Article 1 Exploring the integration of ChatGPT in education: Adapting for the future SElbanna LArmstrong 10.1108/MSAR-03-2023-0016 Management & Sustainability: An Arab Review 3 2023 ChatGPT has entered the classroom: How LLMs could transform education AExtance 10.1038/d41586-023-03507-3 Nature 623 2023 Adoption of AI ChatBot like Chat GPT in Higher Education in India: A SEM Analysis Approach SRoy VGupta SRay 10.36683/2306-1758/2023-4-46/130-149 Economic Environment 4 2023 Chat-GPT; validating Technology Acceptance Model (TAM) in education sector via ubiquitous learning mechanism NSaif SUKhan IShaheen AAlotaibi MMAlnfiai MArif 10.1016/j.chb.2023.108097 Computers in Human Behavior 108097 2023 What drives students toward ChatGPT? An investigation of the factors influencing adoption and usage of ChatGPT CKTiwari MABhat STKhan RSubramaniam MA IKhan 10.1108/ITSE-04-2023-0061 Interactive Technology and Smart Education 2023 New era of artificial intelligence in education: Towards a sustainable multifaceted revolution FKamalov DSantandreu Calonge IGurrib Sustainability 15 12451 2023 Academic Integrity Considerations of AI Large Language Models in the Post-Pandemic Era: ChatGPT and Beyond MPerkins Journal of University Teaching and Learning Practice 20 2023 ChatGPT in higher education: Considerations for academic integrity and student learning MSullivan AKelly PMclaughlan 10.37074/jalt.2023.6.1.17 Journal of Applied Learning & Teaching 2023 Are large language models capable of assessing students' written products? A pilot study in higher education DAgostini Research Trends in Humanities Education & Philosophy 11 2024 Large language models for sustainable assessment and feedback in higher education DAgostini FPicasso Intelligenza Artificiale 18 2024 Firm Investments in Artificial Intelligence Technologies and Changes in Workforce Composition TBabina AFedyk AXHe JHodson 10.3386/w31325 2023 National Bureau of Economic Research Working Paper 31325 Generative AI to become a $1.3 trillion market by 2032, research finds 2023 GHammond Big tech outspends venture capital firms in AI investment frenzy 2023 When does AI pay off? AI-adoption intensity, complementary investments, and R&D strategy YSLee TKim SChoi WKim 10.1016/j.technovation.2022.102590 Technovation 118 102590 2022 What is AI literacy? Competencies and design considerations DLong BMagerko Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems the 2020 CHI Conference on Human Factors in Computing Systems 2020 Conceptualizing AI literacy: An exploratory review DT KNg JK LLeung SK WChu MSQiao Computers and Education: Artificial Intelligence 2 100041 2021 Developing and validating a multidimensional AI literacy questionnaire: Operationalizing AI literacy for higher education GBiagini SCuomo MRanieri Proceedings of the First International Workshop on High-Performance Artificial Intelligence Systems in Education, AIxEDU 2023 the First International Workshop on High-Performance Artificial Intelligence Systems in Education, AIxEDU 2023

Aachen

2023 Explicating AI literacy of employees at digital workplaces DCetindamar KKitto MWu YZhang BAbedin SKnight 10.1109/TEM.2021.3138503 IEEE Transactions on Engineering Management 71 2024 Evaluating an Artificial Intelligence Literacy Programme for Developing University Students' Conceptual Understanding, Literacy, Empowerment and Ethical Awareness S.-CKong WM .-YCheung GZhang Educational Technology & Society 26 2023 Measuring user competence in using artificial intelligence: Validity and reliability of artificial intelligence literacy scale BWang P.-LPRau TYuan 10.1080/0144929X.2022.2072768 Behaviour & Information Technology 42 2023 Toward an Objective Measurement of AI Literacy PWeber MPinski LBaum PACIS 2023 Proceedings 2023 Guidance for generative AI in education and research 2023 UNESCO, Report A participatory data-centric approach to AI ethics by design AGerdes 10.1080/08839514.2021.2009222 Applied Artificial Intelligence 36 2022 Coping with vulnerability: The effect of trust in ai and privacy-protective behaviour on the use of ai-based services CJang 10.1080/0144929X.2023.2246590 Behaviour & Information Technology 2023 When AI Meets Information Privacy: The Adversarial Role of AI in Data Sharing Scenario AMajeed SOHwang 10.1109/ACCESS.2023.3297646 IEEE Access 11 2023 Generative AI meets copyright PSamuelson 10.1126/science.adi0656 Science 381 2023 Academic integrity in the age of Artificial Intelligence (AI) authoring apps MAYeo 10.1002/tesj.716 TESOL Journal 14 e716 2023 AI-generated text detectors: Do they work? VVan Oijen 2023 Testing of detection tools for AI-generated text DWeber-Wulff AAnohina-Naumeca SBjelobaba TFoltýnek JGuerrero-Dib OPopoola PŠigut LWaddington 10.1007/s40979-023-00146-z International Journal for Educational Integrity 19 26 2023 FMiao WHolmes HRonghuai ZHui AI and education: Guidance for policy-makers UNESCO 2023 Technical Report ESabzalieva AValentini ChatGPT and artificial intelligence in higher education: Quick start guide UNESCO 2023 Technical Report A Generative AI Primer MWebb 2023 National Centre for AI Technical Report New principles on use of AI in education RussellGroup 2023 Gtnum Intelligence artificielle et éducation: Apports de la recherche et enjeux pour les politiques publiques 2023 MACardona RJRodríguez KIshmael Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations 2023 Technical Report Using generative AI (GenAI) in learning and teaching Ucl 2023 Assessment in the age of artificial intelligence ZSwiecki HKhosravi GChen RMartinez-Maldonado JMLodge SMilligan NSelwyn DGašević 10.1016/j.caeai.2022.100075 Computers and Education: Artificial Intelligence 3 100075 2022 Exploring new depths: Applying machine learning for the analysis of student argumentation in chemistry PPMartin DKranz PWulff NGraulich 10.1002/tea.21903 Journal of Research in Science Teaching 2023 ChatGPT for good? On opportunities and challenges of large language models for education EKasneci KSeßler SKüchemann MBannert DDementieva FFischer Learning and Individual Differences 103 102274 2023 A review of the literature from 1970 to 2022 on the roles of teachers and artificial intelligence in the field of AI in education ALepage NRoy 10.52358/mm.vi16.304 Médiations et Médiatisations 16 2023 ATamkin MBrundage JClark DGanguli arXiv:2102.02503 Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models 2021 arXiv preprint Teaching English in the Age of AI: Embracing ChatGPT to Optimize EFL Materials and Assessment OKoraishi Language Education and Technology 3 2023 Article 1 Supporting self-directed learning and selfassessment using TeacherGAIA, a generative AI chatbot application: Learning approaches and prompt engineering FAli DChoy SDivaharan HYTay WChen 10.1080/23735082.2023.2258886 Learning: Research and Practice 9 2023 A Systematic Review of AI-Driven Educational Assessment in STEM Education FOuyang TADinh WXu 10.1007/s41979-023-00112-x Journal for STEM Education Research 6 2023 Enhancing teaching through constructive alignment JBiggs 10.1007/BF00138871 Higher Education 32 1996