<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Large Language Models for the Assessment of Students&apos; Authentic Tasks. A Replication Study in Higher Education</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Daniele</forename><surname>Agostini</surname></persName>
							<email>daniele.agostini@unitn.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Psychology and Cognitive Sciences</orgName>
								<orgName type="institution">University of Trento</orgName>
								<address>
									<addrLine>Corso Bettini, 84</addrLine>
									<postCode>38068</postCode>
									<settlement>Rovereto</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Federica</forename><surname>Picasso</surname></persName>
							<email>federica.picasso@unitn.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Psychology and Cognitive Sciences</orgName>
								<orgName type="institution">University of Trento</orgName>
								<address>
									<addrLine>Corso Bettini, 84</addrLine>
									<postCode>38068</postCode>
									<settlement>Rovereto</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Helga</forename><surname>Ballardini</surname></persName>
							<email>helga.ballardini@unitn.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Psychology and Cognitive Sciences</orgName>
								<orgName type="institution">University of Trento</orgName>
								<address>
									<addrLine>Corso Bettini, 84</addrLine>
									<postCode>38068</postCode>
									<settlement>Rovereto</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Large Language Models for the Assessment of Students&apos; Authentic Tasks. A Replication Study in Higher Education</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">CB74E1CC12FDF748AD3789BB405D6045</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:01+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models (LLMs)</term>
					<term>AI-Assisted Assessment</term>
					<term>Rubrics</term>
					<term>Authentic Tasks</term>
					<term>Academic Assessment † D. Agostini: Conceptualisation</term>
					<term>Methodology</term>
					<term>Investigation</term>
					<term>Formal Analysis</term>
					<term>Writing -original draft</term>
					<term>Writing -review &amp; editing</term>
					<term>Resources</term>
					<term>Supervision. F. Picasso: Investigation</term>
					<term>Writing -original draft</term>
					<term>Data curation. H.Ballardini: Investigation</term>
					<term>Writing -review &amp; editing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>After the public release of ChatGPT (November 30th, 2022)  and consequently, that of all its competitors, the use of Large Language Models (LLMs) has become widespread among the public. The most significant impact was perceived from the very beginning in the field of Education and Instruction [1,<ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. Of particular interest for this paper is its use both by teachers and students in particular in the context of higher education <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b8">9]</ref>. The immediacy with which Large Language Models (LLMs) have been integrated into higher education practices, both by teachers and students, leads to questions of fundamental importance relating to their effectiveness and reliability. In this field, LLMs become the means through which teachers have the opportunity to revolutionise the interaction with students, the management of workload and the personalisation of each learning experience <ref type="bibr" target="#b1">[2]</ref>. Although these technologies are recognised as having advantages and potential for improving learning in terms of accessibility and personalisation [7], a crucial question concerns their application in assessment practices, especially the ability to objectively and impartially evaluate students' performance. The possibilities of using these tools in the field of learning evaluation is relatively little known, which implies the need to delve deeper into the topic for its application both in pedagogical theory and in educational practice. A previous study has been already published <ref type="bibr" target="#b9">[10]</ref> which explored the use of the main LLM in the specific context of assessing students' papers, and this is a replication study based on it. The purpose of the current study is to explore the possible use of the main LLMs in the specific context of evaluating students' written productions, with a focus on the aspects of accuracy that are evaluated with the help of a rubric proposed by the teacher. This article is part of a series of contributions that focus on this topic, in light of the principles and application of the AI-Mediated Assessment for Academics and Students (AI-MAAS) model <ref type="bibr" target="#b10">[11]</ref>.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">The Context: AI Assessment in Higher Education</head><p>In the last two years, Large Language Models (LLM) have taken on a very significant role in the technological landscape thanks to the launch of ChatGPT, followed subsequently by the release of competing models. The impact of LLMs remained relatively limited over time until increasingly simple and intuitive user interface functions were introduced, firstly the "chat" level, which brought the general public closer to these tools. This phenomenon of "democratisation" boosted the commercial and large-scale use of LLMs, which led institutions, companies and individuals to increase investment in this sector <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>. In addition to OpenAI's ChatGPT, Anthropic's Claude, Microsoft's Copilot and Google's Gemini are just some of the most used LLMs, in addition to the much more numerous open-source models to which Meta's LLAMA has given a notable boost At the same time, however, this has led to a crisis in search engines since LLMs, without requiring advanced research skills, offer new ways of querying and analysing data, more natural interaction and sufficiently precise and exhaustive answers. For example, LLMs allow users to avoid various typical inconvenient steps that characterise the standard use of search engines, such as the selection of long lists of websites, the acceptance of cookies and the appearance of advertising banners. As a result, educational institutions and agencies have begun to incorporate LLMs and generative AI into their curricula at various levels, developing courses to harness the potential of these innovative technologies. There is currently a strong emphasis on AI Literacy <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>, which allows professionals from different sectors, including educational institutions, to deepen their understanding of the fundamental elements of AI generative, the availability of tools, the functionalities and methods of use that make LLMs effective tools in all fields <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23]</ref>. However, one of the most critical issues concerns information management: LLMs possess enormous potential due to their ability to analyse and generate data; this raises numerous questions about accuracy, privacy and ethics in information management and ownership of output. The challenge in this continuously and rapidly evolving field becomes the ability to pay constant attention and critically evaluate so that end users always use LLMs responsibly <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27]</ref>. In relation to this issue, higher education institutions have reacted by placing themselves on the defensive, so much so that some universities, in order to counteract the possible use of LLMs by students during exam tests, they have reintroduced the obligation to write by hand and also take oral tests <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b27">28]</ref>. At the same time, pieces of software created specifically to detect the productions generated by LLMs were introduced on the market. However, these turned out to be ineffective, causing management and legal problems for institutions because students could be unfairly accused of sending texts generated by artificial intelligence <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b29">30]</ref>. To avoid such inconveniences, national and international institutions and universities promptly provided themselves with guidelines that promote ethical behaviour towards the use of LLMs while maintaining a certain caution, allowing students and teachers to use them effectively to carry out tasks and benefit institutions. Important international bodies and universities moved in this direction, such as UNESCO <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32]</ref>, the JISC National Center for AI <ref type="bibr" target="#b32">[33]</ref>, the Russell Group <ref type="bibr" target="#b33">[34]</ref>, the French National Ministry of Education <ref type="bibr" target="#b34">[35]</ref>, the US Department of Education <ref type="bibr" target="#b35">[36]</ref> and University College London <ref type="bibr" target="#b36">[37]</ref>. Assessment tasks have proven to be arguably the ones that can profit most from the AI technology, especially in terms of sustainability. However, caution is needed as LLMs without specific task adaptations have proven incapable and unreliable in managing students' assessments independently <ref type="bibr" target="#b37">[38,</ref><ref type="bibr" target="#b32">33]</ref>, while LLMs supported by assessment tools have been shown to produce satisfactory results <ref type="bibr" target="#b38">[39]</ref>. Above all, the use of artificial intelligence by students requires that teachers know how to take ethical aspects into consideration and act with responsibility when evaluating tasks and tests which results could have a great impact on students' careers (for example, motivation, grades, scholarships, acceptance into master's or doctoral programs).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Theoretical Framework</head><p>Since the 1980s the idea of being able to use computerised systems (and now also artificial intelligence) to assist educators in their assessment tasks and to be able to make precise, impartial and informed decisions has been present in much literature <ref type="bibr" target="#b39">[40,</ref><ref type="bibr" target="#b40">41]</ref>. The possibility of using LLMs for learning assessment had already been explored in the period immediately preceding the release of ChatGPT, while the use of transformer models, including OpenAI's GPT-3, were already well established. Tamkin et al. <ref type="bibr" target="#b41">[42]</ref> emphasised their educational application, which included:</p><p>• Summary: LLMs are able to summarise even very long texts. This use can help students submit concise summaries. Furthermore, various parameters can be considered for the synthesis, and this supports educators in providing precise information on the elements of the text that will be evaluated. • Questions and Answers: LLMs can "understand" various portions of text, answer questions, and ask questions when required. These features are useful for providing interactive feedback and learning experiences. • Classification: LLMs can classify the text into predefined categories: this allows you to introduce assisted assessment or classify students' feedback.</p><p>• Plagiarism detection: by comparing the similarity between different texts, LLMs are very useful to detect potential cases of plagiarism among students or to identify the misuse of original materials by students. • Assessment of knowledge: LLMs can assess students' understanding of a topic based on their written productions, especially if the information is generated from correct homework and with the help of an assessment rubric to refer to.</p><p>These five applications are fundamental to using LLMs in learning assessment. Following the introduction of ChatGPT and other universally accessible LLMs, UNESCO published the guidelines "AI and education: Guidance for policy-makers" <ref type="bibr" target="#b30">[31]</ref>, which suggest the following recommendations for learning assessment:</p><p>1. Test and implement AI technologies to support the assessment of various dimensions of skills and outcomes. 2. Use caution when using an automated assessment with closed-ended, rule-based questions. 3. Use AI-generated formative assessment as an integrated feature of learning management systems (LMS) to analyse learning outcomes more accurately and efficiently and reduce the risk of human bias. 4. Use the ability to provide AI-powered progressive assessments to regularly update students and parents. 5. Examine and evaluate the use of facial recognition and other artificial intelligence capabilities for users' recognition and their tracking in remote online assessments.</p><p>Based on these different theoretical approaches, recommendations and guidelines, the AI-Mediated Assessment for Academics and Students (AI-MAAS) model was developed which is currently under validation; it proposes two potential implementations of LLMs for the assessment of learning: the first one for formative evaluation and the second one for summative evaluation <ref type="bibr" target="#b10">[11]</ref>. In both cases, the selected LLM must be able to evaluate using an assessment rubric provided by the teachers or by the students. Given the novelty of the tool, so far, there are not many experiments in this field. Martin et al. <ref type="bibr" target="#b38">[39]</ref> worked on this opportunity, starting from the need to be able to assign students even complex tasks that involve a certain degree of reasoning, abstract conceptualisation and reprocessing of information; while the correction of these types of tasks (with a large quantity of long open answers) often proves to be an unsustainable task for teachers. Some researchers, working on this aspect, have demonstrated that in the evaluation of a chemistry task, for example, it is possible to use LLMs: in this case, an almost perfect match was obtained between the scores assigned by human raters and the scores generated by the LLMs. It should be highlighted, however, that Martin and colleagues did not simply use an LLM to achieve this result. The researchers used a complex procedure that involved, among other operations, the unsupervised machine learning technique HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a cluster mapping and training of a deep neural network classifier. The aim of this study was to test an operational model and demonstrate its feasibility. This excellent solution represents the result of models trained on specific tasks and populations; therefore, it cannot be assumed that the procedure applied can be replicated by any teacher not specialised in Machine Learning. Other studies have instead used LLMs for assessment purposes without comparing the performance of the AI with that of a teacher. These studies applied for example in the evaluation of L2 English tasks <ref type="bibr" target="#b42">[43]</ref> and in supporting self-assessment came to satisfactory results <ref type="bibr" target="#b43">[44]</ref>. Machine learning has also been applied in the evaluation of tasks related to STEM disciplines, but without using LLM <ref type="bibr" target="#b44">[45]</ref>.</p><p>Finally, a previous study <ref type="bibr" target="#b9">[10]</ref> explored the use of the main LLMs in the specific context of assessing students' papers, with a focus on their accuracy in assessing according to a rubric developed by the teacher. The idea was that employing LLMs for assessment in higher education can enable the adoption of teaching and assessment approaches that were previously unsustainable and unscalable. This should help to ensure constructive alignment <ref type="bibr" target="#b45">[46]</ref> and thereby improve the quality and effectiveness of university teaching. The study, aimed at selecting the most human-like evaluation amongst LLMs, highlighted that while some AI models, like ChatGPT-4 and Claude 2, performed well in most of the assessment criteria, others, such as Microsoft Copilot and Google Bard, were far from human-like assessment. The article recommends further research on ChatGPT and Claude, with potential inclusion of open-source models as well as involving multi-shot prompting, expanding the student sample, involving more evaluators, and refining and redesigning the rubrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology and Tools</head><p>This is a replication study of "Are Large Language Models Capable of Assessing Students' Written Products? A Pilot Study in Higher Education" published in "Research Trends in Humanities Education &amp; Philosophy, 11" <ref type="bibr" target="#b9">[10]</ref> that follows the same methodology with updated LLMs employed, a greater sample of students and of human evaluators. It explores the use of leading Large Language Models (LLMs) in the specific context of assessing student written products, focusing on their precision and ability to evaluate according to a grading rubric developed by the teacher. The goal is to understand whether and which models can be used by university and non-university educators who are not experts in Machine Learning to assess students' written products, even in the presence of open-ended tasks and questions, thanks to grading rubrics. The pilot study was conducted at the University of Trento within the context of a university habilitation course for secondary school teachers, during the module concerning learning methodologies. One-hundred-fourty-two students participated anonymously, divided into 35 groups, along with 3 evaluating teachers, experts in experimental pedagogy and assessment. No data regarding the students' demographics was collected. The groups were tasked with carrying out an authentic task, namely to re-designing a past educational intervention that proved to be unsuccessful, targeted at a specific class (which could range from 1st grade of lower secondary school to 5th grade of higher secondary school, depending on the group composition). They were instructed to identify the past teaching approaches and strategies and to now think of different ones more suited to reach the intended learning outcomes. Furthermore, students' reflection and redesign ability was evaluated through the rubric of reference. To complete this task, groups were given two hours and thirty minutes, and a template for the educational design consisting of the following sections was provided: Involved Disciplines, Class and Grade Level, Intervention Title, Teacher, Programme and Learning Objectives, Context and Environment (formal, informal, type of setting, etc.). Moreover, a description of the reflection process applied for renewing the formative design is required, and it is considered under evaluation. In the part of the schedule with details, they were asked to explain the programming with concise descriptions of the various educational activities, the teacher's tasks, and those of the students. Within this framework, groups had the freedom to propose their original programming. The final product of each group is thus an MS Word file containing the programming of the educational intervention according to the described template. For the evaluation of the products, the following grading rubric (Table <ref type="table" target="#tab_0">1</ref>) was prepared, consisting of five evaluation criteria with four levels for each criterion.</p><p>Three expert human evaluators and seven LLMs (plus one that merged the feedback of all 7 models in a single one) evaluated all the student groups' products. The LLMs selected for this study were the most popular competing models at the time, and they are applied in the assessment process through the use of big-AGI (https://big-agi.com/). Big-AGI is an AI suite created to make advanced artificial intelligence accessible and was chosen for ease of adding several models through API, the possibility of imparting system prompts and the function (called "beam") for sending the same prompt to several LLMs at the same time. Human results were then compared with the results produced by the LLMs with various statistical analysis (see Method of Analysis section). The models used are:</p><p>1. GPT-4o: released in May 2024, GPT-4o is a multilingual, multimodal generative pretrained transformer developed by OpenAI. The model is capable of processing and generating text, images, and audio, making it a versatile tool for a wide range of tasks. Its multimodal capabilities </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Understanding and applying educational architectures</head><p>Demonstrates a limited understanding of educational architectures, with applications not always appropriate or consistent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Shows</head><p>a basic understanding of educational architectures, applying them generally correctly but with some uncertainties.</p><p>Applies educational architectures correctly, with a good understanding of their use in the specific context.</p><p>Demonstrates a thorough understanding of educational architectures, applying them in an innovative and contextually relevant manner. Clearly justifies the choices made.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Selection and implementation of teaching and learning strategies</head><p>The teaching strategies chosen are limited or not always appropriate for the objectives of the intervention.</p><p>Uses some relevant teaching strategies, but their implementation could be more targeted or diversified in relation to the intervention goals.</p><p>Selects and implements appropriate teaching strategies with a good correlation to the intervention goals.</p><p>Selects and implements highly effective and diversified teaching strategies, perfectly adapted to the objectives and context of the intervention.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition of the intended learning outcomes</head><p>The objectives are vague, not measurable or not aligned with the chosen teaching architectures and strategies.</p><p>The objectives are present but could be more specific or better aligned with the teaching architectures and strategies.</p><p>The objectives are welldefined and generally aligned with the chosen teaching architectures and strategies.</p><p>The objectives are clear, specific, measurable and perfectly aligned with the chosen teaching architectures and strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Detailed scanning of the intervention</head><p>The scan is incomplete, unclear, or lacks a logical progression of activities.</p><p>The scan is present but could be more detailed or better structured in some parts.</p><p>The scan is clear and generally wellstructured, with a good progression of activities.</p><p>The scan is detailed, logical and wellstructured, with a clear progression of activities and realistic timeframes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Critical reflection on the redesign process</head><p>There is a lack of critical reflection on the changes made or the justifications are superficial.</p><p>Includes some reflection on the changes, but the analysis could be more thorough.</p><p>Provides good reflection on the changes, with clear links to the learning objectives.</p><p>Provides deep and critical reflection on the changes made, clearly justifying each choice in relation to the learning objectives.</p><p>enable a deeper integration of different data formats, enhancing its utility in complex applications. Link: https://openai.com/index/hello-gpt-4o/ 2. Gemini 1.5 Pro Latest: a large language model developed by DeepMind (Google), is natively multimodal and supports an extended context window of up to two million tokens, which is currently the longest of any large-scale foundation model. This expansion in token capacity allows for processing more extensive sequences of data, thereby increasing its utility in tasks that require long-term contextual understanding. Link: https://deepmind.google/technologies/gemini/pro/. All these LLMs can "understand" and write in Italian, but it cannot be ruled out that performance in English may be different (presumably better, since most of the training is done in that language). Mistral's models were added for their specific training with European languages that renders them "natively fluent in English, French, Spanish, German, and Italian, with a nuanced understanding of grammar and cultural context". Privacy shouldn't be a concern since there is no data saved to LLM provider servers due to our use of API on a local instance of big-AGI. All the conversations are saved only locally.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Prompting</head><p>This study aimed to understand which models could be used by university educators (and, potentially, other educators) to assess students' products. For this reason, overly sophisticated prompting techniques were not used; instead, what an educator might do by providing clear instructions and giving the necessary context data for evaluation was employed. The LLMs systems were promoted through the following instruction (originally written in Italian). The first one is the System Prompt:</p><p>You are an experienced and impartial university lecturer. Your job is to assess the quality of student assignments according to a specific assessment rubric. The second is the prompt that were given to the LLMs to assess the products (originally written in Italian):</p><p>Evaluate the attached teaching design (student task) that was created by a group of students from the secondary school teaching qualification course. The key competence of this assignment lay in being able to design a teaching intervention that makes effective use of teaching architectures and strategies. In particular, the group's competence in terms of redesign and depth of reflection is taken into account with respect to previous instructional design. At the same time, the instructional design had to prove effective in achieving the goals they set themselves. Take into account that the students only had 2 hours to design. Use the evaluation rubric below to assess: &lt;Starting teaching design evaluation rubric&gt; Evaluation rubric: The prompt was sent simultaneously to all the LLMs involved. Through big-AGI the authentic task document was attached in PDF format. A zero-shot prompting procedure was used for all LLMs, meaning that no examples of human task assessments were given to the models. It is possible for a university instructor to provide an example that can enhance the quality of LLM assessments, however, the goal in this instance was to choose the most suitable models for this type of evaluation, not to find methods for optimising the results. Finally, an "eighth" LLM evaluator has been added, which uses the "Beam" function of big.AGI software. All the nuanced answers from the 7 LLM have been sent for consideration and synthesis to GPT-4o, resulting in an eighth assessment that considers the feedback from all seven LLMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Models' Settings</head><p>In order to set a common limit of length for every model's answers, all of them have been set through big.AGI API controls to 8128 tokens maximum. Also, the temperature was set to 0.2, that should ensure quite strict adherence to the instructions yet leave some room for creativity in answers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Attention to Tokens and Context</head><p>Understanding tokens and context is crucial when using a Large Language Model (LLM). Tokens can be simplified as units of text that might consist of a word, part of a word, or even a single character. The characteristics of tokens can vary between models. However, it is generally safe to assume that, on average, English might require one to one and a half tokens per word, and Italian might need one and a half to two tokens per word. The context window, another essential concept, represents the number of tokens a language model can consider simultaneously when generating responses. This context depends on the model used and the available memory. Exceeding a model's context window could cause errors if it happens in a single prompt or, in a more extended conversation, the model might start ignoring the earlier parts of the dialogue to make room for more recent inputs. Therefore, preserving context is vital for generating coherent and relevant responses. It is important to note that not only the user's prompts consume context, but the model's responses do as well. To preserve the context window, some LLMs platforms impose a character limit on the prompts that can be sent and on the length of the generated responses, which are shorter than the maximum context window. Contrasting this replication study with the original one, it can be noted that context windows are decidedly wider than the ones that were found in LLMs one year ago, posing less of threat to the coherence of the assessment. Below Table <ref type="table" target="#tab_2">2</ref> illustrates the maximum context window size for each of the models used: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Method of Analysis</head><p>The analysis method for evaluating the data involved examines the levels assigned by each evaluator (both LLMs and humans) to the various criteria of the rubric for each of the 35 group products. Each of the seven evaluators assigned a level to each of the five criteria for every product, resulting in each evaluator assigning a level to a total of 175 criteria. Several statistical techniques were employed to extract insights from the data, including Principal Component Analysis (PCA), analysis of standard deviation, and the creation of a disagreement index among evaluators. Microsoft Excel and JASP (based on R) were used for the statistical analyses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Results</head><p>The consistency of the assessment for different models has been tested over the evaluation of three random tasks from the sample for three times each by each one of the models. For this test, each LLM assessed a total of 45 criteria. From these tests, the following behaviours were observed:</p><p>• GPT-4o, Gemini 1.5 Pro Latest and Claude 3.5 Sonnet were extremely consistent, with only one instance of a different assessment for one criterion, by just one point. • Mistral Large 2402 was perfectly consistent with zero instances of different assessments.</p><p>• Open Mixtral 8x22B (2404) and Qwen2 72B Instruct were quite consistent, with five instances of different assessments of single criterion by one point. • Llama 3.1 70B Instruct Turbo: was the fairly inconsistent, with 19 instances of different criteria assessment by one point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Principal Component Analysis</head><p>The first analysis conducted, in addition to descriptive data, was the PCA, a dimensionality reduction technique that allows the identification of latent variables within the data and that can represent a general model of the data. Three principal components were identified from the PCA conducted on the assessment data (Table <ref type="table" target="#tab_4">3</ref>). The first component (RC1) is formed by evaluators e1, e3, e4, e5, e6, e7, e8 loadings, which correspond respectively to the LLMs GPT-4o, Claude 3.5 Sonnet, Mistral Large, Mixtral 8x22B, Llama 3.1 70B, Qwen2 72B and the merge of LLMs opinion. The second component (RC2) comprises those of e2, e9, e10 and e11 corresponding to Gemini 1.5 Pro and human evaluators 1, 2 and 3. As can be appreciated in Figure <ref type="figure" target="#fig_2">1</ref>, Both GPT-4o and Claude 3.5 Sonnet contributes mainly to RC1 component but also to RC2. Gemini Pro 1.5 on the other hand, contributes only to RC2 component (the tiny loading to RC1 is negative). Trying to name the identified components, RC1 could be called "LLM Evaluation Pattern" and RC2 "Human Evaluation Pattern".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Analysis of Standard Deviation of Grades by Product and Assessment Criterion</head><p>To understand how assessments differed from criterion to criterion and from evaluator to evaluator, an analysis was conducted on the standard deviation (SD) of the different variables of the study. The criteria, numbered or abbreviated in some of the graphs, are those listed in Table <ref type="table" target="#tab_5">4</ref>. Firstly, an effort was made to identify which assessment criteria had the slightest and the most SD (Table <ref type="table" target="#tab_5">4</ref>) to understand which were assessed more consistently by all evaluators. The criteria with the minimum SD across all products is Criterion 4 and 1 ("Detailed scanning of the intervention" and "Understanding and application of teaching architectures"), with an average of about 0.5. This suggests a high level of agreement among evaluators in assessing the quality and details of the detailed activities envisaged in the educational design and the understanding and correct application of the teaching architectures at their bases. On the other hand, the criterion with the maximum SD among all activities is Criterion 5 (Critical reflection on redesign), with an average of about 0.8. This indicates a higher level of disagreement or inconsistency in how evaluators assessed the quality of teacher's critical reflection about their past activities and the way in which they tried to improve them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">Agreement Index</head><p>An "Agreement Index" (AIdx) was developed to obtain a more robust metric and better understand which evaluators assigned more similar scores for the various criteria. This index combines the average difference between the scores assigned to a criterion and the variability of this difference. It was calculated to understand which evaluators are most similar to the human ones for each criterion. While LLMs evaluators are treated individually, the human benchmark is an average of the human evaluators' (e9, e10 and e11) assessments. It is constructed as follows:  </p><p>Therefore:</p><p>• The "Average Difference" is the absolute average difference in scores assigned between the evaluator in question and the average of human evaluators across all tasks and criteria. • The "Variability of the Difference" is the standard deviation of the difference scores between the tested evaluator and the reference evaluator, reflecting how consistent these differences are across different tasks and criteria.</p><p>AIdx is calculated individually for each evaluator. It provides a single measure that encapsulates the average magnitude of evaluation differences relative to the reference evaluator and the consistency of such differences. A lower value indicates a more significant overall agreement in evaluation relative to the human evaluator. The highest possible value for the index for an evaluator would be achieved if they constantly evaluated at the maximum difference from the human evaluators (3 points).</p><p>The LLM evaluator who provided assessments most similar to the average of human evaluators (calculated through the AIdx) is GPT-4o, followed at a negligible distance by Claude 3.5 Sonnet. On the other hand, the LLM evaluator with the worst AIdx is Qwen2 72B (Table <ref type="table">5</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Agreement Indices (AIdx) with reference to the average of human evaluators.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluator</head><p>Agreement Index with "average human" (lower is better) Focusing on the single criterion (Table <ref type="table" target="#tab_8">6</ref>), it can be noted how AIdx with other evaluators vary from criterion to criterion. Unexpectedly, Qwen2 72B, the worst on the general AIdx with the "average" human evaluator, is the single model that is most human-like in three criteria out of five. Its main problem is that it assessed in a very different way from humans the most difficult criterion: criterion number 5 "Critical reflection on redesign" (Table <ref type="table" target="#tab_8">6</ref>). It also did not fare optimally in criterion number 3 "Definition of learning objectives". Other LLMs like GPT-4o and Claude 3.5 Sonnet, as well as the Merge of the different LLMs feedback, keep a good Agreement Index across the board.</p><formula xml:id="formula_1">Human 3 0.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.">Assessment correlations among LLM and Human evaluators</head><p>As reported in Table <ref type="table" target="#tab_9">7</ref> the model with higher correlation with human evaluation is by far Gemini 1.5 Pro (r = 0.84), followed at a distance by Claude Sonnet 3.5 (r = 0.66), then by GPT-4o (r = 0.59), the merge of the LLMs feedback by GPT-4o (r = 0.58) and Llama 3.1 70B (r = 0.45). This suggests that Gemini 1.5 Pro's pattern of scores across the criteria is the most similar to that of the human evaluators.</p><p>But what happens excluding single criteria from the correlation analysis? That could help in understanding what criteria makes the assessment "human" and what LLMs struggle with:</p><p>• Excluding Criterion 3 (Definition of learning objectives): When Criterion 3 is excluded all LLMs' correlation indexes significantly improve. Noticeably, for Mistral Large and Qwen2 72B the jump is from being hardly correlated, or not at all (r = 0.27 and -0.06 respectively), to being significantly correlated (r = 0.86 and 0.88). Excluding Criterion 3 also significantly reduces the correlation of Gemini 1.5 Pro suggesting that this was the Criterion that it got right and mostly contributed to its excellent general correlation to humans' assessment. This suggests that Criterion 3 may be peculiarly human-like in its application, which these models struggle to mimic accurately. The high increase implies that Criterion 3 might involve a complex judgment that those models are incapable to handle or contextual information that is not being passed to the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>• Excluding Criterion 2 (Selection and implementation of teaching strategies): Excluding</head><p>Criterion 2 doesn't change LLMs correlation with human assessment, except for Gemini 1.5 Pro Latest. Gemini shows an almost perfect correlation of r = 0.99 when Criterion 2 is excluded, which is remarkable, but, even in this case, this criterion doesn't seem to be crucial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>• Excluding Criterion 5 (Critical reflection on redesign):</head><p>The exclusion leads to a substantial increase in correlation for Mistral Large (from r = 0.27 to 0.88) and a notable improvement for several other models. This criterion, similarly to Criterion 3, may also represent aspects of human judgement that are challenging for models to replicate accurately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Discussion</head><p>Regarding the goal of understanding whether educators without expertise in machine learning can employ current Large Language Models (LLMs) to assess students' written authentic tasks using assessment rubrics, the analyses have revealed several interesting elements:</p><p>• Differently from a previous iteration of the study <ref type="bibr" target="#b9">[10]</ref>, all the models have enough context window to perform this task. • From the PCA, it appears that human evaluators generally have a different pattern of evaluation compared to LLMs. • In contrast with the evaluation pattern, the Agreement Index (AIdx) measures both the magnitude of the score differences and their consistency. A high Agreement Index value suggests significant discrepancies between the model's scores and the average human scores, despite possibly similar trends in the pattern. Transforming in percentage the AIdx of each model referred to the average human an accuracy metric has been achieved. This helps to better visualise each model's performance (Fig. <ref type="figure" target="#fig_3">2</ref> and Fig. <ref type="figure" target="#fig_4">3</ref>) • Only Llama 3.1 70B was inconsistent in the repeated assessment of the same task. • Gemini 1.5 Pro is the LLM model with the evaluation pattern more similar (with by far the higher correlation) to the human's (see Table <ref type="table" target="#tab_9">7</ref>). It is the only model that in the PCA results only in the component of human assessment (Fig. <ref type="figure" target="#fig_2">1</ref>). On the other hand, its AIdx was the second worst, just before Qwen2 72B (Table <ref type="table">5</ref>, Fig. <ref type="figure" target="#fig_3">2</ref>). • GPT-4o and Claude 3.5 Sonnet have evaluation patterns not too dissimilar from the human's (Fig <ref type="figure" target="#fig_2">1</ref>, Table <ref type="table" target="#tab_9">7</ref>) and on average attribute marks more similar to humans than any other model (Table <ref type="table">5</ref>). • Llama 3.1 70B Instruct was the best of the open models, and the fourth in total (Table <ref type="table">5</ref>), after the already mentioned three proprietary models. It behaved quite well in the correlation index with the humans' assessment pattern with a moderate correlation (Table <ref type="table" target="#tab_9">7</ref>) and has a good AIdx.</p><p>The problem with this model is the inconsistency of the assessment of the same task, where it "changed its mind" 19 times out of 45. It would be interesting to understand if that inconsistency has to do with the quantisation applied by Together AI, the API provider used. • Mixtral 8x22B and Mistral Large fared similarly with patterns quite dissimilar to the human's and AIdx which are pretty decent (similar to Llama's). The correlation of Mistral Large with the human pattern of evaluation, when Criterion 3 is removed, is the second highest, thus giving reasons to follow it closely and keep it in the test pool. • Qwen2 72B, an open LLM, would have been by far the best LLM overall (and Mistral Large would have been the second) if it weren't for Criterion 3. Criterion 3 posed a grave problem for Qwen2 both from the assessment pattern and from the AIdx point of view (Fig. <ref type="figure" target="#fig_3">2</ref> and Fig. <ref type="figure" target="#fig_4">3</ref>). • Criterion 3 (Definition of learning objectives), in a larger part, and Criterion 5 (Critical reflection on redesign), in a smaller part, appear to be the most discriminative criteria in terms of capturing what makes the human evaluation pattern unique for this assessment task (Fig. <ref type="figure" target="#fig_3">2</ref> and Fig. <ref type="figure" target="#fig_4">3</ref>). These criteria likely involve nuances and complexities in judgment that are particularly human-like and challenging for LLMs to capture accurately, or the authors might have failed to provide all the relevant contextual information regarding these criteria to LLMs. This last hypothesis seems relevant because, in the previous iteration of the study, this same criteria was the easiest one for LLM to assess in a human-like manner <ref type="bibr" target="#b9">[10]</ref>.</p><p>Based on the available data, it appears that the more suitable LLMs for the assessing students' authentic tasks using an assessment rubric are Claude 3.5 Sonnet and GPT-4o. That is because they fared well both on the assessment pattern (PCA and correlations) and in the agreement index (magnitude and consistencies of scores). On the other hand, Gemini 1.5 Pro is the one that had by far the most human-like assessment pattern, but fell short on the AIdx, attributing marks that were very different from the humans'.</p><p>Qwen2 7B and Llama 3.1 70B deserve a mention as they are open models, and if not for some flaw would have been at the level (or better than) the aforementioned proprietary models. Llama has a problem of inconsistency of the marks assigned for each criterion, while Qwen2 really just got one criterion very wrong. It might be useful to know that for both of them, Together AI (https://www.together.ai/) was used as an API provider. It applies quantisation of Floating Point 8-bit (FP8) for Llama 3.1 70B Instruct Turbo, while Qwen2 72B Instruct is run at full-precision Floating Point 16-bit (FP16).</p><p>Human evaluators have a pattern of evaluation (see the PCA) that can be usually distinguished from the LLMs' one, but Gemini 1.5 Pro, if not for its very different score attribution, has very similar patterns. It is interesting to note that human evaluators among themselves have different score attributions (see Table <ref type="table">5</ref> and Figure <ref type="figure" target="#fig_4">3</ref>), but as for critical criteria they assess similarly.</p><p>All considered, presently, none of the LLMs can be used for autonomous evaluation for all criteria, especially regarding the more complex and the less contextualised ones. This confirms what Webb <ref type="bibr" target="#b32">[33]</ref> highlighted. However, Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B Instruct have the potential to be used as solid support for evaluation for the summative evaluation level as described in the AI-MAAS (AI-Mediated Assessment for Academics and Students) model <ref type="bibr" target="#b10">[11]</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Conclusions</head><p>The fundamental question of this study was whether and which current Large Language Models (LLMs) can be used by university educators (but it applies to other educators and instructors, too), even those without technical experience, to assess student-written authentic products in the presence of open tasks and questions using assessment rubrics. Indeed, using these technologies could make assessment more sustainable and scalable, allowing for more consistent alignment with declared learning objectives. This study has allowed us to determine that Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B Instruct have the potential to be used as solid support for summative evaluation. According to this study, the use of LLMs can be beneficial, but only if they are used under proper supervision. They should be seen as assistance for university educators and not as a substitute for assessments. The available data does not indicate that they are reliable enough to perform assessments independently, even if they are getting close to it. In fact, some criteria that is too complex or needs additional information about the context or specific subject can be evaluated in a way that is not in line with human assessment. This finding confirms the guidelines as stated by Miao et al. <ref type="bibr" target="#b30">[31]</ref> and Webb <ref type="bibr" target="#b32">[33]</ref>. The limitations of the present study lie in the sample size of student products that need to be significantly increased, as well as the number of human expert evaluators and the disciplines involved in the tests. The assessment rubric can also be optimised and, especially for the most critical criteria (such as Criterion 3), it would be important to experiment on its formulation to understand if it could have been a human error in defining the criteria that made it difficult to interpret by the LLMs. The idea behind this study is that it should be expanded and updated on a rolling basis to adjust the discussion and bring useful novelties into the assessment practice. Future evolutions of the study might include multi-shot prompting and the evaluation of textual feedback and assessment to tasks. Feedback that could be provided during the assessment for each of the criteria provided in a rubric deserve particular exploration <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b41">42]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>*</head><label></label><figDesc>*How to respond to requests:** * Do not express personal opinions or subjective judgements. * Focus exclusively on the criteria provided in the rubric. * Provide a fair and impartial assessment based on the task's adherence to the criteria. * Carefully review the student's entire paper before beginning the assessment. * Offer constructive suggestions as to how the student might improve. * Uses clear and concise language. * Justify the marks awarded with specific references to the paper and the rubric. * In your assessment, take into account that the students only had 2 hours for planning. **Request format Each request will include: * **The student's assignment:** The text of the assignment you are to assess. * **The grading rubric:** A list of criteria with descriptions for each grade level. **Response format:** Your answer should follow this format: **Title of the paper (also called title of the paper) as it appears in the document: [insert title here]**. **Total score:** [Insert total score here]. **Scoring breakdown:** | Criterion | Score | Comments |-|-|-| | [Criterion 1] | [Score] | [Comments with specific examples from the task] | | [Criterion 2] | [Score] | [Comments with specific examples from the task] | | [Criterion 3] | [Score] | [Comments with specific examples from the task] | | ... | ... | ... | **Suggestions for improvement * [Suggestion 1] * [Suggestion 2] * ... **Answer following the answer format provided above.**</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Criterion 5 -</head><label>5</label><figDesc>applications not always appropriate. -Sufficient (award 2 points): Basic understanding with some uncertainties in application. -Good (awarded 3 points): Correct application and good understanding. -Excellent (awarded 4 points): Thorough understanding and innovative and relevant application. Criterion 2 -Selection and implementation of teaching strategies: -Insufficient (award 1 point): Limited or not always adequate strategies. -Sufficient (award 2 points): Relevant strategies but implementation can be improved. -Good (award 3 points): Strategies appropriate and related to the objectives. -Excellent (award 4 points): Highly effective, diverse and well adapted strategies. Criterion 3 -Definition of learning objectives: -Insufficient (award 1 point): Vague or non-measurable objectives. -Sufficient (award 2 points): Objectives present but not very specific. -Good (award 3 points): Well-defined and generally aligned objectives. -Excellent (award 4 points): Clear, specific, measurable and perfectly aligned objectives. Criterion 4 -Detailed scanning of the intervention -Insufficient (award 1 point): Incomplete or unclear scan. -Sufficient (award 2 points): Scan present but can be improved in structure. -Good (awarded 3 points): Clear and well-structured scan. -Excellent (awarded 4 points): Detailed, logical and well-structured scanning. Critical reflection on redesign: -Insufficient (award 1 point): Lack of critical reflection or superficial justifications. -Sufficient (award 2 points): Reflection present but not very thorough. -Good (awarded 3 points): Good reflection with clear connections. -Excellent (award 4 points): Deep and critical reflection, clear justifications. &lt;end of assessment rubric&gt; **Total score:** **Scoring distribution:**</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: PCA Path Diagram. The diagram shows two main components (RC1 and RC2) and their relationships with different evaluators. RC1 represents the "LLM Evaluation Pattern" while RC2 represents the "Human Evaluation Pattern".</figDesc><graphic coords="11,117.13,65.60,361.03,375.23" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Radar graph of the LLMs' AIdx (transformed in percentage) with the average human for each criterion.</figDesc><graphic coords="15,72.00,334.61,451.27,404.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Radar graph of the LLMs' and Humans' AIdx (transformed in percentage) with the average human for each criterion.</figDesc><graphic coords="16,72.00,65.61,451.27,404.76" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Rubric for the Assessment of the Educational Intervention</figDesc><table><row><cell>Assessment</cell><cell>Insufficient Level (1</cell><cell>Sufficient Level (2</cell><cell>Good</cell><cell>Level</cell><cell>(3</cell><cell>Excellent Level (4</cell></row><row><cell>Criteria</cell><cell>point)</cell><cell>points)</cell><cell>points)</cell><cell></cell><cell></cell><cell>points)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>3. Claude 3.5 Sonnet: developed by Anthropic, excels in the ability to understand nuanced language, It demonstrates top-tier performance in handling sophisticated reasoning challenges, making it a robust tool for both natural language processing and technical tasks. Link: https://mistral.ai/ news/mistral-large/. 5. Open Mixtral 8x22B (2404): is one of the latest model developed by Mistral, featuring a sparse Mixture-of-Experts (SMoE) architecture. Despite its large size, with 141 billion parameters, only 39 billion parameters are actively engaged during processing, optimising both performance and cost efficiency. This approach sets new standards in the AI community for balancing model complexity with computational resource usage. Link: https://mistral.ai/news/mixtral-8x22b/. 6. Llama 3.1 70B Instruct Turbo: developed by Meta, is a 70-billion parameter language model designed for instruction-following tasks. The model is optimised to improve interactions where clear guidance or step-by-step reasoning is required, positioning it as an effective tool for applications in both academic and practical domains. Link: https://ai.meta.com/blog/meta-llama-3-1/ 7. Qwen2 72B Instruct: developed by Alibaba Cloud, is a 72-billion parameter language model optimised for instruction-based tasks. It integrates the latest advancements in generative AI, offering improved efficiency in tasks ranging from conversational AI to complex text generation and reasoning. Its design caters specifically to high-performance needs in both commercial and research applications.</figDesc><table /><note>humour, and complex instructions. It is designed to generate high-quality content in a relatable, natural tone, showing marked improvements in areas such as writing and human-centric communication. Link: https://www.anthropic.com/news/claude-3-5-sonnet. 4. Mistral Large (2402): is designed to excel in complex reasoning tasks, particularly in multilingual contexts. The model is highly effective in text understanding, transformation, and code generation. Link:https://www.alibabacloud.com/en/solutions/generative-ai/qwen?_p_ lc=1</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Context Windows of the used LLMs. The context windows refer to the APIs. Note that this feature may change with updates.</figDesc><table><row><cell>Large</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Language Model (versions available in Italy, September 2024) Context Window (in tokens)</head><label></label><figDesc></figDesc><table><row><cell>GPT-4o</cell><cell>128,000</cell></row><row><cell>Claude 3.5 Sonnet</cell><cell>200,000</cell></row><row><cell>Gemini 1.5 Pro Latest</cell><cell>2,000,000</cell></row><row><cell>Mistral Large (2402)</cell><cell>32,000</cell></row><row><cell>Mixtral 8x22B (2404)</cell><cell>64,000</cell></row><row><cell>Meta Llama 3.1 70B Instruct Turbo</cell><cell>131,072</cell></row><row><cell>Qwen2 72B Instruct</cell><cell>32,768</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc>PCA Component Loadings</figDesc><table><row><cell>Evaluator</cell><cell>RC1</cell><cell cols="2">RC2 Uniqueness</cell></row><row><cell>e1 (GPT-4o)</cell><cell cols="2">0.579 0.327</cell><cell>0.472</cell></row><row><cell>e2 (Gemini 1.5 Pro)</cell><cell></cell><cell>0.421</cell><cell>0.832</cell></row><row><cell>e3 (Claude 3.5 Sonnet)</cell><cell cols="2">0.695 0.312</cell><cell>0.321</cell></row><row><cell>e4 (Mistral Large 2402)</cell><cell>0.775</cell><cell></cell><cell>0.430</cell></row><row><cell>e5 (Mixtral 8x22B 2404)</cell><cell>0.651</cell><cell></cell><cell>0.592</cell></row><row><cell>e6 (Llama 3.1 70B Instruct)</cell><cell>0.743</cell><cell></cell><cell>0.405</cell></row><row><cell>e7 (Qwen2 72B Instruct)</cell><cell cols="2">0.825 -0.326</cell><cell>0.336</cell></row><row><cell cols="2">e8 (Merge of 7 LLMs by GPT-4o) 0.802</cell><cell></cell><cell>0.301</cell></row><row><cell>e9 (Human Evaluator 1)</cell><cell></cell><cell>0.515</cell><cell>0.708</cell></row><row><cell>e10 (Human Evaluator 2)</cell><cell></cell><cell>0.753</cell><cell>0.420</cell></row><row><cell>e11 (Human Evaluator 3)</cell><cell></cell><cell>0.644</cell><cell>0.604</cell></row><row><cell cols="2">Note. Applied rotation method is promax.</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Average standard deviation of scores assigned to criteria</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Criterion Description Average Standard Deviation Percentage of Total Range (1-4)</head><label></label><figDesc></figDesc><table><row><cell>5</cell><cell cols="2">Critical reflection on redesign</cell><cell>0.76</cell><cell>25%</cell></row><row><cell>2</cell><cell cols="2">Selection and implementation of</cell><cell>0.64</cell><cell>21%</cell></row><row><cell></cell><cell cols="2">teaching strategies</cell><cell></cell></row><row><cell>3</cell><cell cols="2">Definition of learning objectives</cell><cell>0.61</cell><cell>20%</cell></row><row><cell>1</cell><cell cols="2">Understanding and application of</cell><cell>0.54</cell><cell>18%</cell></row><row><cell></cell><cell cols="2">teaching architectures</cell><cell></cell></row><row><cell>4</cell><cell cols="2">Detailed scanning of the</cell><cell>0.53</cell><cell>17%</cell></row><row><cell></cell><cell cols="2">intervention</cell><cell></cell></row><row><cell></cell><cell>AIdx =</cell><cell cols="2">Average difference + Variability of the difference 2</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 6</head><label>6</label><figDesc>Agreement Indices (AIdx) of evaluators compared to the average human evaluator divided by criterion</figDesc><table><row><cell>Crit.</cell><cell>GPT</cell><cell>Gemini</cell><cell>Claude</cell><cell>Mistral</cell><cell>Mixtral</cell><cell>Meta</cell><cell>Qwen2</cell><cell>Merge</cell><cell>Best</cell><cell>Second</cell></row><row><cell></cell><cell>4o</cell><cell>1.5 Pro</cell><cell>Sonnet</cell><cell>Large</cell><cell>8x22B</cell><cell>Llama</cell><cell>72B</cell><cell>(GPT-</cell><cell>LLM</cell><cell>Best</cell></row><row><cell></cell><cell></cell><cell>Latest</cell><cell>3.5</cell><cell>(2402)</cell><cell>(2404)</cell><cell>3.1 70B</cell><cell></cell><cell>4o)</cell><cell></cell><cell>LLM</cell></row><row><cell>1</cell><cell>0.50</cell><cell>0.41</cell><cell>0.50</cell><cell>0.43</cell><cell>0.41</cell><cell>0.43</cell><cell>0.39</cell><cell>0.47</cell><cell>Qwen2</cell><cell>Gemini</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>72B</cell><cell>1.5 Pro /</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>Mixtral</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>8x22B</cell></row><row><cell>2</cell><cell>0.56</cell><cell>0.64</cell><cell>0.57</cell><cell>0.47</cell><cell>0.55</cell><cell>0.57</cell><cell>0.40</cell><cell>0.55</cell><cell>Qwen2</cell><cell>Mistral</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>72B</cell><cell>Large</cell></row><row><cell>3</cell><cell>0.43</cell><cell>0.60</cell><cell>0.43</cell><cell>0.49</cell><cell>0.68</cell><cell>0.62</cell><cell>0.70</cell><cell>0.41</cell><cell>Merge</cell><cell>GPT-4o</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>(GPT-</cell><cell>/ Claude</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>4o)</cell><cell>Sonnet</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>3.5</cell></row><row><cell>4</cell><cell>0.37</cell><cell>0.46</cell><cell>0.35</cell><cell>0.37</cell><cell>0.48</cell><cell>0.38</cell><cell>0.33</cell><cell>0.32</cell><cell>Merge</cell><cell>Qwen2</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>(GPT-</cell><cell>72B</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>4o)</cell><cell></cell></row><row><cell>5</cell><cell>0.41</cell><cell>0.63</cell><cell>0.45</cell><cell>0.80</cell><cell>0.44</cell><cell>0.55</cell><cell>1.24</cell><cell>0.62</cell><cell>GPT-4o</cell><cell>Mixtral</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>8x22B</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 7</head><label>7</label><figDesc>Pearson correlation coefficients calculated between the average scores per criterion for each Large Language Model (LLM) and the average scores per criterion of human evaluators. Single criteria have been excluded to understand what makes the pattern "human".</figDesc><table><row><cell>LLM</cell><cell>Total</cell><cell>Excl. Crit.</cell><cell>Excl. Crit.</cell><cell>Excl. Crit.</cell><cell>Excl. Crit.</cell><cell>Excl. Crit.</cell></row><row><cell></cell><cell></cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell><cell>5</cell></row><row><cell>GPT-4o</cell><cell>0.59</cell><cell>0.45</cell><cell>0.60</cell><cell>0.65</cell><cell>0.74</cell><cell>0.66</cell></row><row><cell>Gemini1.5 Pro</cell><cell>0.84</cell><cell>0.79</cell><cell>0.99</cell><cell>0.47</cell><cell>0.82</cell><cell>0.86</cell></row><row><cell>Latest</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Claude Sonnet</cell><cell>0.66</cell><cell>0.54</cell><cell>0.66</cell><cell>0.78</cell><cell>0.71</cell><cell>0.81</cell></row><row><cell>3.5</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Mistral Large</cell><cell>0.27</cell><cell>0.16</cell><cell>0.21</cell><cell>0.86</cell><cell>0.21</cell><cell>0.88</cell></row><row><cell>(2402)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Mixtral 8x22B</cell><cell>0.15</cell><cell>0.24</cell><cell>0.01</cell><cell>0.53</cell><cell>0.06</cell><cell>0.19</cell></row><row><cell>(2404)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Llama 3.1 70B</cell><cell>0.45</cell><cell>0.34</cell><cell>0.42</cell><cell>0.74</cell><cell>0.47</cell><cell>0.69</cell></row><row><cell>Instrict Turbo</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Qwen2 72B</cell><cell>-0.06</cell><cell>-0.20</cell><cell>-0.13</cell><cell>0.88</cell><cell>-0.11</cell><cell>-0.77</cell></row><row><cell>Merge (GPT4o)</cell><cell>0.58</cell><cell>0.45</cell><cell>0.58</cell><cell>0.75</cell><cell>0.63</cell><cell>0.75</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors would like to thank Elena Benini, PhD Student, for her contribution on the assessment of students' authentic tasks. Thanks also to Prof. Massimo Stella for the fruitful discussion about statistical methods. Both of them work at the Department of Psychology and Cognitive Sciences of the University of Trento.</p></div>
			</div>


			<div type="availability">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(H. Ballardini) https://webapps.unitn.it/du/en/Persona/PER0247709 (D. Agostini); https:https://webapps.unitn.it/du/en/Persona/PER0242228 (F. Picasso); https://webapps.unitn.it/du/en/Persona/PER0033179</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The acceptance and diffusion of generative artificial intelligence in education: A literature review</title>
		<author>
			<persName><forename type="first">A</forename><surname>Baytak</surname></persName>
		</author>
		<idno type="DOI">10.46303/cuper.2023.2</idno>
	</analytic>
	<monogr>
		<title level="j">Current Perspectives in Educational Research</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>Article 1</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Exploring the integration of ChatGPT in education: Adapting for the future</title>
		<author>
			<persName><forename type="first">S</forename><surname>Elbanna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Armstrong</surname></persName>
		</author>
		<idno type="DOI">10.1108/MSAR-03-2023-0016</idno>
	</analytic>
	<monogr>
		<title level="j">Management &amp; Sustainability: An Arab Review</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="16" to="29" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">ChatGPT has entered the classroom: How LLMs could transform education</title>
		<author>
			<persName><forename type="first">A</forename><surname>Extance</surname></persName>
		</author>
		<idno type="DOI">10.1038/d41586-023-03507-3</idno>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">623</biblScope>
			<biblScope unit="page" from="474" to="477" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Adoption of AI ChatBot like Chat GPT in Higher Education in India: A SEM Analysis Approach</title>
		<author>
			<persName><forename type="first">S</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ray</surname></persName>
		</author>
		<idno type="DOI">10.36683/2306-1758/2023-4-46/130-149</idno>
	</analytic>
	<monogr>
		<title level="j">Economic Environment</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="130" to="149" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Chat-GPT; validating Technology Acceptance Model (TAM) in education sector via ubiquitous learning mechanism</title>
		<author>
			<persName><forename type="first">N</forename><surname>Saif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">U</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Shaheen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alotaibi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Alnfiai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arif</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.chb.2023.108097</idno>
	</analytic>
	<monogr>
		<title level="j">Computers in Human Behavior</title>
		<imprint>
			<biblScope unit="page">108097</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">What drives students toward ChatGPT? An investigation of the factors influencing adoption and usage of ChatGPT</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Tiwari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Bhat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Khan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Subramaniam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A I</forename><surname>Khan</surname></persName>
		</author>
		<idno type="DOI">10.1108/ITSE-04-2023-0061</idno>
	</analytic>
	<monogr>
		<title level="j">Interactive Technology and Smart Education</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">New era of artificial intelligence in education: Towards a sustainable multifaceted revolution</title>
		<author>
			<persName><forename type="first">F</forename><surname>Kamalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Santandreu Calonge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurrib</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sustainability</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">12451</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Academic Integrity Considerations of AI Large Language Models in the Post-Pandemic Era: ChatGPT and Beyond</title>
		<author>
			<persName><forename type="first">M</forename><surname>Perkins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of University Teaching and Learning Practice</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">ChatGPT in higher education: Considerations for academic integrity and student learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sullivan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mclaughlan</surname></persName>
		</author>
		<idno type="DOI">10.37074/jalt.2023.6.1.17</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Learning &amp; Teaching</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Are large language models capable of assessing students&apos; written products? A pilot study in higher education</title>
		<author>
			<persName><forename type="first">D</forename><surname>Agostini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Research Trends in Humanities Education &amp; Philosophy</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="38" to="60" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Large language models for sustainable assessment and feedback in higher education</title>
		<author>
			<persName><forename type="first">D</forename><surname>Agostini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Picasso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Intelligenza Artificiale</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="121" to="138" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Firm Investments in Artificial Intelligence Technologies and Changes in Workforce Composition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Babina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fedyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hodson</surname></persName>
		</author>
		<idno type="DOI">10.3386/w31325</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>National Bureau of Economic Research</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Working Paper 31325</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<ptr target="https://www.bloomberg.com/company/press/generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/" />
		<title level="m">Generative AI to become a $1.3 trillion market by 2032, research finds</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Hammond</surname></persName>
		</author>
		<ptr target="https://www.ft.com/content/c6b47d24-b435-4f41-b197-2d826cce9532" />
		<title level="m">Big tech outspends venture capital firms in AI investment frenzy</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">When does AI pay off? AI-adoption intensity, complementary investments, and R&amp;D strategy</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kim</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.technovation.2022.102590</idno>
	</analytic>
	<monogr>
		<title level="j">Technovation</title>
		<imprint>
			<biblScope unit="volume">118</biblScope>
			<biblScope unit="page">102590</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">What is AI literacy? Competencies and design considerations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Long</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Magerko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems</title>
				<meeting>the 2020 CHI Conference on Human Factors in Computing Systems</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Conceptualizing AI literacy: An exploratory review</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T K</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">K L</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K W</forename><surname>Chu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Qiao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers and Education: Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page">100041</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Developing and validating a multidimensional AI literacy questionnaire: Operationalizing AI literacy for higher education</title>
		<author>
			<persName><forename type="first">G</forename><surname>Biagini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cuomo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ranieri</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-3605/" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First International Workshop on High-Performance Artificial Intelligence Systems in Education, AIxEDU 2023</title>
				<meeting>the First International Workshop on High-Performance Artificial Intelligence Systems in Education, AIxEDU 2023<address><addrLine>Aachen</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Explicating AI literacy of employees at digital workplaces</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cetindamar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kitto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Abedin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Knight</surname></persName>
		</author>
		<idno type="DOI">10.1109/TEM.2021.3138503</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Engineering Management</title>
		<imprint>
			<biblScope unit="volume">71</biblScope>
			<biblScope unit="page" from="810" to="823" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Evaluating an Artificial Intelligence Literacy Programme for Developing University Students&apos; Conceptual Understanding, Literacy, Empowerment and Ethical Awareness</title>
		<author>
			<persName><forename type="first">S.-C</forename><surname>Kong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">M</forename></persName>
		</author>
		<author>
			<persName><forename type="first">.-Y</forename><surname>Cheung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Educational Technology &amp; Society</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="16" to="30" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Measuring user competence in using artificial intelligence: Validity and reliability of artificial intelligence literacy scale</title>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-L</forename><forename type="middle">P</forename><surname>Rau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yuan</surname></persName>
		</author>
		<idno type="DOI">10.1080/0144929X.2022.2072768</idno>
	</analytic>
	<monogr>
		<title level="j">Behaviour &amp; Information Technology</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="1324" to="1337" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Toward an Objective Measurement of AI Literacy</title>
		<author>
			<persName><forename type="first">P</forename><surname>Weber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baum</surname></persName>
		</author>
		<ptr target="https://aisel.aisnet.org/pacis2023/60" />
	</analytic>
	<monogr>
		<title level="m">PACIS 2023 Proceedings</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<ptr target="https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research" />
		<title level="m">Guidance for generative AI in education and research</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>UNESCO, Report</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A participatory data-centric approach to AI ethics by design</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gerdes</surname></persName>
		</author>
		<idno type="DOI">10.1080/08839514.2021.2009222</idno>
	</analytic>
	<monogr>
		<title level="j">Applied Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Coping with vulnerability: The effect of trust in ai and privacy-protective behaviour on the use of ai-based services</title>
		<author>
			<persName><forename type="first">C</forename><surname>Jang</surname></persName>
		</author>
		<idno type="DOI">10.1080/0144929X.2023.2246590</idno>
	</analytic>
	<monogr>
		<title level="j">Behaviour &amp; Information Technology</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">When AI Meets Information Privacy: The Adversarial Role of AI in Data Sharing Scenario</title>
		<author>
			<persName><forename type="first">A</forename><surname>Majeed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">O</forename><surname>Hwang</surname></persName>
		</author>
		<idno type="DOI">10.1109/ACCESS.2023.3297646</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="76177" to="76195" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Generative AI meets copyright</title>
		<author>
			<persName><forename type="first">P</forename><surname>Samuelson</surname></persName>
		</author>
		<idno type="DOI">10.1126/science.adi0656</idno>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">381</biblScope>
			<biblScope unit="page" from="158" to="161" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Academic integrity in the age of Artificial Intelligence (AI) authoring apps</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Yeo</surname></persName>
		</author>
		<idno type="DOI">10.1002/tesj.716</idno>
	</analytic>
	<monogr>
		<title level="j">TESOL Journal</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">e716</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">AI-generated text detectors: Do they work?</title>
		<author>
			<persName><forename type="first">V</forename><surname>Van Oijen</surname></persName>
		</author>
		<ptr target="https://communities.surf.nl/en/ai-in-education/article/ai-generated-text-detectors-do-they-work" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Testing of detection tools for AI-generated text</title>
		<author>
			<persName><forename type="first">D</forename><surname>Weber-Wulff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anohina-Naumeca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bjelobaba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Foltýnek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guerrero-Dib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Popoola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Šigut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Waddington</surname></persName>
		</author>
		<idno type="DOI">10.1007/s40979-023-00146-z</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal for Educational Integrity</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page">26</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Miao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ronghuai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hui</surname></persName>
		</author>
		<ptr target="https://unesdoc.unesco.org/ark:/48223/pf0000376709" />
		<title level="m">AI and education: Guidance for policy-makers</title>
				<imprint>
			<publisher>UNESCO</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Sabzalieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Valentini</surname></persName>
		</author>
		<ptr target="https://unesdoc.unesco.org/ark:/48223/pf0000385146" />
		<title level="m">ChatGPT and artificial intelligence in higher education: Quick start guide</title>
				<imprint>
			<publisher>UNESCO</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<title level="m" type="main">A Generative AI Primer</title>
		<author>
			<persName><forename type="first">M</forename><surname>Webb</surname></persName>
		</author>
		<ptr target="https://nationalcentreforai.jiscinvolve.org/wp/2024/01/02/generative-ai-primer/" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
		<respStmt>
			<orgName>National Centre for AI</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<title level="m" type="main">New principles on use of AI in education</title>
		<author>
			<persName><forename type="first">Russell</forename><surname>Group</surname></persName>
		</author>
		<ptr target="https://russellgroup.ac.uk/news/new-principles-on-use-of-ai-in-education/" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><surname>Gtnum</surname></persName>
		</author>
		<ptr target="https://edunumrech.hypotheses.org/8726" />
		<title level="m">Intelligence artificielle et éducation: Apports de la recherche et enjeux pour les politiques publiques</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Cardona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Rodríguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ishmael</surname></persName>
		</author>
		<ptr target="https://policycommons.net/artifacts/3854312/ai-report/4660267/" />
		<title level="m">Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">Using generative AI (GenAI) in learning and teaching</title>
		<author>
			<persName><surname>Ucl</surname></persName>
		</author>
		<ptr target="https://www.ucl.ac.uk/teaching-learning/publications/2023/sep/using-generative-ai-genai-learning-and-teaching" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Assessment in the age of artificial intelligence</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Swiecki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Khosravi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Martinez-Maldonado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Lodge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Milligan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Selwyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gašević</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.caeai.2022.100075</idno>
	</analytic>
	<monogr>
		<title level="j">Computers and Education: Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">100075</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Exploring new depths: Applying machine learning for the analysis of student argumentation in chemistry</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">P</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kranz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wulff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Graulich</surname></persName>
		</author>
		<idno type="DOI">10.1002/tea.21903</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Research in Science Teaching</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">ChatGPT for good? On opportunities and challenges of large language models for education</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kasneci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Seßler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Küchemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bannert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dementieva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fischer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Learning and Individual Differences</title>
		<imprint>
			<biblScope unit="volume">103</biblScope>
			<biblScope unit="page">102274</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">A review of the literature from 1970 to 2022 on the roles of teachers and artificial intelligence in the field of AI in education</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lepage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Roy</surname></persName>
		</author>
		<idno type="DOI">10.52358/mm.vi16.304</idno>
	</analytic>
	<monogr>
		<title level="j">Médiations et Médiatisations</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="30" to="50" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Tamkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brundage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ganguli</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2102.02503</idno>
		<ptr target="http://arxiv.org/abs/2102.02503" />
		<title level="m">Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Teaching English in the Age of AI: Embracing ChatGPT to Optimize EFL Materials and Assessment</title>
		<author>
			<persName><forename type="first">O</forename><surname>Koraishi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Education and Technology</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note>Article 1</note>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">Supporting self-directed learning and selfassessment using TeacherGAIA, a generative AI chatbot application: Learning approaches and prompt engineering</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Choy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Divaharan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.1080/23735082.2023.2258886</idno>
	</analytic>
	<monogr>
		<title level="j">Learning: Research and Practice</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="135" to="147" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">A Systematic Review of AI-Driven Educational Assessment in STEM Education</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ouyang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Dinh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<idno type="DOI">10.1007/s41979-023-00112-x</idno>
	</analytic>
	<monogr>
		<title level="j">Journal for STEM Education Research</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="408" to="426" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Enhancing teaching through constructive alignment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Biggs</surname></persName>
		</author>
		<idno type="DOI">10.1007/BF00138871</idno>
	</analytic>
	<monogr>
		<title level="j">Higher Education</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="347" to="364" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
