Large Language Models for the Assessment of Students’
                         Authentic Tasks. A Replication Study in Higher Education
                         Daniele Agostini1,*,† , Federica Picasso1,† and Helga Ballardini1,†
                         1
                             University of Trento, Department of Psychology and Cognitive Sciences, Corso Bettini, 84, 38068 Rovereto, Italy


                                        Abstract
                                        After the public release of ChatGPT (November 30th, 2022) and consequently, that of all its competitors, the use
                                        of Large Language Models (LLMs) has become widespread among the public. The most significant impact was
                                        perceived from the very beginning in the field of Education and Instruction [1, 2, 3, 4, 5, 6, 7]. Of particular interest
                                        for this paper is its use both by teachers and students in particular in the context of higher education [8, 4, 9].
                                        The immediacy with which Large Language Models (LLMs) have been integrated into higher education practices,
                                        both by teachers and students, leads to questions of fundamental importance relating to their effectiveness and
                                        reliability. In this field, LLMs become the means through which teachers have the opportunity to revolutionise the
                                        interaction with students, the management of workload and the personalisation of each learning experience [2].
                                        Although these technologies are recognised as having advantages and potential for improving learning in terms
                                        of accessibility and personalisation [7], a crucial question concerns their application in assessment practices,
                                        especially the ability to objectively and impartially evaluate students’ performance. The possibilities of using
                                        these tools in the field of learning evaluation is relatively little known, which implies the need to delve deeper
                                        into the topic for its application both in pedagogical theory and in educational practice. A previous study has
                                        been already published [10] which explored the use of the main LLM in the specific context of assessing students’
                                        papers, and this is a replication study based on it. The purpose of the current study is to explore the possible use
                                        of the main LLMs in the specific context of evaluating students’ written productions, with a focus on the aspects
                                        of accuracy that are evaluated with the help of a rubric proposed by the teacher. This article is part of a series of
                                        contributions that focus on this topic, in light of the principles and application of the AI-Mediated Assessment
                                        for Academics and Students (AI-MAAS) model [11].

                                        Keywords
                                        Large Language Models (LLMs), AI-Assisted Assessment, Rubrics, Authentic Tasks, Academic Assessment


                         1. The Context: AI Assessment in Higher Education
                         In the last two years, Large Language Models (LLM) have taken on a very significant role in the
                         technological landscape thanks to the launch of ChatGPT, followed subsequently by the release of
                         competing models. The impact of LLMs remained relatively limited over time until increasingly simple
                         and intuitive user interface functions were introduced, firstly the "chat" level, which brought the
                         general public closer to these tools. This phenomenon of "democratisation" boosted the commercial
                         and large-scale use of LLMs, which led institutions, companies and individuals to increase investment
                         in this sector [12, 13, 14, 15]. In addition to OpenAI’s ChatGPT, Anthropic’s Claude, Microsoft’s
                         Copilot and Google’s Gemini are just some of the most used LLMs, in addition to the much more
                         numerous open-source models to which Meta’s LLAMA has given a notable boost At the same time,
                         however, this has led to a crisis in search engines since LLMs, without requiring advanced research
                         skills, offer new ways of querying and analysing data, more natural interaction and sufficiently precise

                         AIxEDU 2024: 2nd International Workshop on Artificial Intelligence Systems in Education, November 25–28, 2024, Bolzano, IT
                         *
                           Corresponding author.
                         †
                           D. Agostini: Conceptualisation, Methodology, Investigation, Formal Analysis, Writing - original draft, Writing - review &
                           editing, Resources, Supervision. F. Picasso: Investigation, Writing - original draft, Data curation. H.Ballardini: Investigation,
                           Writing - review & editing.
                         $ daniele.agostini@unitn.it (D. Agostini); federica.picasso@unitn.it (F. Picasso); helga.ballardini@unitn.it (H. Ballardini)
                          https://webapps.unitn.it/du/en/Persona/PER0247709 (D. Agostini);
                         https:https://webapps.unitn.it/du/en/Persona/PER0242228 (F. Picasso); https://webapps.unitn.it/du/en/Persona/PER0033179
                         (H. Ballardini)
                          0000-0002-9919-5391 (D. Agostini); 0000-0002-8381-6456 (F. Picasso); 0000-0002-3603-9551 (H. Ballardini)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
and exhaustive answers. For example, LLMs allow users to avoid various typical inconvenient steps
that characterise the standard use of search engines, such as the selection of long lists of websites, the
acceptance of cookies and the appearance of advertising banners. As a result, educational institutions
and agencies have begun to incorporate LLMs and generative AI into their curricula at various levels,
developing courses to harness the potential of these innovative technologies. There is currently a
strong emphasis on AI Literacy [16, 17], which allows professionals from different sectors, including
educational institutions, to deepen their understanding of the fundamental elements of AI generative,
the availability of tools, the functionalities and methods of use that make LLMs effective tools in all
fields [18, 19, 20, 21, 22, 23]. However, one of the most critical issues concerns information management:
LLMs possess enormous potential due to their ability to analyse and generate data; this raises numerous
questions about accuracy, privacy and ethics in information management and ownership of output. The
challenge in this continuously and rapidly evolving field becomes the ability to pay constant attention
and critically evaluate so that end users always use LLMs responsibly [24, 25, 26, 27]. In relation to
this issue, higher education institutions have reacted by placing themselves on the defensive, so much
so that some universities, in order to counteract the possible use of LLMs by students during exam
tests, they have reintroduced the obligation to write by hand and also take oral tests [8, 28]. At the
same time, pieces of software created specifically to detect the productions generated by LLMs were
introduced on the market. However, these turned out to be ineffective, causing management and legal
problems for institutions because students could be unfairly accused of sending texts generated by
artificial intelligence [29, 21, 30]. To avoid such inconveniences, national and international institutions
and universities promptly provided themselves with guidelines that promote ethical behaviour towards
the use of LLMs while maintaining a certain caution, allowing students and teachers to use them
effectively to carry out tasks and benefit institutions. Important international bodies and universities
moved in this direction, such as UNESCO [31, 32], the JISC National Center for AI [33], the Russell
Group [34], the French National Ministry of Education [35], the US Department of Education [36]
and University College London [37]. Assessment tasks have proven to be arguably the ones that can
profit most from the AI technology, especially in terms of sustainability. However, caution is needed as
LLMs without specific task adaptations have proven incapable and unreliable in managing students’
assessments independently [38, 33], while LLMs supported by assessment tools have been shown
to produce satisfactory results [39]. Above all, the use of artificial intelligence by students requires
that teachers know how to take ethical aspects into consideration and act with responsibility when
evaluating tasks and tests which results could have a great impact on students’ careers (for example,
motivation, grades, scholarships, acceptance into master’s or doctoral programs).


2. Theoretical Framework
Since the 1980s the idea of being able to use computerised systems (and now also artificial intelligence)
to assist educators in their assessment tasks and to be able to make precise, impartial and informed
decisions has been present in much literature [40, 41]. The possibility of using LLMs for learning
assessment had already been explored in the period immediately preceding the release of ChatGPT,
while the use of transformer models, including OpenAI’s GPT-3, were already well established. Tamkin
et al. [42] emphasised their educational application, which included:

    • Summary: LLMs are able to summarise even very long texts. This use can help students submit
      concise summaries. Furthermore, various parameters can be considered for the synthesis, and
      this supports educators in providing precise information on the elements of the text that will be
      evaluated.
    • Questions and Answers: LLMs can "understand" various portions of text, answer questions, and
      ask questions when required. These features are useful for providing interactive feedback and
      learning experiences.
    • Classification: LLMs can classify the text into predefined categories: this allows you to introduce
      assisted assessment or classify students’ feedback.
    • Plagiarism detection: by comparing the similarity between different texts, LLMs are very useful to
      detect potential cases of plagiarism among students or to identify the misuse of original materials
      by students.
    • Assessment of knowledge: LLMs can assess students’ understanding of a topic based on their
      written productions, especially if the information is generated from correct homework and with
      the help of an assessment rubric to refer to.

   These five applications are fundamental to using LLMs in learning assessment. Following the
introduction of ChatGPT and other universally accessible LLMs, UNESCO published the guidelines “AI
and education: Guidance for policy-makers” [31], which suggest the following recommendations for
learning assessment:

   1. Test and implement AI technologies to support the assessment of various dimensions of skills
      and outcomes.
   2. Use caution when using an automated assessment with closed-ended, rule-based questions.
   3. Use AI-generated formative assessment as an integrated feature of learning management systems
      (LMS) to analyse learning outcomes more accurately and efficiently and reduce the risk of human
      bias.
   4. Use the ability to provide AI-powered progressive assessments to regularly update students and
      parents.
   5. Examine and evaluate the use of facial recognition and other artificial intelligence capabilities for
      users’ recognition and their tracking in remote online assessments.

   Based on these different theoretical approaches, recommendations and guidelines, the AI-Mediated
Assessment for Academics and Students (AI-MAAS) model was developed which is currently under
validation; it proposes two potential implementations of LLMs for the assessment of learning: the
first one for formative evaluation and the second one for summative evaluation [11]. In both cases,
the selected LLM must be able to evaluate using an assessment rubric provided by the teachers or
by the students. Given the novelty of the tool, so far, there are not many experiments in this field.
Martin et al. [39] worked on this opportunity, starting from the need to be able to assign students even
complex tasks that involve a certain degree of reasoning, abstract conceptualisation and reprocessing
of information; while the correction of these types of tasks (with a large quantity of long open answers)
often proves to be an unsustainable task for teachers. Some researchers, working on this aspect, have
demonstrated that in the evaluation of a chemistry task, for example, it is possible to use LLMs: in this
case, an almost perfect match was obtained between the scores assigned by human raters and the scores
generated by the LLMs. It should be highlighted, however, that Martin and colleagues did not simply
use an LLM to achieve this result. The researchers used a complex procedure that involved, among
other operations, the unsupervised machine learning technique HDBSCAN (Hierarchical Density-Based
Spatial Clustering of Applications with Noise), a cluster mapping and training of a deep neural network
classifier. The aim of this study was to test an operational model and demonstrate its feasibility. This
excellent solution represents the result of models trained on specific tasks and populations; therefore,
it cannot be assumed that the procedure applied can be replicated by any teacher not specialised in
Machine Learning. Other studies have instead used LLMs for assessment purposes without comparing
the performance of the AI with that of a teacher. These studies applied for example in the evaluation
of L2 English tasks [43] and in supporting self-assessment came to satisfactory results [44]. Machine
learning has also been applied in the evaluation of tasks related to STEM disciplines, but without using
LLM [45].
   Finally, a previous study [10] explored the use of the main LLMs in the specific context of assessing
students’ papers, with a focus on their accuracy in assessing according to a rubric developed by the
teacher. The idea was that employing LLMs for assessment in higher education can enable the adoption
of teaching and assessment approaches that were previously unsustainable and unscalable. This
should help to ensure constructive alignment [46] and thereby improve the quality and effectiveness
of university teaching. The study, aimed at selecting the most human-like evaluation amongst LLMs,
highlighted that while some AI models, like ChatGPT-4 and Claude 2, performed well in most of the
assessment criteria, others, such as Microsoft Copilot and Google Bard, were far from human-like
assessment. The article recommends further research on ChatGPT and Claude, with potential inclusion
of open-source models as well as involving multi-shot prompting, expanding the student sample,
involving more evaluators, and refining and redesigning the rubrics.


3. Methodology and Tools
This is a replication study of “Are Large Language Models Capable of Assessing Students’ Written
Products? A Pilot Study in Higher Education” published in “Research Trends in Humanities Education &
Philosophy, 11” [10] that follows the same methodology with updated LLMs employed, a greater sample
of students and of human evaluators. It explores the use of leading Large Language Models (LLMs) in
the specific context of assessing student written products, focusing on their precision and ability to
evaluate according to a grading rubric developed by the teacher. The goal is to understand whether and
which models can be used by university and non-university educators who are not experts in Machine
Learning to assess students’ written products, even in the presence of open-ended tasks and questions,
thanks to grading rubrics. The pilot study was conducted at the University of Trento within the
context of a university habilitation course for secondary school teachers, during the module concerning
learning methodologies. One-hundred-fourty-two students participated anonymously, divided into
35 groups, along with 3 evaluating teachers, experts in experimental pedagogy and assessment. No
data regarding the students’ demographics was collected. The groups were tasked with carrying out an
authentic task, namely to re-designing a past educational intervention that proved to be unsuccessful,
targeted at a specific class (which could range from 1st grade of lower secondary school to 5th grade
of higher secondary school, depending on the group composition). They were instructed to identify
the past teaching approaches and strategies and to now think of different ones more suited to reach
the intended learning outcomes. Furthermore, students’ reflection and redesign ability was evaluated
through the rubric of reference. To complete this task, groups were given two hours and thirty minutes,
and a template for the educational design consisting of the following sections was provided: Involved
Disciplines, Class and Grade Level, Intervention Title, Teacher, Programme and Learning Objectives,
Context and Environment (formal, informal, type of setting, etc.). Moreover, a description of the
reflection process applied for renewing the formative design is required, and it is considered under
evaluation. In the part of the schedule with details, they were asked to explain the programming
with concise descriptions of the various educational activities, the teacher’s tasks, and those of the
students. Within this framework, groups had the freedom to propose their original programming. The
final product of each group is thus an MS Word file containing the programming of the educational
intervention according to the described template. For the evaluation of the products, the following
grading rubric (Table 1) was prepared, consisting of five evaluation criteria with four levels for each
criterion.
   Three expert human evaluators and seven LLMs (plus one that merged the feedback of all 7 models
in a single one) evaluated all the student groups’ products. The LLMs selected for this study were the
most popular competing models at the time, and they are applied in the assessment process through
the use of big-AGI (https://big-agi.com/). Big-AGI is an AI suite created to make advanced artificial
intelligence accessible and was chosen for ease of adding several models through API, the possibility of
imparting system prompts and the function (called “beam”) for sending the same prompt to several
LLMs at the same time. Human results were then compared with the results produced by the LLMs
with various statistical analysis (see Method of Analysis section). The models used are:

   1. GPT-4o: released in May 2024, GPT-4o is a multilingual, multimodal generative pretrained
      transformer developed by OpenAI. The model is capable of processing and generating text,
      images, and audio, making it a versatile tool for a wide range of tasks. Its multimodal capabilities
  Table 1
  Rubric for the Assessment of the Educational Intervention
Assessment         Insufficient Level (1      Sufficient Level (2       Good    Level        (3    Excellent Level (4
Criteria           point)                     points)                   points)                    points)
Understanding      Demonstrates a lim-        Shows       a     basic Applies educational ar-      Demonstrates a thor-
and applying       ited understanding of      understanding of edu- chitectures correctly,         ough understanding of
educational        educational architec-      cational architectures, with a good under-           educational architec-
architectures      tures, with applica-       applying them gener- standing of their use in        tures, applying them
                   tions not always ap-       ally correctly but with the specific context.        in an innovative and
                   propriate or consis-       some uncertainties.                                  contextually relevant
                   tent.                                                                           manner. Clearly justi-
                                                                                                   fies the choices made.
Selection          The teaching strate-       Uses some relevant        Selects and imple-         Selects and imple-
and      imple-    gies chosen are limited    teaching strategies,      ments     appropriate      ments highly effective
mentation          or not always appro-       but their implemen-       teaching       strate-     and        diversified
of    teaching     priate for the objec-      tation could be more      gies with a good           teaching strategies,
and learning       tives of the interven-     targeted or diversified   correlation to the         perfectly     adapted
strategies         tion.                      in relation to the        intervention goals.        to the objectives
                                              intervention goals.                                  and context of the
                                                                                                   intervention.
Definition of      The objectives are         The objectives are        The objectives are well-   The objectives are
the intended       vague, not measurable      present but could be      defined and generally      clear, specific, mea-
learning out-      or not aligned with        more specific or better   aligned with the cho-      surable and perfectly
comes              the chosen teaching        aligned with the          sen teaching architec-     aligned with the
                   architectures    and       teaching architectures    tures and strategies.      chosen       teaching
                   strategies.                and strategies.                                      architectures     and
                                                                                                   strategies.
Detailed scan-     The scan is incomplete,    The scan is present but   The scan is clear          The scan is detailed,
ning of the in-    unclear, or lacks a log-   could be more detailed    and generally well-        logical and well-
tervention         ical progression of ac-    or better structured in   structured, with a         structured, with a
                   tivities.                  some parts.               good progression of        clear progression of
                                                                        activities.                activities and realistic
                                                                                                   timeframes.
Critical reflec-   There is a lack of crit-   Includes some reflec- Provides good reflec-          Provides deep and crit-
tion on the re-    ical reflection on the     tion on the changes, tion on the changes,            ical reflection on the
design process     changes made or the        but the analysis could with clear links to the       changes made, clearly
                   justifications are su-     be more thorough.      learning objectives.          justifying each choice
                   perficial.                                                                      in relation to the learn-
                                                                                                   ing objectives.


    enable a deeper integration of different data formats, enhancing its utility in complex applications.
    Link: https://openai.com/index/hello-gpt-4o/
 2. Gemini 1.5 Pro Latest: a large language model developed by DeepMind (Google), is natively
    multimodal and supports an extended context window of up to two million tokens, which is
    currently the longest of any large-scale foundation model. This expansion in token capacity allows
    for processing more extensive sequences of data, thereby increasing its utility in tasks that require
    long-term contextual understanding. Link: https://deepmind.google/technologies/gemini/pro/.
 3. Claude 3.5 Sonnet: developed by Anthropic, excels in the ability to understand nuanced language,
    humour, and complex instructions. It is designed to generate high-quality content in a relat-
    able, natural tone, showing marked improvements in areas such as writing and human-centric
    communication. Link: https://www.anthropic.com/news/claude-3-5-sonnet.
 4. Mistral Large (2402): is designed to excel in complex reasoning tasks, particularly in multilingual
    contexts. The model is highly effective in text understanding, transformation, and code generation.
      It demonstrates top-tier performance in handling sophisticated reasoning challenges, making it
      a robust tool for both natural language processing and technical tasks. Link: https://mistral.ai/
      news/mistral-large/.
   5. Open Mixtral 8x22B (2404): is one of the latest model developed by Mistral, featuring a sparse
      Mixture-of-Experts (SMoE) architecture. Despite its large size, with 141 billion parameters, only
      39 billion parameters are actively engaged during processing, optimising both performance and
      cost efficiency. This approach sets new standards in the AI community for balancing model
      complexity with computational resource usage. Link: https://mistral.ai/news/mixtral-8x22b/.
   6. Llama 3.1 70B Instruct Turbo: developed by Meta, is a 70-billion parameter language model de-
      signed for instruction-following tasks. The model is optimised to improve interactions where clear
      guidance or step-by-step reasoning is required, positioning it as an effective tool for applications
      in both academic and practical domains. Link: https://ai.meta.com/blog/meta-llama-3-1/
   7. Qwen2 72B Instruct: developed by Alibaba Cloud, is a 72-billion parameter language model
      optimised for instruction-based tasks. It integrates the latest advancements in generative AI,
      offering improved efficiency in tasks ranging from conversational AI to complex text generation
      and reasoning. Its design caters specifically to high-performance needs in both commercial and
      research applications. Link:https://www.alibabacloud.com/en/solutions/generative-ai/qwen?_p_
      lc=1

   All these LLMs can "understand" and write in Italian, but it cannot be ruled out that performance
in English may be different (presumably better, since most of the training is done in that language).
Mistral’s models were added for their specific training with European languages that renders them
“natively fluent in English, French, Spanish, German, and Italian, with a nuanced understanding of
grammar and cultural context”. Privacy shouldn’t be a concern since there is no data saved to LLM
provider servers due to our use of API on a local instance of big-AGI. All the conversations are saved
only locally.


4. Prompting
This study aimed to understand which models could be used by university educators (and, potentially,
other educators) to assess students’ products. For this reason, overly sophisticated prompting techniques
were not used; instead, what an educator might do by providing clear instructions and giving the
necessary context data for evaluation was employed. The LLMs systems were promoted through the
following instruction (originally written in Italian). The first one is the System Prompt:

      You are an experienced and impartial university lecturer. Your job is
      to assess the quality of student assignments according to a specific
      assessment rubric.
      **How to respond to requests:**
      * Do not express personal opinions or subjective judgements.      *
      Focus exclusively on the criteria provided in the rubric. * Provide
      a fair and impartial assessment based on the task’s adherence to
      the criteria. * Carefully review the student’s entire paper before
      beginning the assessment. * Offer constructive suggestions as to
      how the student might improve. * Uses clear and concise language.
      * Justify the marks awarded with specific references to the paper
      and the rubric. * In your assessment, take into account that the
      students only had 2 hours for planning.
      **Request format
      Each request will include:
     * **The student’s assignment:** The text of the assignment you are
     to assess.   * **The grading rubric:** A list of criteria with
     descriptions for each grade level.
     **Response format:**
      Your answer should follow this format:
     **Title of the paper (also called title of the paper) as it appears
     in the document: [insert title here]**.
     **Total score:** [Insert total score here].
     **Scoring breakdown:**
      | Criterion | Score | Comments |—|—|—| | [Criterion 1] | [Score]
      | [Comments with specific examples from the task] | | [Criterion
      2] | [Score] | [Comments with specific examples from the task] | |
      [Criterion 3] | [Score] | [Comments with specific examples from the
      task] | | ... | ... | ... |
     **Suggestions for improvement
     * [Suggestion 1] * [Suggestion 2] * ... **Answer following the answer
     format provided above.**

   The second is the prompt that were given to the LLMs to assess the products (originally written in
Italian):

      Evaluate the attached teaching design (student task) that was
      created by a group of students from the secondary school teaching
      qualification course. The key competence of this assignment lay in
      being able to design a teaching intervention that makes effective
      use of teaching architectures and strategies. In particular, the
      group’s competence in terms of redesign and depth of reflection is
      taken into account with respect to previous instructional design. At
      the same time, the instructional design had to prove effective in
      achieving the goals they set themselves. Take into account that the
      students only had 2 hours to design. Use the evaluation rubric below
      to assess:
     <Starting teaching design evaluation rubric>
      Evaluation rubric:
      Criterion   1  -         Understanding        and    application        of    teaching
      architectures:
     - Insufficient (award 1 point):        Limited understanding and
     applications not always appropriate. - Sufficient (award 2 points):
     Basic understanding with some uncertainties in application. - Good
     (awarded 3 points): Correct application and good understanding. -
     Excellent (awarded 4 points): Thorough understanding and innovative
     and relevant application.
      Criterion 2 - Selection and implementation of teaching strategies:
     - Insufficient (award 1 point): Limited or not always adequate
     strategies. - Sufficient (award 2 points): Relevant strategies but
     implementation can be improved. - Good (award 3 points): Strategies
     appropriate and related to the objectives. - Excellent (award 4
     points): Highly effective, diverse and well adapted strategies.
      Criterion 3 - Definition of learning objectives:
     - Insufficient (award 1 point): Vague or non-measurable objectives.
     - Sufficient (award 2 points): Objectives present but not very
      specific.   - Good (award 3 points): Well-defined and generally
      aligned objectives. - Excellent (award 4 points): Clear, specific,
      measurable and perfectly aligned objectives. Criterion 4 - Detailed
      scanning of the intervention
     - Insufficient (award 1 point): Incomplete or unclear scan.          -
      Sufficient (award 2 points): Scan present but can be improved in
      structure. - Good (awarded 3 points): Clear and well-structured scan.
     - Excellent (awarded 4 points): Detailed, logical and well-structured
      scanning.
      Criterion 5 - Critical reflection on redesign:
     - Insufficient (award 1 point):        Lack of critical reflection
      or superficial justifications.     - Sufficient (award 2 points):
      Reflection present but not very thorough. - Good (awarded 3 points):
      Good reflection with clear connections. - Excellent (award 4 points):
      Deep and critical reflection, clear justifications.
      <end of assessment rubric>
      **Total score:**
      **Scoring distribution:**

   The prompt was sent simultaneously to all the LLMs involved. Through big-AGI the authentic
task document was attached in PDF format. A zero-shot prompting procedure was used for all LLMs,
meaning that no examples of human task assessments were given to the models. It is possible for a
university instructor to provide an example that can enhance the quality of LLM assessments, however,
the goal in this instance was to choose the most suitable models for this type of evaluation, not to find
methods for optimising the results. Finally, an “eighth” LLM evaluator has been added, which uses
the “Beam” function of big.AGI software. All the nuanced answers from the 7 LLM have been sent for
consideration and synthesis to GPT-4o, resulting in an eighth assessment that considers the feedback
from all seven LLMs.


5. Models’ Settings
In order to set a common limit of length for every model’s answers, all of them have been set through
big.AGI API controls to 8128 tokens maximum. Also, the temperature was set to 0.2, that should ensure
quite strict adherence to the instructions yet leave some room for creativity in answers.

5.1. Attention to Tokens and Context
Understanding tokens and context is crucial when using a Large Language Model (LLM). Tokens can be
simplified as units of text that might consist of a word, part of a word, or even a single character. The
characteristics of tokens can vary between models. However, it is generally safe to assume that, on
average, English might require one to one and a half tokens per word, and Italian might need one and a
half to two tokens per word. The context window, another essential concept, represents the number
of tokens a language model can consider simultaneously when generating responses. This context
depends on the model used and the available memory. Exceeding a model’s context window could
cause errors if it happens in a single prompt or, in a more extended conversation, the model might start
ignoring the earlier parts of the dialogue to make room for more recent inputs. Therefore, preserving
context is vital for generating coherent and relevant responses. It is important to note that not only
the user’s prompts consume context, but the model’s responses do as well. To preserve the context
window, some LLMs platforms impose a character limit on the prompts that can be sent and on the
length of the generated responses, which are shorter than the maximum context window. Contrasting
this replication study with the original one, it can be noted that context windows are decidedly wider
than the ones that were found in LLMs one year ago, posing less of threat to the coherence of the
assessment. Below Table 2 illustrates the maximum context window size for each of the models used:

   Table 2
   Context Windows of the used LLMs. The context windows refer to the APIs. Note that this feature may
   change with updates.
 Large Language Model (versions available in Italy, September 2024)        Context Window (in tokens)
 GPT-4o                                                                    128,000
 Claude 3.5 Sonnet                                                         200,000
 Gemini 1.5 Pro Latest                                                     2,000,000
 Mistral Large (2402)                                                      32,000
 Mixtral 8x22B (2404)                                                      64,000
 Meta Llama 3.1 70B Instruct Turbo                                         131,072
 Qwen2 72B Instruct                                                        32,768


6. Method of Analysis
The analysis method for evaluating the data involved examines the levels assigned by each evaluator
(both LLMs and humans) to the various criteria of the rubric for each of the 35 group products. Each of
the seven evaluators assigned a level to each of the five criteria for every product, resulting in each
evaluator assigning a level to a total of 175 criteria. Several statistical techniques were employed to
extract insights from the data, including Principal Component Analysis (PCA), analysis of standard
deviation, and the creation of a disagreement index among evaluators. Microsoft Excel and JASP (based
on R) were used for the statistical analyses.


7. Results
The consistency of the assessment for different models has been tested over the evaluation of three
random tasks from the sample for three times each by each one of the models. For this test, each LLM
assessed a total of 45 criteria.
  From these tests, the following behaviours were observed:

    • GPT-4o, Gemini 1.5 Pro Latest and Claude 3.5 Sonnet were extremely consistent, with only one
      instance of a different assessment for one criterion, by just one point.
    • Mistral Large 2402 was perfectly consistent with zero instances of different assessments.
    • Open Mixtral 8x22B (2404) and Qwen2 72B Instruct were quite consistent, with five instances of
      different assessments of single criterion by one point.
    • Llama 3.1 70B Instruct Turbo: was the fairly inconsistent, with 19 instances of different criteria
      assessment by one point.

7.1. Principal Component Analysis
The first analysis conducted, in addition to descriptive data, was the PCA, a dimensionality reduction
technique that allows the identification of latent variables within the data and that can represent a
general model of the data. Three principal components were identified from the PCA conducted on the
assessment data (Table 3).
    Table 3
    PCA Component Loadings
                      Evaluator                         RC1      RC2      Uniqueness
                      e1 (GPT-4o)                       0.579    0.327       0.472
                      e2 (Gemini 1.5 Pro)                        0.421       0.832
                      e3 (Claude 3.5 Sonnet)            0.695    0.312       0.321
                      e4 (Mistral Large 2402)           0.775                0.430
                      e5 (Mixtral 8x22B 2404)           0.651                0.592
                      e6 (Llama 3.1 70B Instruct)       0.743                0.405
                      e7 (Qwen2 72B Instruct)           0.825    -0.326      0.336
                      e8 (Merge of 7 LLMs by GPT-4o)    0.802                0.301
                      e9 (Human Evaluator 1)                     0.515       0.708
                      e10 (Human Evaluator 2)                    0.753       0.420
                      e11 (Human Evaluator 3)                    0.644       0.604
                      Note. Applied rotation method is promax.


   The first component (RC1) is formed by evaluators e1, e3, e4, e5, e6, e7, e8 loadings, which correspond
respectively to the LLMs GPT-4o, Claude 3.5 Sonnet, Mistral Large, Mixtral 8x22B, Llama 3.1 70B, Qwen2
72B and the merge of LLMs opinion. The second component (RC2) comprises those of e2, e9, e10 and
e11 corresponding to Gemini 1.5 Pro and human evaluators 1, 2 and 3. As can be appreciated in Figure 1,
Both GPT-4o and Claude 3.5 Sonnet contributes mainly to RC1 component but also to RC2. Gemini Pro
1.5 on the other hand, contributes only to RC2 component (the tiny loading to RC1 is negative). Trying
to name the identified components, RC1 could be called "LLM Evaluation Pattern" and RC2 "Human
Evaluation Pattern".

7.2. Analysis of Standard Deviation of Grades by Product and Assessment Criterion
To understand how assessments differed from criterion to criterion and from evaluator to evaluator, an
analysis was conducted on the standard deviation (SD) of the different variables of the study. The criteria,
numbered or abbreviated in some of the graphs, are those listed in Table 4. Firstly, an effort was made
to identify which assessment criteria had the slightest and the most SD (Table 4) to understand which
were assessed more consistently by all evaluators. The criteria with the minimum SD across all products
is Criterion 4 and 1 (“Detailed scanning of the intervention” and “Understanding and application of
teaching architectures”), with an average of about 0.5. This suggests a high level of agreement among
evaluators in assessing the quality and details of the detailed activities envisaged in the educational
design and the understanding and correct application of the teaching architectures at their bases. On the
other hand, the criterion with the maximum SD among all activities is Criterion 5 (Critical reflection on
redesign), with an average of about 0.8. This indicates a higher level of disagreement or inconsistency
in how evaluators assessed the quality of teacher’s critical reflection about their past activities and the
way in which they tried to improve them.

7.3. Agreement Index
An "Agreement Index" (AIdx) was developed to obtain a more robust metric and better understand
which evaluators assigned more similar scores for the various criteria. This index combines the average
difference between the scores assigned to a criterion and the variability of this difference. It was
calculated to understand which evaluators are most similar to the human ones for each criterion. While
LLMs evaluators are treated individually, the human benchmark is an average of the human evaluators’
(e9, e10 and e11) assessments. It is constructed as follows:
Figure 1: PCA Path Diagram. The diagram shows two main components (RC1 and RC2) and their relationships
with different evaluators. RC1 represents the "LLM Evaluation Pattern" while RC2 represents the "Human
Evaluation Pattern".


   Table 4
   Average standard deviation of scores assigned to criteria
       Criterion   Description                            Average Standard     Percentage of
                                                             Deviation        Total Range (1-4)
       5           Critical reflection on redesign              0.76                 25%
       2           Selection and implementation of              0.64                 21%
                   teaching strategies
       3           Definition of learning objectives            0.61                 20%
       1           Understanding and application of             0.54                 18%
                   teaching architectures
       4           Detailed scanning of the                     0.53                 17%
                   intervention


                                Average difference + Variability of the difference
                      AIdx =                                                                        (1)
                                                       2
  Therefore:

    • The "Average Difference" is the absolute average difference in scores assigned between the
      evaluator in question and the average of human evaluators across all tasks and criteria.
    • The "Variability of the Difference" is the standard deviation of the difference scores between the
      tested evaluator and the reference evaluator, reflecting how consistent these differences are across
      different tasks and criteria.

   AIdx is calculated individually for each evaluator. It provides a single measure that encapsulates the
average magnitude of evaluation differences relative to the reference evaluator and the consistency of
such differences. A lower value indicates a more significant overall agreement in evaluation relative to
the human evaluator. The highest possible value for the index for an evaluator would be achieved if
they constantly evaluated at the maximum difference from the human evaluators (3 points).
   The LLM evaluator who provided assessments most similar to the average of human evaluators
(calculated through the AIdx) is GPT-4o, followed at a negligible distance by Claude 3.5 Sonnet. On the
other hand, the LLM evaluator with the worst AIdx is Qwen2 72B (Table 5).

    Table 5
    Agreement Indices (AIdx) with reference to the average of human evaluators.
                  Evaluator                            Agreement Index with "average
                                                          human" (lower is better)
                  Human 3                                             0.39
                  Human 2                                             0.41
                  Human 1                                             0.45
                  GPT 4o                                              0.46
                  Claude Sonnet 3.5                                   0.47
                  Merge (GPT4o)                                       0.49
                  Llama 3.1 70B Instruct Turbo                        0.52
                  Mixtral 8x22B (2404)                                0.53
                  Mistral Large (2402)                                0.53
                  Gemini1.5 Pro Latest                                0.56
                  Qwen2 72B                                           0.72


   Focusing on the single criterion (Table 6), it can be noted how AIdx with other evaluators vary from
criterion to criterion. Unexpectedly, Qwen2 72B, the worst on the general AIdx with the “average”
human evaluator, is the single model that is most human-like in three criteria out of five. Its main
problem is that it assessed in a very different way from humans the most difficult criterion: criterion
number 5 “Critical reflection on redesign” (Table 6). It also did not fare optimally in criterion number
3 “Definition of learning objectives”. Other LLMs like GPT-4o and Claude 3.5 Sonnet, as well as the
Merge of the different LLMs feedback, keep a good Agreement Index across the board.

7.4. Assessment correlations among LLM and Human evaluators
As reported in Table 7 the model with higher correlation with human evaluation is by far Gemini 1.5
Pro (r = 0.84), followed at a distance by Claude Sonnet 3.5 (r = 0.66), then by GPT-4o (r = 0.59), the merge
of the LLMs feedback by GPT-4o (r = 0.58) and Llama 3.1 70B (r = 0.45). This suggests that Gemini 1.5
Pro’s pattern of scores across the criteria is the most similar to that of the human evaluators.
   But what happens excluding single criteria from the correlation analysis? That could help in under-
standing what criteria makes the assessment “human” and what LLMs struggle with:

    • Excluding Criterion 3 (Definition of learning objectives): When Criterion 3 is excluded
      all LLMs’ correlation indexes significantly improve. Noticeably, for Mistral Large and Qwen2
      72B the jump is from being hardly correlated, or not at all (r = 0.27 and -0.06 respectively), to
      being significantly correlated (r = 0.86 and 0.88). Excluding Criterion 3 also significantly reduces
      the correlation of Gemini 1.5 Pro suggesting that this was the Criterion that it got right and
    Table 6
    Agreement Indices (AIdx) of evaluators compared to the average human evaluator divided by criterion

Crit.         GPT    Gemini        Claude         Mistral     Mixtral       Meta         Qwen2       Merge           Best   Second
               4o    1.5 Pro       Sonnet         Large       8x22B        Llama          72B        (GPT-           LLM     Best
                     Latest          3.5          (2402)      (2404)       3.1 70B                    4o)                    LLM
1             0.50       0.41           0.50        0.43        0.41        0.43          0.39        0.47       Qwen2       Gemini
                                                                                                                  72B       1.5 Pro /
                                                                                                                             Mixtral
                                                                                                                              8x22B
2             0.56       0.64           0.57        0.47        0.55        0.57          0.40        0.55       Qwen2       Mistral
                                                                                                                  72B         Large
3             0.43       0.60           0.43        0.49        0.68        0.62          0.70        0.41       Merge       GPT-4o
                                                                                                                 (GPT-      / Claude
                                                                                                                  4o)        Sonnet
                                                                                                                               3.5
4             0.37       0.46           0.35        0.37        0.48        0.38          0.33        0.32       Merge       Qwen2
                                                                                                                  (GPT-        72B
                                                                                                                   4o)
5             0.41       0.63           0.45        0.80        0.44        0.55          1.24        0.62       GPT-4o     Mixtral
                                                                                                                            8x22B


    Table 7
    Pearson correlation coefficients calculated between the average scores per criterion for each Large
    Language Model (LLM) and the average scores per criterion of human evaluators. Single criteria have
    been excluded to understand what makes the pattern “human”.
        LLM                     Total          Excl. Crit.   Excl. Crit.   Excl. Crit.    Excl. Crit.   Excl. Crit.
                                                   1             2             3              4             5
        GPT-4o                  0.59              0.45          0.60          0.65           0.74            0.66
        Gemini1.5 Pro           0.84              0.79          0.99          0.47           0.82            0.86
        Latest
        Claude Sonnet           0.66              0.54          0.66          0.78           0.71            0.81
        3.5
        Mistral Large           0.27              0.16          0.21          0.86           0.21            0.88
        (2402)
        Mixtral 8x22B           0.15              0.24          0.01          0.53           0.06            0.19
        (2404)
        Llama 3.1 70B           0.45              0.34          0.42          0.74           0.47            0.69
        Instrict Turbo
        Qwen2 72B               -0.06            -0.20         -0.13          0.88           -0.11           -0.77
        Merge (GPT4o)            0.58             0.45          0.58          0.75           0.63            0.75


      mostly contributed to its excellent general correlation to humans’ assessment. This suggests that
      Criterion 3 may be peculiarly human-like in its application, which these models struggle to mimic
      accurately. The high increase implies that Criterion 3 might involve a complex judgment that
      those models are incapable to handle or contextual information that is not being passed to the
      model.
    • Excluding Criterion 2 (Selection and implementation of teaching strategies): Excluding
      Criterion 2 doesn’t change LLMs correlation with human assessment, except for Gemini 1.5 Pro
      Latest. Gemini shows an almost perfect correlation of r = 0.99 when Criterion 2 is excluded,
      which is remarkable, but, even in this case, this criterion doesn’t seem to be crucial.
    • Excluding Criterion 5 (Critical reflection on redesign): The exclusion leads to a substantial
      increase in correlation for Mistral Large (from r = 0.27 to 0.88) and a notable improvement for
      several other models. This criterion, similarly to Criterion 3, may also represent aspects of human
      judgement that are challenging for models to replicate accurately.


8. Discussion
Regarding the goal of understanding whether educators without expertise in machine learning can
employ current Large Language Models (LLMs) to assess students’ written authentic tasks using
assessment rubrics, the analyses have revealed several interesting elements:

    • Differently from a previous iteration of the study [10], all the models have enough context window
      to perform this task.
    • From the PCA, it appears that human evaluators generally have a different pattern of evaluation
      compared to LLMs.
    • In contrast with the evaluation pattern, the Agreement Index (AIdx) measures both the magnitude
      of the score differences and their consistency. A high Agreement Index value suggests significant
      discrepancies between the model’s scores and the average human scores, despite possibly similar
      trends in the pattern. Transforming in percentage the AIdx of each model referred to the aver-
      age human an accuracy metric has been achieved. This helps to better visualise each model’s
      performance (Fig. 2 and Fig. 3)
    • Only Llama 3.1 70B was inconsistent in the repeated assessment of the same task.
    • Gemini 1.5 Pro is the LLM model with the evaluation pattern more similar (with by far the higher
      correlation) to the human’s (see Table 7). It is the only model that in the PCA results only in the
      component of human assessment (Fig. 1). On the other hand, its AIdx was the second worst, just
      before Qwen2 72B (Table 5, Fig. 2).
    • GPT-4o and Claude 3.5 Sonnet have evaluation patterns not too dissimilar from the human’s (Fig
      1, Table 7) and on average attribute marks more similar to humans than any other model (Table
      5).
    • Llama 3.1 70B Instruct was the best of the open models, and the fourth in total (Table 5), after
      the already mentioned three proprietary models. It behaved quite well in the correlation index
      with the humans’ assessment pattern with a moderate correlation (Table 7) and has a good AIdx.
      The problem with this model is the inconsistency of the assessment of the same task, where it
      “changed its mind” 19 times out of 45. It would be interesting to understand if that inconsistency
      has to do with the quantisation applied by Together AI, the API provider used.
    • Mixtral 8x22B and Mistral Large fared similarly with patterns quite dissimilar to the human’s
      and AIdx which are pretty decent (similar to Llama’s). The correlation of Mistral Large with the
      human pattern of evaluation, when Criterion 3 is removed, is the second highest, thus giving
      reasons to follow it closely and keep it in the test pool.
    • Qwen2 72B, an open LLM, would have been by far the best LLM overall (and Mistral Large would
      have been the second) if it weren’t for Criterion 3. Criterion 3 posed a grave problem for Qwen2
      both from the assessment pattern and from the AIdx point of view (Fig. 2 and Fig. 3).
    • Criterion 3 (Definition of learning objectives), in a larger part, and Criterion 5 (Critical reflection
      on redesign), in a smaller part, appear to be the most discriminative criteria in terms of capturing
      what makes the human evaluation pattern unique for this assessment task (Fig. 2 and Fig. 3). These
      criteria likely involve nuances and complexities in judgment that are particularly human-like
      and challenging for LLMs to capture accurately, or the authors might have failed to provide all
      the relevant contextual information regarding these criteria to LLMs. This last hypothesis seems
      relevant because, in the previous iteration of the study, this same criteria was the easiest one for
      LLM to assess in a human-like manner [10].

  Based on the available data, it appears that the more suitable LLMs for the assessing students’
authentic tasks using an assessment rubric are Claude 3.5 Sonnet and GPT-4o. That is because they fared
well both on the assessment pattern (PCA and correlations) and in the agreement index (magnitude
and consistencies of scores). On the other hand, Gemini 1.5 Pro is the one that had by far the most
human-like assessment pattern, but fell short on the AIdx, attributing marks that were very different
from the humans’.
    Qwen2 7B and Llama 3.1 70B deserve a mention as they are open models, and if not for some flaw
would have been at the level (or better than) the aforementioned proprietary models. Llama has a problem
of inconsistency of the marks assigned for each criterion, while Qwen2 really just got one criterion very
wrong. It might be useful to know that for both of them, Together AI (https://www.together.ai/) was
used as an API provider. It applies quantisation of Floating Point 8-bit (FP8) for Llama 3.1 70B Instruct
Turbo, while Qwen2 72B Instruct is run at full-precision Floating Point 16-bit (FP16).
    Human evaluators have a pattern of evaluation (see the PCA) that can be usually distinguished from
the LLMs’ one, but Gemini 1.5 Pro, if not for its very different score attribution, has very similar patterns.
It is interesting to note that human evaluators among themselves have different score attributions (see
Table 5 and Figure 3), but as for critical criteria they assess similarly.
    All considered, presently, none of the LLMs can be used for autonomous evaluation for all criteria,
especially regarding the more complex and the less contextualised ones. This confirms what Webb [33]
highlighted. However, Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B Instruct have the
potential to be used as solid support for evaluation for the summative evaluation level as described in
the AI-MAAS (AI-Mediated Assessment for Academics and Students) model [11].


Figure 2: Radar graph of the LLMs’ AIdx (transformed in percentage) with the average human for each criterion.
Figure 3: Radar graph of the LLMs’ and Humans’ AIdx (transformed in percentage) with the average human for
each criterion.


9. Conclusions
The fundamental question of this study was whether and which current Large Language Models (LLMs)
can be used by university educators (but it applies to other educators and instructors, too), even those
without technical experience, to assess student-written authentic products in the presence of open tasks
and questions using assessment rubrics. Indeed, using these technologies could make assessment more
sustainable and scalable, allowing for more consistent alignment with declared learning objectives. This
study has allowed us to determine that Claude 3.5 Sonnet, GPT-4o and, with some caution, Qwen2 72B
Instruct have the potential to be used as solid support for summative evaluation. According to this study,
the use of LLMs can be beneficial, but only if they are used under proper supervision. They should
be seen as assistance for university educators and not as a substitute for assessments. The available
data does not indicate that they are reliable enough to perform assessments independently, even if they
are getting close to it. In fact, some criteria that is too complex or needs additional information about
the context or specific subject can be evaluated in a way that is not in line with human assessment.
This finding confirms the guidelines as stated by Miao et al. [31] and Webb [33]. The limitations of the
present study lie in the sample size of student products that need to be significantly increased, as well
as the number of human expert evaluators and the disciplines involved in the tests. The assessment
rubric can also be optimised and, especially for the most critical criteria (such as Criterion 3), it would
be important to experiment on its formulation to understand if it could have been a human error in
defining the criteria that made it difficult to interpret by the LLMs. The idea behind this study is that it
should be expanded and updated on a rolling basis to adjust the discussion and bring useful novelties
into the assessment practice. Future evolutions of the study might include multi-shot prompting and
the evaluation of textual feedback and assessment to tasks. Feedback that could be provided during the
assessment for each of the criteria provided in a rubric deserve particular exploration [11, 32, 9, 42].


Acknowledgments
The authors would like to thank Elena Benini, PhD Student, for her contribution on the assessment of
students’ authentic tasks. Thanks also to Prof. Massimo Stella for the fruitful discussion about statistical
methods. Both of them work at the Department of Psychology and Cognitive Sciences of the University
of Trento.


References
 [1] A. Baytak, The acceptance and diffusion of generative artificial intelligence in education: A
     literature review, Current Perspectives in Educational Research 6 (2023) Article 1. doi:10.46303/
     cuper.2023.2.
 [2] S. Elbanna, L. Armstrong, Exploring the integration of ChatGPT in education: Adapting for
     the future, Management & Sustainability: An Arab Review 3 (2023) 16–29. doi:10.1108/
     MSAR-03-2023-0016.
 [3] A. Extance, ChatGPT has entered the classroom: How LLMs could transform education, Nature
     623 (2023) 474–477. doi:10.1038/d41586-023-03507-3.
 [4] S. Roy, V. Gupta, S. Ray, Adoption of AI ChatBot like Chat GPT in Higher Education in India: A
     SEM Analysis Approach, Economic Environment 4 (2023) 130–149. doi:10.36683/2306-1758/
     2023-4-46/130-149.
 [5] N. Saif, S. U. Khan, I. Shaheen, A. Alotaibi, M. M. Alnfiai, M. Arif, Chat-GPT; validating Technology
     Acceptance Model (TAM) in education sector via ubiquitous learning mechanism, Computers in
     Human Behavior (2023) 108097. doi:10.1016/j.chb.2023.108097.
 [6] C. K. Tiwari, M. A. Bhat, S. T. Khan, R. Subramaniam, M. A. I. Khan, What drives students toward
     ChatGPT? An investigation of the factors influencing adoption and usage of ChatGPT, Interactive
     Technology and Smart Education (2023). doi:10.1108/ITSE-04-2023-0061.
 [7] F. Kamalov, D. Santandreu Calonge, I. Gurrib, New era of artificial intelligence in education:
     Towards a sustainable multifaceted revolution, Sustainability 15 (2023) 12451.
 [8] M. Perkins, Academic Integrity Considerations of AI Large Language Models in the Post-Pandemic
     Era: ChatGPT and Beyond, Journal of University Teaching and Learning Practice 20 (2023).
 [9] M. Sullivan, A. Kelly, P. Mclaughlan, ChatGPT in higher education: Considerations for academic
     integrity and student learning, Journal of Applied Learning & Teaching (2023). doi:10.37074/
     jalt.2023.6.1.17.
[10] D. Agostini, Are large language models capable of assessing students’ written products? A pilot
     study in higher education, Research Trends in Humanities Education & Philosophy 11 (2024)
     38–60.
[11] D. Agostini, F. Picasso, Large language models for sustainable assessment and feedback in higher
     education, Intelligenza Artificiale 18 (2024) 121–138.
[12] T. Babina, A. Fedyk, A. X. He, J. Hodson, Firm Investments in Artificial Intelligence Technologies
     and Changes in Workforce Composition, Working Paper 31325, National Bureau of Economic
     Research, 2023. doi:10.3386/w31325.
[13] Generative       AI       to   become       a     $1.3     trillion    market     by     2032,     re-
     search         finds,         2023.       URL:        https://www.bloomberg.com/company/press/
     generative-ai-to-become-a-1-3-trillion-market-by-2032-research-finds/.
[14] G. Hammond, Big tech outspends venture capital firms in AI investment frenzy, 2023. URL:
     https://www.ft.com/content/c6b47d24-b435-4f41-b197-2d826cce9532.
[15] Y. S. Lee, T. Kim, S. Choi, W. Kim, When does AI pay off? AI-adoption intensity, complementary
     investments, and R&D strategy, Technovation 118 (2022) 102590. doi:10.1016/j.technovation.
     2022.102590.
[16] D. Long, B. Magerko, What is AI literacy? Competencies and design considerations, in: Proceedings
     of the 2020 CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–16.
[17] D. T. K. Ng, J. K. L. Leung, S. K. W. Chu, M. S. Qiao, Conceptualizing AI literacy: An exploratory
     review, Computers and Education: Artificial Intelligence 2 (2021) 100041.
[18] G. Biagini, S. Cuomo, M. Ranieri, Developing and validating a multidimensional AI literacy
     questionnaire: Operationalizing AI literacy for higher education, in: Proceedings of the First
     International Workshop on High-Performance Artificial Intelligence Systems in Education, AIxEDU
     2023, Aachen, 2023. URL: https://ceur-ws.org/Vol-3605/.
[19] D. Cetindamar, K. Kitto, M. Wu, Y. Zhang, B. Abedin, S. Knight, Explicating AI literacy of
     employees at digital workplaces, IEEE Transactions on Engineering Management 71 (2024)
     810–823. doi:10.1109/TEM.2021.3138503.
[20] S.-C. Kong, W. M.-Y. Cheung, G. Zhang, Evaluating an Artificial Intelligence Literacy Programme
     for Developing University Students’ Conceptual Understanding, Literacy, Empowerment and
     Ethical Awareness, Educational Technology & Society 26 (2023) 16–30.
[21] B. Wang, P.-L. P. Rau, T. Yuan, Measuring user competence in using artificial intelligence: Validity
     and reliability of artificial intelligence literacy scale, Behaviour & Information Technology 42
     (2023) 1324–1337. doi:10.1080/0144929X.2022.2072768.
[22] P. Weber, M. Pinski, L. Baum, Toward an Objective Measurement of AI Literacy, in: PACIS 2023
     Proceedings, 2023. URL: https://aisel.aisnet.org/pacis2023/60.
[23] UNESCO, Report “Guidance for generative AI in education and research”, 2023. URL: https:
     //www.unesco.org/en/articles/guidance-generative-ai-education-and-research.
[24] A. Gerdes, A participatory data-centric approach to AI ethics by design, Applied Artificial
     Intelligence 36 (2022). doi:10.1080/08839514.2021.2009222.
[25] C. Jang, Coping with vulnerability: The effect of trust in ai and privacy-protective behaviour on the
     use of ai-based services, Behaviour & Information Technology (2023). doi:10.1080/0144929X.
     2023.2246590.
[26] A. Majeed, S. O. Hwang, When AI Meets Information Privacy: The Adversarial Role of AI in Data
     Sharing Scenario, IEEE Access 11 (2023) 76177–76195. doi:10.1109/ACCESS.2023.3297646.
[27] P. Samuelson, Generative AI meets copyright, Science 381 (2023) 158–161. doi:10.1126/science.
     adi0656.
[28] M. A. Yeo, Academic integrity in the age of Artificial Intelligence (AI) authoring apps, TESOL
     Journal 14 (2023) e716. doi:10.1002/tesj.716.
[29] V. van Oijen, AI-generated text detectors: Do they work?, 2023. URL: https://communities.surf.nl/
     en/ai-in-education/article/ai-generated-text-detectors-do-they-work.
[30] D. Weber-Wulff, A. Anohina-Naumeca, S. Bjelobaba, T. Foltýnek, J. Guerrero-Dib, O. Popoola,
     P. Šigut, L. Waddington, Testing of detection tools for AI-generated text, International Journal for
     Educational Integrity 19 (2023) 26. doi:10.1007/s40979-023-00146-z.
[31] F. Miao, W. Holmes, H. Ronghuai, Z. Hui, AI and education: Guidance for policy-makers, Technical
     Report, UNESCO, 2023. URL: https://unesdoc.unesco.org/ark:/48223/pf0000376709.
[32] E. Sabzalieva, A. Valentini, ChatGPT and artificial intelligence in higher education: Quick start
     guide, Technical Report, UNESCO, 2023. URL: https://unesdoc.unesco.org/ark:/48223/pf0000385146.
[33] M. Webb, A Generative AI Primer, Technical Report, National Centre for AI, 2023. URL: https:
     //nationalcentreforai.jiscinvolve.org/wp/2024/01/02/generative-ai-primer/.
[34] Russell Group, New principles on use of AI in education, 2023. URL: https://russellgroup.ac.uk/
     news/new-principles-on-use-of-ai-in-education/.
[35] GTnum, Intelligence artificielle et éducation: Apports de la recherche et enjeux pour les politiques
     publiques, 2023. URL: https://edunumrech.hypotheses.org/8726.
[36] M. A. Cardona, R. J. Rodríguez, K. Ishmael, Artificial Intelligence and the Future of Teaching and
     Learning: Insights and Recommendations, Technical Report, 2023. URL: https://policycommons.
     net/artifacts/3854312/ai-report/4660267/.
[37] UCL, Using generative AI (GenAI) in learning and teaching, 2023. URL: https://www.ucl.ac.uk/
     teaching-learning/publications/2023/sep/using-generative-ai-genai-learning-and-teaching.
[38] Z. Swiecki, H. Khosravi, G. Chen, R. Martinez-Maldonado, J. M. Lodge, S. Milligan, N. Selwyn,
     D. Gašević, Assessment in the age of artificial intelligence, Computers and Education: Artificial
     Intelligence 3 (2022) 100075. doi:10.1016/j.caeai.2022.100075.
[39] P. P. Martin, D. Kranz, P. Wulff, N. Graulich, Exploring new depths: Applying machine learning
     for the analysis of student argumentation in chemistry, Journal of Research in Science Teaching
     (2023). doi:10.1002/tea.21903.
[40] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, F. Fischer, et al., ChatGPT for
     good? On opportunities and challenges of large language models for education, Learning and
     Individual Differences 103 (2023) 102274.
[41] A. Lepage, N. Roy, A review of the literature from 1970 to 2022 on the roles of teachers and
     artificial intelligence in the field of AI in education, Médiations et Médiatisations 16 (2023) 30–50.
     doi:10.52358/mm.vi16.304.
[42] A. Tamkin, M. Brundage, J. Clark, D. Ganguli, Understanding the Capabilities, Limitations,
     and Societal Impact of Large Language Models, arXiv preprint arXiv:2102.02503 (2021). URL:
     http://arxiv.org/abs/2102.02503.
[43] O. Koraishi, Teaching English in the Age of AI: Embracing ChatGPT to Optimize EFL Materials
     and Assessment, Language Education and Technology 3 (2023) Article 1.
[44] F. Ali, D. Choy, S. Divaharan, H. Y. Tay, W. Chen, Supporting self-directed learning and self-
     assessment using TeacherGAIA, a generative AI chatbot application: Learning approaches and
     prompt engineering, Learning: Research and Practice 9 (2023) 135–147. doi:10.1080/23735082.
     2023.2258886.
[45] F. Ouyang, T. A. Dinh, W. Xu, A Systematic Review of AI-Driven Educational Assessment
     in STEM Education, Journal for STEM Education Research 6 (2023) 408–426. doi:10.1007/
     s41979-023-00112-x.
[46] J. Biggs, Enhancing teaching through constructive alignment, Higher Education 32 (1996) 347–364.
     doi:10.1007/BF00138871.