<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can GPT-4 Enhance Teaching? A Pilot Study on AI-Driven Analysis of Student Course Feedback</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oshani Weerakoon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panu Puhtila</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuomas Mäkilä</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erkki Kaila</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Turku</institution>
          ,
          <addr-line>Turku</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>In this pilot study, we explored the use of generative AI-specifically GPT-4-to evaluate student feedback in a bilingual software engineering course ofered at the University of Turku. Our aim was twofold: to examine whether ChatGPT can meaningfully evaluate student course feedback and propose suitable enhancements, and to compare its evaluations with those made by a course teacher. We collected voluntary feedback from 18 consenting students across three course instances in 2023 and 2024, resulting in a total of 390 feedback entries. These responses were first translated into English and then anonymized. Using structured questionnaires aligned with defined pedagogical goals, we then analyzed the responses through a dual evaluation process: (1) AI-based assessment using a custom JavaScript application integrating GPT-4 and GPT-4o-mini, and (2) manual evaluation by the teacher. Both followed a standardized Likert-scale format with brief textual comments, and all evaluations were consolidated into thirty-six manually maintained recording sheets. Evaluation results were visualized using heat maps across five key themes derived from the pedagogical goals. Our comparative analysis showed general alignment between the two evaluators, with key diferences in the perceived content clarity and video quality of the course. We further extended our discussion to examine GPT's applicability and limitations as a feedback evaluator. In particular, we identified its potential to quickly assess structured student feedback in courses with high participation, where manual evaluation may be time-consuming for course teachers. These ifndings collectively provide insights into using generative AI in course feedback analysis to enhance teaching within software engineering curricula.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative AI</kwd>
        <kwd>Engineering Education</kwd>
        <kwd>Pedagogy</kwd>
        <kwd>GPT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The swift development of artificial intelligence technologies in recent years has significantly improved
their usefulness across several industries. Notably, the educational industry has received a lot of
attention, especially given the potential uses of artificial intelligence in teaching and learning.</p>
      <p>
        ChatGPT is a Large Language Model that was released in November 2022 by OpenAI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is based on
the Generative Pretrained Transformer (GPT) architecture [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which uses deep learning techniques to
generate human-like text. ChatGPT models, including GPT-3.5, and GPT-4, are trained on vast amounts
of text data and leverage transformer architectures, featuring self-attention mechanisms that allow them
to understand context, generate coherent responses, and complete tasks like translation, summarization,
and conversation. GPT-4o is OpenAI’s latest flagship model, capable of real-time reasoning spanning
voice, vision, and text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The current context window for GPT-4o is 128k, with a knowledge cut-of
date of October 2023 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These models excel at tasks requiring language understanding and generation,
making them useful in chatbots, content creation, and coding assistance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Evaluations of teaching and learning by students themselves have become progressively vital in
facilitating high-quality, learner-centric education [4]. Gaining insight into how students see their
educational experiences provides important information that may afect course design and improvement.
Many feedback tools can be integrated into the course structure to collect feedback from the students.
However, the results may generally be summarised in the form of charts or it will be time-consuming
for teachers to go through and derive detailed insights when the number of student feedback is huge.</p>
      <p>ChatGPT has shown capability in simplifying time-consuming text-based analysis and summarizing
tasks owing to its massive set of trained data and transformer architecture. It further demonstrates
various intelligent behaviours which might even closely follow human writing [5]. Therefore, this study
aims to explore the usage of GPT-4 as a novel workflow to derive meaningful evaluations of course
feedback given by students and comparing with those made by the teachers.</p>
      <p>While there are several Generative AI tools available in the market, we chose GPT-4 for this study for
two reasons. Firstly, ChatGPT is easily accessible and is widely used by many people including teachers
who are our target audience from this research. In this research, we ofer an API-based web interface to
evaluate student feedback, however, ChatGPT’s conversational interface can easily be used to perform
the evaluation task.</p>
      <p>On the other hand, OpenAI’s GPT models are still leading a lot of Generative AI-based research
in diferent fields and OpenAI has been continuing to upgrade the models to fit specific tasks like
reasoning, coding etc. which makes the use cases of the model still high among its competitors.</p>
      <p>While Generative AI tools like GPT-4 can be extremely useful in various applications, there are
instances where they produce text that may be perceived by humans as incorrect or deceptive. These
instances, known as hallucinations [6], [7], [8] are situations where the responses seem logical or
consistent, despite being factually incorrect. Also, there have been recent findings about ChatGPT
using "shortcuts" to generate text, rather than deep reasoning [9]. The privacy and security issues
related to working with ChatGPT are also a concern that could not be overlooked [10]. Furthermore,
the information generated in ChatGPT can be incorrect and biased [11], which emphasizes the necessity
of handling the AI-generated content after careful review.</p>
      <p>In light of these developments and known limitations, our study aims to answer the following research
questions:
1. RQ1: Can ChatGPT evaluate student course feedback and propose suitable enhancements?
2. RQ2: How do ChatGPT’s evaluations compare to those of a teacher?</p>
      <p>We present in this paper a novel workflow utilizing GPT-4 API and GPT-4o-mini API for evaluating
feedback given by students for a software engineering course held at the University of Turku. We then
compare the evaluations made by GPT against those made by a course teacher to assess the eficacy
of the AI. We employed both qualitative and quantitative analyses to explore answers to the research
questions posed above.</p>
      <p>Our paper is outlined as follows. Section 2 explains related studies and Section 3 explains the
methodology that we used to conduct this research in detail. Section 4 presents and compares the
evaluations made by ChatGPT and the teacher on the student feedback. Section 5 extends the discussion
and answers the research questions. Section 6 concludes the findings of our study and proposes future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Since the body of research focusing on the specific phenomena we are studying is quite sparse, we will
here discuss studies that are contextually and thematically adjacent to our work. These include various
research that has investigated the factors influencing the quality of feedback, as well as a few studies
done on the usage of Generative AI tools in giving feedback to students. We will present them in this
order.</p>
      <p>QUEST was an assessment tool designed by Tricker et al. [12] to gauge student satisfaction with
distance education courses within the higher education system. It has produced metrics assessing
diferent aspects of the course’s service quality which can enhance the alignment between student
expectations and their actual experiences in the course.</p>
      <p>Brew [13] investigated in her study whether or not students would respond in-depth to an anonymous
survey, whether or not the survey’s integration into the course management system was successful,
and whether or not the responses would be helpful criticism for course assessment and changes. 71
people responded to the survey, which was administered in 2001, 2003, 2005, and 2007. The survey
results showed students gave thoughtful and comprehensive comments with more success when the
student count was low.</p>
      <p>Dai et al. [14] explored the potential of utilizing ChatGPT as a tool for delivering feedback to enhance
student learning. The findings indicated that: (i) ChatGPT was able to produce feedback that was not
only more detailed but also fluently captured students’ performance compared to human instructors;
(ii) there was a high level of agreement between ChatGPT and the instructors in evaluating the subject
matter of students’ assignments; and (iii) ChatGPT was capable of providing insights into the methods
students employed to complete their tasks, potentially aiding in the development of their learning skills.
However, their research did not investigate utilizing generative AI tools to evaluate student feedback to
improve course teaching.</p>
      <p>Another similar study, conducted by Dai et al. [15] examines the potential of GPT models in enhancing
AI-driven learning analytics for assessment feedback. It evaluates GPT-3.5 and GPT-4 in generating
feedback for student writing tasks in a data science course, comparing them with human instructors in
readability, efectiveness, and reliability. Results show that GPT models, particularly GPT-4, produce
more readable and consistent feedback, outperform human instructors in providing structured feedback,
and demonstrate higher reliability.</p>
      <p>A study done by Morales-Chan et al. [16] also explores AI-driven feedback automation in MOOCs
to enhance student learning. By combining LangChain 1 and the OpenAI API, the system generates
personalized, rubric-based feedback aligned with educational goals. More than just automating
responses, this approach reshapes digital education by making MOOCs more interactive and adaptive. The
research also highlights generative AI’s potential to scale individualized feedback, improving learning
experiences globally. Findings indicate improved student satisfaction and progress, confirming AI’s
value in self-paced courses.</p>
      <p>Utilizing Llama2 (7B) to summarize 742 student course feedback responses across 75 Computer
Science courses, Zhang et al. [17] conducted a study to assess the capability of Llama2 to provide
course instructors with actionable insights. Findings suggest that generative AI can efectively
synthesize factual and relevant feedback, ofering a cost-eficient solution for enhancing teaching practices.
However, no comparison study with human evaluators has been made in this study. Despite this, the
study demonstrates the feasibility of AI-driven evaluation summaries to support educators’ professional
development, highlighting a promising approach to improving teaching feedback in large classroom
settings.</p>
      <p>In the study conducted by Steiss et al. [18], they investigated the capacity of generative AI (ChatGPT)
to ofer formative feedback and compared the quality of feedback provided by humans and AI. Their
analysis included 200 instances of formative feedback generated by humans and 200 by AI for identical
essays. The evaluation showed that feedback from well-trained evaluators was of higher quality
compared to that from ChatGPT. However, given ChatGPT’s ability to produce feedback easily and its
reasonable quality, generative AI could still be beneficial in certain scenarios, such as when a skilled
educator is not available.</p>
      <p>Course evaluations are widely utilized in higher education, yet extracting valuable insights from
numerous open-ended comments can be time-consuming. Fuller et al. [19] investigated the use of
ChatGPT for analyzing course evaluation comments, focusing on the time eficiency in theme generation
and the agreement between themes identified by instructors and those determined by AI. The study
demonstrated ChatGPT’s potential as an analytical tool for open-ended comments in health professions
education. However, their results indicate that it is essential to use ChatGPT as an auxiliary tool in the
analysis process and to not depend exclusively on its outputs for conclusions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Methods</title>
      <p>For this pilot study, we selected a fresh open university software engineering course, named, Green
Coding that is conducted both in Finnish and English languages. The course is implemented under
Sustainable Web Development Studies2 category at the University of Turku, Finland.</p>
      <sec id="sec-3-1">
        <title>3.1. Course Background</title>
        <p>The course covers concepts related to designing eficient algorithms that minimize resource usage and
energy consumption, as well as monitoring energy use during development and run time. Furthermore,
the course encourages students to make sustainable choices, such as selecting appropriate platforms,
development frameworks, and tools. The ultimate goal is fostering lower carbon footprints, balancing
performance with energy eficiency, and integrating environmental concerns into software development
processes.</p>
        <p>The course comprised six chapters which took place in six diferent instances (separately in English
and Finnish) in February 2023, June 2024 and October 2024. The course materials were available online
for enrolled students to complete at their own pace, with one mandatory lecture demonstrating power
consumption in software and hardware-based solutions both in Finnish and English languages. The
average course duration was four weeks. These courses under the open university category are free of
charge and designed for working professionals who need to update their skill sets, or in general for
anyone interested in the topic.</p>
        <p>Participants: Since the course was free and self-paced, the completion of the course depended
on the students themselves. Again, providing feedback in the chapter was voluntary. Out of all the
students who took the course, 18 students expressed their explicit consent to use their course feedback
in this research. This includes both Finnish and English students from all three instances.</p>
        <p>A research consent form in both English and Finnish languages was distributed among the students
enrolled in the course to collect consent to process their feedback about the course. This step was
imperative to ensure the ethical and consented usage of student feedback in our scientific study involving
Generative AI. We further ensured the anonymity of the data by excluding any personal identifiers of
390 feedback entries from all the consented students from the analysis process.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Study Design</title>
        <p>We collected student feedback with a form published in the Moodle instance of the course through
which students were able to gauge their opinion on the course teaching, learning experience and
quality of course materials. The course included six chapters, and at the end of each chapter, students
could voluntarily answer the questions providing feedback about that chapter. The questionnaire was
designed to capture a spectrum of pedagogical areas, as shown in Table 1. Based on these pedagogical
goals, the eight feedback questions shown in Table 2 were produced. The feedback questions were
also mapped to the pedagogical goals. All questions were evaluated under a 5-point based Likert scale
grading criteria as follows: 1. Excellent, 2. Good, 3. Fair, 4. Poor, and 5. Needs Improvement.</p>
        <p>Before conducting the analysis, pre-processing of the feedback data was required. We pre-processed
the 390 feedback points as below.</p>
        <p>Data Pre-processing: This process began with exporting all the student feedback across all chapters
in all three-course instances in the Excel format. Afterwards, we carried out the following steps in
order: consolidating all feedback into a single dataset, excluding responses from students who did not
consent to participate in the study, and removing any personal data to uphold GDPR compliance and
maintain confidentiality. Given the bilingual delivery of the course in both English and Finnish, it was
necessary to translate the Finnish feedback into English to optimize the performance of the ChatGPT.
We utilized the DeepL [20] translation service to ensure high fidelity in our translations.
2https://www.utu.fi/en/open-university-studies/courses/faculty-of-technology/sustainable-web-development</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Analysis Methods</title>
        <sec id="sec-3-3-1">
          <title>After the student feedback data was pre-processed, we evaluated two folds:</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>1. GPT-4 based Evaluation</title>
          <p>2. Teacher’s Evaluation</p>
          <p>
            GPT-4 based Evaluation: For the AI-based student feedback evaluation, we developed a
JavaScriptbased web application, ’Analyse with GPT-4’ [21] that integrated the GPT-4 API (for February 2023
course instance) and GPT-4o-mini API (for June 2024 and October 2024), which is seemingly ’faster at
most questions’ [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] and fitted our requirement.
          </p>
          <p>To provide contextual knowledge of the course and pre-defined evaluation criteria to model, the
following sources of information were provided.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>1. Feedback questions</title>
          <p>2. Corresponding pedagogical goals</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3. Evaluation criteria in the Likert scale</title>
          <p>4. Summary of lecture content</p>
          <p>All the provided content was in the form of markdown files (.md format). We analysed the feedback
given for Questions 3 and 4 together to get a more subjective evaluation since these two questions
are related and complement each other. Every time the model was prompted to evaluate a given
student feedback, these knowledge artefacts were also sent with the prompt. Several prompts were
experimented with, and the one we used was chosen because it provided the most consistent results.
The prompt used is shown below.</p>
          <p>"Below is a feedback question and answer, given by a university student about the given
chapter under the course ’X’. The chapter content and criteria for evaluating feedback per
each question are given. Imagine you are the teacher of the course, refer to the chapter
content first. Then, check the relevant question and its evaluation criteria. Finally, evaluate
the student’s answer based on the criteria and knowledge covered by the chapter. Draw
useful insights to better the course but do not improvise your response beyond the given
content, do not repeat the question or student’s answer in the response, and generate the
response with evaluation grade + less than 20 words, in plain text format. ; "</p>
          <p>We requested each evaluation to be in the output format of evaluation grade + further comments in
plain text. The ChatGPT response was intentionally limited to 20 words to match the length of responses
the teacher had given to maintain the balance during the comparison stage of both types of evaluations.</p>
          <p>Teacher’s Evaluation: A course teacher conducted a manual evaluation of the feedback of all the
students in all six chapters. The teacher’s evaluation also followed the same output format as above.</p>
          <p>The teacher and GPT-4, each evaluated a total of 390 diferent student feedback points across six
chapters in all six course instances (both Finnish &amp; English). Each evaluation including the grade, was
separately recorded in thirty-six recording sheets manually.</p>
          <p>Exclusion from the Analysis: In the October 2024 instance of the course, feedback Question 5- Estimate
the number of study hours you have done to complete this chapter, was removed from the feedback form.
Therefore, all the feedback received for Question 5 in other instances was also excluded from this
analysis. This is mainly because the researchers identified that particular feedback cannot be evaluated
objectively to a qualitative rating (Poor to Excellent) without additional information about the weight
of each chapter’s content.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>To get a comparative view of how the two sets of evaluations have behaved, we used heat maps to
visualize the feedback evaluation grade ratings given by the teacher and AI.</p>
      <sec id="sec-4-1">
        <title>4.1. Visualizing using heat maps</title>
        <p>While the collected textual data (students’ feedback) was qualitative, the evaluation grade being on
a Likert scale allowed us to quantify these evaluations to graph appropriately. First, based on the
pedagogical goals defined in Table 1, we extracted 5 themes- New Insights, Content Clarity, Exercise
Dificulty, Content Origin and Video Quality that correspond to feedback questions 1, 2, 3, 4, 6 and 7
respectively. We chose the particular grades that correspond to these questions, made by the ChatGPT-4,
and the teacher in the recording sheets. We then mapped them to a scale of 1-5, with 1 being Needs
Improvement and 5 being Excellent.</p>
        <p>Further, we used np.nanmean after replacing ’No answer’ entries with NaN to ensure that missing
or incomplete data does not unduly bias the mean estimates. Finally, we computed the mean for all
numerical grades of each chapter per category, separately for the two evaluators -4 and the teacher. To
visualize the evaluation data that we collected from ChatGPT-4 and the teacher, we generated two heat
maps using Seaborn and matplotlib.pyplot python libraries. The complete code used for building the
heat maps is available here [22].</p>
        <p>Figure 2 shows the heat map of the teacher’s evaluation and Figure 1 shows the heat map of ChatGPT’s
evaluation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparison Insights</title>
        <p>According to the heat maps generated, both ChatGPT and the teacher have given high evaluation grades
for New Insights, indicating that they found the student responses indicative of the particular course
providing the student with new knowledge. In terms of Content Clarity, there were some variations,
but both ChatGPT and the teacher rated it from Good to Excellent, while the teacher’s ratings were
particularly higher than the ChatGPT’s. Both the AI and the teacher recognized variability in Exercise
Dificulty, with ratings reflecting a range of student experiences. The AI’s ratings for dificulty were
generally higher than the teacher’s, suggesting that the teacher might perceive the exercises as less
challenging compared to the AI’s judgment.</p>
        <p>In terms of Content Origin, the pedagogical aspect involved was unveiling student perception of
AI-based lecture content generation. Both the teacher and the ChatGPT have rated the respective
students’ feedback as Fair to Poor. Particularly, in certain feedback answers, the students did not prefer
the use of text-to-speech service in the lecture videos highlighting the "robot-like" narration. For Video
Quality, ChatGPT provided varied ratings, and the teacher has also rated student feedback as less
satisfied with the video content. This observation is also likely to stem from the preference of students
towards human voice over AI-generated narration in the lecture videos.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this study, both ChatGPT (AI) and the human teacher employ an identical evaluation method
characterized by a fixed set of five dimensions—New Insights, Content Clarity, Exercise Dificulty, Content
Origin, and Video Quality—across six chapters of the selected course. This standardization is analogous
to using a validated Likert-scale survey in quantitative research. By mapping qualitative ratings (e.g.,
Excellent, Good, Fair) to numerical values, our study implements subjective evaluations made by
ChatGPT versus evaluations made by human teachers into objective, quantifiable data. Altogether, both
evaluators examined 390 student feedback points under the structured criteria.</p>
      <p>To answer RQ1: Can ChatGPT evaluate student course feedback and propose suitable
enhancements?, we observed that the GPT-4 response was rather informative and complete and also
suggested enhancements to course teachers. For example, "Evaluation Grade: Good. Insights on Big O
clarity highlight the importance of practical application in assignments." is one such evaluation done by
GPT on the Big O notation 3, in Chapter 5 of the course under optimization methods to increase code
eficiency. It has been identified that the student’s feedback reflects positively on the Big O notation in
that chapter.</p>
      <p>To answer RQ2: How do ChatGPT’s evaluations compare to those of a teacher?, after comparing
the specific colour shades and ratings in each heat map, it can be determined that the teacher’s
evaluations may tend to reflect a more keen understanding of student feedback, which can be due
to subjective experiences and qualitative judgment. On the other hand, the AI’s evaluations appear
to be based on a more consistent metric-based approach, which is likely to focus on the objective
criteria provided. We also found, generally, that the evaluations of ChatGPT were more informative and
complete than the teacher’s and tried to provide improvements to the course. We observed similarities
in categories that are inherently less subjective specifically, in the New Insights and Exercise Dificulty
dimensions. Both evaluators consistently rated these aspects at similarly high levels across the chapters,
suggesting that the course efectively imparts new knowledge and that the exercises are appropriately
challenging. Adding to this, We used two GPT-4 models for the evaluations, but we did not observe a
significant change in the output in this pilot study. The subjective categories, such as Content Clarity
and Video Quality, show the greatest divergence in evaluations, emphasizing the need for a more defined
approach for ChatGPT to assess these aspects.</p>
      <p>On the other hand, the related feedback on these categories was comparatively expressive, where
students expressed their thoughts about AI-generated lecture videos openly-many disliking the
AIgenerated narration. According to Kocoń et al. [23], the accuracy of ChatGPT-4 diminishes significantly
for more challenging and practical natural language processing tasks, particularly in the evaluation
of texts with emotional content. Teachers can identify subtleties in student feedback, and thus assess
clearly whether something has been understood, rather than just rote-learned, or if the teaching methods
are engaging and efective beyond just conveying information or sentiments behind them.</p>
      <p>Adding to this, is the potential bias in ChatGPT in qualitative analysis, likely resulting from the
guidelines enforced for human trainers by OpenAI [23]. There is also the possibility that hallucinations
could have afected the evaluations given by ChatGPT. Although it is impossible to completely remove
the occurrence of hallucinations in ChatGPT as a language model, these can be minimized by providing
ample contextual information along with the prompt [24]. Furthermore, there is the possibility of
translation errors incurred by DeepL, in the Finnish feedback translated to English for having a complicated
structure and/or of-meaning. It should be noted that the teacher who evaluated the feedback was not
required to translate the feedback and had native proficiency in the Finnish language.</p>
      <p>These findings confirm that while both evaluators agree on certain objective dimensions, the teacher’s
evaluative lens is crucial for capturing the full complexity of student feedback. The similarity in New
Insights and Exercise Dificulty confirms the reliability of the measurement framework, whereas the
divergence in Content Clarity and Video Quality emphasizes the need for human insight to address
exact pedagogical shortcomings. It is clear that while ChatGPT can ofer fast, consistent evaluations,
human judgment is still required to capture the full scope of the course experience and satisfaction
level of the students. Therefore, our analysis supports the dual-evaluator approach, where ChatGPT-4’s
consistency complements the teacher’s judgment, ultimately providing a holistic overview of student
feedback to improve course teaching.</p>
      <p>Apart from the above set of 5 key themes, we continuously observed GPT as a feedback evaluator
was fast at giving the structured evaluation with grading compared to the speed of the human teacher.
When it comes to courses with a higher number of students, the speed of GPT in evaluating student
feedback would greatly benefit teachers to extract insights faster and then review where needed.</p>
      <p>Lastly, one interesting aspect that arose from the student feedback was their perception of whether
the AI technologies had been used in the production of the course materials. They were used, in the
production of lecture videos and slides, based on the human-written course material, and we wanted to
gauge how obvious this was to the students. Our observation was that while many students pointed
out, correctly, very obvious indicators for some particular presentation having been done with large
Language Models(LLMs) or text-to-speech (TTS) systems, many students were completely oblivious to
these elements. Even in situations where, for example, the voice in the lecture video was created with
TTS technology, and every other student recognized it as being machine-generated, one student would
claim that all materials seemed completely natural, that is, human-generated. This is interesting, as it
points out that some people are more prone than others to not recognize AI-generated content, which
is a phenomenon that would require more study in the future.</p>
      <sec id="sec-5-1">
        <title>5.1. Threats of Validity</title>
        <p>We used a standardized evaluation framework that converts qualitative student feedback into numerical
data. Both ChatGPT and the human teacher applied the same Likert-scale conversion to measure the
dimensions-New Insights, Content Clarity, Exercise Dificulty, Content Origin, and Video Quality. This
consistent approach ensures that diferences in evaluations arise from the evaluators’ perspectives, not
from the measurement process, thereby ensuring the Internal Validity.</p>
        <p>Before the analysis, we identified feedback questions on the number of study hours taken by students
to complete a chapter, that could not be satisfactorily evaluated to a qualitative rating (e.g., Excellent,
Good etc.) by the evaluators without additional information. Therefore, we did not use that particular
feedback in the analysis. Next, by mapping qualitative ratings to numerical values, we captured the
intended constructs of the study. This process confirms that our measurement instrument reflects only
the theoretical dimensions this study measures, thereby ensuring Construct Validity.</p>
        <p>While the study implements internal and construct validity, the sample size of 18 students is relatively
modest: however, we analysed altogether 195 feedback points. This controlled, homogeneous group
allowed for a focused assessment of the framework as a pilot study. Yet, we acknowledge that it may
restrict the external validity.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions &amp; Future Work</title>
      <p>Analysing student feedback on a course and integrating the outcome into the course, not only improves
the quality of teaching but also benefits students in their learning process. In this study, it is shown
that the use of Generative tools like ChatGPT in evaluating student feedback to support teachers could
be applied to practical workflow that could significantly lessen the workload of teachers. It can be
determined from our preliminary study observations(where we analysed 390 student feedback points),
that GPT-4 enhances evaluation consistency and speed, but human judgment remains vital, and thereby
supports a dual-evaluator approach than a complete autonomous approach for comprehensive student
feedback analysis.</p>
      <p>In future research, we aim to construct a simple dual-evaluator approach involving both the teacher
and GPT, that complements and retains the positive features of both of them. We further plan to enhance
the use of GPT-4 for evaluating student feedback with larger data-sets and explore the capabilities of
newer GPT reasoning models and other generative AI tools in analyzing and responding to sentiments
expressed in feedback.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network,
funded by the Ministry of Education and Culture, Finland.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors used the GPT-4 API to evaluate student feedback as part of the methodology for this pilot
study. Additionally, GPT-4 was used to debug code related to heat map generation. After using these
tools, the authors reviewed the outputs for accuracy to the best of their knowledge.
[4] C. D. Carly Steyn, A. Sambo, Eliciting student feedback for course development: the application of
a qualitative course evaluation tool among business research students, Assessment &amp; Evaluation
in Higher Education 44 (2019) 11–24. doi:10.1080/02602938.2018.1466266.
[5] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, Y. Zhang, Sparks of artificial general intelligence:
Early experiments with gpt-4, 2023. arXiv:2303.12712.
[6] J. Maynez, S. Narayan, B. Bohnet, R. McDonald, On faithfulness and factuality in abstractive
summarization, arXiv preprint arXiv:2005.00661 (2020).
[7] M. Ling, Chatgpt (feb 13 version) is a chinese room, ArXiv abs/2304.12411 (2023). doi:10.48550/
arXiv.2304.12411.
[8] Y. Meron, Y. T. Araci, Artificial intelligence in design education: evaluating chatgpt as a virtual
colleague for post-graduate course development, Design Science 9 (2023). doi:10.1017/dsj.
2023.28.
[9] Y. Zhou, R. Tang, Z. Yao, Z. Zhu, Navigating the shortcut maze: A comprehensive analysis of
shortcut learning in text classification by language models, 2024. URL: https://arxiv.org/abs/2409.
17455. arXiv:2409.17455.
[10] H. Li, D. Guo, W. Fan, M. Xu, J. Huang, Y. Song, Multi-step jailbreaking privacy attacks on chatgpt,</p>
      <p>ArXiv abs/2304.05197 (2023). doi:10.48550/arXiv.2304.05197.
[11] H. Alkaissi, S. I. McFarlane, Artificial hallucinations in chatgpt: Implications in scientific writing,</p>
      <p>Cureus 15 (2023). URL: https://api.semanticscholar.org/CorpusID:257037938.
[12] T. Tricker, M. Rangecroft, P. Long, Bridging the gap: an alternative tool for course evaluation,</p>
      <p>Open Learning: The Journal of Open, Distance and e-Learning 20 (2005) 185–192.
[13] L. S. Brew, The role of student feedback in evaluating and revising a blended learning course, The
Internet and Higher Education 11 (2008) 98–105. URL: https://www.sciencedirect.com/science/
article/pii/S1096751608000249. doi:https://doi.org/10.1016/j.iheduc.2008.06.002.
[14] W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, G. Chen, Can large language models provide
feedback to students? a case study on chatgpt, in: 2023 IEEE International Conference on Advanced
Learning Technologies (ICALT), 2023, pp. 323–325. doi:10.1109/ICALT58122.2023.00100.
[15] W. Dai, Y. Tsai, J. Lin, A. A. Aldino, F. Jin, T. Li, D. Gašević, G. Chen, Assessing the proficiency
of large language models in automatic feedback generation: An evaluation study, 2024. doi:10.
35542/osf.io/s7dvy.
[16] M. Morales-Chan, H. R. Amado-Salvatierra, J. A. Medina, R. Barchino, R. Hernández-Rizzardini,
A. M. Teixeira, Personalized feedback in massive open online courses: Harnessing the power of
langchain and openai api, Electronics 13 (2024) 1960. doi:10.3390/electronics13101960.
[17] M. Zhang, E. Lindsay, F. B. Thorbensen, D. B. Poulsen, J. Bjerva, Leveraging large language models
for actionable course evaluation, Student Feedback to Lecturers (2024). doi:10.48550/arxiv.
2407.01274. arXiv:2407.01274.
[18] J. Steiss, T. Tate, S. Graham, J. Cruz, M. Hebert, J. Wang, Y. Moon, W. Tseng, M. Warschauer,
C. B. Olson, Comparing the quality of human and chatgpt feedback of students’ writing,
Learning and Instruction 91 (2024) 101894. URL: https://www.sciencedirect.com/science/article/pii/
S0959475224000215. doi:https://doi.org/10.1016/j.learninstruc.2024.101894.
[19] A. Fuller, K. A. Morbitzer, J. M. Zeeman, Exploring the use of chatgpt to analyze student
course evaluation comments, BMC Medical Education 24 (2024) 423. URL: https://doi.org/10.
1186/s12909-024-05316-2. doi:10.1186/s12909-024-05316-2.
[20] DeepL GmbH, Deepl translator, https://www.deepl.com/translator, 2023. Accessed: 2023-04-23.
[21] O. Weerakoon, feedback_analyser_with_chatgpt, https://gitlab.utu.fi/osweer/sunflower, 2024.
[22] O. Weerakoon, thesis_heatmaps, https://gitlab.utu.fi/osweer/thesis_heatmaps, 2024.
[23] J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza,
A. Janz, K. Kanclerz, A. Kocoń, B. Koptyra, W. Mieleszczenko-Kowszewicz, P. Miłkowski, M. Oleksy,
M. Piasecki, Łukasz Radliński, K. Wojtasik, S. Woźniak, P. Kazienko, Chatgpt: Jack of all trades,
master of none, Information Fusion 99 (2023) 101861. URL: https://www.sciencedirect.com/science/
article/pii/S156625352300177X. doi:https://doi.org/10.1016/j.inffus.2023.101861.
[24] J. Rytilahti, O. Weerakoon, Ai tools for study material generation, 2023. Retrieved
March 9, 2024, from https://www.utu.fi/en/university/faculty-of-technology/computing/
ai-tools-for-study-material-generation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Marr</surname>
          </string-name>
          ,
          <article-title>A short history of chatgpt: How we got to where we are today</article-title>
          , https://www.forbes.com/ sites/bernardmarr/2023/05/19/a
          <article-title>-short-history-of-chatgpt-how-we-got-to-where-we-are-</article-title>
          <string-name>
            <surname>today</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] OpenAI, Chat with openai,
          <year>2024</year>
          . URL: https://help.openai.com/en/collections/3742473-chatgpt.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>