<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explainable AI in the Loop: An Instructor-Transformer Collaboration for Improving Explainability and Reliability of Feedback in Introductory Programming Classrooms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muntasir Hoq</string-name>
          <email>mhoq@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bradford Mott</string-name>
          <email>bwmott@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seung Lee</string-name>
          <email>sylee@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jessica Vandenberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Narges Norouzi</string-name>
          <email>norouzi@berkeley.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Lester</string-name>
          <email>lester@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bita Akram</string-name>
          <email>bakram@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>North Carolina State University</institution>
          ,
          <addr-line>NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California Berkeley</institution>
          ,
          <addr-line>CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Active learning is widely recognized as an efective approach to improving learning outcomes in introductory programming courses. However, limited instructional resources often restrict students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. While advances in AI, particularly large language models (LLMs), ofer scalable feedback solutions, their lack of explainability and reliability remains a challenge. This paper presents an AI-driven classroom assistant that leverages the strengths of LLMs in generating feedback while enhancing the explainability and reliability of the feedback. The AI-driven classroom assistant features an authoring tool that fosters instructor-LLM collaboration in designing programming problems, developing sample solutions, identifying patterns representing common misconceptions and logical errors, and generating feedback corresponding to each pattern. The assistant further features an AI engine that detects similar patterns in student code, matches them with instructor-identified patterns, and selects a set of relevant instructor-verified feedback. To enable this pattern-matching process, we pretrained an explainable embedding model on a large repository of Python code labeled by their correctness. The model is then ifne-tuned on programming solutions for a new, unseen problem. Using an attention mechanism, key patterns from student submissions are identified and matched with instructor-selected programming patterns representing common errors and misconceptions, and the corresponding set of feedback is assigned to each programming solution. We conducted a human evaluation to assess the alignment of feedback selected by our system and that provided by instructors. The results demonstrate the system's strong potential in selecting appropriate feedback for student programming solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AI-driven classroom assistant</kwd>
        <kwd>Explainable AI</kwd>
        <kwd>Programming education</kwd>
        <kwd>Adaptive feedback</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Active learning has been widely recognized as an efective approach for improving student engagement
and learning outcomes in introductory programming courses [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A key component of active learning is
adaptive and timely feedback, which helps students identify and correct misconceptions as they develop
their programming skills [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, the growing enrollment in computer science (CS) courses poses
a significant challenge, as delivering individualized feedback at scale is resource-intensive, thereby
constraining instructors’ ability to efectively incorporate active learning activities [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Recent advancements in large language models (LLMs) have demonstrated their potential in analyzing
student code and generating feedback [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. However, despite their strength, LLMs sufer from
reliability and trustworthiness issues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], making them unsuitable for unsupervised use in introductory
programming courses [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Research has shown that misconceptions introduced early in a student’s
learning process can persist for a long time, further emphasizing the need for accurate and pedagogically
sound feedback [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To address these challenges, we propose an AI-driven classroom assistant designed
to support both instructors and students. This system facilitates activity design for instructors while
ensuring students receive individualized, real-time feedback as they work through programming
exercises. To ensure both reliability and explainability, we leverage a triangulated feedback mechanism
that integrates instructors, LLMs, and an explainable AI model, leveraging the strengths of LLMs while
reinforcing instructor expertise.
      </p>
      <p>
        To enable efective instructor-LLM collaboration in designing instructional support, our AI-driven
classroom assistant features an authoring tool that allows instructors to interact with the LLM in crafting
programming problems, identifying common student errors, and generating targeted feedback. The
system utilizes a modified version of the explainable Subtree-based Attention Neural Network (SANN)
model [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10, 11, 12</xref>
        ] to extract key structural components from student code. When incorrect submissions
are detected, the model matches patterns in the code against instructor-defined misconceptions and
assigns the most relevant feedback. The modified SANN model is pretrained on the FalconCode dataset,
a large repository of Python programming submissions [13]. When a new programming problem is
designed through the authoring tool, we generate a synthetic dataset of students’ submissions based on
instructor-authored problem/solution pairs, covering common student mistakes and misconceptions.
We then fine-tune the pre-trained model on a synthetic dataset generated by LLMs to ensure that it
selects relevant and accurate feedback for students’ solutions generated in response to a problem unseen
by the model. While our model enhances explainability, it remains susceptible to errors. To improve
reliability, we incorporate LLMs as a secondary validation mechanism. When a code snippet is flagged
as incorrect by our model, the LLM verifies this decision. If there is no consensus, feedback is withheld,
and students are advised to consult teaching staf. If both models agree on an error, the system maps
the erroneous code to instructor-identified patterns and selects the most relevant feedback.
      </p>
      <p>To evaluate our approach, we conduct a case study using a problem from the FalconCode dataset
that was not included in pretraining. We use the instructor authoring tool to generate potential
solutions, common misconceptions, and relevant feedback. LLM-generated synthetic data for the
new problem is then used to fine-tune our model before evaluating its performance on real student
submissions from the Falcon dataset. Finally, we assess the efectiveness of our feedback-matching
process through a human evaluation, comparing the feedback selected by our model to that chosen by
expert instructors. Our results demonstrate the efectiveness of instructor-AI collaboration in providing
reliable, trustworthy, timely, and scalable feedback to students in introductory programming courses.
By combining explainable models, instructor expertise, and LLM capabilities, our approach enhances
the quality of AI-assisted feedback, ensuring that students receive meaningful and pedagogically sound
guidance while working on programming tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. Active Learning in Programming Education</title>
        <p>
          Active learning techniques have significantly influenced how introductory programming is taught,
ofering a participatory alternative to traditional lectures. Techniques such as pair programming,
collaborative problem-solving, and real-time coding exercises help students better understand complex
programming constructs [
          <xref ref-type="bibr" rid="ref1">1, 14</xref>
          ]. These approaches promote critical thinking and retention, particularly
when students are actively engaged in constructing their own knowledge [15, 16].
        </p>
        <p>Despite the growing adoption of active learning in STEM disciplines, the benefits in computing
education often depend more on the instructional strategy than on the physical classroom
environment [17, 18]. Tools such as classroom response systems and interactive IDEs have emerged to support
engagement, but limitations remain in providing personalized and conceptual feedback to support
deeper learning [19, 20, 21], which is dificult to scale as class sizes increase.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Automated Feedback Systems</title>
        <p>Automated feedback systems have become integral to programming education, aiming to enhance
student learning by providing timely and constructive responses to code submissions [22]. Research
in this domain has predominantly concentrated on the automatic correction or repair of erroneous
programs, often neglecting the delivery of comprehensible natural language feedback [23, 24, 25, 26, 27].
Few systems have relied on rule-based automated feedback generation techniques that, while functional,
can produce feedback that is challenging for students to interpret and apply efectively and most of the
automated grading tools focus primarily on assessing the correctness of assignments in object-oriented
languages, with limited emphasis on the clarity and educational value of the feedback provided [22].</p>
        <p>
          The advances in Generative Artificial Intelligence (Gen-AI) and Large Language Models (LLMs) have
introduced new possibilities for generating feedback in programming education. Recent studies have
explored the deployment of LLM-based systems to produce feedback for student programs. These studies
focused on the application of LLMs in providing feedback on syntax errors, bugs, or misconceptions in
the programs [
          <xref ref-type="bibr" rid="ref4 ref5 ref7">4, 5, 28, 7, 29</xref>
          ]. Although LLMs can generate natural language feedback, recent studies
have revealed that LLM-generated feedback lack explainability, are prone to incorrect suggestions and
hallucinations, and sufer from reliability and precision issues of the feedback to ensure efective and
adequate support to students [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7, 30</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. INSIGHT Classroom Assistant System</title>
      <p>To support scalable, reliable, and explainable feedback delivery, we developed INSIGHT, a web-based
classroom assistant designed to facilitate programming instruction. While INSIGHT includes a broad
set of features for real-time activity distribution, collaborative learning, and instructor monitoring, this
paper focuses on its explainable feedback selection capabilities.</p>
      <p>The system includes two primary interfaces: an Instructor app and a Student app. The Instructor app
contains a dashboard and an authoring tool. The dashboard enables real-time distribution of exercises
categorized by course topics, while the authoring tool allows instructors to create new exercises, provide
correct and incorrect solution examples, and specify associated feedback. Instructors can optionally
collaborate with an LLM to generate new problems and common student errors. The Student app,
initially implemented as a Visual Studio Code extension, allows students to view assigned exercises,
receive immediate feedback, and submit code within their development environment. All
studentinstructor interactions are synchronized to a cloud database for real-time analysis.</p>
      <sec id="sec-3-1">
        <title>3.1. Instructor Authoring Tool</title>
        <p>The Authoring tool enables instructors to create topic-tailored coding exercises that are specific to the
course. All coding exercises are categorized by their course subject and topics in the tool, along with
corresponding solutions and feedback [31]. The authoring tool allows instructors to add new coding
exercises and provide sample solutions with common misconceptions about the exercises (Figure 1). The
tool also allows instructors to leverage LLMs in generating coding exercises, solutions, and feedback.
In this way, instructors can incorporate their pedagogical expertise with the extensive problem and
solution spaces of LLMs to generate the most relevant exercises for students. All authored content is
stored in and accessed from the cloud database, so that it can be shared with the dashboard.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Student App</title>
        <p>To allow students to receive the instructor-sent coding exercises and write their own solutions within
the interface, we have developed the INSIGHT Student App. The student app enables students to
access activities within the app, such as viewing coding exercises sent in the class, receiving immediate
personalized feedback and hints to develop solution code, and submitting their solutions in real-time.
The initial version of the INSIGHT student app was developed as a Visual Studio Code extension
(Figure 2). By integrating with a development environment (IDE), the app provides seamless integration
with the existing classroom workflow. We plan to support multiple IDEs as well as a standalone
web-based interface.</p>
        <p>(a) Problem design interface.</p>
        <p>(b) Example solution creation interface.</p>
        <p>(c) Misconception and feedback tagging interface.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <p>For this work, we used the FalconCode dataset [13], a large and publicly available collection of 1.5
million Python programs submitted by over two thousand undergraduate students at the United States
Air Force Academy over five semesters. The dataset encompasses code submissions for more than 800
Python programming assignments, along with metadata, including the problem prompts, test cases used
for evaluation, and the specific skills required to solve each problem. The problems cover fundamental
CS topics, such as conditionals, loops, files, functions, strings, lists, etc.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <p>We propose a feedback propagation framework that combines an explainable AI model with generative
AI to deliver accurate, reliable, and explainable feedback on student programming submissions. Our
approach uses a pretrained and fine-tuned SANN model to identify the most influential subtrees in
incorrect code, highlighting where feedback is needed. A secondary verification step using an LLM
ensures that only genuinely erroneous code segments are flagged. These verified segments are then
matched with instructor-defined examples to deliver precise, contextually relevant feedback to students.
Figure 3 briefly illustrates the feedback propagation framework.</p>
      <sec id="sec-5-1">
        <title>5.1. Explainable AI Model</title>
        <p>
          Subtree-based Attention Neural Network (SANN) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a model designed for explainable program
representation learning, capable of encoding source code into compact vector forms by extracting
meaningful substructures from Abstract Syntax Trees (ASTs). These vector representations have been
efectively applied to various prediction tasks, such as code correctness predictions and algorithm
detection [
          <xref ref-type="bibr" rid="ref9">9, 32</xref>
          ]. In this work, we leveraged SANN’s attention mechanism to identify key subtrees that
contribute most to the final classification task (whether a program is correct or not), enabling us to
focus on the code parts where feedback is needed for an incorrect program.
        </p>
        <p>SANN constructs vector representations of source code by first extracting subtrees from the AST and
embedding them using a two-way embedding strategy (a combination of node-based and subtree-based
embeddings) to generate subtree vectors that contain syntactic and semantic information of the program.
These subtree vectors are then aggregated into a single vector representation of the entire program using
an Attention Neural Network. The attention mechanism assigns scalar weights to each subtree vector,
computed via a normalized inner product between the subtree vectors and a global attention vector,
followed by a softmax function. This mechanism highlights the most influential subtrees, enhancing
explainability by revealing which parts of a program contribute to the model’s decision.</p>
        <p>While the original SANN model employed a genetic search-based optimization technique to segment
ASTs into non-overlapping subtrees of fixed sizes based on a specific task, we leveraged a modified
version of SANN to consider all possible subtrees of varying sizes and incorporated a sigmoid attention
activation function and an entropy regularization to efectively identify the subtrees with logical errors,
as suggested by prior research [12]. This modification allows us to better capture and analyze erroneous
patterns in student code by identifying the most significant subtrees influencing the model to predict
the code as incorrect.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Model Pretraining</title>
        <p>To simulate a real classroom scenario where students solve new problems without prior historical
data for model training, we pretrained the SANN model using the FalconCode dataset. We filtered out
problems that were ungraded or lacked correctness information, as the model was trained on a binary
correctness classification task. To streamline the pretraining process and make it less resource-hungry,
we randomly selected 50% of the available problems, resulting in a dataset of 234 problems and over
435,614 student submissions. The model was pretrained on a machine using a simplified configuration
with 128GB RAM and an NVIDIA GeForce RTX 3060Ti GPU. The final pretrained model comprised
approximately 100 million parameters, providing a robust initialization for downstream tasks.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Model Fine-tuning</title>
        <p>To evaluate the model’s ability to generalize to unseen problems, we selected a problem that was
not included in the pretraining phase. This setup mirrors a real classroom scenario where students
encounter new problems without prior historical data for model training. To fine-tune the pretrained
SANN model on this new problem, we leveraged GPT-4o to generate synthetic student submissions.
We generated 1,000 correct and 1,000 incorrect submissions using a structured prompt that instructed
the model to mimic an introductory student programmer. The prompt also included instructions to
cover a diverse solution space to ensure variations in correct implementations while incorporating
common logical errors, conceptual misunderstandings, and edge case mishandling in incorrect solutions,
along with some instructor-generated example solutions. After generating this synthetic dataset, we
ifne-tuned the pretrained model on the generated synthetic data to adapt to the new problem without
using any historical submissions of real students, enabling it to identify errors efectively in student
submissions.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Feedback Selection</title>
        <p>In a real-time classroom setting, instructors can design new programming problems using our authoring
tool, which integrates interactions with an LLM. Using the authoring tool, instructors can provide
examples of both correct and incorrect solutions, highlighting erroneous segments in incorrect
submissions along with corresponding feedback to guide students in similar problem-solving scenarios.
Once a student submits a solution, our model, fine-tuned on synthetic data for the newly introduced
problem, analyzes the submission to identify the most influential subtrees. To simulate this setting,
we selected a problem from the FalconCode dataset in the fine-tuning phase that was not included in
the pretraining data, as mentioned before. Using our authoring tool, a CS1 instructor created example
solutions with corresponding feedback without referencing actual student data for the new problem.
Using the tool’s solution authoring interface, the instructor authored two correct solutions and a set of
incorrect solutions representing common misconceptions. For each incorrect solution, the instructor
annotated the associated erroneous code segments, specified the type of misconception, and provided
corresponding feedback to guide students. While the instructor primarily crafted these examples
manually, the tool also allowed optional interaction with an LLM to explore additional erroneous patterns
and refine feedback suggestions.</p>
        <p>The fine-tuned model then processed real student submissions on this problem, extracting the most
significant subtrees. These extracted subtrees capture the errors in the incorrect submissions, providing
insights into the feedback needed for students [12]. However, given the potential for false positives in
the extracted subtrees, it is crucial to ensure that feedback is accurate and does not mislead students. To
mitigate this risk, we incorporated a secondary verification layer using an LLM. In this work, GPT-4o
was employed to assess whether an identified important subtree by the explainable model indeed
contained an error. Only when both the explainable model and the LLM concurred on the presence
of an error was the subtree considered for the feedback-matching stage. The fine-tuned model then
analyzed instructor-defined incorrect solutions, and their corresponding subtrees were extracted along
with associated feedback. To map the erroneous subtrees extracted from student submissions with</p>
        <p>Model
Individual SANN (Synthetic Only)
Mixed Fine-tuned SANN (1% Pretraining Data + Synthetic)
Fine-tuned Pretrained SANN (Synthetic Only)
those in the instructor-provided examples, we computed the cosine similarity between their vector
representations extracted from the explainable model. Feedback was assigned to the student’s incorrect
submission based on the most closely matched erroneous subtree, ensuring precise and contextually
relevant guidance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <sec id="sec-6-1">
        <title>6.1. Pretraining the Modified SANN Model</title>
        <p>To develop our generalizable model, we pretrained the modified SANN model on a large-scale
program correctness prediction task. Specifically, we used 435, 614 student code submissions from 234
programming problems in the FalconCode dataset, consisting of 113, 038 correct and 322, 576 incorrect
submissions. Given the inherent class imbalance, we performed a stratified 80:10:10 split for training,
validation, and test sets to preserve the label distribution across all subsets.</p>
        <p>Hyperparameters were tuned based on validation performance. We set the embedding dimension to
100, capped the number of AST nodes per subtree and the number of extracted subtrees per code to
100, and applied an early stopping criterion with a patience of 20 epochs within a maximum of 200
training epochs. To encourage sparse attention distributions and improve explainability and efective
logical error identification, an entropy regularization term of 0.0003 was applied. Pretraining required
approximately three hours to complete. The pretrained model achieved strong performance on the
held-out test set, attaining an accuracy of 88%, precision of 88%, recall of 81%, and an F1-score of
83%. These results demonstrate the model’s efectiveness in capturing syntactic and semantic patterns
relevant to program correctness, providing a robust foundation for downstream tasks such as fine-tuning
and feedback selection.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Program Correctness Prediction on Unseen Problem</title>
        <p>Following the pretraining phase, we evaluated the ability of the pretrained SANN model to generalize
to a previously unseen problem, simulating a real classroom setting in which an instructor introduces a
new programming task without any historical student data. To enable this, we fine-tuned the pretrained
model using 2, 000 synthetic code submissions generated by GPT-4o.</p>
        <p>To assess the efectiveness of our fine-tuned model, we evaluated its performance on 548 actual
student submissions from the unseen problem (correct: 168, incorrect: 380), which was excluded
from the pretraining dataset. We compared our approach against two baselines: (i) an individual
SANN model trained solely on the synthetic data, and (ii) a mixed fine-tuning strategy, in which the
pretrained model was fine-tuned using the synthetic dataset augmented with a small portion ( 1%) of
the pretraining data. This second baseline aligns with the concept of low-resource or few-shot domain
adaptation, where limited in-domain examples are used alongside pretraining knowledge to improve
generalization. As shown in Table 1, our fine-tuned model outperformed both baselines across all
metrics: accuracy, precision, recall, and F1-score, demonstrating the advantage of using a pretrained
model adapted specifically for unseen problems through synthetic, LLM-generated student data. These
results suggest that our approach ofers a viable path for building feedback systems in classroom settings
where historical data is unavailable.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Feedback Selection Performance</title>
        <p>To evaluate the efectiveness of our explainable feedback selection framework, we tested our pipeline
on real student submissions for a programming problem excluded from the model’s pretraining phase.
The fine-tuned model was used to extract important subtrees from incorrect student submissions in the
FalconCode dataset. We used an attention threshold of 15% to select the most influential subtrees, which
we hypothesized to contain the erroneous logic responsible for incorrect program behavior. To ensure
the reliability of these subtrees, we introduced a secondary verification step using GPT-4o. If the LLM
disagreed with the model regarding the presence of an error in a subtree, that subtree was discarded
to reduce the likelihood of false positives. For the remaining verified subtrees, we computed cosine
similarity between their vector representations and those of instructor-annotated incorrect examples. If
the maximum similarity score exceeded a threshold of 50%, the feedback associated with the closest
instructor example was assigned to the student submission; otherwise, no feedback was returned.</p>
        <p>We evaluated the quality of feedback matching based on a manual analysis conducted by one of
the authors. The evaluation focused solely on the accuracy of feedback matching—i.e., whether the
feedback corresponded appropriately to the actual student error, rather than the pedagogical quality of
the feedback itself, as all feedback had been authored by the instructor. Before evaluation, we filtered out
submissions with syntax or runtime errors, resulting in a final set of 169 incorrect student submissions.
Quantitative analysis of the evaluation revealed that: 96 submissions (56.8%) received correctly matched
feedback, 48 submissions (28.4%) were assigned incorrect feedback, and 25 submissions (14.8%) received
no feedback due to low similarity or unmatched patterns.</p>
        <p>Further analysis of the “no feedback” cases revealed that these submissions often deviated significantly
from the problem specification—for example, by failing to use the provided boilerplate code for input
handling, which made it dificult to align them with any instructor-authored examples. In practice, such
cases would be flagged for instructor intervention. Among the “incorrect” matches, we observed that
the student code structures bore little resemblance to any of the predefined error patterns provided
by the instructor, resulting in incorrect matches. These errors highlight the need for richer example
coverage during the authoring process or future incorporation of approximate structural matching
techniques. Finally, an important limitation we observed within the “correct” feedback group was that,
for submissions containing multiple errors, the matched feedback typically addressed only a single
issue. While this granularity could support step-by-step debugging and reduce cognitive overload, it
may also frustrate students who prefer comprehensive feedback. Addressing multi-error submissions
and studying how sequential feedback impacts learning remains an important direction for future
work [33].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>
        In this work, we addressed a critical challenge in CS education: delivering scalable, trustworthy, and
pedagogically sound feedback to students in real-time classroom settings to facilitate active learning.
As class sizes grow and the need for individualized support increases [34, 35], traditional feedback
mechanisms become dificult to scale [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While LLMs have shown promise in code analysis and
feedback generation, concerns around hallucination, explainability, and reliability hinder their direct
deployment in educational environments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our work proposes a hybrid approach that integrates
explainable deep learning models with LLMs and instructor-authored examples to ensure feedback
remains accurate, contextually grounded, and pedagogically aligned.
      </p>
      <p>Our approach combines a modified version of the explainable SANN model with LLM-based
verification and instructor input. The SANN model is pretrained on a large-scale Python dataset consisting
of student code submissions to learn meaningful program representations. When instructors design a
new programming problem using the authoring tool of our INSIGHT classroom assistant system, we
generate a synthetic dataset of student submissions with correct and incorrect variants, aided by LLMs.
The model is then fine-tuned on this synthetic dataset to adapt to the new problem without access to
historical student submissions. Once students begin submitting code, our system identifies important
subtrees using SANN’s attention mechanism, verifies the presence of errors using GPT-4o, and matches
verified erroneous patterns to instructor-authored feedback based on vector similarity.</p>
      <p>Our evaluation demonstrated that this framework showed promising performance even when applied
to an unseen programming problem. The fine-tuned model achieved high performance on actual
student submissions for correctness prediction, outperforming models trained solely on synthetic
data or using limited pretraining augmentation. In the feedback selection phase, more than half of
the incorrect submissions received correct feedback, despite the challenges of aligning student logic
with a sparse set of instructor-authored examples. These results highlight the promise of combining
explainable program representations, LLM verification, and instructor domain knowledge to scale
feedback generation in CS classrooms. Beyond performance metrics, this system presents an important
opportunity for enhancing programming instruction. Instructors can use the authoring tool to scafold
common misconceptions, design problem-specific feedback, and collaborate with LLMs to co-generate
solution spaces. The triangulated feedback mechanism ensures that only reliable, interpretable, and
context-aware feedback reaches students, making it well-suited for real-time learning environments.
This framework also empowers instructors by reducing the burden of individually assessing each
student’s work and enhancing transparency in automated decision-making, a key requirement in
educational AI systems.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and Future Work</title>
      <p>Our work has several limitations and important directions for future work. First, the efectiveness of
the feedback matching mechanism is strongly dependent on the quality and coverage of
instructorauthored incorrect solutions. Sparse or overly specific examples may limit the system’s ability to
match unseen student errors. In future iterations, we aim to automate the generation of diverse
incorrect examples using LLMs under instructor supervision, thereby enriching the feedback space
while retaining instructional control. Second, our current feedback-matching mechanism relies primarily
on cosine similarity between subtree vectors. While efective in many cases, this approach can propagate
incorrect feedback when structurally dissimilar errors yield incorrect matches due to scarce instructor
examples. To mitigate this, we plan to incorporate LLM-based feedback validation as an additional
layer of confirmation before any feedback is shown to the student, thereby reducing false positives
in the feedback propagation pipeline. Third, our current framework struggles to handle submissions
containing multiple errors. In these cases, the feedback typically addresses only one issue, which may
not be suficient for students aiming to resolve all problems at once. While sequential, step-by-step, and
immediate feedback could support cognitive load management [33], we intend to explore multi-error
detection and composite feedback generation in future work. This also includes studying whether
iteratively addressing errors, one at a time, can enhance learning outcomes compared to presenting all
issues simultaneously. Another promising future direction is incorporating LLMs in the correctness
prediction stage itself. If both the fine-tuned model and the LLM agree on the correctness label of a
submission, we can increase confidence in the decision and avoid propagating feedback for correct
solutions, eliminating the need for instructors to manually author test cases for correctness validation.
Additionally, this ensemble-style agreement between models could improve robustness and reduce
reliance on any single component. Lastly, our evaluation focused on technical correctness rather than
pedagogical efectiveness. While the system can match errors to instructor feedback, future work will
involve classroom studies measuring the actual learning gains and usability outcomes associated with
receiving AI-assisted feedback. This will ensure that the system is not only technically reliable but also
pedagogically impactful.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>This work presented a novel framework that combines explainable AI models, large language models,
and instructor expertise to deliver scalable, explainable, and reliable feedback on student programming
submissions. By fine-tuning a pretrained model on synthetic data for unseen problems and integrating
LLM-based validation, our system enables accurate feedback propagation without relying on historical
student data. Our evaluation demonstrated the efectiveness of our approach in predicting program
correctness and initiating feedback matching, even in classroom settings with newly designed exercises.
While limitations remain, this work lays the groundwork for scalable, reliable, and trustworthy
AIassisted feedback systems in computing education.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This research was supported by the National Science Foundation (NSF) under Grants DUE-2236195
and DUE-2331965. Any opinions, findings, and conclusions expressed in this material are those of the
authors and do not necessarily reflect the views of the NSF.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Grammarly to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[11] M. Hoq, J. Vandenberg, B. Mott, J. Lester, N. Norouzi, B. Akram, Towards attention-based automatic
misconception identification in introductory programming courses, in: Proceedings of the 55th
ACM Technical Symposium on Computer Science Education V. 2, 2024, pp. 1680–1681.
[12] M. Hoq, A. Rao, R. Jaishankar, K. Piryani, N. Janapati, J. Vandenberg, B. Mott, N. Norouzi, J. Lester,
B. Akram, Automated identification of logical errors in programs: Advancing scalable analysis
of student misconceptions, in: Proceedings of the 18th International Conference on Educational
Data Mining (EDM), International Educational Data Mining Society, Palermo, Italy, 2025, pp. –.
[13] A. de Freitas, J. Cofman, M. de Freitas, J. Wilson, T. Weingart, Falconcode: A multiyear dataset of
python code samples from an introductory computer science course, in: Proceedings of the 54th
ACM Technical Symposium on Computer Science Education V. 1, 2023, pp. 938–944.
[14] D. Baldwin, Discovery learning in computer science, in: Proceedings of the 27th SIGCSE Technical</p>
      <p>Symposium on Computer Science Education, 1996, pp. 222–226.
[15] J. E. Froyd, Evidence for the eficacy of student-active learning pedagogies, Project Kaleidoscope
66 (2007) 64–74.
[16] J. Pirker, M. Rifnaller-Schiefer, C. Gütl, Motivational active learning: engaging university students
in computer science education, in: Proceedings of the 2014 Conference on Innovation &amp; Technology
in Computer Science Education, 2014, pp. 297–302.
[17] T. Greer, Q. Hao, M. Jing, B. Barnes, On the efects of active learning environments in computing
education, in: Proceedings of the 50th ACM Technical Symposium on Computer Science Education,
2019, pp. 267–272.
[18] Q. Hao, B. Barnes, M. Jing, Quantifying the efects of active learning environments: separating
physical learning classrooms from pedagogical approaches, Learning Environments Research 24
(2021) 109–122.
[19] M. T. Chi, R. Wylie, The icap framework: Linking cognitive engagement to active learning
outcomes, Educational Psychologist 49 (2014) 219–243.
[20] D. Lombardi, T. F. Shipley, A. Team, B. Team, C. Team, E. Team, G. Team, G. Team, P. Team, The
curious construct of active learning, Psychological Science in the Public Interest 22 (2021) 8–43.
[21] M. Ebert, M. Ring, A presentation framework for programming in programing lectures, in:
Proceedings of the 2016 IEEE Global Engineering Education Conference (EDUCON), IEEE, 2016,
pp. 369–374.
[22] M. Messer, N. C. Brown, M. Kölling, M. Shi, Automated grading and feedback tools for programming
education: A systematic review, ACM Transactions on Computing Education 24 (2024) 1–43.
[23] R. Singh, S. Gulwani, A. Solar-Lezama, Automated feedback generation for introductory
programming assignments, in: Proceedings of the 34th ACM SIGPLAN conference on Programming
language design and implementation, 2013, pp. 15–26.
[24] S. Gulwani, I. Radiček, F. Zuleger, Automated clustering and program repair for introductory
programming assignments, ACM SIGPLAN Notices 53 (2018) 465–480.
[25] S. Bhatia, P. Kohli, R. Singh, Neuro-symbolic program corrector for introductory programming
assignments, in: Proceedings of the 40th International Conference on Software Engineering, 2018,
pp. 60–70.
[26] R. Gupta, A. Kanade, S. Shevade, Deep reinforcement learning for syntactic error repair in student
programs, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp.
930–937.
[27] J. Zhang, J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares, G. Verbruggen, Repairing bugs in
python assignments using large language models, arXiv preprint arXiv:2209.14876 (2022).
[28] T. Phung, V.-A. Pădurean, A. Singh, C. Brooks, J. Cambronero, S. Gulwani, A. Singla, G. Soares,
Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint
generation and gpt-3.5 student model for hint validation, in: Proceedings of the 14th Learning
Analytics and Knowledge Conference, 2024, pp. 12–23.
[29] Z. Xu, S. Jain, M. Kankanhalli, Hallucination is inevitable: An innate limitation of large language
models, arXiv preprint arXiv:2401.11817 (2024).
[30] Q. Jia, J. Cui, R. Xi, C. Liu, P. Rashid, R. Li, E. Gehringer, On assessing the faithfulness of
llmgenerated feedback on student assignments, in: Proceedings of the 17th International Conference
on Educational Data Mining, International Educational Data Mining Society, 2024, pp. 491–499.
[31] M. Hoq, J. Vandenberg, S. Jiao, S. Lee, B. Mott, N. Norouzi, J. Lester, B. Akram, Facilitating
instructors-llm collaboration for problem design in introductory programming classrooms, in:
Proceedings of the CHI 2025 Workshop on Augmented Educators and AI: Shaping the Future of
Human and AI Cooperation in Learning, 2025.
[32] M. Hoq, Y. Shi, J. Leinonen, D. Babalola, C. Lynch, T. Price, B. Akram, Detecting chatgpt-generated
code submissions in a cs1 course using machine learning models, in: Proceedings of the 55th ACM
Technical Symposium on Computer Science Education, Association for Computing Machinery,
New York, NY, USA, 2024, p. 526–532.
[33] J. R. Anderson, F. G. Conrad, A. T. Corbett, Skill acquisition and the lisp tutor, Cognitive Science
13 (1989) 467–505.
[34] M. Hoq, P. Brusilovsky, B. Akram, Analysis of an explainable student performance prediction model
in an introductory programming course, in: Proceedings of the 16th International Conference on
Educational Data Mining, International Educational Data Mining Society, Bengaluru, India, 2023,
pp. 79–90.
[35] M. Hoq, A. Patil, K. Akhuseyinoglu, P. Brusilovsky, B. Akram, An automated approach to
recommending relevant worked examples for programming problems, in: Proceedings of the 56th ACM
Technical Symposium on Computer Science Education (SIGCSE) V. 1, Association for Computing
Machinery, New York, NY, USA, 2025, pp. 527–533.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>J. J. McConnell</surname>
          </string-name>
          ,
          <article-title>Active learning and its use in computer science</article-title>
          ,
          <source>in: Proceedings of the 1st Conference on Integrating Technology into Computer Science Education</source>
          ,
          <year>1996</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          , S. MacNeil, J.
          <string-name>
            <surname>Savelka</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Luxton-Reilly</surname>
          </string-name>
          ,
          <article-title>Desirable characteristics for ai teaching assistants in programming education</article-title>
          ,
          <source>in: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1</source>
          ,
          <issue>2024</issue>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Sphere:
          <article-title>Scaling personalized feedback in programming classrooms with structured review of llm outputs</article-title>
          ,
          <source>arXiv preprint arXiv:2410.16513</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Phung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cambronero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Soares, Generating high-precision feedback for programming syntax errors using large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.04662</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gehringer</surname>
          </string-name>
          ,
          <article-title>Llm-generated feedback in real classes and beyond: Perspectives from students and instructors</article-title>
          ,
          <source>in: Proceedings of the 17th International Conference on Educational Data Mining, International Educational Data Mining Society</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>862</fpage>
          -
          <lpage>867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Phung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.-A.</given-names>
            <surname>Pădurean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cambronero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singla</surname>
          </string-name>
          , G. Soares,
          <article-title>Generative ai for programming education: Benchmarking chatgpt, gpt-4, and human tutors</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume</source>
          <volume>2</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jaschke</surname>
          </string-name>
          ,
          <article-title>Evaluating the application of large language models to generate feedback in programming education</article-title>
          ,
          <source>arXiv preprint arXiv:2403.09744</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Sonnert,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Sadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sasselov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fredericks</surname>
          </string-name>
          ,
          <article-title>The impact of student misconceptions on student persistence in a mooc</article-title>
          ,
          <source>Journal of Research in Science Teaching</source>
          <volume>57</volume>
          (
          <year>2020</year>
          )
          <fpage>879</fpage>
          -
          <lpage>910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Chilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ahmadi</given-names>
            <surname>Ranjbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <article-title>Sann: Programming code representation using attention neural network with optimized subtree extraction</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>783</fpage>
          -
          <lpage>792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brusilovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <article-title>Explaining explainability: Early performance prediction with student programming pattern profiling</article-title>
          ,
          <source>Journal of Educational Data Mining</source>
          <volume>16</volume>
          (
          <year>2024</year>
          )
          <fpage>115</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>