<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ming Feedback: Aligning Small Language Models Without Human Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Charles Koutcheme</string-name>
          <email>charles.koutcheme@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Dainese</string-name>
          <email>nicola.dainese@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arto Hellas</string-name>
          <email>arto.hellas@aalto.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Code: Github</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Language Models, Programming Feedback, Computing Education, Reinforcement Learning,</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
          ,
          <addr-line>Espoo</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CSEDM'25: 9th Educational Data Mining in Computer Science Education Workshop</institution>
          ,
          <addr-line>July, 2025, Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Providing students with timely and efective feedback remains a critical challenge in programming education. Locally deployed Small Language Models (SLMs) ofer a cost-efective solution that enables educators to generate feedback while avoiding third-party reliance and privacy concerns associated with Large Language Models (LLMs). However, SLMs often produce misleading or inaccurate feedback, limiting their practical use. This paper presents a fully automated reinforcement learning framework for aligning SLMs to generate high-quality programming feedback without any human-labelled examples or preference annotations. Our approach transfers the feedback capabilities of powerful LLMs (“teacher models”) to smaller, low-resource models (“student models”) that can run locally on consumer hardware, with the optional assistance of medium-sized “assistant” models. The framework supports two configurations: an of-policy setup that uses assistant model generations to bootstrap alignment and a lightweight online on-policy variant that trains directly on student model outputs. We evaluate both approaches by fine-tuning two SLMs on a real-world dataset of CS1 programming submissions collected across semesters. Our experiments simulate realistic deployment scenarios, training on data from past semesters and evaluating on future ones. Results show that both methods significantly improve feedback quality and generalize across new course oferings. We provide practical considerations for aligning SLMs in educational settings and outline a promising direction for future work. Our code is made available on GitHub.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Learning to program is challenging for many. These challenges can be somewhat alleviated with
improved teaching practice [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. A key part of this is providing feedback, which should be timely
and accurate [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Large Language Models (LLMs) have shown exceptional success in that task
[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], leading to their growing adoption in classrooms [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref8 ref9">8, 9, 10, 11, 12</xref>
        ]. However, relying on third-party
services that provide access to LLMs can introduce cost obstacles and scalability issues [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These
constraints are driving a growing shift towards using smaller, open-source models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which can be
deployed locally [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] to reduce costs and provide educators greater control over their students’ data.
      </p>
      <p>
        Although Small Language Models (SLMs) alleviate these issues due to the ability to run them locally
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], their feedback quality often falls short of LLMs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], posing significant challenges in real-world
applications. In particular, SLMs tend to generate more misleading feedback [
        <xref ref-type="bibr" rid="ref14 ref17">14, 17</xref>
        ], including
hallucinations and irrelevant suggestions. Such shortcomings can confuse students and hinder learning [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Reinforcement Learning (RL) has emerged as a promising approach for aligning language models
to generate pedagogically meaningful programming support [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, existing reinforcement
learning methods for programming feedback generation rely heavily on human supervision, typically
in the form of human-written examples [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or preference annotations [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. This dependency hinders
improvements in contexts where data or annotators are unavailable.
https://koutche.me/ (C. Koutcheme)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        In this paper, we explore aligning small models for programming feedback without human annotations
or preference labels. Our approach uses RL to transfer the feedback abilities from a teacher LLM to a
smaller, locally deployable student model, optionally using medium-sized assistant models to bootstrap
alignment. We implement and compare two fully automated training configurations: an of-policy
setup (TASAP), which builds on prior work using assistant generations [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]; and a novel lightweight
online on-policy method (OSAP), where the model trains directly on its own feedback.
      </p>
      <p>
        Our evaluation focuses on feedback generation for explanations of students’ mistakes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and
Socratic hints [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], using two SLMs (SmolLM-V2-1.7B [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and Llama-3.1-1B [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]) fine-tuned on real
student submissions from the FalconCode dataset [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. We use this dataset to simulate a realistic
deployment scenario by training on one semester and evaluating on the next. We also study a continual
learning setting where models are refined incrementally as new data becomes available.
      </p>
      <p>Our results show that both configurations significantly improve feedback quality on new student
submissions. We also reflect on methodological challenges when training small models for educational
feedback and highlight a promising direction for future work. Our contributions are the following:
• We introduce a reinforcement learning framework for aligning small language models for
programming feedback, without relying on human-labelled preferences or human-annotated feedback.
• We implement an of-policy method using assistant models (TASAP) and introduce a novel online
on-policy variant (OSAP).
• We evaluate both methods on a real-world, semester-split dataset, and show that they substantially
improve feedback quality on new students’ submissions, including in a continual learning setup.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Learning from Preferences</title>
        <p>
          Reinforcement Learning With Human Feedback Recent advancements in fine-tuning techniques
have significantly improved the performance of small language models on downstream tasks. While
Supervised Fine-Tuning (SFT) remains a widely used approach to improve language models’ generations
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], it is limited in its ability to align such models with complex human preferences and objectives [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
Reinforcement Learning with Human Feedback (RLHF) addresses this limitation by training reward
models on ranked human preferences — for example, generation A is better than generation B — and
has contributed to the success of large language models such as GPT-4 [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
        <p>
          LLMs-as-judges. Because of their strong performance, such models have also been used as “judges”
to evaluate outputs from smaller models [
          <xref ref-type="bibr" rid="ref14 ref31">31, 14</xref>
          ]. The growing use of LLMs-as-judges has progressively
reduced the need for human annotators in RLHF pipelines [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], creating a shift towards Reinforcement
Learning From AI Feedback (RLAIF) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], where AI models themselves are used to supervise other
models’ preference-based training.
        </p>
        <p>
          Direct Preference Optimization. Recent RLHF and RLAIF approaches have predominantly relied
on ofline preference alignment methods such as Direct Preference Optimisation (DPO) [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. DPO
simplify the classical three-step pipeline by directly optimising language models on prior collected
preference data, removing the need to train a separate reward model [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
        </p>
        <p>
          Parameter-Eficient Fine-Tuning. In parallel, Parameter-Eficient Fine-Tuning (PEFT) techniques
[
          <xref ref-type="bibr" rid="ref36">36</xref>
          ], such as Low-Rank Adapters (LoRA) [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], have made fine-tuning more accessible by significantly
reducing computational and memory requirements. These techniques have enabled the practical
application of preference alignment to smaller, resource-constrained models.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Improving Small Language Models for Programming Feedback</title>
        <sec id="sec-2-2-1">
          <title>Existing reinforcement learning approaches.</title>
          <p>
            Fine-tuning small language models (SLMs) for
programming education has become an increasingly active area of research in AI in Education. While
early work primarily focused on generating program repairs to correct student code [
            <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
            ], more
recent approaches explore reinforcement learning (RL) and PEFT methods to fine-tune small models to
support students’ learning how to program. In all setups, a common challenge is obtaining high-quality
preference pairs to guide the learning process.
          </p>
          <p>
            Some approaches rely on human annotations paired with synthetic examples. For instance, Kumar et
al. use GPT-4 to generate low-quality samples, paired with human-written Socratic questions, to train a
LLaMA model for educational dialogue [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. Other approaches leverage naturally occurring preference
signals. Hicke et al. use TA edits to forum posts as implicit preferences [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Towards Reinforcement Learning with AI Feedback.</title>
          <p>
            While promising, these methods rely on
human supervision or access to structured educational data, which limits their scalability to new
contexts. Kotalwar et al. take a step toward automation by using GPT-4 to generate explanations and
hints, training a small model via supervised fine-tuning alone [
            <xref ref-type="bibr" rid="ref40">40</xref>
            ]. However, whether preference-based
techniques can further improve such models without human annotation remains underexplored.
          </p>
          <p>
            Recent studies highlight the promise of LLMs-as-judges in evaluating feedback quality, with models
such as GPT-4o-mini and Llama-3.1-70B producing high-quality judgments [
            <xref ref-type="bibr" rid="ref14 ref17">14, 17</xref>
            ]. These advances
motivate our work to adapt RLAIF for programming feedback.
          </p>
          <p>
            Closest to our work in another domain is Scarlatos et al. [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ], who use a combination of
humanwritten feedback, LLM-generated feedback, and AI preferences to train an 8B LLaMA model with PEFT
for multiple-choice math feedback, using Direct Preference Optimisation. While both studies rely
on RLAIF, our approach difers by integrating such techniques within a distillation framework that
addresses specific programming feedback challenges, notably, the lack of human-annotated data, the
vast space of possible student mistakes, and the need for highly contextualised recommendations.
          </p>
          <p>
            Moreover, our work difers from all prior attempts by also integrating online learning algorithms
[
            <xref ref-type="bibr" rid="ref41">41</xref>
            ], where language models improve continuously with their own generated responses.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>Here, we present our two approaches for improving small language models’ programming feedback.
Before presenting the training methods, we formalize the task and outline our assumptions.</p>
      <sec id="sec-3-1">
        <title>3.1. Task and Assumptions</title>
        <p>
          Task. Our primary objective is to fine-tune a small, resource-eficient, instruction-tuned language
model   (the student LM) to generate two interrelated types of feedback [
          <xref ref-type="bibr" rid="ref24 ref40 ref42">42, 40, 24</xref>
          ]: an explanation
ℰ, which identifies and describes a bug in a student’s program, and a single next-step hint ℋ, which
guides the student toward resolving the identified bug without revealing the solution.
        </p>
        <p>
          While our method can be adapted to other types of feedback [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we illustrate its efectiveness with
explanations and hints as these two types of feedback play an important role in supporting students
learning programming.
        </p>
        <sec id="sec-3-1-1">
          <title>Quality attributes.</title>
          <p>
            To ensure the feedback supports efective learning, it must adhere to specific
quality attributes identified in prior works. First, the generated explanation must be accurate, selective,
and clear [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. The explanation is considered accurate (ℰ ) if it correctly identifies and mentions the
ifrst existing issue in the student program. It is considered selective ( ℰ ) when it focuses exclusively

on one issue in the code (whether the issue is correct or not) and avoids discussing any unrelated or
non-existent bugs. Finally, the explanation should be clear (ℰ ), meaning it is easy to understand,
concise, and presented in a readable format.
          </p>
          <p>
            Second, the generated hint must be correct, informative, concealed, and clear [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ]. A hint is considered
correct (ℋ ) if it provides accurate information to resolve issues in the buggy program. It is deemed
informative (ℋ ) if it ofers valuable insights to help the learner resolve the bug efectively.
          </p>
          <p>The hint should also remain concealed (ℋ</p>
          <p>) by avoiding the direct revelation of the solution, to
reason through the process of implementing the fix. Lastly, the hint must be clear ( ℋ ), ensuring that
it is easy to understand and devoid of unnecessary complexity. The student language model,   , will be
optimized to consistently meet these quality attributes in its generations.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Generation methodology.</title>
          <p>
            Following prior work [
            <xref ref-type="bibr" rid="ref40 ref42">42, 40</xref>
            ], feedback ℱ is always generated using a
chain-of-thought approach that prompts language models to generate the explanation ℰ (the “thought”)
followed by the hint ℋ; This strategy ensures hints are grounded in accurate explanations.
Assumptions. To reach our objective, we consider a training dataset  = {(  ,  
pairs of problem descriptions   and incorrect student programs   .
          </p>
          <p>)}=1 consisting of</p>
          <p>We also assume to have access to a teacher LLM,   , accessible via an online API (e.g., GPT-4o-mini via
OpenAI API). We also suppose having access to a set of 
medium-sized assistant LMs    ,  ∈ {1, … , } .</p>
          <p>The teacher model is presumed to generate high-quality feedback, while the assistant models perform
well but with lower quality, and the student model may initially perform poorly.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Supervised Fine-tuning</title>
        <p>
          Given the lack of human annotations, following Koltawar et al. [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ], a natural first step in improving
our small language model is to apply Supervised Fine-Tuning (SFT), that is, training the student model
on teacher-generated feedback for all incorrect programs in the training set using the negative
loglikelihood (NLL) loss. This yields a model  sft . We generate such feedback ℱ 
 using greedy decoding.
        </p>
        <p>
          SFT represents the simplest form of distillation [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ], where the student directly mimics the teacher’s
outputs. However, SFT alone risks overfitting, especially when training data is limited, and does not
allow language models to understand what constitutes high-quality responses.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Learning From Feedback Preferences</title>
        <p>
          In this paper, we propose to apply preference-based optimisation techniques on top of the SFT-trained
small language models to refine their abilities to generate high-quality feedback. Unlike RLHF setups
that rely on human preference labels [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we generate preferences automatically. We compare two
configurations that vary in how feedback examples are generated and how the student model is updated.
        </p>
        <p>In both setups, we use the teacher model to score and rank the generated feedback using a rubric-based
process before optimizing the student model via an appropriate preference alignment algorithm.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Teacher-Assistant-Student Alignment Pipeline (TASAP)</title>
          <p>feedback) based on a quality criterion.</p>
          <p>
            Our first approach, the Teacher-Assistant-Student Alignment Pipeline (TASAP), follows similar ofline
of-policy preference alignment strategies underpinning the success of many language models [
            <xref ref-type="bibr" rid="ref32 ref44">32, 44</xref>
            ].
To apply such methods in our context, we need to construct a preference dataset with feedbacks  ,

  = {(  ,   ,   ), (  ,   ,   )}=1 , where   (the “winning” feedback) is ranked higher than   (the “loosing”

assistant models ℱ
          </p>
          <p>
            generated by the teacher language model ℱ 
Step 1: Data collection. For each incorrect program, we sample three feedback, each one from the
using greedy-decoding following prior work [
            <xref ref-type="bibr" rid="ref14 ref31">14, 31</xref>
            ]. We also reuse the feedback
          </p>
          <p>during the supervised fine-tuning step.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Step 2: Judging and scoring generations.</title>
          <p>
            Then, we use our teacher   as a judge [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] to grade all
four generated feedback (independently against a rubric based on our predefined quality criteria:
ℰ , ..., ℋ , ℋ , ℋ
          </p>
          <p>
            , and ℋ , assigning each criterium a binary value of either 0 (false) or 1 (true)
[
            <xref ref-type="bibr" rid="ref22">22</xref>
            ]. Following Koutcheme et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], our prompt for judging feedback (see Figure 3, Appendix B) asks
the teacher LM   to use its own generated feedback as ground truth to evaluate the newly provided one.
This reference grading strategy [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] ensures the student generations remain aligned with the teacher,
reduces variability in judgments, and ensures the preference dataset is free of noise [
            <xref ref-type="bibr" rid="ref45">45</xref>
            ]. Using the
ℰ ,
grading values, we assign each feedback an overall quality score [
            <xref ref-type="bibr" rid="ref22 ref32">22, 32</xref>
            ] using a weighted sum:
  = 0.20 ⋅ ℰ + 0.15 ⋅ ℰ + 0.10 ⋅ ℰ + 0.20 ⋅ ℋ + 0.15 ⋅ ℋ + 0.10 ⋅ ℋ
+ 0.10 ⋅ ℋ
where the resulting score  is also bounded between 0 and 1. Our scoring function prioritizes
explanation correctness and hint accuracy to ensure feedback is factual. We then consider explanation
selectivity and hint informativeness to discourage the generation of irrelevant or hallucinated
information. Attributes like clarity and concealment are considered last, as they are secondary to the validity of
the feedback (the scoring function can be adapted by teachers to match their needs).
          </p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Step 3 - Preference dataset creation.</title>
          <p>Using the four feedback obtained, three from the SFT model
   , one by the teacher   , for all given incorrect programs   , we add to our preference dataset   all
possible feedback pairs (  ,   ) where   score is better than   score (

 &gt;   ).</p>
          <p />
        </sec>
        <sec id="sec-3-3-4">
          <title>Step 4 - Optimization.</title>
          <p>
            the DPO loss function [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ]:
          </p>
          <p>Using the resulting preference dataset, we train our language model using
ℒDPO (  ;  sft ) = − (  , 
,  ,  )∼ 
[log  ( log
  sft ((   ∣∣ ,,  )) −  log

  (  ∣ ,  )</p>
          <p>sft (  ∣ 
,  ) )]
(1)
where  is the logistic function,   is the policy being optimized (i.e., the model during training),  sft
is the reference policy (i.e., the frozen model before training), and  is a regularization parameter that
controls the deviation of the trained from the reference policy. A higher  keeps the trained model
closer to the reference policy. Intuitively, this formulation penalizes the model based on how much it
“prefers” the lower-quality (losing) feedback over the higher-quality (winning) feedback, which results
in gradually increasing the probability of generating high-quality outputs.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.2. Online Student Alignment Pipeline (OSAP)</title>
          <p>
            Our second approach, Online Student Alignment Pipeline (OSAP), is an online on-policy variant of
TASAP based on Direct Language Model Alignment from Online AI Feedback [
            <xref ref-type="bibr" rid="ref41">41</xref>
            ]. Compared to ofline
approaches, online training continuously updates a language model based on its own generations,
potentially reducing common issues associated with using static preference datasets, such as distribution
shift [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ] and overfitting [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ].
          </p>
          <p>
            Starting from the supervised fine-tuned model    , OSAP integrates the sampling, data collection,
and optimization steps of the TASAP pipeline within a single optimization loop. At each iteration:
(a) Instead of sampling from assistant models, we sample two generations  1,  2 ∼   
our language model using multinomial sampling with an arbitrary temperature of 0.3 [
            <xref ref-type="bibr" rid="ref41">41</xref>
            ].
(  ,   ) from
(b) We use our teacher model   to independently judge and score each generation to determine the
winning   and losing feedback   , before updating the model parameters based on the resulting
preference ordering using the original DPO loss function (see equation 1). If both generations
obtain the same score, we default to syntactic distance measures, and we select the feedback
having the highest ROUGE score [
            <xref ref-type="bibr" rid="ref47">47</xref>
            ] with the teacher feedback [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Adaptation to New Student Data</title>
        <p>
          In real-life scenarios, student programming submissions are collected by course oferings (e.g. semester
by semester) and accumulate and even somewhat change over time [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ]. To take this scenario into
account, we need to study efectively how each of the two preference-based alignment strategies, TASAP
and OSAP, can be applied when additional training data is introduced. We consider the task of refining
a model already trained on an initial dataset  1, using a new semester of data  2 = {(  ,   )}=1 .
TASAP : We perform steps 1 to 4 of the TASAP pipeline on the new dataset of students’ incorrect
programs  2 to obtain a second preference dataset   2. Training the first model exclusively on this new
dataset might induce a situation of catastrophic forgetting [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ], where the student model loses some of
the knowledge it acquired when trained on   1. To mitigate this issue, we initialize the weights of our
model to the supervised fine-tuning version (see section 3.2) of the first semester (i.e.,    ) and train
this model using the IPO loss on the combined   1 ∪   2. Our choice of not repeating the supervised
ifne-tuning step on the combined dataset, and instead, starting from    is motivated by our tentative
to mitigate overfitting risks.
        </p>
        <p>OSAP : For OSAP, we continue the training pipeline directly1 from    (i.e., the OSAP model trained
on the first semester), using the problem description and incorrect programs from   1 ∪   2. This
strategy thus reflects a true continual learning setup and most benefits from new data.</p>
        <p>We note that both techniques shown can be applied continuously, for instance, for refining the model
trained on two semesters of data using a third one.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          In this section, we present our experiments, aiming to answer the following research question:
(RQ) How efective are TASAP and OSAP in improving the feedback quality of small language models
when trained and evaluated across semesters of the same introductory programming course?
We perform our experiments using FalconCode [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], a large and comprehensive publicly available
dataset containing real-life CS1 students’ solutions to Python programming exercises. Beyond its
substantial scale, this dataset distinguishes itself through free-form assignments, enabling a broader
evaluation of language models’ abilities to generate feedback.
        </p>
        <p>
          Preprocessing. The FalconCode dataset is split over three subsets (three semesters of data). Within
each subset, we select all unique incorrect programs from all students’ last submitted solutions for
all assignments automatically evaluated with unit tests [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ]. Uniqueness is determined via AST
normalization2. While we acknowledge that this selection may not fully capture the range of dificulties
students encounter during their attempts, it aligns with the idea that a student’s last attempt often
reflects their improved understanding of the problem. Thus, our setup can be viewed as providing
feedback to students as a last resort for elements they may not have grasped.
        </p>
        <p>
          We leverage the first and second semesters for training and iterative refinement, respectively, and the
last semester for testing. To ensure our setup evaluates our models’ generalization abilities, we filter out
from the test set the programs in the first two semesters having similar normalised AST representations
[
          <xref ref-type="bibr" rid="ref51">51</xref>
          ]. This results in three splits with 826, 690, and 693 incorrect programs (  ) from 62, 44, and 62
assignments (  ), respectively.
1In practice, we also need to generate feedback using the teacher model on the new semester of data   2 to allow the fall back
to a syntactic distance measure comparison.
2Including variable renaming.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models</title>
        <p>
          To answer our research questions, we fine-tune two small language models, SmolLM-V2-1.7B [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] and
Llama-3.2-1B [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], using GPT-4o-mini [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ] as the teacher. We chose these two student models for
their strong performance on small-model benchmarks, while GPT-4o-mini has been shown to produce
high-quality programming feedback [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>
          Baseline. As a baseline, we use the models trained using Supervised Finetuning on each
teachergenerated data following Kotalwar et al. [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ].
        </p>
        <p>Versions. We train each of our models (on FalconCode) using our two proposed approaches: TASAP
and OSAP. For each approach, we train a first version (TASAP-1 and OSAP-1) on FalconCode first
semester. We train second versions of our models using the adaptation to new student data strategy
(see section 3.4) with both FalconCode first and second semesters.</p>
        <p>
          Assistant models. For TASAP, we leverage three assistant language models: Mistral-Nemo-12B [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ],
Llama-3.1-8B [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], and Qwen-2.5-3B [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ]. We chose these models to ensure diversity across model
families, sizes, and performance [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ].
        </p>
        <p>
          Parameter Eficient Finetuning To take into account educators’ limited access to computational
resources, we train our SFT and TASAP models (as well as the baselines introduced below) with
LowRank Adapters (LoRA) [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], a parameter-eficient fine-tuning method that reduces memory requirements
by freezing the base model and adding a small number of trainable parameters called adapters [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ].
These adapters can be removed to restore the base model’s original capabilities.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Automated Evaluations: LLMs-as-feedback-judges</title>
        <p>
          Manually evaluating all models on our datasets would require substantial efort, even on a subset of
generations. Instead, we leverage LLMs-as-judges once again for our final evaluation [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. However,
rather than relying on a single model for this task, following Verga et al. [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ], we use a panel of three
strong LLMs: Llama-3.3-70B [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], GPT-4o-mini, [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], and Gemini-2.0-flash [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ]. Earlier versions of
the GPT-4o and the Llama-3 family have been used extensively as judges [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ], also in programming
context [
          <xref ref-type="bibr" rid="ref14 ref17">17, 14</xref>
          ], and Gemini has recently demonstrated comparable performance to GPT-4o-mini on
multiple benchmarks. While GPT-4o-mini and Gemini-2.0-flash are lighter versions of their full-size
counterparts, they remain strong judges for programming feedback. For instance, GPT-4o-mini has
been shown to perform on par with GPT-4o for evaluating feedback quality [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Moreover, Verga et al.
[
          <xref ref-type="bibr" rid="ref54">54</xref>
          ] demonstrate that ensembles of smaller LLMs of diferent families outperform single large models,
particularly by mitigating individual model biases.
        </p>
        <p>Evaluation prompting strategy. For each feedback ℱ generated on the test set, we prompt all
judges (see Figure 4, Appendix B) to provide binary decisions across all quality criteria. We obtain
the final verdict using a strict unanimity policy: a criterion is marked correct only if all judges agree.
While this method does not provide absolute performance guarantees, as discussed in our Limitations
of Work, it ofers a consistent, scalable, and reliable strategy for comparing the relative efectiveness of
diferent training approaches.</p>
        <p>
          Human validation. Following Scarlatos et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], we conduct a small-scale analysis over a subset
of language model generations to validate the use of LLM-as-judges and provide insights into potential
evaluation errors.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Experiment details</title>
        <p>
          We fine-tune our models using the HuggingFace TRL library, following hyperparameters recommended
in prior work. For SFT, we use a learning rate of 1e-4 [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]; for TASAP and OSAP, we set  = 0.25 [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]
and use learning rates of 1e-5 and 1e-6, respectively. Batch sizes are 8 for SFT and TASAP, and 16
for OSAP. TASAP-2 and OSAP-2 reuse these settings and repeat the training process as described in
Section 3.4. We apply LoRA with  = 64 and rank  = 32 [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], train each model for up to 3 epochs, and
select checkpoints based on lowest validation loss. All other hyper-parameters remain at default values.
Full experimental details and prompts are available in our code base. All training was performed on
Nvidia Tesla V100 GPUs (32GB RAM) via Triton, our institution’s research cluster.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Main results</title>
        <p>TASAP(-2): Teacher-Assistant-Student Alignment Pipeline (trained on 2 semesters), OSAP(-2): Online Student
Alignment Pipeline (trained on 2 semesters). QWEN: Qwen-2.5-3B, LLAMA: Llama-3.1-8B, NEMO:
MistralNemo-12B, MINI: GPT-4o-mini. Explanation (ℰ) criteria:ℰ : accuracy,ℰ : selectivity,ℰ : clarity. Hint criteria
(ℋ): ℋ : correctnessℋ, : informativeness,ℋ : concealment,ℋ : clarity.</p>
        <sec id="sec-5-1-1">
          <title>Llama-3.2-1B (Student)</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Smol2-1.7B (Student)</title>
          <p>Model
BASE
SFT
OSAP
TASAP
MINI
QWEN
TASAP-2 68.2
OSAP-2
ℰ</p>
          <p>20.7
60.6
63.6
68.5
69.5
98.7
70.9
respective supervised fine-tuned base models.</p>
          <p>
            Tracing this observation backwards, we note that although the Smol base model performs slightly
better than the Llama base model (as expected, due to size diferences), supervised fine-tuning benefits
the Llama model more. We hypothesise that this may be due to a distribution shift: the Llama model’s
answer distribution is closer to that of GPT-4o-mini, making further improvements easier [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ]. Training
with OSAP and TASAP may guide the model toward a more optimal solution space. Hieke et al.
[
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] have already highlighted that preference optimization exerts a regularising efect on supervised
ifne-tuning.
          </p>
          <p>While TASAP generally outperforms OSAP across both language models, this performance gap
narrows when training on additional data (e.g., from a subsequent semester). OSAP-2 performs comparably
to TASAP-2 on Llama, and even outperforms TASAP-2 on Smol.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Small-scale human evaluation</title>
        <p>
          Following Scarlatos et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], we conduct a small-scale analysis of LLMs-as-judges performance in
our setting. We arbitrarily selected a subset of 5 representative assignments in our dataset (see Table 1,
Appendix A). For each assignment, we choose the student’s submitted incorrect solution which had the
highest unit test score. Then, one author of the paper manually annotated the quality of the generations
of the BASE, SFT, TASAP, and OSAP models for those 5 assignments for the two models, resulting in
4 × 5 × 2 = 40 annotations with 7 criteria. Our analysis procedures follow prior work [
          <xref ref-type="bibr" rid="ref17 ref22">17, 22</xref>
          ], considering
such manual annotations as ground truths and the LLM-as-judges ensemble result as predictions in 7
distinct binary classification problems (one per criteria). Table 2 shows the result of such annotations
for various classification metrics.
LLM-as-judges classification performance. We report various classification metrics. Legend: %PA: number
of positive human annotations (out of 40) for each respective criteriℰa). c(riteria:ℰ : accuracy,ℰ (selectivity),
ℰ
 : Clarity. Hint criteriaℋ(): ℋ : correctnessℋ, : informativeness,ℋ : Concealment,ℋ : Clarity.
#PA
f0.5
accuracy
precision
recall
f1
kappa
(ℰ ) nor correct (ℋ ), resulting in an imbalanced classification task. Similar to the main results in
ensemble did not classify any generation as containing an informative hint, and our human annotations
identified only 4 out of 40 (10%). While LLMs are not perfect evaluators in general [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ], our small-scale
human analysis supports their utility in this context as a reasonable proxy for human judgment.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Conclusion</title>
      <p>In this paper, we presented a framework for improving small language models’ ability to provide
feedback using Reinforcement Learning With AI Feedback. We proposed two approaches based on
ofline and online preference alignment methods and evaluated the methods performance using
LLMas-judges on a publicly available dataset of students’ Python programs. To summarize the answer to
our research question, the proposed framework, including TASAP and OSAP methods, is efective in
enhancing SLMs’ feedback capability within a course setting.</p>
      <sec id="sec-6-1">
        <title>Practical educational implications.</title>
        <p>
          By utilizing and training fine-tuned models, educators can
provide tailored guidance to their students in a timely manner without constantly relying on external
APIs. Such small models can be efectively deployed using tools like WebLLM [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ], even allowing
inference on client devices with a compatible GPU. This reduces latency, ensures timely feedback, and
eliminates the need to deploy custom inference services [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. This can also give educators and learners
higher control over the generated information and its use.
        </p>
        <p>
          We do not aim to claim that pure data-driven approaches are the way to go. Ideally, when preference
data can be collected and integrated by human TAs [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], such approaches should most likely be
prioritised. However, in many instances, programming courses do not have human TAs write feedback
to students or collect preference data, which limits how such prior work can be efectively used. Our
work closes this gap.
        </p>
        <p>
          Privacy and cost issues. We acknowledge that leveraging RLAIF pipelines requires sending student
data through external APIs for teacher model queries, which breaks privacy measures and might also
incur initial costs. However, we also note that much of the existing work in programming education
already relies on proprietary models (e.g. [
          <xref ref-type="bibr" rid="ref40 ref42">40, 42</xref>
          ]). Moreover, distilling the performance of remote
queried models to smaller models run locally decreases long-term costs. For institutions with strict
data privacy concerns, open-source LLMs (e.g. 4-bit quantized Llama-3.3-70B) could, given suficient
computational resources, also be hosted locally and used as teacher models.
        </p>
        <p>Room for improvement. We note that our results can be further improved, for instance, by training
with more data, leveraging several prompts for diferent feedback tasks simultaneously, and increasing
LoRA rank (the parameter controlling how many parameters  are updated during training).
Programming educators and practitioners often have data readily available to them through the use of automated
assessment systems.</p>
        <p>
          The framework is versatile. We anticipate that this training procedure will generalise to more
complex prompting strategies, for instance, leveraging program repairs [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ] to produce better feedback,
as the improvements stem from the framework’s alignment mechanisms rather than the specific prompt
design. We could also have trained the model on several prompts for diferent types of feedback, using
more data. Programming educators and practitioners often have data readily available to them through
the use of automated assessment systems. The framework is adaptable; we recommend adopting the
same setup as ours: using a teacher LLM to evaluate over one semester to understand what to expect.
Limitations of work. First, we conducted all experiments on a single dataset of Python programming
submissions collected from one institution and did not explore whether our results hold in other contexts.
Second, although our automated evaluation pipeline is robust, leveraging several leading large language
models, no large-scale human analysis was performed. Third, our experiments were limited to two small
models with around 1B parameters. While prior work suggests that performance improves with base
model size [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], it remains to be seen whether the same trends hold when applying OSAP and TASAP
to larger models. Fourth, importantly, we do not claim that OSAP and TASAP trained models produce
feedback matching the exact reported scores (e.g., we do not assert that the models now generate “nearly
perfect feedback”). Rather, the combination of a large dataset and the substantial performance margins
allows us to confirm relative rankings with confidence, even when taking into account judgment error
rates [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>Future work. Future work will address these gaps by first conducting human evaluations to validate
the usefulness of the feedback generated by our trained models. This will include qualitative surveys
with both teachers and students to gain insights into their perspectives. We also plan to conduct
small-scale A/B studies in real educational settings, comparing courses that use these locally deployed
models as AI teaching assistants with those relying on larger models. These deployments will provide
critical insights into small models’ impact on student learning, engagement, and overall educational
outcomes.</p>
        <p>
          Moving forward, we are studying ways to improve small language models’ programming feedback
ability without relying on large language models. In particular, we believe the recent success of pure
reinforcement learning methods such as Group Relative Preference Optimization [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ] could also benefit
programming education.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Grammarly to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Dataset</title>
    </sec>
    <sec id="sec-9">
      <title>B. Prompts</title>
      <p>Figure 2, figure 3, and figure 4 shows the prompts used in our study.</p>
      <p>Assignment 1 - lsn6_lists
Write an algorithm that gets a decimal GPA, APA, and MPA from the user (in that order). You may assume
that all inputs are non-negative whole numbers.</p>
      <p>It then reports which meritorious list the cadet is on. If the GPA is equal to or above 3.0, the cadet is on the
“Dean’s List”, and if the APA is equal to or above 3.0, the cadet is on the “Athletic Director’s List”, and if the
MPA is equal to or above 3.0, the cadet is on the “Commandant’s List”. Finally, if the cadet qualifies for all
three individual lists, then the cadet is on the “Superintendent’s List”. The algorithm should report all the
lists the cadet is on (in the order defined above), unless the cadet is on the Superintendent’s List, in which
case, it should report only “Superintendent’s List”.</p>
      <p>Assignment 2 - lsn9_imagesize
Write a function that computes the size of an uncompressed image. You will name your function
calculate_size_of_image(), and it will have three parameters: the width of the image, the height of the image, and
the bit depth (i.e., the number of bits per pixel). The function should print the size of the image in kilobytes.
Assignment 3 - IterLogic2_football
In Python, write an algorithm that first asks the user how many football players they wish to enter statistics
for and then gets that many yearly passing totals for each player. Output how many of those players had
more than 5000 passing yards in a year. Also, your algorithm will output the average yardage per year as
well as the minimum yardage entered, in that order. You can assume there is at least one player’s yardage to
input.</p>
      <p>Assignment 4 - Lists2_movies
Write a Python function called ‘get_movies‘ that takes three parameters: * A two-dimensional list containing
movie titles and other stats * A rating (e.g., “PG”, “R”) * A run time (in minutes)
Your function should return the number of movies that have the specified rating, and run for at least the
number of minutes specified.</p>
      <p>Assignment 5 - a3_6_pushups
You have been asked to write a program that analyzes number of pushups done by a group of cadets. Write
a program that gets from the user the number of people tested, and gets that many pushup scores (which
you may assume are whole numbers) from the user. Your program must print out: * The average number of
pushups for the group. * The count of cadets that scored higher than the average.</p>
      <p>You are a CS professor teaching introductory programming using Python.</p>
      <p>Below are a problem description and an incorrect program written by a student (i.e., it does not pass all test cases).
&lt;problem description&gt;, &lt;student code&gt;
• Identify and explain the first bug in the student program in 1-3 sentences.
• Focus on a functional issue only; do not discuss performance improvements or stylistic concerns.
• Provide a short and specific hint to help the student address the identified bug.
• The hint should encourage the student to think critically about resolving the issue without directly providing a solution
or code fix.
• Concentrate on one single issue in the program.</p>
      <p>• Ensure both the explanation and the hint are clear, concise, and actionable.</p>
      <p>Below are a problem description and an incorrect program written by a student (i.e., it does not pass all test cases).
1. Explain the first bug:
• Identify and explain the first bug in the student program in 1-3 sentences.
• Focus on a functional issue only; do not discuss performance improvements or stylistic concerns.
2. Generate a Hint:
• Provide a short and specific hint to help the student address the identified bug.
• The hint should encourage the student to think critically about resolving the issue without directly providing a
solution or code fix.
• Concentrate on one single issue in the program.</p>
      <p>• Ensure both the explanation and the hint are clear, concise, and actionable.</p>
      <p>Below is the feedback written by a teaching assistant (TA), which includes an explain and fixes for the bugs in the program.
As well as a hint for the first bug.</p>
      <p>Your task is to evaluate the quality of the TA’s feedback according to the grading criteria outlined below.</p>
      <p>This evaluation will be conducted in two parts
1. Reasoning: Reflect on the quality of the TA’s feedback.
• Reflect on the quality of the feedback, using the grading criteria as a guide.
• Discuss strengths and weaknesses in the explanation and hint.
2. Grading List: Conclude with your final assessment for each criterion.</p>
      <p>• If the criterion is fully met, respond with “true”; otherwise, respond with “false”.</p>
      <p>Please provide your answer using a JSON format with two keys:
• “reasoning”: your detailed written analysis
• “grading”: a dictionary with each criterion as a key and your final answer (true or false) as the value.</p>
      <p>Use only true or false (no other qualifiers) for each grading criterion in the JSON output.</p>
      <p>List of judge-generated bugs and fixes
2</p>
      <p>Below is a problem description and an incorrect program written by a student (i.e., it does not pass all test cases).
problem description, student code
Below is the feedback written by a teaching assistant (TA), which includes an explain and fixes for the bugs in the program.
As well as a hint for the first bug.
This evaluation will be conducted in two parts</p>
      <p>1. Reasoning: Reflect on the quality of the TA’s feedback.
• Reflect on the quality of the feedback, using the grading criteria as a guide.
• Discuss strengths and weaknesses in the explanation and hint.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          , Simon,
          <string-name>
            <given-names>I.</given-names>
            <surname>Albluwi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giannakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Paterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sheard</surname>
          </string-name>
          , et al.,
          <article-title>Introductory programming: a systematic literature review</article-title>
          ,
          <source>in: Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vihavainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Airaksinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <article-title>A systematic review of approaches for teaching introductory programming and their influence on success</article-title>
          ,
          <source>in: Proceedings of the tenth annual conference on International computing education research</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Keuning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeuring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heeren</surname>
          </string-name>
          ,
          <article-title>A systematic literature review of automated feedback generation for programming exercises</article-title>
          ,
          <source>ACM Transactions on Computing Education (TOCE) 19</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Timperley</surname>
          </string-name>
          ,
          <article-title>The power of feedback</article-title>
          ,
          <source>Review of educational research 77</source>
          (
          <year>2007</year>
          )
          <fpage>81</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Shute</surname>
          </string-name>
          , Focus on formative feedback,
          <source>Review of educational research 78</source>
          (
          <year>2008</year>
          )
          <fpage>153</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lohr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Keuning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kiesler</surname>
          </string-name>
          ,
          <article-title>You're (not) my type-can llms generate feedback of specific types for introductory programming tasks?</article-title>
          ,
          <source>Journal of Computer Assisted Learning</source>
          <volume>41</volume>
          (
          <year>2025</year>
          )
          <year>2025</year>
          . URL: https://onlinelibrary.wiley.com/doi/10.1111/jcal.13107. doi:
          <volume>10</volume>
          .1111/jcal.13107.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prather</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finnie-Ansley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Reeves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <article-title>Computing education in the era of generative ai</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>67</volume>
          (
          <year>2024</year>
          )
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          . URL: https://doi.org/10.1145/3624720. doi:
          <volume>10</volume>
          .1145/3624720.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>U. Z.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sahai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karkare</surname>
          </string-name>
          ,
          <article-title>Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback</article-title>
          ,
          <source>in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSETS</surname>
          </string-name>
          <year>2025</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>11</fpage>
          -
          <lpage>17</lpage>
          . doi:
          <volume>10</volume>
          .1145/3641554.3701972.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zenke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thornton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Malan</surname>
          </string-name>
          ,
          <article-title>Teaching cs50 with ai: Leveraging generative artificial intelligence in computer science education</article-title>
          ,
          <source>in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 2</source>
          ,
          <string-name>
            <surname>SIGCSE</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , p.
          <year>1927</year>
          . URL: https://doi.org/10.1145/3626253. 3635427. doi:
          <volume>10</volume>
          .1145/3626253.3635427.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          , J. Mitchell,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <article-title>A large scale rct on efective error messages in cs1</article-title>
          ,
          <source>in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSE</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>1395</fpage>
          -
          <lpage>1401</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3626252.3630764.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vadaparty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zingaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Smith</surname>
          </string-name>
          <string-name>
            <given-names>IV</given-names>
            ,
            <surname>M. Padala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alvarado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Gorson</given-names>
            <surname>Benario</surname>
          </string-name>
          , L. Porter, Cs1
          <article-title>- llm: Integrating llms into cs1 instruction, in: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1</article-title>
          ,
          <issue>ITiCSE</issue>
          2024,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>297</fpage>
          -
          <lpage>303</lpage>
          . doi:
          <volume>10</volume>
          .1145/3649217.3653584.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lifiton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Sheese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Savelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          , Codehelp:
          <article-title>Using large language models with guardrails for scalable support in programming classes</article-title>
          ,
          <source>in: Proceedings of the 23rd Koli Calling International Conference on Computing Education Research</source>
          , Koli Calling '
          <volume>23</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1145/3631802.3631830.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          , S. von Arx,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2108.07258</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dainese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <article-title>Open source language models can provide feedback: Evaluating llms' ability to help students using gpt-4-as-a-judge</article-title>
          ,
          <source>in: Proceedings of the 2024 Innovation and Technology in Computer Science Education</source>
          , Volume
          <volume>1</volume>
          , ITICSE '
          <volume>24</volume>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1145/3649217.3653612.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bergen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liut</surname>
          </string-name>
          ,
          <article-title>Integrating small language models with retrievalaugmented generation in computing education: Key takeaways, setup, and practical insights</article-title>
          ,
          <source>in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSETS</surname>
          </string-name>
          <year>2025</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>1302</fpage>
          -
          <lpage>1308</lpage>
          . doi:
          <volume>10</volume>
          .1145/3641554.3701844.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bulbulia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bergen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liut</surname>
          </string-name>
          ,
          <article-title>Can small language models with retrievalaugmented generation replace large language models when learning computer science?</article-title>
          ,
          <source>in: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1</source>
          ,
          <issue>ITiCSE</issue>
          2024,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>388</fpage>
          -
          <lpage>393</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3649217.3653554.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dainese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <article-title>Evaluating language models for generating and judging programming feedback</article-title>
          ,
          <source>in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSETS</surname>
          </string-name>
          <year>2025</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>624</fpage>
          -
          <lpage>630</lpage>
          . URL: https://doi.org/10.1145/ 3641554.3701791. doi:
          <volume>10</volume>
          .1145/3641554.3701791.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>B. C. Das</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <string-name>
            <surname>Amini</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Security and privacy challenges of large language models: A survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>57</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          , P. Denny,
          <article-title>Ai-ta: Towards an intelligent question-answer teaching assistant using open-source llms</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2311.02775. arXiv:
          <volume>2311</volume>
          .
          <fpage>02775</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N. Ashok</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <article-title>Improving socratic question generation using data augmentation and preference optimization</article-title>
          ,
          <source>in: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Woodrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koyejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Piech</surname>
          </string-name>
          ,
          <article-title>Improving generative ai student feedback: Direct preference optimization with teachers in the loop</article-title>
          , https://juliettewoodrow.github.io/paper-hosting/dpo_ feedback.pdf,
          <year>2025</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-12.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Scarlatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Woodhead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lan</surname>
          </string-name>
          , Improving the Validity of Automatically Generated Feedback via Reinforcement Learning, Springer Nature Switzerland,
          <year>2024</year>
          , p.
          <fpage>280</fpage>
          -
          <lpage>294</lpage>
          . doi:
          <volume>10</volume>
          . 1007/978- 3-
          <fpage>031</fpage>
          - 64302- 6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kujanpää</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorva</surname>
          </string-name>
          ,
          <article-title>Exploring the responses of large language models to beginner programmers' help requests</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1</source>
          , ICER '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>93</fpage>
          -
          <lpage>105</lpage>
          . URL: https://doi.org/10.1145/3568813. 3600139. doi:
          <volume>10</volume>
          .1145/3568813.3600139.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Roest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Keuning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeuring</surname>
          </string-name>
          ,
          <article-title>Next-step hint generation for introductory programming using large language models</article-title>
          ,
          <source>in: Proceedings of the 26th Australasian Computing Education Conference</source>
          , ACE '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>144</fpage>
          -
          <lpage>153</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3636243.3636259.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Allal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozhkov</surname>
          </string-name>
          , E. Bakouch,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Blázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piqueres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marafioti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zakka</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
          </string-name>
          , T. Wolf, Smollm2
          <article-title>- with great data, comes great performance</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L.</surname>
          </string-name>
          et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>A. de Freitas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cofman</surname>
          </string-name>
          , M. de Freitas, J. Wilson, T. Weingart,
          <article-title>Falconcode: A multiyear dataset of python code samples from an introductory computer science course</article-title>
          ,
          <source>in: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1</source>
          ,
          <string-name>
            <surname>SIGCSE</surname>
          </string-name>
          <year>2023</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>938</fpage>
          -
          <lpage>944</lpage>
          . doi:
          <volume>10</volume>
          .1145/3545945.3569822.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , A. Efrat,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Ghosh,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          ,
          <article-title>Lima: less is more for alignment</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>D. M. Ziegler</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Stiennon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Christiano</surname>
          </string-name>
          , G. Irving,
          <article-title>Fine-tuning language models from human preferences</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>08593</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Neural Information Processing Systems</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tunstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Beeching</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fourrier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sarrazin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Sanseviero</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          , T. Wolf, Zephyr:
          <article-title>Direct distillation of LM alignment</article-title>
          ,
          <source>CoRR abs/2310</source>
          .16944 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2310.16944. doi:
          <volume>10</volume>
          .48550/ARXIV.2310.16944. arXiv:
          <volume>2310</volume>
          .
          <fpage>16944</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Phatale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mansoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mesnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Carbune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          , S. Prakash,
          <article-title>RLAIF vs</article-title>
          .
          <article-title>RLHF: scaling reinforcement learning from human feedback with AI feedback</article-title>
          , in: Forty-first
          <source>International Conference on Machine Learning, ICML 2024</source>
          , Vienna, Austria,
          <source>July 21-27</source>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=uydQ2W41KO.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , E. Mitchell,
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ermon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Direct preference optimization: Your language model is secretly a reward model</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <source>Proximal policy optimization algorithms</source>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1707.06347. arXiv:
          <volume>1707</volume>
          .
          <fpage>06347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giurgiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jastrzebski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morrone</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. De Laroussilhe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gesmundo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Attariyan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gelly</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient transfer learning for NLP</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2790</fpage>
          -
          <lpage>2799</lpage>
          . URL: https://proceedings.mlr.press/v97/ houlsby19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: The Tenth International Conference on Learning Representations, ICLR</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>April 25-29</source>
          ,
          <year>2022</year>
          , OpenReview.net,
          <year>2022</year>
          . URL: https: //openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <article-title>Training Language Models for Programming Feedback Using Automated Repair Tools</article-title>
          , in: N.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Rebolledo-Mendez</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Matsuda</surname>
            ,
            <given-names>O. C.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
          </string-name>
          , V. Dimitrova (Eds.),
          <source>Artificial Intelligence in Education</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>830</fpage>
          -
          <lpage>835</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarsa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leinonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          , P. Denny,
          <source>Automated Program Repair Using Generative Models for Code Infilling</source>
          , in: N.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Rebolledo-Mendez</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Matsuda</surname>
            ,
            <given-names>O. C.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
          </string-name>
          , V. Dimitrova (Eds.),
          <source>Artificial Intelligence in Education</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>798</fpage>
          -
          <lpage>803</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kotalwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gotovos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <article-title>Hints-in-browser: Benchmarking language models for programming feedback generation</article-title>
          , in: A.
          <string-name>
            <surname>Globersons</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Paquet</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Tomczak</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems</source>
          <year>2024</year>
          , NeurIPS
          <year>2024</year>
          , Vancouver, BC, Canada,
          <source>December 10 - 15</source>
          ,
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: http://papers.nips.cc/paper_files/paper/2024/hash/ 34cc2ded6daba59357134c0b9fb06bfe-Abstract-Datasets_and_Benchmarks_Track.html.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Liu, T. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Llinares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mesnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <article-title>Direct language model alignment from online ai feedback</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2402.04792. arXiv:
          <volume>2402</volume>
          .
          <fpage>04792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>T.</given-names>
            <surname>Phung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.-A.</given-names>
            <surname>Pădurean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cambronero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gulwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singla</surname>
          </string-name>
          , G. Soares,
          <article-title>Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation</article-title>
          ,
          <source>in: Proceedings of the 14th Learning Analytics and Knowledge Conference</source>
          , LAK '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>12</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          .1145/3636555.3636846.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vieillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stanczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Garea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Geist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <article-title>On-policy distillation of language models: Learning from self-generated mistakes</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations, ICLR</source>
          <year>2024</year>
          , Vienna, Austria, May 7-
          <issue>11</issue>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=3zKtaqxLhW.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>Mistral</surname>
            <given-names>AI Team</given-names>
          </string-name>
          ,
          <article-title>Mistral nemo</article-title>
          , https://mistral.ai/news/mistral-nemo/,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -09-16.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <article-title>Provably robust dpo: aligning language models with noisy feedback</article-title>
          ,
          <source>in: Proceedings of the 41st International Conference on Machine Learning, ICML'24</source>
          , JMLR.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gheshlaghi Azar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. Daniel</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rowland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calandriello</surname>
          </string-name>
          ,
          <article-title>A general theoretical paradigm to understand learning from human preferences</article-title>
          , in: S. Dasgupta,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of The 27th International Conference on Artificial Intelligence and Statistics</source>
          , volume
          <volume>238</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>4447</fpage>
          -
          <lpage>4455</lpage>
          . URL: https://proceedings.mlr.press/v238/gheshlaghi-azar24a.
          <fpage>html</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lagus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Longi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Klami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <article-title>Transfer-learning methods in programming course outcome prediction</article-title>
          ,
          <source>ACM Transactions on Computing Education (TOCE) 18</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kar</surname>
          </string-name>
          , G. Castellucci,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Rokhlenko</surname>
          </string-name>
          ,
          <article-title>Preventing catastrophic forgetting in continual learning of new natural language tasks</article-title>
          ,
          <source>in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3137</fpage>
          -
          <lpage>3145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutcheme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dainese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hellas</surname>
          </string-name>
          ,
          <article-title>Using program repair as a proxy for language models' feedback ability in programming education</article-title>
          , in: E. Kochmar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bexte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Horbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Laarmann-Quante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Yaneva</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          Yuan (Eds.),
          <source>Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>181</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .bea-
          <volume>1</volume>
          .
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. Z.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mechtaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roychoudhury</surname>
          </string-name>
          ,
          <article-title>Re-factoring based program repair applied to programming assignments</article-title>
          ,
          <source>in: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)</source>
          ,
          <source>IEEE/ACM</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>388</fpage>
          -
          <lpage>398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4o
          <source>system card</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.21276. arXiv:
          <volume>2410</volume>
          .
          <fpage>21276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          , et al.,
          <source>Qwen2. 5-coder technical report, arXiv preprint arXiv:2409.12186</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>P.</given-names>
            <surname>Verga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Althammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arkhangorodsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Replacing judges with juries: Evaluating llm generations with a panel of diverse models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.18796. arXiv:
          <volume>2404</volume>
          .
          <fpage>18796</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <surname>Google</surname>
            <given-names>DeepMind</given-names>
          </string-name>
          ,
          <source>Gemini</source>
          <volume>2</volume>
          .
          <article-title>0: Our largest and most capable AI model</article-title>
          , https://blog.google/ technology/google-deepmind/
          <article-title>google-gemini-ai-</article-title>
          <string-name>
            <surname>update-</surname>
          </string-name>
          december-2024/,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -04- 24.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Ramayapally</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaidyanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hupkes</surname>
          </string-name>
          ,
          <article-title>Judging the judges: Evaluating alignment and vulnerabilities in llms-as-</article-title>
          <string-name>
            <surname>judges</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2406. 12624. arXiv:
          <volume>2406</volume>
          .
          <fpage>12624</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>H.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Namgoong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <article-title>Large language models as evaluators in education: Verification of feedback consistency and accuracy</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <article-title>671</article-title>
          . doi:
          <volume>10</volume>
          .3390/app15020671.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <article-title>WebLLM, WebLLM: A web-based language model</article-title>
          , https://webllm.mlc.
          <source>ai/</source>
          ,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          - 09-16.
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Deepseekmath: Pushing the limits of mathematical reasoning in open language models</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.03300. arXiv:
          <volume>2402</volume>
          .
          <fpage>03300</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>