1. Introduction

1613-0073

ming Feedback: Aligning Small Language Models Without Human Preferences

Charles Koutcheme

charles.koutcheme@aalto.fi 0 1

Nicola Dainese

nicola.dainese@aalto.fi 0 1

Arto Hellas

arto.hellas@aalto.fi 0 1

Code: Github

Workshop

Language Models, Programming Feedback, Computing Education, Reinforcement Learning,

0 Aalto University , Espoo , Finland 1 CSEDM'25: 9th Educational Data Mining in Computer Science Education Workshop , July, 2025, Palermo , Italy

Providing students with timely and efective feedback remains a critical challenge in programming education. Locally deployed Small Language Models (SLMs) ofer a cost-efective solution that enables educators to generate feedback while avoiding third-party reliance and privacy concerns associated with Large Language Models (LLMs). However, SLMs often produce misleading or inaccurate feedback, limiting their practical use. This paper presents a fully automated reinforcement learning framework for aligning SLMs to generate high-quality programming feedback without any human-labelled examples or preference annotations. Our approach transfers the feedback capabilities of powerful LLMs (“teacher models”) to smaller, low-resource models (“student models”) that can run locally on consumer hardware, with the optional assistance of medium-sized “assistant” models. The framework supports two configurations: an of-policy setup that uses assistant model generations to bootstrap alignment and a lightweight online on-policy variant that trains directly on student model outputs. We evaluate both approaches by fine-tuning two SLMs on a real-world dataset of CS1 programming submissions collected across semesters. Our experiments simulate realistic deployment scenarios, training on data from past semesters and evaluating on future ones. Results show that both methods significantly improve feedback quality and generalize across new course oferings. We provide practical considerations for aligning SLMs in educational settings and outline a promising direction for future work. Our code is made available on GitHub.

1. Introduction

Learning to program is challenging for many. These challenges can be somewhat alleviated with improved teaching practice [ 1, 2 ]. A key part of this is providing feedback, which should be timely and accurate [ 3, 4, 5 ]. Large Language Models (LLMs) have shown exceptional success in that task [ 6, 7 ], leading to their growing adoption in classrooms [ 8, 9, 10, 11, 12 ]. However, relying on third-party services that provide access to LLMs can introduce cost obstacles and scalability issues [ 13 ]. These constraints are driving a growing shift towards using smaller, open-source models [ 14 ], which can be deployed locally [ 15, 16 ] to reduce costs and provide educators greater control over their students’ data.

Although Small Language Models (SLMs) alleviate these issues due to the ability to run them locally [ 15 ], their feedback quality often falls short of LLMs [ 17 ], posing significant challenges in real-world applications. In particular, SLMs tend to generate more misleading feedback [ 14, 17 ], including hallucinations and irrelevant suggestions. Such shortcomings can confuse students and hinder learning [ 18 ].

Reinforcement Learning (RL) has emerged as a promising approach for aligning language models to generate pedagogically meaningful programming support [ 19 ]. However, existing reinforcement learning methods for programming feedback generation rely heavily on human supervision, typically in the form of human-written examples [ 20 ] or preference annotations [ 21 ]. This dependency hinders improvements in contexts where data or annotators are unavailable. https://koutche.me/ (C. Koutcheme)

CEUR

ceur-ws.org

In this paper, we explore aligning small models for programming feedback without human annotations or preference labels. Our approach uses RL to transfer the feedback abilities from a teacher LLM to a smaller, locally deployable student model, optionally using medium-sized assistant models to bootstrap alignment. We implement and compare two fully automated training configurations: an of-policy setup (TASAP), which builds on prior work using assistant generations [ 22 ]; and a novel lightweight online on-policy method (OSAP), where the model trains directly on its own feedback.

Our evaluation focuses on feedback generation for explanations of students’ mistakes [ 23 ] and Socratic hints [ 24 ], using two SLMs (SmolLM-V2-1.7B [ 25 ] and Llama-3.1-1B [ 26 ]) fine-tuned on real student submissions from the FalconCode dataset [ 27 ]. We use this dataset to simulate a realistic deployment scenario by training on one semester and evaluating on the next. We also study a continual learning setting where models are refined incrementally as new data becomes available.

Our results show that both configurations significantly improve feedback quality on new student submissions. We also reflect on methodological challenges when training small models for educational feedback and highlight a promising direction for future work. Our contributions are the following: • We introduce a reinforcement learning framework for aligning small language models for programming feedback, without relying on human-labelled preferences or human-annotated feedback. • We implement an of-policy method using assistant models (TASAP) and introduce a novel online on-policy variant (OSAP). • We evaluate both methods on a real-world, semester-split dataset, and show that they substantially improve feedback quality on new students’ submissions, including in a continual learning setup.

2. Background and Related Work 2.1. Learning from Preferences

Reinforcement Learning With Human Feedback Recent advancements in fine-tuning techniques have significantly improved the performance of small language models on downstream tasks. While Supervised Fine-Tuning (SFT) remains a widely used approach to improve language models’ generations [ 28 ], it is limited in its ability to align such models with complex human preferences and objectives [ 29 ]. Reinforcement Learning with Human Feedback (RLHF) addresses this limitation by training reward models on ranked human preferences — for example, generation A is better than generation B — and has contributed to the success of large language models such as GPT-4 [ 30 ].

LLMs-as-judges. Because of their strong performance, such models have also been used as “judges” to evaluate outputs from smaller models [ 31, 14 ]. The growing use of LLMs-as-judges has progressively reduced the need for human annotators in RLHF pipelines [ 32 ], creating a shift towards Reinforcement Learning From AI Feedback (RLAIF) [ 33 ], where AI models themselves are used to supervise other models’ preference-based training.

Direct Preference Optimization. Recent RLHF and RLAIF approaches have predominantly relied on ofline preference alignment methods such as Direct Preference Optimisation (DPO) [ 34 ]. DPO simplify the classical three-step pipeline by directly optimising language models on prior collected preference data, removing the need to train a separate reward model [ 35 ].

Parameter-Eficient Fine-Tuning. In parallel, Parameter-Eficient Fine-Tuning (PEFT) techniques [ 36 ], such as Low-Rank Adapters (LoRA) [ 37 ], have made fine-tuning more accessible by significantly reducing computational and memory requirements. These techniques have enabled the practical application of preference alignment to smaller, resource-constrained models.

2.2. Improving Small Language Models for Programming Feedback Existing reinforcement learning approaches.

Fine-tuning small language models (SLMs) for programming education has become an increasingly active area of research in AI in Education. While early work primarily focused on generating program repairs to correct student code [ 38, 39 ], more recent approaches explore reinforcement learning (RL) and PEFT methods to fine-tune small models to support students’ learning how to program. In all setups, a common challenge is obtaining high-quality preference pairs to guide the learning process.

Some approaches rely on human annotations paired with synthetic examples. For instance, Kumar et al. use GPT-4 to generate low-quality samples, paired with human-written Socratic questions, to train a LLaMA model for educational dialogue [ 20 ]. Other approaches leverage naturally occurring preference signals. Hicke et al. use TA edits to forum posts as implicit preferences [ 19 ].

Towards Reinforcement Learning with AI Feedback.

While promising, these methods rely on human supervision or access to structured educational data, which limits their scalability to new contexts. Kotalwar et al. take a step toward automation by using GPT-4 to generate explanations and hints, training a small model via supervised fine-tuning alone [ 40 ]. However, whether preference-based techniques can further improve such models without human annotation remains underexplored.

Recent studies highlight the promise of LLMs-as-judges in evaluating feedback quality, with models such as GPT-4o-mini and Llama-3.1-70B producing high-quality judgments [ 14, 17 ]. These advances motivate our work to adapt RLAIF for programming feedback.

Closest to our work in another domain is Scarlatos et al. [ 22 ], who use a combination of humanwritten feedback, LLM-generated feedback, and AI preferences to train an 8B LLaMA model with PEFT for multiple-choice math feedback, using Direct Preference Optimisation. While both studies rely on RLAIF, our approach difers by integrating such techniques within a distillation framework that addresses specific programming feedback challenges, notably, the lack of human-annotated data, the vast space of possible student mistakes, and the need for highly contextualised recommendations.

Moreover, our work difers from all prior attempts by also integrating online learning algorithms [ 41 ], where language models improve continuously with their own generated responses.

3. Methods

Here, we present our two approaches for improving small language models’ programming feedback. Before presenting the training methods, we formalize the task and outline our assumptions.

3.1. Task and Assumptions

Task. Our primary objective is to fine-tune a small, resource-eficient, instruction-tuned language model (the student LM) to generate two interrelated types of feedback [ 42, 40, 24 ]: an explanation ℰ, which identifies and describes a bug in a student’s program, and a single next-step hint ℋ, which guides the student toward resolving the identified bug without revealing the solution.

While our method can be adapted to other types of feedback [ 3 ], we illustrate its efectiveness with explanations and hints as these two types of feedback play an important role in supporting students learning programming.

Quality attributes.

To ensure the feedback supports efective learning, it must adhere to specific quality attributes identified in prior works. First, the generated explanation must be accurate, selective, and clear [ 14 ]. The explanation is considered accurate (ℰ ) if it correctly identifies and mentions the ifrst existing issue in the student program. It is considered selective ( ℰ ) when it focuses exclusively on one issue in the code (whether the issue is correct or not) and avoids discussing any unrelated or non-existent bugs. Finally, the explanation should be clear (ℰ ), meaning it is easy to understand, concise, and presented in a readable format.

Second, the generated hint must be correct, informative, concealed, and clear [ 42 ]. A hint is considered correct (ℋ ) if it provides accurate information to resolve issues in the buggy program. It is deemed informative (ℋ ) if it ofers valuable insights to help the learner resolve the bug efectively.

The hint should also remain concealed (ℋ

) by avoiding the direct revelation of the solution, to reason through the process of implementing the fix. Lastly, the hint must be clear ( ℋ ), ensuring that it is easy to understand and devoid of unnecessary complexity. The student language model, , will be optimized to consistently meet these quality attributes in its generations.

Generation methodology.

Following prior work [ 42, 40 ], feedback ℱ is always generated using a chain-of-thought approach that prompts language models to generate the explanation ℰ (the “thought”) followed by the hint ℋ; This strategy ensures hints are grounded in accurate explanations. Assumptions. To reach our objective, we consider a training dataset = {( , pairs of problem descriptions and incorrect student programs .

)}=1 consisting of

We also assume to have access to a teacher LLM, , accessible via an online API (e.g., GPT-4o-mini via OpenAI API). We also suppose having access to a set of medium-sized assistant LMs , ∈ {1, … , } .

The teacher model is presumed to generate high-quality feedback, while the assistant models perform well but with lower quality, and the student model may initially perform poorly.

3.2. Supervised Fine-tuning

Given the lack of human annotations, following Koltawar et al. [ 40 ], a natural first step in improving our small language model is to apply Supervised Fine-Tuning (SFT), that is, training the student model on teacher-generated feedback for all incorrect programs in the training set using the negative loglikelihood (NLL) loss. This yields a model sft . We generate such feedback ℱ using greedy decoding.

SFT represents the simplest form of distillation [ 43 ], where the student directly mimics the teacher’s outputs. However, SFT alone risks overfitting, especially when training data is limited, and does not allow language models to understand what constitutes high-quality responses.

3.3. Learning From Feedback Preferences

In this paper, we propose to apply preference-based optimisation techniques on top of the SFT-trained small language models to refine their abilities to generate high-quality feedback. Unlike RLHF setups that rely on human preference labels [ 21 ], we generate preferences automatically. We compare two configurations that vary in how feedback examples are generated and how the student model is updated.

In both setups, we use the teacher model to score and rank the generated feedback using a rubric-based process before optimizing the student model via an appropriate preference alignment algorithm.

3.3.1. Teacher-Assistant-Student Alignment Pipeline (TASAP)

feedback) based on a quality criterion.

Our first approach, the Teacher-Assistant-Student Alignment Pipeline (TASAP), follows similar ofline of-policy preference alignment strategies underpinning the success of many language models [ 32, 44 ]. To apply such methods in our context, we need to construct a preference dataset with feedbacks , = {( , , ), ( , , )}=1 , where (the “winning” feedback) is ranked higher than (the “loosing” assistant models ℱ

generated by the teacher language model ℱ Step 1: Data collection. For each incorrect program, we sample three feedback, each one from the using greedy-decoding following prior work [ 14, 31 ]. We also reuse the feedback

during the supervised fine-tuning step.

Step 2: Judging and scoring generations.

Then, we use our teacher as a judge [ 31 ] to grade all four generated feedback (independently against a rubric based on our predefined quality criteria: ℰ , ..., ℋ , ℋ , ℋ

, and ℋ , assigning each criterium a binary value of either 0 (false) or 1 (true) [ 22 ]. Following Koutcheme et al. [ 14 ], our prompt for judging feedback (see Figure 3, Appendix B) asks the teacher LM to use its own generated feedback as ground truth to evaluate the newly provided one. This reference grading strategy [ 31 ] ensures the student generations remain aligned with the teacher, reduces variability in judgments, and ensures the preference dataset is free of noise [ 45 ]. Using the ℰ , grading values, we assign each feedback an overall quality score [ 22, 32 ] using a weighted sum: = 0.20 ⋅ ℰ + 0.15 ⋅ ℰ + 0.10 ⋅ ℰ + 0.20 ⋅ ℋ + 0.15 ⋅ ℋ + 0.10 ⋅ ℋ + 0.10 ⋅ ℋ where the resulting score is also bounded between 0 and 1. Our scoring function prioritizes explanation correctness and hint accuracy to ensure feedback is factual. We then consider explanation selectivity and hint informativeness to discourage the generation of irrelevant or hallucinated information. Attributes like clarity and concealment are considered last, as they are secondary to the validity of the feedback (the scoring function can be adapted by teachers to match their needs).

Step 3 - Preference dataset creation.

Using the four feedback obtained, three from the SFT model , one by the teacher , for all given incorrect programs , we add to our preference dataset all possible feedback pairs ( , ) where score is better than score ( > ).

Step 4 - Optimization.

the DPO loss function [ 34 ]:

Using the resulting preference dataset, we train our language model using ℒDPO ( ; sft ) = − ( , , , )∼ [log ( log sft (( ∣∣ ,, )) − log ( ∣ , )

sft ( ∣ , ) )] (1) where is the logistic function, is the policy being optimized (i.e., the model during training), sft is the reference policy (i.e., the frozen model before training), and is a regularization parameter that controls the deviation of the trained from the reference policy. A higher keeps the trained model closer to the reference policy. Intuitively, this formulation penalizes the model based on how much it “prefers” the lower-quality (losing) feedback over the higher-quality (winning) feedback, which results in gradually increasing the probability of generating high-quality outputs.

3.3.2. Online Student Alignment Pipeline (OSAP)

Our second approach, Online Student Alignment Pipeline (OSAP), is an online on-policy variant of TASAP based on Direct Language Model Alignment from Online AI Feedback [ 41 ]. Compared to ofline approaches, online training continuously updates a language model based on its own generations, potentially reducing common issues associated with using static preference datasets, such as distribution shift [ 34 ] and overfitting [ 46 ].

Starting from the supervised fine-tuned model , OSAP integrates the sampling, data collection, and optimization steps of the TASAP pipeline within a single optimization loop. At each iteration: (a) Instead of sampling from assistant models, we sample two generations 1, 2 ∼ our language model using multinomial sampling with an arbitrary temperature of 0.3 [ 41 ]. ( , ) from (b) We use our teacher model to independently judge and score each generation to determine the winning and losing feedback , before updating the model parameters based on the resulting preference ordering using the original DPO loss function (see equation 1). If both generations obtain the same score, we default to syntactic distance measures, and we select the feedback having the highest ROUGE score [ 47 ] with the teacher feedback [ 19 ].

3.4. Adaptation to New Student Data

In real-life scenarios, student programming submissions are collected by course oferings (e.g. semester by semester) and accumulate and even somewhat change over time [ 48 ]. To take this scenario into account, we need to study efectively how each of the two preference-based alignment strategies, TASAP and OSAP, can be applied when additional training data is introduced. We consider the task of refining a model already trained on an initial dataset 1, using a new semester of data 2 = {( , )}=1 . TASAP : We perform steps 1 to 4 of the TASAP pipeline on the new dataset of students’ incorrect programs 2 to obtain a second preference dataset 2. Training the first model exclusively on this new dataset might induce a situation of catastrophic forgetting [ 49 ], where the student model loses some of the knowledge it acquired when trained on 1. To mitigate this issue, we initialize the weights of our model to the supervised fine-tuning version (see section 3.2) of the first semester (i.e., ) and train this model using the IPO loss on the combined 1 ∪ 2. Our choice of not repeating the supervised ifne-tuning step on the combined dataset, and instead, starting from is motivated by our tentative to mitigate overfitting risks.

OSAP : For OSAP, we continue the training pipeline directly1 from (i.e., the OSAP model trained on the first semester), using the problem description and incorrect programs from 1 ∪ 2. This strategy thus reflects a true continual learning setup and most benefits from new data.

We note that both techniques shown can be applied continuously, for instance, for refining the model trained on two semesters of data using a third one.

4. Experiments 4.1. Dataset

In this section, we present our experiments, aiming to answer the following research question: (RQ) How efective are TASAP and OSAP in improving the feedback quality of small language models when trained and evaluated across semesters of the same introductory programming course? We perform our experiments using FalconCode [ 27 ], a large and comprehensive publicly available dataset containing real-life CS1 students’ solutions to Python programming exercises. Beyond its substantial scale, this dataset distinguishes itself through free-form assignments, enabling a broader evaluation of language models’ abilities to generate feedback.

Preprocessing. The FalconCode dataset is split over three subsets (three semesters of data). Within each subset, we select all unique incorrect programs from all students’ last submitted solutions for all assignments automatically evaluated with unit tests [ 50 ]. Uniqueness is determined via AST normalization2. While we acknowledge that this selection may not fully capture the range of dificulties students encounter during their attempts, it aligns with the idea that a student’s last attempt often reflects their improved understanding of the problem. Thus, our setup can be viewed as providing feedback to students as a last resort for elements they may not have grasped.

We leverage the first and second semesters for training and iterative refinement, respectively, and the last semester for testing. To ensure our setup evaluates our models’ generalization abilities, we filter out from the test set the programs in the first two semesters having similar normalised AST representations [ 51 ]. This results in three splits with 826, 690, and 693 incorrect programs ( ) from 62, 44, and 62 assignments ( ), respectively. 1In practice, we also need to generate feedback using the teacher model on the new semester of data 2 to allow the fall back to a syntactic distance measure comparison. 2Including variable renaming.

4.2. Models

To answer our research questions, we fine-tune two small language models, SmolLM-V2-1.7B [ 25 ] and Llama-3.2-1B [ 26 ], using GPT-4o-mini [ 52 ] as the teacher. We chose these two student models for their strong performance on small-model benchmarks, while GPT-4o-mini has been shown to produce high-quality programming feedback [ 17 ].

Baseline. As a baseline, we use the models trained using Supervised Finetuning on each teachergenerated data following Kotalwar et al. [ 40 ].

Versions. We train each of our models (on FalconCode) using our two proposed approaches: TASAP and OSAP. For each approach, we train a first version (TASAP-1 and OSAP-1) on FalconCode first semester. We train second versions of our models using the adaptation to new student data strategy (see section 3.4) with both FalconCode first and second semesters.

Assistant models. For TASAP, we leverage three assistant language models: Mistral-Nemo-12B [ 44 ], Llama-3.1-8B [ 26 ], and Qwen-2.5-3B [ 53 ]. We chose these models to ensure diversity across model families, sizes, and performance [ 32 ].

Parameter Eficient Finetuning To take into account educators’ limited access to computational resources, we train our SFT and TASAP models (as well as the baselines introduced below) with LowRank Adapters (LoRA) [ 37 ], a parameter-eficient fine-tuning method that reduces memory requirements by freezing the base model and adding a small number of trainable parameters called adapters [ 36 ]. These adapters can be removed to restore the base model’s original capabilities.

4.3. Automated Evaluations: LLMs-as-feedback-judges

Manually evaluating all models on our datasets would require substantial efort, even on a subset of generations. Instead, we leverage LLMs-as-judges once again for our final evaluation [ 31 ]. However, rather than relying on a single model for this task, following Verga et al. [ 54 ], we use a panel of three strong LLMs: Llama-3.3-70B [ 26 ], GPT-4o-mini, [ 26 ], and Gemini-2.0-flash [ 55 ]. Earlier versions of the GPT-4o and the Llama-3 family have been used extensively as judges [ 56 ], also in programming context [ 17, 14 ], and Gemini has recently demonstrated comparable performance to GPT-4o-mini on multiple benchmarks. While GPT-4o-mini and Gemini-2.0-flash are lighter versions of their full-size counterparts, they remain strong judges for programming feedback. For instance, GPT-4o-mini has been shown to perform on par with GPT-4o for evaluating feedback quality [ 17 ]. Moreover, Verga et al. [ 54 ] demonstrate that ensembles of smaller LLMs of diferent families outperform single large models, particularly by mitigating individual model biases.

Evaluation prompting strategy. For each feedback ℱ generated on the test set, we prompt all judges (see Figure 4, Appendix B) to provide binary decisions across all quality criteria. We obtain the final verdict using a strict unanimity policy: a criterion is marked correct only if all judges agree. While this method does not provide absolute performance guarantees, as discussed in our Limitations of Work, it ofers a consistent, scalable, and reliable strategy for comparing the relative efectiveness of diferent training approaches.

Human validation. Following Scarlatos et al. [ 22 ], we conduct a small-scale analysis over a subset of language model generations to validate the use of LLM-as-judges and provide insights into potential evaluation errors.

4.4. Experiment details

We fine-tune our models using the HuggingFace TRL library, following hyperparameters recommended in prior work. For SFT, we use a learning rate of 1e-4 [ 34 ]; for TASAP and OSAP, we set = 0.25 [ 41 ] and use learning rates of 1e-5 and 1e-6, respectively. Batch sizes are 8 for SFT and TASAP, and 16 for OSAP. TASAP-2 and OSAP-2 reuse these settings and repeat the training process as described in Section 3.4. We apply LoRA with = 64 and rank = 32 [ 37 ], train each model for up to 3 epochs, and select checkpoints based on lowest validation loss. All other hyper-parameters remain at default values. Full experimental details and prompts are available in our code base. All training was performed on Nvidia Tesla V100 GPUs (32GB RAM) via Triton, our institution’s research cluster.

5. Results 5.1. Main results

TASAP(-2): Teacher-Assistant-Student Alignment Pipeline (trained on 2 semesters), OSAP(-2): Online Student Alignment Pipeline (trained on 2 semesters). QWEN: Qwen-2.5-3B, LLAMA: Llama-3.1-8B, NEMO: MistralNemo-12B, MINI: GPT-4o-mini. Explanation (ℰ) criteria:ℰ : accuracy,ℰ : selectivity,ℰ : clarity. Hint criteria (ℋ): ℋ : correctnessℋ, : informativeness,ℋ : concealment,ℋ : clarity.

Llama-3.2-1B (Student) Smol2-1.7B (Student)

Model BASE SFT OSAP TASAP MINI QWEN TASAP-2 68.2 OSAP-2 ℰ

20.7 60.6 63.6 68.5 69.5 98.7 70.9 respective supervised fine-tuned base models.

Tracing this observation backwards, we note that although the Smol base model performs slightly better than the Llama base model (as expected, due to size diferences), supervised fine-tuning benefits the Llama model more. We hypothesise that this may be due to a distribution shift: the Llama model’s answer distribution is closer to that of GPT-4o-mini, making further improvements easier [ 34 ]. Training with OSAP and TASAP may guide the model toward a more optimal solution space. Hieke et al. [ 19 ] have already highlighted that preference optimization exerts a regularising efect on supervised ifne-tuning.

While TASAP generally outperforms OSAP across both language models, this performance gap narrows when training on additional data (e.g., from a subsequent semester). OSAP-2 performs comparably to TASAP-2 on Llama, and even outperforms TASAP-2 on Smol.

5.2. Small-scale human evaluation

Following Scarlatos et al. [ 22 ], we conduct a small-scale analysis of LLMs-as-judges performance in our setting. We arbitrarily selected a subset of 5 representative assignments in our dataset (see Table 1, Appendix A). For each assignment, we choose the student’s submitted incorrect solution which had the highest unit test score. Then, one author of the paper manually annotated the quality of the generations of the BASE, SFT, TASAP, and OSAP models for those 5 assignments for the two models, resulting in 4 × 5 × 2 = 40 annotations with 7 criteria. Our analysis procedures follow prior work [ 17, 22 ], considering such manual annotations as ground truths and the LLM-as-judges ensemble result as predictions in 7 distinct binary classification problems (one per criteria). Table 2 shows the result of such annotations for various classification metrics. LLM-as-judges classification performance. We report various classification metrics. Legend: %PA: number of positive human annotations (out of 40) for each respective criteriℰa). c(riteria:ℰ : accuracy,ℰ (selectivity), ℰ : Clarity. Hint criteriaℋ(): ℋ : correctnessℋ, : informativeness,ℋ : Concealment,ℋ : Clarity. #PA f0.5 accuracy precision recall f1 kappa (ℰ ) nor correct (ℋ ), resulting in an imbalanced classification task. Similar to the main results in ensemble did not classify any generation as containing an informative hint, and our human annotations identified only 4 out of 40 (10%). While LLMs are not perfect evaluators in general [ 57 ], our small-scale human analysis supports their utility in this context as a reasonable proxy for human judgment.

6. Discussion and Conclusion

In this paper, we presented a framework for improving small language models’ ability to provide feedback using Reinforcement Learning With AI Feedback. We proposed two approaches based on ofline and online preference alignment methods and evaluated the methods performance using LLMas-judges on a publicly available dataset of students’ Python programs. To summarize the answer to our research question, the proposed framework, including TASAP and OSAP methods, is efective in enhancing SLMs’ feedback capability within a course setting.

Practical educational implications.

By utilizing and training fine-tuned models, educators can provide tailored guidance to their students in a timely manner without constantly relying on external APIs. Such small models can be efectively deployed using tools like WebLLM [ 58 ], even allowing inference on client devices with a compatible GPU. This reduces latency, ensures timely feedback, and eliminates the need to deploy custom inference services [ 40 ]. This can also give educators and learners higher control over the generated information and its use.

We do not aim to claim that pure data-driven approaches are the way to go. Ideally, when preference data can be collected and integrated by human TAs [ 21 ], such approaches should most likely be prioritised. However, in many instances, programming courses do not have human TAs write feedback to students or collect preference data, which limits how such prior work can be efectively used. Our work closes this gap.

Privacy and cost issues. We acknowledge that leveraging RLAIF pipelines requires sending student data through external APIs for teacher model queries, which breaks privacy measures and might also incur initial costs. However, we also note that much of the existing work in programming education already relies on proprietary models (e.g. [ 40, 42 ]). Moreover, distilling the performance of remote queried models to smaller models run locally decreases long-term costs. For institutions with strict data privacy concerns, open-source LLMs (e.g. 4-bit quantized Llama-3.3-70B) could, given suficient computational resources, also be hosted locally and used as teacher models.

Room for improvement. We note that our results can be further improved, for instance, by training with more data, leveraging several prompts for diferent feedback tasks simultaneously, and increasing LoRA rank (the parameter controlling how many parameters are updated during training). Programming educators and practitioners often have data readily available to them through the use of automated assessment systems.

The framework is versatile. We anticipate that this training procedure will generalise to more complex prompting strategies, for instance, leveraging program repairs [ 42 ] to produce better feedback, as the improvements stem from the framework’s alignment mechanisms rather than the specific prompt design. We could also have trained the model on several prompts for diferent types of feedback, using more data. Programming educators and practitioners often have data readily available to them through the use of automated assessment systems. The framework is adaptable; we recommend adopting the same setup as ours: using a teacher LLM to evaluate over one semester to understand what to expect. Limitations of work. First, we conducted all experiments on a single dataset of Python programming submissions collected from one institution and did not explore whether our results hold in other contexts. Second, although our automated evaluation pipeline is robust, leveraging several leading large language models, no large-scale human analysis was performed. Third, our experiments were limited to two small models with around 1B parameters. While prior work suggests that performance improves with base model size [ 34 ], it remains to be seen whether the same trends hold when applying OSAP and TASAP to larger models. Fourth, importantly, we do not claim that OSAP and TASAP trained models produce feedback matching the exact reported scores (e.g., we do not assert that the models now generate “nearly perfect feedback”). Rather, the combination of a large dataset and the substantial performance margins allows us to confirm relative rankings with confidence, even when taking into account judgment error rates [ 14 ].

Future work. Future work will address these gaps by first conducting human evaluations to validate the usefulness of the feedback generated by our trained models. This will include qualitative surveys with both teachers and students to gain insights into their perspectives. We also plan to conduct small-scale A/B studies in real educational settings, comparing courses that use these locally deployed models as AI teaching assistants with those relying on larger models. These deployments will provide critical insights into small models’ impact on student learning, engagement, and overall educational outcomes.

Moving forward, we are studying ways to improve small language models’ programming feedback ability without relying on large language models. In particular, we believe the recent success of pure reinforcement learning methods such as Group Relative Preference Optimization [ 59 ] could also benefit programming education.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT and Grammarly to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

A. Dataset B. Prompts

Figure 2, figure 3, and figure 4 shows the prompts used in our study.

Assignment 1 - lsn6_lists Write an algorithm that gets a decimal GPA, APA, and MPA from the user (in that order). You may assume that all inputs are non-negative whole numbers.

It then reports which meritorious list the cadet is on. If the GPA is equal to or above 3.0, the cadet is on the “Dean’s List”, and if the APA is equal to or above 3.0, the cadet is on the “Athletic Director’s List”, and if the MPA is equal to or above 3.0, the cadet is on the “Commandant’s List”. Finally, if the cadet qualifies for all three individual lists, then the cadet is on the “Superintendent’s List”. The algorithm should report all the lists the cadet is on (in the order defined above), unless the cadet is on the Superintendent’s List, in which case, it should report only “Superintendent’s List”.

Assignment 2 - lsn9_imagesize Write a function that computes the size of an uncompressed image. You will name your function calculate_size_of_image(), and it will have three parameters: the width of the image, the height of the image, and the bit depth (i.e., the number of bits per pixel). The function should print the size of the image in kilobytes. Assignment 3 - IterLogic2_football In Python, write an algorithm that first asks the user how many football players they wish to enter statistics for and then gets that many yearly passing totals for each player. Output how many of those players had more than 5000 passing yards in a year. Also, your algorithm will output the average yardage per year as well as the minimum yardage entered, in that order. You can assume there is at least one player’s yardage to input.

Assignment 4 - Lists2_movies Write a Python function called ‘get_movies‘ that takes three parameters: * A two-dimensional list containing movie titles and other stats * A rating (e.g., “PG”, “R”) * A run time (in minutes) Your function should return the number of movies that have the specified rating, and run for at least the number of minutes specified.

Assignment 5 - a3_6_pushups You have been asked to write a program that analyzes number of pushups done by a group of cadets. Write a program that gets from the user the number of people tested, and gets that many pushup scores (which you may assume are whole numbers) from the user. Your program must print out: * The average number of pushups for the group. * The count of cadets that scored higher than the average.

You are a CS professor teaching introductory programming using Python.

Below are a problem description and an incorrect program written by a student (i.e., it does not pass all test cases). <problem description>, <student code> • Identify and explain the first bug in the student program in 1-3 sentences. • Focus on a functional issue only; do not discuss performance improvements or stylistic concerns. • Provide a short and specific hint to help the student address the identified bug. • The hint should encourage the student to think critically about resolving the issue without directly providing a solution or code fix. • Concentrate on one single issue in the program.

• Ensure both the explanation and the hint are clear, concise, and actionable.

Below are a problem description and an incorrect program written by a student (i.e., it does not pass all test cases). 1. Explain the first bug: • Identify and explain the first bug in the student program in 1-3 sentences. • Focus on a functional issue only; do not discuss performance improvements or stylistic concerns. 2. Generate a Hint: • Provide a short and specific hint to help the student address the identified bug. • The hint should encourage the student to think critically about resolving the issue without directly providing a solution or code fix. • Concentrate on one single issue in the program.

• Ensure both the explanation and the hint are clear, concise, and actionable.

Below is the feedback written by a teaching assistant (TA), which includes an explain and fixes for the bugs in the program. As well as a hint for the first bug.

Your task is to evaluate the quality of the TA’s feedback according to the grading criteria outlined below.

This evaluation will be conducted in two parts 1. Reasoning: Reflect on the quality of the TA’s feedback. • Reflect on the quality of the feedback, using the grading criteria as a guide. • Discuss strengths and weaknesses in the explanation and hint. 2. Grading List: Conclude with your final assessment for each criterion.

• If the criterion is fully met, respond with “true”; otherwise, respond with “false”.

Please provide your answer using a JSON format with two keys: • “reasoning”: your detailed written analysis • “grading”: a dictionary with each criterion as a key and your final answer (true or false) as the value.

Use only true or false (no other qualifiers) for each grading criterion in the JSON output.

List of judge-generated bugs and fixes 2

Below is a problem description and an incorrect program written by a student (i.e., it does not pass all test cases). problem description, student code Below is the feedback written by a teaching assistant (TA), which includes an explain and fixes for the bugs in the program. As well as a hint for the first bug. This evaluation will be conducted in two parts

1. Reasoning: Reflect on the quality of the TA’s feedback. • Reflect on the quality of the feedback, using the grading criteria as a guide. • Discuss strengths and weaknesses in the explanation and hint.

[1]

Luxton-Reilly , Simon,

Albluwi ,

B. A.

Becker ,

Giannakos ,

A. N.

Kumar ,

Ott ,

Paterson ,

M. J.

Scott ,

Sheard , et al., Introductory programming: a systematic literature review , in: Proceedings companion of the 23rd annual ACM conference on innovation and technology in computer science education , 2018 , pp. 55 - 106 .

[2]

Vihavainen ,

Airaksinen ,

Watson , A systematic review of approaches for teaching introductory programming and their influence on success , in: Proceedings of the tenth annual conference on International computing education research , 2014 , pp. 19 - 26 .

[3]

Keuning ,

Jeuring ,

Heeren , A systematic literature review of automated feedback generation for programming exercises , ACM Transactions on Computing Education (TOCE) 19 ( 2018 ) 1 - 43 .

[4]

Hattie ,

Timperley , The power of feedback , Review of educational research 77 ( 2007 ) 81 - 112 .

[5]

V. J.

Shute , Focus on formative feedback, Review of educational research 78 ( 2008 ) 153 - 189 .

[6]

Lohr ,

Keuning ,

Kiesler , You're (not) my type-can llms generate feedback of specific types for introductory programming tasks? , Journal of Computer Assisted Learning 41 ( 2025 ) 2025 . URL: https://onlinelibrary.wiley.com/doi/10.1111/jcal.13107. doi: 10 .1111/jcal.13107.

[7]

Denny ,

Prather ,

B. A.

Becker ,

Finnie-Ansley ,

Hellas ,

Leinonen ,

Luxton-Reilly ,

B. N.

Reeves ,

E. A.

Santos ,

Sarsa , Computing education in the era of generative ai , Commun. ACM 67 ( 2024 ) 56 - 67 . URL: https://doi.org/10.1145/3624720. doi: 10 .1145/3624720.

[8]

U. Z.

Ahmed ,

Sahai ,

Leong ,

Karkare , Feasibility study of augmenting teaching assistants with ai for cs1 programming feedback , in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSETS 2025 , Association for Computing Machinery , New York, NY, USA, 2025 , p. 11 - 17 . doi: 10 .1145/3641554.3701972.

[9]

Liu ,

Zenke ,

Liu ,

Holmes ,

Thornton ,

D. J.

Malan , Teaching cs50 with ai: Leveraging generative artificial intelligence in computer science education , in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 2 , SIGCSE 2024 , Association for Computing Machinery , New York, NY, USA, 2024 , p. 1927 . URL: https://doi.org/10.1145/3626253. 3635427. doi: 10 .1145/3626253.3635427.

[10]

Wang , J. Mitchell,

Piech , A large scale rct on efective error messages in cs1 , in: Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSE 2024 , Association for Computing Machinery , New York, NY, USA, 2024 , p. 1395 - 1401 . doi: 10 .1145/ 3626252.3630764.

[11]

Vadaparty ,

Zingaro ,

D. H.

Smith IV , M. Padala ,

Alvarado ,

J. Gorson

Benario , L. Porter, Cs1 - llm: Integrating llms into cs1 instruction, in: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1 , ITiCSE 2024, Association for Computing Machinery , New York, NY, USA, 2024 , p. 297 - 303 . doi: 10 .1145/3649217.3653584.

[12]

Lifiton ,

B. E.

Sheese ,

Savelka ,

Denny , Codehelp: Using large language models with guardrails for scalable support in programming classes , in: Proceedings of the 23rd Koli Calling International Conference on Computing Education Research , Koli Calling ' 23 , Association for Computing Machinery, New York, NY, USA, 2024 . doi: 10 .1145/3631802.3631830.

[13]

Bommasani ,

D. A.

Hudson , E. Adeli,

Altman ,

Arora , S. von Arx,

M. S.

Bernstein ,

Bohg ,

Bosselut ,

Brunskill , et al., On the opportunities and risks of foundation models , arXiv preprint arXiv:2108.07258 ( 2021 ).

[14]

Koutcheme ,

Dainese ,

Sarsa ,

Hellas ,

Leinonen ,

Denny , Open source language models can provide feedback: Evaluating llms' ability to help students using gpt-4-as-a-judge , in: Proceedings of the 2024 Innovation and Technology in Computer Science Education , Volume 1 , ITICSE ' 24 , 2024 . doi: 10 .1145/3649217.3653612.

[15]

Yu , S. Liu,

Denny ,

Bergen ,

Liut , Integrating small language models with retrievalaugmented generation in computing education: Key takeaways, setup, and practical insights , in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSETS 2025 , Association for Computing Machinery , New York, NY, USA, 2025 , p. 1302 - 1308 . doi: 10 .1145/3641554.3701844.

[16]

Liu ,

Yu ,

Huang ,

Bulbulia ,

Bergen ,

Liut , Can small language models with retrievalaugmented generation replace large language models when learning computer science? , in: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1 , ITiCSE 2024, Association for Computing Machinery , New York, NY, USA, 2024 , p. 388 - 393 . doi: 10 . 1145/3649217.3653554.

[17]

Koutcheme ,

Dainese ,

Sarsa ,

Hellas ,

Leinonen ,

Ashraf ,

Denny , Evaluating language models for generating and judging programming feedback , in: Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSETS 2025 , Association for Computing Machinery , New York, NY, USA, 2025 , p. 624 - 630 . URL: https://doi.org/10.1145/ 3641554.3701791. doi: 10 .1145/3641554.3701791.

[18] B. C. Das , M. H.

Amini , Y.

Wu , Security and privacy challenges of large language models: A survey , ACM Computing Surveys 57 ( 2025 ) 1 - 39 .

[19]

Hicke ,

Agarwal ,

Ma , P. Denny, Ai-ta: Towards an intelligent question-answer teaching assistant using open-source llms , 2023 . URL: https://arxiv.org/abs/2311.02775. arXiv: 2311 . 02775 .

[20]

N. Ashok

Kumar ,

Lan , Improving socratic question generation using data augmentation and preference optimization , in: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024 ), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 108 - 118 .

[21]

Woodrow ,

Koyejo ,

Piech , Improving generative ai student feedback: Direct preference optimization with teachers in the loop , https://juliettewoodrow.github.io/paper-hosting/dpo_ feedback.pdf, 2025 . Accessed: 2025 -04-12.

[22]

Scarlatos ,

Smith ,

Woodhead ,

Lan , Improving the Validity of Automatically Generated Feedback via Reinforcement Learning, Springer Nature Switzerland, 2024 , p. 280 - 294 . doi: 10 . 1007/978- 3- 031 - 64302- 6_ 20 .

[23]

Hellas ,

Leinonen ,

Sarsa ,

Koutcheme ,

Kujanpää ,

Sorva , Exploring the responses of large language models to beginner programmers' help requests , in: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 , ICER '23, Association for Computing Machinery, New York, NY, USA, 2023 , p. 93 - 105 . URL: https://doi.org/10.1145/3568813. 3600139. doi: 10 .1145/3568813.3600139.

[24]

Roest ,

Keuning ,

Jeuring , Next-step hint generation for introductory programming using large language models , in: Proceedings of the 26th Australasian Computing Education Conference , ACE '24, Association for Computing Machinery, New York, NY, USA, 2024 , p. 144 - 153 . doi: 10 . 1145/3636243.3636259.

[25]

L. B.

Allal ,

Lozhkov , E. Bakouch,

G. M.

Blázquez ,

Tunstall ,

Piqueres ,

Marafioti ,

Zakka , L. von Werra , T. Wolf, Smollm2 - with great data, comes great performance , 2024 .

[26]

Dubey ,

Jauhri ,

Pandey ,

Kadian ,

Al-Dahle , A. L. et al., The llama 3 herd of models , 2024 . URL: https://arxiv.org/abs/2407.21783. arXiv: 2407 . 21783 .

[27] A. de Freitas , J. Cofman , M. de Freitas, J. Wilson, T. Weingart, Falconcode: A multiyear dataset of python code samples from an introductory computer science course , in: Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 , SIGCSE 2023 , Association for Computing Machinery , New York, NY, USA, 2023 , p. 938 - 944 . doi: 10 .1145/3545945.3569822.

[28]

Zhou , P. Liu,

Xu ,

Iyer ,

Sun ,

Mao ,

Ma , A. Efrat,

Yu ,

Zhang , G. Ghosh,

Lewis ,

Zettlemoyer , O. Levy , Lima: less is more for alignment , in: Proceedings of the 37th International Conference on Neural Information Processing Systems , NIPS '23, Curran Associates Inc., Red

Hook

, NY , USA, 2023 .

[29]

Ouyang ,

Wu ,

Jiang ,

Almeida ,

Wainwright ,

Mishkin ,

Zhang , S. Agarwal,

Slama ,

Ray , et al., Training language models to follow instructions with human feedback , Advances in neural information processing systems 35 ( 2022 ) 27730 - 27744 .

[30] D. M. Ziegler , N.

Stiennon , J.

Wu , T. B. Brown , A.

Radford , D.

Amodei , P.

Christiano , G. Irving, Fine-tuning language models from human preferences , arXiv preprint arXiv: 1909 . 08593 ( 2019 ).

[31]

Zheng , W.-L. Chiang,

Sheng ,

Zhuang ,

Wu ,

Zhuang ,

Lin ,

Li ,

E. P.

Xing ,

Zhang ,

J. E.

Gonzalez , I. Stoica , Judging llm-as-a-judge with mt-bench and chatbot arena , in: Proceedings of the 37th International Conference on Neural Information Processing Systems , NIPS '23, Curran Associates Inc., Red

Hook

, NY , USA, 2023 .

[32]

Tunstall ,

Beeching ,

Lambert ,

Rajani ,

Rasul ,

Belkada ,

Huang , L. von Werra , C.

Fourrier , N.

Habib , N.

Sarrazin , O.

Sanseviero , A. M.

Rush , T. Wolf, Zephyr: Direct distillation of LM alignment , CoRR abs/2310 .16944 ( 2023 ). URL: https://doi.org/10.48550/arXiv.2310.16944. doi: 10 .48550/ARXIV.2310.16944. arXiv: 2310 . 16944 .

[33]

Lee ,

Phatale ,

Mansoor ,

Mesnard ,

Ferret ,

Lu ,

Bishop ,

Hall ,

Carbune ,

Rastogi , S. Prakash, RLAIF vs . RLHF: scaling reinforcement learning from human feedback with AI feedback , in: Forty-first International Conference on Machine Learning, ICML 2024 , Vienna, Austria, July 21-27 , 2024 , OpenReview.net, 2024 . URL: https://openreview.net/forum?id=uydQ2W41KO.

[34]

Rafailov ,

Sharma , E. Mitchell, C. D. Manning , S.

Ermon , C.

Finn , Direct preference optimization: Your language model is secretly a reward model , Advances in Neural Information Processing Systems 36 ( 2024 ).

[35]

Schulman ,

Wolski ,

Dhariwal ,

Radford ,

Klimov , Proximal policy optimization algorithms , 2017 . URL: https://arxiv.org/abs/1707.06347. arXiv: 1707 . 06347 .

[36]

Houlsby ,

Giurgiu ,

Jastrzebski ,

Morrone , Q. De Laroussilhe , A.

Gesmundo , M.

Attariyan , S.

Gelly , Parameter-eficient transfer learning for NLP , in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, PMLR , 2019 , pp. 2790 - 2799 . URL: https://proceedings.mlr.press/v97/ houlsby19a.html.

[37]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , in: The Tenth International Conference on Learning Representations, ICLR 2022 ,

Virtual

Event , April 25-29 , 2022 , OpenReview.net, 2022 . URL: https: //openreview.net/forum?id=nZeVKeeFYf9.

[38]

Koutcheme , Training Language Models for Programming Feedback Using Automated Repair Tools , in: N. Wang , G.

Rebolledo-Mendez , N.

Matsuda , O. C.

Santos , V. Dimitrova (Eds.), Artificial Intelligence in Education , Springer Nature Switzerland, Cham, 2023 , pp. 830 - 835 .

[39]

Koutcheme ,

Sarsa ,

Leinonen ,

Hellas , P. Denny, Automated Program Repair Using Generative Models for Code Infilling , in: N. Wang , G.

Rebolledo-Mendez , N.

Matsuda , O. C.

Santos , V. Dimitrova (Eds.), Artificial Intelligence in Education , Springer Nature Switzerland, Cham, 2023 , pp. 798 - 803 .

[40]

Kotalwar ,

Gotovos ,

Singla , Hints-in-browser: Benchmarking language models for programming feedback generation , in: A. Globersons , L.

Mackey , D.

Belgrave , A.

Fan , U.

Paquet , J. M.

Tomczak , C. Zhang (Eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 , NeurIPS 2024 , Vancouver, BC, Canada, December 10 - 15 , 2024 , 2024 . URL: http://papers.nips.cc/paper_files/paper/2024/hash/ 34cc2ded6daba59357134c0b9fb06bfe-Abstract-Datasets_and_Benchmarks_Track.html.

[41]

Guo ,

Zhang , T. Liu, T. Liu,

Khalman ,

Llinares ,

Rame ,

Mesnard ,

Zhao ,

Piot ,

Ferret ,

Blondel , Direct language model alignment from online ai feedback , 2024 . URL: https: //arxiv.org/abs/2402.04792. arXiv: 2402 . 04792 .

[42]

Phung ,

V.-A.

Pădurean ,

Singh ,

Brooks ,

Cambronero ,

Gulwani ,

Singla , G. Soares, Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation , in: Proceedings of the 14th Learning Analytics and Knowledge Conference , LAK '24, Association for Computing Machinery, New York, NY, USA, 2024 , p. 12 - 23 . doi: 10 .1145/3636555.3636846.

[43]

Agarwal ,

Vieillard ,

Zhou ,

Stanczyk ,

S. R.

Garea ,

Geist ,

Bachem , On-policy distillation of language models: Learning from self-generated mistakes , in: The Twelfth International Conference on Learning Representations, ICLR 2024 , Vienna, Austria, May 7- 11 , 2024 , OpenReview.net, 2024 . URL: https://openreview.net/forum?id=3zKtaqxLhW.

[44] Mistral

AI Team

, Mistral nemo , https://mistral.ai/news/mistral-nemo/, 2024 . Accessed: 2024 -09-16.

[45]

S. R.

Chowdhury ,

Kini ,

Natarajan , Provably robust dpo: aligning language models with noisy feedback , in: Proceedings of the 41st International Conference on Machine Learning, ICML'24 , JMLR.org, 2024 .

[46]

Gheshlaghi Azar ,

Z. Daniel

Guo ,

Piot ,

Munos ,

Rowland ,

Valko ,

Calandriello , A general theoretical paradigm to understand learning from human preferences , in: S. Dasgupta,

Mandt ,

Li (Eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , volume 238 of Proceedings of Machine Learning Research, PMLR , 2024 , pp. 4447 - 4455 . URL: https://proceedings.mlr.press/v238/gheshlaghi-azar24a. html .

[47] C.-Y. Lin , ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: https://aclanthology.org/W04-1013.

[48]

Lagus ,

Longi ,

Klami ,

Hellas , Transfer-learning methods in programming course outcome prediction , ACM Transactions on Computing Education (TOCE) 18 ( 2018 ) 1 - 18 .

[49]

Kar , G. Castellucci,

Filice ,

Malmasi ,

Rokhlenko , Preventing catastrophic forgetting in continual learning of new natural language tasks , in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2022 , pp. 3137 - 3145 .

[50]

Koutcheme ,

Dainese ,

Hellas , Using program repair as a proxy for language models' feedback ability in programming education , in: E. Kochmar,

Bexte ,

Burstein ,

Horbach ,

Laarmann-Quante ,

Tack ,

Yaneva , Z. Yuan (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024 ), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 165 - 181 . URL: https://aclanthology.org/ 2024 .bea- 1 . 15 .

[51]

Hu ,

U. Z.

Ahmed ,

Mechtaev ,

Leong ,

Roychoudhury , Re-factoring based program repair applied to programming assignments , in: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) , IEEE/ACM , 2019 , pp. 388 - 398 .

[52] OpenAI , Gpt-4o system card , 2024 . URL: https://arxiv.org/abs/2410.21276. arXiv: 2410 . 21276 .

[53]

Hui ,

Yang ,

Cui ,

Yang ,

Liu ,

Zhang , T. Liu,

Zhang ,

Yu ,

Dang , et al., Qwen2. 5-coder technical report, arXiv preprint arXiv:2409.12186 ( 2024 ).

[54]

Verga ,

Hofstatter ,

Althammer ,

Su ,

Piktus ,

Arkhangorodsky ,

Xu ,

White ,

Lewis , Replacing judges with juries: Evaluating llm generations with a panel of diverse models , 2024 . URL: https://arxiv.org/abs/2404.18796. arXiv: 2404 . 18796 .

[55] Google

DeepMind

, Gemini 2 . 0: Our largest and most capable AI model , https://blog.google/ technology/google-deepmind/ google-gemini-ai- update- december-2024/, 2024 . Accessed: 2025 -04- 24.

[56]

A. S.

Thakur ,

Choudhary ,

V. S.

Ramayapally ,

Vaidyanathan ,

Hupkes , Judging the judges: Evaluating alignment and vulnerabilities in llms-as- judges , 2024 . URL: https://arxiv.org/abs/2406. 12624. arXiv: 2406 . 12624 .

[57]

Seo ,

Hwang ,

Jung ,

Kang ,

Namgoong ,

Lee ,

Jung , Large language models as evaluators in education: Verification of feedback consistency and accuracy , Applied Sciences 15 ( 2025 ) 671 . doi: 10 .3390/app15020671.

[58] WebLLM, WebLLM: A web-based language model , https://webllm.mlc. ai/ , 2024 . Accessed: 2024 - 09-16.

[59]

Shao ,

Wang ,

Zhu ,

Xu ,

Song ,

Bi ,

Zhang ,

Y. K.

Li ,

Wu ,

Guo , Deepseekmath: Pushing the limits of mathematical reasoning in open language models , 2024 . URL: https://arxiv.org/abs/2402.03300. arXiv: 2402 . 03300 .